From ido at mellanox.co.il Wed Dec 1 07:29:54 2004 From: ido at mellanox.co.il (Ido Bukspan) Date: Wed, 1 Dec 2004 17:29:54 +0200 Subject: [openib-general] Fix for possible oops when calling ipoib_neigh_destructor Message-ID: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> I have implemented code in order to fix the oops problem, and it seems like it works fine. I still don't have a patch but here is the pseudo code I used. Please notice that I used a dynamic list (can be organized in an rb tree as well, as Roland suggested) of the current PATHs which exist in the driver (I use it also for the unicast ARP). I'm on vacation starting tomorrow but if you want I can work on a patch when I will come back (two weeks from now). Thanks -Ido The next function will be called from : ipoib_cleanup_module(void) { spin_lock_irq(&priv->lock); if (list_empty(&priv->path_list)) { // Meaning that the destructor already destroyed all the entries // Nothing to do spin_unlock_irq(&priv->lock); return; } spin_unlock_irq(&priv->lock); while (TRUE) { // We will keep checking until the list is empty. spin_lock_irq(&priv->lock); if (list_empty(&priv->path_list)) { //list empty spin_unlock_irq(&priv->lock); break; } else { fpath = path_list->next // We keep a pointer to the first neighbour for future use list_for_each_entry_safe(path, tpath, &priv->path_list, list) { // We are cloning all the neighbours so no destructor could be called // While we are in the process // I know the following is not well protected but I didn't find something better. path->neighbour = neigh_clone(path->neighbour); if (&path->neighbour->refcnt > 1){ // we check if we are the only one that hold if. // If not we will delete this AH. list_del(&path->list); list_add_tail(&path->list, &remove_list); } else { // If so the kernel already start the destructor process and we are backing off. neigh_release(path->neighbour); } } spin_unlock_irq(&priv->lock); } } // Set null pointer to the destructor function so Future called will be called to NULL. fpath->neighbour->ops->destructor = NULL; // Destroy the AHs list_for_each_entry_safe(path, tpath, &remove_list, list) { ib_address_destroy(path->ah); //In case that the module is reloaded, we erase the path pointer memset(path->neighbour->ha + 24, 0, 8); neigh_release(path->neighbour); // For the cloning kfree(path); } } Ido Bukspan Mellanox Technologies Ltd. Phone : (972)-3-6259500 ,Ext 518. Fax : (972)-3-5614943 mailto:ido at mellanox.co.il http://www.mellanox.com No play No game From halr at voltaire.com Wed Dec 1 07:27:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Dec 2004 10:27:24 -0500 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] Message-ID: <1101914844.6411.606.camel@localhost.localdomain> A few more messages came out and then the module removal ultimately timed out and completed. Not sure I trust the state of things after this though... Dec 1 09:35:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_CQ failed (-16) Dec 1 09:35:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_CQ failed (-16) -- Hal --Forwarded Message-- From: Hal Rosenstock To: Roland Dreier Cc: openib-general at openib.org Subject: [openib-general] ib_mthca can't be removed in some error case Date: 01 Dec 2004 09:34:55 -0500 Hi, When I /sbin/modprobe -r ib_mthca (after having umad and IPoIB up and then removed and a switch and SM come and go), the removal hangs up as follows: 29836 pts/5 D 0:00 /sbin/modprobe -r ib_mthca uninterruptible sleep I get the following error in the system logs: Dec 1 09:31:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16) Dec 1 09:32:17 localhost kernel: ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16) -- Hal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From roland at topspin.com Wed Dec 1 07:53:46 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 07:53:46 -0800 Subject: [openib-general] testing OpenIB on TopSpin hardware ? In-Reply-To: <200411301718.SAA08841@styx.bruyeres.cea.fr> (Philippe Gregoire's message of "Tue, 30 Nov 2004 18:18:19 +0100") References: <200411301718.SAA08841@styx.bruyeres.cea.fr> Message-ID: <52hdn6ovzp.fsf@topspin.com> Philippe> Hello, I would like to test the latest OpenIB software Philippe> on our test platform, specially the SDP part. We have a Philippe> HP-DL380/DL360 (IA32) cluster with 12 nodes connected Philippe> through TopSpin HCA and TopSpin 90 IB switch. I got the Philippe> OpenIB source with svn. What is the latest version , Philippe> gen1 or 1.0 ? 1.0 looks like the version available in Philippe> march, correct ? What is the firmware requirement for Philippe> the HCA and the switch ? The latest OpenIB software is the subversion tree at https://openib.org/svn/gen2/trunk/ However this tree has support for IPoIB only (no SDP or any other ULPs). The gen1 tree does have SDP but has not been touched in some time. Any firmware should work on HCA and switch, but the newer firmware will probably be more stable. - Roland From roland at topspin.com Wed Dec 1 07:54:37 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 07:54:37 -0800 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] In-Reply-To: <1101914844.6411.606.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 01 Dec 2004 10:27:24 -0500") References: <1101914844.6411.606.camel@localhost.localdomain> Message-ID: <52d5xuovya.fsf@topspin.com> Hal> A few more messages came out and then the module removal Hal> ultimately timed out and completed. Not sure I trust the Hal> state of things after this though... Looks like you managed to crash the HCA (firmware commands timed out). - Roland From halr at voltaire.com Wed Dec 1 07:59:56 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 01 Dec 2004 10:59:56 -0500 Subject: [Fwd: [openib-general] ib_mthca can't be removed in some error case] In-Reply-To: <52d5xuovya.fsf@topspin.com> References: <1101914844.6411.606.camel@localhost.localdomain> <52d5xuovya.fsf@topspin.com> Message-ID: <1101916796.4124.1.camel@localhost.localdomain> On Wed, 2004-12-01 at 10:54, Roland Dreier wrote: > Hal> A few more messages came out and then the module removal > Hal> ultimately timed out and completed. Not sure I trust the > Hal> state of things after this though... > > Looks like you managed to crash the HCA (firmware commands timed out). There was a power glitch. Perhaps it reset the HCA (and it is more sensitive to these things) but the system rode this out (and did not reset). That's another possible explanation. -- Hal From roland at topspin.com Wed Dec 1 09:14:50 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 09:14:50 -0800 Subject: [openib-general] Fix for possible oops when calling ipoib_neigh_destructor In-Reply-To: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> (Ido Bukspan's message of "Wed, 1 Dec 2004 17:29:54 +0200") References: <91DB792C7985D411BEC300B40080D29C711BC2@mtvex01.mtv.mtl.com> Message-ID: <521xeaos8l.fsf@topspin.com> Thanks for investigating this. I have a slightly different scheme in mind (which resolves both the unicast ARP and neighbour destructor problems). I'll implement it in the next couple of days (before you get back from vacation, I guess). Thanks, Roland From roland at topspin.com Wed Dec 1 09:32:50 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 01 Dec 2004 09:32:50 -0800 Subject: [openib-general] [PATCH] Move IPoIB to use LockLess TX Message-ID: <52wtw1orel.fsf@topspin.com> This changes IPoIB's locking scheme to use the new NETIF_F_LLTX scheme. It adds about 2-3 % to throughput in my netpipe tests. - R. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1304) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -204,7 +204,7 @@ kfree(path); } -static int path_rec_start(struct sk_buff *skb, struct net_device *dev) +static void path_rec_start(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); @@ -244,23 +244,23 @@ path->neighbour = skb->dst->neighbour; *to_ipoib_path(skb->dst->neighbour) = path; - return 0; + return; err: kfree(path); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - - return 0; } -static int path_lookup(struct sk_buff *skb, struct net_device *dev) +static void path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); /* Look up path record for unicasts */ - if (skb->dst->neighbour->ha[4] != 0xff) - return path_rec_start(skb, dev); + if (skb->dst->neighbour->ha[4] != 0xff) { + path_rec_start(skb, dev); + return; + } /* Add in the P_Key */ skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; @@ -268,7 +268,6 @@ ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); - return 0; } static void unicast_arp_completion(int status, @@ -336,8 +335,8 @@ * still go through (since we'll get the new path from the SM for * these queries) so we'll never update the neighbour. */ -static int unicast_arp_start(struct sk_buff *skb, struct net_device *dev, - struct ipoib_pseudoheader *phdr) +static void unicast_arp_start(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *tmp_skb; @@ -352,7 +351,7 @@ dev_kfree_skb_any(tmp_skb); if (!skb) { ++priv->stats.tx_dropped; - return 0; + return; } } @@ -381,25 +380,32 @@ ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); } - - return 0; } static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; + unsigned long flags; + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + if (skb->dst && skb->dst->neighbour) { - if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) - return path_lookup(skb, dev); + if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } path = *to_ipoib_path(skb->dst->neighbour); if (likely(path->ah)) { ipoib_send(dev, skb, path->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); - return 0; + goto out; } if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) @@ -417,8 +423,7 @@ phdr->hwaddr[9] = priv->pkey & 0xff; ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); - } - else { + } else { /* unicast GID -- ARP reply?? */ /* @@ -429,7 +434,7 @@ if (skb->destructor == unicast_arp_finish) { ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, be32_to_cpup((u32 *) phdr->hwaddr)); - return 0; + goto out; } if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { @@ -441,22 +446,25 @@ IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; - return 0; + goto out; } /* put the pseudoheader back on */ skb_push(skb, sizeof *phdr); - return unicast_arp_start(skb, dev, phdr); + unicast_arp_start(skb, dev, phdr); } } - return 0; + goto out; err: ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - return 0; +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; } struct net_device_stats *ipoib_get_stats(struct net_device *dev) @@ -641,7 +649,7 @@ dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; - dev->features = NETIF_F_VLAN_CHALLENGED; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; /* MTU will be reset when mcast join happens */ dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; @@ -656,6 +664,7 @@ priv->dev = dev; spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); init_MUTEX(&priv->mcast_mutex); init_MUTEX(&priv->vlan_mutex); Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1304) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -130,6 +130,7 @@ struct ipoib_buf *rx_ring; + spinlock_t tx_lock; struct ipoib_buf *tx_ring; unsigned tx_head; unsigned tx_tail; Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1304) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -70,14 +70,15 @@ unsigned long flags; - spin_lock_irqsave(&priv->lock, flags); if (ah->last_send <= priv->tx_tail) { ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); ib_destroy_ah(ah->ah); kfree(ah); - } else + } else { + spin_lock_irqsave(&priv->lock, flags); list_add_tail(&ah->list, &priv->dead_ahs); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irqrestore(&priv->lock, flags); + } } static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, @@ -235,11 +236,11 @@ dev_kfree_skb_any(tx_req->skb); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) netif_wake_queue(dev); - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) @@ -338,19 +339,15 @@ ++priv->stats.tx_errors; dev_kfree_skb_any(skb); } else { - unsigned long flags; - dev->trans_start = jiffies; address->last_send = priv->tx_head; ++priv->tx_head; - spin_lock_irqsave(&priv->lock, flags); if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } - spin_unlock_irqrestore(&priv->lock, flags); } } From mshefty at ichips.intel.com Wed Dec 1 10:45:15 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Dec 2004 10:45:15 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1101854788.6411.373.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41ACC233.2000201@ichips.intel.com> <1101843199.6411.288.camel@localhost.localdomain> <41ACCE63.5050800@ichips.intel.com> <1101854788.6411.373.camel@localhost.localdomain> Message-ID: <41AE113B.6070902@ichips.intel.com> Hal Rosenstock wrote: > On Tue, 2004-11-30 at 14:47, Sean Hefty wrote: > >>I guess filtering can be done above the MAD layer, so just letting the >>user specify the qp_type may be all that's needed, beyond indicating >>that snooping is desired. If we go this route, we can probably support >>any number of snoopers. > > > Does that mean the snoopers would just be a list based on qp_type (and > we have a list per QP type (SMI, GSI)) ? This was my first thought... Also, looking ahead at future changes that will be going into the code (RMPP, CM), plus a debugging issues that I'm hitting when loading the drivers, I think it makes sense to add in snooping sooner rather than later. This is something that I can start, since it would be useful to me as a debugging tool today. - Sean To: Hal Rosenstock Subject: Re: [openib-general] smpdump and current MAD layer References: <1101838548.6411.276.camel at localhost.localdomain> In-Reply-To: <1101838548.6411.276.camel at localhost.localdomain> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.31 (www . roaringpenguin . com / mimedefang) X-Spam-Checker-Version: SpamAssassin 2.64 (2004-01-11) on openib.ca.sandia.gov X-Spam-Level: X-Spam-Status: No, hits=0.0 required=5.0 tests=none autolearn=no version=2.64 Cc: openib-general at openib.org X-BeenThere: openib-general at openib.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: OpenIB General Mailing List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 02 Dec 2004 18:04:13 -0000 Hal Rosenstock wrote: > So solicited MAD responses cannot currently be snooped nor can > unsolicited ones for which an agent is registered (Since SMA and PMA are > currently firmware based, the latter is not an issue for the current > implementation). I've gotten a start on adding in the snooping support. It was a little more difficult than I first thought to support multiple snoop clients, because of a race condition deregistering clients while snooping. Question on where to perform the snoop callback on the send side: should it be done in the completion handling code, immediately before the MAD is posted to the QP, or somewhere else? I'd like to place the snooping code in as few places as possible, but still be able to snoop locally processed MADs. Ideally a MAD should be snooped exactly once, which requires some extra care when handling QP errors. Snooping in the completion handling allows the MAD layer to own the thread that performs the callback. Calling clients back on the outbound path puts the callback in an arbitrary thread context. Thoughts? - Sean From halr at voltaire.com Thu Dec 2 12:05:39 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 15:05:39 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41AF559C.7090700@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> Message-ID: <1102017939.4179.10.camel@localhost.localdomain> On Thu, 2004-12-02 at 12:49, Sean Hefty wrote: > Question on where to perform the snoop callback on the send side: > should it be done in the completion handling code, immediately before > the MAD is posted to the QP, or somewhere else? > > I'd like to place the snooping code in as few places as possible, but > still be able to snoop locally processed MADs. Ideally a MAD should be > snooped exactly once, which requires some extra care when handling QP > errors. Snooping in the completion handling allows the MAD layer to > own the thread that performs the callback. Calling clients back on the > outbound path puts the callback in an arbitrary thread context. > > Thoughts? While there are arguments for snooping when the MAD is posted to the QP, I think it makes more sense to snoop it when it completes. In addition to the thread reason you cite, one could also pass the completion status which would give more information on what really happened (which might be useful). I don't see a need to instrument this more than once on the send side either. -- Hal From halr at voltaire.com Thu Dec 2 12:09:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 15:09:36 -0500 Subject: [openib-general] mthca Page Allocation Failures Message-ID: <1102018176.4179.15.camel@localhost.localdomain> Hi, I'm seeing this a lot more lately. Not sure when this started. It happens on first starting the driver after boot. It seems to occur when there is more going on at system startup (like fixing the disk after a previous freeze). -- Hal Dec 2 14:36:17 localhost kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) Dec 2 14:36:17 localhost kernel: ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:02:00.0) Dec 2 14:36:17 localhost kernel: modprobe: page allocation failure. order:6, mode:0xd0 Dec 2 14:36:17 localhost kernel: [] __alloc_pages+0x1c2/0x370 Dec 2 14:36:17 localhost kernel: [] __get_free_pages+0x1f/0x40 Dec 2 14:36:17 localhost kernel: [] dma_alloc_coherent+0xce/0x100 Dec 2 14:36:17 localhost kernel: [] mthca_alloc_sqp+0x65/0x420 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_create_qp+0x16e/0x180 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] ib_create_qp+0x22/0x80 [ib_core] Dec 2 14:36:17 localhost kernel: [] create_mad_qp+0x86/0xd0 [ib_mad] Dec 2 14:36:17 localhost kernel: [] qp_event_handler+0x0/0x30 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_get_dma_mr+0x1e/0x50 [ib_core] Dec 2 14:36:17 localhost kernel: [] ib_mad_port_open+0x24d/0x5d0 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_mad_init_device+0x3e/0x100 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_cache_setup_one+0x12d/0x1d0 [ib_core] Dec 2 14:36:17 localhost kernel: [] ib_mad_init_device+0x0/0x100 [ib_mad] Dec 2 14:36:17 localhost kernel: [] ib_register_device+0x17d/0x1a0 [ib_core] Dec 2 14:36:17 localhost kernel: [] mthca_req_notify_cq+0x0/0x30 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_poll_cq+0x0/0xbb0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_destroy_cq+0x0/0x30 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_register_device+0x15b/0x1a0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] mthca_init_one+0x523/0x6e0 [ib_mthca] Dec 2 14:36:17 localhost kernel: [] pci_device_probe_static+0x52/0x70 Dec 2 14:36:17 localhost kernel: [] __pci_device_probe+0x3c/0x50 Dec 2 14:36:17 localhost kernel: [] pci_device_probe+0x2c/0x50 Dec 2 14:36:17 localhost kernel: [] bus_match+0x3f/0x70 Dec 2 14:36:17 localhost kernel: [] driver_attach+0x5c/0x90 Dec 2 14:36:17 localhost kernel: [] bus_add_driver+0x91/0xb0 Dec 2 14:36:17 localhost kernel: [] driver_register+0x8c/0x90 Dec 2 14:36:17 localhost kernel: [] pci_register_driver+0x90/0xb0 Dec 2 14:36:17 localhost kernel: [] mthca_init+0xf/0x1a [ib_mthca] Dec 2 14:36:17 localhost kernel: [] sys_init_module+0x289/0x340 Dec 2 14:36:17 localhost kernel: [] sysenter_past_esp+0x52/0x71 Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't create ib_mad QP1 Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't open mthca0 port 2 Dec 2 14:36:17 localhost kernel: ib_agent: Port 1 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 1 for agents Dec 2 14:36:17 localhost kernel: ib_mad: Port 1 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 1 Dec 2 14:36:17 localhost kernel: ib_agent: Port 2 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 2 for agents Dec 2 14:36:17 localhost kernel: ib_mad: Port 2 not found Dec 2 14:36:17 localhost kernel: ib_mad: Couldn't close mthca0 port 2 From roland at topspin.com Thu Dec 2 13:32:34 2004 From: roland at topspin.com (Roland Dreier) Date: Thu, 02 Dec 2004 13:32:34 -0800 Subject: [openib-general] Re: mthca Page Allocation Failures In-Reply-To: <1102018176.4179.15.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 02 Dec 2004 15:09:36 -0500") References: <1102018176.4179.15.camel@localhost.localdomain> Message-ID: <52fz2ol72l.fsf@topspin.com> This patch should help. - R. Index: infiniband/core/mad_priv.h =================================================================== --- infiniband/core/mad_priv.h (revision 1304) +++ infiniband/core/mad_priv.h (working copy) @@ -68,7 +68,7 @@ #define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ /* QP and CQ parameters */ -#define IB_MAD_QP_SEND_SIZE 2048 +#define IB_MAD_QP_SEND_SIZE 128 #define IB_MAD_QP_RECV_SIZE 512 #define IB_MAD_SEND_REQ_MAX_SG 2 #define IB_MAD_RECV_REQ_MAX_SG 1 From mshefty at ichips.intel.com Thu Dec 2 14:48:25 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Dec 2004 14:48:25 -0800 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <1102017939.4179.10.camel@localhost.localdomain> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> <1102017939.4179.10.camel@localhost.localdomain> Message-ID: <41AF9BB9.5040402@ichips.intel.com> Hal Rosenstock wrote: >>I'd like to place the snooping code in as few places as possible, but >>still be able to snoop locally processed MADs. Ideally a MAD should be >>snooped exactly once, which requires some extra care when handling QP >>errors. Snooping in the completion handling allows the MAD layer to >>own the thread that performs the callback. Calling clients back on the >>outbound path puts the callback in an arbitrary thread context. > > While there are arguments for snooping when the MAD is posted to the QP, > I think it makes more sense to snoop it when it completes. In addition > to the thread reason you cite, one could also pass the completion status > which would give more information on what really happened (which might > be useful). I don't see a need to instrument this more than once on the > send side either. It would be difficult to snoop all MADs without snooping in multiple locations. For example, local MADs do not generate CQ entries, and are completed from the sender's thread. (I think we may want to change the threading at some point, so it may not matter long term.) It's also not clear if a snooper would want to know if a request received a reply, snoop separate send completions in the case of RMPP, or snoop RMPP ACKs. (For RMPP debugging, I want to snoop at the lowest level.) Right now, I've tried to design the API/code to be fairly flexible about where snooping can occur (since I'm not sure where the best place to do this is), but will likely only snoop in one location (after completions). - Sean From halr at voltaire.com Thu Dec 2 20:08:58 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 02 Dec 2004 23:08:58 -0500 Subject: [openib-general] [PATCH] Add vendor OUI support to MAD layer Message-ID: <1102046938.14406.45.camel@localhost.localdomain> mad: Add vendor OUI support to MAD layer Also, in ib_register_mad_agent, if registration request supplied, make sure that class supplied in request is consistent with the QP type. On sending, it is the responsibility of the client to put the OUI in the MAD when sending a vendor MAD with OUI (despite this being available in the mad_agent). This seems consistent with the way we have been doing things up to now. (This could be changed if someone strongly objects). Also, I didn't compress ib_mad_mgmt_class_table (to eliminate the vendor with OUI classes. That's 32 pointers per version (I think there are 2 versions in play right now). That's an optimization that could be made but I'm not sure it is worth it. Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1308) +++ include/ib_mad.h (working copy) @@ -42,6 +42,8 @@ #define IB_MGMT_CLASS_DEVICE_MGMT 0x06 #define IB_MGMT_CLASS_CM 0x07 #define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F /* Management methods */ #define IB_MGMT_METHOD_GET 0x01 @@ -55,7 +57,6 @@ #define IB_MGMT_METHOD_RESP 0x80 - #define IB_MGMT_MAX_METHODS 128 #define IB_QP0 0 @@ -104,6 +105,14 @@ u8 data[220]; } __attribute__ ((packed)); +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + struct ib_mad_agent; struct ib_mad_send_wc; struct ib_mad_recv_wc; @@ -199,12 +208,15 @@ * receive unsolicited MADs, otherwise it should be 0. * @mgmt_class_version: Indicates which version of MADs for the given * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. * @method_mask: The caller will receive unsolicited MADs for any method * where @method_mask = 1. */ struct ib_mad_reg_req { u8 mgmt_class; u8 mgmt_class_version; + u8 oui[3]; DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); }; Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1308) +++ core/mad_priv.h (working copy) @@ -78,6 +78,9 @@ /* Registration table sizes */ #define MAX_MGMT_CLASS 80 #define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 struct ib_mad_list_head { struct list_head list; @@ -140,6 +143,20 @@ struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; }; +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + struct ib_mad_queue { spinlock_t lock; struct list_head list; @@ -165,7 +182,7 @@ struct ib_mr *mr; spinlock_t reg_lock; - struct ib_mad_mgmt_class_table *version[MAX_MGMT_VERSION]; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; struct list_head agent_list; struct workqueue_struct *wq; struct work_struct work; Index: core/mad.c =================================================================== --- core/mad.c (revision 1308) +++ core/mad.c (working copy) @@ -144,6 +144,47 @@ } } +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + /* * ib_register_mad_agent - Register to send/receive MADs */ @@ -157,61 +198,71 @@ void *context) { struct ib_mad_port_private *port_priv; - struct ib_mad_agent *ret; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_reg_req *reg_req = NULL; struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; struct ib_mad_mgmt_method_table *method; int ret2, qpn; unsigned long flags; - u8 mgmt_class; + u8 mgmt_class, vclass; /* Validate parameters */ qpn = get_spl_qp_index(qp_type); - if (qpn == -1) { - ret = ERR_PTR(-EINVAL); + if (qpn == -1) goto error1; - } - if (rmpp_version) { - ret = ERR_PTR(-EINVAL); /* XXX: until RMPP implemented */ - goto error1; - } + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ /* Validate MAD registration request if supplied */ if (mad_reg_req) { - if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) { - ret = ERR_PTR(-EINVAL); + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) goto error1; - } - if (!recv_handler) { - ret = ERR_PTR(-EINVAL); + if (!recv_handler) goto error1; - } if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { /* * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only * one in this range currently allowed */ if (mad_reg_req->mgmt_class != - IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = ERR_PTR(-EINVAL); + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) goto error1; - } } else if (mad_reg_req->mgmt_class == 0) { /* * Class 0 is reserved in IBA and is used for * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE */ - ret = ERR_PTR(-EINVAL); goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } } else { /* No registration request supplied */ - if (!send_handler) { - ret = ERR_PTR(-EINVAL); + if (!send_handler) goto error1; - } } /* Validate device and port */ @@ -258,17 +309,32 @@ * is non overlapping with any existing ones */ if (mad_reg_req) { - class = port_priv->version[mad_reg_req->mgmt_class_version]; - if (class) { - mgmt_class = convert_mgmt_class( - mad_reg_req->mgmt_class); - method = class->method_table[mgmt_class]; - if (method) { - if (method_in_use(&method, mad_reg_req)) { - ret = ERR_PTR(-EINVAL); - goto error3; + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; } } + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } } } @@ -721,6 +787,40 @@ return 0; } +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, struct ib_mad_agent_private *agent) { @@ -734,23 +834,17 @@ } } -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv) +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) { - struct ib_mad_port_private *private; + struct ib_mad_port_private *port_priv; struct ib_mad_mgmt_class_table **class; struct ib_mad_mgmt_method_table **method; - int i, ret; - u8 mgmt_class; - /* Make sure MAD registration request supplied */ - if (!mad_reg_req) - return 0; - - private = priv->qp_info->port_priv; - mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); - class = &private->version[mad_reg_req->mgmt_class_version]; + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; if (!*class) { /* Allocate management class table for "new" class version */ *class = kmalloc(sizeof **class, GFP_ATOMIC); @@ -760,9 +854,8 @@ ret = -ENOMEM; goto error1; } - /* Clear management class table for this class version */ - memset((*class)->method_table, 0, - sizeof((*class)->method_table)); + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); /* Allocate method table for this management class */ method = &(*class)->method_table[mgmt_class]; if ((ret = allocate_method_table(method))) @@ -782,17 +875,17 @@ /* Finally, add in methods being registered */ for (i = find_first_bit(mad_reg_req->method_mask, - IB_MGMT_MAX_METHODS); + IB_MGMT_MAX_METHODS); i < IB_MGMT_MAX_METHODS; i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, 1+i)) { - (*method)->agent[i] = priv; + (*method)->agent[i] = agent_priv; } return 0; error3: /* Remove any methods for this mad agent */ - remove_methods_mad_agent(*method, priv); + remove_methods_mad_agent(*method, agent_priv); /* Now, check to see if there are any methods in use */ if (!check_method_table(*method)) { /* If not, release management method table */ @@ -808,11 +901,138 @@ return ret; } +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *priv) +{ + int ret; + u8 mgmt_class; + + /* Make sure MAD registration request supplied */ + if (!mad_reg_req) + return 0; + + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) + ret = add_nonoui_reg_req(mad_reg_req, priv, mgmt_class); + else + ret = add_oui_reg_req(mad_reg_req, priv); + return ret; +} + static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) { struct ib_mad_port_private *port_priv; struct ib_mad_mgmt_class_table *class; struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; u8 mgmt_class; /* @@ -824,12 +1044,10 @@ } port_priv = agent_priv->qp_info->port_priv; - class = port_priv->version[agent_priv->reg_req->mgmt_class_version]; - if (!class) { - printk(KERN_ERR PFX "No class table yet MAD registration " - "request supplied\n"); - goto out; - } + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); method = class->method_table[mgmt_class]; @@ -845,12 +1063,56 @@ if (!check_class_table(class)) { /* If not, release management class table */ kfree(class); - port_priv->version[agent_priv->reg_req-> - mgmt_class_version]= NULL; + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; } } } +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + out: return; } @@ -907,20 +1169,49 @@ } } } else { - struct ib_mad_mgmt_class_table *version; - struct ib_mad_mgmt_method_table *class; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; - /* Routing is based on version, class, and method */ + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) goto out; - version = port_priv->version[mad->mad_hdr.class_version]; - if (!version) - goto out; - class = version->method_table[convert_mgmt_class( + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( mad->mad_hdr.mgmt_class)]; - if (class) - mad_agent = class->agent[mad->mad_hdr.method & - ~IB_MGMT_METHOD_RESP]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } } if (mad_agent) { From halr at voltaire.com Fri Dec 3 06:07:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 09:07:32 -0500 Subject: [openib-general] Re: mthca Page Allocation Failures In-Reply-To: <52fz2ol72l.fsf@topspin.com> References: <1102018176.4179.15.camel@localhost.localdomain> <52fz2ol72l.fsf@topspin.com> Message-ID: <1102082852.4197.0.camel@localhost.localdomain> On Thu, 2004-12-02 at 16:32, Roland Dreier wrote: > This patch should help. It did indeed. Thanks. Applied. -- Hal From halr at voltaire.com Fri Dec 3 06:09:19 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 09:09:19 -0500 Subject: [openib-general] smpdump and current MAD layer In-Reply-To: <41AF9BB9.5040402@ichips.intel.com> References: <1101838548.6411.276.camel@localhost.localdomain> <41AF559C.7090700@ichips.intel.com> <1102017939.4179.10.camel@localhost.localdomain> <41AF9BB9.5040402@ichips.intel.com> Message-ID: <1102082958.4197.4.camel@localhost.localdomain> On Thu, 2004-12-02 at 17:48, Sean Hefty wrote: > For example, local MADs do not generate CQ entries, and are > completed from the sender's thread. (I think we may want to change > the threading at some point, so it may not matter long term.) That's been on the TODO list: Roland pointed this out when it was added. -- Hal From roland at topspin.com Fri Dec 3 07:00:45 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 07:00:45 -0800 Subject: [openib-general] [PATCH] Add vendor OUI support to MAD layer In-Reply-To: <1102046938.14406.45.camel@localhost.localdomain> (Hal Rosenstock's message of "Thu, 02 Dec 2004 23:08:58 -0500") References: <1102046938.14406.45.camel@localhost.localdomain> Message-ID: <52k6rzifz6.fsf@topspin.com> This exposes the OUI support to userspace... I bumped the ABI version because the size of the reg request changed. Index: infiniband/include/ib_user_mad.h =================================================================== --- infiniband/include/ib_user_mad.h (revision 1310) +++ infiniband/include/ib_user_mad.h (working copy) @@ -31,7 +31,7 @@ * Increment this value if any changes that break userspace ABI * compatibility are made. */ -#define IB_USER_MAD_ABI_VERSION 1 +#define IB_USER_MAD_ABI_VERSION 2 /* * Make sure that all structs defined in this file remain laid out so @@ -90,6 +90,8 @@ * receive unsolicited MADs, otherwise it should be 0. * @mgmt_class_version - Indicates which version of MADs for the given * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. */ struct ib_user_mad_reg_req { __u32 id; @@ -97,6 +99,7 @@ __u8 qpn; __u8 mgmt_class; __u8 mgmt_class_version; + __u8 oui[3]; }; #define IB_IOCTL_MAGIC 0x1b Index: infiniband/core/user_mad.c =================================================================== --- infiniband/core/user_mad.c (revision 1310) +++ infiniband/core/user_mad.c (working copy) @@ -352,6 +352,7 @@ req.mgmt_class = ureq.mgmt_class; req.mgmt_class_version = ureq.mgmt_class_version; memcpy(req.method_mask, ureq.method_mask, sizeof req.method_mask); + memcpy(req.oui, ureq.oui, sizeof req.oui); agent = ib_register_mad_agent(file->port->ib_dev, file->port->port_num, ureq.qpn ? IB_QPT_GSI : IB_QPT_SMI, From iod00d at hp.com Fri Dec 3 10:27:11 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 10:27:11 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041123225624.GO10431@esmail.cup.hp.com> References: <20041123225624.GO10431@esmail.cup.hp.com> Message-ID: <20041203182711.GH15286@esmail.cup.hp.com> On Tue, Nov 23, 2004 at 02:56:24PM -0800, Grant Grundler wrote: > So the adventure continues on a different box (rx4640). > (I'll go back to the rx2600 and reflash/reboot the box). > > With tvflash, I was able to upload the hca-cougar image I mentioned > before successfully...at least that's what tvflash asserted. So it turns out I had flash the "high profile" firmware to a "low profile" card...it didn't like that. That's why the 3rd BAR was visible but not responding to programming. I've update the source tree and tried again with a high profile card (also reflashed with topspin firmware) and still getting the same error: Linux iowa 2.6.10-rc2 #10 SMP Fri Dec 3 08:06:24 PST 2004 ia64 GNU/Linux iowa:~# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:41:00.0) GSI 38 (level, low) -> CPU 1 (0x0100) vector 66 ACPI: PCI interrupt 0000:41:00.0[A] -> GSI 38 (level, low) -> IRQ 66 ib_mthca 0000:41:00.0: Unhandled event 0f(00) on eqn 3 ib_query_gid failed (-16) for mthca0 (index 12) ib_query_port failed (-16) for mthca0 ib_mthca 0000:41:00.0: WRITE_MTT failed (-16) ib_mad: Couldn't create ib_mad CQ ib_mad: Couldn't open mthca0 port 1 ib_agent: Port 1 not found ib_mad: Couldn't close mthca0 port 1 for agents ib_mad: Port 1 not found ib_mad: Couldn't close mthca0 port 1 ib_agent: Port 2 not found ib_mad: Couldn't close mthca0 port 2 for agents ib_mad: Port 2 not found ib_mad: Couldn't close mthca0 port 2 iowa:~# lspci -vs 0000:41:00.0 lspci: -f: Invalid slot number iowa:~# lspci -vs 41:00.0 0000:41:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technology MT23108 InfiniHost Flags: 66MHz, medium devsel, IRQ 66 Memory at 00000000b0800000 (64-bit, non-prefetchable) [size=1M] Memory at 00000000b0000000 (64-bit, prefetchable) [size=8M] Memory at 00000000a0000000 (64-bit, prefetchable) [size=256M] Capabilities: [40] #11 [001f] Capabilities: [50] Vital Product Data Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [70] PCI-X non-bridge device. I'm fighting other issues right now and haven't been able to work on this (^#%$ tulip driver). If anyone has advice on how to proceed debugging this or needs more info, I can use it. I'm still leary that tvflash didn't work right despite the assertion the flash operation completed: iowa:~# tvflash -i open_hca(0) flash_chip_reset() flash_check_failsafe() Error. String Tag not present (found tag 50 instead) HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source (sig 0x0/0x0) Secondary image is valid, unknown source (sig 0x0/0x0) Error. String Tag not present (found tag 50 instead) close_hca() Note that "tvflash -i" worked fine when the original firmware was loaded: HCA #0: Found MT23108, Cougar, revision A1 Primary image is valid, unknown source (sig 0x0/0x0) Secondary image is valid, unknown source (sig 0x0/0x0) Vital Product Data Product Name: PCI-X Dual Port InfiniBand HCA P/N: AB286-60001 E/C: A-4412 S/N: US4417F00350 Freq/Power: PW=15W;PCI 66MHZ;PCI-X 133MHZ Date Code: N/A Checksum: N/A Maybe the firmware image I have is corrupt? grundler at iowa:~$ cksum hca-cougar-a1-250-157.bin 2761115387 932768 hca-cougar-a1-250-157.bin Does tvflash to have some sanity checking (embedded checksums or something) so it wouldn't use corrupted images? I also tried a different image from HP (that also exposes the 3rd BAR). Got the same result as above. Since the 3rd BAR is visible and programmable, I'll assume the firmware downloaded was good in both cases. thanks, grant From roland at topspin.com Fri Dec 3 10:35:17 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:35:17 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203182711.GH15286@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 10:27:11 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> Message-ID: <52fz2ni61m.fsf@topspin.com> Can you try with the very latest svn bits? It turns out there the readb() I was using in the FW command code path is not allowed by Mellanox HW -- all accesses need to be 32 bits wide. I have no idea if it could cause this problem but in the latest tree I converted it to a readl() just to be safe. By the way, what's the difference between this box and the one that was working before? Just OpenIB SW version? - R. From philippe.gregoire at cea.fr Fri Dec 3 10:44:03 2004 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Fri, 03 Dec 2004 19:44:03 +0100 Subject: [openib-general] IP over IB configuration procedure Message-ID: <200412031844.TAA14129@styx.bruyeres.cea.fr> What are the requires steps to configure IP over IB ? I was able to load the kernel modules but unable to configure the interface ib0 with /sbin/ifup Is /sbin/ifup the correct command to setup the infinband interface or is there any other procedure to follow ? I tried also to configure "manually" the interface but pinging another node does not work. I execute the following commands : # modprobe ib_mthca # modprobe ib_ipoib # lsmod Module Size Used by ib_ipoib 51608 1 ib_sa 12812 1 ib_ipoib ib_mthca 76944 0 ib_mad 28180 2 ib_sa,ib_mthca ib_core 41472 4 ib_ipoib,ib_sa,ib_mthca,ib_mad iptable_filter 4224 0 ip_tables 19600 1 iptable_filter ep 418080 0 rms 24868 0 elan4 256520 1 ep elan3 304400 1 ep elan 41760 3 ep,elan4,elan3 qsnet 91812 5 ep,rms,elan4,elan3,elan # dmesg ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:04:00.0) divert: not allocating divert_blk for non-ethernet device ib0 divert: not allocating divert_blk for non-ethernet device ib1 ]# ifup ib0 Error, some other host already uses address xxx.yyy.zzz.119. Thanks for your help Philippe From roland at topspin.com Fri Dec 3 10:55:34 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:55:34 -0800 Subject: [openib-general] IP over IB configuration procedure In-Reply-To: <200412031844.TAA14129@styx.bruyeres.cea.fr> (Philippe Gregoire's message of "Fri, 03 Dec 2004 19:44:03 +0100") References: <200412031844.TAA14129@styx.bruyeres.cea.fr> Message-ID: <52brdbi53t.fsf@topspin.com> Philippe> Is /sbin/ifup the correct command to setup the infinband Philippe> interface or is there any other procedure to follow ? I Philippe> tried also to configure "manually" the interface but Philippe> pinging another node does not work. Not sure what ifup does on your system (it depends on the distribution). However you probably need to configure something like /etc/network/interfaces (the Debian config file for ifup). Something like "ifconfig ib0

" should work, though. Hal's IPoIB FAQ may be useful for debugging: http://article.gmane.org/gmane.linux.drivers.openib/6175 - Roland From iod00d at hp.com Fri Dec 3 10:56:38 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 10:56:38 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52fz2ni61m.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> Message-ID: <20041203185638.GA16001@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 10:35:17AM -0800, Roland Dreier wrote: > Can you try with the very latest svn bits? How can I tell which bits I'm currently testing? I did the "svn up" about 2h ago. svn is now telling me: grundler at iowa:/usr/src/linux-ia64-release-2.6.10/drivers/infiniband$ svn up At revision 1311. grundler at iowa:/usr/src/linux-ia64-release-2.6.10/drivers/infiniband$ > It turns out there the > readb() I was using in the FW command code path is not allowed by > Mellanox HW -- all accesses need to be 32 bits wide. I have no idea > if it could cause this problem but in the latest tree I converted it > to a readl() just to be safe. ok. > By the way, what's the difference between this box and the one that > was working before? Just OpenIB SW version? Before we were using the original HP firmware that exposes the virtual bridge but hides the "3rd BAR". I will eventually go back and try with such a card again. But I think being able to use a reflashed card would be just as important for openib.org than using the original HP card. thanks, grant From roland at topspin.com Fri Dec 3 10:59:12 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 10:59:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203185638.GA16001@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 10:56:38 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> Message-ID: <527jnzi4xr.fsf@topspin.com> Grant> How can I tell which bits I'm currently testing? I did the Grant> "svn up" about 2h ago. That's new enough. Here's a debugging patch that might help me figure out what's going on; can you reproduce the problem with this applied? Thanks, Roland Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -151,6 +151,8 @@ { u32 doorbell[2]; + printk(KERN_ERR "Set EQ CI %d/%d\n", eqn, ci); + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); doorbell[1] = cpu_to_be32(ci); @@ -270,6 +272,11 @@ break; case MTHCA_EVENT_TYPE_CMD: + { + static int c; + ++c; + printk(KERN_ERR "cmd completion %d\n", c); + } mthca_cmd_event(dev, be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, From halr at voltaire.com Fri Dec 3 11:23:32 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 14:23:32 -0500 Subject: [openib-general] IP over IB configuration procedure In-Reply-To: <52brdbi53t.fsf@topspin.com> References: <200412031844.TAA14129@styx.bruyeres.cea.fr> <52brdbi53t.fsf@topspin.com> Message-ID: <1102101811.4197.31.camel@localhost.localdomain> On Fri, 2004-12-03 at 13:55, Roland Dreier wrote: > Something like "ifconfig ib0

" should work, though. > Hal's IPoIB FAQ may be useful for debugging: > http://article.gmane.org/gmane.linux.drivers.openib/6175 Should we add how to bring up IPoIB to this writeup ? It currently assumes IPoIB has been brought up and is not working. -- Hal From halr at voltaire.com Fri Dec 3 11:43:20 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 03 Dec 2004 14:43:20 -0500 Subject: [openib-general] IPoIB FAQ Update Message-ID: <1102103000.4197.45.camel@localhost.localdomain> Here's an update to my initial attempt at an IPoIB FAQ: ping doesn't work between IPoIB nodes. What should I do ? First, verify that the ports are active. This can be done via: cat /sys/class/infiniband/mthca0/ports/1/state This should indicate 4: ACTIVE assuming the HCA is mthca0 and port 1 is the one plugged into the subnet (switch, etc.). If the port is not active, there could be several reasons: 1. You need an SM in your subnet to bring the ports to active. Do you have an SM ? This can be embedded in a switch or some other IB hardware or run on an end node (HCA) although OpenIB (gen2) does not currently support this. 2. If you have an SM in your subnet, there might be a cabling problem where the SM cannot "reach" your end node. If the port is active, indicate the subnet configuration and which SM is being utilized. Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? There are 2 levels of IPoIB debug which can be enabled when building: IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. The latter has performance implications and should only be enabled when all else fails. Enable the first level of IPoIB debug and then: mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg Other things to verify and supply to help isolate the problem: 1. Verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. 2. Make sure the IB modules are loaded: /sbin/lsmod | grep ib_ should show ib_mthca (HCA driver) as well as ib_ipoib. There are others but those are the two which need to be loaded and any others will follow. 3. Make sure there are no errors in /var/log/messages pertaining to ib_. 4. Indicate the IP configuration via /sbin/ifconfig -a From iod00d at hp.com Fri Dec 3 14:14:13 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 14:14:13 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <527jnzi4xr.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> Message-ID: <20041203221413.GC16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 10:59:12AM -0800, Roland Dreier wrote: > Grant> How can I tell which bits I'm currently testing? I did the > Grant> "svn up" about 2h ago. > > That's new enough. Here's a debugging patch that might help me figure > out what's going on; can you reproduce the problem with this applied? I tried but it worked with the patch. :^( Last output was: ... Set EQ CI 3/55 cmd completion 1208 Set EQ CI 3/56 cmd completion 1209 Set EQ CI 3/57 cmd completion 1210 Set EQ CI 3/58 cmd completion 1211 Set EQ CI 3/59 --- end of output --- iowa:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/1/state cmd completion 1212 Set EQ CI 3/60 4: ACTIVE iowa:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/2/state cmd completion 1213 Set EQ CI 3/61 1: DOWN which is correct. Port "A" is connected to a switch and "B" is not. > + printk(KERN_ERR "Set EQ CI %d/%d\n", eqn, ci); > + > doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); > doorbell[1] = cpu_to_be32(ci); This doesn't feel very safe to me. If write ordering is required here, writel() or wmb() is necessary. Let me look over this code and see if the ordering is enforced elsewhere. I'll also play around with removing one printk at a time. thanks, grant From roland at topspin.com Fri Dec 3 14:26:12 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 14:26:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203221413.GC16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 14:14:13 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> Message-ID: <52hdn3ggsb.fsf@topspin.com> Grant> I tried but it worked with the patch. :^( Of course it would ... :) >> doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); >> doorbell[1] = cpu_to_be32(ci); Grant> This doesn't feel very safe to me. If write ordering is Grant> required here, writel() or wmb() is necessary. Let me look Grant> over this code and see if the ordering is enforced Grant> elsewhere. I'm not positive but I think it should be OK. doorbell is just a temporary variable that gets passed to mthca_write64(), which essentially does a __raw_writeq(*(u64 *) doorbell). On ia64, __raw_writeq is #defined to be writeq, so ordering should be OK there. And surely ia64 ordering is strong enough that the CPU won't try to do the writeq before the writes to doorbell complete, right? On the other hand, the fact that changing the timing with printks makes things work does make it look like there may be some sort of ordering problem... - R. From iod00d at hp.com Fri Dec 3 14:40:39 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 14:40:39 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52hdn3ggsb.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> Message-ID: <20041203224039.GE16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 02:26:12PM -0800, Roland Dreier wrote: > Grant> I tried but it worked with the patch. :^( > > Of course it would ... :) :^) > Grant> This doesn't feel very safe to me. If write ordering is > Grant> required here, writel() or wmb() is necessary. Let me look > Grant> over this code and see if the ordering is enforced > Grant> elsewhere. > > I'm not positive but I think it should be OK. doorbell is just a > temporary variable that gets passed to mthca_write64(), which > essentially does a __raw_writeq(*(u64 *) doorbell). On ia64, > __raw_writeq is #defined to be writeq, so ordering should be OK > there. Yes - I just went through the same code and came to the same conclusion. The key bit here is in __writeq() where it add "volatile". This makes sure all previous writes have completed...but... > And surely ia64 ordering is strong enough that the CPU won't > try to do the writeq before the writes to doorbell complete, right? Yes - I believe it is. > On the other hand, the fact that changing the timing with printks > makes things work does make it look like there may be some sort of > ordering problem... Well, or other race. Ordering was just my first guess. It's likely the race is not even here - but on the completion side of things. I'll play with this for a bit and see were it leads me. thanks, grant From roland at topspin.com Fri Dec 3 14:51:01 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 14:51:01 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041203224039.GE16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 14:40:39 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> Message-ID: <52d5xrgfmy.fsf@topspin.com> Grant> Well, or other race. Ordering was just my first guess. Grant> It's likely the race is not even here - but on the Grant> completion side of things. Hmm, I'll think about it too. The problem appears to be an overflow of the FW command completion event queue (that's what the message ib_mthca 0000:41:00.0: Unhandled event 0f(00) on eqn 3 is saying ... event 0f is EQ overflow) It is true that we call mthca_cmd_event() on FW command completion (which frees up a command slot and could let another command start) before we update the consumer index and let the HCA know that there is a free slot in the EQ. Still, it's a little hard to see how we could overflow the EQ (which is created with 128 slots). - R. From roland at topspin.com Fri Dec 3 15:22:25 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 15:22:25 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52d5xrgfmy.fsf@topspin.com> (Roland Dreier's message of "Fri, 03 Dec 2004 14:51:01 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> Message-ID: <528y8fge6m.fsf@topspin.com> OK, I may have figured out the problem. How does this patch work? Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,11 +219,14 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int set_ci = 0; while (1) { if (!next_eqe_sw(eq)) break; + set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); work = 1; @@ -274,6 +277,13 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + /* + * Need to set the CI inside the loop for + * command completion events, because this + * event allows another command to be posted + * and we may overflow the EQ. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -296,9 +306,14 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (work && !set_ci) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } } - if (work) { + if (work && !set_ci) { wmb(); set_eq_ci(dev, eq->eqn, eq->cons_index); } From iod00d at hp.com Fri Dec 3 18:18:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 3 Dec 2004 18:18:19 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <528y8fge6m.fsf@topspin.com> References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> Message-ID: <20041204021819.GG16522@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 03:22:25PM -0800, Roland Dreier wrote: > OK, I may have figured out the problem. How does this patch work? I've saved it off as diff-openib-set_ci Applied fine but still getting the "Unhandled event 0f(00) on eqn 3" message. :^( I'm not done staring at the code but will have to give up for today. The little bit that is left of my brain is cooked. BTW, a better name for "set_ci" might be "ci_full" or "stop_ci". (YKWIM) thanks, grant From roland at topspin.com Fri Dec 3 18:23:48 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 18:23:48 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041204021819.GG16522@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 3 Dec 2004 18:18:19 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> Message-ID: <524qj2hkcr.fsf@topspin.com> Grant> I've saved it off as diff-openib-set_ci Applied fine but Grant> still getting the "Unhandled event 0f(00) on eqn 3" Grant> message. :^( Hmm, I guess that wasn't the problem (or I didn't fix it properly). Grant> BTW, a better name for "set_ci" might be "ci_full" or Grant> "stop_ci". (YKWIM) No, set_ci stands for "set consumer index," which is what it does -- each event queue has a producer index (where the HW will write the next event) and a consumer index (the last event the driver has consumed). The HW compares the index to know when the EQ overflowed, so we need to set the CI to tell the HW when we've consumed events. - R. From roland at topspin.com Fri Dec 3 18:25:27 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 03 Dec 2004 18:25:27 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <524qj2hkcr.fsf@topspin.com> (Roland Dreier's message of "Fri, 03 Dec 2004 18:23:48 -0800") References: <20041123225624.GO10431@esmail.cup.hp.com> <20041203182711.GH15286@esmail.cup.hp.com> <52fz2ni61m.fsf@topspin.com> <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> Message-ID: <52zn0ug5pk.fsf@topspin.com> Roland> Hmm, I guess that wasn't the problem (or I didn't fix it Roland> properly). The latter (fix was bogus, I made a cut-and-paste error). Here's a better version: Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,11 +219,14 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int set_ci = 0; while (1) { if (!next_eqe_sw(eq)) break; + set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); work = 1; @@ -274,6 +277,13 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + /* + * Need to set the CI inside the loop for + * command completion events, because this + * event allows another command to be posted + * and we may overflow the EQ. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -296,9 +306,14 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + } } - if (work) { + if (work && !set_ci) { wmb(); set_eq_ci(dev, eq->eqn, eq->cons_index); } From jjengla at sandia.gov Sun Dec 5 20:02:56 2004 From: jjengla at sandia.gov (Josh England) Date: Sun, 05 Dec 2004 20:02:56 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102103000.4197.45.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> Message-ID: <1102305776.5890.28.camel@localhost> I'm following the FAQ pretty closely and IPoIB is still not working for me. I'm using the latest from SVN (as of tonight). These are PCIe HCAs with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for tvflash). Hardware/cabling is fine since things work under VAPI 3.2. All modules are loaded, and syslog doesn't show anything. It seems now (new with 4.5.3 FW) that packets are flowing one direction but not the other. Here's as much debug info as I could dig up. ping from n0 to n1: /sys/class/net/ib0/statistics/rx_packets on n1 increases ping from n1 to n0: /sys/class/net/ib0/statistics/rx_packets remains at 0 tcpdump shows nothing on either node n0:/ipoib_debugfs/ib0_mcg shows: GID: ff12:401b:ffff:0:0:0:0:1 created: 4295359818 queuelen: 0 complete: 1 send_only: 0 GID: ff12:401b:ffff:0:0:0:ffff:ffff created: 4295359818 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:1 created: 4295359819 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:2 created: 4295361295 queuelen: 2 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:0:0:16 created: 4295359821 queuelen: 0 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:1:ff00:6693 created: 4295359819 queuelen: 0 complete: 1 send_only: 0 n1:/ipoib_debugfs/ib0_mcg shows: GID: ff12:401b:ffff:0:0:0:0:1 created: 4294846326 queuelen: 0 complete: 1 send_only: 0 GID: ff12:401b:ffff:0:0:0:ffff:ffff created: 4294846326 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:1 created: 4294846327 queuelen: 0 complete: 1 send_only: 0 GID: ff12:601b:ffff:0:0:0:0:2 created: 4294847749 queuelen: 2 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:0:0:16 created: 4294846329 queuelen: 0 complete: 0 send_only: 1 GID: ff12:601b:ffff:0:0:1:ff00:66bf created: 4294846327 queuelen: 0 complete: 1 send_only: 0 ifconfig -a on n0: eth0 Link encap:Ethernet HWaddr 00:30:48:53:69:D8 inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41000 errors:0 dropped:0 overruns:0 frame:0 TX packets:32882 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21739542 (20.7 Mb) TX bytes:4794259 (4.5 Mb) Base address:0xbc00 Memory:feae0000-feb00000 ib0 Link encap:UNSPEC HWaddr 00-00-00-84-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.1 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::206:6a00:a000:6693/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:46 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:2768 (2.7 Kb) ib1 Link encap:UNSPEC HWaddr 00-00-00-85-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:70 errors:0 dropped:0 overruns:0 frame:0 TX packets:70 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6748 (6.5 Kb) TX bytes:6748 (6.5 Kb) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) ifconfig -a on n1: eth0 Link encap:Ethernet HWaddr 00:30:48:53:6A:77 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41747 errors:0 dropped:0 overruns:0 frame:0 TX packets:33587 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21841556 (20.8 Mb) TX bytes:4897945 (4.6 Mb) Base address:0xbc00 Memory:feaa0000-feac0000 ib0 Link encap:UNSPEC HWaddr 00-00-00-84-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.2 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::206:6a00:a000:66bf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:45 errors:0 dropped:0 overruns:0 frame:0 TX packets:69 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:2520 (2.4 Kb) TX bytes:4372 (4.2 Kb) ib1 Link encap:UNSPEC HWaddr 00-00-00-85-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:36 errors:0 dropped:0 overruns:0 frame:0 TX packets:36 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2940 (2.8 Kb) TX bytes:2940 (2.8 Kb) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Is there anything else I can try? -JE On Fri, 2004-12-03 at 14:43 -0500, Hal Rosenstock wrote: > Here's an update to my initial attempt at an IPoIB FAQ: > > ping doesn't work between IPoIB nodes. What should I do ? > > First, verify that the ports are active. > > This can be done via: > > cat /sys/class/infiniband/mthca0/ports/1/state > > This should indicate 4: ACTIVE > > assuming the HCA is mthca0 and port 1 is the one plugged into the subnet > (switch, etc.). > > If the port is not active, there could be several reasons: > > 1. You need an SM in your subnet to bring the ports to active. Do you > have an SM ? This can be embedded in a switch or some other IB hardware > or run on an end node (HCA) although OpenIB (gen2) does not currently > support this. > > 2. If you have an SM in your subnet, there might be a cabling problem > where the SM cannot "reach" your end node. > > If the port is active, indicate the subnet configuration and which SM is > being utilized. > > Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" > show anything on the other nodes when you try to ping or something? > > There are 2 levels of IPoIB debug which can be enabled when building: > IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. > The latter has performance implications and should only be enabled when > all else fails. Enable the first level of IPoIB debug and then: > > mount -t ipoib_debugfs none /ipoib_debufs/ > cat /ipoib_debugfs/ib0_mcg > > Other things to verify and supply to help isolate the problem: > > 1. Verify the firmware version via > > cat /sys/class/infiniband/mthca0/fw_ver > > For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version > 4.5.3 is recommended. > > 2. Make sure the IB modules are loaded: > /sbin/lsmod | grep ib_ > should show ib_mthca (HCA driver) as well as ib_ipoib. There are others > but those are the two which need to be loaded and any others will > follow. > > 3. Make sure there are no errors in /var/log/messages pertaining to ib_. > > 4. Indicate the IP configuration via > /sbin/ifconfig -a > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Dec 6 04:09:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 07:09:11 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102305776.5890.28.camel@localhost> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> Message-ID: <1102334951.4197.940.camel@localhost.localdomain> Hi Josh, On Sun, 2004-12-05 at 23:02, Josh England wrote: Your IPoIB configuration looks fine. > It seems now (new with 4.5.3 FW) that packets are flowing one direction > but not the other. It is possible there are still multicast ARP issues with PCIe with even the 4.5.3 firmware. There is apparently a pre 4.6 version which has a fix in it for this which some others have said they need in other (non OpenIB)environments. You will likely need to contact Mellanox to get this prerelease. -- Hal From bogus@does.not.exist.com Mon Dec 6 07:29:11 2004 From: bogus@does.not.exist.com () Date: Mon, 06 Dec 2004 15:29:11 -0000 Subject: No subject Message-ID: From dledford at redhat.com Mon Dec 6 07:43:12 2004 From: dledford at redhat.com (Doug Ledford) Date: Mon, 06 Dec 2004 10:43:12 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102334951.4197.940.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> Message-ID: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > Hi Josh, > > On Sun, 2004-12-05 at 23:02, Josh England wrote: > > Your IPoIB configuration looks fine. Well, he had identical hardware addresses on n0 and n1 for his ib0 and ib1 interfaces respectively. This would likely keep the linux network stack from actually sending the packet back to the originating host and instead convince it that it's intended for internal consumption I would think. Try putting a different HW MAC address on your different ib? ports and see if that gets the packets flowing. > > It seems now (new with 4.5.3 FW) that packets are flowing one direction > > but not the other. > > It is possible there are still multicast ARP issues with PCIe with even > the 4.5.3 firmware. There is apparently a pre 4.6 version which has a > fix in it for this which some others have said they need in other (non > OpenIB)environments. You will likely need to contact Mellanox to get > this prerelease. > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford Red Hat, Inc. 1801 Varsity Dr. Raleigh, NC 27606 From halr at voltaire.com Mon Dec 6 07:40:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 10:40:08 -0500 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> Message-ID: <1102347607.4197.1183.camel@localhost.localdomain> On Mon, 2004-12-06 at 10:43, Doug Ledford wrote: > On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > > Hi Josh, > > > > On Sun, 2004-12-05 at 23:02, Josh England wrote: > > > > Your IPoIB configuration looks fine. > > Well, he had identical hardware addresses on n0 and n1 for his ib0 and > ib1 interfaces respectively. This would likely keep the linux network > stack from actually sending the packet back to the originating host and > instead convince it that it's intended for internal consumption I would > think. Try putting a different HW MAC address on your different ib? > ports and see if that gets the packets flowing. The HW address is truncated in ifconfig as it is 20 bytes which is longer than the 16 I think ifconfig (and arp) support. You can tell the full IB HW addresses from ip neigh. If these were the same (due to duplicate GUIDs being programmed), this would cause a problem as you indicate. (I will add this tidbit into the next IPoIB FAQ). -- Hal From roland at topspin.com Mon Dec 6 07:51:03 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 07:51:03 -0800 Subject: [openib-general] Re: IPoIB still not working In-Reply-To: <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> (Doug Ledford's message of "Mon, 06 Dec 2004 10:43:12 -0500") References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> <1102347792.6490.156.camel@compaq-rhel4.xsintricity.com> Message-ID: <527jnvfms8.fsf@topspin.com> Doug> Well, he had identical hardware addresses on n0 and n1 for Doug> his ib0 and ib1 interfaces respectively. This would likely Doug> keep the linux network stack from actually sending the Doug> packet back to the originating host and instead convince it Doug> that it's intended for internal consumption I would think. Doug> Try putting a different HW MAC address on your different ib? Doug> ports and see if that gets the packets flowing. Actually ifconfig can only show the first 16 octets of the HW address (and I think the last two bytes are actually wrong, because the SIOGIFHWADDR ioctl that it uses can only return 14 bytes). IPoIB has a 20 byte HW address; the four (or six?) bytes that get cut off are the low-order bytes of the port GID, which is probably where the difference between port GIDs is. To see the real address, you need to do something like "ip addr show dev ib0". For example, on my system: # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr show dev ib0 5: ib0: mtu 2044 qdisc noop qlen 128 link/[32] 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff Perhaps we need something about this in the growing IPoIB FAQ? - Roland From jjengla at sandia.gov Mon Dec 6 08:07:19 2004 From: jjengla at sandia.gov (Josh England) Date: Mon, 06 Dec 2004 08:07:19 -0800 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1102334951.4197.940.camel@localhost.localdomain> References: <1102103000.4197.45.camel@localhost.localdomain> <1102305776.5890.28.camel@localhost> <1102334951.4197.940.camel@localhost.localdomain> Message-ID: <1102349239.21826.4.camel@localhost> On Mon, 2004-12-06 at 07:09 -0500, Hal Rosenstock wrote: > > It seems now (new with 4.5.3 FW) that packets are flowing one direction > > but not the other. > > It is possible there are still multicast ARP issues with PCIe with even > the 4.5.3 firmware. There is apparently a pre 4.6 version which has a > fix in it... I'll look into the pre-4.6 FW. Do you know if anyone has gotten openIB (IPoIB) to work over PCIe HCAs? -JE From robert.j.woodruff at intel.com Mon Dec 6 08:18:12 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 6 Dec 2004 08:18:12 -0800 Subject: [openib-general] Re: IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002F78E42@orsmsx408> >I'll look into the pre-4.6 FW. Do you know if anyone has gotten openIB >(IPoIB) to work over PCIe HCAs? >-JE I am working on setting up a couple of PCI-E systems today. Will let you know what I find. woody From halr at voltaire.com Mon Dec 6 08:41:16 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 11:41:16 -0500 Subject: [openib-general] Latest IPoIB FAQ Message-ID: <1102351276.4197.1196.camel@localhost.localdomain> Here's an update to my initial attempt at an IPoIB FAQ: ping doesn't work between IPoIB nodes. What should I do ? First, verify that the ports are active. This can be done via: cat /sys/class/infiniband/mthca0/ports/1/state This should indicate 4: ACTIVE assuming the HCA is mthca0 and port 1 is the one plugged into the subnet (switch, etc.). If the port is not active, there could be several reasons: 1. You need an SM in your subnet to bring the ports to active. Do you have an SM ? This can be embedded in a switch or some other IB hardware or run on an end node (HCA) although OpenIB (gen2) does not currently support this. 2. If you have an SM in your subnet, there might be a cabling problem where the SM cannot "reach" your end node. If the port is active, indicate the subnet configuration and which SM is being utilized. Do /sys/class/net/ib0/statistics/rx_packets and/or "tcpdump -i ib0" show anything on the other nodes when you try to ping or something? There are 2 levels of IPoIB debug which can be enabled when building: IP-over-InfiniBand debugging and IP-over-InfiniBand data path debugging. The latter has performance implications and should only be enabled when all else fails. Enable the first level of IPoIB debug and then: mount -t ipoib_debugfs none /ipoib_debufs/ cat /ipoib_debugfs/ib0_mcg Other things to verify and supply to help isolate the problem: 1. Verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. 2. Make sure the IB modules are loaded: /sbin/lsmod | grep ib_ should show ib_mthca (HCA driver) as well as ib_ipoib. There are others but those are the two which need to be loaded and any others will follow. 3. Make sure there are no errors in /var/log/messages pertaining to ib_. 4. Indicate the IP configuration via /sbin/ifconfig -a and ip addr show dev ib0 (assuming ib0 is the network interface being configured) This is because ifconfig can only show the first 16 octets of the HW address (and the last two bytes are actually wrong, because the SIOGIFHWADDR ioctl that it uses can only return 14 bytes). IPoIB has a 20 byte HW address; the four (or six?) bytes that get cut off are the low-order bytes of the port GID, which is probably where the difference between port GIDs is. To see the real IB hardware address, you need to do something like "ip addr show dev ib0". For example, # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr show dev ib0 5: ib0: mtu 2044 qdisc noop qlen 128 link/[32] 00:00:04:04:fe:80:00:00:00:00:00:00:00:02:c9:01:07:8c:e4:61 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff 5. Use ip neigh show dev ib0 to display ARP table for IB interface ib0 From roland at topspin.com Mon Dec 6 09:13:05 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 09:13:05 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <1102351276.4197.1196.camel@localhost.localdomain> (Hal Rosenstock's message of "Mon, 06 Dec 2004 11:41:16 -0500") References: <1102351276.4197.1196.camel@localhost.localdomain> Message-ID: <52u0qze4f2.fsf@topspin.com> This looks good. Matt, it might be worth putting this on the web site, and longer term I think this is yet another reason to set up some sort of Wiki. - R. From iod00d at hp.com Mon Dec 6 09:39:54 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 6 Dec 2004 09:39:54 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52zn0ug5pk.fsf@topspin.com> References: <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> Message-ID: <20041206173954.GF26198@esmail.cup.hp.com> On Fri, Dec 03, 2004 at 06:25:27PM -0800, Roland Dreier wrote: > Roland> Hmm, I guess that wasn't the problem (or I didn't fix it > Roland> properly). > > The latter (fix was bogus, I made a cut-and-paste error). Here's a > better version: Yes - that works. Please commit. I was still trying to sort out how set_ci when I gave up on friday. You explanation helped though I'm familiar with consumer/producer (tg3 has a similar construct) indexes. > Index: infiniband/hw/mthca/mthca_eq.c > =================================================================== > --- infiniband/hw/mthca/mthca_eq.c (revision 1310) > +++ infiniband/hw/mthca/mthca_eq.c (working copy) > @@ -219,11 +219,14 @@ > struct mthca_eqe *eqe; > int disarm_cqn; > int work = 0; > + int set_ci = 0; > > while (1) { > if (!next_eqe_sw(eq)) > break; > > + set_ci = 0; > + > eqe = get_eqe(eq, eq->cons_index); > work = 1; > > @@ -274,6 +277,13 @@ > be16_to_cpu(eqe->event.cmd.token), > eqe->event.cmd.status, > be64_to_cpu(eqe->event.cmd.out_param)); > + /* > + * Need to set the CI inside the loop for > + * command completion events, because this > + * event allows another command to be posted > + * and we may overflow the EQ. > + */ The comment inside "case MTHCA_EVENT_TYPE_CMD" now makes sense and it didn't make sense to me on friday...perhaps showing it was indeed time for a break. > + set_ci = 1; > break; > > case MTHCA_EVENT_TYPE_PORT_CHANGE: > @@ -296,9 +306,14 @@ > > set_eqe_hw(eq, eq->cons_index); > eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); This line of code assumes eq->nent is a power of 2. I didn't see anything in mthca_create_eq() that enforces the assumption that nent is a power of 2. It mthca_create_eq() isn't in the perf path, I think it would be good to enforcement the requirement or at least check for it (e.g. WARN_ON). I just don't want to make the performance patch longer. > + > + if (set_ci) { > + wmb(); this wmb() isn't necessary since set_eq_ci() calls mthca_write64() and the latter *must* enforce wmb() for the MMIO write. I'd also like to see doorbell[2] go away and just pass the two u32 values to mthca_write64(). Perhaps combine them explicitly instead depending on load/store model of the stack. While it seems to work, I'm very nervous about this dependency and wonder if it would even work on parisc-linux (ok, I'm the only who cares :^) which has an upward growing stack. > + set_eq_ci(dev, eq->eqn, eq->cons_index); > + } > } > > - if (work) { > + if (work && !set_ci) { > wmb(); ditto. > set_eq_ci(dev, eq->eqn, eq->cons_index); > } thanks, grant From roland at topspin.com Mon Dec 6 10:59:51 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 06 Dec 2004 10:59:51 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041206173954.GF26198@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 6 Dec 2004 09:39:54 -0800") References: <20041203185638.GA16001@esmail.cup.hp.com> <527jnzi4xr.fsf@topspin.com> <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> Message-ID: <52pt1ndzh4.fsf@topspin.com> Grant> Yes - that works. Please commit. I rewrote things in a way that seems cleaner to me -- what I actually committed is below. Please try one more time and make sure this still fixes the problem. Grant> It mthca_create_eq() isn't in the perf path, I think it Grant> would be good to enforcement the requirement or at least Grant> check for it (e.g. WARN_ON). I just don't want to make the Grant> performance patch longer. Good idea, I'll do this. Grant> this wmb() isn't necessary since set_eq_ci() calls Grant> mthca_write64() and the latter *must* enforce wmb() for the Grant> MMIO write. I'm not sure about that... mthca_write64() turns into __raw_writeq, which may not be ordered. Even if it were writeq(), the barrier is after the write (eg on ppc64, the "sync" if after the store operation). We want to make sure that the updates to the ownership bits in the EQ in host memory are complete before doing the MMIO write to update the HCA's consumer pointer; this avoids the (pretty much impossible) race where the HCA writes to the EQ entry and then our ownership update happens later and overwrites the HCA's value. Grant> I'd also like to see doorbell[2] go away and just pass the Grant> two u32 values to mthca_write64(). Perhaps combine them Grant> explicitly instead depending on load/store model of the Grant> stack. That makes sense. I'll try to come up with an API that avoids shifts on a 64-bit arch... Thanks, Roland Index: infiniband/hw/mthca/mthca_cmd.c =================================================================== --- infiniband/hw/mthca/mthca_cmd.c (revision 1310) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -293,6 +293,12 @@ complete(&context->done); } +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) +{ + while (ncomp--) + up(&dev->cmd.event_sem); +} + static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -357,7 +363,6 @@ dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); - up(&dev->cmd.event_sem); return err; } Index: infiniband/hw/mthca/mthca_eq.c =================================================================== --- infiniband/hw/mthca/mthca_eq.c (revision 1310) +++ infiniband/hw/mthca/mthca_eq.c (working copy) @@ -219,6 +219,7 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; + int ncmd = 0; while (1) { if (!next_eqe_sw(eq)) @@ -274,6 +275,7 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); + ++ncmd; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -303,6 +305,9 @@ set_eq_ci(dev, eq->eqn, eq->cons_index); } + if (ncmd) + mthca_cmd_complete(dev, ncmd); + eq_req_not(dev, eq->eqn); } Index: infiniband/hw/mthca/mthca_cmd.h =================================================================== --- infiniband/hw/mthca/mthca_cmd.h (revision 1310) +++ infiniband/hw/mthca/mthca_cmd.h (working copy) @@ -203,10 +203,9 @@ int mthca_cmd_use_events(struct mthca_dev *dev); void mthca_cmd_use_polling(struct mthca_dev *dev); -void mthca_cmd_event(struct mthca_dev *dev, - u16 token, - u8 status, - u64 out_param); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From iod00d at hp.com Mon Dec 6 12:16:12 2004 From: iod00d at hp.com (Grant Grundler) Date: Mon, 6 Dec 2004 12:16:12 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52pt1ndzh4.fsf@topspin.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> Message-ID: <20041206201612.GH26198@esmail.cup.hp.com> On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > Grant> Yes - that works. Please commit. > > I rewrote things in a way that seems cleaner to me -- what I actually > committed is below. Please try one more time and make sure this still > fixes the problem. will do in a bit. ... > Grant> this wmb() isn't necessary since set_eq_ci() calls > Grant> mthca_write64() and the latter *must* enforce wmb() for the > Grant> MMIO write. > > I'm not sure about that... mthca_write64() turns into __raw_writeq, > which may not be ordered. Even if it were writeq(), the barrier is > after the write (eg on ppc64, the "sync" if after the store > operation). We want to make sure that the updates to the ownership > bits in the EQ in host memory are complete before doing the MMIO write > to update the HCA's consumer pointer; this avoids the (pretty much > impossible) race where the HCA writes to the EQ entry and then our > ownership update happens later and overwrites the HCA's value. Yeah - ia64 has slight differences in that the "release" semantics used for writeX() (and wmb()) force prior write mem ops to complete before this one does. I had forgotten exactly what the linux API semantics where and re-read Documentation/DocBook/deviceiobook.tmpl. It doesn't seem to address this issue unfortunately. The best description I have of memory ordering is still "IA64 Linux Kernel" by David Mosberger and Stephane Eranian. And that supports exactly what you say above. > Grant> I'd also like to see doorbell[2] go away and just pass the > Grant> two u32 values to mthca_write64(). Perhaps combine them > Grant> explicitly instead depending on load/store model of the > Grant> stack. > > That makes sense. I'll try to come up with an API that avoids shifts > on a 64-bit arch... I'm not sure there is one. And it still might be better to use the shift op. This feels like one of those perf vs portability issue where someone has to decide if it's worth trading off some perf for clearer code. Thinking about it more, I just realized the two 32-bit stores are to a cacheline already resident in L1. Ditto for the successive load to recover the 64-bit value. The mem ops are nearly "free" since it's to a resident, private cacheline. But completely avoiding those stores/loads would be good too and I expect the resulting code will be better to read/understand. And given the use, I don't think the shift unit will limit parallelism on ia64 or matter on other arches. But I might be wrong on that since it's just a guess. thanks, grant From halr at voltaire.com Mon Dec 6 12:25:06 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 06 Dec 2004 15:25:06 -0500 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <52u0qze4f2.fsf@topspin.com> References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> Message-ID: <1102364706.4197.1220.camel@localhost.localdomain> On Mon, 2004-12-06 at 12:13, Roland Dreier wrote: > This looks good. > > Matt, it might be worth putting this on the web site, and longer term > I think this is yet another reason to set up some sort of Wiki. I would like to confirm the PCIe firmware rev "requirement" in this FAQ. I believe an official release is shortly coming. -- Hal From mlleinin at hpcn.ca.sandia.gov Mon Dec 6 21:31:51 2004 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 06 Dec 2004 21:31:51 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <52u0qze4f2.fsf@topspin.com> References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> Message-ID: <1102397511.4239.323.camel@trinity> I'll post the updated IPoIB FAQ on our webpages. We started looking at Wiki's and then got sidetracked. Is there a preferred wiki? Thanks, - Matt On Mon, 2004-12-06 at 09:13 -0800, Roland Dreier wrote: > This looks good. > > Matt, it might be worth putting this on the web site, and longer term > I think this is yet another reason to set up some sort of Wiki. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From roland at topspin.com Tue Dec 7 05:25:29 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 05:25:29 -0800 Subject: [openib-general] Latest IPoIB FAQ In-Reply-To: <1102397511.4239.323.camel@trinity> (Matt Leininger's message of "Mon, 06 Dec 2004 21:31:51 -0800") References: <1102351276.4197.1196.camel@localhost.localdomain> <52u0qze4f2.fsf@topspin.com> <1102397511.4239.323.camel@trinity> Message-ID: <52r7m2ckae.fsf@topspin.com> Matt> We started looking at Wiki's and then got sidetracked. Is Matt> there a preferred wiki? I don't have any strong opinions -- probably whatever is easiest to set up will work fine for us. I seem to recall that Zwiki is a Zope-based Wiki that might work well with Plone. (ubuntulinux.org seems to be running Plone and Zwiki for their site) - R. From robert.j.woodruff at intel.com Tue Dec 7 11:33:34 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 11:33:34 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> >I'm following the FAQ pretty closely and IPoIB is still not working for >me. I'm using the latest from SVN (as of tonight). These are PCIe HCAs >with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for >tvflash). Hardware/cabling is fine since things work under VAPI 3.2. >All modules are loaded, and syslog doesn't show anything. It seems now >(new with 4.5.3 FW) that packets are flowing one direction but not the >other. Here's as much debug info as I could dig up. Ok here is what I found. I installed the openib.org code on my EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node Mellanox switch. On one node (a 32-bit Xeon node), we are running opensm from the SF project. We have one 32-bit system (Sean's) and 2 EM64T systems connected to the switch that are running the openib.org code. When Sean's 32-bit system loads ipoib and configures the interface, everything seems to work fine. However, when I load ipoib on the 64-bit x86_64 node, the opensm starts complaining that osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember so it looks like the x86_64 system is not able to join the multicast group. I suspect some issue with structure definitions, but have not debugged the issue any further. Any ideas ? From halr at voltaire.com Tue Dec 7 11:59:17 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 14:59:17 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> Message-ID: <1102449557.4141.59.camel@localhost.localdomain> On Tue, 2004-12-07 at 14:33, Woodruff, Robert J wrote: > > >I'm following the FAQ pretty closely and IPoIB is still not working for > >me. I'm using the latest from SVN (as of tonight). These are PCIe > HCAs > >with 4.5.3 firmware (x86_64 MST is horribly broken BTW --can't wait for > >tvflash). Hardware/cabling is fine since things work under VAPI 3.2. > >All modules are loaded, and syslog doesn't show anything. It seems now > >(new with 4.5.3 FW) that packets are flowing one direction but not the > >other. Here's as much debug info as I could dig up. > > Ok here is what I found. I installed the openib.org code on my > EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node > Mellanox switch. On one node (a 32-bit Xeon node), > we are running opensm from the SF project. > We have one 32-bit system (Sean's) and 2 EM64T systems connected to > the switch that are running the openib.org code. > When Sean's 32-bit system loads ipoib and configures the > interface, everything seems to work fine. However, when I load ipoib > on the 64-bit x86_64 node, the opensm starts complaining that > > osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember > > so it looks like the x86_64 system is not able to join the multicast > group. > I suspect some issue with structure definitions, but have not debugged > the > issue any further. I run on x86_64 (Opteron) too with a different SM and I do not see that issue. While there is code in OpenIB IPoIB to perform send only joins, it currently joins as full member (the send only join state is commented out to be full member right now due to some SMs lacking this support (which BTW is required if it does support multicast) so I can't explain what OpenSM indicates. Maybe OpenSM is putting out the wrong message. There are two types of "joins" done by OpenIB: 1. with components sufficient to create the multicast group, and 2. with components sufficient to join an already created group. The second style has been done for a long time and is similar to those being used by other stacks. The first is relatively new and OpenSM may not like it or already have this group preconfigured and return an error due to different characteristics or something like that. Is it correct to presume you do not have an IB analyzer to capture the SA packets on the link to/from the x86_64 system ? -- Hal From halr at voltaire.com Tue Dec 7 12:11:11 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:11:11 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6A82@orsmsx408> Message-ID: <1102450271.4141.77.camel@localhost.localdomain> On Tue, 2004-12-07 at 14:33, Woodruff, Robert J wrote: > Ok here is what I found. I installed the openib.org code on my > EM64T (x86_64) systems that have PCI-E HCAs. We have a 8 node > Mellanox switch. On one node (a 32-bit Xeon node), > we are running opensm from the SF project. > We have one 32-bit system (Sean's) and 2 EM64T systems connected to > the switch that are running the openib.org code. > When Sean's 32-bit system loads ipoib and configures the > interface, everything seems to work fine. However, when I load ipoib > on the 64-bit x86_64 node, the opensm starts complaining that > > osm_mcr_rcv_join_mgrp ERR IB10 Provided Join State != FullMember > > so it looks like the x86_64 system is not able to join the multicast > group. > I suspect some issue with structure definitions, but have not debugged > the > issue any further. Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). -- Hal From halr at voltaire.com Tue Dec 7 12:26:24 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:26:24 -0500 Subject: [openib-general] IPoIB FAQ Message-ID: <1102451183.4141.94.camel@localhost.localdomain> is now checked into the tree as https://openib.org/svn/gen2/trunk/src/linux-kernel/docs/ipoib_faq.txt From robert.j.woodruff at intel.com Tue Dec 7 12:26:38 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:26:38 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> >Is it correct to presume you do not have an IB analyzer to capture the >SA packets on the link to/from the x86_64 system ? >-- Hal No, unfortunately we do not have a 4x IB analyzer. Note that the same openib.org code running on a 32-bit system does not seem to invoke the complaints from opensm. I looked at the opensm code and it is looking for a value of 1 in the join_state. I looked at the ipoib_multicast.c code and the join routine appears to set the value to 1. Somehow the value is lost in transit. Perhaps there is some 32/64 bit issue with the data structures, etc. that allows it to work on a 32 bit platform and fail on x86_64. As I said, I did not debug it any further. It does however not look like the issue is with dropping multicast packets, which was the firmware issue we saw with the PCI-E cards. Rather, it looks like x86_64 systems are having issues communicating with a 32-bit opensm. From robert.j.woodruff at intel.com Tue Dec 7 12:28:39 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:28:39 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> >Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). >-- Hal PCIe From halr at voltaire.com Tue Dec 7 12:33:48 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:33:48 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6B57@orsmsx408> Message-ID: <1102451628.4141.102.camel@localhost.localdomain> On Tue, 2004-12-07 at 15:26, Woodruff, Robert J wrote: > No, unfortunately we do not have a 4x IB analyzer. Do you have 4x to 1x cables ? Then you could use a 1x analyzer for this. > Note that the same openib.org code running > on a 32-bit system does not seem to invoke the complaints from opensm. Yes, but is that SA client local to the node with the OpenSM ? I'm not sure whether this makes a difference or not as I am unaware of how this local communication (SA client -> OpenSM) is accomplished. -- Hal From halr at voltaire.com Tue Dec 7 12:35:00 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 15:35:00 -0500 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB6B5F@orsmsx408> Message-ID: <1102451700.4141.105.camel@localhost.localdomain> On Tue, 2004-12-07 at 15:28, Woodruff, Robert J wrote: > > >Are the x86_64 machines PCIe or PCIX ? (Mine is PCIX). > > >-- Hal > > PCIe That's the difference and that's what makes me think this too is a firmware problem. -- Hal From robert.j.woodruff at intel.com Tue Dec 7 12:51:35 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 12:51:35 -0800 Subject: [openib-general] IPoIB still not working [was IPoIB FAQ Update] Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> > Do you have 4x to 1x cables ? Then you could use a 1x analyzer for this. Use to have one, but I think it was lost. The 1x analyzer we have is very old and out of date, not sure that it is worth trying to get running. >Yes, but is that SA client local to the node with the OpenSM ? I'm not >sure whether this makes a difference or not as I am unaware of how this >local communication (SA client -> OpenSM) is accomplished. We have one dedicated 32-bit node running the opensm from sourceforge. We have a separate 32-bit node that Sean uses for development, so no the sa_client is not on the same node as the SM. His node does not seem to invoke the errors from opensm, but the x86_86 node does. I'll poke around in the code this afternoon to see if I can get any additional information. woody From roland at topspin.com Tue Dec 7 13:08:21 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 13:08:21 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> (Robert J. Woodruff's message of "Tue, 7 Dec 2004 12:51:35 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0002FB6BC0@orsmsx408> Message-ID: <52653dddfe.fsf@topspin.com> Please try applying this patch (which will dump the SA data portion of MCMember MADs). If you could send the output from the 64-bit and 32-bit systems that may help us figure out where the bug is. Thanks, Roland Index: infiniband/core/sa_query.c =================================================================== --- infiniband/core/sa_query.c (revision 1314) +++ infiniband/core/sa_query.c (working copy) @@ -662,6 +662,18 @@ ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), rec, query->sa_query.mad->data); + { + int i; + + for (i = 0; i < 104; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i); + printk(" %02x", query->sa_query.mad->data[i]); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); if (ret) { From robert.j.woodruff at intel.com Tue Dec 7 17:11:37 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Dec 2004 17:11:37 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Here are some log files. First file, mcast-64.log is the /var/log/messages output from the patch you sent on the 64-bit system. Next log files is the opensm log file osm-64bit.log Next log file is the opensm log file when running the 32-node. osm-32-bit.log In the passing case, ipoib sends 2 MCM messages and opensm has no complaints. Search for MCMember Record in osm-32-bit.log In the failing case, ipoib sends 2 MCM messages that look similar with no errors reported. However, in the failing case ipoib continues to send MCM messages that opensm rejects. In the failing case there are a couple of differences, first the MGID lower 32-bits appear to be 0xffffffff in the passing case and something else when it fails. Second, it appears that perhaps the opensm is rejecting the messages because of a bug where the scope and join fields are reversed when extracted from the mad. In the passing case, since the lower 32 bits of the mgid are 0xfffffffff, you never get to the code that checks the join member. Someone that understands opensm should look at this, but Sean I think it may be wrong. This however does not explain why in the failing case, ipoib continues to try to join the mcast group unless it is having difficulties after trying yo join he group and decides to re-try, with the subsequent re-tries to join being failed by opensm. -------------- next part -------------- A non-text attachment was scrubbed... Name: osm-32bit.log Type: application/octet-stream Size: 2781897 bytes Desc: osm-32bit.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osm-64bit.log Type: application/octet-stream Size: 387066 bytes Desc: osm-64bit.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mcast-64.log Type: application/octet-stream Size: 50359 bytes Desc: mcast-64.log URL: From roland at topspin.com Tue Dec 7 18:21:42 2004 From: roland at topspin.com (Roland Dreier) Date: Tue, 07 Dec 2004 18:21:42 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> (Robert J. Woodruff's message of "Tue, 7 Dec 2004 17:11:37 -0800") References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Message-ID: <52mzwpbkcp.fsf@topspin.com> Robert> In the failing case, ipoib sends 2 MCM messages that look Robert> similar with no errors reported. However, in the failing Robert> case ipoib continues to send MCM messages that opensm Robert> rejects. In the failing case there are a couple of Robert> differences, first the MGID lower 32-bits appear to be Robert> 0xffffffff in the passing case and something else when it Robert> fails. Second, it appears that perhaps the opensm is Robert> rejecting the messages because of a bug where the scope Robert> and join fields are reversed when extracted from the Robert> mad. In the passing case, since the lower 32 bits of the Robert> mgid are 0xfffffffff, you never get to the code that Robert> checks the join member. Someone that understands opensm Robert> should look at this, but Sean I think it may be wrong. I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6. It looks like your 32 bit hosts do not have IPv6 support turned on, so IPoIB only joins groups with MGIDs starting ff12:401b. The 64 bit host does have IPv6 and tries to join its solicited-node group (messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in osm-64bit.log). Since no one has created this group yet, OpenSM looks at the join state field. As you say, there seems to be a bug in OpenSM in how it interprets "ScopeState" (JoinState is the low nibble, and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a correct FullMember request). The joins of the IPv4 broadcast group (ff12:401b:ffff:0:0:0:ffff:ffff) and IPv4 all nodes group (ff12:401b:ffff:0:0:0:0:1) succeed because presumably OpenSM has already created these groups. Robert> This however does not explain why in the failing case, Robert> ipoib continues to try to join the mcast group unless it Robert> is having difficulties after trying yo join he group and Robert> decides to re-try, with the subsequent re-tries to join Robert> being failed by opensm. IPoIB is dumb -- when it fails to join a multicast group, it just keeps trying. - R. From halr at voltaire.com Tue Dec 7 18:43:51 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 21:43:51 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwpbkcp.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> <52mzwpbkcp.fsf@topspin.com> Message-ID: <1102473830.4141.111.camel@localhost.localdomain> On Tue, 2004-12-07 at 21:21, Roland Dreier wrote: > The 64 bit host does have IPv6 and tries to join its solicited-node group > (messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and > the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in > osm-64bit.log). Since no one has created this group yet, OpenSM looks > at the join state field. As you say, there seems to be a bug in > OpenSM in how it interprets "ScopeState" (JoinState is the low nibble, > and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a > correct FullMember request). Isn't the join sufficient to create these IPv6 groups ? If so, that would mean a problem with OpenSM in not doing so. -- Hal From halr at voltaire.com Tue Dec 7 18:47:21 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 07 Dec 2004 21:47:21 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> Message-ID: <1102474041.4141.117.camel@localhost.localdomain> On Tue, 2004-12-07 at 20:11, Woodruff, Robert J wrote: So other than this, does IPoIB seem to be working ? I guess that's at least IPv4. It doesn't seem like IPv6 could be working properly. -- Hal From eitan at mellanox.co.il Tue Dec 7 22:39:06 2004 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 8 Dec 2004 08:39:06 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Forgive me for not following the entire thread. But I did take a look at the log files: The 64bit version have the following multicast activities: 1. Port 0x0002c9010ad258f1 joining MLID 0xC000 -> success. Note that MLID 0xC000 is predefined (IPoIB). MGID....................0xff12401bffff0000 : 0x00000000ffffffff PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 2. Port 0x0002c9010ad258f1 joining MLID 0xC000. (Again). MGID....................0xff12401bffff0000 : 0x00000000ffffffff PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x1B0B0000 Mlid....................0xC000 ScopeState..............0x11 Rate....................0x3 Mtu.....................0x4 -> considered as an update to the scope state. 3. Request to join : MGID....................0xff12601bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad258f1 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 Results with - ERR 1B10: Provided Join State != FullMember - required for create. You can not create a group if you are not a full member. 4. A sequence of requests arrive to create MGRPs with several MGIDs: MGID 0xff12601bffff0000:0x0000000000000002 MGID 0xff12601bffff0000:0x0000000000000016 MGID 0xff12601bffff0000:0x00000001ffd258f1 All fail due to the same join state issue. Inspecting the 32bit version: I see only one request to join Port 0x0002c90107fc5be1 joining MLID 0xC000 And it succeeds Hope this helps. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Wednesday, December 08, 2004 3:12 AM To: Roland Dreier Cc: openib-general at openib.org Subject: RE: [openib-general] IPoIB still not working Here are some log files. First file, mcast-64.log is the /var/log/messages output from the patch you sent on the 64-bit system. Next log files is the opensm log file osm-64bit.log Next log file is the opensm log file when running the 32-node. osm-32-bit.log In the passing case, ipoib sends 2 MCM messages and opensm has no complaints. Search for MCMember Record in osm-32-bit.log In the failing case, ipoib sends 2 MCM messages that look similar with no errors reported. However, in the failing case ipoib continues to send MCM messages that opensm rejects. In the failing case there are a couple of differences, first the MGID lower 32-bits appear to be 0xffffffff in the passing case and something else when it fails. Second, it appears that perhaps the opensm is rejecting the messages because of a bug where the scope and join fields are reversed when extracted from the mad. In the passing case, since the lower 32 bits of the mgid are 0xfffffffff, you never get to the code that checks the join member. Someone that understands opensm should look at this, but Sean I think it may be wrong. This however does not explain why in the failing case, ipoib continues to try to join the mcast group unless it is having difficulties after trying yo join he group and decides to re-try, with the subsequent re-tries to join being failed by opensm. -------------- next part -------------- An HTML attachment was scrubbed... URL: From roland at topspin.com Wed Dec 8 04:21:34 2004 From: roland at topspin.com (Roland Dreier) Date: Wed, 08 Dec 2004 04:21:34 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> (Eitan Zahavi's message of "Wed, 8 Dec 2004 08:39:06 +0200") References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Message-ID: <52is7daskx.fsf@topspin.com> Eitan> Results with - ERR 1B10: Provided Join State != FullMember Eitan> - required for create. You can not create a group if you Eitan> are not a full member. Right. However, ScopeState is dumped as 0x1, which means bit 0 of JoinState (the FullMember bit) is in fact set, so OpenSM should create the group. Thanks, Roland From halr at voltaire.com Wed Dec 8 05:17:18 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 08:17:18 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> Message-ID: <1102511838.4141.762.camel@localhost.localdomain> Hi Eitan, The component masks are also significant relative to the below. See embedded comments. On Wed, 2004-12-08 at 01:39, Eitan Zahavi wrote: > Forgive me for not following the entire thread. > But I did take a look at the log files: > > The 64bit version have the following multicast activities: > 1. Port 0x0002c9010ad258f1 joining MLID 0xC000 -> success. > Note that MLID 0xC000 is predefined (IPoIB). > MGID....................0xff12401bffff0000 : > 0x00000000ffffffff > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x0 > Mlid....................0x0 > ScopeState..............0x1 > Rate....................0x0 > Mtu.....................0x0 This is a join rather than a create. It relies on the group preexisting or is rejected by the SA. > 2. Port 0x0002c9010ad258f1 joining MLID 0xC000. (Again). > MGID....................0xff12401bffff0000 : > 0x00000000ffffffff > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x1B0B0000 > Mlid....................0xC000 > ScopeState..............0x11 > Rate....................0x3 > Mtu.....................0x4 This is a request for group creation with sufficient components for this. > -> considered as an update to the scope state. This interpretation is non conformant with o15-0.2.1 (IBA 1.2) which obsoleted 015-0.1.2 (IBA 1.1). It is supposed to be rejected with ERR_REQ_INVALID. > 3. Request to join : > MGID....................0xff12601bffff0000 : > 0x0000000000000016 > PortGid.................0xfe80000000000000 : > 0x0002c9010ad258f1 > qkey....................0x0 > Mlid....................0x0 > ScopeState..............0x1 > Rate....................0x0 > Mtu.....................0x0 > Results with - ERR 1B10: Provided Join State != FullMember - required > for create. > You can not create a group if you are not a full member. JoinState bit 0 is Full Member. Wouldn't 0x1 have bit 0 on making this a full member join ? Anyhow, looking at the components, this appears to be a join rather than create requests (although the component mask is not shown). -- Hal From halr at voltaire.com Wed Dec 8 05:58:36 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 08:58:36 -0500 Subject: [openib-general] GUID/EUI-64 Issue Message-ID: <1102514316.4141.843.camel@localhost.localdomain> Hi, Did we come to closure on how to handle the GUID/EUI-64 issue ? -- Hal On Thu, 2004-11-11 at 13:11, Roland Dreier wrote: > My only questions are: > > + eui[0] ^= 2; > > I remember some discussion about whether IBTA GUIDs are already > modified EUI-64 or not. Is this the correct transformation or should > we be doing something like "eui[0] |= 2;" (ie assume the universal bit > should always be set in our IPv6 address)? IBTA GUIDs are EUI-64. The only issue I recall was whether the polarity of the U/G bit was consistent with IEEE. This was updated at IBA 1.2. It now says "manufacturer assigns EUI-64 with global scope set. May also assign additional EUI-64 with local scope." From halr at voltaire.com Wed Dec 8 06:35:28 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 09:35:28 -0500 Subject: [openib-general] IPoIB still not working In-Reply-To: <52mzwpbkcp.fsf@topspin.com> References: <1AC79F16F5C5284499BB9591B33D6F0002FB71E6@orsmsx408> <52mzwpbkcp.fsf@topspin.com> Message-ID: <1102516528.4129.17.camel@localhost.localdomain> On Tue, 2004-12-07 at 21:21, Roland Dreier wrote: > Robert> This however does not explain why in the failing case, > Robert> ipoib continues to try to join the mcast group unless it > Robert> is having difficulties after trying yo join he group and > Robert> decides to re-try, with the subsequent re-tries to join > Robert> being failed by opensm. > > IPoIB is dumb -- when it fails to join a multicast group, it just > keeps trying. This issue has come up before. The choices seem to be retry forever or retry some number of times and then give up unless someone can see a better "policy". The rety strategy might also depend on the status code returned from the SA although the errors may not be sufficiently rich for different client behavior but it could be based on the join type and the status code. Some errors to certain joins might indicate that retries might never succeed no matter how many retries are attempted. -- Hal From halr at voltaire.com Wed Dec 8 06:52:08 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 09:52:08 -0500 Subject: [openib-general] IPv6 All Router Multicast Group Message-ID: <1102517528.4129.34.camel@localhost.localdomain> Hi Roland, With IPv4 and IPv6, I see all the relevant groups attempted to be joined and then if that fails created (first component mask 0x10083 and then 0x130c7). The former gets status 0x0600 in the GetResp when the group does not already exist which causes the retry with the group creation. I see one IPv6 group (all routers) (0xff12:601b:ffff:0:0:0:0:2) that is only attempted with component mask 0x10083 (join) and not 0x130c7 (create). Is this because an IP router would create this group and this node is only trying to join as it is not a router ? (If so, has the creation been tested) ? Thanks. -- Hal From halr at voltaire.com Wed Dec 8 07:59:54 2004 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 08 Dec 2004 10:59:54 -0500 Subject: [openib-general] User MAD support for cancel MAD Message-ID: <1102521594.4129.42.camel@localhost.localdomain> Hi Roland, It doesn't look to me like there is a way to cancel a MAD from user space. Would this be an additional ioctl to support ? This is needed from an OpenSM perspective. Thanks. -- Hal From robert.j.woodruff at intel.com Wed Dec 8 08:04:46 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 8 Dec 2004 08:04:46 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FB746B@orsmsx408> Rolland > I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6. Ok, I'll take a look at the opensm code and see if we can make a fix that will allow it to work in the IPv6 case. From mst at mellanox.co.il Wed Dec 8 08:22:22 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Dec 2004 18:22:22 +0200 Subject: [openib-general] User MAD support for cancel MAD In-Reply-To: <1102521594.4129.42.camel@localhost.localdomain> References: <1102521594.4129.42.camel@localhost.localdomain> Message-ID: <20041208162222.GA31925@mellanox.co.il> Hello! Quoting r. Hal Rosenstock (halr at voltaire.com) "[openib-general] User MAD support for cancel MAD": > Hi Roland, > > It doesn't look to me like there is a way to cancel a MAD from user > space. Would this be an additional ioctl to support ? This is needed > from an OpenSM perspective. Thanks. > > -- Hal Are you talking about MADs that have been linked but not yet posted to the QP? Since it seems opensm has no way to tell this is the state of the MAD (as opposed to being already posted to the QP), why does is need a way to change it? mst From mshefty at ichips.intel.com Wed Dec 8 09:44:46 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 09:44:46 -0800 Subject: [openib-general] IPoIB still not working In-Reply-To: <52is7daskx.fsf@topspin.com> References: <506C3D7B14CDD411A52C00025558DED6047EEA70@mtlex01.yok.mtl.com> <52is7daskx.fsf@topspin.com> Message-ID: <41B73D8E.7050602@ichips.intel.com> Roland Dreier wrote: > Eitan> Results with - ERR 1B10: Provided Join State != FullMember > Eitan> - required for create. You can not create a group if you > Eitan> are not a full member. > > Right. However, ScopeState is dumped as 0x1, which means bit 0 of > JoinState (the FullMember bit) is in fact set, so OpenSM should create > the group. Here's the relevant code from opensm for extracting the join state and scope information: *p_scope = (uint8_t)(scope_state & 0x0f); tmp_scope_state = scope_state >> 4; *p_state = (uint8_t)(tmp_scope_state &0x0f); So, opensm has the join state and scope fields reversed within the byte. I guess at some point we'll need to go through opensm and verify that it's setting/extracting bit subfields properly. - Sean From mshefty at ichips.intel.com Wed Dec 8 14:57:20 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 14:57:20 -0800 Subject: [openib-general] crash in mthca soon after loading drivers Message-ID: <41B786D0.50701@ichips.intel.com> I'm getting the following bug in mthca when loading the drivers (core, mad, and mthca). The system is attached to a fabric with opensm running on top of the Mellanox gold software stack. I hit this when running with the tip of openib. Any help would be, well, helpful. - Sean Dec 8 14:53:47 mshefty-linux2 kernel: kernel BUG at drivers/infiniband/hw/mthca/mthca_cmd.c:328! Dec 8 14:53:47 mshefty-linux2 kernel: invalid operand: 0000 [#1] Dec 8 14:53:47 mshefty-linux2 kernel: SMP Dec 8 14:53:47 mshefty-linux2 kernel: Modules linked in: ib_mthca ib_mad ib_core edd st sr_mod ide_cd cdrom thermal processor fan button battery ac e100 mii e1000 hw_random uhci_hcd usbcore evdev reiserfs aic7xxx sd_mod scsi_mod Dec 8 14:53:47 mshefty-linux2 kernel: CPU: 0 Dec 8 14:53:47 mshefty-linux2 kernel: EIP: 0060:[pg0+948359147/1069220864] Not tainted VLI Dec 8 14:53:47 mshefty-linux2 kernel: EIP: 0060:[] Not tainted VLI Dec 8 14:53:47 mshefty-linux2 kernel: EFLAGS: 00010286 (2.6.9) Dec 8 14:53:47 mshefty-linux2 kernel: EIP is at mthca_cmd_wait+0x19b/0x1b0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: eax: f6b245a0 ebx: f6b24584 ecx: f6b24584 edx: ffffffff Dec 8 14:53:47 mshefty-linux2 kernel: esi: 32b50000 edi: f6b24324 ebp: f6b245a0 esp: f29d7e34 Dec 8 14:53:47 mshefty-linux2 kernel: ds: 007b es: 007b ss: 0068 Dec 8 14:53:47 mshefty-linux2 kernel: Process ib_mad1 (pid: 9900, threadinfo=f29d6000 task=f744a710) Dec 8 14:53:47 mshefty-linux2 kernel: Stack: 00000024 32b50000 00000000 f6b24324 32b50000 00000000 0000ea60 f8cbb058 Dec 8 14:53:47 mshefty-linux2 kernel: f29d7e70 00000000 00000001 00000000 00000024 0000ea60 f29d7edb 32b50100 Dec 8 14:53:47 mshefty-linux2 kernel: 00000000 00000001 00000000 f2b50100 f2b50000 f8cbd265 32b50100 00000000 Dec 8 14:53:47 mshefty-linux2 kernel: Call Trace: Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948359256/1069220864] mthca_cmd_box+0x58/0x90 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_cmd_box+0x58/0x90 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948367973/1069220864] mthca_MAD_IFC+0x85/0xf0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_MAD_IFC+0x85/0xf0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [check_poison_obj+45/432] check_poison_obj+0x2d/0x1b0 Dec 8 14:53:47 mshefty-linux2 kernel: [] check_poison_obj+0x2d/0x1b0 Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948399983/1069220864] mthca_process_mad+0xcf/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_process_mad+0xcf/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+948399776/1069220864] mthca_process_mad+0x0/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [] mthca_process_mad+0x0/0x1c0 [ib_mthca] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946047120/1069220864] ib_mad_recv_done_handler+0xd0/0x230 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_recv_done_handler+0xd0/0x230 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946048724/1069220864] ib_mad_completion_handler+0x94/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_completion_handler+0x94/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [remove_wait_queue+12/64] remove_wait_queue+0xc/0x40 Dec 8 14:53:47 mshefty-linux2 kernel: [] remove_wait_queue+0xc/0x40 Dec 8 14:53:47 mshefty-linux2 kernel: [worker_thread+424/560] worker_thread+0x1a8/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [] worker_thread+0x1a8/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [pg0+946048576/1069220864] ib_mad_completion_handler+0x0/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [] ib_mad_completion_handler+0x0/0xa0 [ib_mad] Dec 8 14:53:47 mshefty-linux2 kernel: [default_wake_function+0/16] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [default_wake_function+0/16] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] default_wake_function+0x0/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [worker_thread+0/560] worker_thread+0x0/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [] worker_thread+0x0/0x230 Dec 8 14:53:47 mshefty-linux2 kernel: [kthread+136/176] kthread+0x88/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [] kthread+0x88/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [kthread+0/176] kthread+0x0/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [] kthread+0x0/0xb0 Dec 8 14:53:47 mshefty-linux2 kernel: [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: [] kernel_thread_helper+0x5/0x10 Dec 8 14:53:47 mshefty-linux2 kernel: Code: 14 d2 89 d0 c1 e0 09 29 d0 89 c2 c1 e2 12 01 d0 f7 d8 89 87 84 02 00 00 89 e8 e8 51 92 64 c7 89 da 83 c4 0c 89 d0 5b 5e 5f 5d c3 <0f> 0b 48 01 40 7b cc f8 e9 c7 fe ff ff 90 8d b4 26 00 00 00 00 From robert.j.woodruff at intel.com Wed Dec 8 15:35:47 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 8 Dec 2004 15:35:47 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F0002FF6BB5@orsmsx408> I found the problem with parsing of ScopeState. It appears that there was a bug in ib_types.h in the sourceforge code base. This was already fixed in the latest Mellanox code base. However, with that fixed, I still get the following error from opensm when ipoib tries to join the multicast group. This is the dump from the latest mellanox opensm. [1102546997:000053256][18007] -> osm_mcmr_rcv_join_mgrp: [ [1102546997:000053278][18007] -> osm_mcmr_rcv_join_mgrp: Dump of incomming record. [1102546997:000053307][18007] -> MCMember Record dump: MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. [1102546997:000053413][18007] -> osm_sa_send_error: [ [1102546997:000053434][18007] -> osm_mad_pool_get: [ [1102546997:000053457][18007] -> osm_vendor_get: [ [1102546997:000053481][18007] -> osm_vendor_get: Allocated MAD 0x818a860, size = 256. _ From mshefty at ichips.intel.com Wed Dec 8 17:12:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Dec 2004 17:12:16 -0800 Subject: [openib-general] crash in mthca soon after loading drivers In-Reply-To: <41B786D0.50701@ichips.intel.com> References: <41B786D0.50701@ichips.intel.com> Message-ID: <41B7A670.5070102@ichips.intel.com> Sean Hefty wrote: > I'm getting the following bug in mthca when loading the drivers (core, > mad, and mthca). The system is attached to a fabric with opensm running > on top of the Mellanox gold software stack. I hit this when running > with the tip of openib. Any help would be, well, helpful. > > - Sean > > > Dec 8 14:53:47 mshefty-linux2 kernel: kernel BUG at > drivers/infiniband/hw/mthca/mthca_cmd.c:328! I still need to spend more time investigating this, but looking at mthca_cmd_wait(): if (down_interruptible(&dev->cmd.event_sem)) return -EINTR; spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); context = &dev->cmd.context[dev->cmd.free_head]; dev->cmd.free_head = context->next; spin_unlock(&dev->cmd.context_lock); ...snip... wait_for_completion(&context->done); ***** possible race here ***** ...snip... out: spin_lock(&dev->cmd.context_lock); context->next = dev->cmd.free_head; dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); There appears to be a race here where event_sem can be incremented (in mthca_cmd_complete()), but free_head has not yet been updated. A second call to mthca_cmd_wait could then get the semaphore, but find the list empty, leading to the bug. In my case, max_cmd is set to 1. I need to verify if this is indeed what is happening, and if so what to do to fix it. - Sean From eitan at mellanox.co.il Wed Dec 8 22:33:04 2004 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 9 Dec 2004 08:33:04 +0200 Subject: [openib-general] IPoIB still not working Message-ID: <506C3D7B14CDD411A52C00025558DED6047EEA87@mtlex01.yok.mtl.com> Hi Woody, ERR 1B11 means as it says: The expected component mask for the join is not sufficient. Quoting from your mail: >> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, >> component mask =0x0000000000010083, expected comp mask = 0x00000000000130c7. Seems the component mask being sent is missing bits: 12,13 (starting from 0). These are the 2^12 = SL and 2^13 = FlowLabel. Please see the following spec quote: o15-0.2.2: If SA supports UD multicast, then SA shall create a multicast group if it receives a SubnAdmSet() method for a MCMemberRecord, with the MGID set to 0 and the MCMemberRecord:JoinState.FullMember bit set to 1. The required components in the MCMemberRecord for the group to be created are P_Key, Q_Key, SL, FlowLabel, TClass, JoinState and PortGID (see o15-0.1.3:) with the corresponding bits in the Component- Mask set. All other components may be wildcarded. This results in an implicit join for the port specified by PortGID. In osm_sa-mcmember_record.h you can find: #define REQUIRED_MC_CREATE_COMP_MASK |\ (IB_MCR_COMPMASK_MGID | \ IB_MCR_COMPMASK_PORT_GID | \ IB_MCR_COMPMASK_JOIN_STATE | \ IB_MCR_COMPMASK_QKEY | \ IB_MCR_COMPMASK_TCLASS | \ IB_MCR_COMPMASK_PKEY | \ IB_MCR_COMPMASK_FLOW | \ IB_MCR_COMPMASK_SL) Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Thursday, December 09, 2004 1:36 AM To: Eitan Zahavi; Roland Dreier Cc: openib-general at openib.org Subject: RE: [openib-general] IPoIB still not working I found the problem with parsing of ScopeState. It appears that there was a bug in ib_types.h in the sourceforge code base. This was already fixed in the latest Mellanox code base. However, with that fixed, I still get the following error from opensm when ipoib tries to join the multicast group. This is the dump from the latest mellanox opensm. [1102546997:000053256][18007] -> osm_mcmr_rcv_join_mgrp: [ [1102546997:000053278][18007] -> osm_mcmr_rcv_join_mgrp: Dump of incomming record. [1102546997:000053307][18007] -> MCMember Record dump: MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. [1102546997:000053413][18007] -> osm_sa_send_error: [ [1102546997:000053434][18007] -> osm_mad_pool_get: [ [1102546997:000053457][18007] -> osm_vendor_get: [ [1102546997:000053481][18007] -> osm_vendor_get: Allocated MAD 0x818a860, size = 256. _ -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Dec 9 03:23:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 13:23:20 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: Hi guys, I am implementing a user mode access library for the umad. I have encountered few issues: 1. I am using the sys Infiniband class to get the HCA/port attributes, and several fields are missing: For the ports: Physical port state (as in Portinfo) Port GUID (as in Nodeinfo to the port) For the CA: Node type (as in NodeInfo) If no one objects, I would like to add them to the class. If someone else prefers to do it, please tell me. 2. I need an interface for setting/clearing IS_SM bit. If there is such please let me know. If not, I would like to add an additional ioctl to set/clear the IS_SM bit. A ctl file in /dev/.../ports/.../ is anther option. Please tell me what do you think. 3. It would help me very much if I could get an async event on some changes, especially ports state changes. Lid/SM lid changes events will be nice too. I am not familiar with the HCA fw, so I don't know if it does trigger such events. The AnafaII fw does. Anyhow there are several ways to implement the event mechanism, and I would to hear what you think about it. I am still new to 2.6 mechanisms, and I don't know if there is a recommended method to do this. The methods I know are Signals, poll events, pseudo device blocking read, etc. If we are going to use device reads, I want to suggest that we implement circular events queue that each should contain at least the following: . If we use on /dev and /sys I would also use the name of the relevant sys|dev file as the event name, and have its new data as the event data. Such an event queue is managed for each fd that request that, and an event will be cleared once read, or upon queue overflow. I have used such mechanisms before and they were very useful. Again, if there is a standard way to do it, it is OK with me. The IS_SM issue is critical for OpenSM. The missing class fields are important because without them I will have to do local mads and it will considerably complicates the library. The events issue is less critical, but still such a mechanism may help to solve many problems, and to simplify many other mechanisms. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaharf at voltaire.com Thu Dec 9 03:44:08 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 13:44:08 +0200 Subject: [openib-general] Another user mode issue Message-ID: Hi all, Another issue I want to raise concerning the umad access library is some change indicators for the infiniband class objects - ca & ports. It would be very nice if we could have a modification version file for each just object. This will help the safe atomic reading and caching of ca/port attributes. The idea that when you want to read an object attributes you will do the following: 1. read the object version. 2. read all object fields you want 3. read the object version again 4. if the new one is the same as the old one, we are done 5. else the object was changed while reading, so go to step 2. If you have cached the object you can compare the version each time the object is accessed to check if re-read is required. The version should be incremented upon any modification. Of course this is not my idea - this is common method used in lock free programming, and is similar to LL/SC (load linked, store conditional) or C&S (compare and set) schemes. Just as a long shot, it will be very nice to replace the switch state change bit with this version based mechanism. Not only that it can be used to implement atomic read, it can be used to let several distributed SM's to safely scan the network. As a matter of fact it would also solve the dual master synchronization problem we currently have (happens on network merge - two masters sweeps the network, but one clears the change bit before the other see it and it may end such that the SM with the lesser priority waits for the other to take over, while the other don't see it al all...). This issue may be relevant for the distributed SM mentioned in the SOW. Shahar -------------- next part -------------- An HTML attachment was scrubbed... URL: From mucci at cs.utk.edu Thu Dec 9 05:24:43 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 14:24:43 +0100 Subject: [openib-general] Should I use umad -or- osm Message-ID: <1102598683.3731.66.camel@muccislaptop.pdc.kth.se> Hi folks, I've been tasked with developing a rough performance tool for IB networks. I've scanned the documentation and looks like the kind of data we're interested in can be obtained from the Mellanox ASICs. My question is a simple one: I've got to send/recv mads to enable and obtain the performance counters from a user space tool...ideally non-root, but we'll work with what we have. My current inclination has been to use the osm_vendor_api.h functions to do this work. However, the late work here done by Hal on the UMAD access layer seems to be appropriate as well. Could someone elaborate on what you think the best (and most maintainable) approach to accomplishing this task might be? Regards, Philip From shaharf at voltaire.com Thu Dec 9 07:52:43 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 17:52:43 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: Hi Philip, I am currently implementing umad access library. It much simpler then the osm_vendor api. I would recommend using umad library and not osm. As a matter of fact the current osm vendor layer does not support openib gen2. I am working on that either. Both the umad access library and the new osm vendor layer that uses it are not finished yet. I guess that I will need at least another week to reach a point where I can release it. Even then it will change a lot until I will be finished with it. The question is can you wait a little? I can give you a preliminary version - but if you will use it you will have to modify your code several each time the library interface is changed. On the other hand, I would like to understand exactly what you need, because you are the first "client" of the user mode stuff beside me. Shahar > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Philip Mucci > Sent: Thursday, December 09, 2004 3:25 PM > To: openib-general at openib.org > Subject: [openib-general] Should I use umad -or- osm > > Hi folks, > > I've been tasked with developing a rough performance tool for IB > networks. I've scanned the documentation and looks like the kind of data > we're interested in can be obtained from the Mellanox ASICs. > > My question is a simple one: > > I've got to send/recv mads to enable and obtain the performance counters > from a user space tool...ideally non-root, but we'll work with what we > have. > > My current inclination has been to use the osm_vendor_api.h functions to > do this work. However, the late work here done by Hal on the UMAD access > layer seems to be appropriate as well. > > Could someone elaborate on what you think the best (and most > maintainable) approach to accomplishing this task might be? > > Regards, > > Philip > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From mucci at cs.utk.edu Thu Dec 9 08:43:40 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 17:43:40 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> Hi Shahar, Thanks for the info. My needs are very 'simple'. Just to send and receive MADS to the PM agent on each adapter and switch in the network. Ideally, I would like this to work for existing installations based on either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use the current osm_vendor_api.h interface. How much will this interface change with gen2? will it export the same functions? Or will everything change Lastly, will I be able to send/recv these MADS as a non-root user? Thanks again, and the answer to your question is, yes, I can wait. Regards, Philip On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > Hi Philip, > > I am currently implementing umad access library. It much simpler > then the osm_vendor api. I would recommend using umad library and not > osm. As a matter of fact the current osm vendor layer does not support > openib gen2. I am working on that either. Both the umad access library > and the new osm vendor layer that uses it are not finished yet. I guess > that I will need at least another week to reach a point where I can > release it. Even then it will change a lot until I will be finished with > it. > The question is can you wait a little? I can give you a preliminary > version - but if you will use it you will have to modify your code > several each time the library interface is changed. > On the other hand, I would like to understand exactly what you need, > because you are the first "client" of the user mode stuff beside me. > > Shahar > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Philip Mucci > > Sent: Thursday, December 09, 2004 3:25 PM > > To: openib-general at openib.org > > Subject: [openib-general] Should I use umad -or- osm > > > > Hi folks, > > > > I've been tasked with developing a rough performance tool for IB > > networks. I've scanned the documentation and looks like the kind of > data > > we're interested in can be obtained from the Mellanox ASICs. > > > > My question is a simple one: > > > > I've got to send/recv mads to enable and obtain the performance > counters > > from a user space tool...ideally non-root, but we'll work with what we > > have. > > > > My current inclination has been to use the osm_vendor_api.h functions > to > > do this work. However, the late work here done by Hal on the UMAD > access > > layer seems to be appropriate as well. > > > > Could someone elaborate on what you think the best (and most > > maintainable) approach to accomplishing this task might be? > > > > Regards, > > > > Philip > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib- > > general From mst at mellanox.co.il Thu Dec 9 08:52:39 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 18:52:39 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> References: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> Message-ID: <20041209165239.GA8493@mellanox.co.il> Quoting r. Philip Mucci (mucci at cs.utk.edu) "RE: [openib-general] Should I use umad -or- osm": > Hi Shahar, > > Thanks for the info. > > My needs are very 'simple'. Just to send and receive MADS to the PM > agent on each adapter and switch in the network. > > Ideally, I would like this to work for existing installations based on > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use > the current osm_vendor_api.h interface. > > How much will this interface change with gen2? will it export the same > functions? Or will everything change > > Lastly, will I be able to send/recv these MADS as a non-root user? > > Thanks again, and the answer to your question is, yes, I can wait. > > Regards, > > Philip > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > Hi Philip, > > > > I am currently implementing umad access library. It much simpler > > then the osm_vendor api. I would recommend using umad library and not > > osm. As a matter of fact the current osm vendor layer does not support > > openib gen2. I am working on that either. Both the umad access library > > and the new osm vendor layer that uses it are not finished yet. I guess > > that I will need at least another week to reach a point where I can > > release it. Even then it will change a lot until I will be finished with > > it. > > The question is can you wait a little? I can give you a preliminary > > version - but if you will use it you will have to modify your code > > several each time the library interface is changed. > > On the other hand, I would like to understand exactly what you need, > > because you are the first "client" of the user mode stuff beside me. > > > > Shahar > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Philip Mucci > > > Sent: Thursday, December 09, 2004 3:25 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > Hi folks, > > > > > > I've been tasked with developing a rough performance tool for IB > > > networks. I've scanned the documentation and looks like the kind of > > > data > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > My question is a simple one: > > > > > > I've got to send/recv mads to enable and obtain the performance > > > counters > > > from a user space tool...ideally non-root, but we'll work with what we > > > have. > > > > > > My current inclination has been to use the osm_vendor_api.h functions > > > to > > > do this work. However, the late work here done by Hal on the UMAD > > > access > > > layer seems to be appropriate as well. > > > > > > Could someone elaborate on what you think the best (and most > > > maintainable) approach to accomplishing this task might be? > > > > > > Regards, > > > > > > Philip Its clearly not a great idea to let non-root inject arbitrary MADs into the system ,is it? MST From shaharf at voltaire.com Thu Dec 9 08:55:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 18:55:20 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: Philip, I would say that there is not much in common between gen1 and gen2 user mode mad interface. If you want your tools to works above both, then osm vendor layer is your only choice. If you are planning to use gen2 stacks only at some point you can use libumad. If you need just a simple PM portcounter get query, then maybe osm vendor api is a bit too "fat" for you. I would consider using gen1 interface directly. But still, this is your choice. I guess that on the long run, openib gen2 will be the only maintained openib version. Anyone thinks differently? Shahar > From: Philip Mucci [mailto:mucci at cs.utk.edu] > Sent: Thursday, December 09, 2004 6:44 PM > To: shaharf > Cc: openib-general at openib.org > Subject: RE: [openib-general] Should I use umad -or- osm > > Hi Shahar, > > Thanks for the info. > > My needs are very 'simple'. Just to send and receive MADS to the PM > agent on each adapter and switch in the network. > > Ideally, I would like this to work for existing installations based on > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to use > the current osm_vendor_api.h interface. > > How much will this interface change with gen2? will it export the same > functions? Or will everything change > > Lastly, will I be able to send/recv these MADS as a non-root user? > > Thanks again, and the answer to your question is, yes, I can wait. > > Regards, > > Philip > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > Hi Philip, > > > > I am currently implementing umad access library. It much simpler > > then the osm_vendor api. I would recommend using umad library and not > > osm. As a matter of fact the current osm vendor layer does not support > > openib gen2. I am working on that either. Both the umad access library > > and the new osm vendor layer that uses it are not finished yet. I guess > > that I will need at least another week to reach a point where I can > > release it. Even then it will change a lot until I will be finished with > > it. > > The question is can you wait a little? I can give you a preliminary > > version - but if you will use it you will have to modify your code > > several each time the library interface is changed. > > On the other hand, I would like to understand exactly what you need, > > because you are the first "client" of the user mode stuff beside me. > > > > Shahar > > > > > -----Original Message----- > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > bounces at openib.org] On Behalf Of Philip Mucci > > > Sent: Thursday, December 09, 2004 3:25 PM > > > To: openib-general at openib.org > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > Hi folks, > > > > > > I've been tasked with developing a rough performance tool for IB > > > networks. I've scanned the documentation and looks like the kind of > > data > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > My question is a simple one: > > > > > > I've got to send/recv mads to enable and obtain the performance > > counters > > > from a user space tool...ideally non-root, but we'll work with what we > > > have. > > > > > > My current inclination has been to use the osm_vendor_api.h functions > > to > > > do this work. However, the late work here done by Hal on the UMAD > > access > > > layer seems to be appropriate as well. > > > > > > Could someone elaborate on what you think the best (and most > > > maintainable) approach to accomplishing this task might be? > > > > > > Regards, > > > > > > Philip > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib- > > > general From shaharf at voltaire.com Thu Dec 9 09:00:54 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:00:54 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Thursday, December 09, 2004 6:53 PM > To: Philip Mucci > Cc: shaharf; openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > Its clearly not a great idea to let non-root inject arbitrary MADs > into the system ,is it? > > MST I guess that in this stage only root will able to use user mode mads. Later I would consider letting non-root applications use some mads - meaning most of the get/query mads, and some of the set mads. I won't rely on root access for security. There are mkey, qkey and pkey to handle that. Shahar From mst at mellanox.co.il Thu Dec 9 09:03:42 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:03:42 +0200 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209170342.GB8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining Infiniband class fields": > 2. I need an interface for setting/clearing IS_SM bit. If there is such please > let me know. If not, I would like to add an additional ioctl to set/clear the > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please tell me what > do you think. Please note that you want to clear this bit automatically if the application set it and then dies. One way to do this is to require that the user keeps some file open while he wants to be the SM, and clear it on file close. mst From shaharf at voltaire.com Thu Dec 9 09:04:20 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:04:20 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Thursday, December 09, 2004 7:04 PM > To: shaharf > Cc: openib-general at openib.org > Subject: Re: [openib-general] Missining Infiniband class fields > > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining > Infiniband class fields": > > 2. I need an interface for setting/clearing IS_SM bit. If there is such > please > > let me know. If not, I would like to add an additional ioctl to > set/clear the > > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please tell > me what > > do you think. > > Please note that you want to clear this bit automatically if the > application set it and then dies. > One way to do this is to require that the user keeps some file > open while he wants to be the SM, and clear it on file close. > > mst Indeed this plays in the favor of ioctl. Shahar From mst at mellanox.co.il Thu Dec 9 09:07:14 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:07:14 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209170714.GC8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Thursday, December 09, 2004 6:53 PM > > To: Philip Mucci > > Cc: shaharf; openib-general at openib.org > > Subject: Re: [openib-general] Should I use umad -or- osm > > > > Its clearly not a great idea to let non-root inject arbitrary MADs > > into the system ,is it? > > > > MST > > I guess that in this stage only root will able to use user mode mads. > Later I would consider letting non-root applications use some mads - > meaning most of the get/query mads, and some of the set mads. I won't > rely on root access for security. There are mkey, qkey and pkey to > handle that. > > Shahar They are trivial to guess, so kernel would have to touch the MAD data somehow? Further, it seems local MADs have the check disabled now? MST From mst at mellanox.co.il Thu Dec 9 09:09:33 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 19:09:33 +0200 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209170933.GD8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Missining Infiniband class fields": > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Thursday, December 09, 2004 7:04 PM > > To: shaharf > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] Missining Infiniband class fields > > > > Hello! > > Quoting r. shaharf (shaharf at voltaire.com) "[openib-general] Missining > > Infiniband class fields": > > > 2. I need an interface for setting/clearing IS_SM bit. If there is > such > > please > > > let me know. If not, I would like to add an additional ioctl to > > set/clear the > > > IS_SM bit. A ctl file in /dev/?/ports/?/ is anther option. Please > tell > > me what > > > do you think. > > > > Please note that you want to clear this bit automatically if the > > application set it and then dies. > > One way to do this is to require that the user keeps some file > > open while he wants to be the SM, and clear it on file close. > > > > mst > > Indeed this plays in the favor of ioctl. > > Shahar ioctls have a disadvantage of being half-broken when used by a 32 bti app on the 64 bit OS. Maybe write 1 to some offset to set the SM bit? This is what /proc/bus/pci has for pci configuration access. MST From mucci at cs.utk.edu Thu Dec 9 09:15:57 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Thu, 09 Dec 2004 18:15:57 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <20041209165239.GA8493@mellanox.co.il> References: <1102610620.3731.101.camel@muccislaptop.pdc.kth.se> <20041209165239.GA8493@mellanox.co.il> Message-ID: <1102612558.3731.105.camel@muccislaptop.pdc.kth.se> Arbitrary, no. Performance monitoring, yes. These performance counters are not necessarily different than the resource counters available on other interconnects, memory controllers or CPU's. But as I said, we can live with either if we have to. Phil > Its clearly not a great idea to let non-root inject arbitrary MADs > into the system ,is it? > > MST From shaharf at voltaire.com Thu Dec 9 09:22:59 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:22:59 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 7:10 PM > Cc: openib-general at openib.org > Subject: Re: [openib-general] Missining Infiniband class fields > > Please note that you want to clear this bit automatically if the > > > application set it and then dies. > > > One way to do this is to require that the user keeps some file > > > open while he wants to be the SM, and clear it on file close. > > > > > > mst > > > > Indeed this plays in the favor of ioctl. > > > > Shahar > > ioctls have a disadvantage of being half-broken when used > by a 32 bti app on the 64 bit OS. > Maybe write 1 to some offset to set the SM bit? > This is what /proc/bus/pci has for pci configuration access. > MST As it will not be the first IOCTL, another one would not do any harm not already done. Shahar From shaharf at voltaire.com Thu Dec 9 09:27:47 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 19:27:47 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 7:07 PM > To: openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > > > I guess that in this stage only root will able to use user mode mads. > > Later I would consider letting non-root applications use some mads - > > meaning most of the get/query mads, and some of the set mads. I won't > > rely on root access for security. There are mkey, qkey and pkey to > > handle that. > > > > Shahar > > They are trivial to guess, so kernel would have to touch the MAD > data somehow? > Further, it seems local MADs have the check disabled now? > > MST > _______________________________________________ The Mkey should set according to the system policy. They can be non trivial. 64 bits (changing) keys may be relatively strong. Currently only trivial keys are used so we won't let non root users use mads. But this is very weak (NFS style) security. Anyone can have root access on his machine. Shahar From mst at mellanox.co.il Thu Dec 9 10:01:37 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:01:37 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209180137.GF8493@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Michael S. Tsirkin > > Sent: Thursday, December 09, 2004 7:07 PM > > To: openib-general at openib.org > > Subject: Re: [openib-general] Should I use umad -or- osm > > > > > > I guess that in this stage only root will able to use user mode > > > mads. > > > Later I would consider letting non-root applications use some mads - > > > meaning most of the get/query mads, and some of the set mads. I > > > won't > > > rely on root access for security. There are mkey, qkey and pkey to > > > handle that. > > > > > > Shahar > > > > They are trivial to guess, so kernel would have to touch the MAD > > data somehow? > > Further, it seems local MADs have the check disabled now? > > > > MST > > _______________________________________________ > > The Mkey should set according to the system policy. They can be non > trivial. > 64 bits (changing) keys may be relatively strong. Depends on your definition of the "relatively" I guess. > Currently only trivial keys are used so we won't let non root users use > mads. Fine, we are in agreement then. > But this is very weak (NFS style) security. I'm afraid it wont be easy to get beyond that level of security. > Anyone can have root > access on his machine. 1. Why not on the switch then? 2. With "anyone can be root" assumption in mind, anyone can for example, do RDMA to a memory region that is enabled for remote write, since that is protected only by a 32 bit r_key? 3. etc. mst From shaharf at voltaire.com Thu Dec 9 10:16:28 2004 From: shaharf at voltaire.com (shaharf) Date: Thu, 9 Dec 2004 20:16:28 +0200 Subject: [openib-general] Should I use umad -or- osm Message-ID: > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Michael S. Tsirkin > Sent: Thursday, December 09, 2004 8:02 PM > Cc: openib-general at openib.org > Subject: Re: [openib-general] Should I use umad -or- osm > > > > > I guess that in this stage only root will able to use user mode > > > > mads. > > > > Later I would consider letting non-root applications use some mads - > > > > meaning most of the get/query mads, and some of the set mads. I > > > > won't > > > > rely on root access for security. There are mkey, qkey and pkey to > > > > handle that. > > > > > > > > Shahar > > > > > > They are trivial to guess, so kernel would have to touch the MAD > > > data somehow? > > > Further, it seems local MADs have the check disabled now? > > > > > > MST > > > _______________________________________________ > > > > The Mkey should set according to the system policy. They can be non > > trivial. > > 64 bits (changing) keys may be relatively strong. > > Depends on your definition of the "relatively" I guess. > > > Currently only trivial keys are used so we won't let non root users use > > mads. > > Fine, we are in agreement then. > > > But this is very weak (NFS style) security. > > I'm afraid it wont be easy to get beyond that level of security. > > > Anyone can have root > > access on his machine. > > 1. Why not on the switch then? > What do you mean? To be able to send/recv mads it is enough to have one host with HCA. No switch can block root user sending mads unless pkey or mkey mechanism is used. > 2. With "anyone can be root" assumption in mind, anyone can for example, > do RDMA to a memory region that is enabled for remote write, > since that is protected only by a 32 bit r_key? > > 3. etc. > This is a real problem. It is true that brute force attacks can break 32 bit keys quite easily, but in practice even breaking 32 bits keys takes some time. To handle these brute force attacks, I would expect the attacked target to bombard the SM with key violations traps. This should trigger SM action to block and neutralize the offending host, hopefully before the RDMA write succeeds. Anyhow, this is not worse then a regular Ethernet HCA that you attack with valid requests to valid ports. > mst Shahar From mst at mellanox.co.il Thu Dec 9 10:29:57 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:29:57 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <20041209182957.GA8778@mellanox.co.il> Hello! Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > > > Anyone can have root > > > access on his machine. > > > > 1. Why not on the switch then? > > > > What do you mean? To be able to send/recv mads it is enough to have one > host with HCA. No switch can block root user sending mads unless pkey or > mkey mechanism is used. I mean that if a malicious user has control of a switch he can cause even more problems. > > 2. With "anyone can be root" assumption in mind, anyone can for > > example, > > do RDMA to a memory region that is enabled for remote write, > > since that is protected only by a 32 bit r_key? > > > > 3. etc. > > > This is a real problem. It is true that brute force attacks can break 32 > bit keys quite easily, but in practice even breaking 32 bits keys takes > some time. To handle these brute force attacks, I would expect the > attacked target to bombard the SM with key violations traps. This should > trigger SM action to block and neutralize the offending host, hopefully > before the RDMA write succeeds. > Anyhow, this is not worse then a regular Ethernet HCA that you attack > with valid requests to valid ports. Anyway, I'm just trying to say its not easy. mst From mshefty at ichips.intel.com Thu Dec 9 10:33:02 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Dec 2004 10:33:02 -0800 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <20041209182957.GA8778@mellanox.co.il> References: <20041209182957.GA8778@mellanox.co.il> Message-ID: <41B89A5E.3090104@ichips.intel.com> Michael S. Tsirkin wrote: > Hello! > Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I use umad -or- osm": > >>>>Anyone can have root >>>>access on his machine. >>> >>>1. Why not on the switch then? >>> >> >>What do you mean? To be able to send/recv mads it is enough to have one >>host with HCA. No switch can block root user sending mads unless pkey or >>mkey mechanism is used. > > > I mean that if a malicious user has control of a switch he can > cause even more problems. IMO, if you have a malicious user on the fabric, then you've already been compromised, and attacks on your IB network are probably not your greatest worries. - Sean From mshefty at ichips.intel.com Thu Dec 9 10:43:17 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 10:43:17 -0800 Subject: [openib-general] [PATCH] fix race condition in mthca event code Message-ID: <20041209104317.66264ea2.mshefty@ichips.intel.com> This patch fixed my problem hitting the BUG_ON code in mthca_cmd, line 328. It moves releasing the semaphore to after freeing the event entry. - Sean Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 1316) +++ hw/mthca/mthca_cmd.c (working copy) @@ -293,12 +293,6 @@ complete(&context->done); } -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) -{ - while (ncomp--) - up(&dev->cmd.event_sem); -} - static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -362,7 +356,7 @@ context->next = dev->cmd.free_head; dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); - + up(&dev->cmd.event_sem); return err; } Index: hw/mthca/mthca_eq.c =================================================================== --- hw/mthca/mthca_eq.c (revision 1316) +++ hw/mthca/mthca_eq.c (working copy) @@ -219,7 +219,6 @@ struct mthca_eqe *eqe; int disarm_cqn; int work = 0; - int ncmd = 0; while (1) { if (!next_eqe_sw(eq)) @@ -275,7 +274,6 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - ++ncmd; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -314,9 +312,6 @@ set_eq_ci(dev, eq->eqn, eq->cons_index); } - if (ncmd) - mthca_cmd_complete(dev, ncmd); - eq_req_not(dev, eq->eqn); } Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 1316) +++ hw/mthca/mthca_cmd.h (working copy) @@ -205,7 +205,6 @@ void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mst at mellanox.co.il Thu Dec 9 10:49:54 2004 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Dec 2004 20:49:54 +0200 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: <41B89A5E.3090104@ichips.intel.com> References: <20041209182957.GA8778@mellanox.co.il> <41B89A5E.3090104@ichips.intel.com> Message-ID: <20041209184954.GA8856@mellanox.co.il> Hello! Quoting r. Sean Hefty (mshefty at ichips.intel.com) "Re: [openib-general] Should I use umad -or- osm": > Michael S. Tsirkin wrote: > >Hello! > >Quoting r. shaharf (shaharf at voltaire.com) "RE: [openib-general] Should I > >use umad -or- osm": > > > >>>>Anyone can have root > >>>>access on his machine. > >>> > >>>1. Why not on the switch then? > >>> > >> > >>What do you mean? To be able to send/recv mads it is enough to have one > >>host with HCA. No switch can block root user sending mads unless pkey or > >>mkey mechanism is used. > > > > > >I mean that if a malicious user has control of a switch he can > >cause even more problems. > > IMO, if you have a malicious user on the fabric, then you've already > been compromised, and attacks on your IB network are probably not your > greatest worries. > Thats what I was saying. MST From iod00d at hp.com Thu Dec 9 11:00:43 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 11:00:43 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: References: Message-ID: <20041209190043.GB7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 07:22:59PM +0200, shaharf wrote: > As it will not be the first IOCTL, another one would not do any harm not > already done. This is not an reason for adding another one. ioctl's are a real PITA since they tend to port badly and provide opportunity for all sorts of mischief (e.g. bad arguements can crash the box and security holes). grant From hnrose at earthlink.net Thu Dec 9 11:21:49 2004 From: hnrose at earthlink.net (hnrose at earthlink.net) Date: Thu, 9 Dec 2004 14:21:49 -0500 Subject: [openib-general] UD with GRH Message-ID: <305790-220041249192149681@M2W049.mail2web.com> Hi Roland, Just wondering whether UD with GRH has been tested in terms of mthca. I am having difficulties sending UD with GRH (receiving seems OK although I have not verified all the fields as yet) and just want to know whether this should work or not. Thanks. BTW, I'm sending this from my personal email as my Voltaire email has been down for 1.5 days now :-( -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From hnrose at earthlink.net Thu Dec 9 12:29:28 2004 From: hnrose at earthlink.net (hnrose at earthlink.net) Date: Thu, 9 Dec 2004 15:29:28 -0500 Subject: [openib-general] Re: UD with GRH Message-ID: <185290-220041249202928358@M2W103.mail2web.com> Never mind. This was my problem. It's fixed now. -- Hal -------------------------------------------------------------------- mail2web - Check your email from the web at http://mail2web.com/ . From mshefty at ichips.intel.com Thu Dec 9 13:50:49 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 13:50:49 -0800 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <20041209135049.5ee48ef4.mshefty@ichips.intel.com> Here's a patch that adds in the ability to snoop MADs. Currently only send and receive completions are snooped, but the implementation should be general enough to add in snooping to other areas fairly easily. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1316) +++ core/mad.c (working copy) @@ -367,17 +367,129 @@ } EXPORT_SYMBOL(ib_register_mad_agent); -/* - * ib_unregister_mad_agent - Unregisters a client from using MAD services - */ -int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +static inline int is_snooping_sends(int mad_snoop_flags) { - struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_port_private *port_priv; + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; unsigned long flags; + int i; - mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, - agent); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; /* Note that we could still be handling received MADs */ @@ -405,6 +517,46 @@ if (mad_agent_priv->reg_req) kfree(mad_agent_priv->reg_req); kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } return 0; } EXPORT_SYMBOL(ib_unregister_mad_agent); @@ -422,30 +574,82 @@ spin_unlock_irqrestore(&mad_queue->lock, flags); } +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + /* * Return 0 if SMP is to be sent * Return 1 if SMP was consumed locally (whether or not solicited) * Return < 0 if error */ -static int handle_outgoing_smp(struct ib_mad_agent *mad_agent, +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, struct ib_smp *smp, struct ib_send_wr *send_wr) { int ret; struct ib_mad_private *mad_priv; struct ib_mad_send_wc mad_send_wc; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; - if (!smi_handle_dr_smp_send(smp, - mad_agent->device->node_type, - mad_agent->port_num)) { + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; printk(KERN_ERR PFX "Invalid directed route\n"); goto out; } /* Check to post send on QP or process locally */ - ret = smi_check_local_dr_smp(smp, mad_agent->device, - mad_agent->port_num); - if (!ret || !mad_agent->device->process_mad) + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) goto out; mad_priv = kmem_cache_alloc(ib_mad_cache, @@ -456,10 +660,9 @@ printk(KERN_ERR PFX "No memory for local response MAD\n"); goto out; } - ret = mad_agent->device->process_mad(mad_agent->device, 0, - mad_agent->port_num, smp->dr_slid, - (struct ib_mad *)smp, - (struct ib_mad *)&mad_priv->mad); + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); switch (ret) { case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: @@ -468,7 +671,7 @@ * there is a recv handler */ if (solicited_mad(&mad_priv->mad.mad) && - mad_agent->recv_handler) { + mad_agent_priv->agent.recv_handler) { struct ib_wc wc; /* @@ -494,7 +697,12 @@ mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; mad_priv->header.recv_wc.recv_buf = &mad_priv->header.recv_buf; - mad_agent->recv_handler(mad_agent, + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, &mad_priv->header.recv_wc); } else kmem_cache_free(ib_mad_cache, mad_priv); @@ -516,7 +724,11 @@ mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; mad_send_wc.wr_id = send_wr->wr_id; - mad_agent->send_handler(mad_agent, &mad_send_wc); + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, send_wr, &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); ret = 1; out: return ret; @@ -610,7 +822,7 @@ smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = handle_outgoing_smp(mad_agent, smp, send_wr); + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); if (ret < 0) /* error */ goto error2; else if (ret == 1) /* locally consumed */ @@ -1383,6 +1595,9 @@ recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; recv->header.recv_buf.grh = &recv->grh; + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + /* Validate MAD */ if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) goto out; @@ -1600,7 +1815,11 @@ /* Restore client wr_id in WC and complete send */ wc->wr_id = mad_send_wr->wr_id; - ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc*)wc); + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); if (queued_send_wr) { ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, @@ -2068,6 +2287,10 @@ init_mad_queue(qp_info, &qp_info->send_queue); init_mad_queue(qp_info, &qp_info->recv_queue); INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); } static int create_mad_qp(struct ib_mad_qp_info *qp_info, @@ -2108,6 +2331,8 @@ static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) { ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); } /* Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1316) +++ core/mad_priv.h (working copy) @@ -121,6 +121,15 @@ u8 rmpp_version; }; +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + struct ib_mad_send_wr_private { struct ib_mad_list_head mad_list; struct list_head agent_list; @@ -171,6 +180,10 @@ struct ib_mad_queue send_queue; struct ib_mad_queue recv_queue; struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; }; struct ib_mad_port_private { Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1316) +++ include/ib_mad.h (working copy) @@ -126,13 +126,29 @@ struct ib_mad_send_wc *mad_send_wc); /** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** * ib_mad_recv_handler - callback handler for a received MAD. * @mad_agent: MAD agent requesting the received MAD. * @mad_recv_wc: Received work completion information on the received MAD. * * MADs received in response to a send request operation will be handed to * the user after the send operation completes. All data buffers given - * to the user through this routine are owned by the receiving client. + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. */ typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc); @@ -143,6 +159,7 @@ * @qp: Reference to QP used for sending and receiving MADs. * @recv_handler: Callback handler for a received MAD. * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. * @context: User-specified context associated with this registration. * @hi_tid: Access layer assigned transaction ID for this client. * Unsolicited MADs sent by this client will have the upper 32-bits @@ -154,6 +171,7 @@ struct ib_qp *qp; ib_mad_recv_handler recv_handler; ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; void *context; u32 hi_tid; u8 port_num; @@ -247,6 +265,35 @@ ib_mad_recv_handler recv_handler, void *context); +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + /** * ib_unregister_mad_agent - Unregisters a client from using MAD services. * @mad_agent: Corresponding MAD registration request to deregister. From mshefty at ichips.intel.com Thu Dec 9 14:24:59 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 14:24:59 -0800 Subject: [openib-general] [PATCH] [RFC] new test directory + test code for MAD snooping Message-ID: <20041209142459.5c8dbb0f.mshefty@ichips.intel.com> This patch creates a new subdirectory under infiniband called 'test', and adds a new kernel module called 'madeye' that can be used to snoop and display MADs (until user-mode smpdump and gmpdump programs are created). We need to decide if we want to include this sort of test code in the openib tree, and where it might best go. - Sean Index: Kconfig =================================================================== --- Kconfig (revision 1316) +++ Kconfig (working copy) @@ -11,4 +11,6 @@ source "drivers/infiniband/ulp/ipoib/Kconfig" +source "drivers/infiniband/test/madeye/Kconfig" + endmenu Index: Makefile =================================================================== --- Makefile (revision 1316) +++ Makefile (working copy) @@ -1,3 +1,4 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ +obj-$(CONFIG_INFINIBAND_MADEYE) += test/madeye/ Index: test/madeye/Kconfig =================================================================== --- test/madeye/Kconfig (revision 0) +++ test/madeye/Kconfig (revision 0) @@ -0,0 +1,6 @@ +config INFINIBAND_MADEYE + tristate "MAD debug viewer for InfiniBand" + depends on INFINIBAND + ---help--- + Prints sent and received MADs on QP 0/1 for debugging. + Index: test/madeye/madeye.c =================================================================== --- test/madeye/madeye.c (revision 0) +++ test/madeye/madeye.c (revision 0) @@ -0,0 +1,241 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include + +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand MAD viewer"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void madeye_remove_one(struct ib_device *device); +static void madeye_add_one(struct ib_device *device); + +static struct ib_client madeye_client = { + .name = "madeye", + .add = madeye_add_one, + .remove = madeye_remove_one +}; + +struct madeye_port { + struct ib_mad_agent *smi_agent; + struct ib_mad_agent *gsi_agent; +}; + +static char * get_class_name(u8 mgmt_class) +{ + switch(mgmt_class) { + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + return "LID routed SMP"; + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + return "Directed route SMP"; + case IB_MGMT_CLASS_SUBN_ADM: + return "Subnet admin."; + case IB_MGMT_CLASS_PERF_MGMT: + return "Perf. mgmt."; + case IB_MGMT_CLASS_BM: + return "Baseboard mgmt."; + case IB_MGMT_CLASS_DEVICE_MGMT: + return "Device mgmt."; + case IB_MGMT_CLASS_CM: + return "Comm. mgmt."; + case IB_MGMT_CLASS_SNMP: + return "SNMP"; + default: + return "Unknown vendor/application"; + } +} + +static char * get_method_name(u8 method) +{ + switch(method) { + case IB_MGMT_METHOD_GET: + return "Get"; + case IB_MGMT_METHOD_SET: + return "Set"; + case IB_MGMT_METHOD_GET_RESP: + return "Get response"; + case IB_MGMT_METHOD_SEND: + return "Send"; + case IB_MGMT_METHOD_TRAP: + return "Trap"; + case IB_MGMT_METHOD_REPORT: + return "Report"; + case IB_MGMT_METHOD_REPORT_RESP: + return "Report response"; + case IB_MGMT_METHOD_TRAP_REPRESS: + return "Trap repress"; + default: + return "Unknown"; + } +} + +static void print_status_details(u16 status) +{ + if (status & cpu_to_be16(0x0001)) + printk(" busy\n"); + if (status & cpu_to_be16(0x0002)) + printk(" redirection required\n"); + switch((be16_to_cpu(status) & 0x001C) >> 2) { + case 1: + printk(" bad version\n"); + break; + case 2: + printk(" method not supported\n"); + break; + case 3: + printk(" method/attribute combo not supported\n"); + break; + case 7: + printk(" invalid attribute/modifier value\n"); + break; + } +} + +static void print_mad_hdr(struct ib_mad_hdr *mad_hdr) +{ + printk("MAD version....0x%01x\n", mad_hdr->base_version); + printk("Class..........0x%01x (%s)\n", mad_hdr->mgmt_class, + get_class_name(mad_hdr->mgmt_class)); + printk("Class version..0x%01x\n", mad_hdr->class_version); + printk("Method.........0x%01x (%s)\n", mad_hdr->method, + get_method_name(mad_hdr->method)); + printk("Status.........0x%02x\n", mad_hdr->status); + if (mad_hdr->status) + print_status_details(mad_hdr->status); + printk("Class specific.0x%02x\n", mad_hdr->class_specific); + printk("Trans ID.......0x%llx\n", mad_hdr->tid); + printk("Attr ID........0x%02x\n", mad_hdr->attr_id); + printk("Attr modifier..0x%04x\n", mad_hdr->attr_mod); +} + +static void snoop_smi_handler(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + printk("Madeye:sent SMP\n"); + print_mad_hdr(send_wr->wr.ud.mad_hdr); +} + +static void recv_smi_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + printk("Madeye:recv SMP\n"); + print_mad_hdr(&mad_recv_wc->recv_buf->mad->mad_hdr); +} + +static void snoop_gsi_handler(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + printk("Madeye:sent GMP\n"); + print_mad_hdr(send_wr->wr.ud.mad_hdr); +} + +static void recv_gsi_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + printk("Madeye:recv GMP\n"); + print_mad_hdr(&mad_recv_wc->recv_buf->mad->mad_hdr); +} + +static void madeye_add_one(struct ib_device *device) +{ + struct madeye_port *port; + int reg_flags; + u8 i, s, e; + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + port = kmalloc(sizeof *port * (e - s + 1), GFP_KERNEL); + if (!port) + goto out; + + reg_flags = IB_MAD_SNOOP_SEND_COMPLETIONS | IB_MAD_SNOOP_RECVS; + for (i = s; i <= e; i++) { + port[i].smi_agent = ib_register_mad_snoop(device, i, + IB_QPT_SMI, + reg_flags, + snoop_smi_handler, + recv_smi_handler, + &port[i]); + port[i].gsi_agent = ib_register_mad_snoop(device, i, + IB_QPT_GSI, + reg_flags, + snoop_gsi_handler, + recv_gsi_handler, + &port[i]); + } + +out: + ib_set_client_data(device, &madeye_client, port); +} + +static void madeye_remove_one(struct ib_device *device) +{ + struct madeye_port *port; + int i, s, e; + + port = (struct madeye_port *) + ib_get_client_data(device, &madeye_client); + if (!port) + return; + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (i = s; i <= e; i++) { + if (!IS_ERR(port[i].smi_agent)) + ib_unregister_mad_agent(port[i].smi_agent); + if (!IS_ERR(port[i].gsi_agent)) + ib_unregister_mad_agent(port[i].gsi_agent); + } + kfree(port); +} + +static int __init ib_madeye_init(void) +{ + return ib_register_client(&madeye_client); +} + +static void __exit ib_madeye_cleanup(void) +{ + ib_unregister_client(&madeye_client); +} + +module_init(ib_madeye_init); +module_exit(ib_madeye_cleanup); Index: test/madeye/Makefile =================================================================== --- test/madeye/Makefile (revision 0) +++ test/madeye/Makefile (revision 0) @@ -0,0 +1,6 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_MADEYE) += ib_madeye.o + +ib_madeye-y := madeye.o \ + From iod00d at hp.com Thu Dec 9 15:41:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 15:41:09 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52pt1ndzh4.fsf@topspin.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> Message-ID: <20041209234109.GN7178@esmail.cup.hp.com> On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > Grant> Yes - that works. Please commit. > > I rewrote things in a way that seems cleaner to me -- what I actually > committed is below. Please try one more time and make sure this still > fixes the problem. I've updated to svn 1316 and that has both the problem I originally observed plus it "hangs" when the module is loaded. ... > Index: infiniband/hw/mthca/mthca_cmd.c > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.c (revision 1310) > +++ infiniband/hw/mthca/mthca_cmd.c (working copy) > @@ -293,6 +293,12 @@ > complete(&context->done); > } > > +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) > +{ > + while (ncomp--) > + up(&dev->cmd.event_sem); > +} > + > static void event_timeout(unsigned long context_ptr) > { > struct mthca_cmd_context *context = > @@ -357,7 +363,6 @@ > dev->cmd.free_head = context - dev->cmd.context; > spin_unlock(&dev->cmd.context_lock); > > - up(&dev->cmd.event_sem); > return err; > } I have the feeling the timeout code isn't cleaning up the event_sem and that's causing the hang. trying to understand this now. thanks, grant > > Index: infiniband/hw/mthca/mthca_eq.c > =================================================================== > --- infiniband/hw/mthca/mthca_eq.c (revision 1310) > +++ infiniband/hw/mthca/mthca_eq.c (working copy) > @@ -219,6 +219,7 @@ > struct mthca_eqe *eqe; > int disarm_cqn; > int work = 0; > + int ncmd = 0; > > while (1) { > if (!next_eqe_sw(eq)) > @@ -274,6 +275,7 @@ > be16_to_cpu(eqe->event.cmd.token), > eqe->event.cmd.status, > be64_to_cpu(eqe->event.cmd.out_param)); > + ++ncmd; > break; > > case MTHCA_EVENT_TYPE_PORT_CHANGE: > @@ -303,6 +305,9 @@ > set_eq_ci(dev, eq->eqn, eq->cons_index); > } > > + if (ncmd) > + mthca_cmd_complete(dev, ncmd); > + > eq_req_not(dev, eq->eqn); > } > > Index: infiniband/hw/mthca/mthca_cmd.h > =================================================================== > --- infiniband/hw/mthca/mthca_cmd.h (revision 1310) > +++ infiniband/hw/mthca/mthca_cmd.h (working copy) > @@ -203,10 +203,9 @@ > > int mthca_cmd_use_events(struct mthca_dev *dev); > void mthca_cmd_use_polling(struct mthca_dev *dev); > -void mthca_cmd_event(struct mthca_dev *dev, > - u16 token, > - u8 status, > - u64 out_param); > +void mthca_cmd_event(struct mthca_dev *dev, u16 token, > + u8 status, u64 out_param); > +void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); > > int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); > int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mshefty at ichips.intel.com Thu Dec 9 15:48:06 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Dec 2004 15:48:06 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041209234109.GN7178@esmail.cup.hp.com> References: <20041203221413.GC16522@esmail.cup.hp.com> <52hdn3ggsb.fsf@topspin.com> <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> Message-ID: <41B8E436.7050509@ichips.intel.com> Grant Grundler wrote: > On Mon, Dec 06, 2004 at 10:59:51AM -0800, Roland Dreier wrote: > >> Grant> Yes - that works. Please commit. >> >>I rewrote things in a way that seems cleaner to me -- what I actually >>committed is below. Please try one more time and make sure this still >>fixes the problem. > > > I've updated to svn 1316 and that has both the problem I > originally observed plus it "hangs" when the module is loaded. > > ... > >>Index: infiniband/hw/mthca/mthca_cmd.c >>=================================================================== >>--- infiniband/hw/mthca/mthca_cmd.c (revision 1310) >>+++ infiniband/hw/mthca/mthca_cmd.c (working copy) >>@@ -293,6 +293,12 @@ >> complete(&context->done); >> } >> >>+void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) >>+{ >>+ while (ncomp--) >>+ up(&dev->cmd.event_sem); >>+} >>+ I had to remove this patch in order to get things working on my system. - Sean From mshefty at ichips.intel.com Thu Dec 9 15:49:16 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 9 Dec 2004 15:49:16 -0800 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <20041209154916.69d2dc80.mshefty@ichips.intel.com> This patch removes add_mad_reg_req(), which removes redundant checks in the code. - Sean Index: core/mad.c =================================================================== --- core/mad.c (revision 1317) +++ core/mad.c (working copy) @@ -80,8 +80,6 @@ /* Forward declarations */ static int method_in_use(struct ib_mad_mgmt_method_table **method, struct ib_mad_reg_req *mad_reg_req); -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv); static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, struct ib_mad_private *mad); @@ -321,6 +319,8 @@ goto error3; } } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); } else { /* "New" vendor class range */ vendor = port_priv->version[mad_reg_req-> @@ -335,13 +335,12 @@ goto error3; } } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; } - } - - ret2 = add_mad_reg_req(mad_reg_req, mad_agent_priv); - if (ret2) { - ret = ERR_PTR(ret2); - goto error3; } /* Add mad agent into port's agent list */ @@ -1007,24 +1006,6 @@ return ret; } -static int add_mad_reg_req(struct ib_mad_reg_req *mad_reg_req, - struct ib_mad_agent_private *priv) -{ - int ret; - u8 mgmt_class; - - /* Make sure MAD registration request supplied */ - if (!mad_reg_req) - return 0; - - mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); - if (!is_vendor_class(mgmt_class)) - ret = add_nonoui_reg_req(mad_reg_req, priv, mgmt_class); - else - ret = add_oui_reg_req(mad_reg_req, priv); - return ret; -} - static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) { struct ib_mad_port_private *port_priv; From iod00d at hp.com Thu Dec 9 16:05:36 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 16:05:36 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <41B8E436.7050509@ichips.intel.com> References: <20041203224039.GE16522@esmail.cup.hp.com> <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> Message-ID: <20041210000536.GO7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 03:48:06PM -0800, Sean Hefty wrote: > >>+void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) > >>+{ > >>+ while (ncomp--) > >>+ up(&dev->cmd.event_sem); > >>+} > >>+ > > I had to remove this patch in order to get things working on my system. Ok. But I see now that the set_ci patch that worked hasn't been committed. Roland, Would you consider committing the set_ci patch and backing the ncmd patch out? thanks, grant From iod00d at hp.com Thu Dec 9 17:12:16 2004 From: iod00d at hp.com (Grant Grundler) Date: Thu, 9 Dec 2004 17:12:16 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041210000536.GO7178@esmail.cup.hp.com> References: <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> Message-ID: <20041210011216.GQ7178@esmail.cup.hp.com> On Thu, Dec 09, 2004 at 04:05:36PM -0800, Grant Grundler wrote: > Roland, > Would you consider committing the set_ci patch and backing the ncmd patch out? Roland, Attached is a patch that does three things (LART me for combining them, but I can split them up later if you want): o reverse most of the mthca_cmd_complete() patch o add set_ci logic so consumer index is updated inside the event loop o eliminate "work" variable (nicely simplifies the code). If I understood the code correctly, "work" was always being set if next_eqe_sw(eq) was non-zero. It is so expensive to get to the interrupt handler in the first place that if nothing needs to be done: a) the problem is NOT in the interrupt handler. b) the spurious set_eq_ci() is light weight enough to be in the noise I do understand the wmb()/set_eq_ci() disturb the PCI data flows but we shouldn't be seeing spurious interrupts either. If they are happening often enough to disturb performance, something else is wrong. grant Index: hw/mthca/mthca_cmd.c =================================================================== --- hw/mthca/mthca_cmd.c (revision 1317) +++ hw/mthca/mthca_cmd.c (working copy) @@ -293,12 +293,6 @@ complete(&context->done); } -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp) -{ - while (ncomp--) - up(&dev->cmd.event_sem); -} - static void event_timeout(unsigned long context_ptr) { struct mthca_cmd_context *context = @@ -363,6 +357,7 @@ dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); + up(&dev->cmd.event_sem); return err; } Index: hw/mthca/mthca_eq.c =================================================================== --- hw/mthca/mthca_eq.c (revision 1317) +++ hw/mthca/mthca_eq.c (working copy) @@ -218,15 +218,10 @@ { struct mthca_eqe *eqe; int disarm_cqn; - int work = 0; - int ncmd = 0; - while (1) { - if (!next_eqe_sw(eq)) - break; - + while (next_eqe_sw(eq)) { + int set_ci = 0; eqe = get_eqe(eq, eq->cons_index); - work = 1; switch (eqe->type) { case MTHCA_EVENT_TYPE_COMP: @@ -275,7 +270,11 @@ be16_to_cpu(eqe->event.cmd.token), eqe->event.cmd.status, be64_to_cpu(eqe->event.cmd.out_param)); - ++ncmd; + /* cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; break; case MTHCA_EVENT_TYPE_PORT_CHANGE: @@ -298,25 +297,25 @@ set_eqe_hw(eq, eq->cons_index); eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); - } - if (work) { - /* - * This barrier makes sure that all updates to - * ownership bits done by set_eqe_hw() hit memory - * before the consumer index is updated. set_eq_ci() - * allows the HCA to possibly write more EQ entries, - * and we want to avoid the exceedingly unlikely - * possibility of the HCA writing an entry and then - * having set_eqe_hw() overwrite the owner field. - */ - wmb(); - set_eq_ci(dev, eq->eqn, eq->cons_index); + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } } - if (ncmd) - mthca_cmd_complete(dev, ncmd); - + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); eq_req_not(dev, eq->eqn); } Index: hw/mthca/mthca_cmd.h =================================================================== --- hw/mthca/mthca_cmd.h (revision 1317) +++ hw/mthca/mthca_cmd.h (working copy) @@ -205,7 +205,6 @@ void mthca_cmd_use_polling(struct mthca_dev *dev); void mthca_cmd_event(struct mthca_dev *dev, u16 token, u8 status, u64 out_param); -void mthca_cmd_complete(struct mthca_dev *dev, int ncomp); int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); From mucci at cs.utk.edu Fri Dec 10 05:19:32 2004 From: mucci at cs.utk.edu (Philip Mucci) Date: Fri, 10 Dec 2004 14:19:32 +0100 Subject: [openib-general] Should I use umad -or- osm In-Reply-To: References: Message-ID: <1102684772.3690.31.camel@muccislaptop.pdc.kth.se> Hi Shahar, Thanks again for the information. Yes, the OSM interface sure seems a bit excessive for my needs...you mentioned that the libumad interface will be much simpler? Simpler is better. I definitely would like to have access to your code, that would be great. I'll chat more with you about this off-line. Phil On Thu, 2004-12-09 at 18:55 +0200, shaharf wrote: > Philip, > I would say that there is not much in common between gen1 and > gen2 user mode mad interface. If you want your tools to works above > both, then osm vendor layer is your only choice. If you are planning to > use gen2 stacks only at some point you can use libumad. > > If you need just a simple PM portcounter get query, then maybe osm > vendor api is a bit too "fat" for you. I would consider using gen1 > interface directly. But still, this is your choice. > > I guess that on the long run, openib gen2 will be the only maintained > openib version. Anyone thinks differently? > > Shahar > > > From: Philip Mucci [mailto:mucci at cs.utk.edu] > > Sent: Thursday, December 09, 2004 6:44 PM > > To: shaharf > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] Should I use umad -or- osm > > > > Hi Shahar, > > > > Thanks for the info. > > > > My needs are very 'simple'. Just to send and receive MADS to the PM > > agent on each adapter and switch in the network. > > > > Ideally, I would like this to work for existing installations based on > > either OpenIB gen1 or Mellanox Gold. But I think for that, I need to > use > > the current osm_vendor_api.h interface. > > > > How much will this interface change with gen2? will it export the same > > functions? Or will everything change > > > > Lastly, will I be able to send/recv these MADS as a non-root user? > > > > Thanks again, and the answer to your question is, yes, I can wait. > > > > Regards, > > > > Philip > > > > > > On Thu, 2004-12-09 at 17:52 +0200, shaharf wrote: > > > Hi Philip, > > > > > > I am currently implementing umad access library. It much simpler > > > then the osm_vendor api. I would recommend using umad library and > not > > > osm. As a matter of fact the current osm vendor layer does not > support > > > openib gen2. I am working on that either. Both the umad access > library > > > and the new osm vendor layer that uses it are not finished yet. I > guess > > > that I will need at least another week to reach a point where I can > > > release it. Even then it will change a lot until I will be finished > with > > > it. > > > The question is can you wait a little? I can give you a preliminary > > > version - but if you will use it you will have to modify your code > > > several each time the library interface is changed. > > > On the other hand, I would like to understand exactly what you need, > > > because you are the first "client" of the user mode stuff beside me. > > > > > > Shahar > > > > > > > -----Original Message----- > > > > From: openib-general-bounces at openib.org [mailto:openib-general- > > > > bounces at openib.org] On Behalf Of Philip Mucci > > > > Sent: Thursday, December 09, 2004 3:25 PM > > > > To: openib-general at openib.org > > > > Subject: [openib-general] Should I use umad -or- osm > > > > > > > > Hi folks, > > > > > > > > I've been tasked with developing a rough performance tool for IB > > > > networks. I've scanned the documentation and looks like the kind > of > > > data > > > > we're interested in can be obtained from the Mellanox ASICs. > > > > > > > > My question is a simple one: > > > > > > > > I've got to send/recv mads to enable and obtain the performance > > > counters > > > > from a user space tool...ideally non-root, but we'll work with > what we > > > > have. > > > > > > > > My current inclination has been to use the osm_vendor_api.h > functions > > > to > > > > do this work. However, the late work here done by Hal on the UMAD > > > access > > > > layer seems to be appropriate as well. > > > > > > > > Could someone elaborate on what you think the best (and most > > > > maintainable) approach to accomplishing this task might be? > > > > > > > > Regards, > > > > > > > > Philip > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib- > > > > general > From hnrose at earthlink.net Fri Dec 10 06:29:07 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 09:29:07 -0500 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <001801c4dec4$9bcee3c0$6501a8c0@comcast.net> Hi Sean, A couple of minor questions about this patch: 1. In ib_mad.h: /** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ I presume snoop clients should also not free the MAD either. If so, should that comment also be added ? 2. Should MAD snooping be exposed to user space too ? Thanks. - Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 06:55:22 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 09:55:22 -0500 Subject: [openib-general] [PATCH] MAD snooping API/implementation Message-ID: <003101c4dec8$468281c0$6501a8c0@comcast.net> Thanks :-) Applied. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:09:02 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:09:02 -0500 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <003c01c4deca$2f418680$6501a8c0@comcast.net> Thanks. Applied with the minor addition of forward declarations for add_oui/nonoui_reg_req. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:11:04 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:11:04 -0500 Subject: [openib-general] [PATCH] remove add_mad_reg_req function Message-ID: <005f01c4deca$77e6f000$6501a8c0@comcast.net> Thanks. Applied with the minor addition of forward declarations for add_oui/nonoui_reg_req. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:38:08 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:38:08 -0500 Subject: [openib-general] IPv6 All Router Multicast Group Message-ID: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> Hi again Roland, In looking at the code, this IPv6 group (all routers) (0xff12:601b:ffff:0:0:0:0:2) is going through the send only path in ipoib_multicast.c::ipoib_mcast_send where: spin_lock_irqsave(&priv->lock, flags); mcast = __ipoib_mcast_find(dev, mgid); if (!mcast) { /* Let's create a new send only group now */ ipoib_dbg_mcast(priv, "setting up send only multicast group for " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(*mgid)); mcast = ipoib_mcast_alloc(dev, 0); if (!mcast) { ipoib_warn(priv, "unable to allocate memory for " "multicast structure\n"); dev_kfree_skb_any(skb); goto out; } set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags); mcast->mcmember.mgid = *mgid; __ipoib_mcast_add(dev, mcast); list_add_tail(&mcast->list, &priv->multicast_list); } if (!mcast->ah) { if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) skb_queue_tail(&mcast->pkt_queue, skb); else dev_kfree_skb_any(skb); if (mcast->query) ipoib_dbg_mcast(priv, "no address vector, " "but multicast join already started\n"); else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) ipoib_mcast_sendonly_join(mcast); Although the sendonly_join code has been changed to do a full member rather than send only join, it does not fall back to create the group if it does not already exist. One wouldn't expect a send only join to create the group if it didn't already exist. Any idea on why this group is send only ? Don't end nodes need to both send and receive on the all routers group ? -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:41:54 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:41:54 -0500 Subject: [openib-general] IPoIB still not working Message-ID: <008601c4dece$c6e14da0$6501a8c0@comcast.net> Hi, MGID....................0xff12401bffff0000 : 0x0000000000000016 PortGid.................0xfe80000000000000 : 0x0002c9010ad25b91 qkey....................0x0 Mlid....................0x0 ScopeState..............0x1 Rate....................0x0 Mtu.....................0x0 [1102546997:000053334][18007] -> osm_physp_share_pkey: [ [1102546997:000053358][18007] -> osm_physp_share_pkey: ] [1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7. This is the IPv4 equivalent of what the previous post on All Routers Multicast Group. For some reason, your node is joining this as send only (which attempts to join rather than create) the underlying IB multicast group for this IP multicast group. I'm not sure how the Linux network stack decides this. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at earthlink.net Fri Dec 10 07:54:01 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 10:54:01 -0500 Subject: [openib-general] Other Send Only Multicast Groups Message-ID: <000c01c4ded0$78670dc0$6501a8c0@comcast.net> I've also seen the IGMP group (0x16) for IPv6 and/or IPv4 (224.0.0.22) joined as send only (in addition to the all routers (2) group). -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Dec 10 08:50:01 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 08:50:01 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000302FBBE@orsmsx408> >[1102546997:000053389][18007] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: >method = SubnAdmSet,scope_state = 0x1, component mask = >0x0000000000010083, expected comp mask = 0x00000000000130c7. >This is the IPv4 equivalent of what the previous post on All Routers Multicast >Group. >For some reason, your node is joining this as send only (which attempts to >join rather than create) the underlying IB multicast group for this IP multicast group. >I'm not sure how the Linux network stack decides this. >-- Hal In comparing the behavior of my EM64T system against Seans 32-bit systems, I see Sean's stack only joining mcast group's c0000 and c0001, the 2 created by openSM. For some reason, my stack also creates an additional 2 groups. The first one C002 appears to get created OK. The next one, it seems to try to do the join without create, as seen above. I will try compare the .config files from Sean and my system to see what are the differences in the network stack configuration, but it seems to be related to how the network stack is configured. woody From sean.hefty at intel.com Fri Dec 10 09:15:55 2004 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 10 Dec 2004 09:15:55 -0800 Subject: [openib-general] [PATCH] MAD snooping API/implementation In-Reply-To: <001801c4dec4$9bcee3c0$6501a8c0@comcast.net> Message-ID: I presume snoop clients should also not free the MAD either. If so, should that comment also be added ? Snooping clients should never touch the MADs. There are a couple of comments to that effect elsewhere in the code, but we can add more if needed. 2. Should MAD snooping be exposed to user space too ? I think so. My hope was that a client would be built on top of this API to expose snooping to user-mode, since it requires copying the MAD. I'd also like to eventually take the madeye code and modify it to collect statistics and dynamically allow printing of MAD data. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Dec 10 10:41:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 10:41:19 -0800 Subject: [openib-general] ipoib "link" failure on ia64 Message-ID: <20041210184119.GF11324@esmail.cup.hp.com> Hi Roland/et al, With the patch I posted last time I was able to ifconfig ib0/ib1. But I had problems with sending packets from one host to another. My goal was to run netperf but that doesn't seem possible yet on ia64. Both systems are running 2.6.10-rc2 + openib-1317 + set_ci patch. Both systems have both ports connected to a Topspin 12port switch. Both showed similar behaviour - following output applies to both. "ionize" is an rx2600 with proto Cougarcub Card flashed with: fw-cougarcub-a1-3.1.0.bin I will update this with fw-cougarcub-a1-3.2.0.bin that I found in a different tar ball. The other system "iowa" is an rx4640 with "HP" Cougar flashed with: hca-cougar-a1-250-157.bin I'll update that to the corresponding 3.2.0.bin as well. ionize:~# cat /sys/class/infiniband/mthca0/ports/?/state 4: ACTIVE 4: ACTIVE ionize:~# modprobe ib_ipoib ionize:~# ifconfig -a ... ionize:~# ifconfig ib0 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 ionize:~# ifconfig ib1 10.0.1.2 netmask 255.255.255.0 broadcast 10.0.1.255 (Ditto for iowa but with x.x.x.1) I could ping 10.0.1.1 (from ionize) but not 10.0.0.1. Then after trying to run netperf, ping failed for 10.0.1.x subnet too. My initial guess is there are more race conditions in the code. Ideas on where I should start looking? Or other debug info wanted? grant From roland at topspin.com Fri Dec 10 10:52:43 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:52:43 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: <1102521594.4129.42.camel@localhost.localdomain> (Hal Rosenstock's message of "Wed, 08 Dec 2004 10:59:54 -0500") References: <1102521594.4129.42.camel@localhost.localdomain> Message-ID: <52653aasuc.fsf@topspin.com> Hal> Hi Roland, It doesn't look to me like there is a way to Hal> cancel a MAD from user space. Would this be an additional Hal> ioctl to support ? This is needed from an OpenSM Hal> perspective. I guess it would be another ioctl. However I'm not sure how useful this is for userspace... in the kernel modules like IPoIB want to cancel pending sends to avoid a callback when the module is unloaded. However in userspace, an application can just close the file on exit. - R. From roland at topspin.com Fri Dec 10 10:56:07 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:56:07 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: (shaharf@voltaire.com's message of "Thu, 9 Dec 2004 13:23:20 +0200") References: Message-ID: <521xdyasoo.fsf@topspin.com> shaharf> 2. I need an interface for setting/clearing IS_SM bit. If shaharf> there is such please let me know. If not, I would like to shaharf> add an additional ioctl to set/clear the IS_SM bit. A ctl shaharf> file in /dev/.../ports/.../ is anther option. Please tell shaharf> me what do you think. Don't add an ioctl. Let's create an "is_sm" file that sets the bit when opened and clears it when closed. shaharf> 3. It would help me very much if I could get an async shaharf> event on some changes, especially ports state shaharf> changes. Lid/SM lid changes events will be nice too. I am shaharf> not familiar with the HCA fw, so I don't know if it does shaharf> trigger such events. The AnafaII fw does. I would create another device special file like the umad file and use reads to get the events. - Roland From roland at topspin.com Fri Dec 10 10:56:55 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:56:55 -0800 Subject: [openib-general] Re: [PATCH] fix race condition in mthca event code In-Reply-To: <20041209104317.66264ea2.mshefty@ichips.intel.com> (Sean Hefty's message of "Thu, 9 Dec 2004 10:43:17 -0800") References: <20041209104317.66264ea2.mshefty@ichips.intel.com> Message-ID: <52wtvq9e2w.fsf@topspin.com> Sean> This patch fixed my problem hitting the BUG_ON code in Sean> mthca_cmd, line 328. It moves releasing the semaphore to Sean> after freeing the event entry. Unfortunately this backs out the fix for Grant's race. I need to think about the proper fix here. Thanks, Roland From roland at topspin.com Fri Dec 10 10:58:05 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 10:58:05 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 10:38:08 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> Message-ID: <52sm6e9e0y.fsf@topspin.com> Hal> Any idea on why this group is send only ? Don't end nodes Hal> need to both send and receive on the all routers group ? IPoIB does a send only join when it gets a multicast packet to send. It will do a full member join when the kernel asks it to add a multicast group to the list of groups to receive from. - Roland From roland at topspin.com Fri Dec 10 11:24:43 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:24:43 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <20041210011216.GQ7178@esmail.cup.hp.com> (Grant Grundler's message of "Thu, 9 Dec 2004 17:12:16 -0800") References: <52d5xrgfmy.fsf@topspin.com> <528y8fge6m.fsf@topspin.com> <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> <20041210011216.GQ7178@esmail.cup.hp.com> Message-ID: <52fz2e9csk.fsf@topspin.com> Grant> Roland, Attached is a patch that does three things (LART me Grant> for combining them, but I can split them up later if you Grant> want) Thanks, I applied this. Sorry about the brokenness for the past few days... (The only thing I would LART you for would be leaving out the Signed-off-by: line with your patch, but I'm not going to be anal about that yet) - R. From hnrose at earthlink.net Fri Dec 10 11:37:18 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 14:37:18 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> Message-ID: <002601c4deef$a9c18ac0$6501a8c0@comcast.net> Roland Dreier wrote: > IPoIB does a send only join when it gets a multicast packet to send. and this packet is directed at a group which is not currently joined. > It will do a full member join when the kernel asks it to add a > multicast group to the list of groups to receive from. My question was a little unclear; I meant: Why does the kernel not request a full join in the case of the all routers and IGMP groups ? What is the differentiator for this ? -- Hal From roland at topspin.com Fri Dec 10 11:43:22 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:43:22 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <002601c4deef$a9c18ac0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 14:37:18 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> Message-ID: <524qiu9bxh.fsf@topspin.com> Hal> My question was a little unclear; I meant: Why does the Hal> kernel not request a full join in the case of the all routers Hal> and IGMP groups ? What is the differentiator for this ? Not sure exactly but I would guess only routers (or nodes running routing protocols) would join these groups (as opposed to groups that all nodes have to join). On the other hand I'm not sure who is sending packets to these groups either... - R. From hnrose at earthlink.net Fri Dec 10 11:50:46 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 14:50:46 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> Message-ID: <005e01c4def1$8b673d20$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> My question was a little unclear; I meant: Why does the > Hal> kernel not request a full join in the case of the all routers > Hal> and IGMP groups ? What is the differentiator for this ? > > Not sure exactly but I would guess only routers (or nodes running > routing protocols) would join these groups (as opposed to groups that > all nodes have to join). On the other hand I'm not sure who is > sending packets to these groups either... I would expect end nodes joining the all routers group to at least be listening to router advertisements so just joining send only doesn't make sense. If the node is behaving as a router, I would expect a full join as they would need to both send and receive. In terms of IGMP, if the end node is running multicast, it needs to send and receive as would a multicast router. So a send only join doesn't make sense to me for these groups. To state what sounds obvious, the only time a send only join makes sense would be for some send only application. Something doesn't quite seem right to me here. -- Hal From roland at topspin.com Fri Dec 10 11:59:47 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 11:59:47 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <005e01c4def1$8b673d20$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 14:50:46 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> Message-ID: <52zn0l9b64.fsf@topspin.com> Hal> I would expect end nodes joining the all routers group to at Hal> least be listening to router advertisements so just joining Hal> send only doesn't make sense. If the node is behaving as a Hal> router, I would expect a full join as they would need to both Hal> send and receive. Doesn't IPv6 autoconfiguration work by having a node send a router solicit message to the all-routers group (without needing to listen to the all-routers group)? - R. From hnrose at earthlink.net Fri Dec 10 12:07:50 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 15:07:50 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> Message-ID: <007301c4def3$ed472b20$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> I would expect end nodes joining the all routers group to at > Hal> least be listening to router advertisements so just joining > Hal> send only doesn't make sense. If the node is behaving as a > Hal> router, I would expect a full join as they would need to both > Hal> send and receive. > > Doesn't IPv6 autoconfiguration work by having a node send a > router solicit message to the all-routers group (without needing to > listen to the all-routers group)? Routers send router advertisements periodically. If a hosts does not receive any router advertisements in some time period, it will solicit for routers (send a router solicitation message) some number of times. So maybe IPv6 unicast routers don't need to receive on this group (hosts definitely do). For IGMP, I think both hosts and routers need to both send and receive yet we see a send only join for this group. The same thing appears to be occuring in these IPv4 (all routers and IGMP). -- Hal From mshefty at ichips.intel.com Fri Dec 10 12:10:50 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Dec 2004 12:10:50 -0800 Subject: [openib-general] [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc Message-ID: <20041210121050.7e41f45e.mshefty@ichips.intel.com> This patch replaces a pointer to struct ib_mad_recv_buf in struct ib_mad_recv_wc by embedding the recv_buf directly. It saves having to allocate and dereference a pointer. Patch touches mad.c, user_mad.c, and sa_query.c + header files. - Sean Index: include/ib_mad.h =================================================================== --- include/ib_mad.h (revision 1321) +++ include/ib_mad.h (working copy) @@ -215,7 +215,7 @@ */ struct ib_mad_recv_wc { struct ib_wc *wc; - struct ib_mad_recv_buf *recv_buf; + struct ib_mad_recv_buf recv_buf; int mad_len; }; Index: core/user_mad.c =================================================================== --- core/user_mad.c (revision 1321) +++ core/user_mad.c (working copy) @@ -148,7 +148,7 @@ memset(packet, 0, sizeof *packet); - memcpy(packet->mad.data, mad_recv_wc->recv_buf->mad, sizeof packet->mad.data); + memcpy(packet->mad.data, mad_recv_wc->recv_buf.mad, sizeof packet->mad.data); packet->mad.status = 0; packet->mad.qpn = cpu_to_be32(mad_recv_wc->wc->src_qp); packet->mad.lid = cpu_to_be16(mad_recv_wc->wc->slid); Index: core/mad.c =================================================================== --- core/mad.c (revision 1321) +++ core/mad.c (working copy) @@ -80,7 +80,7 @@ /* Forward declarations */ static int method_in_use(struct ib_mad_mgmt_method_table **method, struct ib_mad_reg_req *mad_reg_req); -static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, struct ib_mad_private *mad); static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); @@ -696,11 +696,10 @@ mad_priv->header.recv_wc.wc = &wc; mad_priv->header.recv_wc.mad_len = sizeof(struct ib_mad); - INIT_LIST_HEAD(&mad_priv->header.recv_buf.list); - mad_priv->header.recv_buf.grh = NULL; - mad_priv->header.recv_buf.mad = &mad_priv->mad.mad; - mad_priv->header.recv_wc.recv_buf = - &mad_priv->header.recv_buf; + INIT_LIST_HEAD(&mad_priv->header.recv_wc.recv_buf.list); + mad_priv->header.recv_wc.recv_buf.grh = NULL; + mad_priv->header.recv_wc.recv_buf.mad = + &mad_priv->mad.mad; if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) snoop_recv(mad_agent_priv->qp_info, &mad_priv->header.recv_wc, @@ -906,11 +905,12 @@ * Walk receive buffer list associated with this WC * No need to remove them from list of receive buffers */ - list_for_each_entry(entry, &mad_recv_wc->recv_buf->list, list) { + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { /* Free previous receive buffer */ kmem_cache_free(ib_mad_cache, priv); - mad_priv_hdr = container_of(entry, struct ib_mad_private_header, - recv_buf); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); priv = container_of(mad_priv_hdr, struct ib_mad_private, header); } @@ -1462,7 +1462,7 @@ struct ib_mad_private *recv) { /* Until we have RMPP, all receives are reassembled!... */ - INIT_LIST_HEAD(&recv->header.recv_buf.list); + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); return recv; } @@ -1553,7 +1553,6 @@ struct ib_mad_private *recv, *response; struct ib_mad_list_head *mad_list; struct ib_mad_agent_private *mad_agent; - struct ib_smp *smp; int solicited; response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); @@ -1577,32 +1576,30 @@ /* Setup MAD receive work completion from "normal" work completion */ recv->header.recv_wc.wc = wc; recv->header.recv_wc.mad_len = sizeof(struct ib_mad); - recv->header.recv_wc.recv_buf = &recv->header.recv_buf; - recv->header.recv_buf.mad = (struct ib_mad *)&recv->mad; - recv->header.recv_buf.grh = &recv->grh; + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; if (atomic_read(&qp_info->snoop_count)) snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); /* Validate MAD */ - if (!validate_mad(recv->header.recv_buf.mad, qp_info->qp->qp_num)) + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) goto out; - if (recv->header.recv_buf.mad->mad_hdr.mgmt_class == + if (recv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - smp = (struct ib_smp *)recv->header.recv_buf.mad; - if (!smi_handle_dr_smp_recv(smp, + if (!smi_handle_dr_smp_recv(&recv->mad.smp, port_priv->device->node_type, port_priv->port_num, port_priv->device->phys_port_cnt)) goto out; - if (!smi_check_forward_dr_smp(smp)) + if (!smi_check_forward_dr_smp(&recv->mad.smp)) goto local; - if (!smi_handle_dr_smp_send(smp, + if (!smi_handle_dr_smp_send(&recv->mad.smp, port_priv->device->node_type, port_priv->port_num)) goto out; - if (!smi_check_local_dr_smp(smp, + if (!smi_check_local_dr_smp(&recv->mad.smp, port_priv->device, port_priv->port_num)) goto out; @@ -1625,7 +1622,7 @@ ret = port_priv->device->process_mad(port_priv->device, 0, port_priv->port_num, wc->slid, - recv->header.recv_buf.mad, + &recv->mad.mad, &response->mad.mad); if (ret & IB_MAD_RESULT_SUCCESS) { if (ret & IB_MAD_RESULT_CONSUMED) @@ -1642,9 +1639,8 @@ } /* Determine corresponding MAD agent for incoming receive MAD */ - solicited = solicited_mad(recv->header.recv_buf.mad); - mad_agent = find_mad_agent(port_priv, recv->header.recv_buf.mad, - solicited); + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); if (mad_agent) { ib_mad_complete_recv(mad_agent, recv, solicited); /* Index: core/mad_priv.h =================================================================== --- core/mad_priv.h (revision 1321) +++ core/mad_priv.h (working copy) @@ -90,7 +90,6 @@ struct ib_mad_private_header { struct ib_mad_list_head mad_list; struct ib_mad_recv_wc recv_wc; - struct ib_mad_recv_buf recv_buf; DECLARE_PCI_UNMAP_ADDR(mapping) } __attribute__ ((packed)); Index: core/sa_query.c =================================================================== --- core/sa_query.c (revision 1321) +++ core/sa_query.c (working copy) @@ -728,9 +728,9 @@ if (query) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, - mad_recv_wc->recv_buf->mad->mad_hdr.status ? + mad_recv_wc->recv_buf.mad->mad_hdr.status ? -EINVAL : 0, - (struct ib_sa_mad *) mad_recv_wc->recv_buf->mad); + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); else query->callback(query, -EIO, NULL); } From roland at topspin.com Fri Dec 10 12:29:58 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 12:29:58 -0800 Subject: [openib-general] Re: [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc In-Reply-To: <20041210121050.7e41f45e.mshefty@ichips.intel.com> (Sean Hefty's message of "Fri, 10 Dec 2004 12:10:50 -0800") References: <20041210121050.7e41f45e.mshefty@ichips.intel.com> Message-ID: <52vfb999rt.fsf@topspin.com> This patch seems fine to apply to me. - R. From hnrose at earthlink.net Fri Dec 10 13:10:22 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 16:10:22 -0500 Subject: [openib-general] A Couple of verbs/mthca/firmware questions Message-ID: <004001c4defc$a9f2c100$6501a8c0@comcast.net> Hi, I have a couple of questions relative to verbs/mthca/firmware: 1. On received packets, the LRH is not visible but there are some WC fields set. One of those fields is dlid_path_bits. dlid_path_bits seems to be the same regardless of whether the incoming DR SMP was sent to the actual DLID or the permissive DLID. Is there a way to distinguish these cases ? 2. If a status is set in the MAD header and a post_send is issued on that MAD, should that status be sent in the packet ? Is there any conversion of the status field which might occur ? Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Fri Dec 10 13:22:19 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 13:22:19 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) Message-ID: <20041210212219.GK11324@esmail.cup.hp.com> /me bounces! With openib-1321, ip_ipoib is seems to be working! I couldn't reproduce the problem with ping not working sometimes. Current issue is misaligned accesses in the kernel. Here's a "cleaner" set of data. ionize:/usr/src/linux-ia64-release-2.6.10# modprobe ib_mthca ib_mthca: Mellanox InfiniBand HCA driver v0.06-pre (November 8, 2004) ib_mthca: Initializing Mellanox Technology MT23108 InfiniHost (0000:81:00.0) GSI 60 (level, low) -> CPU 0 (0x0000) vector 67 ACPI: PCI interrupt 0000:81:00.0[A] -> GSI 60 (level, low) -> IRQ 67 ionize:/usr/src/linux-ia64-release-2.6.10# elilo -v --efiboot ionize:/usr/src/linux-ia64-release-2.6.10# modprobe ib_ipoib ionize:/usr/src/linux-ia64-release-2.6.10# cat /sys/class/infiniband/mthca0/ports/?/state 4: ACTIVE 4: ACTIVE ionize:/usr/src/linux-ia64-release-2.6.10# ifconfig ib0 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 ionize:/usr/src/linux-ia64-release-2.6.10# ifconfig ib1 10.0.1.2 netmask 255.255.255.0 broadcast 10.0.1.255 ionize:/usr/src/linux-ia64-release-2.6.10# ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data. kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001be010 kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=14.4 ms kernel unaligned access to 0xe0000002ff5fe05c, ip=0xa0000002001bef10 64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.571 ms 64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.069 ms 64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.067 ms 64 bytes from 10.0.0.1: icmp_seq=5 ttl=64 time=0.069 ms 64 bytes from 10.0.0.1: icmp_seq=6 ttl=64 time=0.068 ms --- 10.0.0.1 ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5001ms rtt min/avg/max/mdev = 0.067/2.551/14.463/5.330 ms ionize:/usr/src/linux-ia64-release-2.6.10# ionize:/usr/src/linux-ia64-release-2.6.10# cd /opt/netperf/ ionize:/opt/netperf# ls netperf snapshot_script tcp_rr_script udp_rr_script netserver tcp_range_script tcp_stream_script udp_stream_script ionize:/opt/netperf# ./snapshot_script 10.0.1.1 Netperf snapshot script started at Fri Dec 10 13:00:02 PST 2004 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 kernel unaligned access to 0xe0000002ff5fd85c, ip=0xa0000002001bef10 ... misaligned accesses reports are rate limited by the kernel. The above is just the tip of the iceberg. a0000002001bee60 t ipoib_start_xmit [ib_ipoib] a0000002001bf880 t ipoib_get_stats [ib_ipoib] The "netserver" (rx4640) is getting the following: kernel unaligned access to 0xe0000001008b0f5c, ip=0xa000000200152f10 a000000200152e60 t ipoib_start_xmit [ib_ipoib] a000000200153880 t ipoib_get_stats [ib_ipoib] based on IP and offset (0x5c) I'll guess this is the same problem on both sides. Still looking at it. FYA, Starting 32x4 TCP_STREAM tests at Fri Dec 10 13:09:36 PST 2004 ------------------------------------ Testing with the following command line: /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 10,3 -I 99,5 -- -s 32768 -S 32768 -m 4096 ... Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 262142 262142 4096 60.00 1164.65 Fixing the alignment issue should help here. Then I can start drilling a bit deeper on bottlenecks. hth, grant From iod00d at hp.com Fri Dec 10 12:58:30 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 12:58:30 -0800 Subject: [openib-general] HP ZX1 and HP IB cards... In-Reply-To: <52fz2e9csk.fsf@topspin.com> References: <20041204021819.GG16522@esmail.cup.hp.com> <524qj2hkcr.fsf@topspin.com> <52zn0ug5pk.fsf@topspin.com> <20041206173954.GF26198@esmail.cup.hp.com> <52pt1ndzh4.fsf@topspin.com> <20041209234109.GN7178@esmail.cup.hp.com> <41B8E436.7050509@ichips.intel.com> <20041210000536.GO7178@esmail.cup.hp.com> <20041210011216.GQ7178@esmail.cup.hp.com> <52fz2e9csk.fsf@topspin.com> Message-ID: <20041210205830.GJ11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 11:24:43AM -0800, Roland Dreier wrote: > Grant> Roland, Attached is a patch that does three things (LART me > Grant> for combining them, but I can split them up later if you > Grant> want) > > Thanks, I applied this. Sorry about the brokenness for the past few days... > > (The only thing I would LART you for would be leaving out the Signed-off-by: > line with your patch, but I'm not going to be anal about that yet) Yes - I definitely know better - apologies and thanks. Here's one for the archive in case it comes up: Signed-off-by: Grant Grundler (for the add "set_ci"/remove "work"/backout "cmd_complete" patch I submitted yesterday) thanks, grant From hnrose at earthlink.net Fri Dec 10 13:58:53 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 16:58:53 -0500 Subject: [openib-general] [PATCH] embed ib_mad_recv_buf into ib_mad_recv_wc Message-ID: <006001c4df03$729cef80$6501a8c0@comcast.net> Thanks. Applied. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Dec 10 14:07:38 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 14:07:38 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000306A861@orsmsx408> >This is the IPv4 equivalent of what the previous post on All Routers >Multicast Group. I was able to get rid of the multicast join errors by removing various network services from starting up. Now I see no errors in joining multicast groups, since I am only joining the 2 default ones, but I still cannot ping on EM64T with the PCI-E HCAs. I am running the latest F/W I received from Mellanox 4.6.0-rc4. I cannot back up to the 4.3.5 firmware or any other firmware rev, because the MST tools don't work on my 2.6 based system. I will try running the same S/W on an EM64T system but with a PCI-X card to see if it is the PCI-E card, which I suspect. From hnrose at earthlink.net Fri Dec 10 14:12:06 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 17:12:06 -0500 Subject: [openib-general] IPoIB still not working References: <1AC79F16F5C5284499BB9591B33D6F000306A861@orsmsx408> Message-ID: <007201c4df05$49a50ac0$6501a8c0@comcast.net> Woodruff, Robert J wrote: > I was able to get rid of the multicast join errors by removing various > network services > from starting up. Now I see no errors in joining multicast > groups, since I am only joining the 2 default ones, but I still cannot > ping on EM64T > with the PCI-E HCAs. Those join errors were not on groups which would have anything to do with ping. The only group which matters for that is the broadcast group. You now have the same problem as others have reported with PCI-E HCAs. I believe the 4.6 released firmware (RSN) is supposed to fix this. > I cannot back up to the 4.3.5 firmware or any other firmware rev, > because the MST tools don't work on my 2.6 based system. I thought there was a workaround patch for the invariant sector using tvflash. -- Hal From roland at topspin.com Fri Dec 10 14:18:24 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 14:18:24 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210212219.GK11324@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 10 Dec 2004 13:22:19 -0800") References: <20041210212219.GK11324@esmail.cup.hp.com> Message-ID: <52k6rp94r3.fsf@topspin.com> Grant> /me bounces! With openib-1321, ip_ipoib is seems to be working! Cool. (I was wondering whether the earlier problems were because each system had two interfaces on the same broadcast domain and hence maybe responding to ARPs from the wrong interface. I wonder whether /proc/sys/net/ipv4/conf/ibX/arp_filter might have helped...) Grant> Current issue is misaligned accesses in the kernel. Hmm... what's offsetof(struct neighbour, ha) on ia64? (I'll check for myself a little later but I think the problem may be stashing a pointer at neigh->ha + 24) My current tree has a lot of local changes so it's a little hard for me to generate a patch, but does changing the body of to_ipoib_neigh() to the following help? static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - (offsetof(struct neighbour, ha) & 4)); } - Roland From roland at topspin.com Fri Dec 10 15:33:09 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:33:09 -0800 Subject: [openib-general] A Couple of verbs/mthca/firmware questions In-Reply-To: <004001c4defc$a9f2c100$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 16:10:22 -0500") References: <004001c4defc$a9f2c100$6501a8c0@comcast.net> Message-ID: <527jnp91ai.fsf@topspin.com> Hal> 1. On received packets, the LRH is not visible but there are Hal> some WC fields set. One of those fields is Hal> dlid_path_bits. dlid_path_bits seems to be the same Hal> regardless of whether the incoming DR SMP was sent to the Hal> actual DLID or the permissive DLID. Is there a way to Hal> distinguish these cases ? Not that I know of... as far as I know, the IB spec doesn't have a way to distinguish this at the verbs level. Hal> 2. If a status is set in the MAD header and a post_send is Hal> issued on that MAD, should that status be sent in the packet Hal> ? Is there any conversion of the status field which might Hal> occur ? I think the payload of the UD packet (ie the full 256 byte MAD packet) should be put on the wire unchanged. - Roland From iod00d at hp.com Fri Dec 10 15:36:54 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 15:36:54 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210212219.GK11324@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> Message-ID: <20041210233654.GN11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 01:22:19PM -0800, Grant Grundler wrote: > The "netserver" (rx4640) is getting the following: > kernel unaligned access to 0xe0000001008b0f5c, ip=0xa000000200152f10 > > a000000200152e60 t ipoib_start_xmit [ib_ipoib] > a000000200153880 t ipoib_get_stats [ib_ipoib] f10 - e60 == 0xb0 if (!spin_trylock(&priv->tx_lock)) { local_irq_restore(flags); return NETDEV_TX_LOCKED; } 1e3c: 01 40 20 e6 cmp4.eq p9,p8=0,r8 1e40: 1c 48 01 42 00 21 [MFB] mov r41=r33 1e46: 00 00 00 02 80 04 nop.f 0x0 1e4c: 50 09 00 43 (p09) br.cond.dpnt.few 2790 if (skb->dst && skb->dst->neighbour) { 1e50: 0b 50 01 40 00 21 [MMI] mov r42=r32;; 1e56: e0 00 50 30 20 00 ld8 r14=[r20] 1e5c: 00 00 04 00 nop.i 0x0;; 1e60: 11 40 40 1c 01 21 [MIB] adds r8=144,r14 1e66: 80 00 38 12 72 04 cmp.eq p8,p9=0,r14 1e6c: 60 03 00 42 (p08) br.cond.dptk.few 21c0 ;; 1e70: 0a 78 00 10 18 10 [MMI] ld8 r15=[r8];; 1e76: 90 e0 3e 00 42 40 adds r9=92,r15 1e7c: 01 78 2c e4 cmp.eq p10,p11=0,r15 1e80: 11 80 10 1f 00 21 [MIB] adds r16=68,r15 1e86: 00 00 00 02 00 05 nop.i 0x0 1e8c: 40 03 00 42 (p10) br.cond.dptk.few 21c0 ;; if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { 1e90: 1d 70 00 12 18 10 [MFB] ld8 r14=[r9] 0x1e90 is the faulting insn. Roland, yeah - this is because of the "neigh->ha + 24" issue. I'll play with it. thanks, grant From roland at topspin.com Fri Dec 10 15:41:13 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:41:13 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <007301c4def3$ed472b20$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 15:07:50 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> <007301c4def3$ed472b20$6501a8c0@comcast.net> Message-ID: <523byd90x2.fsf@topspin.com> Hal> Routers send router advertisements periodically. yes. Hal> If a hosts does not receive any router advertisements in some Hal> time period, it will solicit for routers (send a router Hal> solicitation message) some number of times. yes, it will send router solicitation messages to the all-routers group. Hal> So maybe IPv6 unicast routers don't need to receive on this Hal> group (hosts definitely do). I think unicast routers would need to listen to the all-routers group. However, since router advertisements are sent to the all-nodes group, a typical IPv6 end node does not need to listen to the all-routers group (which is why the kernel doesn't join the all-routers group by default). Hal> For IGMP, I think both hosts and routers need to both send Hal> and receive yet we see a send only join for this group. OK, not sure how IGMP works on the host side. Hal> The same thing appears to be occuring in these IPv4 (all Hal> routers and IGMP). The only IPv4 group I see my systems joining is the broadcast group -- maybe you have more config options turned on than I do; I'm running # CONFIG_IP_ADVANCED_ROUTER is not set # CONFIG_IP_MROUTE is not set - Roland From roland at topspin.com Fri Dec 10 15:44:25 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 15:44:25 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041210233654.GN11324@esmail.cup.hp.com> (Grant Grundler's message of "Fri, 10 Dec 2004 15:36:54 -0800") References: <20041210212219.GK11324@esmail.cup.hp.com> <20041210233654.GN11324@esmail.cup.hp.com> Message-ID: <52y8g57m7a.fsf@topspin.com> Grant> Roland, yeah - this is because of the "neigh->ha + 24" Grant> issue. I'll play with it. Try changing to_ipoib_neigh() (in ipoib.h) to this: static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - (offsetof(struct neighbour, ha) & 4)); } I think this should fix it. I have a big patch for IPoIB coming soon (fixes the neighbour lifetime issues as well as unicast ARP), and this will be part of it. I'd be curious how much this boosts your performance (it would be at least one unaligned trap per packet, so it's probably a big deal). - R. From hnrose at earthlink.net Fri Dec 10 15:48:33 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 18:48:33 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net><52zn0l9b64.fsf@topspin.com><007301c4def3$ed472b20$6501a8c0@comcast.net> <523byd90x2.fsf@topspin.com> Message-ID: <002e01c4df12$c293fba0$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> So maybe IPv6 unicast routers don't need to receive on this > Hal> group (hosts definitely do). > > I think unicast routers would need to listen to the all-routers > group. Do they need to know the other routers ? Isn't that what the routing protocols (RIP, OSPF, ...) are about ? > However, since router advertisements are sent to the all-nodes > group, a typical IPv6 end node does not need to listen to the > all-routers group (which is why the kernel doesn't join the > all-routers group by default). So an end node doesn't find available routers via this group ? What you are saying is consistent with the observations for this and the router would create the group and there is no one to send to if the group isn't there (send only join). > Hal> For IGMP, I think both hosts and routers need to both send > Hal> and receive yet we see a send only join for this group. > > OK, not sure how IGMP works on the host side. A host needs to tell the multicast router when it is joining and leaving a group so the router knows when to join the multicast tree for that group or prune the tree. > Hal> The same thing appears to be occuring in these IPv4 (all > Hal> routers and IGMP). > > The only IPv4 group I see my systems joining is the broadcast group -- > maybe you have more config options turned on than I do; I'm running > > # CONFIG_IP_ADVANCED_ROUTER is not set > # CONFIG_IP_MROUTE is not set I have the same for those options with also CONFIG_IP_MULTICAST = y -- Hal From roland at topspin.com Fri Dec 10 16:23:46 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 16:23:46 -0800 Subject: [openib-general] IPv6 All Router Multicast Group In-Reply-To: <002e01c4df12$c293fba0$6501a8c0@comcast.net> (Hal Rosenstock's message of "Fri, 10 Dec 2004 18:48:33 -0500") References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net> <52sm6e9e0y.fsf@topspin.com> <002601c4deef$a9c18ac0$6501a8c0@comcast.net> <524qiu9bxh.fsf@topspin.com> <005e01c4def1$8b673d20$6501a8c0@comcast.net> <52zn0l9b64.fsf@topspin.com> <007301c4def3$ed472b20$6501a8c0@comcast.net> <523byd90x2.fsf@topspin.com> <002e01c4df12$c293fba0$6501a8c0@comcast.net> Message-ID: <52u0qt7kdp.fsf@topspin.com> Hal> So an end node doesn't find available routers via this group Hal> ? What you are saying is consistent with the observations Hal> for this and the router would create the group and there is Hal> no one to send to if the group isn't there (send only join). Yes, see RFC 2461 -- router advertisements are sent to the all-nodes group (although to be precise, solicited advertisements MAY be unicast to the requestor). Hal> A host needs to tell the multicast router when it is joining Hal> and leaving a group so the router knows when to join the Hal> multicast tree for that group or prune the tree. If the host does not need to receive any multicast messages, then only the multicast routers would need to join the group. - R. From robert.j.woodruff at intel.com Fri Dec 10 16:28:41 2004 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 10 Dec 2004 16:28:41 -0800 Subject: [openib-general] IPoIB still not working Message-ID: <1AC79F16F5C5284499BB9591B33D6F000306AAC3@orsmsx408> >You now have the same problem as others have reported with PCI-E HCAs. >I believe the 4.6 released firmware (RSN) is supposed to fix this. I tried replacing the PCI-E card with a PCI-X card without changing any software and ping works just fine with the PCI-X card on my EM64T system. As I mentioned before, I was running 5.4.6-rc4 in the PCI-E card and it still fails, so there is some other issue with the PCI-E card that needs to be investigated. woody From iod00d at hp.com Fri Dec 10 16:42:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 16:42:09 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <52k6rp94r3.fsf@topspin.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> Message-ID: <20041211004209.GO11324@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 02:18:24PM -0800, Roland Dreier wrote: > Grant> /me bounces! With openib-1321, ip_ipoib is seems to be working! > > Cool. > > (I was wondering whether the earlier problems were because each system > had two interfaces on the same broadcast domain and hence maybe > responding to ARPs from the wrong interface. Yes, you are, as usual :^), probably right. The initial broadcast mask was 10.255.255.255 because I didn't specify one when I ran ifconfig (with params) for the first time. I checked what I had done by running ifconfig (no params) and realized the error. I ran ifconfig (with params) again for both ib0/1 and this time specified the broadcast address and netmask. It's likely it didn't recover from that. But I expect this issue is not ia64 specific and anyone should be able to reproduce it. > I wonder whether > /proc/sys/net/ipv4/conf/ibX/arp_filter might have helped...) sorry - I don't understand networking protocols well enough to know what you are alluding to here. But if you are already aware of the issue and fixing it... > Grant> Current issue is misaligned accesses in the kernel. > > Hmm... what's offsetof(struct neighbour, ha) on ia64? By hand, I counted 68. It should be in the asm I posted earlier. > (I'll check for > myself a little later but I think the problem may be stashing a > pointer at neigh->ha + 24) yes, I'm pretty sure it is. > My current tree has a lot of local changes so it's a little hard for > me to generate a patch, but does changing the body of to_ipoib_neigh() > to the following help? > > static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) > { > return (struct ipoib_neigh **) (neigh->ha + 24 - > (offsetof(struct neighbour, ha) & 4)); > } Sorry, I don't have this function in my tree...that's probably part of the changes you want to commit. I think it's called to_ipoib_path() in the exist tree: static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh) { return (struct ipoib_path **) (neigh->ha + 24); } I've add the same bit to it that you have above and that does avoid the misaligned access. Trying to unload the module didn't go smoothly either: ionize:/opt/netperf# ifconfig ib0 down ionize:/opt/netperf# ifconfig ib1 down ionize:/opt/netperf# rmmod ib_ipoib ib1: ib_dealloc_pd failed Not sure what caused that hiccup but I was able to unload everything else and reload the new modules just fine. In another email you commented: > I'd be curious how much this boosts your performance (it would be at > least one unaligned trap per packet, so it's probably a big deal). That's what I expected too. But not for this particular test: Starting 56x4 TCP_STREAM tests at Fri Dec 10 15:58:12 PST 2004 /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 10,3 -I 99,5 -- -s 57344 -S 57344 -m 4096 TCP STREAM TEST to 10.0.1.1 : +/-2.5% @ 99% conf. !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 5.6% !!! Local CPU util : 0.0% !!! Remote CPU util : 0.0% Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 262142 262142 4096 60.00 1232.26 That's something like ~10% improvment. I ran the same test again but limited it to three passes *AND* under control of pfmon...about the same result (1221.12 Mbps) CPU0 16948574 UC_LOADS_RETIRED CPU1 13211 UC_LOADS_RETIRED *sigh*...forgot to grab IRQ counts. Again, but with one iteration, (60 seconds): 67: 149648661 0 IO-SAPIC-level ib_mthca pfmon -e uc_loads_retired -k --system-wide -- /opt/netperf/netperf -t TCP_STREAM -l 60 -H 10.0.1.1 -i 1,1 -- -s 57344 -S 57344 -m 4096 ... 114688 114688 4096 60.00 1170.88 CPU0 5656170 UC_LOADS_RETIRED CPU1 4466 UC_LOADS_RETIRED ionize:/opt/netperf# cat /proc/interrupts | fgrep mthca 67: 152474464 0 IO-SAPIC-level ib_mthca 5656170/(152474464-149648661) ~= 2 two uncached reads per Interrupt. That's what e1000 driver is doing today. No wonder we aren't much faster. I was expecting zero uncached reads from IB in the interrupt path. Fix that and we'll get back ~8 seconds of CPU time for a 60 second test. 47096 interrupts/second. We should be able to do 2x that on this box at least. (1.5GHz Madison) I'll dig up the other trivial things with pfmon. BTW, if anyone has another favorite trivial test or netperf parameters, I'd be happy to collect pfmon, q-syscollect, prospect, or oprofile output. thanks, grant From hnrose at earthlink.net Fri Dec 10 16:55:25 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 19:55:25 -0500 Subject: [openib-general] IPv6 All Router Multicast Group References: <007d01c4dece$3fd6fda0$6501a8c0@comcast.net><52sm6e9e0y.fsf@topspin.com><002601c4deef$a9c18ac0$6501a8c0@comcast.net><524qiu9bxh.fsf@topspin.com><005e01c4def1$8b673d20$6501a8c0@comcast.net><52zn0l9b64.fsf@topspin.com><007301c4def3$ed472b20$6501a8c0@comcast.net><523byd90x2.fsf@topspin.com><002e01c4df12$c293fba0$6501a8c0@comcast.net> <52u0qt7kdp.fsf@topspin.com> Message-ID: <001901c4df1c$19ac0a00$6501a8c0@comcast.net> Roland Dreier wrote: > Hal> A host needs to tell the multicast router when it is joining > Hal> and leaving a group so the router knows when to join the > Hal> multicast tree for that group or prune the tree. > > If the host does not need to receive any multicast messages, then only > the multicast routers would need to join the group. In IGMP, the host needs to both listen (for queries) and send (reports, leave). The inverse is true for mrouters. It seems like they both need to join the group if they are running IGMP of some version. IGMPv1 does not support leave. -- Hal From roland at topspin.com Fri Dec 10 18:23:36 2004 From: roland at topspin.com (Roland Dreier) Date: Fri, 10 Dec 2004 18:23:36 -0800 Subject: [openib-general] [PATCH] IPoIB neighbour fixes Message-ID: <52mzwl7etz.fsf@topspin.com> I just committed this patch, which fixes both the "path mismatch for unicast ARP" and "neighbour destructor after rmmod" issues. I've tested rmmod'ing ipoib with traffic running with this patch, and it survives fine. I now have everything that I wanted to get done, and I'm planning on submitting patches to lkml again on Monday. - R. Index: infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- infiniband/ulp/ipoib/ipoib_main.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -149,22 +149,116 @@ return 0; } +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, gid->raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &priv->path_list, list) + __path_free(dev, path); + + spin_unlock_irqrestore(&priv->lock, flags); +} + static void path_rec_completion(int status, struct ib_sa_path_rec *pathrec, void *path_ptr) { struct ipoib_path *path = path_ptr; struct ipoib_dev_priv *priv = netdev_priv(path->dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; struct sk_buff *skb; - struct ipoib_ah *ah; + unsigned long flags; ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); - if (status != IB_WC_SUCCESS) - goto err; - - { + if (status == IB_WC_SUCCESS) { struct ib_ah_attr av = { .dlid = be16_to_cpu(pathrec->dlid), .sl = pathrec->sl, @@ -177,215 +271,216 @@ ah = ipoib_create_ah(path->dev, priv->pd, &av); } - if (!ah) - goto err; + spin_lock_irqsave(&priv->lock, flags); path->ah = ah; - ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", - ah, pathrec->dlid, pathrec->sl); + if (ah) { + path->pathrec = *pathrec; - while ((skb = __skb_dequeue(&path->queue))) { + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + skb_queue_head_init(&skqueue); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { skb->dev = path->dev; if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed " "to requeue packet\n"); } - - return; - -err: - while ((skb = __skb_dequeue(&path->queue))) - dev_kfree_skb(skb); - - if (path->neighbour) - *to_ipoib_path(path->neighbour) = NULL; - - kfree(path); } -static void path_rec_start(struct sk_buff *skb, struct net_device *dev) +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); - struct ib_sa_path_rec rec = { - .numb_path = 1 - }; - struct ib_sa_query *query; + struct ipoib_path *path; + path = kmalloc(sizeof *path, GFP_ATOMIC); if (!path) - goto err; + return NULL; - path->ah = NULL; path->dev = dev; + path->pathrec.dlid = 0; + skb_queue_head_init(&path->queue); - __skb_queue_tail(&path->queue, skb); - path->neighbour = NULL; - rec.sgid = priv->local_gid; - memcpy(rec.dgid.raw, skb->dst->neighbour->ha + 4, 16); - rec.pkey = cpu_to_be16(priv->pkey); + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); - /* - * XXX there's a race here if path record completion runs - * before we get to finish up. Add a lock to path struct? - */ - if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, - IB_SA_PATH_REC_DGID | - IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, - path_rec_completion, - path, &query) < 0) { - ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); - goto err; - } + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; - path->neighbour = skb->dst->neighbour; - *to_ipoib_path(skb->dst->neighbour) = path; - return; + __path_add(dev, path); -err: - kfree(path); - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); + return path; } -static void path_lookup(struct sk_buff *skb, struct net_device *dev) +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) { - struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); - /* Look up path record for unicasts */ - if (skb->dst->neighbour->ha[4] != 0xff) { - path_rec_start(skb, dev); - return; + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; } - /* Add in the P_Key */ - skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; - skb->dst->neighbour->ha[9] = priv->pkey & 0xff; - ipoib_mcast_send(dev, - (union ib_gid *) (skb->dst->neighbour->ha + 4), - skb); + return 0; } -static void unicast_arp_completion(int status, - struct ib_sa_path_rec *pathrec, - void *skb_ptr) +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) { - struct sk_buff *skb = skb_ptr; - struct ipoib_dev_priv *priv = netdev_priv(skb->dev); - struct ipoib_ah *ah; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; - ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", - status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } - if (status) - goto err; + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; - { - struct ib_ah_attr av = { - .dlid = be16_to_cpu(pathrec->dlid), - .sl = pathrec->sl, - .src_path_bits = 0, - .static_rate = 0, - .ah_flags = 0, - .port_num = priv->port - }; + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); - ah = ipoib_create_ah(skb->dev, priv->pd, &av); + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; } - if (!ah) - goto err; + list_add_tail(&neigh->list, &path->neigh_list); - *(struct ipoib_ah **) skb->cb = ah; + if (path->pathrec.dlid) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); - if (dev_queue_xmit(skb)) - ipoib_warn(priv, "dev_queue_xmit failed " - "to requeue ARP packet\n"); + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else if (!path->query) { + neigh->ah = NULL; + __skb_queue_tail(&neigh->queue, skb); + if (path_rec_start(dev, path)) + goto err; + } + spin_unlock(&priv->lock); return; err: - dev_kfree_skb(skb); + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); } -static void unicast_arp_finish(struct sk_buff *skb) +static void path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); - struct ipoib_ah *ah = *(struct ipoib_ah **) skb->cb; - unsigned long flags; - if (ah) { - spin_lock_irqsave(&priv->lock, flags); - list_add_tail(&ah->list, &priv->dead_ahs); - spin_unlock_irqrestore(&priv->lock, flags); + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); } -/* - * For unicast packets with no skb->dst->neighbour (unicast ARPs are - * the main example), we fire off a path record query for each packet. - * This is pretty bad for scalability (since this is going to hammer - * the SM on a big fabric) but it's the best I can think of for now. - * - * Also we might have a problem if a path changes, because ARPs will - * still go through (since we'll get the new path from the SM for - * these queries) so we'll never update the neighbour. - */ -static void unicast_arp_start(struct sk_buff *skb, struct net_device *dev, - struct ipoib_pseudoheader *phdr) +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct sk_buff *tmp_skb; - struct ib_sa_path_rec rec = { - .numb_path = 1 - }; - struct ib_sa_query *query; + struct ipoib_path *path; - if (skb->destructor) { - tmp_skb = skb; - skb = skb_clone(tmp_skb, GFP_ATOMIC); - dev_kfree_skb_any(tmp_skb); - if (!skb) { + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { ++priv->stats.tx_dropped; - return; + dev_kfree_skb_any(skb); } + + spin_unlock(&priv->lock); + return; } - skb->dev = dev; - skb->destructor = unicast_arp_finish; - memset(skb->cb, 0, sizeof skb->cb); + ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - rec.sgid = priv->local_gid; - memcpy(rec.dgid.raw, phdr->hwaddr + 4, 16); - rec.pkey = cpu_to_be16(priv->pkey); + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); - /* - * XXX We need to keep a record of the skb and TID somewhere - * so that we can cancel the request if the device goes down - * before it finishes. - */ - if (ib_sa_path_rec_get(priv->ca, priv->port, &rec, - IB_SA_PATH_REC_DGID | - IB_SA_PATH_REC_SGID | - IB_SA_PATH_REC_NUMB_PATH | - IB_SA_PATH_REC_PKEY, - 1000, GFP_ATOMIC, - unicast_arp_completion, - skb, &query) < 0) { - ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); - ++priv->stats.tx_dropped; - dev_kfree_skb_any(skb); - } + spin_unlock(&priv->lock); } static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_path *path; + struct ipoib_neigh *neigh; unsigned long flags; local_irq_save(flags); @@ -395,21 +490,21 @@ } if (skb->dst && skb->dst->neighbour) { - if (unlikely(!*to_ipoib_path(skb->dst->neighbour))) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { path_lookup(skb, dev); goto out; } - path = *to_ipoib_path(skb->dst->neighbour); + neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (likely(path->ah)) { - ipoib_send(dev, skb, path->ah, + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); goto out; } - if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) - __skb_queue_tail(&path->queue, skb); + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); else goto err; } else { @@ -418,25 +513,14 @@ skb_pull(skb, sizeof *phdr); if (phdr->hwaddr[4] == 0xff) { - /* Add in the P_Key */ + /* Add in the P_Key for multicast*/ phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; phdr->hwaddr[9] = priv->pkey & 0xff; ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); } else { - /* unicast GID -- ARP reply?? */ + /* unicast GID -- should be ARP reply */ - /* - * If destructor is unicast_arp_finish, we've - * already been through the path lookup and - * now we can just send the packet. - */ - if (skb->destructor == unicast_arp_finish) { - ipoib_send(dev, skb, *(struct ipoib_ah **) skb->cb, - be32_to_cpup((u32 *) phdr->hwaddr)); - goto out; - } - if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " IPOIB_GID_FMT "\n", @@ -449,9 +533,8 @@ goto out; } - /* put the pseudoheader back on */ - skb_push(skb, sizeof *phdr); - unicast_arp_start(skb, dev, phdr); + /* put the pseudoheader back on and try to send */ + unicast_arp_send(skb, dev, phdr); } } @@ -516,19 +599,28 @@ schedule_work(&priv->restart_task); } -static void ipoib_neigh_destructor(struct neighbour *neigh) +static void ipoib_neigh_destructor(struct neighbour *n) { - struct ipoib_path *path = *to_ipoib_path(neigh); + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; - ipoib_dbg(netdev_priv(neigh->dev), + ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", - be32_to_cpup((__be32 *) neigh->ha), - IPOIB_GID_ARG(*((union ib_gid *) (neigh->ha + 4)))); + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); - if (path && path->ah) { - ipoib_put_ah(path->ah); - kfree(path); + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); } + + spin_unlock_irqrestore(&priv->lock, flags); } static int ipoib_neigh_setup(struct neighbour *neigh) @@ -669,6 +761,7 @@ init_MUTEX(&priv->mcast_mutex); init_MUTEX(&priv->vlan_mutex); + INIT_LIST_HEAD(&priv->path_list); INIT_LIST_HEAD(&priv->child_intfs); INIT_LIST_HEAD(&priv->dead_ahs); INIT_LIST_HEAD(&priv->multicast_list); Index: infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -60,6 +60,8 @@ unsigned long flags; unsigned char logcount; + struct list_head neigh_list; + struct sk_buff_head pkt_queue; struct net_device *dev; @@ -77,11 +79,25 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast) { struct net_device *dev = mcast->dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tmp; + unsigned long flags; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); + if (mcast->ah) ipoib_put_ah(mcast->ah); @@ -114,6 +130,7 @@ mcast->logcount = 0; INIT_LIST_HEAD(&mcast->list); + INIT_LIST_HEAD(&mcast->neigh_list); skb_queue_head_init(&mcast->pkt_queue); mcast->ah = NULL; @@ -671,24 +688,25 @@ } out: - spin_unlock_irqrestore(&priv->lock, flags); if (mcast && mcast->ah) { if (skb->dst && skb->dst->neighbour && - !*to_ipoib_path(skb->dst->neighbour)) { - struct ipoib_path *path = kmalloc(sizeof *path, GFP_ATOMIC); + !*to_ipoib_neigh(skb->dst->neighbour)) { + struct ipoib_neigh *neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); - if (path) { + if (neigh) { kref_get(&mcast->ah->ref); - path->ah = mcast->ah; - path->dev = dev; - path->neighbour = skb->dst->neighbour; - *to_ipoib_path(skb->dst->neighbour) = path; + neigh->ah = mcast->ah; + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + list_add_tail(&neigh->list, &mcast->neigh_list); } } ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN); } + + spin_unlock_irqrestore(&priv->lock, flags); } void ipoib_mcast_dev_flush(struct net_device *dev) Index: infiniband/ulp/ipoib/ipoib.h =================================================================== --- infiniband/ulp/ipoib/ipoib.h (revision 1320) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -93,6 +93,12 @@ DECLARE_PCI_UNMAP_ADDR(mapping) }; +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ struct ipoib_dev_priv { spinlock_t lock; @@ -103,6 +109,9 @@ struct semaphore mcast_mutex; struct semaphore vlan_mutex; + struct rb_root path_tree; + struct list_head path_list; + struct ipoib_mcast *broadcast; struct list_head multicast_list; struct rb_root multicast_tree; @@ -162,16 +171,34 @@ }; struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { struct ipoib_ah *ah; struct sk_buff_head queue; - struct net_device *dev; struct neighbour *neighbour; + + struct list_head list; }; -static inline struct ipoib_path **to_ipoib_path(struct neighbour *neigh) +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { - return (struct ipoib_path **) (neigh->ha + 24); + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); } extern struct workqueue_struct *ipoib_workqueue; @@ -194,6 +221,7 @@ struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(void *dev_ptr); +void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); Index: infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- infiniband/ulp/ipoib/ipoib_ib.c (revision 1320) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -445,6 +445,8 @@ /* Delete broadcast and local addresses since they will be recreated */ ipoib_mcast_dev_down(dev); + ipoib_flush_paths(dev); + return 0; } From hnrose at earthlink.net Fri Dec 10 20:33:39 2004 From: hnrose at earthlink.net (Hal Rosenstock) Date: Fri, 10 Dec 2004 23:33:39 -0500 Subject: Fw: [openib-general] IPv6 All Router Multicast Group Message-ID: <004c01c4df3a$98e72980$6501a8c0@comcast.net> Hal Rosenstock wrote: > Roland Dreier wrote: >> Hal> A host needs to tell the multicast router when it is joining >> Hal> and leaving a group so the router knows when to join the >> Hal> multicast tree for that group or prune the tree. >> >> If the host does not need to receive any multicast messages, then >> only the multicast routers would need to join the group. > > In IGMP, the host needs to both listen (for queries) and send > (reports, leave). > The inverse is true for mrouters. It seems like they both need to > join the group > if they are running IGMP of some version. IGMPv1 does not support > leave. In thinking about this further, it depends on whether the messages are multicast or not and in what direction(s). The router to host general query is multicast (which means a host would need to receive as well as send) but this is on the all systems multicast address (224.0.0.1). Specific queries are sent on the group being queried. Routers send to these groups and also listen to the all routers group (224.0.0.2) for IGMPv2 (leaves are sent on this group from hosts) and all other groups (which is where the reports come back on). Only Version 3 Reports are sent with an IP destination address of 224.0.0.22, to which all IGMPv3-capable multicast routers listen. So hosts send only on this group. -- Hal From iod00d at hp.com Fri Dec 10 22:21:55 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 22:21:55 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041211004209.GO11324@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> <20041211004209.GO11324@esmail.cup.hp.com> Message-ID: <20041211062155.GC13957@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 04:42:09PM -0800, Grant Grundler wrote: > I'll dig up the other trivial things with pfmon. pfmon 3.1 hasn't been released like I'd hoped and thus I may not be able to collect Data EAR like I hoped. I did a q-syscollect run (full output): http://gsyprf3.external.hp.com/apache2-default/openib/q-1321-tcp_stream-0.txt Here are the top offenders: Flat profile of CPU_CYCLES in kernel-cpu0.hist#0: Each histogram sample counts as 1.00034m seconds % time self cumul calls self/call tot/call name 20.48 8.10 8.10 41.0k 198u 198u default_idle 15.02 5.94 14.05 1.83M 3.25u 5.06u mthca_interrupt 14.91 5.90 19.94 17.2M 342n 342n _spin_unlock_irqrestore 4.05 1.60 21.55 12.1M 132n 149n ipt_do_table 3.20 1.27 22.81 7.17M 177n 177n do_csum 2.67 1.05 23.87 7.83M 135n 135n __copy_user 2.26 0.89 24.76 33.9M 26.4n 36.6n local_bh_enable 2.15 0.85 25.61 2.81M 303n 778n tcp_transmit_skb 2.01 0.79 26.41 1.45M 545n 3.98u tcp_sendmsg ... hrm...don't understand the 20% idle. This is a dual CPU system and (this version of) netperf is not multi-threaded. The top 3 only add up to about 50%. I guess I need to see what's being inlined into mthca_interrupt and try to break that down into smaller bits. thanks, grant From iod00d at hp.com Fri Dec 10 22:35:09 2004 From: iod00d at hp.com (Grant Grundler) Date: Fri, 10 Dec 2004 22:35:09 -0800 Subject: [openib-general] ip_ipoib works on IA64! (woohoo! :^) In-Reply-To: <20041211062155.GC13957@esmail.cup.hp.com> References: <20041210212219.GK11324@esmail.cup.hp.com> <52k6rp94r3.fsf@topspin.com> <20041211004209.GO11324@esmail.cup.hp.com> <20041211062155.GC13957@esmail.cup.hp.com> Message-ID: <20041211063509.GA14155@esmail.cup.hp.com> On Fri, Dec 10, 2004 at 10:21:55PM -0800, Grant Grundler wrote: > I did a q-syscollect run (full output): > http://gsyprf3.external.hp.com/apache2-default/openib/q-1321-tcp_stream-0.txt Sorry that should be: http://gsyprf3.external.hp.com/openib/q-1321-tcp_stream-0.txt (misconfigured apache2 server - fixed it) grant From roland at topspin.com Sat Dec 11 06:37:37 2004 From: roland at topspin.com (Roland Dreier) Date: Sat, 11 Dec 2004 06:37:37 -0800 Subject: [openib-general] Missining Infiniband class fields In-Reply-To: <521xdyasoo.fsf@topspin.com> (Roland Dreier's message of "Fri, 10 Dec 2004 10:56:07 -0800") References: <521xdyasoo.fsf@topspin.com> Message-ID: <52acsk7vf2.fsf@topspin.com> Roland> I would create another device special file like the umad Roland> file and use reads to get the events. I thought about this a little more, and I'm not sure this is the right approach any more. I would look at using the kobject_uevent mechanism to pass these events to userspace (add a KOBJ_IB_ASYNC action and generate events using the IB device's class_device kobject). - R. From shaharf at voltaire.com Sun Dec 12 02:43:04 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 12 Dec 2004 12:43:04 +0200 Subject: [openib-general] Re: User MAD support for cancel MAD Message-ID: > I guess it would be another ioctl. However I'm not sure how useful > this is for userspace... in the kernel modules like IPoIB want to > cancel pending sends to avoid a callback when the module is unloaded. > However in userspace, an application can just close the file on exit. > > - R. OpenSM attempts to cancel mads on several scenarios. For example, the SM issues SwitchInfo to all switchs. One of them returns with a port change. This means that all other SwitchInfo are not relevant anymore - a full "heavy" sweep must be performed any way. While it is not critical to be able to cancel these mads, it may help freeing kernel resources in the case when these mads will timeout, for example when some switches are behind a switch that is disconnected. Another scenario is when a discovery is done and an error is received and another sweep is forced. The pending mads should be canceled. Again, this is not critical, but if the instrumentation is already there, it would be nice to use it. The interface issue is not so trivial. An IOCTL may do, but the problem is the parameter of the IOCTL. The most straight forward way is to specify a TID to cancel. This requires the usermode to avoid sending the same mad until it is timeouts. While this is reasonable, I would not force such a limitation, after all there are plenty of retries mechanisms and we should not force the applications to use only one sort. For example, if you have a retries mechanism that work over IB_MGT that just retires the mad after X msecs, I think you want to let it work also above openib. This means that either you allow no timeout/no kernel matching semantics (do we have one?) or let the user cancel *all* mads with the same TID. I think that both should be implemented. Shahar From shaharf at voltaire.com Sun Dec 12 10:22:08 2004 From: shaharf at voltaire.com (shaharf) Date: Sun, 12 Dec 2004 20:22:08 +0200 Subject: [openib-general] Missining Infiniband class fields Message-ID: > > Roland> I would create another device special file like the umad > Roland> file and use reads to get the events. > > I thought about this a little more, and I'm not sure this is the right > approach any more. I would look at using the kobject_uevent mechanism > to pass these events to userspace (add a KOBJ_IB_ASYNC action and > generate events using the IB device's class_device kobject). > > - R. Either way is good enough for me. Just tell me what the interface is and what structures and events are passed. Shahar From mshefty at ichips.intel.com Mon Dec 13 09:56:55 2004 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Dec 2004 09:56:55 -0800 Subject: [openib-general] Re: User MAD support for cancel MAD In-Reply-To: References: Message-ID: <41BDD7E7.6050600@ichips.intel.com> shaharf wrote: > The interface issue is not so trivial. An IOCTL may do, but the problem > is the parameter of the IOCTL. The most straight forward way is to > specify a TID to cancel. This requires the usermode to avoid sending the > same mad until it is timeouts. While this is reasonable, I would not > force such a limitation, Why would a client send the same MAD before it times out? I think that enforcing such a limitation is perfectly acceptable. We could argue that cancel MAD should work based on TID, management class, and SGID/SLID, rather than simply the TID. > For example, if you have a retries mechanism that work over IB_MGT > that just retires the mad after X msecs, I think you want to let it work > also above openib. This means that either you allow no timeout/no kernel > matching semantics Clients are required to use the access layer timeout mechanism to properly match responses with requests. If a timeout is not specified, the access layer will drop the response, since it cannot uniquely identify which client to hand the response to. This is done to avoid making multiple data copies of a receive MAD in order to hand it to all clients. - Sean From roland at topspin.com Mon Dec 13 10:09:16 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:16 -0800 Subject: [openib-general] [PATCH][v3][0/21] Initial submission of InfiniBand patches for review Message-ID: <20041213109.xPBcb5yOtGKuT24L@topspin.com> The following series of patches is the latest version of the OpenIB InfiniBand drivers. We believe that this version is suitable for merging when 2.6.11 opens (or into -mm immediately), although of course we are willing to go through as many more iterations as required to fix any remaining issues. We appreciate all of the excellent feedback we received for our previous posting, and we believe we have addressed all of the problems that were identified. We did not intentionally ignore any issues -- if we did not address some of your comments, please rest assured that it was an error on our part. Based on the discussion on cleaning up kernel headers, we have left our .h files under drivers/infiniband/include. None of these .h files are used outside of drivers/infiniband, so it seems that it is better not to add them to the global include/ namespace. Thanks, Roland Dreier OpenIB Alliance www.openib.org From roland at topspin.com Mon Dec 13 10:09:17 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:17 -0800 Subject: [openib-general] [PATCH][v3][1/21] Add core InfiniBand support (public headers) In-Reply-To: <20041213109.xPBcb5yOtGKuT24L@topspin.com> Message-ID: <20041213109.NN2Neh1z1VRsHIIH@topspin.com> Add public headers for core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_cache.h 2004-12-13 09:44:39.409919591 -0800 @@ -0,0 +1,42 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ib_cache.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef _IB_CACHE_H +#define _IB_CACHE_H + +#include + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid); +int ib_cached_pkey_get(struct ib_device *device_handle, + u8 port, + int index, + u16 *pkey); +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index); + +#endif /* _IB_CACHE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_fmr_pool.h 2004-12-13 09:44:39.435915761 -0800 @@ -0,0 +1,81 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_fmr_pool.h 1295 2004-11-25 19:26:33Z roland $ + */ + +#if !defined(IB_FMR_POOL_H) +#define IB_FMR_POOL_H + +#include + +struct ib_fmr_pool; + +/** + * struct ib_fmr_pool_param - Parameters for creating FMR pool + * @max_pages_per_fmr:Maximum number of pages per map request. + * @access:Access flags for FMRs in pool. + * @pool_size:Number of FMRs to allocate for pool. + * @dirty_watermark:Flush is triggered when @dirty_watermark dirty + * FMRs are present. + * @flush_function:Callback called when unmapped FMRs are flushed and + * more FMRs are possibly available for mapping + * @flush_arg:Context passed to user's flush function. + * @cache:If set, FMRs may be reused after unmapping for identical map + * requests. + */ +struct ib_fmr_pool_param { + int max_pages_per_fmr; + enum ib_access_flags access; + int pool_size; + int dirty_watermark; + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + unsigned cache:1; +}; + +struct ib_pool_fmr { + struct ib_fmr *fmr; + struct ib_fmr_pool *pool; + struct list_head list; + struct hlist_node cache_node; + int ref_count; + int remap_count; + u64 io_virtual_address; + int page_list_len; + u64 page_list[0]; +}; + +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params); + +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool); + +int ib_flush_fmr_pool(struct ib_fmr_pool *pool); + +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address); + +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr); + +#endif /* IB_FMR_POOL_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_pack.h 2004-12-13 09:44:39.467911048 -0800 @@ -0,0 +1,234 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ib_pack.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef IB_PACK_H +#define IB_PACK_H + +#include + +enum { + IB_LRH_BYTES = 8, + IB_GRH_BYTES = 40, + IB_BTH_BYTES = 12, + IB_DETH_BYTES = 8 +}; + +struct ib_field { + size_t struct_offset_bytes; + size_t struct_size_bytes; + int offset_words; + int offset_bits; + int size_bits; + char *field_name; +}; + +#define RESERVED \ + .field_name = "reserved" + +/* + * This macro cleans up the definitions of constants for BTH opcodes. + * It is used to define constants such as IB_OPCODE_UD_SEND_ONLY, + * which becomes IB_OPCODE_UD + IB_OPCODE_SEND_ONLY, and this gives + * the correct value. + * + * In short, user code should use the constants defined using the + * macro rather than worrying about adding together other constants. +*/ +#define IB_OPCODE(transport, op) \ + IB_OPCODE_ ## transport ## _ ## op = \ + IB_OPCODE_ ## transport + IB_OPCODE_ ## op + +enum { + /* transport types -- just used to define real constants */ + IB_OPCODE_RC = 0x00, + IB_OPCODE_UC = 0x20, + IB_OPCODE_RD = 0x40, + IB_OPCODE_UD = 0x60, + + /* operations -- just used to define real constants */ + IB_OPCODE_SEND_FIRST = 0x00, + IB_OPCODE_SEND_MIDDLE = 0x01, + IB_OPCODE_SEND_LAST = 0x02, + IB_OPCODE_SEND_LAST_WITH_IMMEDIATE = 0x03, + IB_OPCODE_SEND_ONLY = 0x04, + IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE = 0x05, + IB_OPCODE_RDMA_WRITE_FIRST = 0x06, + IB_OPCODE_RDMA_WRITE_MIDDLE = 0x07, + IB_OPCODE_RDMA_WRITE_LAST = 0x08, + IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE = 0x09, + IB_OPCODE_RDMA_WRITE_ONLY = 0x0a, + IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE = 0x0b, + IB_OPCODE_RDMA_READ_REQUEST = 0x0c, + IB_OPCODE_RDMA_READ_RESPONSE_FIRST = 0x0d, + IB_OPCODE_RDMA_READ_RESPONSE_MIDDLE = 0x0e, + IB_OPCODE_RDMA_READ_RESPONSE_LAST = 0x0f, + IB_OPCODE_RDMA_READ_RESPONSE_ONLY = 0x10, + IB_OPCODE_ACKNOWLEDGE = 0x11, + IB_OPCODE_ATOMIC_ACKNOWLEDGE = 0x12, + IB_OPCODE_COMPARE_SWAP = 0x13, + IB_OPCODE_FETCH_ADD = 0x14, + + /* real constants follow -- see comment about above IB_OPCODE() + macro for more details */ + + /* RC */ + IB_OPCODE(RC, SEND_FIRST), + IB_OPCODE(RC, SEND_MIDDLE), + IB_OPCODE(RC, SEND_LAST), + IB_OPCODE(RC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, SEND_ONLY), + IB_OPCODE(RC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_FIRST), + IB_OPCODE(RC, RDMA_WRITE_MIDDLE), + IB_OPCODE(RC, RDMA_WRITE_LAST), + IB_OPCODE(RC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_WRITE_ONLY), + IB_OPCODE(RC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RC, RDMA_READ_REQUEST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RC, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RC, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RC, ACKNOWLEDGE), + IB_OPCODE(RC, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RC, COMPARE_SWAP), + IB_OPCODE(RC, FETCH_ADD), + + /* UC */ + IB_OPCODE(UC, SEND_FIRST), + IB_OPCODE(UC, SEND_MIDDLE), + IB_OPCODE(UC, SEND_LAST), + IB_OPCODE(UC, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, SEND_ONLY), + IB_OPCODE(UC, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_FIRST), + IB_OPCODE(UC, RDMA_WRITE_MIDDLE), + IB_OPCODE(UC, RDMA_WRITE_LAST), + IB_OPCODE(UC, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(UC, RDMA_WRITE_ONLY), + IB_OPCODE(UC, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + + /* RD */ + IB_OPCODE(RD, SEND_FIRST), + IB_OPCODE(RD, SEND_MIDDLE), + IB_OPCODE(RD, SEND_LAST), + IB_OPCODE(RD, SEND_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, SEND_ONLY), + IB_OPCODE(RD, SEND_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_FIRST), + IB_OPCODE(RD, RDMA_WRITE_MIDDLE), + IB_OPCODE(RD, RDMA_WRITE_LAST), + IB_OPCODE(RD, RDMA_WRITE_LAST_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_WRITE_ONLY), + IB_OPCODE(RD, RDMA_WRITE_ONLY_WITH_IMMEDIATE), + IB_OPCODE(RD, RDMA_READ_REQUEST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_FIRST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_MIDDLE), + IB_OPCODE(RD, RDMA_READ_RESPONSE_LAST), + IB_OPCODE(RD, RDMA_READ_RESPONSE_ONLY), + IB_OPCODE(RD, ACKNOWLEDGE), + IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE), + IB_OPCODE(RD, COMPARE_SWAP), + IB_OPCODE(RD, FETCH_ADD), + + /* UD */ + IB_OPCODE(UD, SEND_ONLY), + IB_OPCODE(UD, SEND_ONLY_WITH_IMMEDIATE) +}; + +enum { + IB_LNH_RAW = 0, + IB_LNH_IP = 1, + IB_LNH_IBA_LOCAL = 2, + IB_LNH_IBA_GLOBAL = 3 +}; + +struct ib_unpacked_lrh { + u8 virtual_lane; + u8 link_version; + u8 service_level; + u8 link_next_header; + __be16 destination_lid; + __be16 packet_length; + __be16 source_lid; +}; + +struct ib_unpacked_grh { + u8 ip_version; + u8 traffic_class; + __be32 flow_label; + __be16 payload_length; + u8 next_header; + u8 hop_limit; + union ib_gid source_gid; + union ib_gid destination_gid; +}; + +struct ib_unpacked_bth { + u8 opcode; + u8 solicited_event; + u8 mig_req; + u8 pad_count; + u8 transport_header_version; + __be16 pkey; + __be32 destination_qpn; + u8 ack_req; + __be32 psn; +}; + +struct ib_unpacked_deth { + __be32 qkey; + __be32 source_qpn; +}; + +struct ib_ud_header { + struct ib_unpacked_lrh lrh; + int grh_present; + struct ib_unpacked_grh grh; + struct ib_unpacked_bth bth; + struct ib_unpacked_deth deth; + int immediate_present; + __be32 immediate_data; +}; + +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf); + +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure); + +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header); + +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf); + +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header); + +#endif /* IB_PACK_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_verbs.h 2004-12-13 09:44:39.495906923 -0800 @@ -0,0 +1,1225 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id: ib_verbs.h 1304 2004-11-30 19:01:25Z sean.hefty $ + */ + +#if !defined(IB_VERBS_H) +#define IB_VERBS_H + +#include +#include +#include + +union ib_gid { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +enum ib_node_type { + IB_NODE_CA = 1, + IB_NODE_SWITCH, + IB_NODE_ROUTER +}; + +enum ib_device_cap_flags { + IB_DEVICE_RESIZE_MAX_WR = 1, + IB_DEVICE_BAD_PKEY_CNTR = (1<<1), + IB_DEVICE_BAD_QKEY_CNTR = (1<<2), + IB_DEVICE_RAW_MULTI = (1<<3), + IB_DEVICE_AUTO_PATH_MIG = (1<<4), + IB_DEVICE_CHANGE_PHY_PORT = (1<<5), + IB_DEVICE_UD_AV_PORT_ENFORCE = (1<<6), + IB_DEVICE_CURR_QP_STATE_MOD = (1<<7), + IB_DEVICE_SHUTDOWN_PORT = (1<<8), + IB_DEVICE_INIT_TYPE = (1<<9), + IB_DEVICE_PORT_ACTIVE_EVENT = (1<<10), + IB_DEVICE_SYS_IMAGE_GUID = (1<<11), + IB_DEVICE_RC_RNR_NAK_GEN = (1<<12), + IB_DEVICE_SRQ_RESIZE = (1<<13), + IB_DEVICE_N_NOTIFY_CQ = (1<<14), + IB_DEVICE_RQ_SIG_TYPE = (1<<15) +}; + +enum ib_atomic_cap { + IB_ATOMIC_NONE, + IB_ATOMIC_HCA, + IB_ATOMIC_GLOB +}; + +struct ib_device_attr { + u64 fw_ver; + u64 node_guid; + u64 sys_image_guid; + u64 max_mr_size; + u64 page_size_cap; + u32 vendor_id; + u32 vendor_part_id; + u32 hw_ver; + int max_qp; + int max_qp_wr; + int device_cap_flags; + int max_sge; + int max_sge_rd; + int max_cq; + int max_cqe; + int max_mr; + int max_pd; + int max_qp_rd_atom; + int max_ee_rd_atom; + int max_res_rd_atom; + int max_qp_init_rd_atom; + int max_ee_init_rd_atom; + enum ib_atomic_cap atomic_cap; + int max_ee; + int max_rdd; + int max_mw; + int max_raw_ipv6_qp; + int max_raw_ethy_qp; + int max_mcast_grp; + int max_mcast_qp_attach; + int max_total_mcast_qp_attach; + int max_ah; + int max_fmr; + int max_map_per_fmr; + int max_srq; + int max_srq_wr; + int max_srq_sge; + u16 max_pkeys; + u8 local_ca_ack_delay; +}; + +enum ib_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int ib_mtu_enum_to_int(enum ib_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +enum ib_static_rate { + IB_STATIC_RATE_FULL = 0, + IB_STATIC_RATE_12X_TO_4X = 2, + IB_STATIC_RATE_4X_TO_1X = 3, + IB_STATIC_RATE_12X_TO_1X = 11 +}; + +enum ib_port_state { + IB_PORT_NOP = 0, + IB_PORT_DOWN = 1, + IB_PORT_INIT = 2, + IB_PORT_ARMED = 3, + IB_PORT_ACTIVE = 4, + IB_PORT_ACTIVE_DEFER = 5 +}; + +enum ib_port_cap_flags { + IB_PORT_SM = (1<<31), + IB_PORT_NOTICE_SUP = (1<<30), + IB_PORT_TRAP_SUP = (1<<29), + IB_PORT_AUTO_MIGR_SUP = (1<<27), + IB_PORT_SL_MAP_SUP = (1<<26), + IB_PORT_MKEY_NVRAM = (1<<25), + IB_PORT_PKEY_NVRAM = (1<<24), + IB_PORT_LED_INFO_SUP = (1<<23), + IB_PORT_SM_DISABLED = (1<<22), + IB_PORT_SYS_IMAGE_GUID_SUP = (1<<21), + IB_PORT_PKEY_SW_EXT_PORT_TRAP_SUP = (1<<20), + IB_PORT_CM_SUP = (1<<16), + IB_PORT_SNMP_TUNNEL_SUP = (1<<15), + IB_PORT_REINIT_SUP = (1<<14), + IB_PORT_DEVICE_MGMT_SUP = (1<<13), + IB_PORT_VENDOR_CLASS_SUP = (1<<12), + IB_PORT_DR_NOTICE_SUP = (1<<11), + IB_PORT_PORT_NOTICE_SUP = (1<<10), + IB_PORT_BOOT_MGMT_SUP = (1<<9) +}; + +struct ib_port_attr { + enum ib_port_state state; + enum ib_mtu max_mtu; + enum ib_mtu active_mtu; + int gid_tbl_len; + u32 port_cap_flags; + u32 max_msg_sz; + u32 bad_pkey_cntr; + u32 qkey_viol_cntr; + u16 pkey_tbl_len; + u16 lid; + u16 sm_lid; + u8 lmc; + u8 max_vl_num; + u8 sm_sl; + u8 subnet_timeout; + u8 init_type_reply; +}; + +enum ib_device_modify_flags { + IB_DEVICE_MODIFY_SYS_IMAGE_GUID = 1 +}; + +struct ib_device_modify { + u64 sys_image_guid; +}; + +enum ib_port_modify_flags { + IB_PORT_SHUTDOWN = 1, + IB_PORT_INIT_TYPE = (1<<2), + IB_PORT_RESET_QKEY_CNTR = (1<<3) +}; + +struct ib_port_modify { + u32 set_port_cap_mask; + u32 clr_port_cap_mask; + u8 init_type; +}; + +enum ib_event_type { + IB_EVENT_CQ_ERR, + IB_EVENT_QP_FATAL, + IB_EVENT_QP_REQ_ERR, + IB_EVENT_QP_ACCESS_ERR, + IB_EVENT_COMM_EST, + IB_EVENT_SQ_DRAINED, + IB_EVENT_PATH_MIG, + IB_EVENT_PATH_MIG_ERR, + IB_EVENT_DEVICE_FATAL, + IB_EVENT_PORT_ACTIVE, + IB_EVENT_PORT_ERR, + IB_EVENT_LID_CHANGE, + IB_EVENT_PKEY_CHANGE, + IB_EVENT_SM_CHANGE +}; + +struct ib_event { + struct ib_device *device; + union { + struct ib_cq *cq; + struct ib_qp *qp; + u8 port_num; + } element; + enum ib_event_type event; +}; + +struct ib_event_handler { + struct ib_device *device; + void (*handler)(struct ib_event_handler *, struct ib_event *); + struct list_head list; +}; + +#define INIT_IB_EVENT_HANDLER(_ptr, _device, _handler) \ + do { \ + (_ptr)->device = _device; \ + (_ptr)->handler = _handler; \ + INIT_LIST_HEAD(&(_ptr)->list); \ + } while (0) + +struct ib_global_route { + union ib_gid dgid; + u32 flow_label; + u8 sgid_index; + u8 hop_limit; + u8 traffic_class; +}; + +enum { + IB_MULTICAST_QPN = 0xffffff +}; + +enum ib_ah_flags { + IB_AH_GRH = 1 +}; + +struct ib_ah_attr { + struct ib_global_route grh; + u16 dlid; + u8 sl; + u8 src_path_bits; + u8 static_rate; + u8 ah_flags; + u8 port_num; +}; + +enum ib_wc_status { + IB_WC_SUCCESS, + IB_WC_LOC_LEN_ERR, + IB_WC_LOC_QP_OP_ERR, + IB_WC_LOC_EEC_OP_ERR, + IB_WC_LOC_PROT_ERR, + IB_WC_WR_FLUSH_ERR, + IB_WC_MW_BIND_ERR, + IB_WC_BAD_RESP_ERR, + IB_WC_LOC_ACCESS_ERR, + IB_WC_REM_INV_REQ_ERR, + IB_WC_REM_ACCESS_ERR, + IB_WC_REM_OP_ERR, + IB_WC_RETRY_EXC_ERR, + IB_WC_RNR_RETRY_EXC_ERR, + IB_WC_LOC_RDD_VIOL_ERR, + IB_WC_REM_INV_RD_REQ_ERR, + IB_WC_REM_ABORT_ERR, + IB_WC_INV_EECN_ERR, + IB_WC_INV_EEC_STATE_ERR, + IB_WC_FATAL_ERR, + IB_WC_RESP_TIMEOUT_ERR, + IB_WC_GENERAL_ERR +}; + +enum ib_wc_opcode { + IB_WC_SEND, + IB_WC_RDMA_WRITE, + IB_WC_RDMA_READ, + IB_WC_COMP_SWAP, + IB_WC_FETCH_ADD, + IB_WC_BIND_MW, +/* + * Set value of IB_WC_RECV so consumers can test if a completion is a + * receive by testing (opcode & IB_WC_RECV). + */ + IB_WC_RECV = 1 << 7, + IB_WC_RECV_RDMA_WITH_IMM +}; + +enum ib_wc_flags { + IB_WC_GRH = 1, + IB_WC_WITH_IMM = (1<<1) +}; + +struct ib_wc { + u64 wr_id; + enum ib_wc_status status; + enum ib_wc_opcode opcode; + u32 vendor_err; + u32 byte_len; + __be32 imm_data; + u32 src_qp; + int wc_flags; + u16 pkey_index; + u16 slid; + u8 sl; + u8 dlid_path_bits; + u8 port_num; /* valid only for DR SMPs on switches */ +}; + +enum ib_cq_notify { + IB_CQ_SOLICITED, + IB_CQ_NEXT_COMP +}; + +struct ib_qp_cap { + u32 max_send_wr; + u32 max_recv_wr; + u32 max_send_sge; + u32 max_recv_sge; + u32 max_inline_data; +}; + +enum ib_sig_type { + IB_SIGNAL_ALL_WR, + IB_SIGNAL_REQ_WR +}; + +enum ib_qp_type { + /* + * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries + * here (and in that order) since the MAD layer uses them as + * indices into a 2-entry table. + */ + IB_QPT_SMI, + IB_QPT_GSI, + + IB_QPT_RC, + IB_QPT_UC, + IB_QPT_UD, + IB_QPT_RAW_IPV6, + IB_QPT_RAW_ETY +}; + +struct ib_qp_init_attr { + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + struct ib_qp_cap cap; + enum ib_sig_type sq_sig_type; + enum ib_sig_type rq_sig_type; + enum ib_qp_type qp_type; + u8 port_num; /* special QP types only */ +}; + +enum ib_rnr_timeout { + IB_RNR_TIMER_655_36 = 0, + IB_RNR_TIMER_000_01 = 1, + IB_RNR_TIMER_000_02 = 2, + IB_RNR_TIMER_000_03 = 3, + IB_RNR_TIMER_000_04 = 4, + IB_RNR_TIMER_000_06 = 5, + IB_RNR_TIMER_000_08 = 6, + IB_RNR_TIMER_000_12 = 7, + IB_RNR_TIMER_000_16 = 8, + IB_RNR_TIMER_000_24 = 9, + IB_RNR_TIMER_000_32 = 10, + IB_RNR_TIMER_000_48 = 11, + IB_RNR_TIMER_000_64 = 12, + IB_RNR_TIMER_000_96 = 13, + IB_RNR_TIMER_001_28 = 14, + IB_RNR_TIMER_001_92 = 15, + IB_RNR_TIMER_002_56 = 16, + IB_RNR_TIMER_003_84 = 17, + IB_RNR_TIMER_005_12 = 18, + IB_RNR_TIMER_007_68 = 19, + IB_RNR_TIMER_010_24 = 20, + IB_RNR_TIMER_015_36 = 21, + IB_RNR_TIMER_020_48 = 22, + IB_RNR_TIMER_030_72 = 23, + IB_RNR_TIMER_040_96 = 24, + IB_RNR_TIMER_061_44 = 25, + IB_RNR_TIMER_081_92 = 26, + IB_RNR_TIMER_122_88 = 27, + IB_RNR_TIMER_163_84 = 28, + IB_RNR_TIMER_245_76 = 29, + IB_RNR_TIMER_327_68 = 30, + IB_RNR_TIMER_491_52 = 31 +}; + +enum ib_qp_attr_mask { + IB_QP_STATE = 1, + IB_QP_CUR_STATE = (1<<1), + IB_QP_EN_SQD_ASYNC_NOTIFY = (1<<2), + IB_QP_ACCESS_FLAGS = (1<<3), + IB_QP_PKEY_INDEX = (1<<4), + IB_QP_PORT = (1<<5), + IB_QP_QKEY = (1<<6), + IB_QP_AV = (1<<7), + IB_QP_PATH_MTU = (1<<8), + IB_QP_TIMEOUT = (1<<9), + IB_QP_RETRY_CNT = (1<<10), + IB_QP_RNR_RETRY = (1<<11), + IB_QP_RQ_PSN = (1<<12), + IB_QP_MAX_QP_RD_ATOMIC = (1<<13), + IB_QP_ALT_PATH = (1<<14), + IB_QP_MIN_RNR_TIMER = (1<<15), + IB_QP_SQ_PSN = (1<<16), + IB_QP_MAX_DEST_RD_ATOMIC = (1<<17), + IB_QP_PATH_MIG_STATE = (1<<18), + IB_QP_CAP = (1<<19), + IB_QP_DEST_QPN = (1<<20) +}; + +enum ib_qp_state { + IB_QPS_RESET, + IB_QPS_INIT, + IB_QPS_RTR, + IB_QPS_RTS, + IB_QPS_SQD, + IB_QPS_SQE, + IB_QPS_ERR +}; + +enum ib_mig_state { + IB_MIG_MIGRATED, + IB_MIG_REARM, + IB_MIG_ARMED +}; + +struct ib_qp_attr { + enum ib_qp_state qp_state; + enum ib_qp_state cur_qp_state; + enum ib_mtu path_mtu; + enum ib_mig_state path_mig_state; + u32 qkey; + u32 rq_psn; + u32 sq_psn; + u32 dest_qp_num; + int qp_access_flags; + struct ib_qp_cap cap; + struct ib_ah_attr ah_attr; + struct ib_ah_attr alt_ah_attr; + u16 pkey_index; + u16 alt_pkey_index; + u8 en_sqd_async_notify; + u8 sq_draining; + u8 max_rd_atomic; + u8 max_dest_rd_atomic; + u8 min_rnr_timer; + u8 port_num; + u8 timeout; + u8 retry_cnt; + u8 rnr_retry; + u8 alt_port_num; + u8 alt_timeout; +}; + +enum ib_wr_opcode { + IB_WR_RDMA_WRITE, + IB_WR_RDMA_WRITE_WITH_IMM, + IB_WR_SEND, + IB_WR_SEND_WITH_IMM, + IB_WR_RDMA_READ, + IB_WR_ATOMIC_CMP_AND_SWP, + IB_WR_ATOMIC_FETCH_AND_ADD +}; + +enum ib_send_flags { + IB_SEND_FENCE = 1, + IB_SEND_SIGNALED = (1<<1), + IB_SEND_SOLICITED = (1<<2), + IB_SEND_INLINE = (1<<3) +}; + +enum ib_recv_flags { + IB_RECV_SIGNALED = 1 +}; + +struct ib_sge { + u64 addr; + u32 length; + u32 lkey; +}; + +struct ib_send_wr { + struct ib_send_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + enum ib_wr_opcode opcode; + int send_flags; + u32 imm_data; + union { + struct { + u64 remote_addr; + u32 rkey; + } rdma; + struct { + u64 remote_addr; + u64 compare_add; + u64 swap; + u32 rkey; + } atomic; + struct { + struct ib_ah *ah; + struct ib_mad_hdr *mad_hdr; + u32 remote_qpn; + u32 remote_qkey; + int timeout_ms; /* valid for MADs only */ + u16 pkey_index; /* valid for GSI only */ + u8 port_num; /* valid for DR SMPs on switch only */ + } ud; + } wr; +}; + +struct ib_recv_wr { + struct ib_recv_wr *next; + u64 wr_id; + struct ib_sge *sg_list; + int num_sge; + int recv_flags; +}; + +enum ib_access_flags { + IB_ACCESS_LOCAL_WRITE = 1, + IB_ACCESS_REMOTE_WRITE = (1<<1), + IB_ACCESS_REMOTE_READ = (1<<2), + IB_ACCESS_REMOTE_ATOMIC = (1<<3), + IB_ACCESS_MW_BIND = (1<<4) +}; + +struct ib_phys_buf { + u64 addr; + u64 size; +}; + +struct ib_mr_attr { + struct ib_pd *pd; + u64 device_virt_addr; + u64 size; + int mr_access_flags; + u32 lkey; + u32 rkey; +}; + +enum ib_mr_rereg_flags { + IB_MR_REREG_TRANS = 1, + IB_MR_REREG_PD = (1<<1), + IB_MR_REREG_ACCESS = (1<<2) +}; + +struct ib_mw_bind { + struct ib_mr *mr; + u64 wr_id; + u64 addr; + u32 length; + int send_flags; + int mw_access_flags; +}; + +struct ib_fmr_attr { + int max_pages; + int max_maps; + u8 page_size; +}; + +struct ib_pd { + struct ib_device *device; + atomic_t usecnt; /* count all resources */ +}; + +struct ib_ah { + struct ib_device *device; + struct ib_pd *pd; +}; + +typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context); + +struct ib_cq { + struct ib_device *device; + ib_comp_handler comp_handler; + void (*event_handler)(struct ib_event *, void *); + void * cq_context; + int cqe; + atomic_t usecnt; /* count number of work queues */ +}; + +struct ib_srq { + struct ib_device *device; + struct ib_pd *pd; + void *srq_context; + atomic_t usecnt; +}; + +struct ib_qp { + struct ib_device *device; + struct ib_pd *pd; + struct ib_cq *send_cq; + struct ib_cq *recv_cq; + struct ib_srq *srq; + void (*event_handler)(struct ib_event *, void *); + void *qp_context; + u32 qp_num; +}; + +struct ib_mr { + struct ib_device *device; + struct ib_pd *pd; + u32 lkey; + u32 rkey; + atomic_t usecnt; /* count number of MWs */ +}; + +struct ib_mw { + struct ib_device *device; + struct ib_pd *pd; + u32 rkey; +}; + +struct ib_fmr { + struct ib_device *device; + struct ib_pd *pd; + struct list_head list; + u32 lkey; + u32 rkey; +}; + +struct ib_mad; + +enum ib_process_mad_flags { + IB_MAD_IGNORE_MKEY = 1 +}; + +enum ib_mad_result { + IB_MAD_RESULT_FAILURE = 0, /* (!SUCCESS is the important flag) */ + IB_MAD_RESULT_SUCCESS = 1 << 0, /* MAD was successfully processed */ + IB_MAD_RESULT_REPLY = 1 << 1, /* Reply packet needs to be sent */ + IB_MAD_RESULT_CONSUMED = 1 << 2 /* Packet consumed: stop processing */ +}; + +#define IB_DEVICE_NAME_MAX 64 + +struct ib_cache { + rwlock_t lock; + struct ib_event_handler event_handler; + struct ib_pkey_cache **pkey_cache; + struct ib_gid_cache **gid_cache; +}; + +struct ib_device { + struct device *dma_device; + + char name[IB_DEVICE_NAME_MAX]; + + struct list_head event_handler_list; + spinlock_t event_handler_lock; + + struct list_head core_list; + struct list_head client_data_list; + spinlock_t client_data_lock; + + struct ib_cache cache; + + u32 flags; + + int (*query_device)(struct ib_device *device, + struct ib_device_attr *device_attr); + int (*query_port)(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr); + int (*query_gid)(struct ib_device *device, + u8 port_num, int index, + union ib_gid *gid); + int (*query_pkey)(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + int (*modify_device)(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + int (*modify_port)(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + struct ib_pd * (*alloc_pd)(struct ib_device *device); + int (*dealloc_pd)(struct ib_pd *pd); + struct ib_ah * (*create_ah)(struct ib_pd *pd, + struct ib_ah_attr *ah_attr); + int (*modify_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*query_ah)(struct ib_ah *ah, + struct ib_ah_attr *ah_attr); + int (*destroy_ah)(struct ib_ah *ah); + struct ib_qp * (*create_qp)(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + int (*modify_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + int (*query_qp)(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + int (*destroy_qp)(struct ib_qp *qp); + int (*post_send)(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + int (*post_recv)(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr); + struct ib_cq * (*create_cq)(struct ib_device *device, + int cqe); + int (*destroy_cq)(struct ib_cq *cq); + int (*resize_cq)(struct ib_cq *cq, int *cqe); + int (*poll_cq)(struct ib_cq *cq, int num_entries, + struct ib_wc *wc); + int (*peek_cq)(struct ib_cq *cq, int wc_cnt); + int (*req_notify_cq)(struct ib_cq *cq, + enum ib_cq_notify cq_notify); + int (*req_ncomp_notif)(struct ib_cq *cq, + int wc_cnt); + struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, + int mr_access_flags); + struct ib_mr * (*reg_phys_mr)(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + int (*query_mr)(struct ib_mr *mr, + struct ib_mr_attr *mr_attr); + int (*dereg_mr)(struct ib_mr *mr); + int (*rereg_phys_mr)(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + struct ib_mw * (*alloc_mw)(struct ib_pd *pd); + int (*bind_mw)(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); + int (*dealloc_mw)(struct ib_mw *mw); + struct ib_fmr * (*alloc_fmr)(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + int (*map_phys_fmr)(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova); + int (*unmap_fmr)(struct list_head *fmr_list); + int (*dealloc_fmr)(struct ib_fmr *fmr); + int (*attach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*detach_mcast)(struct ib_qp *qp, + union ib_gid *gid, + u16 lid); + int (*process_mad)(struct ib_device *device, + int process_mad_flags, + u8 port_num, + u16 source_lid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); + + struct class_device class_dev; + struct kobject ports_parent; + struct list_head port_list; + + enum { + IB_DEV_UNINITIALIZED, + IB_DEV_REGISTERED, + IB_DEV_UNREGISTERED + } reg_state; + + u8 node_type; + u8 phys_port_cnt; +}; + +struct ib_client { + char *name; + void (*add) (struct ib_device *); + void (*remove)(struct ib_device *); + + struct list_head list; +}; + +struct ib_device *ib_alloc_device(size_t size); +void ib_dealloc_device(struct ib_device *device); + +int ib_register_device (struct ib_device *device); +void ib_unregister_device(struct ib_device *device); + +int ib_register_client (struct ib_client *client); +void ib_unregister_client(struct ib_client *client); + +void *ib_get_client_data(struct ib_device *device, struct ib_client *client); +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data); + +int ib_register_event_handler (struct ib_event_handler *event_handler); +int ib_unregister_event_handler(struct ib_event_handler *event_handler); +void ib_dispatch_event(struct ib_event *event); + +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr); + +int ib_query_port(struct ib_device *device, + u8 port_num, struct ib_port_attr *port_attr); + +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid); + +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey); + +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify); + +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify); + +/** + * ib_alloc_pd - Allocates an unused protection domain. + * @device: The device on which to allocate the protection domain. + * + * A protection domain object provides an association between QPs, shared + * receive queues, address handles, memory regions, and memory windows. + */ +struct ib_pd *ib_alloc_pd(struct ib_device *device); + +/** + * ib_dealloc_pd - Deallocates a protection domain. + * @pd: The protection domain to deallocate. + */ +int ib_dealloc_pd(struct ib_pd *pd); + +/** + * ib_create_ah - Creates an address handle for the given address vector. + * @pd: The protection domain associated with the address handle. + * @ah_attr: The attributes of the address vector. + * + * The address handle is used to reference a local or global destination + * in all UD QP post sends. + */ +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr); + +/** + * ib_modify_ah - Modifies the address vector associated with an address + * handle. + * @ah: The address handle to modify. + * @ah_attr: The new address vector attributes to associate with the + * address handle. + */ +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_query_ah - Queries the address vector associated with an address + * handle. + * @ah: The address handle to query. + * @ah_attr: The address vector attributes associated with the address + * handle. + */ +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr); + +/** + * ib_destroy_ah - Destroys an address handle. + * @ah: The address handle to destroy. + */ +int ib_destroy_ah(struct ib_ah *ah); + +/** + * ib_create_qp - Creates a QP associated with the specified protection + * domain. + * @pd: The protection domain associated with the QP. + * @qp_init_attr: A list of initial attributes required to create the QP. + */ +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_modify_qp - Modifies the attributes for the specified QP and then + * transitions the QP to the given state. + * @qp: The QP to modify. + * @qp_attr: On input, specifies the QP attributes to modify. On output, + * the current values of selected QP attributes are returned. + * @qp_attr_mask: A bit-mask used to specify which attributes of the QP + * are being modified. + */ +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask); + +/** + * ib_query_qp - Returns the attribute list and current values for the + * specified QP. + * @qp: The QP to query. + * @qp_attr: The attributes of the specified QP. + * @qp_attr_mask: A bit-mask used to select specific attributes to query. + * @qp_init_attr: Additional attributes of the selected QP. + * + * The qp_attr_mask may be used to limit the query to gathering only the + * selected attributes. + */ +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr); + +/** + * ib_destroy_qp - Destroys the specified QP. + * @qp: The QP to destroy. + */ +int ib_destroy_qp(struct ib_qp *qp); + +/** + * ib_post_send - Posts a list of work requests to the send queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @send_wr: A list of work requests to post on the send queue. + * @bad_send_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_send(struct ib_qp *qp, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + return qp->device->post_send(qp, send_wr, bad_send_wr); +} + +/** + * ib_post_recv - Posts a list of work requests to the receive queue of + * the specified QP. + * @qp: The QP to post the work request on. + * @recv_wr: A list of work requests to post on the receive queue. + * @bad_recv_wr: On an immediate failure, this parameter will reference + * the work request that failed to be posted on the QP. + */ +static inline int ib_post_recv(struct ib_qp *qp, + struct ib_recv_wr *recv_wr, + struct ib_recv_wr **bad_recv_wr) +{ + return qp->device->post_recv(qp, recv_wr, bad_recv_wr); +} + +/** + * ib_create_cq - Creates a CQ on the specified device. + * @device: The device on which to create the CQ. + * @comp_handler: A user-specified callback that is invoked when a + * completion event occurs on the CQ. + * @event_handler: A user-specified callback that is invoked when an + * asynchronous event not associated with a completion occurs on the CQ. + * @cq_context: Context associated with the CQ returned to the user via + * the associated completion and event handlers. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe); + +/** + * ib_resize_cq - Modifies the capacity of the CQ. + * @cq: The CQ to resize. + * @cqe: The minimum size of the CQ. + * + * Users can examine the cq structure to determine the actual CQ size. + */ +int ib_resize_cq(struct ib_cq *cq, int cqe); + +/** + * ib_destroy_cq - Destroys the specified CQ. + * @cq: The CQ to destroy. + */ +int ib_destroy_cq(struct ib_cq *cq); + +/** + * ib_poll_cq - poll a CQ for completion(s) + * @cq:the CQ being polled + * @num_entries:maximum number of completions to return + * @wc:array of at least @num_entries &struct ib_wc where completions + * will be returned + * + * Poll a CQ for (possibly multiple) completions. If the return value + * is < 0, an error occurred. If the return value is >= 0, it is the + * number of completions returned. If the return value is + * non-negative and < num_entries, then the CQ was emptied. + */ +static inline int ib_poll_cq(struct ib_cq *cq, int num_entries, + struct ib_wc *wc) +{ + return cq->device->poll_cq(cq, num_entries, wc); +} + +/** + * ib_peek_cq - Returns the number of unreaped completions currently + * on the specified CQ. + * @cq: The CQ to peek. + * @wc_cnt: A minimum number of unreaped completions to check for. + * + * If the number of unreaped completions is greater than or equal to wc_cnt, + * this function returns wc_cnt, otherwise, it returns the actual number of + * unreaped completions. + */ +int ib_peek_cq(struct ib_cq *cq, int wc_cnt); + +/** + * ib_req_notify_cq - Request completion notification on a CQ. + * @cq: The CQ to generate an event for. + * @cq_notify: If set to %IB_CQ_SOLICITED, completion notification will + * occur on the next solicited event. If set to %IB_CQ_NEXT_COMP, + * notification will occur on the next completion. + */ +static inline int ib_req_notify_cq(struct ib_cq *cq, + enum ib_cq_notify cq_notify) +{ + return cq->device->req_notify_cq(cq, cq_notify); +} + +/** + * ib_req_ncomp_notif - Request completion notification when there are + * at least the specified number of unreaped completions on the CQ. + * @cq: The CQ to generate an event for. + * @wc_cnt: The number of unreaped completions that should be on the + * CQ before an event is generated. + */ +static inline int ib_req_ncomp_notif(struct ib_cq *cq, int wc_cnt) +{ + return cq->device->req_ncomp_notif ? + cq->device->req_ncomp_notif(cq, wc_cnt) : + -ENOSYS; +} + +/** + * ib_get_dma_mr - Returns a memory region for system memory that is + * usable for DMA. + * @pd: The protection domain associated with the memory region. + * @mr_access_flags: Specifies the memory access rights. + */ +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_reg_phys_mr - Prepares a virtually addressed memory region for use + * by an HCA. + * @pd: The protection domain associated assigned to the registered region. + * @phys_buf_array: Specifies a list of physical buffers to use in the + * memory region. + * @num_phys_buf: Specifies the size of the phys_buf_array. + * @mr_access_flags: Specifies the memory access rights. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_rereg_phys_mr - Modifies the attributes of an existing memory region. + * Conceptually, this call performs the functions deregister memory region + * followed by register physical memory region. Where possible, + * resources are reused instead of deallocated and reallocated. + * @mr: The memory region to modify. + * @mr_rereg_mask: A bit-mask used to indicate which of the following + * properties of the memory region are being modified. + * @pd: If %IB_MR_REREG_PD is set in mr_rereg_mask, this field specifies + * the new protection domain to associated with the memory region, + * otherwise, this parameter is ignored. + * @phys_buf_array: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies a list of physical buffers to use in the new + * translation, otherwise, this parameter is ignored. + * @num_phys_buf: If %IB_MR_REREG_TRANS is set in mr_rereg_mask, this + * field specifies the size of the phys_buf_array, otherwise, this + * parameter is ignored. + * @mr_access_flags: If %IB_MR_REREG_ACCESS is set in mr_rereg_mask, this + * field specifies the new memory access rights, otherwise, this + * parameter is ignored. + * @iova_start: The offset of the region's starting I/O virtual address. + */ +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start); + +/** + * ib_query_mr - Retrieves information about a specific memory region. + * @mr: The memory region to retrieve information about. + * @mr_attr: The attributes of the specified memory region. + */ +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); + +/** + * ib_dereg_mr - Deregisters a memory region and removes it from the + * HCA translation table. + * @mr: The memory region to deregister. + */ +int ib_dereg_mr(struct ib_mr *mr); + +/** + * ib_alloc_mw - Allocates a memory window. + * @pd: The protection domain associated with the memory window. + */ +struct ib_mw *ib_alloc_mw(struct ib_pd *pd); + +/** + * ib_bind_mw - Posts a work request to the send queue of the specified + * QP, which binds the memory window to the given address range and + * remote access attributes. + * @qp: QP to post the bind work request on. + * @mw: The memory window to bind. + * @mw_bind: Specifies information about the memory window, including + * its address range, remote access rights, and associated memory region. + */ +static inline int ib_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + /* XXX reference counting in corresponding MR? */ + return mw->device->bind_mw ? + mw->device->bind_mw(qp, mw, mw_bind) : + -ENOSYS; +} + +/** + * ib_dealloc_mw - Deallocates a memory window. + * @mw: The memory window to deallocate. + */ +int ib_dealloc_mw(struct ib_mw *mw); + +/** + * ib_alloc_fmr - Allocates a unmapped fast memory region. + * @pd: The protection domain associated with the unmapped region. + * @mr_access_flags: Specifies the memory access rights. + * @fmr_attr: Attributes of the unmapped region. + * + * A fast memory region must be mapped before it can be used as part of + * a work request. + */ +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +/** + * ib_map_phys_fmr - Maps a list of physical pages to a fast memory region. + * @fmr: The fast memory region to associate with the pages. + * @page_list: An array of physical pages to map to the fast memory region. + * @list_len: The number of pages in page_list. + * @iova: The I/O virtual address to use with the mapped region. + */ +static inline int ib_map_phys_fmr(struct ib_fmr *fmr, + u64 *page_list, int list_len, + u64 iova) +{ + return fmr->device->map_phys_fmr(fmr, page_list, list_len, iova); +} + +/** + * ib_unmap_fmr - Removes the mapping from a list of fast memory regions. + * @fmr_list: A linked list of fast memory regions to unmap. + */ +int ib_unmap_fmr(struct list_head *fmr_list); + +/** + * ib_dealloc_fmr - Deallocates a fast memory region. + * @fmr: The fast memory region to deallocate. + */ +int ib_dealloc_fmr(struct ib_fmr *fmr); + +/** + * ib_attach_mcast - Attaches the specified QP to a multicast group. + * @qp: QP to attach to the multicast group. The QP must be type + * IB_QPT_UD. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + * + * In order to send and receive multicast packets, subnet + * administration must have created the multicast group and configured + * the fabric appropriately. The port associated with the specified + * QP must also be a member of the multicast group. + */ +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +/** + * ib_detach_mcast - Detaches the specified QP from a multicast group. + * @qp: QP to detach from the multicast group. + * @gid: Multicast group GID. + * @lid: Multicast group LID in host byte order. + */ +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid); + +#endif /* IB_VERBS_H */ From roland at topspin.com Mon Dec 13 10:09:20 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:20 -0800 Subject: [openib-general] [PATCH][v3][2/21] Add core InfiniBand support In-Reply-To: <20041213109.NN2Neh1z1VRsHIIH@topspin.com> Message-ID: <20041213109.B80JuEFdg6Nma7kr@topspin.com> Add implementation of core InfiniBand support. This can be thought of as a midlayer that provides an abstraction between low-level hardware drivers and upper level protocols (such as IP-over-InfiniBand). Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:40.230798660 -0800 @@ -0,0 +1,11 @@ +menu "InfiniBand support" + +config INFINIBAND + tristate "InfiniBand support" + default n + ---help--- + Core support for InfiniBand (IB). Make sure to also select + any protocols you wish to use as well as drivers for your + InfiniBand hardware. + +endmenu --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:40.278791590 -0800 @@ -0,0 +1 @@ +obj-$(CONFIG_INFINIBAND) += core/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:40.305787613 -0800 @@ -0,0 +1,13 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND) += \ + ib_core.o + +ib_core-objs := \ + packer.o \ + ud_header.o \ + verbs.o \ + sysfs.o \ + device.o \ + fmr_pool.o \ + cache.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/cache.c 2004-12-13 09:44:40.502758596 -0800 @@ -0,0 +1,317 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: cache.c 1293 2004-11-25 16:04:07Z roland $ +*/ + +#include +#include +#include +#include + +#include "core_priv.h" + +struct ib_pkey_cache { + int table_len; + u16 table[0]; +}; + +struct ib_gid_cache { + int table_len; + union ib_gid table[0]; +}; + +struct ib_update_work { + struct work_struct work; + struct ib_device *device; + u8 port_num; +}; + +static inline int start_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : 1; +} + +static inline int end_port(struct ib_device *device) +{ + return device->node_type == IB_NODE_SWITCH ? 0 : device->phys_port_cnt; +} + +int ib_cached_gid_get(struct ib_device *device, + u8 port, + int index, + union ib_gid *gid) +{ + struct ib_gid_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.gid_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *gid = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_gid_get); + +int ib_cached_pkey_get(struct ib_device *device, + u8 port, + int index, + u16 *pkey) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int ret = 0; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + if (index < 0 || index >= cache->table_len) + ret = -EINVAL; + else + *pkey = cache->table[index]; + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_get); + +int ib_cached_pkey_find(struct ib_device *device, + u8 port, + u16 pkey, + u16 *index) +{ + struct ib_pkey_cache *cache; + unsigned long flags; + int i; + int ret = -ENOENT; + + if (port < start_port(device) || port > end_port(device)) + return -EINVAL; + + read_lock_irqsave(&device->cache.lock, flags); + + cache = device->cache.pkey_cache[port - start_port(device)]; + + *index = -1; + + for (i = 0; i < cache->table_len; ++i) + if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { + *index = i; + ret = 0; + break; + } + + read_unlock_irqrestore(&device->cache.lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_cached_pkey_find); + +static void ib_cache_update(struct ib_device *device, + u8 port) +{ + struct ib_port_attr *tprops = NULL; + struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; + struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; + int i; + int ret; + + tprops = kmalloc(sizeof *tprops, GFP_KERNEL); + if (!tprops) + return; + + ret = ib_query_port(device, port, tprops); + if (ret) { + printk(KERN_WARNING "ib_query_port failed (%d) for %s\n", + ret, device->name); + goto err; + } + + pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * + sizeof *pkey_cache->table, GFP_KERNEL); + if (!pkey_cache) + goto err; + + pkey_cache->table_len = tprops->pkey_tbl_len; + + gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * + sizeof *gid_cache->table, GFP_KERNEL); + if (!gid_cache) + goto err; + + gid_cache->table_len = tprops->gid_tbl_len; + + for (i = 0; i < pkey_cache->table_len; ++i) { + ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + for (i = 0; i < gid_cache->table_len; ++i) { + ret = ib_query_gid(device, port, i, gid_cache->table + i); + if (ret) { + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, i); + goto err; + } + } + + write_lock_irq(&device->cache.lock); + + old_pkey_cache = device->cache.pkey_cache[port - start_port(device)]; + old_gid_cache = device->cache.gid_cache [port - start_port(device)]; + + device->cache.pkey_cache[port - start_port(device)] = pkey_cache; + device->cache.gid_cache [port - start_port(device)] = gid_cache; + + write_unlock_irq(&device->cache.lock); + + kfree(old_pkey_cache); + kfree(old_gid_cache); + kfree(tprops); + return; + +err: + kfree(pkey_cache); + kfree(gid_cache); + kfree(tprops); +} + +static void ib_cache_task(void *work_ptr) +{ + struct ib_update_work *work = work_ptr; + + ib_cache_update(work->device, work->port_num); + kfree(work); +} + +static void ib_cache_event(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct ib_update_work *work; + + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + work = kmalloc(sizeof *work, GFP_ATOMIC); + if (work) { + INIT_WORK(&work->work, ib_cache_task, work); + work->device = event->device; + work->port_num = event->element.port_num; + schedule_work(&work->work); + } + } +} + +void ib_cache_setup_one(struct ib_device *device) +{ + int p; + + rwlock_init(&device->cache.lock); + + device->cache.pkey_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + device->cache.gid_cache = + kmalloc(sizeof *device->cache.pkey_cache * + (end_port(device) - start_port(device) + 1), GFP_KERNEL); + + if (!device->cache.pkey_cache || !device->cache.gid_cache) { + printk(KERN_WARNING "Couldn't allocate cache " + "for %s\n", device->name); + goto err; + } + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + device->cache.pkey_cache[p] = NULL; + device->cache.gid_cache [p] = NULL; + ib_cache_update(device, p + start_port(device)); + } + + INIT_IB_EVENT_HANDLER(&device->cache.event_handler, + device, ib_cache_event); + if (ib_register_event_handler(&device->cache.event_handler)) + goto err_cache; + + return; + +err_cache: + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + +err: + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +void ib_cache_cleanup_one(struct ib_device *device) +{ + int p; + + ib_unregister_event_handler(&device->cache.event_handler); + flush_scheduled_work(); + + for (p = 0; p <= end_port(device) - start_port(device); ++p) { + kfree(device->cache.pkey_cache[p]); + kfree(device->cache.gid_cache[p]); + } + + kfree(device->cache.pkey_cache); + kfree(device->cache.gid_cache); +} + +struct ib_client cache_client = { + .name = "cache", + .add = ib_cache_setup_one, + .remove = ib_cache_cleanup_one +}; + +int __init ib_cache_setup(void) +{ + return ib_register_client(&cache_client); +} + +void __exit ib_cache_cleanup(void) +{ + ib_unregister_client(&cache_client); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/core_priv.h 2004-12-13 09:44:40.527754913 -0800 @@ -0,0 +1,41 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Topspin Communications. All rights reserved. + + $Id: core_priv.h 1288 2004-11-24 01:12:39Z roland $ +*/ + +#ifndef _CORE_PRIV_H +#define _CORE_PRIV_H + +#include +#include + +#include + +int ib_device_register_sysfs(struct ib_device *device); +void ib_device_unregister_sysfs(struct ib_device *device); + +int ib_sysfs_setup(void); +void ib_sysfs_cleanup(void); + +int ib_cache_setup(void); +void ib_cache_cleanup(void); + +#endif /* _CORE_PRIV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/device.c 2004-12-13 09:44:40.448766550 -0800 @@ -0,0 +1,603 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: device.c 1304 2004-11-30 19:01:25Z sean.hefty $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("core kernel InfiniBand API"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct ib_client_data { + struct list_head list; + struct ib_client *client; + void * data; +}; + +static LIST_HEAD(device_list); +static LIST_HEAD(client_list); + +/* + * device_sem protects access to both device_list and client_list. + * There's no real point to using multiple locks or something fancier + * like an rwsem: we always access both lists, and we're always + * modifying one list or the other list. In any case this is not a + * hot path so there's no point in trying to optimize. + */ +static DECLARE_MUTEX(device_sem); + +static int ib_device_check_mandatory(struct ib_device *device) +{ +#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device, x), #x } + static const struct { + size_t offset; + char *name; + } mandatory_table[] = { + IB_MANDATORY_FUNC(query_device), + IB_MANDATORY_FUNC(query_port), + IB_MANDATORY_FUNC(query_pkey), + IB_MANDATORY_FUNC(query_gid), + IB_MANDATORY_FUNC(alloc_pd), + IB_MANDATORY_FUNC(dealloc_pd), + IB_MANDATORY_FUNC(create_ah), + IB_MANDATORY_FUNC(destroy_ah), + IB_MANDATORY_FUNC(create_qp), + IB_MANDATORY_FUNC(modify_qp), + IB_MANDATORY_FUNC(destroy_qp), + IB_MANDATORY_FUNC(post_send), + IB_MANDATORY_FUNC(post_recv), + IB_MANDATORY_FUNC(create_cq), + IB_MANDATORY_FUNC(destroy_cq), + IB_MANDATORY_FUNC(poll_cq), + IB_MANDATORY_FUNC(req_notify_cq), + IB_MANDATORY_FUNC(get_dma_mr), + IB_MANDATORY_FUNC(dereg_mr) + }; + int i; + + for (i = 0; i < sizeof mandatory_table / sizeof mandatory_table[0]; ++i) { + if (!*(void **) ((void *) device + mandatory_table[i].offset)) { + printk(KERN_WARNING "Device %s is missing mandatory function %s\n", + device->name, mandatory_table[i].name); + return -EINVAL; + } + } + + return 0; +} + +static struct ib_device *__ib_device_get_by_name(const char *name) +{ + struct ib_device *device; + + list_for_each_entry(device, &device_list, core_list) + if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX)) + return device; + + return NULL; +} + + +static int alloc_name(char *name) +{ + long *inuse; + char buf[IB_DEVICE_NAME_MAX]; + struct ib_device *device; + int i; + + inuse = (long *) get_zeroed_page(GFP_KERNEL); + if (!inuse) + return -ENOMEM; + + list_for_each_entry(device, &device_list, core_list) { + if (!sscanf(device->name, name, &i)) + continue; + if (i < 0 || i >= PAGE_SIZE * 8) + continue; + snprintf(buf, sizeof buf, name, i); + if (!strncmp(buf, device->name, IB_DEVICE_NAME_MAX)) + set_bit(i, inuse); + } + + i = find_first_zero_bit(inuse, PAGE_SIZE * 8); + free_page((unsigned long) inuse); + snprintf(buf, sizeof buf, name, i); + + if (__ib_device_get_by_name(buf)) + return -ENFILE; + + strlcpy(name, buf, IB_DEVICE_NAME_MAX); + return 0; +} + +/** + * ib_alloc_device - allocate an IB device struct + * @size:size of structure to allocate + * + * Low-level drivers should use ib_alloc_device() to allocate &struct + * ib_device. @size is the size of the structure to be allocated, + * including any private data used by the low-level driver. + * ib_dealloc_device() must be used to free structures allocated with + * ib_alloc_device(). + */ +struct ib_device *ib_alloc_device(size_t size) +{ + void *dev; + + BUG_ON(size < sizeof (struct ib_device)); + + dev = kmalloc(size, GFP_KERNEL); + if (!dev) + return NULL; + + memset(dev, 0, size); + + return dev; +} +EXPORT_SYMBOL(ib_alloc_device); + +/** + * ib_dealloc_device - free an IB device struct + * @device:structure to free + * + * Free a structure allocated with ib_alloc_device(). + */ +void ib_dealloc_device(struct ib_device *device) +{ + if (device->reg_state == IB_DEV_UNINITIALIZED) { + kfree(device); + return; + } + + BUG_ON(device->reg_state != IB_DEV_UNREGISTERED); + + ib_device_unregister_sysfs(device); +} +EXPORT_SYMBOL(ib_dealloc_device); + +static int add_client_context(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + unsigned long flags; + + context = kmalloc(sizeof *context, GFP_KERNEL); + if (!context) { + printk(KERN_WARNING "Couldn't allocate client context for %s/%s\n", + device->name, client->name); + return -ENOMEM; + } + + context->client = client; + context->data = NULL; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_add(&context->list, &device->client_data_list); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return 0; +} + +/** + * ib_register_device - Register an IB device with IB core + * @device:Device to register + * + * Low-level drivers use ib_register_device() to register their + * devices with the IB core. All registered clients will receive a + * callback for each device that is added. @device must be allocated + * with ib_alloc_device(). + */ +int ib_register_device(struct ib_device *device) +{ + int ret; + + down(&device_sem); + + if (strchr(device->name, '%')) { + ret = alloc_name(device->name); + if (ret) + goto out; + } + + if (ib_device_check_mandatory(device)) { + ret = -EINVAL; + goto out; + } + + INIT_LIST_HEAD(&device->event_handler_list); + INIT_LIST_HEAD(&device->client_data_list); + spin_lock_init(&device->event_handler_lock); + spin_lock_init(&device->client_data_lock); + + ret = ib_device_register_sysfs(device); + if (ret) { + printk(KERN_WARNING "Couldn't register device %s with driver model\n", + device->name); + goto out; + } + + list_add_tail(&device->core_list, &device_list); + + device->reg_state = IB_DEV_REGISTERED; + + { + struct ib_client *client; + + list_for_each_entry(client, &client_list, list) + if (client->add && !add_client_context(device, client)) + client->add(device); + } + + out: + up(&device_sem); + return ret; +} +EXPORT_SYMBOL(ib_register_device); + +/** + * ib_unregister_device - Unregister an IB device + * @device:Device to unregister + * + * Unregister an IB device. All clients will receive a remove callback. + */ +void ib_unregister_device(struct ib_device *device) +{ + struct ib_client *client; + struct ib_client_data *context, *tmp; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry_reverse(client, &client_list, list) + if (client->remove) + client->remove(device); + + list_del(&device->core_list); + + up(&device_sem); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + kfree(context); + spin_unlock_irqrestore(&device->client_data_lock, flags); + + device->reg_state = IB_DEV_UNREGISTERED; +} +EXPORT_SYMBOL(ib_unregister_device); + +/** + * ib_register_client - Register an IB client + * @client:Client to register + * + * Upper level users of the IB drivers can use ib_register_client() to + * register callbacks for IB device addition and removal. When an IB + * device is added, each registered client's add method will be called + * (in the order the clients were registered), and when a device is + * removed, each client's remove method will be called (in the reverse + * order that clients were registered). In addition, when + * ib_register_client() is called, the client will receive an add + * callback for all devices already registered. + */ +int ib_register_client(struct ib_client *client) +{ + struct ib_device *device; + + down(&device_sem); + + list_add_tail(&client->list, &client_list); + list_for_each_entry(device, &device_list, core_list) + if (client->add && !add_client_context(device, client)) + client->add(device); + + up(&device_sem); + + return 0; +} +EXPORT_SYMBOL(ib_register_client); + +/** + * ib_unregister_client - Unregister an IB client + * @client:Client to unregister + * + * Upper level users use ib_unregister_client() to remove their client + * registration. When ib_unregister_client() is called, the client + * will receive a remove callback for each IB device still registered. + */ +void ib_unregister_client(struct ib_client *client) +{ + struct ib_client_data *context, *tmp; + struct ib_device *device; + unsigned long flags; + + down(&device_sem); + + list_for_each_entry(device, &device_list, core_list) { + if (client->remove) + client->remove(device); + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry_safe(context, tmp, &device->client_data_list, list) + if (context->client == client) { + list_del(&context->list); + kfree(context); + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + } + list_del(&client->list); + + up(&device_sem); +} +EXPORT_SYMBOL(ib_unregister_client); + +/** + * ib_get_client_data - Get IB client context + * @device:Device to get context for + * @client:Client to get context for + * + * ib_get_client_data() returns client context set with + * ib_set_client_data(). + */ +void *ib_get_client_data(struct ib_device *device, struct ib_client *client) +{ + struct ib_client_data *context; + void *ret = NULL; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + ret = context->data; + break; + } + spin_unlock_irqrestore(&device->client_data_lock, flags); + + return ret; +} +EXPORT_SYMBOL(ib_get_client_data); + +/** + * ib_set_client_data - Get IB client context + * @device:Device to set context for + * @client:Client to set context for + * @data:Context to set + * + * ib_set_client_data() sets client context that can be retrieved with + * ib_get_client_data(). + */ +void ib_set_client_data(struct ib_device *device, struct ib_client *client, + void *data) +{ + struct ib_client_data *context; + unsigned long flags; + + spin_lock_irqsave(&device->client_data_lock, flags); + list_for_each_entry(context, &device->client_data_list, list) + if (context->client == client) { + context->data = data; + goto out; + } + + printk(KERN_WARNING "No client context found for %s/%s\n", + device->name, client->name); + +out: + spin_unlock_irqrestore(&device->client_data_lock, flags); +} +EXPORT_SYMBOL(ib_set_client_data); + +/** + * ib_register_event_handler - Register an IB event handler + * @event_handler:Handler to register + * + * ib_register_event_handler() registers an event handler that will be + * called back when asynchronous IB events occur (as defined in + * chapter 11 of the InfiniBand Architecture Specification). This + * callback may occur in interrupt context. + */ +int ib_register_event_handler (struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_add_tail(&event_handler->list, + &event_handler->device->event_handler_list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_register_event_handler); + +/** + * ib_unregister_event_handler - Unregister an event handler + * @event_handler:Handler to unregister + * + * Unregister an event handler registered with + * ib_register_event_handler(). + */ +int ib_unregister_event_handler(struct ib_event_handler *event_handler) +{ + unsigned long flags; + + spin_lock_irqsave(&event_handler->device->event_handler_lock, flags); + list_del(&event_handler->list); + spin_unlock_irqrestore(&event_handler->device->event_handler_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_unregister_event_handler); + +/** + * ib_dispatch_event - Dispatch an asynchronous event + * @event:Event to dispatch + * + * Low-level drivers must call ib_dispatch_event() to dispatch the + * event to all registered event handlers when an asynchronous event + * occurs. + */ +void ib_dispatch_event(struct ib_event *event) +{ + unsigned long flags; + struct ib_event_handler *handler; + + spin_lock_irqsave(&event->device->event_handler_lock, flags); + + list_for_each_entry(handler, &event->device->event_handler_list, list) + handler->handler(handler, event); + + spin_unlock_irqrestore(&event->device->event_handler_lock, flags); +} +EXPORT_SYMBOL(ib_dispatch_event); + +/** + * ib_query_device - Query IB device attributes + * @device:Device to query + * @device_attr:Device attributes + * + * ib_query_device() returns the attributes of a device through the + * @device_attr pointer. + */ +int ib_query_device(struct ib_device *device, + struct ib_device_attr *device_attr) +{ + return device->query_device(device, device_attr); +} +EXPORT_SYMBOL(ib_query_device); + +/** + * ib_query_port - Query IB port attributes + * @device:Device to query + * @port_num:Port number to query + * @port_attr:Port attributes + * + * ib_query_port() returns the attributes of a port through the + * @port_attr pointer. + */ +int ib_query_port(struct ib_device *device, + u8 port_num, + struct ib_port_attr *port_attr) +{ + return device->query_port(device, port_num, port_attr); +} +EXPORT_SYMBOL(ib_query_port); + +/** + * ib_query_gid - Get GID table entry + * @device:Device to query + * @port_num:Port number to query + * @index:GID table index to query + * @gid:Returned GID + * + * ib_query_gid() fetches the specified GID table entry. + */ +int ib_query_gid(struct ib_device *device, + u8 port_num, int index, union ib_gid *gid) +{ + return device->query_gid(device, port_num, index, gid); +} +EXPORT_SYMBOL(ib_query_gid); + +/** + * ib_query_pkey - Get P_Key table entry + * @device:Device to query + * @port_num:Port number to query + * @index:P_Key table index to query + * @pkey:Returned P_Key + * + * ib_query_pkey() fetches the specified P_Key table entry. + */ +int ib_query_pkey(struct ib_device *device, + u8 port_num, u16 index, u16 *pkey) +{ + return device->query_pkey(device, port_num, index, pkey); +} +EXPORT_SYMBOL(ib_query_pkey); + +/** + * ib_modify_device - Change IB device attributes + * @device:Device to modify + * @device_modify_mask:Mask of attributes to change + * @device_modify:New attribute values + * + * ib_modify_device() changes a device's attributes as specified by + * the @device_modify_mask and @device_modify structure. + */ +int ib_modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + return device->modify_device(device, device_modify_mask, + device_modify); +} +EXPORT_SYMBOL(ib_modify_device); + +/** + * ib_modify_port - Modifies the attributes for the specified port. + * @device: The device to modify. + * @port_num: The number of the port to modify. + * @port_modify_mask: Mask used to specify which attributes of the port + * to change. + * @port_modify: New attribute values for the port. + * + * ib_modify_port() changes a port's attributes as specified by the + * @port_modify_mask and @port_modify structure. + */ +int ib_modify_port(struct ib_device *device, + u8 port_num, int port_modify_mask, + struct ib_port_modify *port_modify) +{ + return device->modify_port(device, port_num, port_modify_mask, + port_modify); +} +EXPORT_SYMBOL(ib_modify_port); + +static int __init ib_core_init(void) +{ + int ret; + + ret = ib_sysfs_setup(); + if (ret) + printk(KERN_WARNING "Couldn't create InfiniBand device class\n"); + + ret = ib_cache_setup(); + if (ret) { + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID cache\n"); + ib_sysfs_cleanup(); + } + + return ret; +} + +static void __exit ib_core_cleanup(void) +{ + ib_cache_cleanup(); + ib_sysfs_cleanup(); +} + +module_init(ib_core_init); +module_exit(ib_core_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/fmr_pool.c 2004-12-13 09:44:40.476762425 -0800 @@ -0,0 +1,496 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: fmr_pool.c 1294 2004-11-25 19:17:57Z roland $ + */ + +#include +#include +#include +#include +#include + +#include + +#include "core_priv.h" + +enum { + IB_FMR_MAX_REMAPS = 32, + + IB_FMR_HASH_BITS = 8, + IB_FMR_HASH_SIZE = 1 << IB_FMR_HASH_BITS, + IB_FMR_HASH_MASK = IB_FMR_HASH_SIZE - 1 +}; + +/* + * If an FMR is not in use, then the list member will point to either + * its pool's free_list (if the FMR can be mapped again; that is, + * remap_count < IB_FMR_MAX_REMAPS) or its pool's dirty_list (if the + * FMR needs to be unmapped before being remapped). In either of + * these cases it is a bug if the ref_count is not 0. In other words, + * if ref_count is > 0, then the list member must not be linked into + * either free_list or dirty_list. + * + * The cache_node member is used to link the FMR into a cache bucket + * (if caching is enabled). This is independent of the reference + * count of the FMR. When a valid FMR is released, its ref_count is + * decremented, and if ref_count reaches 0, the FMR is placed in + * either free_list or dirty_list as appropriate. However, it is not + * removed from the cache and may be "revived" if a call to + * ib_fmr_register_physical() occurs before the FMR is remapped. In + * this case we just increment the ref_count and remove the FMR from + * free_list/dirty_list. + * + * Before we remap an FMR from free_list, we remove it from the cache + * (to prevent another user from obtaining a stale FMR). When an FMR + * is released, we add it to the tail of the free list, so that our + * cache eviction policy is "least recently used." + * + * All manipulation of ref_count, list and cache_node is protected by + * pool_lock to maintain consistency. + */ + +struct ib_fmr_pool { + spinlock_t pool_lock; + + int pool_size; + int max_pages; + int dirty_watermark; + int dirty_len; + struct list_head free_list; + struct list_head dirty_list; + struct hlist_head *cache_bucket; + + void (*flush_function)(struct ib_fmr_pool *pool, + void * arg); + void *flush_arg; + + struct task_struct *thread; + + atomic_t req_ser; + atomic_t flush_ser; + + wait_queue_head_t force_wait; +}; + +static inline u32 ib_fmr_hash(u64 first_page) +{ + return jhash_2words((u32) first_page, + (u32) (first_page >> 32), + 0); +} + +/* Caller must hold pool_lock */ +static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, + u64 *page_list, + int page_list_len, + u64 io_virtual_address) +{ + struct hlist_head *bucket; + struct ib_pool_fmr *fmr; + struct hlist_node *pos; + + if (!pool->cache_bucket) + return NULL; + + bucket = pool->cache_bucket + ib_fmr_hash(*page_list); + + hlist_for_each_entry(fmr, pos, bucket, cache_node) + if (io_virtual_address == fmr->io_virtual_address && + page_list_len == fmr->page_list_len && + !memcmp(page_list, fmr->page_list, + page_list_len * sizeof *page_list)) + return fmr; + + return NULL; +} + +static void ib_fmr_batch_release(struct ib_fmr_pool *pool) +{ + int ret; + struct ib_pool_fmr *fmr; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + + spin_lock_irq(&pool->pool_lock); + + list_for_each_entry(fmr, &pool->dirty_list, list) { + hlist_del_init(&fmr->cache_node); + fmr->remap_count = 0; + list_add_tail(&fmr->fmr->list, &fmr_list); + +#ifdef DEBUG + if (fmr->ref_count !=0) { + printk(KERN_WARNING "Unmapping FMR 0x%08x with ref count %d", + fmr, fmr->ref_count); + } +#endif + } + + list_splice(&pool->dirty_list, &unmap_list); + INIT_LIST_HEAD(&pool->dirty_list); + pool->dirty_len = 0; + + spin_unlock_irq(&pool->pool_lock); + + if (list_empty(&unmap_list)) { + return; + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "ib_unmap_fmr returned %d", ret); + + spin_lock_irq(&pool->pool_lock); + list_splice(&unmap_list, &pool->free_list); + spin_unlock_irq(&pool->pool_lock); +} + +static int ib_fmr_cleanup_thread(void *pool_ptr) +{ + struct ib_fmr_pool *pool = pool_ptr; + + do { + if (pool->dirty_len >= pool->dirty_watermark || + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) < 0) { + ib_fmr_batch_release(pool); + + atomic_inc(&pool->flush_ser); + wake_up_interruptible(&pool->force_wait); + + if (pool->flush_function) + pool->flush_function(pool, pool->flush_arg); + } + + set_current_state(TASK_INTERRUPTIBLE); + if (pool->dirty_len < pool->dirty_watermark && + atomic_read(&pool->flush_ser) - atomic_read(&pool->req_ser) >= 0 && + !kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + +/** + * ib_create_fmr_pool - Create an FMR pool + * @pd:Protection domain for FMRs + * @params:FMR pool parameters + * + * Create a pool of FMRs. Return value is pointer to new pool or + * error code if creation failed. + */ +struct ib_fmr_pool *ib_create_fmr_pool(struct ib_pd *pd, + struct ib_fmr_pool_param *params) +{ + struct ib_device *device; + struct ib_fmr_pool *pool; + int i; + int ret; + + if (!params) + return ERR_PTR(-EINVAL); + + device = pd->device; + if (!device->alloc_fmr || !device->dealloc_fmr || + !device->map_phys_fmr || !device->unmap_fmr) { + printk(KERN_WARNING "Device %s does not support fast memory regions", + device->name); + return ERR_PTR(-ENOSYS); + } + + pool = kmalloc(sizeof *pool, GFP_KERNEL); + if (!pool) { + printk(KERN_WARNING "couldn't allocate pool struct"); + return ERR_PTR(-ENOMEM); + } + + pool->cache_bucket = NULL; + + pool->flush_function = params->flush_function; + pool->flush_arg = params->flush_arg; + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->dirty_list); + + if (params->cache) { + pool->cache_bucket = + kmalloc(IB_FMR_HASH_SIZE * sizeof *pool->cache_bucket, + GFP_KERNEL); + if (!pool->cache_bucket) { + printk(KERN_WARNING "Failed to allocate cache in pool"); + ret = -ENOMEM; + goto out_free_pool; + } + + for (i = 0; i < IB_FMR_HASH_SIZE; ++i) + INIT_HLIST_HEAD(pool->cache_bucket + i); + } + + pool->pool_size = 0; + pool->max_pages = params->max_pages_per_fmr; + pool->dirty_watermark = params->dirty_watermark; + pool->dirty_len = 0; + spin_lock_init(&pool->pool_lock); + atomic_set(&pool->req_ser, 0); + atomic_set(&pool->flush_ser, 0); + init_waitqueue_head(&pool->force_wait); + + pool->thread = kthread_create(ib_fmr_cleanup_thread, + pool, + "ib_fmr(%s)", + device->name); + if (IS_ERR(pool->thread)) { + printk(KERN_WARNING "couldn't start cleanup thread"); + ret = PTR_ERR(pool->thread); + goto out_free_pool; + } + + { + struct ib_pool_fmr *fmr; + struct ib_fmr_attr attr = { + .max_pages = params->max_pages_per_fmr, + .max_maps = IB_FMR_MAX_REMAPS, + .page_size = PAGE_SHIFT + }; + + for (i = 0; i < params->pool_size; ++i) { + fmr = kmalloc(sizeof *fmr + params->max_pages_per_fmr * sizeof (u64), + GFP_KERNEL); + if (!fmr) { + printk(KERN_WARNING "failed to allocate fmr struct " + "for FMR %d", i); + goto out_fail; + } + + fmr->pool = pool; + fmr->remap_count = 0; + fmr->ref_count = 0; + INIT_HLIST_NODE(&fmr->cache_node); + + fmr->fmr = ib_alloc_fmr(pd, params->access, &attr); + if (IS_ERR(fmr->fmr)) { + printk(KERN_WARNING "fmr_create failed for FMR %d", i); + kfree(fmr); + goto out_fail; + } + + list_add_tail(&fmr->list, &pool->free_list); + ++pool->pool_size; + } + } + + return pool; + + out_free_pool: + kfree(pool->cache_bucket); + kfree(pool); + + return ERR_PTR(ret); + + out_fail: + ib_destroy_fmr_pool(pool); + + return ERR_PTR(-ENOMEM); +} +EXPORT_SYMBOL(ib_create_fmr_pool); + +/** + * ib_destroy_fmr_pool - Free FMR pool + * @pool:FMR pool to free + * + * Destroy an FMR pool and free all associated resources. + */ +int ib_destroy_fmr_pool(struct ib_fmr_pool *pool) +{ + struct ib_pool_fmr *fmr; + struct ib_pool_fmr *tmp; + int i; + + kthread_stop(pool->thread); + ib_fmr_batch_release(pool); + + i = 0; + list_for_each_entry_safe(fmr, tmp, &pool->free_list, list) { + ib_dealloc_fmr(fmr->fmr); + list_del(&fmr->list); + kfree(fmr); + ++i; + } + + if (i < pool->pool_size) + printk(KERN_WARNING "pool still has %d regions registered", + pool->pool_size - i); + + kfree(pool->cache_bucket); + kfree(pool); + + return 0; +} +EXPORT_SYMBOL(ib_destroy_fmr_pool); + +/** + * ib_flush_fmr_pool - Invalidate all unmapped FMRs + * @pool:FMR pool to flush + * + * Ensure that all unmapped FMRs are fully invalidated. + */ +int ib_flush_fmr_pool(struct ib_fmr_pool *pool) +{ + int serial; + + atomic_inc(&pool->req_ser); + /* + * It's OK if someone else bumps req_ser again here -- we'll + * just wait a little longer. + */ + serial = atomic_read(&pool->req_ser); + + wake_up_process(pool->thread); + + if (wait_event_interruptible(pool->force_wait, + atomic_read(&pool->flush_ser) - + atomic_read(&pool->req_ser) >= 0)) + return -EINTR; + + return 0; +} +EXPORT_SYMBOL(ib_flush_fmr_pool); + +/** + * ib_fmr_pool_map_phys - + * @pool:FMR pool to allocate FMR from + * @page_list:List of pages to map + * @list_len:Number of pages in @page_list + * @io_virtual_address:I/O virtual address for new FMR + * + * Map an FMR from an FMR pool. + */ +struct ib_pool_fmr *ib_fmr_pool_map_phys(struct ib_fmr_pool *pool_handle, + u64 *page_list, + int list_len, + u64 *io_virtual_address) +{ + struct ib_fmr_pool *pool = pool_handle; + struct ib_pool_fmr *fmr; + unsigned long flags; + int result; + + if (list_len < 1 || list_len > pool->max_pages) + return ERR_PTR(-EINVAL); + + spin_lock_irqsave(&pool->pool_lock, flags); + fmr = ib_fmr_cache_lookup(pool, + page_list, + list_len, + *io_virtual_address); + if (fmr) { + /* found in cache */ + ++fmr->ref_count; + if (fmr->ref_count == 1) { + list_del(&fmr->list); + } + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return fmr; + } + + if (list_empty(&pool->free_list)) { + spin_unlock_irqrestore(&pool->pool_lock, flags); + return ERR_PTR(-EAGAIN); + } + + fmr = list_entry(pool->free_list.next, struct ib_pool_fmr, list); + list_del(&fmr->list); + hlist_del_init(&fmr->cache_node); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + result = ib_map_phys_fmr(fmr->fmr, page_list, list_len, + *io_virtual_address); + + if (result) { + spin_lock_irqsave(&pool->pool_lock, flags); + list_add(&fmr->list, &pool->free_list); + spin_unlock_irqrestore(&pool->pool_lock, flags); + + printk(KERN_WARNING "fmr_map returns %d", + result); + + return ERR_PTR(result); + } + + ++fmr->remap_count; + fmr->ref_count = 1; + + if (pool->cache_bucket) { + fmr->io_virtual_address = *io_virtual_address; + fmr->page_list_len = list_len; + memcpy(fmr->page_list, page_list, list_len * sizeof(*page_list)); + + spin_lock_irqsave(&pool->pool_lock, flags); + hlist_add_head(&fmr->cache_node, + pool->cache_bucket + ib_fmr_hash(fmr->page_list[0])); + spin_unlock_irqrestore(&pool->pool_lock, flags); + } + + return fmr; +} +EXPORT_SYMBOL(ib_fmr_pool_map_phys); + +/** + * ib_fmr_pool_unmap - Unmap FMR + * @fmr:FMR to unmap + * + * Unmap an FMR. The FMR mapping may remain valid until the FMR is + * reused (or until ib_flush_fmr_pool() is called). + */ +int ib_fmr_pool_unmap(struct ib_pool_fmr *fmr) +{ + struct ib_fmr_pool *pool; + unsigned long flags; + + pool = fmr->pool; + + spin_lock_irqsave(&pool->pool_lock, flags); + + --fmr->ref_count; + if (!fmr->ref_count) { + if (fmr->remap_count < IB_FMR_MAX_REMAPS) { + list_add_tail(&fmr->list, &pool->free_list); + } else { + list_add_tail(&fmr->list, &pool->dirty_list); + ++pool->dirty_len; + wake_up_process(pool->thread); + } + } + +#ifdef DEBUG + if (fmr->ref_count < 0) + printk(KERN_WARNING "FMR %p has ref count %d < 0", + fmr, fmr->ref_count); +#endif + + spin_unlock_irqrestore(&pool->pool_lock, flags); + + return 0; +} +EXPORT_SYMBOL(ib_fmr_pool_unmap); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/packer.c 2004-12-13 09:44:40.335783194 -0800 @@ -0,0 +1,190 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: packer.c 1292 2004-11-25 05:24:40Z roland $ + */ + +#include + +static u64 value_read(int offset, int size, void *structure) +{ + switch (size) { + case 1: return *(u8 *) (structure + offset); + case 2: return be16_to_cpup((__be16 *) (structure + offset)); + case 4: return be32_to_cpup((__be32 *) (structure + offset)); + case 8: return be64_to_cpup((__be64 *) (structure + offset)); + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + return 0; + } +} + +/** + * ib_pack - Pack a structure into a buffer + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @structure:Structure to pack from + * @buf:Buffer to pack into + * + * ib_pack() packs a list of structure fields into a buffer, + * controlled by the array of fields in @desc. + */ +void ib_pack(const struct ib_field *desc, + int desc_len, + void *structure, + void *buf) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + __be32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be32(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be32 *) buf + desc[i].offset_words; + *addr = (*addr & ~mask) | (cpu_to_be32(val) & mask); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + __be64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + if (desc[i].struct_size_bytes) + val = value_read(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + structure) << shift; + else + val = 0; + + mask = cpu_to_be64(((1ull << desc[i].size_bits) - 1) << shift); + addr = (__be64 *) ((__be32 *) buf + desc[i].offset_words); + *addr = (*addr & ~mask) | (cpu_to_be64(val) & mask); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + if (desc[i].struct_size_bytes) + memcpy(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + structure + desc[i].struct_offset_bytes, + desc[i].size_bits / 8); + else + memset(buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + 0, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_pack); + +static void value_write(int offset, int size, u64 val, void *structure) +{ + switch (size * 8) { + case 8: *( u8 *) (structure + offset) = val; break; + case 16: *(__be16 *) (structure + offset) = cpu_to_be16(val); break; + case 32: *(__be32 *) (structure + offset) = cpu_to_be32(val); break; + case 64: *(__be64 *) (structure + offset) = cpu_to_be64(val); break; + default: + printk(KERN_WARNING "Field size %d bits not handled\n", size * 8); + } +} + +/** + * ib_unpack - Unpack a buffer into a structure + * @desc:Array of structure field descriptions + * @desc_len:Number of entries in @desc + * @buf:Buffer to unpack from + * @structure:Structure to unpack into + * + * ib_pack() unpacks a list of structure fields from a buffer, + * controlled by the array of fields in @desc. + */ +void ib_unpack(const struct ib_field *desc, + int desc_len, + void *buf, + void *structure) +{ + int i; + + for (i = 0; i < desc_len; ++i) { + if (!desc[i].struct_size_bytes) + continue; + + if (desc[i].size_bits <= 32) { + int shift; + u32 val; + u32 mask; + __be32 *addr; + + shift = 32 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be32 *) buf + desc[i].offset_words; + val = (be32_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else if (desc[i].size_bits <= 64) { + int shift; + u64 val; + u64 mask; + __be64 *addr; + + shift = 64 - desc[i].offset_bits - desc[i].size_bits; + mask = ((1ull << desc[i].size_bits) - 1) << shift; + addr = (__be64 *) buf + desc[i].offset_words; + val = (be64_to_cpup(addr) & mask) >> shift; + value_write(desc[i].struct_offset_bytes, + desc[i].struct_size_bytes, + val, + structure); + } else { + if (desc[i].offset_bits % 8 || + desc[i].size_bits % 8) { + printk(KERN_WARNING "Structure field %s of size %d " + "bits is not byte-aligned\n", + desc[i].field_name, desc[i].size_bits); + } + + memcpy(structure + desc[i].struct_offset_bytes, + buf + desc[i].offset_words * 4 + + desc[i].offset_bits / 8, + desc[i].size_bits / 8); + } + } +} +EXPORT_SYMBOL(ib_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sysfs.c 2004-12-13 09:44:40.421770527 -0800 @@ -0,0 +1,684 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: sysfs.c 1257 2004-11-17 23:12:18Z roland $ + */ + +#include "core_priv.h" + +#include + +struct ib_port { + struct kobject kobj; + struct ib_device *ibdev; + struct attribute_group gid_group; + struct attribute **gid_attr; + struct attribute_group pkey_group; + struct attribute **pkey_attr; + u8 port_num; +}; + +struct port_attribute { + struct attribute attr; + ssize_t (*show)(struct ib_port *, struct port_attribute *, char *buf); + ssize_t (*store)(struct ib_port *, struct port_attribute *, + const char *buf, size_t count); +}; + +#define PORT_ATTR(_name, _mode, _show, _store) \ +struct port_attribute port_attr_##_name = __ATTR(_name, _mode, _show, _store) + +#define PORT_ATTR_RO(_name) \ +struct port_attribute port_attr_##_name = __ATTR_RO(_name) + +struct port_table_attribute { + struct port_attribute attr; + int index; +}; + +static ssize_t port_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + + if (!port_attr->show) + return 0; + + return port_attr->show(p, port_attr, buf); +} + +static struct sysfs_ops port_sysfs_ops = { + .show = port_attr_show +}; + +static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + static const char *state_name[] = { + [IB_PORT_NOP] = "NOP", + [IB_PORT_DOWN] = "DOWN", + [IB_PORT_INIT] = "INIT", + [IB_PORT_ARMED] = "ARMED", + [IB_PORT_ACTIVE] = "ACTIVE", + [IB_PORT_ACTIVE_DEFER] = "ACTIVE_DEFER" + }; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d: %s\n", attr.state, + attr.state >= 0 && attr.state <= ARRAY_SIZE(state_name) ? + state_name[attr.state] : "UNKNOWN"); +} + +static ssize_t lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.lid); +} + +static ssize_t lid_mask_count_show(struct ib_port *p, + struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.lmc); +} + +static ssize_t sm_lid_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%x\n", attr.sm_lid); +} + +static ssize_t sm_sl_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "%d\n", attr.sm_sl); +} + +static ssize_t cap_mask_show(struct ib_port *p, struct port_attribute *unused, + char *buf) +{ + struct ib_port_attr attr; + ssize_t ret; + + ret = ib_query_port(p->ibdev, p->port_num, &attr); + if (ret) + return ret; + + return sprintf(buf, "0x%08x\n", attr.port_cap_flags); +} + +static PORT_ATTR_RO(state); +static PORT_ATTR_RO(lid); +static PORT_ATTR_RO(lid_mask_count); +static PORT_ATTR_RO(sm_lid); +static PORT_ATTR_RO(sm_sl); +static PORT_ATTR_RO(cap_mask); + +static struct attribute *port_default_attrs[] = { + &port_attr_state.attr, + &port_attr_lid.attr, + &port_attr_lid_mask_count.attr, + &port_attr_sm_lid.attr, + &port_attr_sm_sl.attr, + &port_attr_cap_mask.attr, + NULL +}; + +static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + ssize_t ret; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, &gid); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) gid.raw)[0]), + be16_to_cpu(((u16 *) gid.raw)[1]), + be16_to_cpu(((u16 *) gid.raw)[2]), + be16_to_cpu(((u16 *) gid.raw)[3]), + be16_to_cpu(((u16 *) gid.raw)[4]), + be16_to_cpu(((u16 *) gid.raw)[5]), + be16_to_cpu(((u16 *) gid.raw)[6]), + be16_to_cpu(((u16 *) gid.raw)[7])); +} + +static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + u16 pkey; + ssize_t ret; + + ret = ib_query_pkey(p->ibdev, p->port_num, tab_attr->index, &pkey); + if (ret) + return ret; + + return sprintf(buf, "0x%04x\n", pkey); +} + +#define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ +struct port_table_attribute port_pma_attr_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL), \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0xffff; + int width = (tab_attr->index >> 16) & 0xff; + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + ssize_t ret; + + if (!p->ibdev->process_mad) + return sprintf(buf, "N/A (no PMA)\n"); + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + if (!in_mad || !out_mad) { + ret = -ENOMEM; + goto out; + } + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + + in_mad->data[41] = p->port_num; /* PortSelect field */ + + if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, p->port_num, 0xffff, + in_mad, out_mad) & + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) != + (IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY)) { + ret = -EINVAL; + goto out; + } + + switch (width) { + case 4: + ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + (offset % 4)) & 0xf); + break; + case 8: + ret = sprintf(buf, "%u\n", out_mad->data[40 + offset / 8]); + break; + case 16: + ret = sprintf(buf, "%u\n", + be16_to_cpup((u16 *)(out_mad->data + 40 + offset / 8))); + break; + case 32: + ret = sprintf(buf, "%u\n", + be32_to_cpup((u32 *)(out_mad->data + 40 + offset / 8))); + break; + default: + ret = 0; + } + +out: + kfree(in_mad); + kfree(out_mad); + + return ret; +} + +static PORT_PMA_ATTR(symbol_error , 0, 16, 32); +static PORT_PMA_ATTR(link_error_recovery , 1, 8, 48); +static PORT_PMA_ATTR(link_downed , 2, 8, 56); +static PORT_PMA_ATTR(port_rcv_errors , 3, 16, 64); +static PORT_PMA_ATTR(port_rcv_remote_physical_errors, 4, 16, 80); +static PORT_PMA_ATTR(port_rcv_switch_relay_errors , 5, 16, 96); +static PORT_PMA_ATTR(port_xmit_discards , 6, 16, 112); +static PORT_PMA_ATTR(port_xmit_constraint_errors , 7, 8, 128); +static PORT_PMA_ATTR(port_rcv_constraint_errors , 8, 8, 136); +static PORT_PMA_ATTR(local_link_integrity_errors , 9, 4, 152); +static PORT_PMA_ATTR(excessive_buffer_overrun_errors, 10, 4, 156); +static PORT_PMA_ATTR(VL15_dropped , 11, 16, 176); +static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); +static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); +static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); +static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); + +static struct attribute *pma_attrs[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_local_link_integrity_errors.attr.attr, + &port_pma_attr_excessive_buffer_overrun_errors.attr.attr, + &port_pma_attr_VL15_dropped.attr.attr, + &port_pma_attr_port_xmit_data.attr.attr, + &port_pma_attr_port_rcv_data.attr.attr, + &port_pma_attr_port_xmit_packets.attr.attr, + &port_pma_attr_port_rcv_packets.attr.attr, + NULL +}; + +static struct attribute_group pma_group = { + .name = "counters", + .attrs = pma_attrs +}; + +static void ib_port_release(struct kobject *kobj) +{ + struct ib_port *p = container_of(kobj, struct ib_port, kobj); + struct attribute *a; + int i; + + for (i = 0; (a = p->gid_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + for (i = 0; (a = p->pkey_attr[i]); ++i) { + kfree(a->name); + kfree(a); + } + + kfree(p->gid_attr); + kfree(p); +} + +static struct kobj_type port_type = { + .release = ib_port_release, + .sysfs_ops = &port_sysfs_ops, + .default_attrs = port_default_attrs +}; + +static void ib_device_release(struct class_device *cdev) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + + kfree(dev); +} + +static int ib_device_hotplug(struct class_device *cdev, char **envp, + int num_envp, char *buf, int size) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + int i = 0, len = 0; + + if (add_hotplug_env_var(envp, num_envp, &i, buf, size, &len, + "NAME=%s", dev->name)) + return -ENOMEM; + + /* + * It might be nice to pass the node GUID to hotplug, but + * right now the only way to get it is to query the device + * provider, and this can crash during device removal because + * we are will be running after driver removal has started. + * We could add a node_guid field to struct ib_device, or we + * could just let the hotplug script read the node GUID from + * sysfs when devices are added. + */ + + envp[i] = NULL; + return 0; +} + +static int alloc_group(struct attribute ***attr, + ssize_t (*show)(struct ib_port *, + struct port_attribute *, char *buf), + int len) +{ + struct port_table_attribute ***tab_attr = + (struct port_table_attribute ***) attr; + int i; + int ret; + + *tab_attr = kmalloc((1 + len) * sizeof *tab_attr, GFP_KERNEL); + if (!*tab_attr) + return -ENOMEM; + + memset(*tab_attr, 0, (1 + len) * sizeof *tab_attr); + + for (i = 0; i < len; ++i) { + (*tab_attr)[i] = kmalloc(sizeof *(*tab_attr)[i], GFP_KERNEL); + if (!(*tab_attr)[i]) { + ret = -ENOMEM; + goto err; + } + memset((*tab_attr)[i], 0, sizeof *(*tab_attr)[i]); + (*tab_attr)[i]->attr.attr.name = kmalloc(8, GFP_KERNEL); + if (!(*tab_attr)[i]->attr.attr.name) { + ret = -ENOMEM; + goto err; + } + + if (snprintf((*tab_attr)[i]->attr.attr.name, 8, "%d", i) >= 8) { + ret = -ENOMEM; + goto err; + } + + (*tab_attr)[i]->attr.attr.mode = S_IRUGO; + (*tab_attr)[i]->attr.attr.owner = THIS_MODULE; + (*tab_attr)[i]->attr.show = show; + (*tab_attr)[i]->index = i; + } + + return 0; + +err: + for (i = 0; i < len; ++i) { + if ((*tab_attr)[i]) + kfree((*tab_attr)[i]->attr.attr.name); + kfree((*tab_attr)[i]); + } + + kfree(*tab_attr); + + return ret; +} + +static int add_port(struct ib_device *device, int port_num) +{ + struct ib_port *p; + struct ib_port_attr attr; + int i; + int ret; + + ret = ib_query_port(device, port_num, &attr); + if (ret) + return ret; + + p = kmalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + memset(p, 0, sizeof *p); + + p->ibdev = device; + p->port_num = port_num; + p->kobj.ktype = &port_type; + + p->kobj.parent = kobject_get(&device->ports_parent); + if (!p->kobj.parent) { + ret = -EBUSY; + goto err; + } + + ret = kobject_set_name(&p->kobj, "%d", port_num); + if (ret) + goto err_put; + + ret = kobject_register(&p->kobj); + if (ret) + goto err_put; + + ret = sysfs_create_group(&p->kobj, &pma_group); + if (ret) + goto err_put; + + ret = alloc_group(&p->gid_attr, show_port_gid, attr.gid_tbl_len); + if (ret) + goto err_remove_pma; + + p->gid_group.name = "gids"; + p->gid_group.attrs = p->gid_attr; + + ret = sysfs_create_group(&p->kobj, &p->gid_group); + if (ret) + goto err_free_gid; + + ret = alloc_group(&p->pkey_attr, show_port_pkey, attr.pkey_tbl_len); + if (ret) + goto err_remove_gid; + + p->pkey_group.name = "pkeys"; + p->pkey_group.attrs = p->pkey_attr; + + ret = sysfs_create_group(&p->kobj, &p->pkey_group); + if (ret) + goto err_free_pkey; + + list_add_tail(&p->kobj.entry, &device->port_list); + + return 0; + +err_free_pkey: + for (i = 0; i < attr.pkey_tbl_len; ++i) { + kfree(p->pkey_attr[i]->name); + kfree(p->pkey_attr[i]); + } + + kfree(p->pkey_attr); + +err_remove_gid: + sysfs_remove_group(&p->kobj, &p->gid_group); + +err_free_gid: + for (i = 0; i < attr.gid_tbl_len; ++i) { + kfree(p->gid_attr[i]->name); + kfree(p->gid_attr[i]); + } + + kfree(p->gid_attr); + +err_remove_pma: + sysfs_remove_group(&p->kobj, &pma_group); + +err_put: + kobject_put(&device->ports_parent); + +err: + kfree(p); + return ret; +} + +static ssize_t show_sys_image_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.sys_image_guid)[0]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[1]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[2]), + be16_to_cpu(((u16 *) &attr.sys_image_guid)[3])); +} + +static ssize_t show_node_guid(struct class_device *cdev, char *buf) +{ + struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); + struct ib_device_attr attr; + ssize_t ret; + + ret = ib_query_device(dev, &attr); + if (ret) + return ret; + + return sprintf(buf, "%04x:%04x:%04x:%04x\n", + be16_to_cpu(((u16 *) &attr.node_guid)[0]), + be16_to_cpu(((u16 *) &attr.node_guid)[1]), + be16_to_cpu(((u16 *) &attr.node_guid)[2]), + be16_to_cpu(((u16 *) &attr.node_guid)[3])); +} + +static CLASS_DEVICE_ATTR(sys_image_guid, S_IRUGO, show_sys_image_guid, NULL); +static CLASS_DEVICE_ATTR(node_guid, S_IRUGO, show_node_guid, NULL); + +static struct class_device_attribute *ib_class_attributes[] = { + &class_device_attr_sys_image_guid, + &class_device_attr_node_guid +}; + +static struct class ib_class = { + .name = "infiniband", + .release = ib_device_release, + .hotplug = ib_device_hotplug, +}; + +int ib_device_register_sysfs(struct ib_device *device) +{ + struct class_device *class_dev = &device->class_dev; + int ret; + int i; + + class_dev->class = &ib_class; + class_dev->class_data = device; + strlcpy(class_dev->class_id, device->name, BUS_ID_SIZE); + + INIT_LIST_HEAD(&device->port_list); + + ret = class_device_register(class_dev); + if (ret) + goto err; + + for (i = 0; i < ARRAY_SIZE(ib_class_attributes); ++i) { + ret = class_device_create_file(class_dev, ib_class_attributes[i]); + if (ret) + goto err_unregister; + } + + device->ports_parent.parent = kobject_get(&class_dev->kobj); + if (!device->ports_parent.parent) { + ret = -EBUSY; + goto err_unregister; + } + ret = kobject_set_name(&device->ports_parent, "ports"); + if (ret) + goto err_put; + ret = kobject_register(&device->ports_parent); + if (ret) + goto err_put; + + if (device->node_type == IB_NODE_SWITCH) { + ret = add_port(device, 0); + if (ret) + goto err_put; + } else { + int i; + + for (i = 1; i <= device->phys_port_cnt; ++i) { + ret = add_port(device, i); + if (ret) + goto err_put; + } + } + + return 0; + +err_put: + { + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + } + + kobject_put(&class_dev->kobj); + +err_unregister: + class_device_unregister(class_dev); + +err: + return ret; +} + +void ib_device_unregister_sysfs(struct ib_device *device) +{ + struct kobject *p, *t; + struct ib_port *port; + + list_for_each_entry_safe(p, t, &device->port_list, entry) { + list_del(&p->entry); + port = container_of(p, struct ib_port, kobj); + sysfs_remove_group(p, &pma_group); + sysfs_remove_group(p, &port->pkey_group); + sysfs_remove_group(p, &port->gid_group); + kobject_unregister(p); + } + + kobject_unregister(&device->ports_parent); + class_device_unregister(&device->class_dev); +} + +int ib_sysfs_setup(void) +{ + return class_register(&ib_class); +} + +void ib_sysfs_cleanup(void) +{ + class_unregister(&ib_class); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/ud_header.c 2004-12-13 09:44:40.360779512 -0800 @@ -0,0 +1,354 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * + * $Id: ud_header.c 1292 2004-11-25 05:24:40Z roland $ + */ + +#include + +#include + +#define STRUCT_FIELD(header, field) \ + .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ + .struct_size_bytes = sizeof ((struct ib_unpacked_ ## header *) 0)->field, \ + .field_name = #header ":" #field + +static const struct ib_field lrh_table[] = { + { STRUCT_FIELD(lrh, virtual_lane), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, link_version), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 4 }, + { STRUCT_FIELD(lrh, service_level), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 4 }, + { RESERVED, + .offset_words = 0, + .offset_bits = 12, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, link_next_header), + .offset_words = 0, + .offset_bits = 14, + .size_bits = 2 }, + { STRUCT_FIELD(lrh, destination_lid), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 5 }, + { STRUCT_FIELD(lrh, packet_length), + .offset_words = 1, + .offset_bits = 5, + .size_bits = 11 }, + { STRUCT_FIELD(lrh, source_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 } +}; + +static const struct ib_field grh_table[] = { + { STRUCT_FIELD(grh, ip_version), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 4 }, + { STRUCT_FIELD(grh, traffic_class), + .offset_words = 0, + .offset_bits = 4, + .size_bits = 8 }, + { STRUCT_FIELD(grh, flow_label), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 20 }, + { STRUCT_FIELD(grh, payload_length), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { STRUCT_FIELD(grh, next_header), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 8 }, + { STRUCT_FIELD(grh, hop_limit), + .offset_words = 1, + .offset_bits = 24, + .size_bits = 8 }, + { STRUCT_FIELD(grh, source_gid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { STRUCT_FIELD(grh, destination_gid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 } +}; + +static const struct ib_field bth_table[] = { + { STRUCT_FIELD(bth, opcode), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, solicited_event), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 1 }, + { STRUCT_FIELD(bth, mig_req), + .offset_words = 0, + .offset_bits = 9, + .size_bits = 1 }, + { STRUCT_FIELD(bth, pad_count), + .offset_words = 0, + .offset_bits = 10, + .size_bits = 2 }, + { STRUCT_FIELD(bth, transport_header_version), + .offset_words = 0, + .offset_bits = 12, + .size_bits = 4 }, + { STRUCT_FIELD(bth, pkey), + .offset_words = 0, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(bth, destination_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 }, + { STRUCT_FIELD(bth, ack_req), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 2, + .offset_bits = 1, + .size_bits = 7 }, + { STRUCT_FIELD(bth, psn), + .offset_words = 2, + .offset_bits = 8, + .size_bits = 24 } +}; + +static const struct ib_field deth_table[] = { + { STRUCT_FIELD(deth, qkey), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 8 }, + { STRUCT_FIELD(deth, source_qpn), + .offset_words = 1, + .offset_bits = 8, + .size_bits = 24 } +}; + +/** + * ib_ud_header_init - Initialize UD header structure + * @payload_bytes:Length of packet payload + * @grh_present:GRH flag (if non-zero, GRH will be included) + * @header:Structure to initialize + * + * ib_ud_header_init() initializes the lrh.link_version, lrh.link_next_header, + * lrh.packet_length, grh.ip_version, grh.payload_length, + * grh.next_header, bth.opcode, bth.pad_count and + * bth.transport_header_version fields of a &struct ib_ud_header given + * the payload length and whether a GRH will be included. + */ +void ib_ud_header_init(int payload_bytes, + int grh_present, + struct ib_ud_header *header) +{ + int header_len; + + memset(header, 0, sizeof *header); + + header_len = + IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES; + if (grh_present) { + header_len += IB_GRH_BYTES; + } + + header->lrh.link_version = 0; + header->lrh.link_next_header = + grh_present ? IB_LNH_IBA_GLOBAL : IB_LNH_IBA_LOCAL; + header->lrh.packet_length = (IB_LRH_BYTES + + IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) / 4; /* round up */ + + header->grh_present = grh_present; + if (grh_present) { + header->lrh.packet_length += IB_GRH_BYTES / 4; + + header->grh.ip_version = 6; + header->grh.payload_length = + cpu_to_be16((IB_BTH_BYTES + + IB_DETH_BYTES + + payload_bytes + + 4 + /* ICRC */ + 3) & ~3); /* round up */ + header->grh.next_header = 0x1b; + } + + cpu_to_be16s(&header->lrh.packet_length); + + if (header->immediate_present) + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + else + header->bth.opcode = IB_OPCODE_UD_SEND_ONLY; + header->bth.pad_count = (4 - payload_bytes) & 3; + header->bth.transport_header_version = 0; +} +EXPORT_SYMBOL(ib_ud_header_init); + +/** + * ib_ud_header_pack - Pack UD header struct into wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() packs the UD header structure @header into wire + * format in the buffer @buf. + */ +int ib_ud_header_pack(struct ib_ud_header *header, + void *buf) +{ + int len = 0; + + ib_pack(lrh_table, ARRAY_SIZE(lrh_table), + &header->lrh, buf); + len += IB_LRH_BYTES; + + if (header->grh_present) { + ib_pack(grh_table, ARRAY_SIZE(grh_table), + &header->grh, buf + len); + len += IB_GRH_BYTES; + } + + ib_pack(bth_table, ARRAY_SIZE(bth_table), + &header->bth, buf + len); + len += IB_BTH_BYTES; + + ib_pack(deth_table, ARRAY_SIZE(deth_table), + &header->deth, buf + len); + len += IB_DETH_BYTES; + + if (header->immediate_present) { + memcpy(buf + len, &header->immediate_data, sizeof header->immediate_data); + len += sizeof header->immediate_data; + } + + return len; +} +EXPORT_SYMBOL(ib_ud_header_pack); + +/** + * ib_ud_header_unpack - Unpack UD header struct from wire format + * @header:UD header struct + * @buf:Buffer to pack into + * + * ib_ud_header_pack() unpacks the UD header structure @header from wire + * format in the buffer @buf. + */ +int ib_ud_header_unpack(void *buf, + struct ib_ud_header *header) +{ + ib_unpack(lrh_table, ARRAY_SIZE(lrh_table), + buf, &header->lrh); + buf += IB_LRH_BYTES; + + if (header->lrh.link_version != 0) { + printk(KERN_WARNING "Invalid LRH.link_version %d\n", + header->lrh.link_version); + return -EINVAL; + } + + switch (header->lrh.link_next_header) { + case IB_LNH_IBA_LOCAL: + header->grh_present = 0; + break; + + case IB_LNH_IBA_GLOBAL: + header->grh_present = 1; + ib_unpack(grh_table, ARRAY_SIZE(grh_table), + buf, &header->grh); + buf += IB_GRH_BYTES; + + if (header->grh.ip_version != 6) { + printk(KERN_WARNING "Invalid GRH.ip_version %d\n", + header->grh.ip_version); + return -EINVAL; + } + if (header->grh.next_header != 0x1b) { + printk(KERN_WARNING "Invalid GRH.next_header 0x%02x\n", + header->grh.next_header); + return -EINVAL; + } + break; + + default: + printk(KERN_WARNING "Invalid LRH.link_next_header %d\n", + header->lrh.link_next_header); + return -EINVAL; + } + + ib_unpack(bth_table, ARRAY_SIZE(bth_table), + buf, &header->bth); + buf += IB_BTH_BYTES; + + switch (header->bth.opcode) { + case IB_OPCODE_UD_SEND_ONLY: + header->immediate_present = 0; + break; + case IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE: + header->immediate_present = 1; + break; + default: + printk(KERN_WARNING "Invalid BTH.opcode 0x%02x\n", + header->bth.opcode); + return -EINVAL; + } + + if (header->bth.transport_header_version != 0) { + printk(KERN_WARNING "Invalid BTH.transport_header_version %d\n", + header->bth.transport_header_version); + return -EINVAL; + } + + ib_unpack(deth_table, ARRAY_SIZE(deth_table), + buf, &header->deth); + buf += IB_DETH_BYTES; + + if (header->immediate_present) + memcpy(&header->immediate_data, buf, sizeof header->immediate_data); + + return 0; +} +EXPORT_SYMBOL(ib_ud_header_unpack); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/verbs.c 2004-12-13 09:44:40.392774798 -0800 @@ -0,0 +1,420 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include +#include + +#include + +/* Protection domains */ + +struct ib_pd *ib_alloc_pd(struct ib_device *device) +{ + struct ib_pd *pd; + + pd = device->alloc_pd(device); + + if (!IS_ERR(pd)) { + pd->device = device; + atomic_set(&pd->usecnt, 0); + } + + return pd; +} +EXPORT_SYMBOL(ib_alloc_pd); + +int ib_dealloc_pd(struct ib_pd *pd) +{ + if (atomic_read(&pd->usecnt)) + return -EBUSY; + + return pd->device->dealloc_pd(pd); +} +EXPORT_SYMBOL(ib_dealloc_pd); + +/* Address handles */ + +struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) +{ + struct ib_ah *ah; + + ah = pd->device->create_ah(pd, ah_attr); + + if (!IS_ERR(ah)) { + ah->device = pd->device; + ah->pd = pd; + atomic_inc(&pd->usecnt); + } + + return ah; +} +EXPORT_SYMBOL(ib_create_ah); + +int ib_modify_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->modify_ah ? + ah->device->modify_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_modify_ah); + +int ib_query_ah(struct ib_ah *ah, struct ib_ah_attr *ah_attr) +{ + return ah->device->query_ah ? + ah->device->query_ah(ah, ah_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_ah); + +int ib_destroy_ah(struct ib_ah *ah) +{ + struct ib_pd *pd; + int ret; + + pd = ah->pd; + ret = ah->device->destroy_ah(ah); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_destroy_ah); + +/* Queue pairs */ + +struct ib_qp *ib_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct ib_qp *qp; + + qp = pd->device->create_qp(pd, qp_init_attr); + + if (!IS_ERR(qp)) { + qp->device = pd->device; + qp->pd = pd; + qp->send_cq = qp_init_attr->send_cq; + qp->recv_cq = qp_init_attr->recv_cq; + qp->srq = qp_init_attr->srq; + qp->event_handler = qp_init_attr->event_handler; + qp->qp_context = qp_init_attr->qp_context; + atomic_inc(&pd->usecnt); + atomic_inc(&qp_init_attr->send_cq->usecnt); + atomic_inc(&qp_init_attr->recv_cq->usecnt); + if (qp_init_attr->srq) + atomic_inc(&qp_init_attr->srq->usecnt); + } + + return qp; +} +EXPORT_SYMBOL(ib_create_qp); + +int ib_modify_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask) +{ + return qp->device->modify_qp(qp, qp_attr, qp_attr_mask); +} +EXPORT_SYMBOL(ib_modify_qp); + +int ib_query_qp(struct ib_qp *qp, + struct ib_qp_attr *qp_attr, + int qp_attr_mask, + struct ib_qp_init_attr *qp_init_attr) +{ + return qp->device->query_qp ? + qp->device->query_qp(qp, qp_attr, qp_attr_mask, qp_init_attr) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_query_qp); + +int ib_destroy_qp(struct ib_qp *qp) +{ + struct ib_pd *pd; + struct ib_cq *scq, *rcq; + struct ib_srq *srq; + int ret; + + pd = qp->pd; + scq = qp->send_cq; + rcq = qp->recv_cq; + srq = qp->srq; + + ret = qp->device->destroy_qp(qp); + if (!ret) { + atomic_dec(&pd->usecnt); + atomic_dec(&scq->usecnt); + atomic_dec(&rcq->usecnt); + if (srq) + atomic_dec(&srq->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_destroy_qp); + +/* Completion queues */ + +struct ib_cq *ib_create_cq(struct ib_device *device, + ib_comp_handler comp_handler, + void (*event_handler)(struct ib_event *, void *), + void *cq_context, int cqe) +{ + struct ib_cq *cq; + + cq = device->create_cq(device, cqe); + + if (!IS_ERR(cq)) { + cq->device = device; + cq->comp_handler = comp_handler; + cq->event_handler = event_handler; + cq->cq_context = cq_context; + atomic_set(&cq->usecnt, 0); + } + + return cq; +} +EXPORT_SYMBOL(ib_create_cq); + +int ib_destroy_cq(struct ib_cq *cq) +{ + if (atomic_read(&cq->usecnt)) + return -EBUSY; + + return cq->device->destroy_cq(cq); +} +EXPORT_SYMBOL(ib_destroy_cq); + +int ib_resize_cq(struct ib_cq *cq, + int cqe) +{ + int ret; + + if (!cq->device->resize_cq) + return -ENOSYS; + + ret = cq->device->resize_cq(cq, &cqe); + if (!ret) + cq->cqe = cqe; + + return ret; +} +EXPORT_SYMBOL(ib_resize_cq); + +/* Memory regions */ + +struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags) +{ + struct ib_mr *mr; + + mr = pd->device->get_dma_mr(pd, mr_access_flags); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_get_dma_mr); + +struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_mr *mr; + + mr = pd->device->reg_phys_mr(pd, phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_reg_phys_mr); + +int ib_rereg_phys_mr(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *phys_buf_array, + int num_phys_buf, + int mr_access_flags, + u64 *iova_start) +{ + struct ib_pd *old_pd; + int ret; + + if (!mr->device->rereg_phys_mr) + return -ENOSYS; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + old_pd = mr->pd; + + ret = mr->device->rereg_phys_mr(mr, mr_rereg_mask, pd, + phys_buf_array, num_phys_buf, + mr_access_flags, iova_start); + + if (!ret && (mr_rereg_mask & IB_MR_REREG_PD)) { + atomic_dec(&old_pd->usecnt); + atomic_inc(&pd->usecnt); + } + + return ret; +} +EXPORT_SYMBOL(ib_rereg_phys_mr); + +int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) +{ + return mr->device->query_mr ? + mr->device->query_mr(mr, mr_attr) : -ENOSYS; +} +EXPORT_SYMBOL(ib_query_mr); + +int ib_dereg_mr(struct ib_mr *mr) +{ + struct ib_pd *pd; + int ret; + + if (atomic_read(&mr->usecnt)) + return -EBUSY; + + pd = mr->pd; + ret = mr->device->dereg_mr(mr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dereg_mr); + +/* Memory windows */ + +struct ib_mw *ib_alloc_mw(struct ib_pd *pd) +{ + struct ib_mw *mw; + + if (!pd->device->alloc_mw) + return ERR_PTR(-ENOSYS); + + mw = pd->device->alloc_mw(pd); + if (!IS_ERR(mw)) { + mw->device = pd->device; + mw->pd = pd; + atomic_inc(&pd->usecnt); + } + + return mw; +} +EXPORT_SYMBOL(ib_alloc_mw); + +int ib_dealloc_mw(struct ib_mw *mw) +{ + struct ib_pd *pd; + int ret; + + pd = mw->pd; + ret = mw->device->dealloc_mw(mw); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_mw); + +/* "Fast" memory regions */ + +struct ib_fmr *ib_alloc_fmr(struct ib_pd *pd, + int mr_access_flags, + struct ib_fmr_attr *fmr_attr) +{ + struct ib_fmr *fmr; + + if (!pd->device->alloc_fmr) + return ERR_PTR(-ENOSYS); + + fmr = pd->device->alloc_fmr(pd, mr_access_flags, fmr_attr); + if (!IS_ERR(fmr)) { + fmr->device = pd->device; + fmr->pd = pd; + atomic_inc(&pd->usecnt); + } + + return fmr; +} +EXPORT_SYMBOL(ib_alloc_fmr); + +int ib_unmap_fmr(struct list_head *fmr_list) +{ + struct ib_fmr *fmr; + + if (list_empty(fmr_list)) + return 0; + + fmr = list_entry(fmr_list->next, struct ib_fmr, list); + return fmr->device->unmap_fmr(fmr_list); +} +EXPORT_SYMBOL(ib_unmap_fmr); + +int ib_dealloc_fmr(struct ib_fmr *fmr) +{ + struct ib_pd *pd; + int ret; + + pd = fmr->pd; + ret = fmr->device->dealloc_fmr(fmr); + if (!ret) + atomic_dec(&pd->usecnt); + + return ret; +} +EXPORT_SYMBOL(ib_dealloc_fmr); + +/* Multicast groups */ + +int ib_attach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->attach_mcast ? + qp->device->attach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_attach_mcast); + +int ib_detach_mcast(struct ib_qp *qp, union ib_gid *gid, u16 lid) +{ + return qp->device->detach_mcast ? + qp->device->detach_mcast(qp, gid, lid) : + -ENOSYS; +} +EXPORT_SYMBOL(ib_detach_mcast); From roland at topspin.com Mon Dec 13 10:09:23 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:23 -0800 Subject: [openib-general] [PATCH][v3][3/21] Hook up drivers/infiniband In-Reply-To: <20041213109.B80JuEFdg6Nma7kr@topspin.com> Message-ID: <20041213109.GUz9r6Ey44wtDeiY@topspin.com> Add the appropriate lines to drivers/Kconfig and drivers/Makefile so that the kernel configuration and build systems know about drivers/infiniband. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/Kconfig 2004-12-11 15:16:34.000000000 -0800 +++ linux-bk/drivers/Kconfig 2004-12-13 09:44:41.933547814 -0800 @@ -54,4 +54,6 @@ source "drivers/usb/Kconfig" +source "drivers/infiniband/Kconfig" + endmenu --- linux-bk.orig/drivers/Makefile 2004-12-11 15:16:49.000000000 -0800 +++ linux-bk/drivers/Makefile 2004-12-13 09:44:41.953544868 -0800 @@ -59,4 +59,5 @@ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_INFINIBAND) += infiniband/ obj-y += firmware/ From roland at topspin.com Mon Dec 13 10:09:23 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:23 -0800 Subject: [openib-general] [PATCH][v3][4/21] Add InfiniBand MAD (management datagram) support (public headers) In-Reply-To: <20041213109.GUz9r6Ey44wtDeiY@topspin.com> Message-ID: <20041213109.kvbhOIc6xDgg0Bag@topspin.com> Add public headers for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_mad.h 2004-12-13 09:44:42.628445443 -0800 @@ -0,0 +1,393 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_MAD_H ) +#define IB_MAD_H + +#include + +/* Management base version */ +#define IB_MGMT_BASE_VERSION 1 + +/* Management classes */ +#define IB_MGMT_CLASS_SUBN_LID_ROUTED 0x01 +#define IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE 0x81 +#define IB_MGMT_CLASS_SUBN_ADM 0x03 +#define IB_MGMT_CLASS_PERF_MGMT 0x04 +#define IB_MGMT_CLASS_BM 0x05 +#define IB_MGMT_CLASS_DEVICE_MGMT 0x06 +#define IB_MGMT_CLASS_CM 0x07 +#define IB_MGMT_CLASS_SNMP 0x08 +#define IB_MGMT_CLASS_VENDOR_RANGE2_START 0x30 +#define IB_MGMT_CLASS_VENDOR_RANGE2_END 0x4F + +/* Management methods */ +#define IB_MGMT_METHOD_GET 0x01 +#define IB_MGMT_METHOD_SET 0x02 +#define IB_MGMT_METHOD_GET_RESP 0x81 +#define IB_MGMT_METHOD_SEND 0x03 +#define IB_MGMT_METHOD_TRAP 0x05 +#define IB_MGMT_METHOD_REPORT 0x06 +#define IB_MGMT_METHOD_REPORT_RESP 0x86 +#define IB_MGMT_METHOD_TRAP_REPRESS 0x07 + +#define IB_MGMT_METHOD_RESP 0x80 + +#define IB_MGMT_MAX_METHODS 128 + +#define IB_QP0 0 +#define IB_QP1 cpu_to_be32(1) +#define IB_QP1_QKEY 0x80010000 + +struct ib_grh { + u32 version_tclass_flow; + u16 paylen; + u8 next_hdr; + u8 hop_limit; + union ib_gid sgid; + union ib_gid dgid; +} __attribute__ ((packed)); + +struct ib_mad_hdr { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u16 class_specific; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; +} __attribute__ ((packed)); + +struct ib_rmpp_hdr { + u8 rmpp_version; + u8 rmpp_type; + u8 rmpp_rtime_flags; + u8 rmpp_status; + u32 seg_num; + u32 paylen_newwin; +} __attribute__ ((packed)); + +struct ib_mad { + struct ib_mad_hdr mad_hdr; + u8 data[232]; +} __attribute__ ((packed)); + +struct ib_rmpp_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 data[220]; +} __attribute__ ((packed)); + +struct ib_vendor_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + u8 reserved; + u8 oui[3]; + u8 data[216]; +} __attribute__ ((packed)); + +struct ib_mad_agent; +struct ib_mad_send_wc; +struct ib_mad_recv_wc; + +/** + * ib_mad_send_handler - callback handler for a sent MAD. + * @mad_agent: MAD agent that sent the MAD. + * @mad_send_wc: Send work completion information on the sent MAD. + */ +typedef void (*ib_mad_send_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_snoop_handler - Callback handler for snooping sent MADs. + * @mad_agent: MAD agent that snooped the MAD. + * @send_wr: Work request information on the sent MAD. + * @mad_send_wc: Work completion information on the sent MAD. Valid + * only for snooping that occurs on a send completion. + * + * Clients snooping MADs should not modify data referenced by the @send_wr + * or @mad_send_wc. + */ +typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc); + +/** + * ib_mad_recv_handler - callback handler for a received MAD. + * @mad_agent: MAD agent requesting the received MAD. + * @mad_recv_wc: Received work completion information on the received MAD. + * + * MADs received in response to a send request operation will be handed to + * the user after the send operation completes. All data buffers given + * to registered agents through this routine are owned by the receiving + * client, except for snooping agents. Clients snooping MADs should not + * modify the data referenced by @mad_recv_wc. + */ +typedef void (*ib_mad_recv_handler)(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_mad_agent - Used to track MAD registration with the access layer. + * @device: Reference to device registration is on. + * @qp: Reference to QP used for sending and receiving MADs. + * @recv_handler: Callback handler for a received MAD. + * @send_handler: Callback handler for a sent MAD. + * @snoop_handler: Callback handler for snooped sent MADs. + * @context: User-specified context associated with this registration. + * @hi_tid: Access layer assigned transaction ID for this client. + * Unsolicited MADs sent by this client will have the upper 32-bits + * of their TID set to this value. + * @port_num: Port number on which QP is registered + */ +struct ib_mad_agent { + struct ib_device *device; + struct ib_qp *qp; + ib_mad_recv_handler recv_handler; + ib_mad_send_handler send_handler; + ib_mad_snoop_handler snoop_handler; + void *context; + u32 hi_tid; + u8 port_num; +}; + +/** + * ib_mad_send_wc - MAD send completion information. + * @wr_id: Work request identifier associated with the send MAD request. + * @status: Completion status. + * @vendor_err: Optional vendor error information returned with a failed + * request. + */ +struct ib_mad_send_wc { + u64 wr_id; + enum ib_wc_status status; + u32 vendor_err; +}; + +/** + * ib_mad_recv_buf - received MAD buffer information. + * @list: Reference to next data buffer for a received RMPP MAD. + * @grh: References a data buffer containing the global route header. + * The data refereced by this buffer is only valid if the GRH is + * valid. + * @mad: References the start of the received MAD. + */ +struct ib_mad_recv_buf { + struct list_head list; + struct ib_grh *grh; + struct ib_mad *mad; +}; + +/** + * ib_mad_recv_wc - received MAD information. + * @wc: Completion information for the received data. + * @recv_buf: Specifies the location of the received data buffer(s). + * @mad_len: The length of the received MAD, without duplicated headers. + * + * For received response, the wr_id field of the wc is set to the wr_id + * for the corresponding send request. + */ +struct ib_mad_recv_wc { + struct ib_wc *wc; + struct ib_mad_recv_buf recv_buf; + int mad_len; +}; + +/** + * ib_mad_reg_req - MAD registration request + * @mgmt_class: Indicates which management class of MADs should be receive + * by the caller. This field is only required if the user wishes to + * receive unsolicited MADs, otherwise it should be 0. + * @mgmt_class_version: Indicates which version of MADs for the given + * management class to receive. + * @oui: Indicates IEEE OUI when mgmt_class is a vendor class + * in the range from 0x30 to 0x4f. Otherwise not used. + * @method_mask: The caller will receive unsolicited MADs for any method + * where @method_mask = 1. + */ +struct ib_mad_reg_req { + u8 mgmt_class; + u8 mgmt_class_version; + u8 oui[3]; + DECLARE_BITMAP(method_mask, IB_MGMT_MAX_METHODS); +}; + +/** + * ib_register_mad_agent - Register to send/receive MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP to access. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_reg_req: Specifies which unsolicited MADs should be received + * by the caller. This parameter may be NULL if the caller only + * wishes to receive solicited responses. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +enum ib_mad_snoop_flags { + /*IB_MAD_SNOOP_POSTED_SENDS = 1,*/ + /*IB_MAD_SNOOP_RMPP_SENDS = (1<<1),*/ + IB_MAD_SNOOP_SEND_COMPLETIONS = (1<<2), + /*IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS = (1<<3),*/ + IB_MAD_SNOOP_RECVS = (1<<4) + /*IB_MAD_SNOOP_RMPP_RECVS = (1<<5),*/ + /*IB_MAD_SNOOP_REDIRECTED_QPS = (1<<6)*/ +}; + +/** + * ib_register_mad_snoop - Register to snoop sent and received MADs. + * @device: The device to register with. + * @port_num: The port on the specified device to use. + * @qp_type: Specifies which QP traffic to snoop. Must be either + * IB_QPT_SMI or IB_QPT_GSI. + * @mad_snoop_flags: Specifies information where snooping occurs. + * @send_handler: The callback routine invoked for a snooped send. + * @recv_handler: The callback routine invoked for a snooped receive. + * @context: User specified context associated with the registration. + */ +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_unregister_mad_agent - Unregisters a client from using MAD services. + * @mad_agent: Corresponding MAD registration request to deregister. + * + * After invoking this routine, MAD services are no longer usable by the + * client on the associated QP. + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent); + +/** + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client. + * @mad_agent: Specifies the associated registration to post the send to. + * @send_wr: Specifies the information needed to send the MAD(s). + * @bad_send_wr: Specifies the MAD on which an error was encountered. + * + * Sent MADs are not guaranteed to complete in the order that they were posted. + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr); + +/** + * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. + * @mad_recv_wc: Work completion information for a received MAD. + * @buf: User-provided data buffer to receive the coalesced buffers. The + * referenced buffer should be at least the size of the mad_len specified + * by @mad_recv_wc. + * + * This call copies a chain of received RMPP MADs into a single data buffer, + * removing duplicated headers. + */ +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf); + +/** + * ib_free_recv_mad - Returns data buffers used to receive a MAD to the + * access layer. + * @mad_recv_wc: Work completion information for a received MAD. + * + * Clients receiving MADs through their ib_mad_recv_handler must call this + * routine to return the work completion buffers to the access layer. + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc); + +/** + * ib_cancel_mad - Cancels an outstanding send MAD operation. + * @mad_agent: Specifies the registration associated with sent MAD. + * @wr_id: Indicates the work request identifier of the MAD to cancel. + * + * MADs will be returned to the user through the corresponding + * ib_mad_send_handler. + */ +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id); + +/** + * ib_redirect_mad_qp - Registers a QP for MAD services. + * @qp: Reference to a QP that requires MAD services. + * @rmpp_version: If set, indicates that the client will send + * and receive MADs that contain the RMPP header for the given version. + * If set to 0, indicates that RMPP is not used by this client. + * @send_handler: The completion callback routine invoked after a send + * request has completed. + * @recv_handler: The completion callback routine invoked for a received + * MAD. + * @context: User specified context associated with the registration. + * + * Use of this call allows clients to use MAD services, such as RMPP, + * on user-owned QPs. After calling this routine, users may send + * MADs on the specified QP by calling ib_mad_post_send. + */ +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context); + +/** + * ib_process_mad_wc - Processes a work completion associated with a + * MAD sent or received on a redirected QP. + * @mad_agent: Specifies the registered MAD service using the redirected QP. + * @wc: References a work completion associated with a sent or received + * MAD segment. + * + * This routine is used to complete or continue processing on a MAD request. + * If the work completion is associated with a send operation, calling + * this routine is required to continue an RMPP transfer or to wait for a + * corresponding response, if it is a request. If the work completion is + * associated with a receive operation, calling this routine is required to + * process an inbound or outbound RMPP transfer, or to match a response MAD + * with its corresponding request. + */ +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc); + +#endif /* IB_MAD_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_smi.h 2004-12-13 09:44:42.653441760 -0800 @@ -0,0 +1,67 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + * Copyright (c) 2004 Infinicon Corporation. All rights reserved. + * Copyright (c) 2004 Intel Corporation. All rights reserved. + * Copyright (c) 2004 Topspin Corporation. All rights reserved. + * Copyright (c) 2004 Voltaire Corporation. All rights reserved. + * + * $Id$ + */ + +#if !defined( IB_SMI_H ) +#define IB_SMI_H + +#include + +#define IB_LID_PERMISSIVE 0xFFFF + +#define IB_SMP_DATA_SIZE 64 +#define IB_SMP_MAX_PATH_HOPS 64 + +struct ib_smp { + u8 base_version; + u8 mgmt_class; + u8 class_version; + u8 method; + u16 status; + u8 hop_ptr; + u8 hop_cnt; + u64 tid; + u16 attr_id; + u16 resv; + u32 attr_mod; + u64 mkey; + u16 dr_slid; + u16 dr_dlid; + u8 reserved[28]; + u8 data[IB_SMP_DATA_SIZE]; + u8 initial_path[IB_SMP_MAX_PATH_HOPS]; + u8 return_path[IB_SMP_MAX_PATH_HOPS]; +} __attribute__ ((packed)); + +#define IB_SMP_DIRECTION cpu_to_be16(0x8000) + +static inline u8 +ib_get_smp_direction(struct ib_smp *smp) +{ + return ((smp->status & IB_SMP_DIRECTION) == IB_SMP_DIRECTION); +} + +#endif /* IB_SMI_H */ From roland at topspin.com Mon Dec 13 10:09:27 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:27 -0800 Subject: [openib-general] [PATCH][v3][6/21] Add InfiniBand SA (Subnet Administration) query support In-Reply-To: <20041213109.3tK6alRLJABxH4bu@topspin.com> Message-ID: <20041213109.cVS0twN822l4xQbR@topspin.com> Add support for sending queries to the SA (Subnet Administration). In particular the PathRecord and MCMember (multicast group member) used by the IP-over-InfiniBand driver are implemented. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-13 09:44:42.966395657 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:43.579305364 -0800 @@ -2,7 +2,8 @@ obj-$(CONFIG_INFINIBAND) += \ ib_core.o \ - ib_mad.o + ib_mad.o \ + ib_sa.o ib_core-objs := \ packer.o \ @@ -17,3 +18,5 @@ mad.o \ smi.o \ agent.o + +ib_sa-objs := sa_query.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/sa_query.c 2004-12-13 09:44:43.603301828 -0800 @@ -0,0 +1,855 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand subnet administration query support"); +MODULE_LICENSE("Dual BSD/GPL"); + +/* + * These two structures must be packed because they have 64-bit fields + * that are only 32-bit aligned. 64-bit architectures will lay them + * out wrong otherwise. (And unfortunately they are sent on the wire + * so we can't change the layout) + */ +struct ib_sa_hdr { + u64 sm_key; + u16 attr_offset; + u16 reserved; + ib_sa_comp_mask comp_mask; +} __attribute__ ((packed)); + +struct ib_sa_mad { + struct ib_mad_hdr mad_hdr; + struct ib_rmpp_hdr rmpp_hdr; + struct ib_sa_hdr sa_hdr; + u8 data[200]; +} __attribute__ ((packed)); + +struct ib_sa_sm_ah { + struct ib_ah *ah; + struct kref ref; +}; + +struct ib_sa_port { + struct ib_mad_agent *agent; + struct ib_mr *mr; + struct ib_sa_sm_ah *sm_ah; + struct work_struct update_task; + spinlock_t ah_lock; + u8 port_num; +}; + +struct ib_sa_device { + int start_port, end_port; + struct ib_event_handler event_handler; + struct ib_sa_port port[0]; +}; + +struct ib_sa_query { + void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); + void (*release)(struct ib_sa_query *); + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_sm_ah *sm_ah; + DECLARE_PCI_UNMAP_ADDR(mapping) + int id; +}; + +struct ib_sa_path_query { + void (*callback)(int, struct ib_sa_path_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +struct ib_sa_mcmember_query { + void (*callback)(int, struct ib_sa_mcmember_rec *, void *); + void *context; + struct ib_sa_query sa_query; +}; + +static void ib_sa_add_one(struct ib_device *device); +static void ib_sa_remove_one(struct ib_device *device); + +static struct ib_client sa_client = { + .name = "sa", + .add = ib_sa_add_one, + .remove = ib_sa_remove_one +}; + +static spinlock_t idr_lock; +static DEFINE_IDR(query_idr); + +static spinlock_t tid_lock; +static u32 tid; + +enum { + IB_SA_ATTR_CLASS_PORTINFO = 0x01, + IB_SA_ATTR_NOTICE = 0x02, + IB_SA_ATTR_INFORM_INFO = 0x03, + IB_SA_ATTR_NODE_REC = 0x11, + IB_SA_ATTR_PORT_INFO_REC = 0x12, + IB_SA_ATTR_SL2VL_REC = 0x13, + IB_SA_ATTR_SWITCH_REC = 0x14, + IB_SA_ATTR_LINEAR_FDB_REC = 0x15, + IB_SA_ATTR_RANDOM_FDB_REC = 0x16, + IB_SA_ATTR_MCAST_FDB_REC = 0x17, + IB_SA_ATTR_SM_INFO_REC = 0x18, + IB_SA_ATTR_LINK_REC = 0x20, + IB_SA_ATTR_GUID_INFO_REC = 0x30, + IB_SA_ATTR_SERVICE_REC = 0x31, + IB_SA_ATTR_PARTITION_REC = 0x33, + IB_SA_ATTR_RANGE_REC = 0x34, + IB_SA_ATTR_PATH_REC = 0x35, + IB_SA_ATTR_VL_ARB_REC = 0x36, + IB_SA_ATTR_MC_GROUP_REC = 0x37, + IB_SA_ATTR_MC_MEMBER_REC = 0x38, + IB_SA_ATTR_TRACE_REC = 0x39, + IB_SA_ATTR_MULTI_PATH_REC = 0x3a, + IB_SA_ATTR_SERVICE_ASSOC_REC = 0x3b +}; + +#define PATH_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_path_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_path_rec *) 0)->field, \ + .field_name = "sa_path_rec:" #field + +static const struct ib_field path_rec_table[] = { + { RESERVED, + .offset_words = 0, + .offset_bits = 0, + .size_bits = 32 }, + { RESERVED, + .offset_words = 1, + .offset_bits = 0, + .size_bits = 32 }, + { PATH_REC_FIELD(dgid), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(sgid), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 128 }, + { PATH_REC_FIELD(dlid), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { PATH_REC_FIELD(slid), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 16 }, + { PATH_REC_FIELD(raw_traffic), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 1 }, + { RESERVED, + .offset_words = 11, + .offset_bits = 1, + .size_bits = 3 }, + { PATH_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { PATH_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { PATH_REC_FIELD(traffic_class), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 8 }, + { PATH_REC_FIELD(reversible), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { PATH_REC_FIELD(numb_path), + .offset_words = 12, + .offset_bits = 9, + .size_bits = 7 }, + { PATH_REC_FIELD(pkey), + .offset_words = 12, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 13, + .offset_bits = 0, + .size_bits = 12 }, + { PATH_REC_FIELD(sl), + .offset_words = 13, + .offset_bits = 12, + .size_bits = 4 }, + { PATH_REC_FIELD(mtu_selector), + .offset_words = 13, + .offset_bits = 16, + .size_bits = 2 }, + { PATH_REC_FIELD(mtu), + .offset_words = 13, + .offset_bits = 18, + .size_bits = 6 }, + { PATH_REC_FIELD(rate_selector), + .offset_words = 13, + .offset_bits = 24, + .size_bits = 2 }, + { PATH_REC_FIELD(rate), + .offset_words = 13, + .offset_bits = 26, + .size_bits = 6 }, + { PATH_REC_FIELD(packet_life_time_selector), + .offset_words = 14, + .offset_bits = 0, + .size_bits = 2 }, + { PATH_REC_FIELD(packet_life_time), + .offset_words = 14, + .offset_bits = 2, + .size_bits = 6 }, + { PATH_REC_FIELD(preference), + .offset_words = 14, + .offset_bits = 8, + .size_bits = 8 }, + { RESERVED, + .offset_words = 14, + .offset_bits = 16, + .size_bits = 48 }, +}; + +#define MCMEMBER_REC_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_mcmember_rec, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_mcmember_rec *) 0)->field, \ + .field_name = "sa_mcmember_rec:" #field + +static const struct ib_field mcmember_rec_table[] = { + { MCMEMBER_REC_FIELD(mgid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(port_gid), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 128 }, + { MCMEMBER_REC_FIELD(qkey), + .offset_words = 8, + .offset_bits = 0, + .size_bits = 32 }, + { MCMEMBER_REC_FIELD(mlid), + .offset_words = 9, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(mtu_selector), + .offset_words = 9, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(mtu), + .offset_words = 9, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(traffic_class), + .offset_words = 9, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(pkey), + .offset_words = 10, + .offset_bits = 0, + .size_bits = 16 }, + { MCMEMBER_REC_FIELD(rate_selector), + .offset_words = 10, + .offset_bits = 16, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(rate), + .offset_words = 10, + .offset_bits = 18, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(packet_life_time_selector), + .offset_words = 10, + .offset_bits = 24, + .size_bits = 2 }, + { MCMEMBER_REC_FIELD(packet_life_time), + .offset_words = 10, + .offset_bits = 26, + .size_bits = 6 }, + { MCMEMBER_REC_FIELD(sl), + .offset_words = 11, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(flow_label), + .offset_words = 11, + .offset_bits = 4, + .size_bits = 20 }, + { MCMEMBER_REC_FIELD(hop_limit), + .offset_words = 11, + .offset_bits = 24, + .size_bits = 8 }, + { MCMEMBER_REC_FIELD(scope), + .offset_words = 12, + .offset_bits = 0, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(join_state), + .offset_words = 12, + .offset_bits = 4, + .size_bits = 4 }, + { MCMEMBER_REC_FIELD(proxy_join), + .offset_words = 12, + .offset_bits = 8, + .size_bits = 1 }, + { RESERVED, + .offset_words = 12, + .offset_bits = 9, + .size_bits = 23 }, +}; + +static void free_sm_ah(struct kref *kref) +{ + struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); + + ib_destroy_ah(sm_ah->ah); + kfree(sm_ah); +} + +static void update_sm_ah(void *port_ptr) +{ + struct ib_sa_port *port = port_ptr; + struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_port_attr port_attr; + struct ib_ah_attr ah_attr; + + if (ib_query_port(port->agent->device, port->port_num, &port_attr)) { + printk(KERN_WARNING "Couldn't query port\n"); + return; + } + + new_ah = kmalloc(sizeof *new_ah, GFP_KERNEL); + if (!new_ah) { + printk(KERN_WARNING "Couldn't allocate new SM AH\n"); + return; + } + + kref_init(&new_ah->ref); + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = port_attr.sm_lid; + ah_attr.sl = port_attr.sm_sl; + ah_attr.port_num = port->port_num; + + new_ah->ah = ib_create_ah(port->agent->qp->pd, &ah_attr); + if (IS_ERR(new_ah->ah)) { + printk(KERN_WARNING "Couldn't create new SM AH\n"); + kfree(new_ah); + return; + } + + spin_lock_irq(&port->ah_lock); + old_ah = port->sm_ah; + port->sm_ah = new_ah; + spin_unlock_irq(&port->ah_lock); + + if (old_ah) + kref_put(&old_ah->ref, free_sm_ah); +} + +static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) +{ + if (event->event == IB_EVENT_PORT_ERR || + event->event == IB_EVENT_PORT_ACTIVE || + event->event == IB_EVENT_LID_CHANGE || + event->event == IB_EVENT_PKEY_CHANGE || + event->event == IB_EVENT_SM_CHANGE) { + struct ib_sa_device *sa_dev = + ib_get_client_data(event->device, &sa_client); + + schedule_work(&sa_dev->port[event->element.port_num - + sa_dev->start_port].update_task); + } +} + +/** + * ib_sa_cancel_query - try to cancel an SA query + * @id:ID of query to cancel + * @query:query pointer to cancel + * + * Try to cancel an SA query. If the id and query don't match up or + * the query has already completed, nothing is done. Otherwise the + * query is canceled and will complete with a status of -EINTR. + */ +void ib_sa_cancel_query(int id, struct ib_sa_query *query) +{ + unsigned long flags; + struct ib_mad_agent *agent; + + spin_lock_irqsave(&idr_lock, flags); + if (idr_find(&query_idr, id) != query) { + spin_unlock_irqrestore(&idr_lock, flags); + return; + } + agent = query->port->agent; + spin_unlock_irqrestore(&idr_lock, flags); + + ib_cancel_mad(agent, id); +} +EXPORT_SYMBOL(ib_sa_cancel_query); + +static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent) +{ + unsigned long flags; + + memset(mad, 0, sizeof *mad); + + mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; + mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_ADM; + mad->mad_hdr.class_version = IB_SA_CLASS_VERSION; + + spin_lock_irqsave(&tid_lock, flags); + mad->mad_hdr.tid = + cpu_to_be64(((u64) agent->hi_tid) << 32 | tid++); + spin_unlock_irqrestore(&tid_lock, flags); +} + +static int send_mad(struct ib_sa_query *query, int timeout_ms) +{ + struct ib_sa_port *port = query->port; + unsigned long flags; + int ret; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .mad_hdr = &query->mad->mad_hdr, + .remote_qpn = 1, + .remote_qkey = IB_QP1_QKEY, + .timeout_ms = timeout_ms + } + } + }; + +retry: + if (!idr_pre_get(&query_idr, GFP_ATOMIC)) + return -ENOMEM; + spin_lock_irqsave(&idr_lock, flags); + ret = idr_get_new(&query_idr, query, &query->id); + spin_unlock_irqrestore(&idr_lock, flags); + if (ret == -EAGAIN) + goto retry; + if (ret) + return ret; + + wr.wr_id = query->id; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + wr.wr.ud.ah = port->sm_ah->ah; + spin_unlock_irqrestore(&port->ah_lock, flags); + + gather_list.addr = dma_map_single(port->agent->device->dma_device, + query->mad, + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + gather_list.length = sizeof (struct ib_sa_mad); + gather_list.lkey = port->mr->lkey; + pci_unmap_addr_set(query, mapping, gather_list.addr); + + ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + if (ret) { + dma_unmap_single(port->agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, query->id); + spin_unlock_irqrestore(&idr_lock, flags); + } + + return ret; +} + +static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_path_query *query = + container_of(sa_query, struct ib_sa_path_query, sa_query); + + if (mad) { + struct ib_sa_path_rec rec; + + ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); +} + +/** + * ib_sa_path_rec_get - Start a Path get query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Path Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send a Path Record Get query to the SA to look up a path. The + * callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_path_rec_get() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_path_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_path_rec_callback; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_path_rec_get); + +static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_mcmember_query *query = + container_of(sa_query, struct ib_sa_mcmember_query, sa_query); + + if (mad) { + struct ib_sa_mcmember_rec rec; + + ib_unpack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) +{ + kfree(sa_query->mad); + kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); +} + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_mcmember_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; + struct ib_mad_agent *agent = port->agent; + int ret; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); + if (!query->sa_query.mad) { + kfree(query); + return -ENOMEM; + } + + query->callback = callback; + query->context = context; + + init_mad(query->sa_query.mad, agent); + + query->sa_query.callback = ib_sa_mcmember_rec_callback; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + query->sa_query.mad->mad_hdr.method = method; + query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), + rec, query->sa_query.mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms); + if (ret) { + *sa_query = NULL; + kfree(query->sa_query.mad); + kfree(query); + } + + return ret ? ret : query->sa_query.id; +} +EXPORT_SYMBOL(ib_sa_mcmember_rec_query); + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (!query) + return; + + switch (mad_send_wc->status) { + case IB_WC_SUCCESS: + /* No callback -- already got recv */ + break; + case IB_WC_RESP_TIMEOUT_ERR: + query->callback(query, -ETIMEDOUT, NULL); + break; + case IB_WC_WR_FLUSH_ERR: + query->callback(query, -EINTR, NULL); + break; + default: + query->callback(query, -EIO, NULL); + break; + } + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(query, mapping), + sizeof (struct ib_sa_mad), + DMA_TO_DEVICE); + kref_put(&query->sm_ah->ref, free_sm_ah); + + query->release(query); + + spin_lock_irqsave(&idr_lock, flags); + idr_remove(&query_idr, mad_send_wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); +} + +static void recv_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_query *query; + unsigned long flags; + + spin_lock_irqsave(&idr_lock, flags); + query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); + spin_unlock_irqrestore(&idr_lock, flags); + + if (query) { + if (mad_recv_wc->wc->status == IB_WC_SUCCESS) + query->callback(query, + mad_recv_wc->recv_buf.mad->mad_hdr.status ? + -EINVAL : 0, + (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); + else + query->callback(query, -EIO, NULL); + } + + ib_free_recv_mad(mad_recv_wc); +} + +static void ib_sa_add_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev; + int s, e, i; + + if (device->node_type == IB_NODE_SWITCH) + s = e = 0; + else { + s = 1; + e = device->phys_port_cnt; + } + + sa_dev = kmalloc(sizeof *sa_dev + + (e - s + 1) * sizeof (struct ib_sa_port), + GFP_KERNEL); + if (!sa_dev) + return; + + sa_dev->start_port = s; + sa_dev->end_port = e; + + for (i = 0; i <= e - s; ++i) { + sa_dev->port[i].mr = NULL; + sa_dev->port[i].sm_ah = NULL; + sa_dev->port[i].port_num = i + s; + spin_lock_init(&sa_dev->port[i].ah_lock); + + sa_dev->port[i].agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + NULL, 0, send_handler, + recv_handler, sa_dev); + if (IS_ERR(sa_dev->port[i].agent)) + goto err; + + sa_dev->port[i].mr = ib_get_dma_mr(sa_dev->port[i].agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(sa_dev->port[i].mr)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + goto err; + } + + INIT_WORK(&sa_dev->port[i].update_task, + update_sm_ah, &sa_dev->port[i]); + } + + /* + * We register our event handler after everything is set up, + * and then update our cached info after the event handler is + * registered to avoid any problems if a port changes state + * during our initialization. + */ + + INIT_IB_EVENT_HANDLER(&sa_dev->event_handler, device, ib_sa_event); + if (ib_register_event_handler(&sa_dev->event_handler)) + goto err; + + for (i = 0; i <= e - s; ++i) + update_sm_ah(&sa_dev->port[i]); + + ib_set_client_data(device, &sa_client, sa_dev); + + return; + +err: + while (--i >= 0) { + ib_dereg_mr(sa_dev->port[i].mr); + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + + kfree(sa_dev); + + return; +} + +static void ib_sa_remove_one(struct ib_device *device) +{ + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + int i; + + if (!sa_dev) + return; + + ib_unregister_event_handler(&sa_dev->event_handler); + + for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); + } + + kfree(sa_dev); +} + +static int __init ib_sa_init(void) +{ + int ret; + + spin_lock_init(&idr_lock); + spin_lock_init(&tid_lock); + + get_random_bytes(&tid, sizeof tid); + + ret = ib_register_client(&sa_client); + if (ret) + printk(KERN_ERR "Couldn't register ib_sa client\n"); + + return ret; +} + +static void __exit ib_sa_cleanup(void) +{ + ib_unregister_client(&sa_client); +} + +module_init(ib_sa_init); +module_exit(ib_sa_cleanup); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/include/ib_sa.h 2004-12-13 09:44:43.630297851 -0800 @@ -0,0 +1,269 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef IB_SA_H +#define IB_SA_H + +#include + +#include +#include + +enum { + IB_SA_CLASS_VERSION = 2, /* IB spec version 1.1/1.2 */ + + IB_SA_METHOD_DELETE = 0x15 +}; + +enum ib_sa_selector { + IB_SA_GTE = 0, + IB_SA_LTE = 1, + IB_SA_EQ = 2, + /* + * The meaning of "best" depends on the attribute: for + * example, for MTU best will return the largest available + * MTU, while for packet life time, best will return the + * smallest available life time. + */ + IB_SA_BEST = 3 +}; + +typedef u64 __bitwise ib_sa_comp_mask; + +#define IB_SA_COMP_MASK(n) ((__force ib_sa_comp_mask) cpu_to_be64(1ull << n)) + +/* + * Structures for SA records are named "struct ib_sa_xxx_rec." No + * attempt is made to pack structures to match the physical layout of + * SA records in SA MADs; all packing and unpacking is handled by the + * SA query code. + * + * For a record with structure ib_sa_xxx_rec, the naming convention + * for the component mask value for field yyy is IB_SA_XXX_REC_YYY (we + * never use different abbreviations or otherwise change the spelling + * of xxx/yyy between ib_sa_xxx_rec.yyy and IB_SA_XXX_REC_YYY). + * + * Reserved rows are indicated with comments to help maintainability. + */ + +/* reserved: 0 */ +/* reserved: 1 */ +#define IB_SA_PATH_REC_DGID IB_SA_COMP_MASK( 2) +#define IB_SA_PATH_REC_SGID IB_SA_COMP_MASK( 3) +#define IB_SA_PATH_REC_DLID IB_SA_COMP_MASK( 4) +#define IB_SA_PATH_REC_SLID IB_SA_COMP_MASK( 5) +#define IB_SA_PATH_REC_RAW_TRAFFIC IB_SA_COMP_MASK( 6) +/* reserved: 7 */ +#define IB_SA_PATH_REC_FLOW_LABEL IB_SA_COMP_MASK( 8) +#define IB_SA_PATH_REC_HOP_LIMIT IB_SA_COMP_MASK( 9) +#define IB_SA_PATH_REC_TRAFFIC_CLASS IB_SA_COMP_MASK(10) +#define IB_SA_PATH_REC_REVERSIBLE IB_SA_COMP_MASK(11) +#define IB_SA_PATH_REC_NUMB_PATH IB_SA_COMP_MASK(12) +#define IB_SA_PATH_REC_PKEY IB_SA_COMP_MASK(13) +/* reserved: 14 */ +#define IB_SA_PATH_REC_SL IB_SA_COMP_MASK(15) +#define IB_SA_PATH_REC_MTU_SELECTOR IB_SA_COMP_MASK(16) +#define IB_SA_PATH_REC_MTU IB_SA_COMP_MASK(17) +#define IB_SA_PATH_REC_RATE_SELECTOR IB_SA_COMP_MASK(18) +#define IB_SA_PATH_REC_RATE IB_SA_COMP_MASK(19) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(20) +#define IB_SA_PATH_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(21) +#define IB_SA_PATH_REC_PREFERENCE IB_SA_COMP_MASK(22) + +struct ib_sa_path_rec { + /* reserved */ + /* reserved */ + union ib_gid dgid; + union ib_gid sgid; + u16 dlid; + u16 slid; + int raw_traffic; + /* reserved */ + u32 flow_label; + u8 hop_limit; + u8 traffic_class; + int reversible; + u8 numb_path; + u16 pkey; + /* reserved */ + u8 sl; + u8 mtu_selector; + enum ib_mtu mtu; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 preference; +}; + +#define IB_SA_MCMEMBER_REC_MGID IB_SA_COMP_MASK( 0) +#define IB_SA_MCMEMBER_REC_PORT_GID IB_SA_COMP_MASK( 1) +#define IB_SA_MCMEMBER_REC_QKEY IB_SA_COMP_MASK( 2) +#define IB_SA_MCMEMBER_REC_MLID IB_SA_COMP_MASK( 3) +#define IB_SA_MCMEMBER_REC_MTU_SELECTOR IB_SA_COMP_MASK( 4) +#define IB_SA_MCMEMBER_REC_MTU IB_SA_COMP_MASK( 5) +#define IB_SA_MCMEMBER_REC_TRAFFIC_CLASS IB_SA_COMP_MASK( 6) +#define IB_SA_MCMEMBER_REC_PKEY IB_SA_COMP_MASK( 7) +#define IB_SA_MCMEMBER_REC_RATE_SELECTOR IB_SA_COMP_MASK( 8) +#define IB_SA_MCMEMBER_REC_RATE IB_SA_COMP_MASK( 9) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME_SELECTOR IB_SA_COMP_MASK(10) +#define IB_SA_MCMEMBER_REC_PACKET_LIFE_TIME IB_SA_COMP_MASK(11) +#define IB_SA_MCMEMBER_REC_SL IB_SA_COMP_MASK(12) +#define IB_SA_MCMEMBER_REC_FLOW_LABEL IB_SA_COMP_MASK(13) +#define IB_SA_MCMEMBER_REC_HOP_LIMIT IB_SA_COMP_MASK(14) +#define IB_SA_MCMEMBER_REC_SCOPE IB_SA_COMP_MASK(15) +#define IB_SA_MCMEMBER_REC_JOIN_STATE IB_SA_COMP_MASK(16) +#define IB_SA_MCMEMBER_REC_PROXY_JOIN IB_SA_COMP_MASK(17) + +struct ib_sa_mcmember_rec { + union ib_gid mgid; + union ib_gid port_gid; + u32 qkey; + u16 mlid; + u8 mtu_selector; + enum ib_mtu mtu; + u8 traffic_class; + u16 pkey; + u8 rate_selector; + u8 rate; + u8 packet_life_time_selector; + u8 packet_life_time; + u8 sl; + u32 flow_label; + u8 hop_limit; + u8 scope; + u8 join_state; + int proxy_join; +}; + +struct ib_sa_query; + +void ib_sa_cancel_query(int id, struct ib_sa_query *query); + +int ib_sa_path_rec_get(struct ib_device *device, u8 port_num, + struct ib_sa_path_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_path_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +int ib_sa_mcmember_rec_query(struct ib_device *device, u8 port_num, + u8 method, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query); + +/** + * ib_sa_mcmember_rec_set - Start an MCMember set query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Set query to the SA (eg to join a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_set() is negative, it is + * an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_set(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_MGMT_METHOD_SET, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + +/** + * ib_sa_mcmember_rec_delete - Start an MCMember delete query + * @device:device to send query on + * @port_num: port number to send query on + * @rec:MCMember Record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when query completes, times out or is + * canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * Send an MCMember Delete query to the SA (eg to leave a multicast + * group). The callback function will be called when the query + * completes (or fails); status is 0 for a successful response, -EINTR + * if the query is canceled, -ETIMEDOUT is the query timed out, or + * -EIO if an error occurred sending the query. The resp parameter of + * the callback is only valid if status is 0. + * + * If the return value of ib_sa_mcmember_rec_delete() is negative, it + * is an error code. Otherwise it is a query ID that can be used to + * cancel the query. + */ +static inline int +ib_sa_mcmember_rec_delete(struct ib_device *device, u8 port_num, + struct ib_sa_mcmember_rec *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, int gfp_mask, + void (*callback)(int status, + struct ib_sa_mcmember_rec *resp, + void *context), + void *context, + struct ib_sa_query **query) +{ + return ib_sa_mcmember_rec_query(device, port_num, + IB_SA_METHOD_DELETE, + rec, comp_mask, + timeout_ms, gfp_mask, callback, + context, query); +} + + +#endif /* IB_SA_H */ From roland at topspin.com Mon Dec 13 10:09:25 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:25 -0800 Subject: [openib-general] [PATCH][v3][5/21] Add InfiniBand MAD (management datagram) support In-Reply-To: <20041213109.kvbhOIc6xDgg0Bag@topspin.com> Message-ID: <20041213109.3tK6alRLJABxH4bu@topspin.com> Add support for handling InfiniBand MADs (management datagrams), including sending and receiving MADs as well as passing MADs on to local agents. This is required for an SM (subnet manager) to discover and configure the host, since the SM's query MADs must be passed to the local SMA (subnet management agent). In addition, this support is used by upper level protocols to send queries to and receive responses from the SM. Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/core/Makefile 2004-12-13 09:44:40.305787613 -0800 +++ linux-bk/drivers/infiniband/core/Makefile 2004-12-13 09:44:42.966395657 -0800 @@ -1,7 +1,8 @@ EXTRA_CFLAGS += -Idrivers/infiniband/include obj-$(CONFIG_INFINIBAND) += \ - ib_core.o + ib_core.o \ + ib_mad.o ib_core-objs := \ packer.o \ @@ -11,3 +12,8 @@ device.o \ fmr_pool.o \ cache.o + +ib_mad-objs := \ + mad.o \ + smi.o \ + agent.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.c 2004-12-13 09:44:43.016388292 -0800 @@ -0,0 +1,386 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + +#include + +#include + +#include "smi.h" +#include "agent_priv.h" +#include "mad_priv.h" + + +spinlock_t ib_agent_port_list_lock; +static LIST_HEAD(ib_agent_port_list); + +extern kmem_cache_t *ib_mad_cache; + + +/* + * Caller must hold ib_agent_port_list_lock + */ +static inline struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + + BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ + + if (device) { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->dr_smp_agent->device == device && + entry->port_num == port_num) + return entry; + } + } else { + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if ((entry->dr_smp_agent == mad_agent) || + (entry->lr_smp_agent == mad_agent) || + (entry->perf_mgmt_agent == mad_agent)) + return entry; + } + } + return NULL; +} + +static inline struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num, + struct ib_mad_agent *mad_agent) +{ + struct ib_agent_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + entry = __ib_get_agent_port(device, port_num, mad_agent); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return entry; +} + +int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + + if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + return 1; + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " + "not open\n", + device->name, port_num); + return 1; + } + + return smi_check_local_smp(port_priv->dr_smp_agent, smp); +} + +static int agent_mad_send(struct ib_mad_agent *mad_agent, + struct ib_agent_port_private *port_priv, + struct ib_mad_private *mad_priv, + struct ib_grh *grh, + struct ib_wc *wc) +{ + struct ib_agent_send_wr *agent_send_wr; + struct ib_sge gather_list; + struct ib_send_wr send_wr; + struct ib_send_wr *bad_send_wr; + struct ib_ah_attr ah_attr; + unsigned long flags; + int ret = 1; + + agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); + if (!agent_send_wr) + goto out; + agent_send_wr->mad = mad_priv; + + /* PCI mapping */ + gather_list.addr = dma_map_single(mad_agent->device->dma_device, + &mad_priv->mad, + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + gather_list.length = sizeof(mad_priv->mad); + gather_list.lkey = (*port_priv->mr).lkey; + + send_wr.next = NULL; + send_wr.opcode = IB_WR_SEND; + send_wr.sg_list = &gather_list; + send_wr.num_sge = 1; + send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ + send_wr.wr.ud.timeout_ms = 0; + send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; + + ah_attr.dlid = wc->slid; + ah_attr.port_num = mad_agent->port_num; + ah_attr.src_path_bits = wc->dlid_path_bits; + ah_attr.sl = wc->sl; + ah_attr.static_rate = 0; + ah_attr.ah_flags = 0; /* No GRH */ + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + if (wc->wc_flags & IB_WC_GRH) { + ah_attr.ah_flags = IB_AH_GRH; + /* Should sgid be looked up ? */ + ah_attr.grh.sgid_index = 0; + ah_attr.grh.hop_limit = grh->hop_limit; + ah_attr.grh.flow_label = be32_to_cpup( + &grh->version_tclass_flow) & 0xfffff; + ah_attr.grh.traffic_class = (be32_to_cpup( + &grh->version_tclass_flow) >> 20) & 0xff; + memcpy(ah_attr.grh.dgid.raw, + grh->sgid.raw, + sizeof(ah_attr.grh.dgid)); + } + } + + agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); + if (IS_ERR(agent_send_wr->ah)) { + printk(KERN_ERR SPFX "No memory for address handle\n"); + kfree(agent_send_wr); + goto out; + } + + send_wr.wr.ud.ah = agent_send_wr->ah; + if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { + send_wr.wr.ud.pkey_index = wc->pkey_index; + send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; + } else { /* for SMPs */ + send_wr.wr.ud.pkey_index = 0; + send_wr.wr.ud.remote_qkey = 0; + } + send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; + send_wr.wr_id = (unsigned long)agent_send_wr; + + pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); + + /* Send */ + spin_lock_irqsave(&port_priv->send_list_lock, flags); + if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(mad_priv->mad), + DMA_TO_DEVICE); + ib_destroy_ah(agent_send_wr->ah); + kfree(agent_send_wr); + } else { + list_add_tail(&agent_send_wr->send_list, + &port_priv->send_posted_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + ret = 0; + } + +out: + return ret; +} + +int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num) +{ + struct ib_agent_port_private *port_priv; + struct ib_mad_agent *mad_agent; + + port_priv = ib_get_agent_port(device, port_num, NULL); + if (!port_priv) { + printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", + device->name, port_num); + return 1; + } + + /* Get mad agent based on mgmt_class in MAD */ + switch (mad->mad.mad.mad_hdr.mgmt_class) { + case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: + mad_agent = port_priv->dr_smp_agent; + break; + case IB_MGMT_CLASS_SUBN_LID_ROUTED: + mad_agent = port_priv->lr_smp_agent; + break; + case IB_MGMT_CLASS_PERF_MGMT: + mad_agent = port_priv->perf_mgmt_agent; + break; + default: + return 1; + } + + return agent_mad_send(mad_agent, port_priv, mad, grh, wc); +} + +static void agent_send_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_agent_port_private *port_priv; + struct ib_agent_send_wr *agent_send_wr; + unsigned long flags; + + /* Find matching MAD agent */ + port_priv = ib_get_agent_port(NULL, 0, mad_agent); + if (!port_priv) { + printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " + "agent %p\n", mad_agent); + return; + } + + agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; + spin_lock_irqsave(&port_priv->send_list_lock, flags); + /* Remove completed send from posted send MAD list */ + list_del(&agent_send_wr->send_list); + spin_unlock_irqrestore(&port_priv->send_list_lock, flags); + + /* Unmap PCI */ + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(agent_send_wr, mapping), + sizeof(agent_send_wr->mad->mad), + DMA_TO_DEVICE); + + ib_destroy_ah(agent_send_wr->ah); + + /* Release allocated memory */ + kmem_cache_free(ib_mad_cache, agent_send_wr->mad); + kfree(agent_send_wr); +} + +int ib_agent_port_open(struct ib_device *device, int port_num) +{ + int ret; + struct ib_agent_port_private *port_priv; + struct ib_mad_reg_req reg_req; + unsigned long flags; + + /* First, check if port already open for SMI */ + port_priv = ib_get_agent_port(device, port_num, NULL); + if (port_priv) { + printk(KERN_DEBUG SPFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR SPFX "No memory for ib_agent_port_private\n"); + ret = -ENOMEM; + goto error1; + } + + memset(port_priv, 0, sizeof *port_priv); + port_priv->port_num = port_num; + spin_lock_init(&port_priv->send_list_lock); + INIT_LIST_HEAD(&port_priv->send_posted_list); + + /* Obtain MAD agent for directed route SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE; + reg_req.mgmt_class_version = 1; + + port_priv->dr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + + if (IS_ERR(port_priv->dr_smp_agent)) { + ret = PTR_ERR(port_priv->dr_smp_agent); + goto error2; + } + + /* Obtain MAD agent for LID routed SM class */ + reg_req.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + port_priv->lr_smp_agent = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->lr_smp_agent)) { + ret = PTR_ERR(port_priv->lr_smp_agent); + goto error3; + } + + /* Obtain MAD agent for PerfMgmt class */ + reg_req.mgmt_class = IB_MGMT_CLASS_PERF_MGMT; + port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, + NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->perf_mgmt_agent)) { + ret = PTR_ERR(port_priv->perf_mgmt_agent); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->dr_smp_agent->qp->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR SPFX "Couldn't get DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_agent_port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + return 0; + +error5: + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); +error4: + ib_unregister_mad_agent(port_priv->lr_smp_agent); +error3: + ib_unregister_mad_agent(port_priv->dr_smp_agent); +error2: + kfree(port_priv); +error1: + return ret; +} + +int ib_agent_port_close(struct ib_device *device, int port_num) +{ + struct ib_agent_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_agent_port_list_lock, flags); + port_priv = __ib_get_agent_port(device, port_num, NULL); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + printk(KERN_ERR SPFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); + + ib_dereg_mr(port_priv->mr); + + ib_unregister_mad_agent(port_priv->perf_mgmt_agent); + ib_unregister_mad_agent(port_priv->lr_smp_agent); + ib_unregister_mad_agent(port_priv->dr_smp_agent); + kfree(port_priv); + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent.h 2004-12-13 09:44:43.070380338 -0800 @@ -0,0 +1,42 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __AGENT_H_ +#define __AGENT_H_ + +extern spinlock_t ib_agent_port_list_lock; + +extern int ib_agent_port_open(struct ib_device *device, + int port_num); + +extern int ib_agent_port_close(struct ib_device *device, int port_num); + +extern int agent_send(struct ib_mad_private *mad, + struct ib_grh *grh, + struct ib_wc *wc, + struct ib_device *device, + int port_num); + +#endif /* __AGENT_H_ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/agent_priv.h 2004-12-13 09:44:43.096376508 -0800 @@ -0,0 +1,51 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __IB_AGENT_PRIV_H__ +#define __IB_AGENT_PRIV_H__ + +#include + +#define SPFX "ib_agent: " + +struct ib_agent_send_wr { + struct list_head send_list; + struct ib_ah *ah; + struct ib_mad_private *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct ib_agent_port_private { + struct list_head port_list; + struct list_head send_posted_list; + spinlock_t send_list_lock; + int port_num; + struct ib_mad_agent *dr_smp_agent; /* DR SM class */ + struct ib_mad_agent *lr_smp_agent; /* LR SM class */ + struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ + struct ib_mr *mr; +}; + +#endif /* __IB_AGENT_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad.c 2004-12-13 09:44:42.990392121 -0800 @@ -0,0 +1,2593 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#include +#include + +#include + +#include "mad_priv.h" +#include "smi.h" +#include "agent.h" + + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_DESCRIPTION("kernel IB MAD API"); +MODULE_AUTHOR("Hal Rosenstock"); +MODULE_AUTHOR("Sean Hefty"); + + +kmem_cache_t *ib_mad_cache; +static struct list_head ib_mad_port_list; +static u32 ib_mad_client_id = 0; + +/* Port list lock */ +static spinlock_t ib_mad_port_list_lock; + + +/* Forward declarations */ +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req); +static void remove_mad_reg_req(struct ib_mad_agent_private *priv); +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad); +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv); +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc); +static void timeout_sends(void *data); +static int solicited_mad(struct ib_mad *mad); +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class); +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv); + +/* + * Returns a ib_mad_port_private structure or NULL for a device/port + * Assumes ib_mad_port_list_lock is being held + */ +static inline struct ib_mad_port_private * +__ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + + list_for_each_entry(entry, &ib_mad_port_list, port_list) { + if (entry->device == device && entry->port_num == port_num) + return entry; + } + return NULL; +} + +/* + * Wrapper function to return a ib_mad_port_private structure or NULL + * for a device/port + */ +static inline struct ib_mad_port_private * +ib_get_mad_port(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *entry; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + entry = __ib_get_mad_port(device, port_num); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + return entry; +} + +static inline u8 convert_mgmt_class(u8 mgmt_class) +{ + /* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */ + return mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE ? + 0 : mgmt_class; +} + +static int get_spl_qp_index(enum ib_qp_type qp_type) +{ + switch (qp_type) + { + case IB_QPT_SMI: + return 0; + case IB_QPT_GSI: + return 1; + default: + return -1; + } +} + +static int vendor_class_index(u8 mgmt_class) +{ + return mgmt_class - IB_MGMT_CLASS_VENDOR_RANGE2_START; +} + +static int is_vendor_class(u8 mgmt_class) +{ + if ((mgmt_class < IB_MGMT_CLASS_VENDOR_RANGE2_START) || + (mgmt_class > IB_MGMT_CLASS_VENDOR_RANGE2_END)) + return 0; + return 1; +} + +static int is_vendor_oui(char *oui) +{ + if (oui[0] || oui[1] || oui[2]) + return 1; + return 0; +} + +static int is_vendor_method_in_use( + struct ib_mad_mgmt_vendor_class *vendor_class, + struct ib_mad_reg_req *mad_reg_req) +{ + struct ib_mad_mgmt_method_table *method; + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) { + if (!memcmp(vendor_class->oui[i], mad_reg_req->oui, 3)) { + method = vendor_class->method_table[i]; + if (method) { + if (method_in_use(&method, mad_reg_req)) + return 1; + else + break; + } + } + } + return 0; +} + +/* + * ib_register_mad_agent - Register to send/receive MADs + */ +struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + struct ib_mad_reg_req *mad_reg_req, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret = ERR_PTR(-EINVAL); + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_reg_req *reg_req = NULL; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_mad_mgmt_method_table *method; + int ret2, qpn; + unsigned long flags; + u8 mgmt_class, vclass; + + /* Validate parameters */ + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) + goto error1; + + if (rmpp_version) + goto error1; /* XXX: until RMPP implemented */ + + /* Validate MAD registration request if supplied */ + if (mad_reg_req) { + if (mad_reg_req->mgmt_class_version >= MAX_MGMT_VERSION) + goto error1; + if (!recv_handler) + goto error1; + if (mad_reg_req->mgmt_class >= MAX_MGMT_CLASS) { + /* + * IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE is the only + * one in this range currently allowed + */ + if (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + goto error1; + } else if (mad_reg_req->mgmt_class == 0) { + /* + * Class 0 is reserved in IBA and is used for + * aliasing of IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE + */ + goto error1; + } else if (is_vendor_class(mad_reg_req->mgmt_class)) { + /* + * If class is in "new" vendor range, + * ensure supplied OUI is not zero + */ + if (!is_vendor_oui(mad_reg_req->oui)) + goto error1; + } + /* Make sure class supplied is consistent with QP type */ + if (qp_type == IB_QPT_SMI) { + if ((mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_LID_ROUTED) && + (mad_reg_req->mgmt_class != + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } else { + if ((mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad_reg_req->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) + goto error1; + } + } else { + /* No registration request supplied */ + if (!send_handler) + goto error1; + } + + /* Validate device and port */ + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + + /* Allocate structures */ + mad_agent_priv = kmalloc(sizeof *mad_agent_priv, GFP_KERNEL); + if (!mad_agent_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + if (mad_reg_req) { + reg_req = kmalloc(sizeof *reg_req, GFP_KERNEL); + if (!reg_req) { + ret = ERR_PTR(-ENOMEM); + goto error2; + } + /* Make a copy of the MAD registration request */ + memcpy(reg_req, mad_reg_req, sizeof *reg_req); + } + + /* Now, fill in the various structures */ + memset(mad_agent_priv, 0, sizeof *mad_agent_priv); + mad_agent_priv->qp_info = &port_priv->qp_info[qpn]; + mad_agent_priv->reg_req = reg_req; + mad_agent_priv->rmpp_version = rmpp_version; + mad_agent_priv->agent.device = device; + mad_agent_priv->agent.recv_handler = recv_handler; + mad_agent_priv->agent.send_handler = send_handler; + mad_agent_priv->agent.context = context; + mad_agent_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_agent_priv->agent.port_num = port_num; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + mad_agent_priv->agent.hi_tid = ++ib_mad_client_id; + + /* + * Make sure MAD registration (if supplied) + * is non overlapping with any existing ones + */ + if (mad_reg_req) { + mgmt_class = convert_mgmt_class(mad_reg_req->mgmt_class); + if (!is_vendor_class(mgmt_class)) { + class = port_priv->version[mad_reg_req-> + mgmt_class_version].class; + if (class) { + method = class->method_table[mgmt_class]; + if (method) { + if (method_in_use(&method, + mad_reg_req)) + goto error3; + } + } + ret2 = add_nonoui_reg_req(mad_reg_req, mad_agent_priv, + mgmt_class); + } else { + /* "New" vendor class range */ + vendor = port_priv->version[mad_reg_req-> + mgmt_class_version].vendor; + if (vendor) { + vclass = vendor_class_index(mgmt_class); + vendor_class = vendor->vendor_class[vclass]; + if (vendor_class) { + if (is_vendor_method_in_use( + vendor_class, + mad_reg_req)) + goto error3; + } + } + ret2 = add_oui_reg_req(mad_reg_req, mad_agent_priv); + } + if (ret2) { + ret = ERR_PTR(ret2); + goto error3; + } + } + + /* Add mad agent into port's agent list */ + list_add_tail(&mad_agent_priv->agent_list, &port_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + spin_lock_init(&mad_agent_priv->lock); + INIT_LIST_HEAD(&mad_agent_priv->send_list); + INIT_LIST_HEAD(&mad_agent_priv->wait_list); + INIT_WORK(&mad_agent_priv->timed_work, timeout_sends, mad_agent_priv); + atomic_set(&mad_agent_priv->refcount, 1); + init_waitqueue_head(&mad_agent_priv->wait); + + return &mad_agent_priv->agent; + +error3: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + kfree(reg_req); +error2: + kfree(mad_agent_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_agent); + +static inline int is_snooping_sends(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (/*IB_MAD_SNOOP_POSTED_SENDS | + IB_MAD_SNOOP_RMPP_SENDS |*/ + IB_MAD_SNOOP_SEND_COMPLETIONS /*| + IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/)); +} + +static inline int is_snooping_recvs(int mad_snoop_flags) +{ + return (mad_snoop_flags & + (IB_MAD_SNOOP_RECVS /*| + IB_MAD_SNOOP_RMPP_RECVS*/)); +} + +static int register_snoop_agent(struct ib_mad_qp_info *qp_info, + struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_snoop_private **new_snoop_table; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + /* Check for empty slot in array. */ + for (i = 0; i < qp_info->snoop_table_size; i++) + if (!qp_info->snoop_table[i]) + break; + + if (i == qp_info->snoop_table_size) { + /* Grow table. */ + new_snoop_table = kmalloc(sizeof mad_snoop_priv * + qp_info->snoop_table_size + 1, + GFP_ATOMIC); + if (!new_snoop_table) { + i = -ENOMEM; + goto out; + } + if (qp_info->snoop_table) { + memcpy(new_snoop_table, qp_info->snoop_table, + sizeof mad_snoop_priv * + qp_info->snoop_table_size); + kfree(qp_info->snoop_table); + } + qp_info->snoop_table = new_snoop_table; + qp_info->snoop_table_size++; + } + qp_info->snoop_table[i] = mad_snoop_priv; + atomic_inc(&qp_info->snoop_count); +out: + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + return i; +} + +struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device, + u8 port_num, + enum ib_qp_type qp_type, + int mad_snoop_flags, + ib_mad_snoop_handler snoop_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_agent *ret; + struct ib_mad_snoop_private *mad_snoop_priv; + int qpn; + + /* Validate parameters */ + if ((is_snooping_sends(mad_snoop_flags) && !snoop_handler) || + (is_snooping_recvs(mad_snoop_flags) && !recv_handler)) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + qpn = get_spl_qp_index(qp_type); + if (qpn == -1) { + ret = ERR_PTR(-EINVAL); + goto error1; + } + port_priv = ib_get_mad_port(device, port_num); + if (!port_priv) { + ret = ERR_PTR(-ENODEV); + goto error1; + } + /* Allocate structures */ + mad_snoop_priv = kmalloc(sizeof *mad_snoop_priv, GFP_KERNEL); + if (!mad_snoop_priv) { + ret = ERR_PTR(-ENOMEM); + goto error1; + } + + /* Now, fill in the various structures */ + memset(mad_snoop_priv, 0, sizeof *mad_snoop_priv); + mad_snoop_priv->qp_info = &port_priv->qp_info[qpn]; + mad_snoop_priv->agent.device = device; + mad_snoop_priv->agent.recv_handler = recv_handler; + mad_snoop_priv->agent.snoop_handler = snoop_handler; + mad_snoop_priv->agent.context = context; + mad_snoop_priv->agent.qp = port_priv->qp_info[qpn].qp; + mad_snoop_priv->agent.port_num = port_num; + mad_snoop_priv->mad_snoop_flags = mad_snoop_flags; + init_waitqueue_head(&mad_snoop_priv->wait); + mad_snoop_priv->snoop_index = register_snoop_agent( + &port_priv->qp_info[qpn], + mad_snoop_priv); + if (mad_snoop_priv->snoop_index < 0) { + ret = ERR_PTR(mad_snoop_priv->snoop_index); + goto error2; + } + + atomic_set(&mad_snoop_priv->refcount, 1); + return &mad_snoop_priv->agent; + +error2: + kfree(mad_snoop_priv); +error1: + return ret; +} +EXPORT_SYMBOL(ib_register_mad_snoop); + +static void unregister_mad_agent(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + /* Note that we could still be handling received MADs */ + + /* + * Canceling all sends results in dropping received response + * MADs, preventing us from queuing additional work + */ + cancel_mads(mad_agent_priv); + + port_priv = mad_agent_priv->qp_info->port_priv; + cancel_delayed_work(&mad_agent_priv->timed_work); + flush_workqueue(port_priv->wq); + + spin_lock_irqsave(&port_priv->reg_lock, flags); + remove_mad_reg_req(mad_agent_priv); + list_del(&mad_agent_priv->agent_list); + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + /* XXX: Cleanup pending RMPP receives for this agent */ + + atomic_dec(&mad_agent_priv->refcount); + wait_event(mad_agent_priv->wait, + !atomic_read(&mad_agent_priv->refcount)); + + if (mad_agent_priv->reg_req) + kfree(mad_agent_priv->reg_req); + kfree(mad_agent_priv); +} + +static void unregister_mad_snoop(struct ib_mad_snoop_private *mad_snoop_priv) +{ + struct ib_mad_qp_info *qp_info; + unsigned long flags; + + qp_info = mad_snoop_priv->qp_info; + spin_lock_irqsave(&qp_info->snoop_lock, flags); + qp_info->snoop_table[mad_snoop_priv->snoop_index] = NULL; + atomic_dec(&qp_info->snoop_count); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + + atomic_dec(&mad_snoop_priv->refcount); + wait_event(mad_snoop_priv->wait, + !atomic_read(&mad_snoop_priv->refcount)); + + kfree(mad_snoop_priv); +} + +/* + * ib_unregister_mad_agent - Unregisters a client from using MAD services + */ +int ib_unregister_mad_agent(struct ib_mad_agent *mad_agent) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_snoop_private *mad_snoop_priv; + + /* If the TID is zero, the agent can only snoop. */ + if (mad_agent->hi_tid) { + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + unregister_mad_agent(mad_agent_priv); + } else { + mad_snoop_priv = container_of(mad_agent, + struct ib_mad_snoop_private, + agent); + unregister_mad_snoop(mad_snoop_priv); + } + return 0; +} +EXPORT_SYMBOL(ib_unregister_mad_agent); + +static void dequeue_mad(struct ib_mad_list_head *mad_list) +{ + struct ib_mad_queue *mad_queue; + unsigned long flags; + + BUG_ON(!mad_list->mad_queue); + mad_queue = mad_list->mad_queue; + spin_lock_irqsave(&mad_queue->lock, flags); + list_del(&mad_list->list); + mad_queue->count--; + spin_unlock_irqrestore(&mad_queue->lock, flags); +} + +static void snoop_send(struct ib_mad_qp_info *qp_info, + struct ib_send_wr *send_wr, + struct ib_mad_send_wc *mad_send_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, + send_wr, mad_send_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +static void snoop_recv(struct ib_mad_qp_info *qp_info, + struct ib_mad_recv_wc *mad_recv_wc, + int mad_snoop_flags) +{ + struct ib_mad_snoop_private *mad_snoop_priv; + unsigned long flags; + int i; + + spin_lock_irqsave(&qp_info->snoop_lock, flags); + for (i = 0; i < qp_info->snoop_table_size; i++) { + mad_snoop_priv = qp_info->snoop_table[i]; + if (!mad_snoop_priv || + !(mad_snoop_priv->mad_snoop_flags & mad_snoop_flags)) + continue; + + atomic_inc(&mad_snoop_priv->refcount); + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); + mad_snoop_priv->agent.recv_handler(&mad_snoop_priv->agent, + mad_recv_wc); + if (atomic_dec_and_test(&mad_snoop_priv->refcount)) + wake_up(&mad_snoop_priv->wait); + spin_lock_irqsave(&qp_info->snoop_lock, flags); + } + spin_unlock_irqrestore(&qp_info->snoop_lock, flags); +} + +/* + * Return 0 if SMP is to be sent + * Return 1 if SMP was consumed locally (whether or not solicited) + * Return < 0 if error + */ +static int handle_outgoing_smp(struct ib_mad_agent_private *mad_agent_priv, + struct ib_smp *smp, + struct ib_send_wr *send_wr) +{ + int ret; + struct ib_mad_private *mad_priv; + struct ib_mad_send_wc mad_send_wc; + struct ib_device *device = mad_agent_priv->agent.device; + u8 port_num = mad_agent_priv->agent.port_num; + + if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { + ret = -EINVAL; + printk(KERN_ERR PFX "Invalid directed route\n"); + goto out; + } + /* Check to post send on QP or process locally */ + ret = smi_check_local_dr_smp(smp, device, port_num); + if (!ret || !device->process_mad) + goto out; + + mad_priv = kmem_cache_alloc(ib_mad_cache, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_priv) { + ret = -ENOMEM; + printk(KERN_ERR PFX "No memory for local response MAD\n"); + goto out; + } + ret = device->process_mad(device, 0, port_num, smp->dr_slid, + (struct ib_mad *)smp, + (struct ib_mad *)&mad_priv->mad); + switch (ret) + { + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY: + /* + * See if response is solicited and + * there is a recv handler + */ + if (solicited_mad(&mad_priv->mad.mad) && + mad_agent_priv->agent.recv_handler) { + struct ib_wc wc; + + /* + * Defined behavior is to complete response + * before request + */ + wc.wr_id = send_wr->wr_id; + wc.status = IB_WC_SUCCESS; + wc.opcode = IB_WC_RECV; + wc.vendor_err = 0; + wc.byte_len = sizeof(struct ib_mad); + wc.src_qp = IB_QP0; + wc.wc_flags = 0; + wc.pkey_index = 0; + wc.slid = IB_LID_PERMISSIVE; + wc.sl = 0; + wc.dlid_path_bits = 0; + mad_priv->header.recv_wc.wc = &wc; + mad_priv->header.recv_wc.mad_len = + sizeof(struct ib_mad); + INIT_LIST_HEAD(&mad_priv->header.recv_wc.recv_buf.list); + mad_priv->header.recv_wc.recv_buf.grh = NULL; + mad_priv->header.recv_wc.recv_buf.mad = + &mad_priv->mad.mad; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_recv(mad_agent_priv->qp_info, + &mad_priv->header.recv_wc, + IB_MAD_SNOOP_RECVS); + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &mad_priv->header.recv_wc); + } else + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: + kmem_cache_free(ib_mad_cache, mad_priv); + break; + case IB_MAD_RESULT_SUCCESS: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = 0; + goto out; + default: + kmem_cache_free(ib_mad_cache, mad_priv); + ret = -EINVAL; + goto out; + } + + /* Complete send */ + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = send_wr->wr_id; + if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) + snoop_send(mad_agent_priv->qp_info, send_wr, &mad_send_wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + ret = 1; +out: + return ret; +} + +static int ib_send_mad(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr) +{ + struct ib_mad_qp_info *qp_info; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + /* Replace user's WR ID with our own to find WR upon completion */ + qp_info = mad_agent_priv->qp_info; + mad_send_wr->wr_id = mad_send_wr->send_wr.wr_id; + mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; + mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + if (qp_info->send_queue.count++ < qp_info->send_queue.max_active) { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->send_queue.list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = ib_post_send(mad_agent_priv->agent.qp, + &mad_send_wr->send_wr, &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + dequeue_mad(&mad_send_wr->mad_list); + } + } else { + list_add_tail(&mad_send_wr->mad_list.list, + &qp_info->overflow_list); + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + ret = 0; + } + return ret; +} + +/* + * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated + * with the registered client + */ +int ib_post_send_mad(struct ib_mad_agent *mad_agent, + struct ib_send_wr *send_wr, + struct ib_send_wr **bad_send_wr) +{ + int ret = -EINVAL; + struct ib_mad_agent_private *mad_agent_priv; + + /* Validate supplied parameters */ + if (!bad_send_wr) + goto error1; + + if (!mad_agent || !send_wr) + goto error2; + + if (!mad_agent->send_handler) + goto error2; + + mad_agent_priv = container_of(mad_agent, + struct ib_mad_agent_private, + agent); + + /* Walk list of send WRs and post each on send list */ + while (send_wr) { + unsigned long flags; + struct ib_send_wr *next_send_wr; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_smp *smp; + + /* Validate more parameters */ + if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) + goto error2; + + if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) + goto error2; + + if (!send_wr->wr.ud.mad_hdr) { + printk(KERN_ERR PFX "MAD header must be supplied " + "in WR %p\n", send_wr); + goto error2; + } + + /* + * Save pointer to next work request to post in case the + * current one completes, and the user modifies the work + * request associated with the completion + */ + next_send_wr = (struct ib_send_wr *)send_wr->next; + + smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; + if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_smp(mad_agent_priv, smp, send_wr); + if (ret < 0) /* error */ + goto error2; + else if (ret == 1) /* locally consumed */ + goto next; + } + + /* Allocate MAD send WR tracking structure */ + mad_send_wr = kmalloc(sizeof *mad_send_wr, + (in_atomic() || irqs_disabled()) ? + GFP_ATOMIC : GFP_KERNEL); + if (!mad_send_wr) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_send_wr_private\n"); + ret = -ENOMEM; + goto error2; + } + + mad_send_wr->send_wr = *send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + memcpy(mad_send_wr->sg_list, send_wr->sg_list, + sizeof *send_wr->sg_list * send_wr->num_sge); + mad_send_wr->send_wr.next = NULL; + mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; + mad_send_wr->agent = mad_agent; + /* Timeout will be updated after send completes */ + mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. + ud.timeout_ms); + mad_send_wr->retry = 0; + /* One reference for each work request to QP + response */ + mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); + mad_send_wr->status = IB_WC_SUCCESS; + + /* Reference MAD agent until send completes */ + atomic_inc(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_add_tail(&mad_send_wr->agent_list, + &mad_agent_priv->send_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + ret = ib_send_mad(mad_agent_priv, mad_send_wr); + if (ret) { + /* Fail send request */ + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + atomic_dec(&mad_agent_priv->refcount); + goto error2; + } +next: + send_wr = next_send_wr; + } + return 0; + +error2: + *bad_send_wr = send_wr; +error1: + return ret; +} +EXPORT_SYMBOL(ib_post_send_mad); + +/* + * ib_free_recv_mad - Returns data buffers used to receive + * a MAD to the access layer + */ +void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_recv_buf *entry; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *priv; + + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, header); + + /* + * Walk receive buffer list associated with this WC + * No need to remove them from list of receive buffers + */ + list_for_each_entry(entry, &mad_recv_wc->recv_buf.list, list) { + /* Free previous receive buffer */ + kmem_cache_free(ib_mad_cache, priv); + mad_priv_hdr = container_of(mad_recv_wc, + struct ib_mad_private_header, + recv_wc); + priv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + } + + /* Free last buffer */ + kmem_cache_free(ib_mad_cache, priv); +} +EXPORT_SYMBOL(ib_free_recv_mad); + +void ib_coalesce_recv_mad(struct ib_mad_recv_wc *mad_recv_wc, + void *buf) +{ + printk(KERN_ERR PFX "ib_coalesce_recv_mad() not implemented yet\n"); +} +EXPORT_SYMBOL(ib_coalesce_recv_mad); + +struct ib_mad_agent *ib_redirect_mad_qp(struct ib_qp *qp, + u8 rmpp_version, + ib_mad_send_handler send_handler, + ib_mad_recv_handler recv_handler, + void *context) +{ + return ERR_PTR(-EINVAL); /* XXX: for now */ +} +EXPORT_SYMBOL(ib_redirect_mad_qp); + +int ib_process_mad_wc(struct ib_mad_agent *mad_agent, + struct ib_wc *wc) +{ + printk(KERN_ERR PFX "ib_process_mad_wc() not implemented yet\n"); + return 0; +} +EXPORT_SYMBOL(ib_process_mad_wc); + +static int method_in_use(struct ib_mad_mgmt_method_table **method, + struct ib_mad_reg_req *mad_reg_req) +{ + int i; + + for (i = find_first_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + if ((*method)->agent[i]) { + printk(KERN_ERR PFX "Method %d already in use\n", i); + return -EINVAL; + } + } + return 0; +} + +static int allocate_method_table(struct ib_mad_mgmt_method_table **method) +{ + /* Allocate management method table */ + *method = kmalloc(sizeof **method, GFP_ATOMIC); + if (!*method) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_method_table\n"); + return -ENOMEM; + } + /* Clear management method table */ + memset(*method, 0, sizeof **method); + + return 0; +} + +/* + * Check to see if there are any methods still in use + */ +static int check_method_table(struct ib_mad_mgmt_method_table *method) +{ + int i; + + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) + if (method->agent[i]) + return 1; + return 0; +} + +/* + * Check to see if there are any method tables for this class still in use + */ +static int check_class_table(struct ib_mad_mgmt_class_table *class) +{ + int i; + + for (i = 0; i < MAX_MGMT_CLASS; i++) + if (class->method_table[i]) + return 1; + return 0; +} + +static int check_vendor_class(struct ib_mad_mgmt_vendor_class *vendor_class) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + if (vendor_class->method_table[i]) + return 1; + return 0; +} + +static int find_vendor_oui(struct ib_mad_mgmt_vendor_class *vendor_class, + char *oui) +{ + int i; + + for (i = 0; i < MAX_MGMT_OUI; i++) + /* Is there matching OUI for this vendor class ? */ + if (!memcmp(vendor_class->oui[i], oui, 3)) + return i; + + return -1; +} + +static int check_vendor_table(struct ib_mad_mgmt_vendor_class_table *vendor) +{ + int i; + + for (i = 0; i < MAX_MGMT_VENDOR_RANGE2; i++) + if (vendor->vendor_class[i]) + return 1; + + return 0; +} + +static void remove_methods_mad_agent(struct ib_mad_mgmt_method_table *method, + struct ib_mad_agent_private *agent) +{ + int i; + + /* Remove any methods for this mad agent */ + for (i = 0; i < IB_MGMT_MAX_METHODS; i++) { + if (method->agent[i] == agent) { + method->agent[i] = NULL; + } + } +} + +static int add_nonoui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv, + u8 mgmt_class) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table **class; + struct ib_mad_mgmt_method_table **method; + int i, ret; + + port_priv = agent_priv->qp_info->port_priv; + class = &port_priv->version[mad_reg_req->mgmt_class_version].class; + if (!*class) { + /* Allocate management class table for "new" class version */ + *class = kmalloc(sizeof **class, GFP_ATOMIC); + if (!*class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_class_table\n"); + ret = -ENOMEM; + goto error1; + } + /* Clear management class table */ + memset(*class, 0, sizeof(**class)); + /* Allocate method table for this management class */ + method = &(*class)->method_table[mgmt_class]; + if ((ret = allocate_method_table(method))) + goto error2; + } else { + method = &(*class)->method_table[mgmt_class]; + if (!*method) { + /* Allocate method table for this management class */ + if ((ret = allocate_method_table(method))) + goto error1; + } + } + + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error3; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error3: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; + goto error1; +error2: + kfree(*class); + *class = NULL; +error1: + return ret; +} + +static int add_oui_reg_req(struct ib_mad_reg_req *mad_reg_req, + struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_vendor_class_table **vendor_table; + struct ib_mad_mgmt_vendor_class_table *vendor = NULL; + struct ib_mad_mgmt_vendor_class *vendor_class = NULL; + struct ib_mad_mgmt_method_table **method; + int i, ret = -ENOMEM; + u8 vclass; + + /* "New" vendor (with OUI) class */ + vclass = vendor_class_index(mad_reg_req->mgmt_class); + port_priv = agent_priv->qp_info->port_priv; + vendor_table = &port_priv->version[ + mad_reg_req->mgmt_class_version].vendor; + if (!*vendor_table) { + /* Allocate mgmt vendor class table for "new" class version */ + vendor = kmalloc(sizeof *vendor, GFP_ATOMIC); + if (!vendor) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class_table\n"); + goto error1; + } + /* Clear management vendor class table */ + memset(vendor, 0, sizeof(*vendor)); + *vendor_table = vendor; + } + if (!(*vendor_table)->vendor_class[vclass]) { + /* Allocate table for this management vendor class */ + vendor_class = kmalloc(sizeof *vendor_class, GFP_ATOMIC); + if (!vendor_class) { + printk(KERN_ERR PFX "No memory for " + "ib_mad_mgmt_vendor_class\n"); + goto error2; + } + memset(vendor_class, 0, sizeof(*vendor_class)); + (*vendor_table)->vendor_class[vclass] = vendor_class; + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* Is there matching OUI for this vendor class ? */ + if (!memcmp((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3)) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(!*method); + goto check_in_use; + } + } + for (i = 0; i < MAX_MGMT_OUI; i++) { + /* OUI slot available ? */ + if (!is_vendor_oui((*vendor_table)->vendor_class[ + vclass]->oui[i])) { + method = &(*vendor_table)->vendor_class[ + vclass]->method_table[i]; + BUG_ON(*method); + /* Allocate method table for this OUI */ + if ((ret = allocate_method_table(method))) + goto error3; + memcpy((*vendor_table)->vendor_class[vclass]->oui[i], + mad_reg_req->oui, 3); + goto check_in_use; + } + } + printk(KERN_ERR PFX "All OUI slots in use\n"); + goto error3; + +check_in_use: + /* Now, make sure methods are not already in use */ + if (method_in_use(method, mad_reg_req)) + goto error4; + + /* Finally, add in methods being registered */ + for (i = find_first_bit(mad_reg_req->method_mask, + IB_MGMT_MAX_METHODS); + i < IB_MGMT_MAX_METHODS; + i = find_next_bit(mad_reg_req->method_mask, IB_MGMT_MAX_METHODS, + 1+i)) { + (*method)->agent[i] = agent_priv; + } + return 0; + +error4: + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(*method, agent_priv); + /* Now, check to see if there are any methods in use */ + if (!check_method_table(*method)) { + /* If not, release management method table */ + kfree(*method); + *method = NULL; + } + ret = -EINVAL; +error3: + if (vendor_class) { + (*vendor_table)->vendor_class[vclass] = NULL; + kfree(vendor_class); + } +error2: + if (vendor) { + *vendor_table = NULL; + kfree(vendor); + } +error1: + return ret; +} + +static void remove_mad_reg_req(struct ib_mad_agent_private *agent_priv) +{ + struct ib_mad_port_private *port_priv; + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + int index; + u8 mgmt_class; + + /* + * Was MAD registration request supplied + * with original registration ? + */ + if (!agent_priv->reg_req) { + goto out; + } + + port_priv = agent_priv->qp_info->port_priv; + class = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].class; + if (!class) + goto vendor_check; + + mgmt_class = convert_mgmt_class(agent_priv->reg_req->mgmt_class); + method = class->method_table[mgmt_class]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* Now, check to see if there are any methods still in use */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + class->method_table[mgmt_class] = NULL; + /* Any management classes left ? */ + if (!check_class_table(class)) { + /* If not, release management class table */ + kfree(class); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version].class = NULL; + } + } + } + +vendor_check: + vendor = port_priv->version[ + agent_priv->reg_req->mgmt_class_version].vendor; + if (!vendor) + goto out; + + mgmt_class = vendor_class_index(agent_priv->reg_req->mgmt_class); + vendor_class = vendor->vendor_class[mgmt_class]; + if (vendor_class) { + index = find_vendor_oui(vendor_class, agent_priv->reg_req->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + /* Remove any methods for this mad agent */ + remove_methods_mad_agent(method, agent_priv); + /* + * Now, check to see if there are + * any methods still in use + */ + if (!check_method_table(method)) { + /* If not, release management method table */ + kfree(method); + vendor_class->method_table[index] = NULL; + memset(vendor_class->oui[index], 0, 3); + /* Any OUIs left ? */ + if (!check_vendor_class(vendor_class)) { + /* If not, release vendor class table */ + kfree(vendor_class); + vendor->vendor_class[mgmt_class] = NULL; + /* Any other vendor classes left ? */ + if (!check_vendor_table(vendor)) { + kfree(vendor); + port_priv->version[ + agent_priv->reg_req-> + mgmt_class_version]. + vendor = NULL; + } + } + } + } + } + +out: + return; +} + +static int response_mad(struct ib_mad *mad) +{ + /* Trap represses are responses although response bit is reset */ + return ((mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) || + (mad->mad_hdr.method & IB_MGMT_METHOD_RESP)); +} + +static int solicited_mad(struct ib_mad *mad) +{ + /* CM MADs are never solicited */ + if (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_CM) { + return 0; + } + + /* XXX: Determine whether MAD is using RMPP */ + + /* Not using RMPP */ + /* Is this MAD a response to a previous MAD ? */ + return response_mad(mad); +} + +static struct ib_mad_agent_private * +find_mad_agent(struct ib_mad_port_private *port_priv, + struct ib_mad *mad, + int solicited) +{ + struct ib_mad_agent_private *mad_agent = NULL; + unsigned long flags; + + spin_lock_irqsave(&port_priv->reg_lock, flags); + + /* + * Whether MAD was solicited determines type of routing to + * MAD client. + */ + if (solicited) { + u32 hi_tid; + struct ib_mad_agent_private *entry; + + /* + * Routing is based on high 32 bits of transaction ID + * of MAD. + */ + hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; + list_for_each_entry(entry, &port_priv->agent_list, + agent_list) { + if (entry->agent.hi_tid == hi_tid) { + mad_agent = entry; + break; + } + } + } else { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_method_table *method; + struct ib_mad_mgmt_vendor_class_table *vendor; + struct ib_mad_mgmt_vendor_class *vendor_class; + struct ib_vendor_mad *vendor_mad; + int index; + + /* + * Routing is based on version, class, and method + * For "newer" vendor MADs, also based on OUI + */ + if (mad->mad_hdr.class_version >= MAX_MGMT_VERSION) + goto out; + if (!is_vendor_class(mad->mad_hdr.mgmt_class)) { + class = port_priv->version[ + mad->mad_hdr.class_version].class; + if (!class) + goto out; + method = class->method_table[convert_mgmt_class( + mad->mad_hdr.mgmt_class)]; + if (method) + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } else { + vendor = port_priv->version[ + mad->mad_hdr.class_version].vendor; + if (!vendor) + goto out; + vendor_class = vendor->vendor_class[vendor_class_index( + mad->mad_hdr.mgmt_class)]; + if (!vendor_class) + goto out; + /* Find matching OUI */ + vendor_mad = (struct ib_vendor_mad *)mad; + index = find_vendor_oui(vendor_class, vendor_mad->oui); + if (index == -1) + goto out; + method = vendor_class->method_table[index]; + if (method) { + mad_agent = method->agent[mad->mad_hdr.method & + ~IB_MGMT_METHOD_RESP]; + } + } + } + + if (mad_agent) { + if (mad_agent->agent.recv_handler) + atomic_inc(&mad_agent->refcount); + else { + printk(KERN_NOTICE PFX "No receive handler for client " + "%p on port %d\n", + &mad_agent->agent, port_priv->port_num); + mad_agent = NULL; + } + } +out: + spin_unlock_irqrestore(&port_priv->reg_lock, flags); + + return mad_agent; +} + +static int validate_mad(struct ib_mad *mad, u32 qp_num) +{ + int valid = 0; + + /* Make sure MAD base version is understood */ + if (mad->mad_hdr.base_version != IB_MGMT_BASE_VERSION) { + printk(KERN_ERR PFX "MAD received with unsupported base " + "version %d\n", mad->mad_hdr.base_version); + goto out; + } + + /* Filter SMI packets sent to other than QP0 */ + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED) || + (mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)) { + if (qp_num == 0) + valid = 1; + } else { + /* Filter GSI packets sent to QP0 */ + if (qp_num != 0) + valid = 1; + } + +out: + return valid; +} + +/* + * Return start of fully reassembled MAD, or NULL, if MAD isn't assembled yet + */ +static struct ib_mad_private * +reassemble_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv) +{ + /* Until we have RMPP, all receives are reassembled!... */ + INIT_LIST_HEAD(&recv->header.recv_wc.recv_buf.list); + return recv; +} + +static struct ib_mad_send_wr_private* +find_send_req(struct ib_mad_agent_private *mad_agent_priv, + u64 tid) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->tid == tid) + return mad_send_wr; + } + + /* + * It's possible to receive the response before we've + * been notified that the send has completed + */ + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->tid == tid && mad_send_wr->timeout) { + /* Verify request has not been canceled */ + return (mad_send_wr->status == IB_WC_SUCCESS) ? + mad_send_wr : NULL; + } + } + return NULL; +} + +static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_private *recv, + int solicited) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + /* Fully reassemble receive before processing */ + recv = reassemble_recv(mad_agent_priv, recv); + if (!recv) { + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + + /* Complete corresponding request */ + if (solicited) { + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_req(mad_agent_priv, + recv->mad.mad.mad_hdr.tid); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + ib_free_recv_mad(&recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + return; + } + /* Timeout = 0 means that we won't wait for a response */ + mad_send_wr->timeout = 0; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Defined behavior is to complete response before request */ + recv->header.recv_wc.wc->wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + atomic_dec(&mad_agent_priv->refcount); + + mad_send_wc.status = IB_WC_SUCCESS; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); + } else { + mad_agent_priv->agent.recv_handler( + &mad_agent_priv->agent, + &recv->header.recv_wc); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + } +} + +static void ib_mad_recv_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_qp_info *qp_info; + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv, *response; + struct ib_mad_list_head *mad_list; + struct ib_mad_agent_private *mad_agent; + int solicited; + + response = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!response) + printk(KERN_ERR PFX "ib_mad_recv_done_handler no memory " + "for response buffer\n"); + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + dequeue_mad(mad_list); + + mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, header); + dma_unmap_single(port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + + /* Setup MAD receive work completion from "normal" work completion */ + recv->header.recv_wc.wc = wc; + recv->header.recv_wc.mad_len = sizeof(struct ib_mad); + recv->header.recv_wc.recv_buf.mad = &recv->mad.mad; + recv->header.recv_wc.recv_buf.grh = &recv->grh; + + if (atomic_read(&qp_info->snoop_count)) + snoop_recv(qp_info, &recv->header.recv_wc, IB_MAD_SNOOP_RECVS); + + /* Validate MAD */ + if (!validate_mad(&recv->mad.mad, qp_info->qp->qp_num)) + goto out; + + if (recv->mad.mad.mad_hdr.mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (!smi_handle_dr_smp_recv(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num, + port_priv->device->phys_port_cnt)) + goto out; + if (!smi_check_forward_dr_smp(&recv->mad.smp)) + goto local; + if (!smi_handle_dr_smp_send(&recv->mad.smp, + port_priv->device->node_type, + port_priv->port_num)) + goto out; + if (!smi_check_local_dr_smp(&recv->mad.smp, + port_priv->device, + port_priv->port_num)) + goto out; + } + +local: + /* Give driver "right of first refusal" on incoming MAD */ + if (port_priv->device->process_mad) { + int ret; + + if (!response) { + printk(KERN_ERR PFX "No memory for response MAD\n"); + /* + * Is it better to assume that + * it wouldn't be processed ? + */ + goto out; + } + + ret = port_priv->device->process_mad(port_priv->device, 0, + port_priv->port_num, + wc->slid, + &recv->mad.mad, + &response->mad.mad); + if (ret & IB_MAD_RESULT_SUCCESS) { + if (ret & IB_MAD_RESULT_CONSUMED) + goto out; + if (ret & IB_MAD_RESULT_REPLY) { + /* Send response */ + if (!agent_send(response, &recv->grh, wc, + port_priv->device, + port_priv->port_num)) + response = NULL; + goto out; + } + } + } + + /* Determine corresponding MAD agent for incoming receive MAD */ + solicited = solicited_mad(&recv->mad.mad); + mad_agent = find_mad_agent(port_priv, &recv->mad.mad, solicited); + if (mad_agent) { + ib_mad_complete_recv(mad_agent, recv, solicited); + /* + * recv is freed up in error cases in ib_mad_complete_recv + * or via recv_handler in ib_mad_complete_recv() + */ + recv = NULL; + } + +out: + /* Post another receive request for this QP */ + if (response) { + ib_mad_post_receive_mads(qp_info, response); + if (recv) + kmem_cache_free(ib_mad_cache, recv); + } else + ib_mad_post_receive_mads(qp_info, recv); +} + +static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) +{ + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long delay; + + if (list_empty(&mad_agent_priv->wait_list)) { + cancel_delayed_work(&mad_agent_priv->timed_work); + } else { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_agent_priv->timeout, + mad_send_wr->timeout)) { + mad_agent_priv->timeout = mad_send_wr->timeout; + cancel_delayed_work(&mad_agent_priv->timed_work); + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + } + } +} + +static void wait_for_response(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_wr_private *mad_send_wr ) +{ + struct ib_mad_send_wr_private *temp_mad_send_wr; + struct list_head *list_item; + unsigned long delay; + + list_del(&mad_send_wr->agent_list); + + delay = mad_send_wr->timeout; + mad_send_wr->timeout += jiffies; + + list_for_each_prev(list_item, &mad_agent_priv->wait_list) { + temp_mad_send_wr = list_entry(list_item, + struct ib_mad_send_wr_private, + agent_list); + if (time_after(mad_send_wr->timeout, + temp_mad_send_wr->timeout)) + break; + } + list_add(&mad_send_wr->agent_list, list_item); + + /* Reschedule a work item if we have a shorter timeout */ + if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { + cancel_delayed_work(&mad_agent_priv->timed_work); + queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, + &mad_agent_priv->timed_work, delay); + } +} + +/* + * Process a send work completion + */ +static void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr, + struct ib_mad_send_wc *mad_send_wc) +{ + struct ib_mad_agent_private *mad_agent_priv; + unsigned long flags; + + mad_agent_priv = container_of(mad_send_wr->agent, + struct ib_mad_agent_private, agent); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + if (mad_send_wc->status != IB_WC_SUCCESS && + mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = mad_send_wc->status; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + + if (--mad_send_wr->refcount > 0) { + if (mad_send_wr->refcount == 1 && mad_send_wr->timeout && + mad_send_wr->status == IB_WC_SUCCESS) { + wait_for_response(mad_agent_priv, mad_send_wr); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + return; + } + + /* Remove send from MAD agent and notify client of completion */ + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + if (mad_send_wr->status != IB_WC_SUCCESS ) + mad_send_wc->status = mad_send_wr->status; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + mad_send_wc); + + /* Release reference on agent taken when sending */ + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + + kfree(mad_send_wr); +} + +static void ib_mad_send_done_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_send_wr_private *mad_send_wr, *queued_send_wr; + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_queue *send_queue; + struct ib_send_wr *bad_send_wr; + unsigned long flags; + int ret; + + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + send_queue = mad_list->mad_queue; + qp_info = send_queue->qp_info; + +retry: + queued_send_wr = NULL; + spin_lock_irqsave(&send_queue->lock, flags); + list_del(&mad_list->list); + + /* Move queued send to the send queue */ + if (send_queue->count-- > send_queue->max_active) { + mad_list = container_of(qp_info->overflow_list.next, + struct ib_mad_list_head, list); + queued_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + list_del(&mad_list->list); + list_add_tail(&mad_list->list, &send_queue->list); + } + spin_unlock_irqrestore(&send_queue->lock, flags); + + /* Restore client wr_id in WC and complete send */ + wc->wr_id = mad_send_wr->wr_id; + if (atomic_read(&qp_info->snoop_count)) + snoop_send(qp_info, &mad_send_wr->send_wr, + (struct ib_mad_send_wc *)wc, + IB_MAD_SNOOP_SEND_COMPLETIONS); + ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); + + if (queued_send_wr) { + ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, + &bad_send_wr); + if (ret) { + printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); + mad_send_wr = queued_send_wr; + wc->status = IB_WC_LOC_QP_OP_ERR; + goto retry; + } + } +} + +static void mark_sends_for_retry(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_list_head *mad_list; + unsigned long flags; + + spin_lock_irqsave(&qp_info->send_queue.lock, flags); + list_for_each_entry(mad_list, &qp_info->send_queue.list, list) { + mad_send_wr = container_of(mad_list, + struct ib_mad_send_wr_private, + mad_list); + mad_send_wr->retry = 1; + } + spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); +} + +static void mad_error_handler(struct ib_mad_port_private *port_priv, + struct ib_wc *wc) +{ + struct ib_mad_list_head *mad_list; + struct ib_mad_qp_info *qp_info; + struct ib_mad_send_wr_private *mad_send_wr; + int ret; + + /* Determine if failure was a send or receive */ + mad_list = (struct ib_mad_list_head *)(unsigned long)wc->wr_id; + qp_info = mad_list->mad_queue->qp_info; + if (mad_list->mad_queue == &qp_info->recv_queue) + /* + * Receive errors indicate that the QP has entered the error + * state - error handling/shutdown code will cleanup + */ + return; + + /* + * Send errors will transition the QP to SQE - move + * QP to RTS and repost flushed work requests + */ + mad_send_wr = container_of(mad_list, struct ib_mad_send_wr_private, + mad_list); + if (wc->status == IB_WC_WR_FLUSH_ERR) { + if (mad_send_wr->retry) { + /* Repost send */ + struct ib_send_wr *bad_send_wr; + + mad_send_wr->retry = 0; + ret = ib_post_send(qp_info->qp, &mad_send_wr->send_wr, + &bad_send_wr); + if (ret) + ib_mad_send_done_handler(port_priv, wc); + } else + ib_mad_send_done_handler(port_priv, wc); + } else { + struct ib_qp_attr *attr; + + /* Transition QP to RTS and fail offending send */ + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (attr) { + attr->qp_state = IB_QPS_RTS; + attr->cur_qp_state = IB_QPS_SQE; + ret = ib_modify_qp(qp_info->qp, attr, + IB_QP_STATE | IB_QP_CUR_STATE); + kfree(attr); + if (ret) + printk(KERN_ERR PFX "mad_error_handler - " + "ib_modify_qp to RTS : %d\n", ret); + else + mark_sends_for_retry(qp_info); + } + ib_mad_send_done_handler(port_priv, wc); + } +} + +/* + * IB MAD completion callback + */ +static void ib_mad_completion_handler(void *data) +{ + struct ib_mad_port_private *port_priv; + struct ib_wc wc; + + port_priv = (struct ib_mad_port_private *)data; + ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + + while (ib_poll_cq(port_priv->cq, 1, &wc) == 1) { + if (wc.status == IB_WC_SUCCESS) { + switch (wc.opcode) { + case IB_WC_SEND: + ib_mad_send_done_handler(port_priv, &wc); + break; + case IB_WC_RECV: + ib_mad_recv_done_handler(port_priv, &wc); + break; + default: + BUG_ON(1); + break; + } + } else + mad_error_handler(port_priv, &wc); + } +} + +static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv) +{ + unsigned long flags; + struct ib_mad_send_wr_private *mad_send_wr, *temp_mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + struct list_head cancel_list; + + INIT_LIST_HEAD(&cancel_list); + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &mad_agent_priv->send_list, agent_list) { + if (mad_send_wr->status == IB_WC_SUCCESS) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + } + } + + /* Empty wait list to prevent receives from finding a request */ + list_splice_init(&mad_agent_priv->wait_list, &cancel_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + /* Report all cancelled requests */ + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + + list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, + &cancel_list, agent_list) { + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + list_del(&mad_send_wr->agent_list); + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + } +} + +static struct ib_mad_send_wr_private* +find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, + u64 wr_id) +{ + struct ib_mad_send_wr_private *mad_send_wr; + + list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + + list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, + agent_list) { + if (mad_send_wr->wr_id == wr_id) + return mad_send_wr; + } + return NULL; +} + +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + u64 wr_id) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags; + + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + if (!mad_send_wr) { + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + if (mad_send_wr->status == IB_WC_SUCCESS) + mad_send_wr->refcount -= (mad_send_wr->timeout > 0); + + if (mad_send_wr->refcount != 0) { + mad_send_wr->status = IB_WC_WR_FLUSH_ERR; + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + goto out; + } + + list_del(&mad_send_wr->agent_list); + adjust_timeout(mad_agent_priv); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.status = IB_WC_WR_FLUSH_ERR; + mad_send_wc.vendor_err = 0; + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + if (atomic_dec_and_test(&mad_agent_priv->refcount)) + wake_up(&mad_agent_priv->wait); + +out: + return; +} +EXPORT_SYMBOL(ib_cancel_mad); + +static void timeout_sends(void *data) +{ + struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_wr_private *mad_send_wr; + struct ib_mad_send_wc mad_send_wc; + unsigned long flags, delay; + + mad_agent_priv = (struct ib_mad_agent_private *)data; + + mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; + mad_send_wc.vendor_err = 0; + + spin_lock_irqsave(&mad_agent_priv->lock, flags); + while (!list_empty(&mad_agent_priv->wait_list)) { + mad_send_wr = list_entry(mad_agent_priv->wait_list.next, + struct ib_mad_send_wr_private, + agent_list); + + if (time_after(mad_send_wr->timeout, jiffies)) { + delay = mad_send_wr->timeout - jiffies; + if ((long)delay <= 0) + delay = 1; + queue_delayed_work(mad_agent_priv->qp_info-> + port_priv->wq, + &mad_agent_priv->timed_work, delay); + break; + } + + list_del(&mad_send_wr->agent_list); + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); + + mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, + &mad_send_wc); + + kfree(mad_send_wr); + atomic_dec(&mad_agent_priv->refcount); + spin_lock_irqsave(&mad_agent_priv->lock, flags); + } + spin_unlock_irqrestore(&mad_agent_priv->lock, flags); +} + +static void ib_mad_thread_completion_handler(struct ib_cq *cq) +{ + struct ib_mad_port_private *port_priv = cq->cq_context; + + queue_work(port_priv->wq, &port_priv->work); +} + +/* + * Allocate receive MADs and post receive WRs for them + */ +static int ib_mad_post_receive_mads(struct ib_mad_qp_info *qp_info, + struct ib_mad_private *mad) +{ + unsigned long flags; + int post, ret; + struct ib_mad_private *mad_priv; + struct ib_sge sg_list; + struct ib_recv_wr recv_wr, *bad_recv_wr; + struct ib_mad_queue *recv_queue = &qp_info->recv_queue; + + /* Initialize common scatter list fields */ + sg_list.length = sizeof *mad_priv - sizeof mad_priv->header; + sg_list.lkey = (*qp_info->port_priv->mr).lkey; + + /* Initialize common receive WR fields */ + recv_wr.next = NULL; + recv_wr.sg_list = &sg_list; + recv_wr.num_sge = 1; + recv_wr.recv_flags = IB_RECV_SIGNALED; + + do { + /* Allocate and map receive buffer */ + if (mad) { + mad_priv = mad; + mad = NULL; + } else { + mad_priv = kmem_cache_alloc(ib_mad_cache, GFP_KERNEL); + if (!mad_priv) { + printk(KERN_ERR PFX "No memory for receive buffer\n"); + ret = -ENOMEM; + break; + } + } + sg_list.addr = dma_map_single(qp_info->port_priv-> + device->dma_device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); + recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; + mad_priv->header.mad_list.mad_queue = recv_queue; + + /* Post receive WR */ + spin_lock_irqsave(&recv_queue->lock, flags); + post = (++recv_queue->count < recv_queue->max_active); + list_add_tail(&mad_priv->header.mad_list.list, &recv_queue->list); + spin_unlock_irqrestore(&recv_queue->lock, flags); + ret = ib_post_recv(qp_info->qp, &recv_wr, &bad_recv_wr); + if (ret) { + spin_lock_irqsave(&recv_queue->lock, flags); + list_del(&mad_priv->header.mad_list.list); + recv_queue->count--; + spin_unlock_irqrestore(&recv_queue->lock, flags); + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, mad_priv); + printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); + break; + } + } while (post); + + return ret; +} + +/* + * Return all the posted receive MADs + */ +static void cleanup_recv_queue(struct ib_mad_qp_info *qp_info) +{ + struct ib_mad_private_header *mad_priv_hdr; + struct ib_mad_private *recv; + struct ib_mad_list_head *mad_list; + + while (!list_empty(&qp_info->recv_queue.list)) { + + mad_list = list_entry(qp_info->recv_queue.list.next, + struct ib_mad_list_head, list); + mad_priv_hdr = container_of(mad_list, + struct ib_mad_private_header, + mad_list); + recv = container_of(mad_priv_hdr, struct ib_mad_private, + header); + + /* Remove from posted receive MAD list */ + list_del(&mad_list->list); + + /* Undo PCI mapping */ + dma_unmap_single(qp_info->port_priv->device->dma_device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); + kmem_cache_free(ib_mad_cache, recv); + } + + qp_info->recv_queue.count = 0; +} + +/* + * Start the port + */ +static int ib_mad_port_start(struct ib_mad_port_private *port_priv) +{ + int ret, i; + struct ib_qp_attr *attr; + struct ib_qp *qp; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + printk(KERN_ERR PFX "Couldn't kmalloc ib_qp_attr\n"); + return -ENOMEM; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + qp = port_priv->qp_info[i].qp; + /* + * PKey index for QP1 is irrelevant but + * one is needed for the Reset to Init transition + */ + attr->qp_state = IB_QPS_INIT; + attr->pkey_index = 0; + attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | + IB_QP_PKEY_INDEX | IB_QP_QKEY); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "INIT: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTR; + ret = ib_modify_qp(qp, attr, IB_QP_STATE); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTR: %d\n", i, ret); + goto out; + } + + attr->qp_state = IB_QPS_RTS; + attr->sq_psn = IB_MAD_SEND_Q_PSN; + ret = ib_modify_qp(qp, attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + printk(KERN_ERR PFX "Couldn't change QP%d state to " + "RTS: %d\n", i, ret); + goto out; + } + } + + ret = ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP); + if (ret) { + printk(KERN_ERR PFX "Failed to request completion " + "notification: %d\n", ret); + goto out; + } + + for (i = 0; i < IB_MAD_QPS_CORE; i++) { + ret = ib_mad_post_receive_mads(&port_priv->qp_info[i], NULL); + if (ret) { + printk(KERN_ERR PFX "Couldn't post receive WRs\n"); + goto out; + } + } +out: + kfree(attr); + return ret; +} + +static void qp_event_handler(struct ib_event *event, void *qp_context) +{ + struct ib_mad_qp_info *qp_info = qp_context; + + /* It's worse than that! He's dead, Jim! */ + printk(KERN_ERR PFX "Fatal error (%d) on MAD QP (%d)\n", + event->event, qp_info->qp->qp_num); +} + +static void init_mad_queue(struct ib_mad_qp_info *qp_info, + struct ib_mad_queue *mad_queue) +{ + mad_queue->qp_info = qp_info; + mad_queue->count = 0; + spin_lock_init(&mad_queue->lock); + INIT_LIST_HEAD(&mad_queue->list); +} + +static void init_mad_qp(struct ib_mad_port_private *port_priv, + struct ib_mad_qp_info *qp_info) +{ + qp_info->port_priv = port_priv; + init_mad_queue(qp_info, &qp_info->send_queue); + init_mad_queue(qp_info, &qp_info->recv_queue); + INIT_LIST_HEAD(&qp_info->overflow_list); + spin_lock_init(&qp_info->snoop_lock); + qp_info->snoop_table = NULL; + qp_info->snoop_table_size = 0; + atomic_set(&qp_info->snoop_count, 0); +} + +static int create_mad_qp(struct ib_mad_qp_info *qp_info, + enum ib_qp_type qp_type) +{ + struct ib_qp_init_attr qp_init_attr; + int ret; + + memset(&qp_init_attr, 0, sizeof qp_init_attr); + qp_init_attr.send_cq = qp_info->port_priv->cq; + qp_init_attr.recv_cq = qp_info->port_priv->cq; + qp_init_attr.sq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.rq_sig_type = IB_SIGNAL_ALL_WR; + qp_init_attr.cap.max_send_wr = IB_MAD_QP_SEND_SIZE; + qp_init_attr.cap.max_recv_wr = IB_MAD_QP_RECV_SIZE; + qp_init_attr.cap.max_send_sge = IB_MAD_SEND_REQ_MAX_SG; + qp_init_attr.cap.max_recv_sge = IB_MAD_RECV_REQ_MAX_SG; + qp_init_attr.qp_type = qp_type; + qp_init_attr.port_num = qp_info->port_priv->port_num; + qp_init_attr.qp_context = qp_info; + qp_init_attr.event_handler = qp_event_handler; + qp_info->qp = ib_create_qp(qp_info->port_priv->pd, &qp_init_attr); + if (IS_ERR(qp_info->qp)) { + printk(KERN_ERR PFX "Couldn't create ib_mad QP%d\n", + get_spl_qp_index(qp_type)); + ret = PTR_ERR(qp_info->qp); + goto error; + } + /* Use minimum queue sizes unless the CQ is resized */ + qp_info->send_queue.max_active = IB_MAD_QP_SEND_SIZE; + qp_info->recv_queue.max_active = IB_MAD_QP_RECV_SIZE; + return 0; + +error: + return ret; +} + +static void destroy_mad_qp(struct ib_mad_qp_info *qp_info) +{ + ib_destroy_qp(qp_info->qp); + if (qp_info->snoop_table) + kfree(qp_info->snoop_table); +} + +/* + * Open the port + * Create the QP, PD, MR, and CQ if needed + */ +static int ib_mad_port_open(struct ib_device *device, + int port_num) +{ + int ret, cq_size; + struct ib_mad_port_private *port_priv; + unsigned long flags; + char name[sizeof "ib_mad123"]; + + /* First, check if port already open at MAD layer */ + port_priv = ib_get_mad_port(device, port_num); + if (port_priv) { + printk(KERN_DEBUG PFX "%s port %d already open\n", + device->name, port_num); + return 0; + } + + /* Create new device info */ + port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); + if (!port_priv) { + printk(KERN_ERR PFX "No memory for ib_mad_port_private\n"); + return -ENOMEM; + } + memset(port_priv, 0, sizeof *port_priv); + port_priv->device = device; + port_priv->port_num = port_num; + spin_lock_init(&port_priv->reg_lock); + INIT_LIST_HEAD(&port_priv->agent_list); + init_mad_qp(port_priv, &port_priv->qp_info[0]); + init_mad_qp(port_priv, &port_priv->qp_info[1]); + + cq_size = (IB_MAD_QP_SEND_SIZE + IB_MAD_QP_RECV_SIZE) * 2; + port_priv->cq = ib_create_cq(port_priv->device, + (ib_comp_handler) + ib_mad_thread_completion_handler, + NULL, port_priv, cq_size); + if (IS_ERR(port_priv->cq)) { + printk(KERN_ERR PFX "Couldn't create ib_mad CQ\n"); + ret = PTR_ERR(port_priv->cq); + goto error3; + } + + port_priv->pd = ib_alloc_pd(device); + if (IS_ERR(port_priv->pd)) { + printk(KERN_ERR PFX "Couldn't create ib_mad PD\n"); + ret = PTR_ERR(port_priv->pd); + goto error4; + } + + port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(port_priv->mr)) { + printk(KERN_ERR PFX "Couldn't get ib_mad DMA MR\n"); + ret = PTR_ERR(port_priv->mr); + goto error5; + } + + ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI); + if (ret) + goto error6; + ret = create_mad_qp(&port_priv->qp_info[1], IB_QPT_GSI); + if (ret) + goto error7; + + snprintf(name, sizeof name, "ib_mad%d", port_num); + port_priv->wq = create_singlethread_workqueue(name); + if (!port_priv->wq) { + ret = -ENOMEM; + goto error8; + } + INIT_WORK(&port_priv->work, ib_mad_completion_handler, port_priv); + + ret = ib_mad_port_start(port_priv); + if (ret) { + printk(KERN_ERR PFX "Couldn't start port\n"); + goto error9; + } + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + list_add_tail(&port_priv->port_list, &ib_mad_port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + return 0; + +error9: + destroy_workqueue(port_priv->wq); +error8: + destroy_mad_qp(&port_priv->qp_info[1]); +error7: + destroy_mad_qp(&port_priv->qp_info[0]); +error6: + ib_dereg_mr(port_priv->mr); +error5: + ib_dealloc_pd(port_priv->pd); +error4: + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); +error3: + kfree(port_priv); + + return ret; +} + +/* + * Close the port + * If there are no classes using the port, free the port + * resources (CQ, MR, PD, QP) and remove the port's info structure + */ +static int ib_mad_port_close(struct ib_device *device, int port_num) +{ + struct ib_mad_port_private *port_priv; + unsigned long flags; + + spin_lock_irqsave(&ib_mad_port_list_lock, flags); + port_priv = __ib_get_mad_port(device, port_num); + if (port_priv == NULL) { + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + printk(KERN_ERR PFX "Port %d not found\n", port_num); + return -ENODEV; + } + list_del(&port_priv->port_list); + spin_unlock_irqrestore(&ib_mad_port_list_lock, flags); + + /* Stop processing completions. */ + flush_workqueue(port_priv->wq); + destroy_workqueue(port_priv->wq); + destroy_mad_qp(&port_priv->qp_info[1]); + destroy_mad_qp(&port_priv->qp_info[0]); + ib_dereg_mr(port_priv->mr); + ib_dealloc_pd(port_priv->pd); + ib_destroy_cq(port_priv->cq); + cleanup_recv_queue(&port_priv->qp_info[1]); + cleanup_recv_queue(&port_priv->qp_info[0]); + /* XXX: Handle deallocation of MAD registration tables */ + + kfree(port_priv); + + return 0; +} + +static void ib_mad_init_device(struct ib_device *device) +{ + int ret, num_ports, cur_port, i, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret = ib_mad_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d\n", + device->name, cur_port); + goto error_device_open; + } + ret = ib_agent_port_open(device, cur_port); + if (ret) { + printk(KERN_ERR PFX "Couldn't open %s port %d " + "for agents\n", + device->name, cur_port); + goto error_device_open; + } + } + + goto error_device_query; + +error_device_open: + while (i > 0) { + cur_port--; + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + } + i--; + } + +error_device_query: + return; +} + +static void ib_mad_remove_device(struct ib_device *device) +{ + int ret = 0, i, num_ports, cur_port, ret2; + + if (device->node_type == IB_NODE_SWITCH) { + num_ports = 1; + cur_port = 0; + } else { + num_ports = device->phys_port_cnt; + cur_port = 1; + } + for (i = 0; i < num_ports; i++, cur_port++) { + ret2 = ib_agent_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d " + "for agents\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + ret2 = ib_mad_port_close(device, cur_port); + if (ret2) { + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, cur_port); + if (!ret) + ret = ret2; + } + } +} + +static struct ib_client mad_client = { + .name = "mad", + .add = ib_mad_init_device, + .remove = ib_mad_remove_device +}; + +static int __init ib_mad_init_module(void) +{ + int ret; + + spin_lock_init(&ib_mad_port_list_lock); + spin_lock_init(&ib_agent_port_list_lock); + + ib_mad_cache = kmem_cache_create("ib_mad", + sizeof(struct ib_mad_private), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + if (!ib_mad_cache) { + printk(KERN_ERR PFX "Couldn't create ib_mad cache\n"); + ret = -ENOMEM; + goto error1; + } + + INIT_LIST_HEAD(&ib_mad_port_list); + + if (ib_register_client(&mad_client)) { + printk(KERN_ERR PFX "Couldn't register ib_mad client\n"); + ret = -EINVAL; + goto error2; + } + + return 0; + +error2: + kmem_cache_destroy(ib_mad_cache); +error1: + return ret; +} + +static void __exit ib_mad_cleanup_module(void) +{ + ib_unregister_client(&mad_client); + + if (kmem_cache_destroy(ib_mad_cache)) { + printk(KERN_DEBUG PFX "Failed to destroy ib_mad cache\n"); + } +} + +module_init(ib_mad_init_module); +module_exit(ib_mad_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/mad_priv.h 2004-12-13 09:44:43.123372531 -0800 @@ -0,0 +1,204 @@ +/* + * Copyright (c) 2004, Voltaire, Inc. All rights reserved. + * Maintained by: vtrmaint1 at voltaire.com + * + * This program is intended for the purpose of Infiniband + * protocol stack for Linux Servers. + * + * This software program is free software and you are free to modifyi + * and/or redistribute it under a choice of one of the following two + * licenses: + * + * 1) under either the GNU General Public License (GPL) Version 2, June 1991, + * a copy of which is in the file LICENSE_GPL_V2.txt in the root directory. + * This GPL license is also available from the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA, or on the + * web at http://www.fsf.org/copyleft/gpl.html + * + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, on the web at + * http://www.opensource.org/licenses/bsd-license.php. + * + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * + * + * To obtain a copy of these licenses, the source code to this software or + * for other questions, you may write to Voltaire, Inc., + * Attention: Voltaire openSource maintainer, + * Voltaire, Inc. 54 Middlesex Turnpike Bedford, MA 01730 or + * by Email: vtrmaint1 at voltaire.com + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +#ifndef __IB_MAD_PRIV_H__ +#define __IB_MAD_PRIV_H__ + +#include +#include +#include +#include +#include + + +#define PFX "ib_mad: " + +#define IB_MAD_QPS_CORE 2 /* Always QP0 and QP1 as a minimum */ + +/* QP and CQ parameters */ +#define IB_MAD_QP_SEND_SIZE 128 +#define IB_MAD_QP_RECV_SIZE 512 +#define IB_MAD_SEND_REQ_MAX_SG 2 +#define IB_MAD_RECV_REQ_MAX_SG 1 + +#define IB_MAD_SEND_Q_PSN 0 + +/* Registration table sizes */ +#define MAX_MGMT_CLASS 80 +#define MAX_MGMT_VERSION 8 +#define MAX_MGMT_OUI 8 +#define MAX_MGMT_VENDOR_RANGE2 IB_MGMT_CLASS_VENDOR_RANGE2_END - \ + IB_MGMT_CLASS_VENDOR_RANGE2_START + 1 + +struct ib_mad_list_head { + struct list_head list; + struct ib_mad_queue *mad_queue; +}; + +struct ib_mad_private_header { + struct ib_mad_list_head mad_list; + struct ib_mad_recv_wc recv_wc; + DECLARE_PCI_UNMAP_ADDR(mapping) +} __attribute__ ((packed)); + +struct ib_mad_private { + struct ib_mad_private_header header; + struct ib_grh grh; + union { + struct ib_mad mad; + struct ib_rmpp_mad rmpp_mad; + struct ib_smp smp; + } mad; +} __attribute__ ((packed)); + +struct ib_mad_agent_private { + struct list_head agent_list; + struct ib_mad_agent agent; + struct ib_mad_reg_req *reg_req; + struct ib_mad_qp_info *qp_info; + + spinlock_t lock; + struct list_head send_list; + struct list_head wait_list; + struct work_struct timed_work; + unsigned long timeout; + + atomic_t refcount; + wait_queue_head_t wait; + u8 rmpp_version; +}; + +struct ib_mad_snoop_private { + struct ib_mad_agent agent; + struct ib_mad_qp_info *qp_info; + int snoop_index; + int mad_snoop_flags; + atomic_t refcount; + wait_queue_head_t wait; +}; + +struct ib_mad_send_wr_private { + struct ib_mad_list_head mad_list; + struct list_head agent_list; + struct ib_mad_agent *agent; + struct ib_send_wr send_wr; + struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; + u64 wr_id; /* client WR ID */ + u64 tid; + unsigned long timeout; + int retry; + int refcount; + enum ib_wc_status status; +}; + +struct ib_mad_mgmt_method_table { + struct ib_mad_agent_private *agent[IB_MGMT_MAX_METHODS]; +}; + +struct ib_mad_mgmt_class_table { + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_CLASS]; +}; + +struct ib_mad_mgmt_vendor_class { + u8 oui[MAX_MGMT_OUI][3]; + struct ib_mad_mgmt_method_table *method_table[MAX_MGMT_OUI]; +}; + +struct ib_mad_mgmt_vendor_class_table { + struct ib_mad_mgmt_vendor_class *vendor_class[MAX_MGMT_VENDOR_RANGE2]; +}; + +struct ib_mad_mgmt_version_table { + struct ib_mad_mgmt_class_table *class; + struct ib_mad_mgmt_vendor_class_table *vendor; +}; + +struct ib_mad_queue { + spinlock_t lock; + struct list_head list; + int count; + int max_active; + struct ib_mad_qp_info *qp_info; +}; + +struct ib_mad_qp_info { + struct ib_mad_port_private *port_priv; + struct ib_qp *qp; + struct ib_mad_queue send_queue; + struct ib_mad_queue recv_queue; + struct list_head overflow_list; + spinlock_t snoop_lock; + struct ib_mad_snoop_private **snoop_table; + int snoop_table_size; + atomic_t snoop_count; +}; + +struct ib_mad_port_private { + struct list_head port_list; + struct ib_device *device; + int port_num; + struct ib_cq *cq; + struct ib_pd *pd; + struct ib_mr *mr; + + spinlock_t reg_lock; + struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION]; + struct list_head agent_list; + struct workqueue_struct *wq; + struct work_struct work; + struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE]; +}; + +#endif /* __IB_MAD_PRIV_H__ */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.c 2004-12-13 09:44:43.044384167 -0800 @@ -0,0 +1,222 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#include + + +/* + * Fixup a directed route SMP for sending + * Return 0 if the SMP should be discarded + */ +int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 */ + if (hop_cnt && hop_ptr == 0) { + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:2 */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->return_path set when received */ + smp->hop_ptr++; + return (smp->initial_path[smp->hop_ptr] == + port_num); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + /* smp->return_path set when received */ + smp->hop_ptr++; + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- Fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) { + smp->hop_ptr--; + /* C14-13:3 -- SMPs destined for SM shouldn't be here */ + return (node_type == IB_NODE_SWITCH || + smp->dr_slid == IB_LID_PERMISSIVE); + } + + /* C14-13:4 -- hop_ptr = 0 -> should have gone to SM */ + if (hop_ptr == 0) + return 1; + + /* C14-13:5 -- Check for unreasonable hop pointer */ + return 0; + } +} + +/* + * Adjust information for a received SMP + * Return 0 if the SMP should be dropped + */ +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + /* See section 14.2.2.2, Vol 1 IB spec */ + if (!ib_get_smp_direction(smp)) { + /* C14-9:1 -- sender should have incremented hop_ptr */ + if (hop_cnt && hop_ptr == 0) + return 0; + + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + return (smp->initial_path[hop_ptr+1] <= phys_port_cnt); + } + + /* C14-9:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) { + if (hop_cnt) + smp->return_path[hop_ptr] = port_num; + /* smp->hop_ptr updated when sending */ + + return (node_type == IB_NODE_SWITCH || + smp->dr_dlid == IB_LID_PERMISSIVE); + } + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + /* C14-9:5 -- fail unreasonable hop pointer */ + return (hop_ptr == hop_cnt + 1); + + } else { + + /* C14-13:1 */ + if (hop_cnt && hop_ptr == hop_cnt + 1) { + smp->hop_ptr--; + return (smp->return_path[smp->hop_ptr] == + port_num); + } + + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) { + if (node_type != IB_NODE_SWITCH) + return 0; + + /* smp->hop_ptr updated when sending */ + return (smp->return_path[hop_ptr-1] <= phys_port_cnt); + } + + /* C14-13:3 -- We're at the end of the DR segment of path */ + if (hop_ptr == 1) { + if (smp->dr_slid == IB_LID_PERMISSIVE) { + /* giving SMP to SM - update hop_ptr */ + smp->hop_ptr--; + return 1; + } + /* smp->hop_ptr updated when sending */ + return (node_type == IB_NODE_SWITCH); + } + + /* C14-13:4 -- hop_ptr = 0 -> give to SM */ + /* C14-13:5 -- Check for unreasonable hop pointer */ + return (hop_ptr == 0); + } +} + +/* + * Return 1 if the received DR SMP should be forwarded to the send queue + * Return 0 if the SMP should be completed up the stack + */ +int smi_check_forward_dr_smp(struct ib_smp *smp) +{ + u8 hop_ptr, hop_cnt; + + hop_ptr = smp->hop_ptr; + hop_cnt = smp->hop_cnt; + + if (!ib_get_smp_direction(smp)) { + /* C14-9:2 -- intermediate hop */ + if (hop_ptr && hop_ptr < hop_cnt) + return 1; + + /* C14-9:3 -- at the end of the DR segment of path */ + if (hop_ptr == hop_cnt) + return (smp->dr_dlid == IB_LID_PERMISSIVE); + + /* C14-9:4 -- hop_ptr = hop_cnt + 1 -> give to SMA/SM */ + if (hop_ptr == hop_cnt + 1) + return 1; + } else { + /* C14-13:2 */ + if (2 <= hop_ptr && hop_ptr <= hop_cnt) + return 1; + + /* C14-13:3 -- at the end of the DR segment of path */ + if (hop_ptr == 1) + return (smp->dr_slid != IB_LID_PERMISSIVE); + } + return 0; +} + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/core/smi.h 2004-12-13 09:44:43.148368849 -0800 @@ -0,0 +1,54 @@ +/* + This software is available to you under a choice of one of two + licenses. You may choose to be licensed under the terms of the GNU + General Public License (GPL) Version 2, available at + , or the OpenIB.org BSD + license, available in the LICENSE.TXT file accompanying this + software. These details are also available at + . + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + + Copyright (c) 2004 Mellanox Technologies Ltd. All rights reserved. + Copyright (c) 2004 Infinicon Corporation. All rights reserved. + Copyright (c) 2004 Intel Corporation. All rights reserved. + Copyright (c) 2004 Topspin Corporation. All rights reserved. + Copyright (c) 2004 Voltaire Corporation. All rights reserved. +*/ + +#ifndef __SMI_H_ +#define __SMI_H_ + +int smi_handle_dr_smp_recv(struct ib_smp *smp, + u8 node_type, + int port_num, + int phys_port_cnt); +extern int smi_check_forward_dr_smp(struct ib_smp *smp); +extern int smi_handle_dr_smp_send(struct ib_smp *smp, + u8 node_type, + int port_num); +extern int smi_check_local_dr_smp(struct ib_smp *smp, + struct ib_device *device, + int port_num); + +/* + * Return 1 if the SMP should be handled by the local SMA/SM via process_mad + */ +static inline int smi_check_local_smp(struct ib_mad_agent *mad_agent, + struct ib_smp *smp) +{ + /* C14-9:3 -- We're at the end of the DR segment of path */ + /* C14-9:4 -- Hop Pointer = Hop Count + 1 -> give to SMA/SM */ + return ((mad_agent->device->process_mad && + !ib_get_smp_direction(smp) && + (smp->hop_ptr == smp->hop_cnt + 1))); +} + +#endif /* __SMI_H_ */ From roland at topspin.com Mon Dec 13 10:09:29 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:29 -0800 Subject: [openib-general] [PATCH][v3][7/21] Add Mellanox HCA low-level driver In-Reply-To: <20041213109.cVS0twN822l4xQbR@topspin.com> Message-ID: <20041213109.PeK9O1dDWBb3rThl@topspin.com> Add a low-level driver for Mellanox MT23108 and MT25208 HCAs. The MT25208 is only fully supported when in MT23108 compatibility mode; only the very beginnings of support for native MT25208 mode (required for HCAs without local memory) is present. (As a side note, I believe this driver would be the first in-tree consumer of the PCI MSI/MSI-X API) Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-13 09:44:40.230798660 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:43.936252779 -0800 @@ -8,4 +8,6 @@ any protocols you wish to use as well as drivers for your InfiniBand hardware. +source "drivers/infiniband/hw/mthca/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-13 09:44:40.278791590 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:43.909256756 -0800 @@ -1 +1,2 @@ obj-$(CONFIG_INFINIBAND) += core/ +obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Kconfig 2004-12-13 09:44:43.962248949 -0800 @@ -0,0 +1,26 @@ +config INFINIBAND_MTHCA + tristate "Mellanox HCA support" + depends on PCI && INFINIBAND + ---help--- + This is a low-level driver for Mellanox InfiniHost host + channel adapters (HCAs), including the MT23108 PCI-X HCA + ("Tavor") and the MT25208 PCI Express HCA ("Arbel"). + +config INFINIBAND_MTHCA_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_MTHCA + default n + ---help--- + This option causes the mthca driver produce a bunch of debug + messages. Select this is you are developing the driver or + trying to diagnose a problem. + +config INFINIBAND_MTHCA_SSE_DOORBELL + bool "SSE doorbell code" + depends on INFINIBAND_MTHCA && X86 && !X86_64 + default n + ---help--- + This option will have the mthca driver use SSE instructions + to ring hardware doorbell registers. This may improve + performance for some workloads, but the driver will not run + on processors without SSE instructions. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/Makefile 2004-12-13 09:44:43.990244825 -0800 @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +ifdef CONFIG_INFINIBAND_MTHCA_DEBUG +EXTRA_CFLAGS += -DDEBUG +endif + +obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mthca.o + +ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ + mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ + mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ + mthca_provider.o --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_allocator.c 2004-12-13 09:44:44.017240848 -0800 @@ -0,0 +1,168 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_allocator.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" + +/* Trivial bitmap-based allocator */ +u32 mthca_alloc(struct mthca_alloc *alloc) +{ + u32 obj; + + spin_lock(&alloc->lock); + obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last); + if (obj >= alloc->max) { + alloc->top = (alloc->top + alloc->max) & alloc->mask; + obj = find_first_zero_bit(alloc->table, alloc->max); + } + + if (obj < alloc->max) { + set_bit(obj, alloc->table); + obj |= alloc->top; + } else + obj = -1; + + spin_unlock(&alloc->lock); + + return obj; +} + +void mthca_free(struct mthca_alloc *alloc, u32 obj) +{ + obj &= alloc->max - 1; + spin_lock(&alloc->lock); + clear_bit(obj, alloc->table); + alloc->last = min(alloc->last, obj); + alloc->top = (alloc->top + alloc->max) & alloc->mask; + spin_unlock(&alloc->lock); +} + +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved) +{ + int i; + + /* num must be a power of 2 */ + if (num != 1 << (ffs(num) - 1)) + return -EINVAL; + + alloc->last = 0; + alloc->top = 0; + alloc->max = num; + alloc->mask = mask; + spin_lock_init(&alloc->lock); + alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof (long), + GFP_KERNEL); + if (!alloc->table) + return -ENOMEM; + + bitmap_zero(alloc->table, num); + for (i = 0; i < reserved; ++i) + set_bit(i, alloc->table); + + return 0; +} + +void mthca_alloc_cleanup(struct mthca_alloc *alloc) +{ + kfree(alloc->table); +} + +/* + * Array of pointers with lazy allocation of leaf pages. Callers of + * _get, _set and _clear methods must use a lock or otherwise + * serialize access to the array. + */ + +void *mthca_array_get(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (array->page_list[p].page) { + int i = index & (PAGE_SIZE / sizeof (void *) - 1); + return array->page_list[p].page[i]; + } else + return NULL; +} + +int mthca_array_set(struct mthca_array *array, int index, void *value) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + /* Allocate with GFP_ATOMIC because we'll be called with locks held. */ + if (!array->page_list[p].page) + array->page_list[p].page = (void **) get_zeroed_page(GFP_ATOMIC); + + if (!array->page_list[p].page) + return -ENOMEM; + + array->page_list[p].page[index & (PAGE_SIZE / sizeof (void *) - 1)] = + value; + ++array->page_list[p].used; + + return 0; +} + +void mthca_array_clear(struct mthca_array *array, int index) +{ + int p = (index * sizeof (void *)) >> PAGE_SHIFT; + + if (--array->page_list[p].used == 0) { + free_page((unsigned long) array->page_list[p].page); + array->page_list[p].page = NULL; + } + + if (array->page_list[p].used < 0) + pr_debug("Array %p index %d page %d with ref count %d < 0\n", + array, index, p, array->page_list[p].used); +} + +int mthca_array_init(struct mthca_array *array, int nent) +{ + int npage = (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; + int i; + + array->page_list = kmalloc(npage * sizeof *array->page_list, GFP_KERNEL); + if (!array->page_list) + return -ENOMEM; + + for (i = 0; i < npage; ++i) { + array->page_list[i].page = NULL; + array->page_list[i].used = 0; + } + + return 0; +} + +void mthca_array_cleanup(struct mthca_array *array, int nent) +{ + int i; + + for (i = 0; i < (nent * sizeof (void *) + PAGE_SIZE - 1) / PAGE_SIZE; ++i) + free_page((unsigned long) array->page_list[i].page); + + kfree(array->page_list); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_config_reg.h 2004-12-13 09:44:44.041237312 -0800 @@ -0,0 +1,44 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_config_reg.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_CONFIG_REG_H +#define MTHCA_CONFIG_REG_H + +#include + +#define MTHCA_HCR_BASE 0x80680 +#define MTHCA_HCR_SIZE 0x0001c +#define MTHCA_ECR_BASE 0x80700 +#define MTHCA_ECR_SIZE 0x00008 +#define MTHCA_ECR_CLR_BASE 0x80708 +#define MTHCA_ECR_CLR_SIZE 0x00008 +#define MTHCA_ECR_OFFSET (MTHCA_ECR_BASE - MTHCA_HCR_BASE) +#define MTHCA_ECR_CLR_OFFSET (MTHCA_ECR_CLR_BASE - MTHCA_HCR_BASE) +#define MTHCA_CLR_INT_BASE 0xf00d8 +#define MTHCA_CLR_INT_SIZE 0x00008 + +#define MTHCA_MAP_HCR_SIZE (MTHCA_ECR_CLR_BASE + \ + MTHCA_ECR_CLR_SIZE - \ + MTHCA_HCR_BASE) + +#endif /* MTHCA_CONFIG_REG_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_dev.h 2004-12-13 09:44:44.069233188 -0800 @@ -0,0 +1,380 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_dev.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_DEV_H +#define MTHCA_DEV_H + +#include +#include +#include +#include +#include +#include + +#include "mthca_provider.h" +#include "mthca_doorbell.h" + +#define DRV_NAME "ib_mthca" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.06-pre" +#define DRV_RELDATE "November 8, 2004" + +/* Types of supported HCA */ +enum { + TAVOR, /* MT23108 */ + ARBEL_COMPAT, /* MT25208 in Tavor compat mode */ + ARBEL_NATIVE /* MT25208 with extended features */ +}; + +enum { + MTHCA_FLAG_DDR_HIDDEN = 1 << 1, + MTHCA_FLAG_SRQ = 1 << 2, + MTHCA_FLAG_MSI = 1 << 3, + MTHCA_FLAG_MSI_X = 1 << 4, + MTHCA_FLAG_NO_LAM = 1 << 5 +}; + +enum { + MTHCA_KAR_PAGE = 1, + MTHCA_MAX_PORTS = 2 +}; + +enum { + MTHCA_MPT_ENTRY_SIZE = 0x40, + MTHCA_EQ_CONTEXT_SIZE = 0x40, + MTHCA_CQ_CONTEXT_SIZE = 0x40, + MTHCA_QP_CONTEXT_SIZE = 0x200, + MTHCA_AV_SIZE = 0x20, + MTHCA_MGM_ENTRY_SIZE = 0x40 +}; + +enum { + MTHCA_EQ_CMD, + MTHCA_EQ_ASYNC, + MTHCA_EQ_COMP, + MTHCA_NUM_EQ +}; + +struct mthca_cmd { + int use_events; + struct semaphore hcr_sem; + struct semaphore poll_sem; + struct semaphore event_sem; + int max_cmds; + spinlock_t context_lock; + int free_head; + struct mthca_cmd_context *context; + u16 token_mask; +}; + +struct mthca_limits { + int num_ports; + int vl_cap; + int mtu_cap; + int gid_table_len; + int pkey_table_len; + int local_ca_ack_delay; + int max_sg; + int num_qps; + int reserved_qps; + int num_srqs; + int reserved_srqs; + int num_eecs; + int reserved_eecs; + int num_cqs; + int reserved_cqs; + int num_eqs; + int reserved_eqs; + int num_mpts; + int num_mtt_segs; + int mtt_seg_size; + int reserved_mtts; + int reserved_mrws; + int num_rdbs; + int reserved_uars; + int num_mgms; + int num_amgms; + int reserved_mcgs; + int num_pds; + int reserved_pds; +}; + +struct mthca_alloc { + u32 last; + u32 top; + u32 max; + u32 mask; + spinlock_t lock; + unsigned long *table; +}; + +struct mthca_array { + struct { + void **page; + int used; + } *page_list; +}; + +struct mthca_pd_table { + struct mthca_alloc alloc; +}; + +struct mthca_mr_table { + struct mthca_alloc mpt_alloc; + int max_mtt_order; + unsigned long **mtt_buddy; + u64 mtt_base; +}; + +struct mthca_eq_table { + struct mthca_alloc alloc; + void __iomem *clr_int; + u32 clr_mask; + struct mthca_eq eq[MTHCA_NUM_EQ]; + int have_irq; + u8 inta_pin; +}; + +struct mthca_cq_table { + struct mthca_alloc alloc; + spinlock_t lock; + struct mthca_array cq; +}; + +struct mthca_qp_table { + struct mthca_alloc alloc; + int sqp_start; + spinlock_t lock; + struct mthca_array qp; +}; + +struct mthca_av_table { + struct pci_pool *pool; + int num_ddr_avs; + u64 ddr_av_base; + void __iomem *av_map; + struct mthca_alloc alloc; +}; + +struct mthca_mcg_table { + struct semaphore sem; + struct mthca_alloc alloc; +}; + +struct mthca_dev { + struct ib_device ib_dev; + struct pci_dev *pdev; + + int hca_type; + unsigned long mthca_flags; + + u32 rev_id; + + /* firmware info */ + u64 fw_ver; + union { + struct { + u64 fw_start; + u64 fw_end; + } tavor; + struct { + u64 clr_int_base; + u64 eq_arm_base; + u64 eq_set_ci_base; + struct scatterlist *mem; + u16 fw_pages; + } arbel; + } fw; + + u64 ddr_start; + u64 ddr_end; + + MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) + + void __iomem *hcr; + void __iomem *clr_base; + void __iomem *kar; + + struct mthca_cmd cmd; + struct mthca_limits limits; + + struct mthca_pd_table pd_table; + struct mthca_mr_table mr_table; + struct mthca_eq_table eq_table; + struct mthca_cq_table cq_table; + struct mthca_qp_table qp_table; + struct mthca_av_table av_table; + struct mthca_mcg_table mcg_table; + + struct mthca_pd driver_pd; + struct mthca_mr driver_mr; + + struct ib_mad_agent *send_agent[MTHCA_MAX_PORTS][2]; + struct ib_ah *sm_ah[MTHCA_MAX_PORTS]; + spinlock_t sm_lock; +}; + +#define mthca_dbg(mdev, format, arg...) \ + dev_dbg(&mdev->pdev->dev, format, ## arg) +#define mthca_err(mdev, format, arg...) \ + dev_err(&mdev->pdev->dev, format, ## arg) +#define mthca_info(mdev, format, arg...) \ + dev_info(&mdev->pdev->dev, format, ## arg) +#define mthca_warn(mdev, format, arg...) \ + dev_warn(&mdev->pdev->dev, format, ## arg) + +extern void __buggy_use_of_MTHCA_GET(void); +extern void __buggy_use_of_MTHCA_PUT(void); + +#define MTHCA_GET(dest, source, offset) \ + do { \ + void *__p = (char *) (source) + (offset); \ + switch (sizeof (dest)) { \ + case 1: (dest) = *(u8 *) __p; break; \ + case 2: (dest) = be16_to_cpup(__p); break; \ + case 4: (dest) = be32_to_cpup(__p); break; \ + case 8: (dest) = be64_to_cpup(__p); break; \ + default: __buggy_use_of_MTHCA_GET(); \ + } \ + } while (0) + +#define MTHCA_PUT(dest, source, offset) \ + do { \ + __typeof__(source) *__p = \ + (__typeof__(source) *) ((char *) (dest) + (offset)); \ + switch (sizeof(source)) { \ + case 1: *__p = (source); break; \ + case 2: *__p = cpu_to_be16(source); break; \ + case 4: *__p = cpu_to_be32(source); break; \ + case 8: *__p = cpu_to_be64(source); break; \ + default: __buggy_use_of_MTHCA_PUT(); \ + } \ + } while (0) + +int mthca_reset(struct mthca_dev *mdev); + +u32 mthca_alloc(struct mthca_alloc *alloc); +void mthca_free(struct mthca_alloc *alloc, u32 obj); +int mthca_alloc_init(struct mthca_alloc *alloc, u32 num, u32 mask, + u32 reserved); +void mthca_alloc_cleanup(struct mthca_alloc *alloc); +void *mthca_array_get(struct mthca_array *array, int index); +int mthca_array_set(struct mthca_array *array, int index, void *value); +void mthca_array_clear(struct mthca_array *array, int index); +int mthca_array_init(struct mthca_array *array, int nent); +void mthca_array_cleanup(struct mthca_array *array, int nent); + +int mthca_init_pd_table(struct mthca_dev *dev); +int mthca_init_mr_table(struct mthca_dev *dev); +int mthca_init_eq_table(struct mthca_dev *dev); +int mthca_init_cq_table(struct mthca_dev *dev); +int mthca_init_qp_table(struct mthca_dev *dev); +int mthca_init_av_table(struct mthca_dev *dev); +int mthca_init_mcg_table(struct mthca_dev *dev); + +void mthca_cleanup_pd_table(struct mthca_dev *dev); +void mthca_cleanup_mr_table(struct mthca_dev *dev); +void mthca_cleanup_eq_table(struct mthca_dev *dev); +void mthca_cleanup_cq_table(struct mthca_dev *dev); +void mthca_cleanup_qp_table(struct mthca_dev *dev); +void mthca_cleanup_av_table(struct mthca_dev *dev); +void mthca_cleanup_mcg_table(struct mthca_dev *dev); + +int mthca_register_device(struct mthca_dev *dev); +void mthca_unregister_device(struct mthca_dev *dev); + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd); +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr); +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr); +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr); + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry); +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited); +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq); +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn); + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type); +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask); +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe); +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp); +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp); +void mthca_free_qp(struct mthca_dev *dev, struct mthca_qp *qp); +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah); +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah); +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header); + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad); +int mthca_create_agents(struct mthca_dev *dev); +void mthca_free_agents(struct mthca_dev *dev); + +static inline struct mthca_dev *to_mdev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct mthca_dev, ib_dev); +} + +#endif /* MTHCA_DEV_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_doorbell.h 2004-12-13 09:44:44.095229358 -0800 @@ -0,0 +1,112 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_doorbell.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include + +#define MTHCA_RD_DOORBELL 0x00 +#define MTHCA_SEND_DOORBELL 0x10 +#define MTHCA_RECEIVE_DOORBELL 0x18 +#define MTHCA_CQ_DOORBELL 0x20 +#define MTHCA_EQ_DOORBELL 0x28 + +#if BITS_PER_LONG == 64 +/* + * Assume that we can just write a 64-bit doorbell atomically. s390 + * actually doesn't have writeq() but S/390 systems don't even have + * PCI so we won't worry about it. + */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + __raw_writeq(*(u64 *) val, dest); +} + +#elif defined(CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL) +/* Use SSE to write 64 bits atomically without a lock. */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) +#define MTHCA_INIT_DOORBELL_LOCK(ptr) do { } while (0) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (NULL) + +static inline unsigned long mthca_get_fpu(void) +{ + unsigned long cr0; + + preempt_disable(); + asm volatile("mov %%cr0,%0; clts" : "=r" (cr0)); + return cr0; +} + +static inline void mthca_put_fpu(unsigned long cr0) +{ + asm volatile("mov %0,%%cr0" : : "r" (cr0)); + preempt_enable(); +} + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + /* i386 stack is aligned to 8 bytes, so this should be OK: */ + u8 xmmsave[8] __attribute__((aligned(8))); + unsigned long cr0; + + cr0 = mthca_get_fpu(); + + asm volatile ( + "movlps %%xmm0,(%0); \n\t" + "movlps (%1),%%xmm0; \n\t" + "movlps %%xmm0,(%2); \n\t" + "movlps (%0),%%xmm0; \n\t" + : + : "r" (xmmsave), "r" (val), "r" (dest) + : "memory" ); + + mthca_put_fpu(cr0); +} + +#else +/* Just fall back to a spinlock to protect the doorbell */ + +#define MTHCA_DECLARE_DOORBELL_LOCK(name) spinlock_t name; +#define MTHCA_INIT_DOORBELL_LOCK(ptr) spin_lock_init(ptr) +#define MTHCA_GET_DOORBELL_LOCK(ptr) (ptr) + +static inline void mthca_write64(u32 val[2], void __iomem *dest, + spinlock_t *doorbell_lock) +{ + unsigned long flags; + + spin_lock_irqsave(doorbell_lock, flags); + __raw_writel(val[0], dest); + __raw_writel(val[1], dest + 4); + spin_unlock_irqrestore(doorbell_lock, flags); +} + +#endif --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_main.c 2004-12-13 09:44:44.121225529 -0800 @@ -0,0 +1,922 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_main.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL +#include +#endif + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" +#include "mthca_profile.h" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +#ifdef CONFIG_PCI_MSI + +static int msi_x = 0; +module_param(msi_x, int, 0444); +MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero"); + +static int msi = 0; +module_param(msi, int, 0444); +MODULE_PARM_DESC(msi, "attempt to use MSI if nonzero"); + +#else /* CONFIG_PCI_MSI */ + +#define msi_x (0) +#define msi (0) + +#endif /* CONFIG_PCI_MSI */ + +static const char mthca_version[] __devinitdata = + "ib_mthca: Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +static int __devinit mthca_tune_pci(struct mthca_dev *mdev) +{ + int cap; + u16 val; + + /* First try to max out Read Byte Count */ + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_PCIX); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_X_CMD, &val)) { + mthca_err(mdev, "Couldn't read PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_X_CMD_MAX_READ) | (3 << 2); + if (pci_write_config_word(mdev->pdev, cap + PCI_X_CMD, val)) { + mthca_err(mdev, "Couldn't write PCI-X command register, " + "aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == TAVOR) + mthca_info(mdev, "No PCI-X capability, not setting RBC.\n"); + + cap = pci_find_capability(mdev->pdev, PCI_CAP_ID_EXP); + if (cap) { + if (pci_read_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, &val)) { + mthca_err(mdev, "Couldn't read PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12); + if (pci_write_config_word(mdev->pdev, cap + PCI_EXP_DEVCTL, val)) { + mthca_err(mdev, "Couldn't write PCI Express device control " + "register, aborting.\n"); + return -ENODEV; + } + } else if (mdev->hca_type == ARBEL_NATIVE || + mdev->hca_type == ARBEL_COMPAT) + mthca_info(mdev, "No PCI Express capability, " + "not setting Max Read Request Size.\n"); + + return 0; +} + +static int __devinit mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) +{ + int err; + u8 status; + + err = mthca_QUERY_DEV_LIM(mdev, dev_lim, &status); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_DEV_LIM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + if (dev_lim->min_page_sz > PAGE_SIZE) { + mthca_err(mdev, "HCA minimum page size of %d bigger than " + "kernel PAGE_SIZE of %ld, aborting.\n", + dev_lim->min_page_sz, PAGE_SIZE); + return -ENODEV; + } + if (dev_lim->num_ports > MTHCA_MAX_PORTS) { + mthca_err(mdev, "HCA has %d ports, but we only support %d, " + "aborting.\n", + dev_lim->num_ports, MTHCA_MAX_PORTS); + return -ENODEV; + } + + mdev->limits.num_ports = dev_lim->num_ports; + mdev->limits.vl_cap = dev_lim->max_vl; + mdev->limits.mtu_cap = dev_lim->max_mtu; + mdev->limits.gid_table_len = dev_lim->max_gids; + mdev->limits.pkey_table_len = dev_lim->max_pkeys; + mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; + mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.reserved_srqs = dev_lim->reserved_srqs; + mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.reserved_cqs = dev_lim->reserved_cqs; + mdev->limits.reserved_eqs = dev_lim->reserved_eqs; + mdev->limits.reserved_mtts = dev_lim->reserved_mtts; + mdev->limits.reserved_mrws = dev_lim->reserved_mrws; + mdev->limits.reserved_uars = dev_lim->reserved_uars; + mdev->limits.reserved_pds = dev_lim->reserved_pds; + + if (dev_lim->flags & DEV_LIM_FLAG_SRQ) + mdev->mthca_flags |= MTHCA_FLAG_SRQ; + + return 0; +} + +static int __devinit mthca_init_tavor(struct mthca_dev *mdev) +{ + u8 status; + int err; + struct mthca_dev_lim dev_lim; + struct mthca_init_hca_param init_hca; + struct mthca_adapter adapter; + + err = mthca_SYS_EN(mdev, &status); + if (err) { + mthca_err(mdev, "SYS_EN command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "SYS_EN returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + err = mthca_QUERY_DDR(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_DDR command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_DDR returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + + err = mthca_make_profile(mdev, &dev_lim, &init_hca); + if (err) + goto err_out_disable; + + err = mthca_INIT_HCA(mdev, &init_hca, &status); + if (err) { + mthca_err(mdev, "INIT_HCA command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "INIT_HCA returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_disable; + } + + err = mthca_QUERY_ADAPTER(mdev, &adapter, &status); + if (err) { + mthca_err(mdev, "QUERY_ADAPTER command failed, aborting.\n"); + goto err_out_disable; + } + if (status) { + mthca_err(mdev, "QUERY_ADAPTER returned status 0x%02x, " + "aborting.\n", status); + err = -EINVAL; + goto err_out_close; + } + + mdev->eq_table.inta_pin = adapter.inta_pin; + mdev->rev_id = adapter.revision_id; + + return 0; + +err_out_close: + mthca_CLOSE_HCA(mdev, 0, &status); + +err_out_disable: + mthca_SYS_DIS(mdev, &status); + + return err; +} + +static int __devinit mthca_load_fw(struct mthca_dev *mdev) +{ + u8 status; + int err; + int num_ent, num_sg, fw_pages, cur_order; + int i; + + /* FIXME: use HCA-attached memory for FW if present */ + + mdev->fw.arbel.mem = kmalloc(sizeof *mdev->fw.arbel.mem * + mdev->fw.arbel.fw_pages, + GFP_KERNEL); + if (!mdev->fw.arbel.mem) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + return -ENOMEM; + } + + memset(mdev->fw.arbel.mem, 0, + sizeof *mdev->fw.arbel.mem * mdev->fw.arbel.fw_pages); + + fw_pages = mdev->fw.arbel.fw_pages; + num_ent = 0; + /* + * We allocate in as big chunks as we can, up to a maximum of + * 256 KB per chunk. + */ + cur_order = get_order(1 << 18); + while (fw_pages > 0) { + while (1 << cur_order > fw_pages) + --cur_order; + + /* + * We allocate with GFP_HIGHUSER because only the + * firmware is going to touch these pages, so there's + * no need for a kernel virtual address. We use + * __GFP_NOWARN because we'll deal with any allocation + * failures ourselves. + */ + mdev->fw.arbel.mem[num_ent].page = alloc_pages(GFP_HIGHUSER | __GFP_NOWARN, + cur_order); + mdev->fw.arbel.mem[num_ent].length = PAGE_SIZE << cur_order; + if (!mdev->fw.arbel.mem[num_ent].page) { + --cur_order; + if (cur_order < 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + } else { + ++num_ent; + fw_pages -= 1 << cur_order; + } + } + num_sg = pci_map_sg(mdev->pdev, mdev->fw.arbel.mem, num_ent, + PCI_DMA_BIDIRECTIONAL); + if (num_sg <= 0) { + mthca_err(mdev, "Couldn't allocate FW area, aborting.\n"); + err = -ENOMEM; + goto err_free; + } + + err = mthca_MAP_FA(mdev, num_sg, mdev->fw.arbel.mem, &status); + if (err) { + mthca_err(mdev, "MAP_FA command failed, aborting.\n"); + goto err_unmap; + } + if (status) { + mthca_err(mdev, "MAP_FA returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap; + } + err = mthca_RUN_FW(mdev, &status); + if (err) { + mthca_err(mdev, "RUN_FW command failed, aborting.\n"); + goto err_unmap_fa; + } + if (status) { + mthca_err(mdev, "RUN_FW returned status 0x%02x, aborting.\n", status); + err = -EINVAL; + goto err_unmap_fa; + } + + return 0; + +err_unmap_fa: + mthca_UNMAP_FA(mdev, &status); + +err_unmap: + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); +err_free: + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + return err; +} + +static int __devinit mthca_init_arbel(struct mthca_dev *mdev) +{ + struct mthca_dev_lim dev_lim; + u8 status; + int err; + + err = mthca_QUERY_FW(mdev, &status); + if (err) { + mthca_err(mdev, "QUERY_FW command failed, aborting.\n"); + return err; + } + if (status) { + mthca_err(mdev, "QUERY_FW returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_ENABLE_LAM(mdev, &status); + if (err) { + mthca_err(mdev, "ENABLE_LAM command failed, aborting.\n"); + return err; + } + if (status == MTHCA_CMD_STAT_LAM_NOT_PRE) { + mthca_dbg(mdev, "No HCA-attached memory (running in MemFree mode)\n"); + mdev->mthca_flags |= MTHCA_FLAG_NO_LAM; + } else if (status) { + mthca_err(mdev, "ENABLE_LAM returned status 0x%02x, " + "aborting.\n", status); + return -EINVAL; + } + + err = mthca_load_fw(mdev); + if (err) { + mthca_err(mdev, "Failed to start FW, aborting.\n"); + goto err_out_disable; + } + + err = mthca_dev_lim(mdev, &dev_lim); + if (err) { + mthca_err(mdev, "QUERY_DEV_LIM command failed, aborting.\n"); + goto err_out_disable; + } + + mthca_warn(mdev, "Sorry, native MT25208 mode support is not done, " + "aborting.\n"); + err = -ENODEV; + +err_out_disable: + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + return err; +} + +static int __devinit mthca_init_hca(struct mthca_dev *mdev) +{ + if (mdev->hca_type == ARBEL_NATIVE) + return mthca_init_arbel(mdev); + else + return mthca_init_tavor(mdev); +} + +static int __devinit mthca_setup_hca(struct mthca_dev *dev) +{ + int err; + + MTHCA_INIT_DOORBELL_LOCK(&dev->doorbell_lock); + + err = mthca_init_pd_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "protection domain table, aborting.\n"); + return err; + } + + err = mthca_init_mr_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "memory region table, aborting.\n"); + goto err_out_pd_table_free; + } + + err = mthca_pd_alloc(dev, &dev->driver_pd); + if (err) { + mthca_err(dev, "Failed to create driver PD, " + "aborting.\n"); + goto err_out_mr_table_free; + } + + err = mthca_init_eq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "event queue table, aborting.\n"); + goto err_out_pd_free; + } + + err = mthca_cmd_use_events(dev); + if (err) { + mthca_err(dev, "Failed to switch to event-driven " + "firmware commands, aborting.\n"); + goto err_out_eq_table_free; + } + + err = mthca_init_cq_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "completion queue table, aborting.\n"); + goto err_out_cmd_poll; + } + + err = mthca_init_qp_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "queue pair table, aborting.\n"); + goto err_out_cq_table_free; + } + + err = mthca_init_av_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "address vector table, aborting.\n"); + goto err_out_qp_table_free; + } + + err = mthca_init_mcg_table(dev); + if (err) { + mthca_err(dev, "Failed to initialize " + "multicast group table, aborting.\n"); + goto err_out_av_table_free; + } + + return 0; + +err_out_av_table_free: + mthca_cleanup_av_table(dev); + +err_out_qp_table_free: + mthca_cleanup_qp_table(dev); + +err_out_cq_table_free: + mthca_cleanup_cq_table(dev); + +err_out_cmd_poll: + mthca_cmd_use_polling(dev); + +err_out_eq_table_free: + mthca_cleanup_eq_table(dev); + +err_out_pd_free: + mthca_pd_free(dev, &dev->driver_pd); + +err_out_mr_table_free: + mthca_cleanup_mr_table(dev); + +err_out_pd_table_free: + mthca_cleanup_pd_table(dev); + return err; +} + +static int __devinit mthca_request_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + int err; + + /* + * We request our first BAR in two chunks, since the MSI-X + * vector table is right in the middle. + * + * This is why we can't just use pci_request_regions() -- if + * we did then setting up MSI-X would fail, since the PCI core + * wants to do request_mem_region on the MSI-X vector table. + */ + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE, + DRV_NAME)) + return -EBUSY; + + if (!request_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE, + DRV_NAME)) { + err = -EBUSY; + goto err_out_bar0_beg; + } + + err = pci_request_region(pdev, 2, DRV_NAME); + if (err) + goto err_out_bar0_end; + + if (!ddr_hidden) { + err = pci_request_region(pdev, 4, DRV_NAME); + if (err) + goto err_out_bar2; + } + + return 0; + +err_out_bar0_beg: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + +err_out_bar0_end: + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + +err_out_bar2: + pci_release_region(pdev, 2); + return err; +} + +static void mthca_release_regions(struct pci_dev *pdev, + int ddr_hidden) +{ + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_HCR_BASE, + MTHCA_MAP_HCR_SIZE); + release_mem_region(pci_resource_start(pdev, 0) + + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + pci_release_region(pdev, 2); + if (!ddr_hidden) + pci_release_region(pdev, 4); +} + +static int __devinit mthca_enable_msi_x(struct mthca_dev *mdev) +{ + struct msix_entry entries[3]; + int err; + + entries[0].entry = 0; + entries[1].entry = 1; + entries[2].entry = 2; + + err = pci_enable_msix(mdev->pdev, entries, ARRAY_SIZE(entries)); + if (err) { + if (err > 0) + mthca_info(mdev, "Only %d MSI-X vectors available, " + "not using MSI-X\n", err); + return err; + } + + mdev->eq_table.eq[MTHCA_EQ_COMP ].msi_x_vector = entries[0].vector; + mdev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector = entries[1].vector; + mdev->eq_table.eq[MTHCA_EQ_CMD ].msi_x_vector = entries[2].vector; + + return 0; +} + +static void mthca_close_hca(struct mthca_dev *mdev) +{ + u8 status; + int i; + + mthca_CLOSE_HCA(mdev, 0, &status); + + if (mdev->hca_type == ARBEL_NATIVE) { + mthca_UNMAP_FA(mdev, &status); + + pci_unmap_sg(mdev->pdev, mdev->fw.arbel.mem, + mdev->fw.arbel.fw_pages, PCI_DMA_BIDIRECTIONAL); + + for (i = 0; i < mdev->fw.arbel.fw_pages; ++i) + if (mdev->fw.arbel.mem[i].page) + __free_pages(mdev->fw.arbel.mem[i].page, + get_order(mdev->fw.arbel.mem[i].length)); + kfree(mdev->fw.arbel.mem); + + if (!(mdev->mthca_flags & MTHCA_FLAG_NO_LAM)) + mthca_DISABLE_LAM(mdev, &status); + } else + mthca_SYS_DIS(mdev, &status); +} + +static int __devinit mthca_init_one(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + static int mthca_version_printed = 0; + int ddr_hidden = 0; + int err; + unsigned long mthca_base; + struct mthca_dev *mdev; + + if (!mthca_version_printed) { + printk(KERN_INFO "%s", mthca_version); + ++mthca_version_printed; + } + + printk(KERN_INFO PFX "Initializing %s (%s)\n", + pci_pretty_name(pdev), pci_name(pdev)); + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "Cannot enable PCI device, " + "aborting.\n"); + return err; + } + + /* + * Check for BARs. We expect 0: 1MB, 2: 8MB, 4: DDR (may not + * be present) + */ + if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) || + pci_resource_len(pdev, 0) != 1 << 20) { + dev_err(&pdev->dev, "Missing DCS, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 2) & IORESOURCE_MEM) || + pci_resource_len(pdev, 2) != 1 << 23) { + dev_err(&pdev->dev, "Missing UAR, aborting."); + err = -ENODEV; + goto err_out_disable_pdev; + } + if (!(pci_resource_flags(pdev, 4) & IORESOURCE_MEM)) + ddr_hidden = 1; + + err = mthca_request_regions(pdev, ddr_hidden); + if (err) { + dev_err(&pdev->dev, "Cannot obtain PCI resources, " + "aborting.\n"); + goto err_out_disable_pdev; + } + + pci_set_master(pdev); + + err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit PCI DMA mask.\n"); + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set PCI DMA mask, aborting.\n"); + goto err_out_free_res; + } + } + err = pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); + if (err) { + dev_warn(&pdev->dev, "Warning: couldn't set 64-bit " + "consistent PCI DMA mask.\n"); + err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); + if (err) { + dev_err(&pdev->dev, "Can't set consistent PCI DMA mask, " + "aborting.\n"); + goto err_out_free_res; + } + } + + mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev); + if (!mdev) { + dev_err(&pdev->dev, "Device struct alloc failed, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_res; + } + + mdev->pdev = pdev; + mdev->hca_type = id->driver_data; + + if (ddr_hidden) + mdev->mthca_flags |= MTHCA_FLAG_DDR_HIDDEN; + + /* + * Now reset the HCA before we touch the PCI capabilities or + * attempt a firmware command, since a boot ROM may have left + * the HCA in an undefined state. + */ + err = mthca_reset(mdev); + if (err) { + mthca_err(mdev, "Failed to reset HCA, aborting.\n"); + goto err_out_free_dev; + } + + if (msi_x && !mthca_enable_msi_x(mdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI_X; + if (msi && !(mdev->mthca_flags & MTHCA_FLAG_MSI_X) && + !pci_enable_msi(pdev)) + mdev->mthca_flags |= MTHCA_FLAG_MSI; + + sema_init(&mdev->cmd.hcr_sem, 1); + sema_init(&mdev->cmd.poll_sem, 1); + mdev->cmd.use_events = 0; + + mthca_base = pci_resource_start(pdev, 0); + mdev->hcr = ioremap(mthca_base + MTHCA_HCR_BASE, MTHCA_MAP_HCR_SIZE); + if (!mdev->hcr) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_free_dev; + } + mdev->clr_base = ioremap(mthca_base + MTHCA_CLR_INT_BASE, + MTHCA_CLR_INT_SIZE); + if (!mdev->clr_base) { + mthca_err(mdev, "Couldn't map command register, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap; + } + + mthca_base = pci_resource_start(pdev, 2); + mdev->kar = ioremap(mthca_base + PAGE_SIZE * MTHCA_KAR_PAGE, PAGE_SIZE); + if (!mdev->kar) { + mthca_err(mdev, "Couldn't map kernel access region, " + "aborting.\n"); + err = -ENOMEM; + goto err_out_iounmap_clr; + } + + err = mthca_tune_pci(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_init_hca(mdev); + if (err) + goto err_out_iounmap_kar; + + err = mthca_setup_hca(mdev); + if (err) + goto err_out_close; + + err = mthca_register_device(mdev); + if (err) + goto err_out_cleanup; + + err = mthca_create_agents(mdev); + if (err) + goto err_out_unregister; + + pci_set_drvdata(pdev, mdev); + + return 0; + +err_out_unregister: + mthca_unregister_device(mdev); + +err_out_cleanup: + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + +err_out_close: + mthca_close_hca(mdev); + +err_out_iounmap_kar: + iounmap(mdev->kar); + +err_out_iounmap_clr: + iounmap(mdev->clr_base); + +err_out_iounmap: + iounmap(mdev->hcr); + +err_out_free_dev: + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + +err_out_free_res: + mthca_release_regions(pdev, ddr_hidden); + +err_out_disable_pdev: + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + return err; +} + +static void __devexit mthca_remove_one(struct pci_dev *pdev) +{ + struct mthca_dev *mdev = pci_get_drvdata(pdev); + u8 status; + int p; + + if (mdev) { + mthca_free_agents(mdev); + mthca_unregister_device(mdev); + + for (p = 1; p <= mdev->limits.num_ports; ++p) + mthca_CLOSE_IB(mdev, p, &status); + + mthca_cleanup_mcg_table(mdev); + mthca_cleanup_av_table(mdev); + mthca_cleanup_qp_table(mdev); + mthca_cleanup_cq_table(mdev); + mthca_cmd_use_polling(mdev); + mthca_cleanup_eq_table(mdev); + + mthca_pd_free(mdev, &mdev->driver_pd); + + mthca_cleanup_mr_table(mdev); + mthca_cleanup_pd_table(mdev); + + mthca_close_hca(mdev); + + iounmap(mdev->hcr); + iounmap(mdev->clr_base); + + if (mdev->mthca_flags & MTHCA_FLAG_MSI_X) + pci_disable_msix(pdev); + if (mdev->mthca_flags & MTHCA_FLAG_MSI) + pci_disable_msi(pdev); + + ib_dealloc_device(&mdev->ib_dev); + mthca_release_regions(pdev, mdev->mthca_flags & + MTHCA_FLAG_DDR_HIDDEN); + pci_disable_device(pdev); + pci_set_drvdata(pdev, NULL); + } +} + +static struct pci_device_id mthca_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_TAVOR), + .driver_data = TAVOR }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL_COMPAT), + .driver_data = ARBEL_COMPAT }, + { PCI_DEVICE(PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { PCI_DEVICE(PCI_VENDOR_ID_TOPSPIN, PCI_DEVICE_ID_MELLANOX_ARBEL), + .driver_data = ARBEL_NATIVE }, + { 0, } +}; + +MODULE_DEVICE_TABLE(pci, mthca_pci_table); + +static struct pci_driver mthca_driver = { + .name = "ib_mthca", + .id_table = mthca_pci_table, + .probe = mthca_init_one, + .remove = __devexit_p(mthca_remove_one) +}; + +static int __init mthca_init(void) +{ + int ret; + + /* + * TODO: measure whether dynamically choosing doorbell code at + * runtime affects our performance. Is there a "magic" way to + * choose without having to follow a function pointer every + * time we ring a doorbell? + */ +#ifdef CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL + if (!cpu_has_xmm) { + printk(KERN_ERR PFX "mthca was compiled with SSE doorbell code, but\n"); + printk(KERN_ERR PFX "the current CPU does not support SSE.\n"); + printk(KERN_ERR PFX "Turn off CONFIG_INFINIBAND_MTHCA_SSE_DOORBELL " + "and recompile.\n"); + return -ENODEV; + } +#endif + + ret = pci_register_driver(&mthca_driver); + return ret < 0 ? ret : 0; +} + +static void __exit mthca_cleanup(void) +{ + pci_unregister_driver(&mthca_driver); +} + +module_init(mthca_init); +module_exit(mthca_cleanup); From roland at topspin.com Mon Dec 13 10:09:30 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:30 -0800 Subject: [openib-general] [PATCH][v3][8/21] Add Mellanox HCA low-level driver (midlayer interface) In-Reply-To: <20041213109.PeK9O1dDWBb3rThl@topspin.com> Message-ID: <20041213109.tdJgFcyoYZx3znNi@topspin.com> Add midlayer interface code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.c 2004-12-13 09:44:44.599155121 -0800 @@ -0,0 +1,622 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* Temporary until we get core support straightened out */ +enum { + IB_SMP_ATTRIB_NODE_INFO = 0x0011, + IB_SMP_ATTRIB_GUID_INFO = 0x0014, + IB_SMP_ATTRIB_PORT_INFO = 0x0015, + IB_SMP_ATTRIB_PKEY_TABLE = 0x0016 +}; + +static int mthca_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + props->fw_ver = to_mdev(ibdev)->fw_ver; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_NODE_INFO); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + 1, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->vendor_id = be32_to_cpup((u32 *) (out_mad->data + 76)) & + 0xffffff; + props->vendor_part_id = be16_to_cpup((u16 *) (out_mad->data + 70)); + props->hw_ver = be16_to_cpup((u16 *) (out_mad->data + 72)); + memcpy(&props->sys_image_guid, out_mad->data + 44, 8); + memcpy(&props->node_guid, out_mad->data + 52, 8); + + err = 0; + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + props->lid = be16_to_cpup((u16 *) (out_mad->data + 56)); + props->lmc = (*(u8 *) (out_mad->data + 74)) & 0x7; + props->sm_lid = be16_to_cpup((u16 *) (out_mad->data + 58)); + props->sm_sl = (*(u8 *) (out_mad->data + 76)) & 0xf; + props->state = (*(u8 *) (out_mad->data + 72)) & 0xf; + props->port_cap_flags = be32_to_cpup((u32 *) (out_mad->data + 60)); + props->gid_tbl_len = to_mdev(ibdev)->limits.gid_table_len; + props->pkey_tbl_len = to_mdev(ibdev)->limits.pkey_table_len; + props->qkey_viol_cntr = be16_to_cpup((u16 *) (out_mad->data + 88)); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return 0; +} + +static int mthca_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 *pkey) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PKEY_TABLE); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 32); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + *pkey = be16_to_cpu(((u16 *) (out_mad->data + 40))[index % 32]); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static int mthca_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct ib_mad *in_mad = NULL; + struct ib_mad *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_PORT_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(port); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw, out_mad->data + 48, 8); + + memset(in_mad, 0, sizeof *in_mad); + in_mad->mad_hdr.base_version = 1; + in_mad->mad_hdr.mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + in_mad->mad_hdr.class_version = 1; + in_mad->mad_hdr.method = IB_MGMT_METHOD_GET; + in_mad->mad_hdr.attr_id = cpu_to_be16(IB_SMP_ATTRIB_GUID_INFO); + in_mad->mad_hdr.attr_mod = cpu_to_be32(index / 8); + + err = mthca_MAD_IFC(to_mdev(ibdev), 1, + port, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(gid->raw + 8, out_mad->data + 40 + (index % 8) * 16, 8); + + out: + kfree(in_mad); + kfree(out_mad); + return err; +} + +static struct ib_pd *mthca_alloc_pd(struct ib_device *ibdev) +{ + struct mthca_pd *pd; + int err; + + pd = kmalloc(sizeof *pd, GFP_KERNEL); + if (!pd) + return ERR_PTR(-ENOMEM); + + err = mthca_pd_alloc(to_mdev(ibdev), pd); + if (err) { + kfree(pd); + return ERR_PTR(err); + } + + return &pd->ibpd; +} + +static int mthca_dealloc_pd(struct ib_pd *pd) +{ + mthca_pd_free(to_mdev(pd->device), to_mpd(pd)); + kfree(pd); + + return 0; +} + +static struct ib_ah *mthca_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + int err; + struct mthca_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return ERR_PTR(-ENOMEM); + + err = mthca_create_ah(to_mdev(pd->device), to_mpd(pd), ah_attr, ah); + if (err) { + kfree(ah); + return ERR_PTR(err); + } + + return &ah->ibah; +} + +static int mthca_ah_destroy(struct ib_ah *ah) +{ + mthca_destroy_ah(to_mdev(ah->device), to_mah(ah)); + kfree(ah); + + return 0; +} + +static struct ib_qp *mthca_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *init_attr) +{ + struct mthca_qp *qp; + int err; + + switch (init_attr->qp_type) { + case IB_QPT_RC: + case IB_QPT_UC: + case IB_QPT_UD: + { + qp = kmalloc(sizeof *qp, GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + err = mthca_alloc_qp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->qp_type, init_attr->sq_sig_type, + init_attr->rq_sig_type, qp); + qp->ibqp.qp_num = qp->qpn; + break; + } + case IB_QPT_SMI: + case IB_QPT_GSI: + { + qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); + if (!qp) + return ERR_PTR(-ENOMEM); + + qp->sq.max = init_attr->cap.max_send_wr; + qp->rq.max = init_attr->cap.max_recv_wr; + qp->sq.max_gs = init_attr->cap.max_send_sge; + qp->rq.max_gs = init_attr->cap.max_recv_sge; + + qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; + + err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), + to_mcq(init_attr->send_cq), + to_mcq(init_attr->recv_cq), + init_attr->sq_sig_type, init_attr->rq_sig_type, + qp->ibqp.qp_num, init_attr->port_num, + to_msqp(qp)); + break; + } + default: + /* Don't support raw QPs */ + return ERR_PTR(-ENOSYS); + } + + if (err) { + kfree(qp); + return ERR_PTR(err); + } + + init_attr->cap.max_inline_data = 0; + + return &qp->ibqp; +} + +static int mthca_destroy_qp(struct ib_qp *qp) +{ + mthca_free_qp(to_mdev(qp->device), to_mqp(qp)); + kfree(qp); + return 0; +} + +static struct ib_cq *mthca_create_cq(struct ib_device *ibdev, int entries) +{ + struct mthca_cq *cq; + int nent; + int err; + + cq = kmalloc(sizeof *cq, GFP_KERNEL); + if (!cq) + return ERR_PTR(-ENOMEM); + + for (nent = 1; nent < entries; nent <<= 1) + ; /* nothing */ + + err = mthca_init_cq(to_mdev(ibdev), nent, cq); + if (err) { + kfree(cq); + cq = ERR_PTR(err); + } else + cq->ibcq.cqe = nent; + + return &cq->ibcq; +} + +static int mthca_destroy_cq(struct ib_cq *cq) +{ + mthca_free_cq(to_mdev(cq->device), to_mcq(cq)); + kfree(cq); + + return 0; +} + +static int mthca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify notify) +{ + mthca_arm_cq(to_mdev(cq->device), to_mcq(cq), + notify == IB_CQ_SOLICITED); + return 0; +} + +static inline u32 convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_ATOMIC ? MTHCA_MPT_FLAG_ATOMIC : 0) | + (acc & IB_ACCESS_REMOTE_WRITE ? MTHCA_MPT_FLAG_REMOTE_WRITE : 0) | + (acc & IB_ACCESS_REMOTE_READ ? MTHCA_MPT_FLAG_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? MTHCA_MPT_FLAG_LOCAL_WRITE : 0) | + MTHCA_MPT_FLAG_LOCAL_READ; +} + +static struct ib_mr *mthca_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct mthca_mr *mr; + int err; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + err = mthca_mr_alloc_notrans(to_mdev(pd->device), + to_mpd(pd)->pd_num, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + return &mr->ibmr; +} + +static struct ib_mr *mthca_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + struct mthca_mr *mr; + u64 *page_list; + u64 total_size; + u64 mask; + int shift; + int npages; + int err; + int i, j, n; + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + mask = 0; + total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (buffer_list[i].addr & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return ERR_PTR(-EINVAL); + + total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + /* Find largest page shift we can use to cover buffers */ + for (shift = PAGE_SHIFT; shift < 31; ++shift) + if (num_phys_buf > 1) { + if ((1ULL << shift) & mask) + break; + } else { + if (1ULL << shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << shift) - 1); + buffer_list[0].addr &= ~0ull << shift; + + mr = kmalloc(sizeof *mr, GFP_KERNEL); + if (!mr) + return ERR_PTR(-ENOMEM); + + npages = 0; + for (i = 0; i < num_phys_buf; ++i) + npages += (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + + if (!npages) + return &mr->ibmr; + + page_list = kmalloc(npages * sizeof *page_list, GFP_KERNEL); + if (!page_list) { + kfree(mr); + return ERR_PTR(-ENOMEM); + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << shift) - 1) >> shift; + ++j) + page_list[n++] = buffer_list[i].addr + ((u64) j << shift); + + mthca_dbg(to_mdev(pd->device), "Registering memory at %llx (iova %llx) " + "in PD %x; shift %d, npages %d.\n", + (unsigned long long) buffer_list[0].addr, + (unsigned long long) *iova_start, + to_mpd(pd)->pd_num, + shift, npages); + + err = mthca_mr_alloc_phys(to_mdev(pd->device), + to_mpd(pd)->pd_num, + page_list, shift, npages, + *iova_start, total_size, + convert_access(acc), mr); + + if (err) { + kfree(mr); + return ERR_PTR(err); + } + + kfree(page_list); + return &mr->ibmr; +} + +static int mthca_dereg_mr(struct ib_mr *mr) +{ + mthca_free_mr(to_mdev(mr->device), to_mmr(mr)); + kfree(mr); + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x\n", dev->rev_id); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + return sprintf(buf, "%x.%x.%x\n", (int) (dev->fw_ver >> 32), + (int) (dev->fw_ver >> 16) & 0xffff, + (int) dev->fw_ver & 0xffff); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct mthca_dev *dev = container_of(cdev, struct mthca_dev, ib_dev.class_dev); + switch (dev->hca_type) { + case TAVOR: return sprintf(buf, "MT23108\n"); + case ARBEL_COMPAT: return sprintf(buf, "MT25208 (MT23108 compat mode)\n"); + case ARBEL_NATIVE: return sprintf(buf, "MT25208\n"); + default: return sprintf(buf, "unknown\n"); + } +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); + +static struct class_device_attribute *mthca_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type +}; + +int mthca_register_device(struct mthca_dev *dev) +{ + int ret; + int i; + + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); + dev->ib_dev.node_type = IB_NODE_CA; + dev->ib_dev.phys_port_cnt = dev->limits.num_ports; + dev->ib_dev.dma_device = &dev->pdev->dev; + dev->ib_dev.class_dev.dev = &dev->pdev->dev; + dev->ib_dev.query_device = mthca_query_device; + dev->ib_dev.query_port = mthca_query_port; + dev->ib_dev.modify_port = mthca_modify_port; + dev->ib_dev.query_pkey = mthca_query_pkey; + dev->ib_dev.query_gid = mthca_query_gid; + dev->ib_dev.alloc_pd = mthca_alloc_pd; + dev->ib_dev.dealloc_pd = mthca_dealloc_pd; + dev->ib_dev.create_ah = mthca_ah_create; + dev->ib_dev.destroy_ah = mthca_ah_destroy; + dev->ib_dev.create_qp = mthca_create_qp; + dev->ib_dev.modify_qp = mthca_modify_qp; + dev->ib_dev.destroy_qp = mthca_destroy_qp; + dev->ib_dev.post_send = mthca_post_send; + dev->ib_dev.post_recv = mthca_post_receive; + dev->ib_dev.create_cq = mthca_create_cq; + dev->ib_dev.destroy_cq = mthca_destroy_cq; + dev->ib_dev.poll_cq = mthca_poll_cq; + dev->ib_dev.req_notify_cq = mthca_req_notify_cq; + dev->ib_dev.get_dma_mr = mthca_get_dma_mr; + dev->ib_dev.reg_phys_mr = mthca_reg_phys_mr; + dev->ib_dev.dereg_mr = mthca_dereg_mr; + dev->ib_dev.attach_mcast = mthca_multicast_attach; + dev->ib_dev.detach_mcast = mthca_multicast_detach; + dev->ib_dev.process_mad = mthca_process_mad; + + ret = ib_register_device(&dev->ib_dev); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(mthca_class_attributes); ++i) { + ret = class_device_create_file(&dev->ib_dev.class_dev, + mthca_class_attributes[i]); + if (ret) { + ib_unregister_device(&dev->ib_dev); + return ret; + } + } + + return 0; +} + +void mthca_unregister_device(struct mthca_dev *dev) +{ + ib_unregister_device(&dev->ib_dev); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_provider.h 2004-12-13 09:44:44.636149671 -0800 @@ -0,0 +1,214 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_provider.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_PROVIDER_H +#define MTHCA_PROVIDER_H + +#include +#include + +#define MTHCA_MPT_FLAG_ATOMIC (1 << 14) +#define MTHCA_MPT_FLAG_REMOTE_WRITE (1 << 13) +#define MTHCA_MPT_FLAG_REMOTE_READ (1 << 12) +#define MTHCA_MPT_FLAG_LOCAL_WRITE (1 << 11) +#define MTHCA_MPT_FLAG_LOCAL_READ (1 << 10) + +struct mthca_buf_list { + void *buf; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +struct mthca_mr { + struct ib_mr ibmr; + int order; + u32 first_seg; +}; + +struct mthca_pd { + struct ib_pd ibpd; + u32 pd_num; + atomic_t sqp_count; + struct mthca_mr ntmr; +}; + +struct mthca_eq { + struct mthca_dev *dev; + int eqn; + u32 ecr_mask; + u16 msi_x_vector; + u16 msi_x_entry; + int have_irq; + int nent; + int cons_index; + struct mthca_buf_list *page_list; + struct mthca_mr mr; +}; + +struct mthca_av; + +struct mthca_ah { + struct ib_ah ibah; + int on_hca; + u32 key; + struct mthca_av *av; + dma_addr_t avdma; +}; + +/* + * Quick description of our CQ/QP locking scheme: + * + * We have one global lock that protects dev->cq/qp_table. Each + * struct mthca_cq/qp also has its own lock. An individual qp lock + * may be taken inside of an individual cq lock. Both cqs attached to + * a qp may be locked, with the send cq locked first. No other + * nesting should be done. + * + * Each struct mthca_cq/qp also has an atomic_t ref count. The + * pointer from the cq/qp_table to the struct counts as one reference. + * This reference also is good for access through the consumer API, so + * modifying the CQ/QP etc doesn't need to take another reference. + * Access because of a completion being polled does need a reference. + * + * Finally, each struct mthca_cq/qp has a wait_queue_head_t for the + * destroy function to sleep on. + * + * This means that access from the consumer API requires nothing but + * taking the struct's lock. + * + * Access because of a completion event should go as follows: + * - lock cq/qp_table and look up struct + * - increment ref count in struct + * - drop cq/qp_table lock + * - lock struct, do your thing, and unlock struct + * - decrement ref count; if zero, wake up waiters + * + * To destroy a CQ/QP, we can do the following: + * - lock cq/qp_table, remove pointer, unlock cq/qp_table lock + * - decrement ref count + * - wait_event until ref count is zero + * + * It is the consumer's responsibilty to make sure that no QP + * operations (WQE posting or state modification) are pending when the + * QP is destroyed. Also, the consumer must make sure that calls to + * qp_modify are serialized. + * + * Possible optimizations (wait for profile data to see if/where we + * have locks bouncing between CPUs): + * - split cq/qp table lock into n separate (cache-aligned) locks, + * indexed (say) by the page in the table + * - split QP struct lock into three (one for common info, one for the + * send queue and one for the receive queue) + */ + +struct mthca_cq { + struct ib_cq ibcq; + spinlock_t lock; + atomic_t refcount; + int cqn; + int cons_index; + int is_direct; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + struct mthca_mr mr; + wait_queue_head_t wait; +}; + +struct mthca_wq { + int max; + int cur; + int next; + int last_comp; + void *last; + int max_gs; + int wqe_shift; + enum ib_sig_type policy; +}; + +struct mthca_qp { + struct ib_qp ibqp; + spinlock_t lock; + atomic_t refcount; + u32 qpn; + int transport; + enum ib_qp_state state; + int is_direct; + struct mthca_mr mr; + + struct mthca_wq rq; + struct mthca_wq sq; + int send_wqe_offset; + + u64 *wrid; + union { + struct mthca_buf_list direct; + struct mthca_buf_list *page_list; + } queue; + + wait_queue_head_t wait; +}; + +struct mthca_sqp { + struct mthca_qp qp; + int port; + int pkey_index; + u32 qkey; + u32 send_psn; + struct ib_ud_header ud_header; + int header_buf_size; + void *header_buf; + dma_addr_t header_dma; +}; + +static inline struct mthca_mr *to_mmr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct mthca_mr, ibmr); +} + +static inline struct mthca_pd *to_mpd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct mthca_pd, ibpd); +} + +static inline struct mthca_ah *to_mah(struct ib_ah *ibah) +{ + return container_of(ibah, struct mthca_ah, ibah); +} + +static inline struct mthca_cq *to_mcq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct mthca_cq, ibcq); +} + +static inline struct mthca_qp *to_mqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct mthca_qp, ibqp); +} + +static inline struct mthca_sqp *to_msqp(struct mthca_qp *qp) +{ + return container_of(qp, struct mthca_sqp, qp); +} + +#endif /* MTHCA_PROVIDER_H */ From roland at topspin.com Mon Dec 13 10:09:32 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:32 -0800 Subject: [openib-general] [PATCH][v3][9/21] Add Mellanox HCA low-level driver (FW commands) In-Reply-To: <20041213109.tdJgFcyoYZx3znNi@topspin.com> Message-ID: <20041213109.NiBwdaLIPMmwHwiP@topspin.com> Add firmware command processing code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.c 2004-12-13 09:44:45.011094434 -0800 @@ -0,0 +1,1562 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.c 1321 2004-12-10 19:38:54Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_config_reg.h" +#include "mthca_cmd.h" + +#define CMD_POLL_TOKEN 0xffff + +enum { + HCR_IN_PARAM_OFFSET = 0x00, + HCR_IN_MODIFIER_OFFSET = 0x08, + HCR_OUT_PARAM_OFFSET = 0x0c, + HCR_TOKEN_OFFSET = 0x14, + HCR_STATUS_OFFSET = 0x18, + + HCR_OPMOD_SHIFT = 12, + HCA_E_BIT = 22, + HCR_GO_BIT = 23 +}; + +enum { + /* initialization and general commands */ + CMD_SYS_EN = 0x1, + CMD_SYS_DIS = 0x2, + CMD_MAP_FA = 0xfff, + CMD_UNMAP_FA = 0xffe, + CMD_RUN_FW = 0xff6, + CMD_MOD_STAT_CFG = 0x34, + CMD_QUERY_DEV_LIM = 0x3, + CMD_QUERY_FW = 0x4, + CMD_ENABLE_LAM = 0xff8, + CMD_DISABLE_LAM = 0xff7, + CMD_QUERY_DDR = 0x5, + CMD_QUERY_ADAPTER = 0x6, + CMD_INIT_HCA = 0x7, + CMD_CLOSE_HCA = 0x8, + CMD_INIT_IB = 0x9, + CMD_CLOSE_IB = 0xa, + CMD_QUERY_HCA = 0xb, + CMD_SET_IB = 0xc, + CMD_ACCESS_DDR = 0x2e, + CMD_MAP_ICM = 0xffa, + CMD_UNMAP_ICM = 0xff9, + CMD_MAP_ICM_AUX = 0xffc, + CMD_UNMAP_ICM_AUX = 0xffb, + CMD_SET_ICM_SIZE = 0xffd, + + /* TPT commands */ + CMD_SW2HW_MPT = 0xd, + CMD_QUERY_MPT = 0xe, + CMD_HW2SW_MPT = 0xf, + CMD_READ_MTT = 0x10, + CMD_WRITE_MTT = 0x11, + CMD_SYNC_TPT = 0x2f, + + /* EQ commands */ + CMD_MAP_EQ = 0x12, + CMD_SW2HW_EQ = 0x13, + CMD_HW2SW_EQ = 0x14, + CMD_QUERY_EQ = 0x15, + + /* CQ commands */ + CMD_SW2HW_CQ = 0x16, + CMD_HW2SW_CQ = 0x17, + CMD_QUERY_CQ = 0x18, + CMD_RESIZE_CQ = 0x2c, + + /* SRQ commands */ + CMD_SW2HW_SRQ = 0x35, + CMD_HW2SW_SRQ = 0x36, + CMD_QUERY_SRQ = 0x37, + + /* QP/EE commands */ + CMD_RST2INIT_QPEE = 0x19, + CMD_INIT2RTR_QPEE = 0x1a, + CMD_RTR2RTS_QPEE = 0x1b, + CMD_RTS2RTS_QPEE = 0x1c, + CMD_SQERR2RTS_QPEE = 0x1d, + CMD_2ERR_QPEE = 0x1e, + CMD_RTS2SQD_QPEE = 0x1f, + CMD_SQD2SQD_QPEE = 0x38, + CMD_SQD2RTS_QPEE = 0x20, + CMD_ERR2RST_QPEE = 0x21, + CMD_QUERY_QPEE = 0x22, + CMD_INIT2INIT_QPEE = 0x2d, + CMD_SUSPEND_QPEE = 0x32, + CMD_UNSUSPEND_QPEE = 0x33, + /* special QPs and management commands */ + CMD_CONF_SPECIAL_QP = 0x23, + CMD_MAD_IFC = 0x24, + + /* multicast commands */ + CMD_READ_MGM = 0x25, + CMD_WRITE_MGM = 0x26, + CMD_MGID_HASH = 0x27, + + /* miscellaneous commands */ + CMD_DIAG_RPRT = 0x30, + CMD_NOP = 0x31, + + /* debug commands */ + CMD_QUERY_DEBUG_MSG = 0x2a, + CMD_SET_DEBUG_MSG = 0x2b, +}; + +/* + * According to Mellanox code, FW may be starved and never complete + * commands. So we can't use strict timeouts described in PRM -- we + * just arbitrarily select 60 seconds for now. + */ +#if 0 +/* + * Round up and add 1 to make sure we get the full wait time (since we + * will be starting in the middle of a jiffy) + */ +enum { + CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, + CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 +}; +#else +enum { + CMD_TIME_CLASS_A = 60 * HZ, + CMD_TIME_CLASS_B = 60 * HZ, + CMD_TIME_CLASS_C = 60 * HZ +}; +#endif + +enum { + GO_BIT_TIMEOUT = HZ * 10 +}; + +struct mthca_cmd_context { + struct completion done; + struct timer_list timer; + int result; + int next; + u64 out_param; + u16 token; + u8 status; +}; + +static inline int go_bit(struct mthca_dev *dev) +{ + return readl(dev->hcr + HCR_STATUS_OFFSET) & + swab32(1 << HCR_GO_BIT); +} + +static int mthca_cmd_post(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + u16 token, + int event) +{ + int err = 0; + + if (down_interruptible(&dev->cmd.hcr_sem)) + return -EINTR; + + if (event) { + unsigned long end = jiffies + GO_BIT_TIMEOUT; + + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + } + + if (go_bit(dev)) { + err = -EAGAIN; + goto out; + } + + /* + * We use writel (instead of something like memcpy_toio) + * because writes of less than 32 bits to the HCR don't work + * (and some architectures such as ia64 implement memcpy_toio + * in terms of writeb). + */ + __raw_writel(cpu_to_be32(in_param >> 32), dev->hcr + 0 * 4); + __raw_writel(cpu_to_be32(in_param & 0xfffffffful), dev->hcr + 1 * 4); + __raw_writel(cpu_to_be32(in_modifier), dev->hcr + 2 * 4); + __raw_writel(cpu_to_be32(out_param >> 32), dev->hcr + 3 * 4); + __raw_writel(cpu_to_be32(out_param & 0xfffffffful), dev->hcr + 4 * 4); + __raw_writel(cpu_to_be32(token << 16), dev->hcr + 5 * 4); + + /* __raw_writel may not order writes. */ + wmb(); + + __raw_writel(cpu_to_be32((1 << HCR_GO_BIT) | + (event ? (1 << HCA_E_BIT) : 0) | + (op_modifier << HCR_OPMOD_SHIFT) | + op), dev->hcr + 6 * 4); + +out: + up(&dev->cmd.hcr_sem); + return err; +} + +static int mthca_cmd_poll(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + unsigned long end; + + if (down_interruptible(&dev->cmd.poll_sem)) + return -EINTR; + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, CMD_POLL_TOKEN, 0); + if (err) + goto out; + + end = timeout + jiffies; + while (go_bit(dev) && time_before(jiffies, end)) { + set_current_state(TASK_RUNNING); + schedule(); + } + + if (go_bit(dev)) { + err = -EBUSY; + goto out; + } + + if (out_is_imm) { + memcpy_fromio(out_param, dev->hcr + HCR_OUT_PARAM_OFFSET, sizeof (u64)); + be64_to_cpus(out_param); + } + + *status = be32_to_cpu(__raw_readl(dev->hcr + HCR_STATUS_OFFSET)) >> 24; + +out: + up(&dev->cmd.poll_sem); + return err; +} + +void mthca_cmd_event(struct mthca_dev *dev, + u16 token, + u8 status, + u64 out_param) +{ + struct mthca_cmd_context *context = + &dev->cmd.context[token & dev->cmd.token_mask]; + + /* previously timed out command completing at long last */ + if (token != context->token) + return; + + context->result = 0; + context->status = status; + context->out_param = out_param; + + context->token += dev->cmd.token_mask + 1; + + complete(&context->done); +} + +static void event_timeout(unsigned long context_ptr) +{ + struct mthca_cmd_context *context = + (struct mthca_cmd_context *) context_ptr; + + context->result = -EBUSY; + complete(&context->done); +} + +static int mthca_cmd_wait(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + int out_is_imm, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + int err = 0; + struct mthca_cmd_context *context; + + if (down_interruptible(&dev->cmd.event_sem)) + return -EINTR; + + spin_lock(&dev->cmd.context_lock); + BUG_ON(dev->cmd.free_head < 0); + context = &dev->cmd.context[dev->cmd.free_head]; + dev->cmd.free_head = context->next; + spin_unlock(&dev->cmd.context_lock); + + init_completion(&context->done); + + err = mthca_cmd_post(dev, in_param, + out_param ? *out_param : 0, + in_modifier, op_modifier, + op, context->token, 1); + if (err) + goto out; + + context->timer.expires = jiffies + timeout; + add_timer(&context->timer); + + wait_for_completion(&context->done); + del_timer_sync(&context->timer); + + err = context->result; + if (err) + goto out; + + *status = context->status; + if (*status) + mthca_dbg(dev, "Command %02x completed with status %02x\n", + op, *status); + + if (out_is_imm) + *out_param = context->out_param; + +out: + spin_lock(&dev->cmd.context_lock); + context->next = dev->cmd.free_head; + dev->cmd.free_head = context - dev->cmd.context; + spin_unlock(&dev->cmd.context_lock); + + up(&dev->cmd.event_sem); + return err; +} + +/* Invoke a command with an output mailbox */ +static int mthca_cmd_box(struct mthca_dev *dev, + u64 in_param, + u64 out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, &out_param, 0, + in_modifier, op_modifier, op, + timeout, status); +} + +/* Invoke a command with no output parameter */ +static int mthca_cmd(struct mthca_dev *dev, + u64 in_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + return mthca_cmd_box(dev, in_param, 0, in_modifier, + op_modifier, op, timeout, status); +} + +/* + * Invoke a command with an immediate output parameter (and copy the + * output into the caller's out_param pointer after the command + * executes). + */ +static int mthca_cmd_imm(struct mthca_dev *dev, + u64 in_param, + u64 *out_param, + u32 in_modifier, + u8 op_modifier, + u16 op, + unsigned long timeout, + u8 *status) +{ + if (dev->cmd.use_events) + return mthca_cmd_wait(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); + else + return mthca_cmd_poll(dev, in_param, out_param, 1, + in_modifier, op_modifier, op, + timeout, status); +} + +/* + * Switch to using events to issue FW commands (should be called after + * event queue to command events has been initialized). + */ +int mthca_cmd_use_events(struct mthca_dev *dev) +{ + int i; + + dev->cmd.context = kmalloc(dev->cmd.max_cmds * + sizeof (struct mthca_cmd_context), + GFP_KERNEL); + if (!dev->cmd.context) + return -ENOMEM; + + for (i = 0; i < dev->cmd.max_cmds; ++i) { + dev->cmd.context[i].token = i; + dev->cmd.context[i].next = i + 1; + init_timer(&dev->cmd.context[i].timer); + dev->cmd.context[i].timer.data = + (unsigned long) &dev->cmd.context[i]; + dev->cmd.context[i].timer.function = event_timeout; + } + + dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; + dev->cmd.free_head = 0; + + sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + spin_lock_init(&dev->cmd.context_lock); + + for (dev->cmd.token_mask = 1; + dev->cmd.token_mask < dev->cmd.max_cmds; + dev->cmd.token_mask <<= 1) + ; /* nothing */ + --dev->cmd.token_mask; + + dev->cmd.use_events = 1; + down(&dev->cmd.poll_sem); + + return 0; +} + +/* + * Switch back to polling (used when shutting down the device) + */ +void mthca_cmd_use_polling(struct mthca_dev *dev) +{ + int i; + + dev->cmd.use_events = 0; + + for (i = 0; i < dev->cmd.max_cmds; ++i) + down(&dev->cmd.event_sem); + + kfree(dev->cmd.context); + + up(&dev->cmd.poll_sem); +} + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status) +{ + u64 out; + int ret; + + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + + if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) + mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " + "sladdr=%d, SPD source=%s\n", + (int) (out >> 6) & 0xf, (int) (out >> 4) & 3, + (int) (out >> 1) & 7, (int) out & 1 ? "NVMEM" : "DIMM"); + + return ret; +} + +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); +} + +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int lg; + int nent = 0; + int i, j; + int err = 0; + int ts = 0; + + inbox = pci_alloc_consistent(dev->pdev, PAGE_SIZE, &indma); + memset(inbox, 0, PAGE_SIZE); + + for (i = 0; i < count; ++i) { + /* + * We have to pass pages that are aligned to their + * size, so find the least significant 1 in the + * address or size and use that as our log2 size. + */ + lg = ffs(sg_dma_address(sglist + i) | sg_dma_len(sglist + i)) - 1; + if (lg < 12) { + mthca_warn(dev, "Got FW area not aligned to 4K (%llx/%x).\n", + (unsigned long long) sg_dma_address(sglist + i), + sg_dma_len(sglist + i)); + err = -EINVAL; + goto out; + } + for (j = 0; j < sg_dma_len(sglist + i) / (1 << lg); ++j, ++nent) { + *((__be64 *) (inbox + nent * 4 + 2)) = + cpu_to_be64((sg_dma_address(sglist + i) + + (j << lg)) | + (lg - 12)); + ts += 1 << (lg - 10); + if (nent == PAGE_SIZE / 16) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + if (err || *status) + goto out; + nent = 0; + } + } + } + + if (nent) { + err = mthca_cmd(dev, indma, nent, 0, CMD_MAP_FA, + CMD_TIME_CLASS_B, status); + } + + mthca_dbg(dev, "Mapped %d KB of host memory for FW.\n", ts); + +out: + pci_free_consistent(dev->pdev, PAGE_SIZE, inbox, indma); + return err; +} + +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_UNMAP_FA, CMD_TIME_CLASS_B, status); +} + +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_RUN_FW, CMD_TIME_CLASS_A, status); +} + +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err = 0; + u8 lg; + +#define QUERY_FW_OUT_SIZE 0x100 +#define QUERY_FW_VER_OFFSET 0x00 +#define QUERY_FW_MAX_CMD_OFFSET 0x0f +#define QUERY_FW_ERR_START_OFFSET 0x30 +#define QUERY_FW_ERR_SIZE_OFFSET 0x38 + +#define QUERY_FW_START_OFFSET 0x20 +#define QUERY_FW_END_OFFSET 0x28 + +#define QUERY_FW_SIZE_OFFSET 0x00 +#define QUERY_FW_CLR_INT_BASE_OFFSET 0x20 +#define QUERY_FW_EQ_ARM_BASE_OFFSET 0x40 +#define QUERY_FW_EQ_SET_CI_BASE_OFFSET 0x48 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_FW_OUT_SIZE, &outdma); + if (!outbox) { + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_FW, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->fw_ver, outbox, QUERY_FW_VER_OFFSET); + /* + * FW subminor version is at more signifant bits than minor + * version, so swap here. + */ + dev->fw_ver = (dev->fw_ver & 0xffff00000000ull) | + ((dev->fw_ver & 0xffff0000ull) >> 16) | + ((dev->fw_ver & 0x0000ffffull) << 16); + + MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); + dev->cmd.max_cmds = 1 << lg; + + mthca_dbg(dev, "FW version %012llx, max commands %d\n", + (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); + MTHCA_GET(dev->fw.arbel.clr_int_base, outbox, QUERY_FW_CLR_INT_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_arm_base, outbox, QUERY_FW_EQ_ARM_BASE_OFFSET); + MTHCA_GET(dev->fw.arbel.eq_set_ci_base, outbox, QUERY_FW_EQ_SET_CI_BASE_OFFSET); + mthca_dbg(dev, "FW size %d KB\n", dev->fw.arbel.fw_pages << 2); + + /* + * Arbel page size is always 4 KB; round up number of + * system pages needed. + */ + dev->fw.arbel.fw_pages = + (dev->fw.arbel.fw_pages + (1 << (PAGE_SHIFT - 12)) - 1) >> + (PAGE_SHIFT - 12); + + mthca_dbg(dev, "Clear int @ %llx, EQ arm @ %llx, EQ set CI @ %llx\n", + (unsigned long long) dev->fw.arbel.clr_int_base, + (unsigned long long) dev->fw.arbel.eq_arm_base, + (unsigned long long) dev->fw.arbel.eq_set_ci_base); + } else { + MTHCA_GET(dev->fw.tavor.fw_start, outbox, QUERY_FW_START_OFFSET); + MTHCA_GET(dev->fw.tavor.fw_end, outbox, QUERY_FW_END_OFFSET); + + mthca_dbg(dev, "FW size %d KB (start %llx, end %llx)\n", + (int) ((dev->fw.tavor.fw_end - dev->fw.tavor.fw_start) >> 10), + (unsigned long long) dev->fw.tavor.fw_start, + (unsigned long long) dev->fw.tavor.fw_end); + } + +out: + pci_free_consistent(dev->pdev, QUERY_FW_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define ENABLE_LAM_OUT_SIZE 0x100 +#define ENABLE_LAM_START_OFFSET 0x00 +#define ENABLE_LAM_END_OFFSET 0x08 +#define ENABLE_LAM_INFO_OFFSET 0x13 + +#define ENABLE_LAM_INFO_HIDDEN_FLAG (1 << 4) +#define ENABLE_LAM_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_ENABLE_LAM, + CMD_TIME_CLASS_C, status); + + if (err) + goto out; + + if (*status == MTHCA_CMD_STAT_LAM_NOT_PRE) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, ENABLE_LAM_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, ENABLE_LAM_END_OFFSET); + MTHCA_GET(info, outbox, ENABLE_LAM_INFO_OFFSET); + + if (!!(info & ENABLE_LAM_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & ENABLE_LAM_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & ENABLE_LAM_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, ENABLE_LAM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status) +{ + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); +} + +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status) +{ + u8 info; + u32 *outbox; + dma_addr_t outdma; + int err = 0; + +#define QUERY_DDR_OUT_SIZE 0x100 +#define QUERY_DDR_START_OFFSET 0x00 +#define QUERY_DDR_END_OFFSET 0x08 +#define QUERY_DDR_INFO_OFFSET 0x13 + +#define QUERY_DDR_INFO_HIDDEN_FLAG (1 << 4) +#define QUERY_DDR_INFO_ECC_MASK 0x3 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DDR, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(dev->ddr_start, outbox, QUERY_DDR_START_OFFSET); + MTHCA_GET(dev->ddr_end, outbox, QUERY_DDR_END_OFFSET); + MTHCA_GET(info, outbox, QUERY_DDR_INFO_OFFSET); + + if (!!(info & QUERY_DDR_INFO_HIDDEN_FLAG) != + !!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + mthca_info(dev, "FW reports that HCA-attached memory " + "is %s hidden; does not match PCI config\n", + (info & QUERY_DDR_INFO_HIDDEN_FLAG) ? + "" : "not"); + } + if (info & QUERY_DDR_INFO_HIDDEN_FLAG) + mthca_dbg(dev, "HCA-attached memory is hidden.\n"); + + mthca_dbg(dev, "HCA memory size %d KB (start %llx, end %llx)\n", + (int) ((dev->ddr_end - dev->ddr_start) >> 10), + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->ddr_end); + +out: + pci_free_consistent(dev->pdev, QUERY_DDR_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + u8 field; + u16 size; + int err; + +#define QUERY_DEV_LIM_OUT_SIZE 0x100 +#define QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET 0x10 +#define QUERY_DEV_LIM_MAX_QP_SZ_OFFSET 0x11 +#define QUERY_DEV_LIM_RSVD_QP_OFFSET 0x12 +#define QUERY_DEV_LIM_MAX_QP_OFFSET 0x13 +#define QUERY_DEV_LIM_RSVD_SRQ_OFFSET 0x14 +#define QUERY_DEV_LIM_MAX_SRQ_OFFSET 0x15 +#define QUERY_DEV_LIM_RSVD_EEC_OFFSET 0x16 +#define QUERY_DEV_LIM_MAX_EEC_OFFSET 0x17 +#define QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET 0x19 +#define QUERY_DEV_LIM_RSVD_CQ_OFFSET 0x1a +#define QUERY_DEV_LIM_MAX_CQ_OFFSET 0x1b +#define QUERY_DEV_LIM_MAX_MPT_OFFSET 0x1d +#define QUERY_DEV_LIM_RSVD_EQ_OFFSET 0x1e +#define QUERY_DEV_LIM_MAX_EQ_OFFSET 0x1f +#define QUERY_DEV_LIM_RSVD_MTT_OFFSET 0x20 +#define QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET 0x21 +#define QUERY_DEV_LIM_RSVD_MRW_OFFSET 0x22 +#define QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET 0x23 +#define QUERY_DEV_LIM_MAX_AV_OFFSET 0x27 +#define QUERY_DEV_LIM_MAX_REQ_QP_OFFSET 0x29 +#define QUERY_DEV_LIM_MAX_RES_QP_OFFSET 0x2b +#define QUERY_DEV_LIM_MAX_RDMA_OFFSET 0x2f +#define QUERY_DEV_LIM_RSZ_SRQ_OFFSET 0x33 +#define QUERY_DEV_LIM_ACK_DELAY_OFFSET 0x35 +#define QUERY_DEV_LIM_MTU_WIDTH_OFFSET 0x36 +#define QUERY_DEV_LIM_VL_PORT_OFFSET 0x37 +#define QUERY_DEV_LIM_MAX_GID_OFFSET 0x3b +#define QUERY_DEV_LIM_MAX_PKEY_OFFSET 0x3f +#define QUERY_DEV_LIM_FLAGS_OFFSET 0x44 +#define QUERY_DEV_LIM_RSVD_UAR_OFFSET 0x48 +#define QUERY_DEV_LIM_UAR_SZ_OFFSET 0x49 +#define QUERY_DEV_LIM_PAGE_SZ_OFFSET 0x4b +#define QUERY_DEV_LIM_MAX_SG_OFFSET 0x51 +#define QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET 0x52 +#define QUERY_DEV_LIM_MAX_SG_RQ_OFFSET 0x55 +#define QUERY_DEV_LIM_MAX_DESC_SZ_RQ_OFFSET 0x56 +#define QUERY_DEV_LIM_MAX_QP_MCG_OFFSET 0x61 +#define QUERY_DEV_LIM_RSVD_MCG_OFFSET 0x62 +#define QUERY_DEV_LIM_MAX_MCG_OFFSET 0x63 +#define QUERY_DEV_LIM_RSVD_PD_OFFSET 0x64 +#define QUERY_DEV_LIM_MAX_PD_OFFSET 0x65 +#define QUERY_DEV_LIM_RSVD_RDD_OFFSET 0x66 +#define QUERY_DEV_LIM_MAX_RDD_OFFSET 0x67 +#define QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET 0x80 +#define QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET 0x82 +#define QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET 0x84 +#define QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET 0x86 +#define QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET 0x88 +#define QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET 0x8a +#define QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET 0x8c +#define QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET 0x8e +#define QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET 0x90 +#define QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET 0x92 +#define QUERY_DEV_LIM_PBL_SZ_OFFSET 0x96 +#define QUERY_DEV_LIM_BMME_FLAGS_OFFSET 0x97 +#define QUERY_DEV_LIM_RSVD_LKEY_OFFSET 0x98 +#define QUERY_DEV_LIM_LAMR_OFFSET 0x9f +#define QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET 0xa0 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_DEV_LIM, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); + dev_lim->max_srq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); + dev_lim->max_qp_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); + dev_lim->reserved_qps = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); + dev_lim->max_qps = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_SRQ_OFFSET); + dev_lim->reserved_srqs = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_OFFSET); + dev_lim->max_srqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EEC_OFFSET); + dev_lim->reserved_eecs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EEC_OFFSET); + dev_lim->max_eecs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_SZ_OFFSET); + dev_lim->max_cq_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_CQ_OFFSET); + dev_lim->reserved_cqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_CQ_OFFSET); + dev_lim->max_cqs = 1 << (field & 0x1f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MPT_OFFSET); + dev_lim->max_mpts = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_EQ_OFFSET); + dev_lim->reserved_eqs = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_EQ_OFFSET); + dev_lim->max_eqs = 1 << (field & 0x7); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MTT_OFFSET); + dev_lim->reserved_mtts = 1 << (field >> 4); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MRW_SZ_OFFSET); + dev_lim->max_mrw_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MRW_OFFSET); + dev_lim->reserved_mrws = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MTT_SEG_OFFSET); + dev_lim->max_mtt_seg = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_REQ_QP_OFFSET); + dev_lim->max_requester_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RES_QP_OFFSET); + dev_lim->max_responder_per_qp = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDMA_OFFSET); + dev_lim->max_rdma_global = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_ACK_DELAY_OFFSET); + dev_lim->local_ca_ack_delay = field & 0x1f; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MTU_WIDTH_OFFSET); + dev_lim->max_mtu = field >> 4; + dev_lim->max_port_width = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_VL_PORT_OFFSET); + dev_lim->max_vl = field >> 4; + dev_lim->num_ports = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_GID_OFFSET); + dev_lim->max_gids = 1 << (field & 0xf); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PKEY_OFFSET); + dev_lim->max_pkeys = 1 << (field & 0xf); + MTHCA_GET(dev_lim->flags, outbox, QUERY_DEV_LIM_FLAGS_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_UAR_OFFSET); + dev_lim->reserved_uars = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_UAR_SZ_OFFSET); + dev_lim->uar_size = 1 << ((field & 0x3f) + 20); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PAGE_SZ_OFFSET); + dev_lim->min_page_sz = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SG_OFFSET); + dev_lim->max_sg = field; + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MAX_DESC_SZ_OFFSET); + dev_lim->max_desc_sz = size; + + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_MCG_OFFSET); + dev_lim->max_qp_per_mcg = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_MCG_OFFSET); + dev_lim->reserved_mgms = field & 0xf; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_MCG_OFFSET); + dev_lim->max_mcgs = 1 << field; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_PD_OFFSET); + dev_lim->reserved_pds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_PD_OFFSET); + dev_lim->max_pds = 1 << (field & 0x3f); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_RDD_OFFSET); + dev_lim->reserved_rdds = field >> 4; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_RDD_OFFSET); + dev_lim->max_rdds = 1 << (field & 0x3f); + + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEC_ENTRY_SZ_OFFSET); + dev_lim->eec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_QPC_ENTRY_SZ_OFFSET); + dev_lim->qpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EEEC_ENTRY_SZ_OFFSET); + dev_lim->eeec_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQPC_ENTRY_SZ_OFFSET); + dev_lim->eqpc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_EQC_ENTRY_SZ_OFFSET); + dev_lim->eqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_CQC_ENTRY_SZ_OFFSET); + dev_lim->cqc_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_SRQ_ENTRY_SZ_OFFSET); + dev_lim->srq_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_UAR_ENTRY_SZ_OFFSET); + dev_lim->uar_scratch_entry_sz = size; + + mthca_dbg(dev, "Max QPs: %d, reserved QPs: %d, entry size: %d\n", + dev_lim->max_qps, dev_lim->reserved_qps, dev_lim->qpc_entry_sz); + mthca_dbg(dev, "Max CQs: %d, reserved CQs: %d, entry size: %d\n", + dev_lim->max_cqs, dev_lim->reserved_cqs, dev_lim->cqc_entry_sz); + mthca_dbg(dev, "Max EQs: %d, reserved EQs: %d, entry size: %d\n", + dev_lim->max_eqs, dev_lim->reserved_eqs, dev_lim->eqc_entry_sz); + mthca_dbg(dev, "reserved MPTs: %d, reserved MTTs: %d\n", + dev_lim->reserved_mrws, dev_lim->reserved_mtts); + mthca_dbg(dev, "Max PDs: %d, reserved PDs: %d, reserved UARs: %d\n", + dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); + mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", + dev_lim->max_pds, dev_lim->reserved_mgms); + + mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); + + if (dev->hca_type == ARBEL_NATIVE) { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSZ_SRQ_OFFSET); + dev_lim->hca.arbel.resize_srq = field & 1; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MTT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mtt_entry_sz = size; + MTHCA_GET(size, outbox, QUERY_DEV_LIM_MPT_ENTRY_SZ_OFFSET); + dev_lim->hca.arbel.mpt_entry_sz = size; + MTHCA_GET(field, outbox, QUERY_DEV_LIM_PBL_SZ_OFFSET); + dev_lim->hca.arbel.max_pbl_sz = 1 << (field & 0x3f); + MTHCA_GET(dev_lim->hca.arbel.bmme_flags, outbox, + QUERY_DEV_LIM_BMME_FLAGS_OFFSET); + MTHCA_GET(dev_lim->hca.arbel.reserved_lkey, outbox, + QUERY_DEV_LIM_RSVD_LKEY_OFFSET); + MTHCA_GET(field, outbox, QUERY_DEV_LIM_LAMR_OFFSET); + dev_lim->hca.arbel.lam_required = field & 1; + MTHCA_GET(dev_lim->hca.arbel.max_icm_sz, outbox, + QUERY_DEV_LIM_MAX_ICM_SZ_OFFSET); + + if (dev_lim->hca.arbel.bmme_flags & 1) + mthca_dbg(dev, "Base MM extensions: yes " + "(flags %d, max PBL %d, rsvd L_Key %08x)\n", + dev_lim->hca.arbel.bmme_flags, + dev_lim->hca.arbel.max_pbl_sz, + dev_lim->hca.arbel.reserved_lkey); + else + mthca_dbg(dev, "Base MM extensions: no\n"); + + mthca_dbg(dev, "Max ICM size %lld MB\n", + (unsigned long long) dev_lim->hca.arbel.max_icm_sz >> 20); + } else { + MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_AV_OFFSET); + dev_lim->hca.tavor.max_avs = 1 << (field & 0x3f); + } + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status) +{ + u32 *outbox; + dma_addr_t outdma; + int err; + +#define QUERY_ADAPTER_OUT_SIZE 0x100 +#define QUERY_ADAPTER_VENDOR_ID_OFFSET 0x00 +#define QUERY_ADAPTER_DEVICE_ID_OFFSET 0x04 +#define QUERY_ADAPTER_REVISION_ID_OFFSET 0x08 +#define QUERY_ADAPTER_INTA_PIN_OFFSET 0x10 + + outbox = pci_alloc_consistent(dev->pdev, QUERY_ADAPTER_OUT_SIZE, &outdma); + if (!outbox) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, 0, 0, CMD_QUERY_ADAPTER, + CMD_TIME_CLASS_A, status); + + if (err) + goto out; + + MTHCA_GET(adapter->vendor_id, outbox, QUERY_ADAPTER_VENDOR_ID_OFFSET); + MTHCA_GET(adapter->device_id, outbox, QUERY_ADAPTER_DEVICE_ID_OFFSET); + MTHCA_GET(adapter->revision_id, outbox, QUERY_ADAPTER_REVISION_ID_OFFSET); + MTHCA_GET(adapter->inta_pin, outbox, QUERY_ADAPTER_INTA_PIN_OFFSET); + +out: + pci_free_consistent(dev->pdev, QUERY_DEV_LIM_OUT_SIZE, outbox, outdma); + return err; +} + +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + +#define INIT_HCA_IN_SIZE 0x200 +#define INIT_HCA_FLAGS_OFFSET 0x014 +#define INIT_HCA_QPC_OFFSET 0x020 +#define INIT_HCA_QPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x10) +#define INIT_HCA_LOG_QP_OFFSET (INIT_HCA_QPC_OFFSET + 0x17) +#define INIT_HCA_EEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x20) +#define INIT_HCA_LOG_EEC_OFFSET (INIT_HCA_QPC_OFFSET + 0x27) +#define INIT_HCA_SRQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x28) +#define INIT_HCA_LOG_SRQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x2f) +#define INIT_HCA_CQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x30) +#define INIT_HCA_LOG_CQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x37) +#define INIT_HCA_EQPC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x40) +#define INIT_HCA_EEEC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x50) +#define INIT_HCA_EQC_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x60) +#define INIT_HCA_LOG_EQ_OFFSET (INIT_HCA_QPC_OFFSET + 0x67) +#define INIT_HCA_RDB_BASE_OFFSET (INIT_HCA_QPC_OFFSET + 0x70) +#define INIT_HCA_UDAV_OFFSET 0x0b0 +#define INIT_HCA_UDAV_LKEY_OFFSET (INIT_HCA_UDAV_OFFSET + 0x0) +#define INIT_HCA_UDAV_PD_OFFSET (INIT_HCA_UDAV_OFFSET + 0x4) +#define INIT_HCA_MCAST_OFFSET 0x0c0 +#define INIT_HCA_MC_BASE_OFFSET (INIT_HCA_MCAST_OFFSET + 0x00) +#define INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x12) +#define INIT_HCA_MC_HASH_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x16) +#define INIT_HCA_LOG_MC_TABLE_SZ_OFFSET (INIT_HCA_MCAST_OFFSET + 0x1b) +#define INIT_HCA_TPT_OFFSET 0x0f0 +#define INIT_HCA_MPT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x00) +#define INIT_HCA_MTT_SEG_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x09) +#define INIT_HCA_LOG_MPT_SZ_OFFSET (INIT_HCA_TPT_OFFSET + 0x0b) +#define INIT_HCA_MTT_BASE_OFFSET (INIT_HCA_TPT_OFFSET + 0x10) +#define INIT_HCA_UAR_OFFSET 0x120 +#define INIT_HCA_UAR_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x00) +#define INIT_HCA_UAR_PAGE_SZ_OFFSET (INIT_HCA_UAR_OFFSET + 0x0b) +#define INIT_HCA_UAR_SCATCH_BASE_OFFSET (INIT_HCA_UAR_OFFSET + 0x10) + + inbox = pci_alloc_consistent(dev->pdev, INIT_HCA_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_HCA_IN_SIZE); + +#if defined(__LITTLE_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) &= ~cpu_to_be32(1 << 1); +#elif defined(__BIG_ENDIAN) + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1 << 1); +#else +#error Host endianness not defined +#endif + /* Check port for UD address vector: */ + *(inbox + INIT_HCA_FLAGS_OFFSET / 4) |= cpu_to_be32(1); + + /* We leave wqe_quota, responder_exu, etc as 0 (default) */ + + /* QPC/EEC/CQC/EQC/RDB attributes */ + + MTHCA_PUT(inbox, param->qpc_base, INIT_HCA_QPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_qps, INIT_HCA_LOG_QP_OFFSET); + MTHCA_PUT(inbox, param->eec_base, INIT_HCA_EEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eecs, INIT_HCA_LOG_EEC_OFFSET); + MTHCA_PUT(inbox, param->srqc_base, INIT_HCA_SRQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_srqs, INIT_HCA_LOG_SRQ_OFFSET); + MTHCA_PUT(inbox, param->cqc_base, INIT_HCA_CQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_cqs, INIT_HCA_LOG_CQ_OFFSET); + MTHCA_PUT(inbox, param->eqpc_base, INIT_HCA_EQPC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eeec_base, INIT_HCA_EEEC_BASE_OFFSET); + MTHCA_PUT(inbox, param->eqc_base, INIT_HCA_EQC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_num_eqs, INIT_HCA_LOG_EQ_OFFSET); + MTHCA_PUT(inbox, param->rdb_base, INIT_HCA_RDB_BASE_OFFSET); + + /* UD AV attributes */ + + /* multicast attributes */ + + MTHCA_PUT(inbox, param->mc_base, INIT_HCA_MC_BASE_OFFSET); + MTHCA_PUT(inbox, param->log_mc_entry_sz, INIT_HCA_LOG_MC_ENTRY_SZ_OFFSET); + MTHCA_PUT(inbox, param->mc_hash_sz, INIT_HCA_MC_HASH_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mc_table_sz, INIT_HCA_LOG_MC_TABLE_SZ_OFFSET); + + /* TPT attributes */ + + MTHCA_PUT(inbox, param->mpt_base, INIT_HCA_MPT_BASE_OFFSET); + MTHCA_PUT(inbox, param->mtt_seg_sz, INIT_HCA_MTT_SEG_SZ_OFFSET); + MTHCA_PUT(inbox, param->log_mpt_sz, INIT_HCA_LOG_MPT_SZ_OFFSET); + MTHCA_PUT(inbox, param->mtt_base, INIT_HCA_MTT_BASE_OFFSET); + + /* UAR attributes */ + { + u8 uar_page_sz = PAGE_SHIFT - 12; + MTHCA_PUT(inbox, uar_page_sz, INIT_HCA_UAR_PAGE_SZ_OFFSET); + MTHCA_PUT(inbox, param->uar_scratch_base, INIT_HCA_UAR_SCATCH_BASE_OFFSET); + } + + err = mthca_cmd(dev, indma, 0, 0, CMD_INIT_HCA, + HZ, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status) +{ + u32 *inbox; + dma_addr_t indma; + int err; + u32 flags; + +#define INIT_IB_IN_SIZE 56 +#define INIT_IB_FLAGS_OFFSET 0x00 +#define INIT_IB_FLAG_SIG (1 << 18) +#define INIT_IB_FLAG_NG (1 << 17) +#define INIT_IB_FLAG_G0 (1 << 16) +#define INIT_IB_FLAG_1X (1 << 8) +#define INIT_IB_FLAG_4X (1 << 9) +#define INIT_IB_FLAG_12X (1 << 11) +#define INIT_IB_VL_SHIFT 4 +#define INIT_IB_MTU_SHIFT 12 +#define INIT_IB_MAX_GID_OFFSET 0x06 +#define INIT_IB_MAX_PKEY_OFFSET 0x0a +#define INIT_IB_GUID0_OFFSET 0x10 +#define INIT_IB_NODE_GUID_OFFSET 0x18 +#define INIT_IB_SI_GUID_OFFSET 0x20 + + inbox = pci_alloc_consistent(dev->pdev, INIT_IB_IN_SIZE, &indma); + if (!inbox) + return -ENOMEM; + + memset(inbox, 0, INIT_IB_IN_SIZE); + + flags = 0; + flags |= param->enable_1x ? INIT_IB_FLAG_1X : 0; + flags |= param->enable_4x ? INIT_IB_FLAG_4X : 0; + flags |= param->set_guid0 ? INIT_IB_FLAG_G0 : 0; + flags |= param->set_node_guid ? INIT_IB_FLAG_NG : 0; + flags |= param->set_si_guid ? INIT_IB_FLAG_SIG : 0; + flags |= param->vl_cap << INIT_IB_VL_SHIFT; + flags |= param->mtu_cap << INIT_IB_MTU_SHIFT; + MTHCA_PUT(inbox, flags, INIT_IB_FLAGS_OFFSET); + + MTHCA_PUT(inbox, param->gid_cap, INIT_IB_MAX_GID_OFFSET); + MTHCA_PUT(inbox, param->pkey_cap, INIT_IB_MAX_PKEY_OFFSET); + MTHCA_PUT(inbox, param->guid0, INIT_IB_GUID0_OFFSET); + MTHCA_PUT(inbox, param->node_guid, INIT_IB_NODE_GUID_OFFSET); + MTHCA_PUT(inbox, param->si_guid, INIT_IB_SI_GUID_OFFSET); + + err = mthca_cmd(dev, indma, port, 0, CMD_INIT_IB, + CMD_TIME_CLASS_A, status); + + pci_free_consistent(dev->pdev, INIT_HCA_IN_SIZE, inbox, indma); + return err; +} + +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) +{ + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); +} + +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) +{ + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); +} + +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, mpt_index, 0, CMD_SW2HW_MPT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MPT_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + if (mpt_entry) { + outdma = pci_map_single(dev->pdev, mpt_entry, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + } + + err = mthca_cmd_box(dev, 0, outdma, mpt_index, !mpt_entry, + CMD_HW2SW_MPT, + CMD_TIME_CLASS_B, status); + + if (mpt_entry) + pci_unmap_single(dev->pdev, outdma, + MTHCA_MPT_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mtt_entry, + (num_mtt + 2) * 8, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, num_mtt, 0, CMD_WRITE_MTT, + CMD_TIME_CLASS_B, status); + + pci_unmap_single(dev->pdev, indma, + (num_mtt + 2) * 8, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status) +{ + mthca_dbg(dev, "%s mask %016llx for eqn %d\n", + unmap ? "Clearing" : "Setting", + (unsigned long long) event_mask, eq_num); + return mthca_cmd(dev, event_mask, (unmap << 31) | eq_num, + 0, CMD_MAP_EQ, CMD_TIME_CLASS_B, status); +} + +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, eq_num, 0, CMD_SW2HW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_EQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, eq_context, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, eq_num, 0, + CMD_HW2SW_EQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_EQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, cq_num, 0, CMD_SW2HW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_CQ_CONTEXT_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, cq_context, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, cq_num, 0, + CMD_HW2SW_CQ, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_CQ_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status) +{ + static const u16 op[] = { + [MTHCA_TRANS_RST2INIT] = CMD_RST2INIT_QPEE, + [MTHCA_TRANS_INIT2INIT] = CMD_INIT2INIT_QPEE, + [MTHCA_TRANS_INIT2RTR] = CMD_INIT2RTR_QPEE, + [MTHCA_TRANS_RTR2RTS] = CMD_RTR2RTS_QPEE, + [MTHCA_TRANS_RTS2RTS] = CMD_RTS2RTS_QPEE, + [MTHCA_TRANS_SQERR2RTS] = CMD_SQERR2RTS_QPEE, + [MTHCA_TRANS_ANY2ERR] = CMD_2ERR_QPEE, + [MTHCA_TRANS_RTS2SQD] = CMD_RTS2SQD_QPEE, + [MTHCA_TRANS_SQD2SQD] = CMD_SQD2SQD_QPEE, + [MTHCA_TRANS_SQD2RTS] = CMD_SQD2RTS_QPEE, + [MTHCA_TRANS_ANY2RST] = CMD_ERR2RST_QPEE + }; + u8 op_mod = 0; + + dma_addr_t indma; + int err; + + if (trans < 0 || trans >= ARRAY_SIZE(op)) + return -EINVAL; + + if (trans == MTHCA_TRANS_ANY2RST) { + indma = 0; + op_mod = 3; /* don't write outbox, any->reset */ + + /* For debugging */ + qp_context = pci_alloc_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + &indma); + op_mod = 2; /* write outbox, any->reset */ + } else { + indma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + } + + if (trans == MTHCA_TRANS_ANY2RST) { + err = mthca_cmd_box(dev, 0, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (0) { + int i; + mthca_dbg(dev, "Dumping QP context:\n"); + printk(" %08x\n", be32_to_cpup(qp_context)); + for (i = 0; i < 0x100 / 4; ++i) { + if (i % 8 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) qp_context)[i + 2])); + if ((i + 1) % 8 == 0) + printk("\n"); + } + } + + } else + err = mthca_cmd(dev, indma, (!!is_ee << 24) | num, + op_mod, op[trans], CMD_TIME_CLASS_C, status); + + if (trans != MTHCA_TRANS_ANY2RST) + pci_unmap_single(dev->pdev, indma, + MTHCA_QP_CONTEXT_SIZE, PCI_DMA_TODEVICE); + else + pci_free_consistent(dev->pdev, MTHCA_QP_CONTEXT_SIZE, + qp_context, indma); + return err; +} + +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, qp_context, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, (!!is_ee << 24) | num, 0, + CMD_QUERY_QPEE, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_QP_CONTEXT_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status) +{ + u8 op_mod; + + switch (type) { + case IB_QPT_SMI: + op_mod = 0; + break; + case IB_QPT_GSI: + op_mod = 1; + break; + case IB_QPT_RAW_IPV6: + op_mod = 2; + break; + case IB_QPT_RAW_ETY: + op_mod = 3; + break; + default: + return -EINVAL; + } + + return mthca_cmd(dev, 0, qpn, op_mod, CMD_CONF_SPECIAL_QP, + CMD_TIME_CLASS_B, status); +} + +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status) { + void *box; + dma_addr_t dma; + int err; + +#define MAD_IFC_BOX_SIZE 512 + + box = pci_alloc_consistent(dev->pdev, MAD_IFC_BOX_SIZE, &dma); + if (!box) + return -ENOMEM; + + memcpy(box, in_mad, 256); + + err = mthca_cmd_box(dev, dma, dma + 256, port, !!ignore_mkey, + CMD_MAD_IFC, CMD_TIME_CLASS_C, status); + + if (!err && !*status) + memcpy(response_mad, box + 256, 256); + + pci_free_consistent(dev->pdev, MAD_IFC_BOX_SIZE, box, dma); + return err; +} + +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t outdma = 0; + int err; + + outdma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + if (pci_dma_mapping_error(outdma)) + return -ENOMEM; + + err = mthca_cmd_box(dev, 0, outdma, index, 0, + CMD_READ_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, outdma, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_FROMDEVICE); + return err; +} + +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status) +{ + dma_addr_t indma; + int err; + + indma = pci_map_single(dev->pdev, mgm, + MTHCA_MGM_ENTRY_SIZE, + PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd(dev, indma, index, 0, CMD_WRITE_MGM, + CMD_TIME_CLASS_A, status); + + pci_unmap_single(dev->pdev, indma, + MTHCA_MGM_ENTRY_SIZE, PCI_DMA_TODEVICE); + return err; +} + +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status) +{ + dma_addr_t indma; + u64 imm; + int err; + + indma = pci_map_single(dev->pdev, gid, 16, PCI_DMA_TODEVICE); + if (pci_dma_mapping_error(indma)) + return -ENOMEM; + + err = mthca_cmd_imm(dev, indma, &imm, 0, 0, CMD_MGID_HASH, + CMD_TIME_CLASS_A, status); + *hash = imm; + + pci_unmap_single(dev->pdev, indma, 16, PCI_DMA_TODEVICE); + return err; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cmd.h 2004-12-13 09:44:45.036090752 -0800 @@ -0,0 +1,265 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cmd.h 1321 2004-12-10 19:38:54Z roland $ + */ + +#ifndef MTHCA_CMD_H +#define MTHCA_CMD_H + +#include + +#define MTHCA_CMD_MAILBOX_ALIGN 16UL +#define MTHCA_CMD_MAILBOX_EXTRA (MTHCA_CMD_MAILBOX_ALIGN - 1) + +enum { + /* command completed successfully: */ + MTHCA_CMD_STAT_OK = 0x00, + /* Internal error (such as a bus error) occurred while processing command: */ + MTHCA_CMD_STAT_INTERNAL_ERR = 0x01, + /* Operation/command not supported or opcode modifier not supported: */ + MTHCA_CMD_STAT_BAD_OP = 0x02, + /* Parameter not supported or parameter out of range: */ + MTHCA_CMD_STAT_BAD_PARAM = 0x03, + /* System not enabled or bad system state: */ + MTHCA_CMD_STAT_BAD_SYS_STATE = 0x04, + /* Attempt to access reserved or unallocaterd resource: */ + MTHCA_CMD_STAT_BAD_RESOURCE = 0x05, + /* Requested resource is currently executing a command, or is otherwise busy: */ + MTHCA_CMD_STAT_RESOURCE_BUSY = 0x06, + /* memory error: */ + MTHCA_CMD_STAT_DDR_MEM_ERR = 0x07, + /* Required capability exceeds device limits: */ + MTHCA_CMD_STAT_EXCEED_LIM = 0x08, + /* Resource is not in the appropriate state or ownership: */ + MTHCA_CMD_STAT_BAD_RES_STATE = 0x09, + /* Index out of range: */ + MTHCA_CMD_STAT_BAD_INDEX = 0x0a, + /* FW image corrupted: */ + MTHCA_CMD_STAT_BAD_NVMEM = 0x0b, + /* Attempt to modify a QP/EE which is not in the presumed state: */ + MTHCA_CMD_STAT_BAD_QPEE_STATE = 0x10, + /* Bad segment parameters (Address/Size): */ + MTHCA_CMD_STAT_BAD_SEG_PARAM = 0x20, + /* Memory Region has Memory Windows bound to: */ + MTHCA_CMD_STAT_REG_BOUND = 0x21, + /* HCA local attached memory not present: */ + MTHCA_CMD_STAT_LAM_NOT_PRE = 0x22, + /* Bad management packet (silently discarded): */ + MTHCA_CMD_STAT_BAD_PKT = 0x30, + /* More outstanding CQEs in CQ than new CQ size: */ + MTHCA_CMD_STAT_BAD_SIZE = 0x40 +}; + +enum { + MTHCA_TRANS_INVALID = 0, + MTHCA_TRANS_RST2INIT, + MTHCA_TRANS_INIT2INIT, + MTHCA_TRANS_INIT2RTR, + MTHCA_TRANS_RTR2RTS, + MTHCA_TRANS_RTS2RTS, + MTHCA_TRANS_SQERR2RTS, + MTHCA_TRANS_ANY2ERR, + MTHCA_TRANS_RTS2SQD, + MTHCA_TRANS_SQD2SQD, + MTHCA_TRANS_SQD2RTS, + MTHCA_TRANS_ANY2RST, +}; + +enum { + DEV_LIM_FLAG_SRQ = 1 << 6 +}; + +struct mthca_dev_lim { + int max_srq_sz; + int max_qp_sz; + int reserved_qps; + int max_qps; + int reserved_srqs; + int max_srqs; + int reserved_eecs; + int max_eecs; + int max_cq_sz; + int reserved_cqs; + int max_cqs; + int max_mpts; + int reserved_eqs; + int max_eqs; + int reserved_mtts; + int max_mrw_sz; + int reserved_mrws; + int max_mtt_seg; + int max_requester_per_qp; + int max_responder_per_qp; + int max_rdma_global; + int local_ca_ack_delay; + int max_mtu; + int max_port_width; + int max_vl; + int num_ports; + int max_gids; + int max_pkeys; + u32 flags; + int reserved_uars; + int uar_size; + int min_page_sz; + int max_sg; + int max_desc_sz; + int max_qp_per_mcg; + int reserved_mgms; + int max_mcgs; + int reserved_pds; + int max_pds; + int reserved_rdds; + int max_rdds; + int eec_entry_sz; + int qpc_entry_sz; + int eeec_entry_sz; + int eqpc_entry_sz; + int eqc_entry_sz; + int cqc_entry_sz; + int srq_entry_sz; + int uar_scratch_entry_sz; + union { + struct { + int max_avs; + } tavor; + struct { + int resize_srq; + int mtt_entry_sz; + int mpt_entry_sz; + int max_pbl_sz; + u8 bmme_flags; + u32 reserved_lkey; + int lam_required; + u64 max_icm_sz; + } arbel; + } hca; +}; + +struct mthca_adapter { + u32 vendor_id; + u32 device_id; + u32 revision_id; + u8 inta_pin; +}; + +struct mthca_init_hca_param { + u64 qpc_base; + u8 log_num_qps; + u64 eec_base; + u8 log_num_eecs; + u64 srqc_base; + u8 log_num_srqs; + u64 cqc_base; + u8 log_num_cqs; + u64 eqpc_base; + u64 eeec_base; + u64 eqc_base; + u8 log_num_eqs; + u64 rdb_base; + u64 mc_base; + u16 log_mc_entry_sz; + u16 mc_hash_sz; + u8 log_mc_table_sz; + u64 mpt_base; + u8 mtt_seg_sz; + u8 log_mpt_sz; + u64 mtt_base; + u64 uar_scratch_base; +}; + +struct mthca_init_ib_param { + int enable_1x; + int enable_4x; + int vl_cap; + int mtu_cap; + u16 gid_cap; + u16 pkey_cap; + int set_guid0; + u64 guid0; + int set_node_guid; + u64 node_guid; + int set_si_guid; + u64 si_guid; +}; + +int mthca_cmd_use_events(struct mthca_dev *dev); +void mthca_cmd_use_polling(struct mthca_dev *dev); +void mthca_cmd_event(struct mthca_dev *dev, u16 token, + u8 status, u64 out_param); + +int mthca_SYS_EN(struct mthca_dev *dev, u8 *status); +int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status); +int mthca_MAP_FA(struct mthca_dev *dev, int count, + struct scatterlist *sglist, u8 *status); +int mthca_UNMAP_FA(struct mthca_dev *dev, u8 *status); +int mthca_RUN_FW(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status); +int mthca_ENABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_DISABLE_LAM(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DDR(struct mthca_dev *dev, u8 *status); +int mthca_QUERY_DEV_LIM(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, u8 *status); +int mthca_QUERY_ADAPTER(struct mthca_dev *dev, + struct mthca_adapter *adapter, u8 *status); +int mthca_INIT_HCA(struct mthca_dev *dev, + struct mthca_init_hca_param *param, + u8 *status); +int mthca_INIT_IB(struct mthca_dev *dev, + struct mthca_init_ib_param *param, + int port, u8 *status); +int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status); +int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status); +int mthca_SW2HW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_HW2SW_MPT(struct mthca_dev *dev, void *mpt_entry, + int mpt_index, u8 *status); +int mthca_WRITE_MTT(struct mthca_dev *dev, u64 *mtt_entry, + int num_mtt, u8 *status); +int mthca_MAP_EQ(struct mthca_dev *dev, u64 event_mask, int unmap, + int eq_num, u8 *status); +int mthca_SW2HW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_HW2SW_EQ(struct mthca_dev *dev, void *eq_context, + int eq_num, u8 *status); +int mthca_SW2HW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_HW2SW_CQ(struct mthca_dev *dev, void *cq_context, + int cq_num, u8 *status); +int mthca_MODIFY_QP(struct mthca_dev *dev, int trans, u32 num, + int is_ee, void *qp_context, u32 optmask, + u8 *status); +int mthca_QUERY_QP(struct mthca_dev *dev, u32 num, int is_ee, + void *qp_context, u8 *status); +int mthca_CONF_SPECIAL_QP(struct mthca_dev *dev, int type, u32 qpn, + u8 *status); +int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int port, + void *in_mad, void *response_mad, u8 *status); +int mthca_READ_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_WRITE_MGM(struct mthca_dev *dev, int index, void *mgm, + u8 *status); +int mthca_MGID_HASH(struct mthca_dev *dev, void *gid, u16 *hash, + u8 *status); + +#define MAILBOX_ALIGN(x) ((void *) ALIGN((unsigned long) (x), MTHCA_CMD_MAILBOX_ALIGN)) + +#endif /* MTHCA_CMD_H */ From roland at topspin.com Mon Dec 13 10:09:34 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:34 -0800 Subject: [openib-general] [PATCH][v3][10/21] Add Mellanox HCA low-level driver (EQ) In-Reply-To: <20041213109.NiBwdaLIPMmwHwiP@topspin.com> Message-ID: <20041213109.SaVuYVABecenGuRB@topspin.com> Add event queue code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_eq.c 2004-12-13 09:44:45.329047594 -0800 @@ -0,0 +1,663 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_eq.c 1321 2004-12-10 19:38:54Z roland $ + */ + +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" +#include "mthca_config_reg.h" + +enum { + MTHCA_NUM_ASYNC_EQE = 0x80, + MTHCA_NUM_CMD_EQE = 0x80, + MTHCA_EQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_eq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 pd; + u8 reserved1[3]; + u8 intr; + u32 lost_count; + u32 lkey; + u32 reserved2[2]; + u32 consumer_index; + u32 producer_index; + u32 reserved3[4]; +} __attribute__((packed)); + +#define MTHCA_EQ_STATUS_OK ( 0 << 28) +#define MTHCA_EQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_EQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_EQ_OWNER_SW ( 0 << 24) +#define MTHCA_EQ_OWNER_HW ( 1 << 24) +#define MTHCA_EQ_FLAG_TR ( 1 << 18) +#define MTHCA_EQ_FLAG_OI ( 1 << 17) +#define MTHCA_EQ_STATE_ARMED ( 1 << 8) +#define MTHCA_EQ_STATE_FIRED ( 2 << 8) +#define MTHCA_EQ_STATE_ALWAYS_ARMED ( 3 << 8) + +enum { + MTHCA_EVENT_TYPE_COMP = 0x00, + MTHCA_EVENT_TYPE_PATH_MIG = 0x01, + MTHCA_EVENT_TYPE_COMM_EST = 0x02, + MTHCA_EVENT_TYPE_SQ_DRAINED = 0x03, + MTHCA_EVENT_TYPE_SRQ_LAST_WQE = 0x13, + MTHCA_EVENT_TYPE_CQ_ERROR = 0x04, + MTHCA_EVENT_TYPE_WQ_CATAS_ERROR = 0x05, + MTHCA_EVENT_TYPE_EEC_CATAS_ERROR = 0x06, + MTHCA_EVENT_TYPE_PATH_MIG_FAILED = 0x07, + MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR = 0x10, + MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR = 0x11, + MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR = 0x12, + MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR = 0x08, + MTHCA_EVENT_TYPE_PORT_CHANGE = 0x09, + MTHCA_EVENT_TYPE_EQ_OVERFLOW = 0x0f, + MTHCA_EVENT_TYPE_ECC_DETECT = 0x0e, + MTHCA_EVENT_TYPE_CMD = 0x0a +}; + +#define MTHCA_ASYNC_EVENT_MASK ((1ULL << MTHCA_EVENT_TYPE_PATH_MIG) | \ + (1ULL << MTHCA_EVENT_TYPE_COMM_EST) | \ + (1ULL << MTHCA_EVENT_TYPE_SQ_DRAINED) | \ + (1ULL << MTHCA_EVENT_TYPE_CQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_EEC_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PATH_MIG_FAILED) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_PORT_CHANGE) | \ + (1ULL << MTHCA_EVENT_TYPE_ECC_DETECT)) +#define MTHCA_SRQ_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR) | \ + (1ULL << MTHCA_EVENT_TYPE_SRQ_LAST_WQE) +#define MTHCA_CMD_EVENT_MASK (1ULL << MTHCA_EVENT_TYPE_CMD) + +#define MTHCA_EQ_DB_INC_CI (1 << 24) +#define MTHCA_EQ_DB_REQ_NOT (2 << 24) +#define MTHCA_EQ_DB_DISARM_CQ (3 << 24) +#define MTHCA_EQ_DB_SET_CI (4 << 24) +#define MTHCA_EQ_DB_ALWAYS_ARM (5 << 24) + +struct mthca_eqe { + u8 reserved1; + u8 type; + u8 reserved2; + u8 subtype; + union { + u32 raw[6]; + struct { + u32 cqn; + } __attribute__((packed)) comp; + struct { + u16 reserved1; + u16 token; + u32 reserved2; + u8 reserved3[3]; + u8 status; + u64 out_param; + } __attribute__((packed)) cmd; + struct { + u32 qpn; + } __attribute__((packed)) qp; + struct { + u32 reserved1[2]; + u32 port; + } __attribute__((packed)) port_change; + } event; + u8 reserved3[3]; + u8 owner; +} __attribute__((packed)); + +#define MTHCA_EQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_EQ_ENTRY_OWNER_HW (1 << 7) + +static inline u64 async_mask(struct mthca_dev *dev) +{ + return dev->mthca_flags & MTHCA_FLAG_SRQ ? + MTHCA_ASYNC_EVENT_MASK | MTHCA_SRQ_EVENT_MASK : + MTHCA_ASYNC_EVENT_MASK; +} + +static inline void set_eq_ci(struct mthca_dev *dev, int eqn, int ci) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_SET_CI | eqn); + doorbell[1] = cpu_to_be32(ci); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void eq_req_not(struct mthca_dev *dev, int eqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_REQ_NOT | eqn); + doorbell[1] = 0; + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline void disarm_cq(struct mthca_dev *dev, int eqn, int cqn) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_EQ_DB_DISARM_CQ | eqn); + doorbell[1] = cpu_to_be32(cqn); + + mthca_write64(doorbell, + dev->kar + MTHCA_EQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +static inline struct mthca_eqe *get_eqe(struct mthca_eq *eq, int entry) +{ + return eq->page_list[entry * MTHCA_EQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_EQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int next_eqe_sw(struct mthca_eq *eq) +{ + return !(MTHCA_EQ_ENTRY_OWNER_HW & + get_eqe(eq, eq->cons_index)->owner); +} + +static inline void set_eqe_hw(struct mthca_eq *eq, int entry) +{ + get_eqe(eq, entry)->owner = MTHCA_EQ_ENTRY_OWNER_HW; +} + +static void port_change(struct mthca_dev *dev, int port, int active) +{ + struct ib_event record; + + mthca_dbg(dev, "Port change to %s for port %d\n", + active ? "active" : "down", port); + + record.device = &dev->ib_dev; + record.event = active ? IB_EVENT_PORT_ACTIVE : IB_EVENT_PORT_ERR; + record.element.port_num = port; + + ib_dispatch_event(&record); +} + +static void mthca_eq_int(struct mthca_dev *dev, struct mthca_eq *eq) +{ + struct mthca_eqe *eqe; + int disarm_cqn; + + while (next_eqe_sw(eq)) { + int set_ci = 0; + eqe = get_eqe(eq, eq->cons_index); + + switch (eqe->type) { + case MTHCA_EVENT_TYPE_COMP: + disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; + disarm_cq(dev, eq->eqn, disarm_cqn); + mthca_cq_event(dev, disarm_cqn); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG); + break; + + case MTHCA_EVENT_TYPE_COMM_EST: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_COMM_EST); + break; + + case MTHCA_EVENT_TYPE_SQ_DRAINED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_SQ_DRAINED); + break; + + case MTHCA_EVENT_TYPE_WQ_CATAS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_FATAL); + break; + + case MTHCA_EVENT_TYPE_PATH_MIG_FAILED: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_PATH_MIG_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_INVAL_REQ_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_REQ_ERR); + break; + + case MTHCA_EVENT_TYPE_WQ_ACCESS_ERROR: + mthca_qp_event(dev, be32_to_cpu(eqe->event.qp.qpn) & 0xffffff, + IB_EVENT_QP_ACCESS_ERR); + break; + + case MTHCA_EVENT_TYPE_CMD: + mthca_cmd_event(dev, + be16_to_cpu(eqe->event.cmd.token), + eqe->event.cmd.status, + be64_to_cpu(eqe->event.cmd.out_param)); + /* cmd_event() may add more commands. + * The card will think the queue has overflowed if + * we don't tell it we've been processing events. + */ + set_ci = 1; + break; + + case MTHCA_EVENT_TYPE_PORT_CHANGE: + port_change(dev, + (be32_to_cpu(eqe->event.port_change.port) >> 28) & 3, + eqe->subtype == 0x4); + break; + + case MTHCA_EVENT_TYPE_CQ_ERROR: + case MTHCA_EVENT_TYPE_EEC_CATAS_ERROR: + case MTHCA_EVENT_TYPE_SRQ_CATAS_ERROR: + case MTHCA_EVENT_TYPE_LOCAL_CATAS_ERROR: + case MTHCA_EVENT_TYPE_EQ_OVERFLOW: + case MTHCA_EVENT_TYPE_ECC_DETECT: + default: + mthca_warn(dev, "Unhandled event %02x(%02x) on eqn %d\n", + eqe->type, eqe->subtype, eq->eqn); + break; + }; + + set_eqe_hw(eq, eq->cons_index); + eq->cons_index = (eq->cons_index + 1) & (eq->nent - 1); + + if (set_ci) { + wmb(); /* see comment below */ + set_eq_ci(dev, eq->eqn, eq->cons_index); + set_ci = 0; + } + } + + /* + * This barrier makes sure that all updates to + * ownership bits done by set_eqe_hw() hit memory + * before the consumer index is updated. set_eq_ci() + * allows the HCA to possibly write more EQ entries, + * and we want to avoid the exceedingly unlikely + * possibility of the HCA writing an entry and then + * having set_eqe_hw() overwrite the owner field. + */ + wmb(); + set_eq_ci(dev, eq->eqn, eq->cons_index); + eq_req_not(dev, eq->eqn); +} + +static irqreturn_t mthca_interrupt(int irq, void *dev_ptr, struct pt_regs *regs) +{ + struct mthca_dev *dev = dev_ptr; + u32 ecr; + int work = 0; + int i; + + if (dev->eq_table.clr_mask) + writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); + + while ((ecr = readl(dev->hcr + MTHCA_ECR_OFFSET + 4)) != 0) { + work = 1; + + writel(ecr, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].ecr_mask) + mthca_eq_int(dev, &dev->eq_table.eq[i]); + } + + return IRQ_RETVAL(work); +} + +static irqreturn_t mthca_msi_x_interrupt(int irq, void *eq_ptr, + struct pt_regs *regs) +{ + struct mthca_eq *eq = eq_ptr; + struct mthca_dev *dev = eq->dev; + + writel(eq->ecr_mask, dev->hcr + MTHCA_ECR_CLR_OFFSET + 4); + mthca_eq_int(dev, eq); + + /* MSI-X vectors always belong to us */ + return IRQ_HANDLED; +} + +static int __devinit mthca_create_eq(struct mthca_dev *dev, + int nent, + u8 intr, + struct mthca_eq *eq) +{ + int npages = (nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + u64 *dma_list = NULL; + dma_addr_t t; + void *mailbox = NULL; + struct mthca_eq_context *eq_context; + int err = -ENOMEM; + int i; + u8 status; + + /* Make sure EQ size is aligned to a power of 2 size. */ + for (i = 1; i < nent; i <<= 1) + ; /* nothing */ + nent = i; + + eq->dev = dev; + + eq->page_list = kmalloc(npages * sizeof *eq->page_list, + GFP_KERNEL); + if (!eq->page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + eq->page_list[i].buf = NULL; + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + mailbox = kmalloc(sizeof *eq_context + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free; + eq_context = MAILBOX_ALIGN(mailbox); + + for (i = 0; i < npages; ++i) { + eq->page_list[i].buf = pci_alloc_consistent(dev->pdev, + PAGE_SIZE, &t); + if (!eq->page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&eq->page_list[i], mapping, t); + + memset(eq->page_list[i].buf, 0, PAGE_SIZE); + } + + for (i = 0; i < nent; ++i) + set_eqe_hw(eq, i); + + eq->eqn = mthca_alloc(&dev->eq_table.alloc); + if (eq->eqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, PAGE_SHIFT, npages, + 0, npages * PAGE_SIZE, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &eq->mr); + if (err) + goto err_out_free_eq; + + eq->nent = nent; + + memset(eq_context, 0, sizeof *eq_context); + eq_context->flags = cpu_to_be32(MTHCA_EQ_STATUS_OK | + MTHCA_EQ_OWNER_HW | + MTHCA_EQ_STATE_ARMED | + MTHCA_EQ_FLAG_TR); + eq_context->start = cpu_to_be64(0); + eq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + eq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + eq_context->intr = intr; + eq_context->lkey = cpu_to_be32(eq->mr.ibmr.lkey); + + err = mthca_SW2HW_EQ(dev, eq_context, eq->eqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_EQ failed (%d)\n", err); + goto err_out_free_mr; + } + if (status) { + mthca_warn(dev, "SW2HW_EQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + kfree(dma_list); + kfree(mailbox); + + eq->ecr_mask = swab32(1 << eq->eqn); + eq->cons_index = 0; + + eq_req_not(dev, eq->eqn); + + mthca_dbg(dev, "Allocated EQ %d with %d entries\n", + eq->eqn, nent); + + return err; + + err_out_free_mr: + mthca_free_mr(dev, &eq->mr); + + err_out_free_eq: + mthca_free(&dev->eq_table.alloc, eq->eqn); + + err_out_free: + for (i = 0; i < npages; ++i) + if (eq->page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], + mapping)); + + kfree(eq->page_list); + kfree(dma_list); + kfree(mailbox); + + err_out: + return err; +} + +static void mthca_free_eq(struct mthca_dev *dev, + struct mthca_eq *eq) +{ + void *mailbox = NULL; + int err; + u8 status; + int npages = (eq->nent * MTHCA_EQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + int i; + + mailbox = kmalloc(sizeof (struct mthca_eq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + return; + + err = mthca_HW2SW_EQ(dev, MAILBOX_ALIGN(mailbox), + eq->eqn, &status); + if (err) + mthca_warn(dev, "HW2SW_EQ failed (%d)\n", err); + if (status) + mthca_warn(dev, "HW2SW_EQ returned status 0x%02x\n", + status); + + if (0) { + mthca_dbg(dev, "Dumping EQ context %02x:\n", eq->eqn); + for (i = 0; i < sizeof (struct mthca_eq_context) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpup(MAILBOX_ALIGN(mailbox) + i * 4)); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + + mthca_free_mr(dev, &eq->mr); + for (i = 0; i < npages; ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + eq->page_list[i].buf, + pci_unmap_addr(&eq->page_list[i], mapping)); + + kfree(eq->page_list); + kfree(mailbox); +} + +static void mthca_free_irqs(struct mthca_dev *dev) +{ + int i; + + if (dev->eq_table.have_irq) + free_irq(dev->pdev->irq, dev); + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (dev->eq_table.eq[i].have_irq) + free_irq(dev->eq_table.eq[i].msi_x_vector, + dev->eq_table.eq + i); +} + +int __devinit mthca_init_eq_table(struct mthca_dev *dev) +{ + int err; + u8 status; + u8 intr; + int i; + + err = mthca_alloc_init(&dev->eq_table.alloc, + dev->limits.num_eqs, + dev->limits.num_eqs - 1, + dev->limits.reserved_eqs); + if (err) + return err; + + if (dev->mthca_flags & MTHCA_FLAG_MSI || + dev->mthca_flags & MTHCA_FLAG_MSI_X) { + dev->eq_table.clr_mask = 0; + } else { + dev->eq_table.clr_mask = + swab32(1 << (dev->eq_table.inta_pin & 31)); + dev->eq_table.clr_int = dev->clr_base + + (dev->eq_table.inta_pin < 31 ? 4 : 0); + } + + intr = (dev->mthca_flags & MTHCA_FLAG_MSI) ? + 128 : dev->eq_table.inta_pin; + + err = mthca_create_eq(dev, dev->limits.num_cqs, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 128 : intr, + &dev->eq_table.eq[MTHCA_EQ_COMP]); + if (err) + goto err_out_free; + + err = mthca_create_eq(dev, MTHCA_NUM_ASYNC_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 129 : intr, + &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + if (err) + goto err_out_comp; + + err = mthca_create_eq(dev, MTHCA_NUM_CMD_EQE, + (dev->mthca_flags & MTHCA_FLAG_MSI_X) ? 130 : intr, + &dev->eq_table.eq[MTHCA_EQ_CMD]); + if (err) + goto err_out_async; + + if (dev->mthca_flags & MTHCA_FLAG_MSI_X) { + static const char *eq_name[] = { + [MTHCA_EQ_COMP] = DRV_NAME " (comp)", + [MTHCA_EQ_ASYNC] = DRV_NAME " (async)", + [MTHCA_EQ_CMD] = DRV_NAME " (cmd)" + }; + + for (i = 0; i < MTHCA_NUM_EQ; ++i) { + err = request_irq(dev->eq_table.eq[i].msi_x_vector, + mthca_msi_x_interrupt, 0, + eq_name[i], dev->eq_table.eq + i); + if (err) + goto err_out_cmd; + dev->eq_table.eq[i].have_irq = 1; + } + } else { + err = request_irq(dev->pdev->irq, mthca_interrupt, SA_SHIRQ, + DRV_NAME, dev); + if (err) + goto err_out_cmd; + dev->eq_table.have_irq = 1; + } + + err = mthca_MAP_EQ(dev, async_mask(dev), + 0, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for async EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, status); + + err = mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 0, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + if (err) + mthca_warn(dev, "MAP_EQ for cmd EQ %d failed (%d)\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, err); + if (status) + mthca_warn(dev, "MAP_EQ for cmd EQ %d returned status 0x%02x\n", + dev->eq_table.eq[MTHCA_EQ_CMD].eqn, status); + + return 0; + +err_out_cmd: + mthca_free_irqs(dev); + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_CMD]); + +err_out_async: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_ASYNC]); + +err_out_comp: + mthca_free_eq(dev, &dev->eq_table.eq[MTHCA_EQ_COMP]); + +err_out_free: + mthca_alloc_cleanup(&dev->eq_table.alloc); + return err; +} + +void __devexit mthca_cleanup_eq_table(struct mthca_dev *dev) +{ + u8 status; + int i; + + mthca_free_irqs(dev); + + mthca_MAP_EQ(dev, async_mask(dev), + 1, dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn, &status); + mthca_MAP_EQ(dev, MTHCA_CMD_EVENT_MASK, + 1, dev->eq_table.eq[MTHCA_EQ_CMD].eqn, &status); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + mthca_free_eq(dev, &dev->eq_table.eq[i]); + + mthca_alloc_cleanup(&dev->eq_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:35 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:35 -0800 Subject: [openib-general] [PATCH][v3][11/21] Add Mellanox HCA low-level driver (initialization) In-Reply-To: <20041213109.SaVuYVABecenGuRB@topspin.com> Message-ID: <20041213109.ZLx80T8qLt2sXe9H@topspin.com> Add device initializaton code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.c 2004-12-13 09:44:45.555014305 -0800 @@ -0,0 +1,215 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_profile.h" + +static int default_profile[MTHCA_RES_NUM] = { + [MTHCA_RES_QP] = 1 << 16, + [MTHCA_RES_EQP] = 1 << 16, + [MTHCA_RES_CQ] = 1 << 16, + [MTHCA_RES_EQ] = 32, + [MTHCA_RES_RDB] = 1 << 18, + [MTHCA_RES_MCG] = 1 << 13, + [MTHCA_RES_MPT] = 1 << 17, + [MTHCA_RES_MTT] = 1 << 20, + [MTHCA_RES_UDAV] = 1 << 15 +}; + +enum { + MTHCA_RDB_ENTRY_SIZE = 32, + MTHCA_MTT_SEG_SIZE = 64 +}; + +enum { + MTHCA_NUM_PDS = 1 << 15 +}; + +int mthca_make_profile(struct mthca_dev *dev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca) +{ + /* just use default profile for now */ + struct mthca_resource { + u64 size; + u64 start; + int type; + int num; + int log_num; + }; + + u64 total_size = 0; + struct mthca_resource *profile; + struct mthca_resource tmp; + int i, j; + + default_profile[MTHCA_RES_UAR] = dev_lim->uar_size / PAGE_SIZE; + + profile = kmalloc(MTHCA_RES_NUM * sizeof *profile, GFP_KERNEL); + if (!profile) + return -ENOMEM; + + profile[MTHCA_RES_QP].size = dev_lim->qpc_entry_sz; + profile[MTHCA_RES_EEC].size = dev_lim->eec_entry_sz; + profile[MTHCA_RES_SRQ].size = dev_lim->srq_entry_sz; + profile[MTHCA_RES_CQ].size = dev_lim->cqc_entry_sz; + profile[MTHCA_RES_EQP].size = dev_lim->eqpc_entry_sz; + profile[MTHCA_RES_EEEC].size = dev_lim->eeec_entry_sz; + profile[MTHCA_RES_EQ].size = dev_lim->eqc_entry_sz; + profile[MTHCA_RES_RDB].size = MTHCA_RDB_ENTRY_SIZE; + profile[MTHCA_RES_MCG].size = MTHCA_MGM_ENTRY_SIZE; + profile[MTHCA_RES_MPT].size = MTHCA_MPT_ENTRY_SIZE; + profile[MTHCA_RES_MTT].size = MTHCA_MTT_SEG_SIZE; + profile[MTHCA_RES_UAR].size = dev_lim->uar_scratch_entry_sz; + profile[MTHCA_RES_UDAV].size = MTHCA_AV_SIZE; + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + profile[i].type = i; + profile[i].num = default_profile[i]; + profile[i].log_num = max(ffs(default_profile[i]) - 1, 0); + profile[i].size *= default_profile[i]; + } + + /* + * Sort the resources in decreasing order of size. Since they + * all have sizes that are powers of 2, we'll be able to keep + * resources aligned to their size and pack them without gaps + * using the sorted order. + */ + for (i = MTHCA_RES_NUM; i > 0; --i) + for (j = 1; j < i; ++j) { + if (profile[j].size > profile[j - 1].size) { + tmp = profile[j]; + profile[j] = profile[j - 1]; + profile[j - 1] = tmp; + } + } + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + if (profile[i].size) { + profile[i].start = dev->ddr_start + total_size; + total_size += profile[i].size; + } + if (total_size > dev->fw.tavor.fw_start - dev->ddr_start) { + mthca_err(dev, "Profile requires 0x%llx bytes; " + "won't fit between DDR start at 0x%016llx " + "and FW start at 0x%016llx.\n", + (unsigned long long) total_size, + (unsigned long long) dev->ddr_start, + (unsigned long long) dev->fw.tavor.fw_start); + kfree(profile); + return -ENOMEM; + } + + if (profile[i].size) + mthca_dbg(dev, "profile[%2d]--%2d/%2d @ 0x%16llx " + "(size 0x%8llx)\n", + i, profile[i].type, profile[i].log_num, + (unsigned long long) profile[i].start, + (unsigned long long) profile[i].size); + } + + mthca_dbg(dev, "HCA memory: allocated %d KB/%d KB (%d KB free)\n", + (int) (total_size >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start) >> 10), + (int) ((dev->fw.tavor.fw_start - dev->ddr_start - total_size) >> 10)); + + for (i = 0; i < MTHCA_RES_NUM; ++i) { + switch (profile[i].type) { + case MTHCA_RES_QP: + dev->limits.num_qps = profile[i].num; + init_hca->qpc_base = profile[i].start; + init_hca->log_num_qps = profile[i].log_num; + break; + case MTHCA_RES_EEC: + dev->limits.num_eecs = profile[i].num; + init_hca->eec_base = profile[i].start; + init_hca->log_num_eecs = profile[i].log_num; + break; + case MTHCA_RES_SRQ: + dev->limits.num_srqs = profile[i].num; + init_hca->srqc_base = profile[i].start; + init_hca->log_num_srqs = profile[i].log_num; + break; + case MTHCA_RES_CQ: + dev->limits.num_cqs = profile[i].num; + init_hca->cqc_base = profile[i].start; + init_hca->log_num_cqs = profile[i].log_num; + break; + case MTHCA_RES_EQP: + init_hca->eqpc_base = profile[i].start; + break; + case MTHCA_RES_EEEC: + init_hca->eeec_base = profile[i].start; + break; + case MTHCA_RES_EQ: + dev->limits.num_eqs = profile[i].num; + init_hca->eqc_base = profile[i].start; + init_hca->log_num_eqs = profile[i].log_num; + break; + case MTHCA_RES_RDB: + dev->limits.num_rdbs = profile[i].num; + init_hca->rdb_base = profile[i].start; + break; + case MTHCA_RES_MCG: + dev->limits.num_mgms = profile[i].num >> 1; + dev->limits.num_amgms = profile[i].num >> 1; + init_hca->mc_base = profile[i].start; + init_hca->log_mc_entry_sz = ffs(MTHCA_MGM_ENTRY_SIZE) - 1; + init_hca->log_mc_table_sz = profile[i].log_num; + init_hca->mc_hash_sz = 1 << (profile[i].log_num - 1); + break; + case MTHCA_RES_MPT: + dev->limits.num_mpts = profile[i].num; + init_hca->mpt_base = profile[i].start; + init_hca->log_mpt_sz = profile[i].log_num; + break; + case MTHCA_RES_MTT: + dev->limits.num_mtt_segs = profile[i].num; + dev->limits.mtt_seg_size = MTHCA_MTT_SEG_SIZE; + dev->mr_table.mtt_base = profile[i].start; + init_hca->mtt_base = profile[i].start; + init_hca->mtt_seg_sz = ffs(MTHCA_MTT_SEG_SIZE) - 7; + break; + case MTHCA_RES_UAR: + init_hca->uar_scratch_base = profile[i].start; + break; + case MTHCA_RES_UDAV: + dev->av_table.ddr_av_base = profile[i].start; + dev->av_table.num_ddr_avs = profile[i].num; + default: + break; + } + } + + /* + * PDs don't take any HCA memory, but we assign them as part + * of the HCA profile anyway. + */ + dev->limits.num_pds = MTHCA_NUM_PDS; + + kfree(profile); + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_profile.h 2004-12-13 09:44:45.581010475 -0800 @@ -0,0 +1,51 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_profile.h 1288 2004-11-24 01:12:39Z roland $ + */ + +#ifndef MTHCA_PROFILE_H +#define MTHCA_PROFILE_H + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_RES_QP, + MTHCA_RES_EEC, + MTHCA_RES_SRQ, + MTHCA_RES_CQ, + MTHCA_RES_EQP, + MTHCA_RES_EEEC, + MTHCA_RES_EQ, + MTHCA_RES_RDB, + MTHCA_RES_MCG, + MTHCA_RES_MPT, + MTHCA_RES_MTT, + MTHCA_RES_UAR, + MTHCA_RES_UDAV, + MTHCA_RES_NUM +}; + +int mthca_make_profile(struct mthca_dev *mdev, + struct mthca_dev_lim *dev_lim, + struct mthca_init_hca_param *init_hca); + +#endif /* MTHCA_PROFILE_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_reset.c 2004-12-13 09:44:45.607006645 -0800 @@ -0,0 +1,221 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_reset.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +int mthca_reset(struct mthca_dev *mdev) +{ + int i; + int err = 0; + u32 *hca_header = NULL; + u32 *bridge_header = NULL; + struct pci_dev *bridge = NULL; + +#define MTHCA_RESET_OFFSET 0xf0010 +#define MTHCA_RESET_VALUE cpu_to_be32(1) + + /* + * Reset the chip. This is somewhat ugly because we have to + * save off the PCI header before reset and then restore it + * after the chip reboots. We skip config space offsets 22 + * and 23 since those have a special meaning. + * + * To make matters worse, for Tavor (PCI-X HCA) we have to + * find the associated bridge device and save off its PCI + * header as well. + */ + + if (mdev->hca_type == TAVOR) { + /* Look for the bridge -- its device ID will be 2 more + than HCA's device ID. */ + while ((bridge = pci_get_device(mdev->pdev->vendor, + mdev->pdev->device + 2, + bridge)) != NULL) { + if (bridge->hdr_type == PCI_HEADER_TYPE_BRIDGE && + bridge->subordinate == mdev->pdev->bus) { + mthca_dbg(mdev, "Found bridge: %s (%s)\n", + pci_pretty_name(bridge), pci_name(bridge)); + break; + } + } + + if (!bridge) { + /* + * Didn't find a bridge for a Tavor device -- + * assume we're in no-bridge mode and hope for + * the best. + */ + mthca_warn(mdev, "No bridge found for %s (%s)\n", + pci_pretty_name(mdev->pdev), pci_name(mdev->pdev)); + } + + } + + /* For Arbel do we need to save off the full 4K PCI Express header?? */ + hca_header = kmalloc(256, GFP_KERNEL); + if (!hca_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(mdev->pdev, i * 4, hca_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA " + "PCI header, aborting.\n"); + goto out; + } + } + + if (bridge) { + bridge_header = kmalloc(256, GFP_KERNEL); + if (!bridge_header) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't allocate memory to save HCA " + "bridge PCI header, aborting.\n"); + goto out; + } + + for (i = 0; i < 64; ++i) { + if (i == 22 || i == 23) + continue; + if (pci_read_config_dword(bridge, i * 4, bridge_header + i)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't save HCA bridge " + "PCI header, aborting.\n"); + goto out; + } + } + } + + /* actually hit reset */ + { + void __iomem *reset = ioremap(pci_resource_start(mdev->pdev, 0) + + MTHCA_RESET_OFFSET, 4); + + if (!reset) { + err = -ENOMEM; + mthca_err(mdev, "Couldn't map HCA reset register, " + "aborting.\n"); + goto out; + } + + writel(MTHCA_RESET_VALUE, reset); + iounmap(reset); + } + + /* Docs say to wait one second before accessing device */ + msleep(1000); + + /* Now wait for PCI device to start responding again */ + { + u32 v; + int c = 0; + + for (c = 0; c < 100; ++c) { + if (pci_read_config_dword(bridge ? bridge : mdev->pdev, 0, &v)) { + err = -ENODEV; + mthca_err(mdev, "Couldn't access HCA after reset, " + "aborting.\n"); + goto out; + } + + if (v != 0xffffffff) + goto good; + + msleep(100); + } + + err = -ENODEV; + mthca_err(mdev, "PCI device did not come back after reset, " + "aborting.\n"); + goto out; + } + +good: + /* Now restore the PCI headers */ + if (bridge) { + /* + * Bridge control register is at 0x3e, so we'll + * naturally restore it last in this loop. + */ + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(bridge, i * 4, bridge_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(bridge, PCI_COMMAND, + bridge_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA bridge COMMAND, " + "aborting.\n"); + goto out; + } + } + + for (i = 0; i < 16; ++i) { + if (i * 4 == PCI_COMMAND) + continue; + + if (pci_write_config_dword(mdev->pdev, i * 4, hca_header[i])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA reg %x, " + "aborting.\n", i); + goto out; + } + } + + if (pci_write_config_dword(mdev->pdev, PCI_COMMAND, + hca_header[PCI_COMMAND / 4])) { + err = -ENODEV; + mthca_err(mdev, "Couldn't restore HCA COMMAND, " + "aborting.\n"); + goto out; + } + +out: + if (bridge) + pci_dev_put(bridge); + kfree(bridge_header); + kfree(hca_header); + + return err; +} From roland at topspin.com Mon Dec 13 10:09:36 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:36 -0800 Subject: [openib-general] [PATCH][v3][12/21] Add Mellanox HCA low-level driver (QP/CQ) In-Reply-To: <20041213109.ZLx80T8qLt2sXe9H@topspin.com> Message-ID: <20041213109.qjNNDyU3lIqRtV2z@topspin.com> Add CQ (completion queue) and QP (queue pair) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_cq.c 2004-12-13 09:44:45.891964666 -0800 @@ -0,0 +1,817 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_cq.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_CQ_SIZE = 4 * PAGE_SIZE +}; + +enum { + MTHCA_CQ_ENTRY_SIZE = 0x20 +}; + +/* + * Must be packed because start is 64 bits but only aligned to 32 bits. + */ +struct mthca_cq_context { + u32 flags; + u64 start; + u32 logsize_usrpage; + u32 error_eqn; + u32 comp_eqn; + u32 pd; + u32 lkey; + u32 last_notified_index; + u32 solicit_producer_index; + u32 consumer_index; + u32 producer_index; + u32 cqn; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_CQ_STATUS_OK ( 0 << 28) +#define MTHCA_CQ_STATUS_OVERFLOW ( 9 << 28) +#define MTHCA_CQ_STATUS_WRITE_FAIL (10 << 28) +#define MTHCA_CQ_FLAG_TR ( 1 << 18) +#define MTHCA_CQ_FLAG_OI ( 1 << 17) +#define MTHCA_CQ_STATE_DISARMED ( 0 << 8) +#define MTHCA_CQ_STATE_ARMED ( 1 << 8) +#define MTHCA_CQ_STATE_ARMED_SOL ( 4 << 8) +#define MTHCA_EQ_STATE_FIRED (10 << 8) + +enum { + MTHCA_ERROR_CQE_OPCODE_MASK = 0xfe +}; + +enum { + SYNDROME_LOCAL_LENGTH_ERR = 0x01, + SYNDROME_LOCAL_QP_OP_ERR = 0x02, + SYNDROME_LOCAL_EEC_OP_ERR = 0x03, + SYNDROME_LOCAL_PROT_ERR = 0x04, + SYNDROME_WR_FLUSH_ERR = 0x05, + SYNDROME_MW_BIND_ERR = 0x06, + SYNDROME_BAD_RESP_ERR = 0x10, + SYNDROME_LOCAL_ACCESS_ERR = 0x11, + SYNDROME_REMOTE_INVAL_REQ_ERR = 0x12, + SYNDROME_REMOTE_ACCESS_ERR = 0x13, + SYNDROME_REMOTE_OP_ERR = 0x14, + SYNDROME_RETRY_EXC_ERR = 0x15, + SYNDROME_RNR_RETRY_EXC_ERR = 0x16, + SYNDROME_LOCAL_RDD_VIOL_ERR = 0x20, + SYNDROME_REMOTE_INVAL_RD_REQ_ERR = 0x21, + SYNDROME_REMOTE_ABORTED_ERR = 0x22, + SYNDROME_INVAL_EECN_ERR = 0x23, + SYNDROME_INVAL_EEC_STATE_ERR = 0x24 +}; + +struct mthca_cqe { + u32 my_qpn; + u32 my_ee; + u32 rqpn; + u16 sl_g_mlpath; + u16 rlid; + u32 imm_etype_pkey_eec; + u32 byte_cnt; + u32 wqe; + u8 opcode; + u8 is_send; + u8 reserved; + u8 owner; +}; + +struct mthca_err_cqe { + u32 my_qpn; + u32 reserved1[3]; + u8 syndrome; + u8 reserved2; + u16 db_cnt; + u32 reserved3; + u32 wqe; + u8 opcode; + u8 reserved4[2]; + u8 owner; +}; + +#define MTHCA_CQ_ENTRY_OWNER_SW (0 << 7) +#define MTHCA_CQ_ENTRY_OWNER_HW (1 << 7) + +#define MTHCA_CQ_DB_INC_CI (1 << 24) +#define MTHCA_CQ_DB_REQ_NOT (2 << 24) +#define MTHCA_CQ_DB_REQ_NOT_SOL (3 << 24) +#define MTHCA_CQ_DB_SET_CI (4 << 24) +#define MTHCA_CQ_DB_REQ_NOT_MULT (5 << 24) + +static inline struct mthca_cqe *get_cqe(struct mthca_cq *cq, int entry) +{ + if (cq->is_direct) + return cq->queue.direct.buf + (entry * MTHCA_CQ_ENTRY_SIZE); + else + return cq->queue.page_list[entry * MTHCA_CQ_ENTRY_SIZE / PAGE_SIZE].buf + + (entry * MTHCA_CQ_ENTRY_SIZE) % PAGE_SIZE; +} + +static inline int cqe_sw(struct mthca_cq *cq, int i) +{ + return !(MTHCA_CQ_ENTRY_OWNER_HW & + get_cqe(cq, i)->owner); +} + +static inline int next_cqe_sw(struct mthca_cq *cq) +{ + return cqe_sw(cq, cq->cons_index); +} + +static inline void set_cqe_hw(struct mthca_cq *cq, int entry) +{ + get_cqe(cq, entry)->owner = MTHCA_CQ_ENTRY_OWNER_HW; +} + +static inline void inc_cons_index(struct mthca_dev *dev, struct mthca_cq *cq, + int nent) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(MTHCA_CQ_DB_INC_CI | cq->cqn); + doorbell[1] = cpu_to_be32(nent - 1); + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +{ + struct mthca_cq *cq; + + spin_lock(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Completion event for bogus CQ %08x\n", cqn); + return; + } + + cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn) +{ + struct mthca_cq *cq; + struct mthca_cqe *cqe; + int prod_index; + int nfreed = 0; + + spin_lock_irq(&dev->cq_table.lock); + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + if (cq) + atomic_inc(&cq->refcount); + spin_unlock_irq(&dev->cq_table.lock); + + if (!cq) + return; + + spin_lock_irq(&cq->lock); + + /* + * First we need to find the current producer index, so we + * know where to start cleaning from. It doesn't matter if HW + * adds new entries after this loop -- the QP we're worried + * about is already in RESET, so the new entries won't come + * from our QP and therefore don't need to be checked. + */ + for (prod_index = cq->cons_index; + cqe_sw(cq, prod_index & (cq->ibcq.cqe - 1)); + ++prod_index) + if (prod_index == cq->cons_index + cq->ibcq.cqe - 1) + break; + + if (0) + mthca_dbg(dev, "Cleaning QPN %06x from CQN %06x; ci %d, pi %d\n", + qpn, cqn, cq->cons_index, prod_index); + + /* + * Now sweep backwards through the CQ, removing CQ entries + * that match our QP by copying older entries on top of them. + */ + while (prod_index > cq->cons_index) { + cqe = get_cqe(cq, (prod_index - 1) & (cq->ibcq.cqe - 1)); + if (cqe->my_qpn == cpu_to_be32(qpn)) + ++nfreed; + else if (nfreed) + memcpy(get_cqe(cq, (prod_index - 1 + nfreed) & + (cq->ibcq.cqe - 1)), + cqe, + MTHCA_CQ_ENTRY_SIZE); + --prod_index; + } + + if (nfreed) { + wmb(); + inc_cons_index(dev, cq, nfreed); + cq->cons_index = (cq->cons_index + nfreed) & (cq->ibcq.cqe - 1); + } + + spin_unlock_irq(&cq->lock); + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + +static int handle_error_cqe(struct mthca_dev *dev, struct mthca_cq *cq, + struct mthca_qp *qp, int wqe_index, int is_send, + struct mthca_err_cqe *cqe, + struct ib_wc *entry, int *free_cqe) +{ + int err; + int dbd; + u32 new_wqe; + + if (1 && cqe->syndrome != SYNDROME_WR_FLUSH_ERR) { + int j; + + mthca_dbg(dev, "%x/%d: error CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); + } + + /* + * For completions in error, only work request ID, status (and + * freed resource count for RD) have to be set. + */ + switch (cqe->syndrome) { + case SYNDROME_LOCAL_LENGTH_ERR: + entry->status = IB_WC_LOC_LEN_ERR; + break; + case SYNDROME_LOCAL_QP_OP_ERR: + entry->status = IB_WC_LOC_QP_OP_ERR; + break; + case SYNDROME_LOCAL_EEC_OP_ERR: + entry->status = IB_WC_LOC_EEC_OP_ERR; + break; + case SYNDROME_LOCAL_PROT_ERR: + entry->status = IB_WC_LOC_PROT_ERR; + break; + case SYNDROME_WR_FLUSH_ERR: + entry->status = IB_WC_WR_FLUSH_ERR; + break; + case SYNDROME_MW_BIND_ERR: + entry->status = IB_WC_MW_BIND_ERR; + break; + case SYNDROME_BAD_RESP_ERR: + entry->status = IB_WC_BAD_RESP_ERR; + break; + case SYNDROME_LOCAL_ACCESS_ERR: + entry->status = IB_WC_LOC_ACCESS_ERR; + break; + case SYNDROME_REMOTE_INVAL_REQ_ERR: + entry->status = IB_WC_REM_INV_REQ_ERR; + break; + case SYNDROME_REMOTE_ACCESS_ERR: + entry->status = IB_WC_REM_ACCESS_ERR; + break; + case SYNDROME_REMOTE_OP_ERR: + entry->status = IB_WC_REM_OP_ERR; + break; + case SYNDROME_RETRY_EXC_ERR: + entry->status = IB_WC_RETRY_EXC_ERR; + break; + case SYNDROME_RNR_RETRY_EXC_ERR: + entry->status = IB_WC_RNR_RETRY_EXC_ERR; + break; + case SYNDROME_LOCAL_RDD_VIOL_ERR: + entry->status = IB_WC_LOC_RDD_VIOL_ERR; + break; + case SYNDROME_REMOTE_INVAL_RD_REQ_ERR: + entry->status = IB_WC_REM_INV_RD_REQ_ERR; + break; + case SYNDROME_REMOTE_ABORTED_ERR: + entry->status = IB_WC_REM_ABORT_ERR; + break; + case SYNDROME_INVAL_EECN_ERR: + entry->status = IB_WC_INV_EECN_ERR; + break; + case SYNDROME_INVAL_EEC_STATE_ERR: + entry->status = IB_WC_INV_EEC_STATE_ERR; + break; + default: + entry->status = IB_WC_GENERAL_ERR; + break; + } + + err = mthca_free_err_wqe(qp, is_send, wqe_index, &dbd, &new_wqe); + if (err) + return err; + + /* + * If we're at the end of the WQE chain, or we've used up our + * doorbell count, free the CQE. Otherwise just update it for + * the next poll operation. + */ + if (!(new_wqe & cpu_to_be32(0x3f)) || (!cqe->db_cnt && dbd)) + return 0; + + cqe->db_cnt = cpu_to_be16(be16_to_cpu(cqe->db_cnt) - dbd); + cqe->wqe = new_wqe; + cqe->syndrome = SYNDROME_WR_FLUSH_ERR; + + *free_cqe = 0; + + return 0; +} + +static void dump_cqe(struct mthca_cqe *cqe) +{ + int j; + + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) cqe)[j])); +} + +static inline int mthca_poll_one(struct mthca_dev *dev, + struct mthca_cq *cq, + struct mthca_qp **cur_qp, + int *freed, + struct ib_wc *entry) +{ + struct mthca_wq *wq; + struct mthca_cqe *cqe; + int wqe_index; + int is_error = 0; + int is_send; + int free_cqe = 1; + int err = 0; + + if (!next_cqe_sw(cq)) + return -EAGAIN; + + rmb(); + + cqe = get_cqe(cq, cq->cons_index); + + if (0) { + mthca_dbg(dev, "%x/%d: CQE -> QPN %06x, WQE @ %08x\n", + cq->cqn, cq->cons_index, be32_to_cpu(cqe->my_qpn), + be32_to_cpu(cqe->wqe)); + + dump_cqe(cqe); + } + + if ((cqe->opcode & MTHCA_ERROR_CQE_OPCODE_MASK) == + MTHCA_ERROR_CQE_OPCODE_MASK) { + is_error = 1; + is_send = cqe->opcode & 1; + } else + is_send = cqe->is_send & 0x80; + + if (!*cur_qp || be32_to_cpu(cqe->my_qpn) != (*cur_qp)->qpn) { + if (*cur_qp) { + spin_unlock(&(*cur_qp)->lock); + if (atomic_dec_and_test(&(*cur_qp)->refcount)) + wake_up(&(*cur_qp)->wait); + } + + spin_lock(&dev->qp_table.lock); + *cur_qp = mthca_array_get(&dev->qp_table.qp, + be32_to_cpu(cqe->my_qpn) & + (dev->limits.num_qps - 1)); + if (*cur_qp) + atomic_inc(&(*cur_qp)->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!*cur_qp) { + mthca_warn(dev, "CQ entry for unknown QP %06x\n", + be32_to_cpu(cqe->my_qpn) & 0xffffff); + err = -EINVAL; + goto out; + } + + spin_lock(&(*cur_qp)->lock); + } + + if (is_send) { + wq = &(*cur_qp)->sq; + wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) + >> wq->wqe_shift); + entry->wr_id = (*cur_qp)->wrid[wqe_index + + (*cur_qp)->rq.max]; + } else { + wq = &(*cur_qp)->rq; + wqe_index = be32_to_cpu(cqe->wqe) >> wq->wqe_shift; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; + } + + if (wq->last_comp < wqe_index) + wq->cur -= wqe_index - wq->last_comp; + else + wq->cur -= wq->max - wq->last_comp + wqe_index; + + wq->last_comp = wqe_index; + + if (0) + mthca_dbg(dev, "%s completion for QP %06x, index %d (nr %d)\n", + is_send ? "Send" : "Receive", + (*cur_qp)->qpn, wqe_index, wq->max); + + if (is_error) { + err = handle_error_cqe(dev, cq, *cur_qp, wqe_index, is_send, + (struct mthca_err_cqe *) cqe, + entry, &free_cqe); + goto out; + } + + if (is_send) { + entry->opcode = IB_WC_SEND; /* XXX */ + } else { + entry->byte_len = be32_to_cpu(cqe->byte_cnt); + switch (cqe->opcode & 0x1f) { + case IB_OPCODE_SEND_LAST_WITH_IMMEDIATE: + case IB_OPCODE_SEND_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV; + break; + case IB_OPCODE_RDMA_WRITE_LAST_WITH_IMMEDIATE: + case IB_OPCODE_RDMA_WRITE_ONLY_WITH_IMMEDIATE: + entry->wc_flags = IB_WC_WITH_IMM; + entry->imm_data = cqe->imm_etype_pkey_eec; + entry->opcode = IB_WC_RECV_RDMA_WITH_IMM; + break; + default: + entry->wc_flags = 0; + entry->opcode = IB_WC_RECV; + break; + } + entry->slid = be16_to_cpu(cqe->rlid); + entry->sl = be16_to_cpu(cqe->sl_g_mlpath) >> 12; + entry->src_qp = be32_to_cpu(cqe->rqpn) & 0xffffff; + entry->dlid_path_bits = be16_to_cpu(cqe->sl_g_mlpath) & 0x7f; + entry->pkey_index = be32_to_cpu(cqe->imm_etype_pkey_eec) >> 16; + entry->wc_flags |= be16_to_cpu(cqe->sl_g_mlpath) & 0x80 ? + IB_WC_GRH : 0; + } + + entry->status = IB_WC_SUCCESS; + + out: + if (free_cqe) { + set_cqe_hw(cq, cq->cons_index); + ++(*freed); + cq->cons_index = (cq->cons_index + 1) & (cq->ibcq.cqe - 1); + } + + return err; +} + +int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, + struct ib_wc *entry) +{ + struct mthca_dev *dev = to_mdev(ibcq->device); + struct mthca_cq *cq = to_mcq(ibcq); + struct mthca_qp *qp = NULL; + unsigned long flags; + int err = 0; + int freed = 0; + int npolled; + + spin_lock_irqsave(&cq->lock, flags); + + for (npolled = 0; npolled < num_entries; ++npolled) { + err = mthca_poll_one(dev, cq, &qp, + &freed, entry + npolled); + if (err) + break; + } + + if (qp) { + spin_unlock(&qp->lock); + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + + wmb(); + inc_cons_index(dev, cq, freed); + + spin_unlock_irqrestore(&cq->lock, flags); + + return err == 0 || err == -EAGAIN ? npolled : err; +} + +void mthca_arm_cq(struct mthca_dev *dev, struct mthca_cq *cq, + int solicited) +{ + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((solicited ? + MTHCA_CQ_DB_REQ_NOT_SOL : + MTHCA_CQ_DB_REQ_NOT) | + cq->cqn); + doorbell[1] = 0xffffffff; + + mthca_write64(doorbell, + dev->kar + MTHCA_CQ_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); +} + +int mthca_init_cq(struct mthca_dev *dev, int nent, + struct mthca_cq *cq) +{ + int size = nent * MTHCA_CQ_ENTRY_SIZE; + dma_addr_t t; + void *mailbox = NULL; + int npages, shift; + u64 *dma_list = NULL; + struct mthca_cq_context *cq_context; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out; + + cq_context = MAILBOX_ALIGN(mailbox); + + if (size <= MTHCA_MAX_DIRECT_CQ_SIZE) { + if (0) + mthca_dbg(dev, "Creating direct CQ of size %d\n", size); + + cq->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + cq->queue.direct.buf = pci_alloc_consistent(dev->pdev, + size, &t); + if (!cq->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&cq->queue.direct, mapping, t); + + memset(cq->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + cq->is_direct = 0; + npages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect CQ with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + cq->queue.page_list = kmalloc(npages * sizeof *cq->queue.page_list, + GFP_KERNEL); + if (!cq->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) + cq->queue.page_list[i].buf = NULL; + + for (i = 0; i < npages; ++i) { + cq->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!cq->queue.page_list[i].buf) + goto err_out_free; + + dma_list[i] = t; + pci_unmap_addr_set(&cq->queue.page_list[i], mapping, t); + + memset(cq->queue.page_list[i].buf, 0, PAGE_SIZE); + } + } + + for (i = 0; i < nent; ++i) + set_cqe_hw(cq, i); + + cq->cqn = mthca_alloc(&dev->cq_table.alloc); + if (cq->cqn == -1) + goto err_out_free; + + err = mthca_mr_alloc_phys(dev, dev->driver_pd.pd_num, + dma_list, shift, npages, + 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &cq->mr); + if (err) + goto err_out_free_cq; + + spin_lock_init(&cq->lock); + atomic_set(&cq->refcount, 1); + init_waitqueue_head(&cq->wait); + + memset(cq_context, 0, sizeof *cq_context); + cq_context->flags = cpu_to_be32(MTHCA_CQ_STATUS_OK | + MTHCA_CQ_STATE_DISARMED | + MTHCA_CQ_FLAG_TR); + cq_context->start = cpu_to_be64(0); + cq_context->logsize_usrpage = cpu_to_be32((ffs(nent) - 1) << 24 | + MTHCA_KAR_PAGE); + cq_context->error_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_ASYNC].eqn); + cq_context->comp_eqn = cpu_to_be32(dev->eq_table.eq[MTHCA_EQ_COMP].eqn); + cq_context->pd = cpu_to_be32(dev->driver_pd.pd_num); + cq_context->lkey = cpu_to_be32(cq->mr.ibmr.lkey); + cq_context->cqn = cpu_to_be32(cq->cqn); + + err = mthca_SW2HW_CQ(dev, cq_context, cq->cqn, &status); + if (err) { + mthca_warn(dev, "SW2HW_CQ failed (%d)\n", err); + goto err_out_free_mr; + } + + if (status) { + mthca_warn(dev, "SW2HW_CQ returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_free_mr; + } + + spin_lock_irq(&dev->cq_table.lock); + if (mthca_array_set(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1), + cq)) { + spin_unlock_irq(&dev->cq_table.lock); + goto err_out_free_mr; + } + spin_unlock_irq(&dev->cq_table.lock); + + cq->cons_index = 0; + + kfree(dma_list); + kfree(mailbox); + + return 0; + + err_out_free_mr: + mthca_free_mr(dev, &cq->mr); + + err_out_free_cq: + mthca_free(&dev->cq_table.alloc, cq->cqn); + + err_out_free: + if (cq->is_direct) + pci_free_consistent(dev->pdev, size, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, mapping)); + else { + for (i = 0; i < npages; ++i) + if (cq->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + err_out: + kfree(dma_list); + kfree(mailbox); + + return err; +} + +void mthca_free_cq(struct mthca_dev *dev, + struct mthca_cq *cq) +{ + void *mailbox; + int err; + u8 status; + + might_sleep(); + + mailbox = kmalloc(sizeof (struct mthca_cq_context) + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_warn(dev, "No memory for mailbox to free CQ.\n"); + return; + } + + err = mthca_HW2SW_CQ(dev, MAILBOX_ALIGN(mailbox), cq->cqn, &status); + if (err) + mthca_warn(dev, "HW2SW_CQ failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n", + status); + + if (0) { + u32 *ctx = MAILBOX_ALIGN(mailbox); + int j; + + printk(KERN_ERR "context for CQN %x\n", cq->cqn); + for (j = 0; j < 16; ++j) + printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j])); + } + + spin_lock_irq(&dev->cq_table.lock); + mthca_array_clear(&dev->cq_table.cq, + cq->cqn & (dev->limits.num_cqs - 1)); + spin_unlock_irq(&dev->cq_table.lock); + + atomic_dec(&cq->refcount); + wait_event(cq->wait, !atomic_read(&cq->refcount)); + + mthca_free_mr(dev, &cq->mr); + + if (cq->is_direct) + pci_free_consistent(dev->pdev, + cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE, + cq->queue.direct.buf, + pci_unmap_addr(&cq->queue.direct, + mapping)); + else { + int i; + + for (i = 0; + i < (cq->ibcq.cqe * MTHCA_CQ_ENTRY_SIZE + PAGE_SIZE - 1) / + PAGE_SIZE; + ++i) + pci_free_consistent(dev->pdev, PAGE_SIZE, + cq->queue.page_list[i].buf, + pci_unmap_addr(&cq->queue.page_list[i], + mapping)); + + kfree(cq->queue.page_list); + } + + mthca_free(&dev->cq_table.alloc, cq->cqn); + kfree(mailbox); +} + +int __devinit mthca_init_cq_table(struct mthca_dev *dev) +{ + int err; + + spin_lock_init(&dev->cq_table.lock); + + err = mthca_alloc_init(&dev->cq_table.alloc, + dev->limits.num_cqs, + (1 << 24) - 1, + dev->limits.reserved_cqs); + if (err) + return err; + + err = mthca_array_init(&dev->cq_table.cq, + dev->limits.num_cqs); + if (err) + mthca_alloc_cleanup(&dev->cq_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_cq_table(struct mthca_dev *dev) +{ + mthca_array_cleanup(&dev->cq_table.cq, dev->limits.num_cqs); + mthca_alloc_cleanup(&dev->cq_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_qp.c 2004-12-13 09:44:45.916960983 -0800 @@ -0,0 +1,1479 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_qp.c 1324 2004-12-13 17:55:08Z roland $ + */ + +#include + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_MAX_DIRECT_QP_SIZE = 4 * PAGE_SIZE, + MTHCA_ACK_REQ_FREQ = 10, + MTHCA_FLIGHT_LIMIT = 9, + MTHCA_UD_HEADER_SIZE = 72 /* largest UD header possible */ +}; + +enum { + MTHCA_QP_STATE_RST = 0, + MTHCA_QP_STATE_INIT = 1, + MTHCA_QP_STATE_RTR = 2, + MTHCA_QP_STATE_RTS = 3, + MTHCA_QP_STATE_SQE = 4, + MTHCA_QP_STATE_SQD = 5, + MTHCA_QP_STATE_ERR = 6, + MTHCA_QP_STATE_DRAINING = 7 +}; + +enum { + MTHCA_QP_ST_RC = 0x0, + MTHCA_QP_ST_UC = 0x1, + MTHCA_QP_ST_RD = 0x2, + MTHCA_QP_ST_UD = 0x3, + MTHCA_QP_ST_MLX = 0x7 +}; + +enum { + MTHCA_QP_PM_MIGRATED = 0x3, + MTHCA_QP_PM_ARMED = 0x0, + MTHCA_QP_PM_REARM = 0x1 +}; + +enum { + /* qp_context flags */ + MTHCA_QP_BIT_DE = 1 << 8, + /* params1 */ + MTHCA_QP_BIT_SRE = 1 << 15, + MTHCA_QP_BIT_SWE = 1 << 14, + MTHCA_QP_BIT_SAE = 1 << 13, + MTHCA_QP_BIT_SIC = 1 << 4, + MTHCA_QP_BIT_SSC = 1 << 3, + /* params2 */ + MTHCA_QP_BIT_RRE = 1 << 15, + MTHCA_QP_BIT_RWE = 1 << 14, + MTHCA_QP_BIT_RAE = 1 << 13, + MTHCA_QP_BIT_RIC = 1 << 4, + MTHCA_QP_BIT_RSC = 1 << 3 +}; + +struct mthca_qp_path { + u32 port_pkey; + u8 rnr_retry; + u8 g_mylmc; + u16 rlid; + u8 ackto; + u8 mgid_index; + u8 static_rate; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u8 rgid[16]; +} __attribute__((packed)); + +struct mthca_qp_context { + u32 flags; + u32 sched_queue; + u32 mtu_msgmax; + u32 usr_page; + u32 local_qpn; + u32 remote_qpn; + u32 reserved1[2]; + struct mthca_qp_path pri_path; + struct mthca_qp_path alt_path; + u32 rdd; + u32 pd; + u32 wqe_base; + u32 wqe_lkey; + u32 params1; + u32 reserved2; + u32 next_send_psn; + u32 cqn_snd; + u32 next_snd_wqe[2]; + u32 last_acked_psn; + u32 ssn; + u32 params2; + u32 rnr_nextrecvpsn; + u32 ra_buff_indx; + u32 cqn_rcv; + u32 next_rcv_wqe[2]; + u32 qkey; + u32 srqn; + u32 rmsn; + u32 reserved3[19]; +} __attribute__((packed)); + +struct mthca_qp_param { + u32 opt_param_mask; + u32 reserved1; + struct mthca_qp_context context; + u32 reserved2[62]; +} __attribute__((packed)); + +enum { + MTHCA_QP_OPTPAR_ALT_ADDR_PATH = 1 << 0, + MTHCA_QP_OPTPAR_RRE = 1 << 1, + MTHCA_QP_OPTPAR_RAE = 1 << 2, + MTHCA_QP_OPTPAR_REW = 1 << 3, + MTHCA_QP_OPTPAR_PKEY_INDEX = 1 << 4, + MTHCA_QP_OPTPAR_Q_KEY = 1 << 5, + MTHCA_QP_OPTPAR_RNR_TIMEOUT = 1 << 6, + MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH = 1 << 7, + MTHCA_QP_OPTPAR_SRA_MAX = 1 << 8, + MTHCA_QP_OPTPAR_RRA_MAX = 1 << 9, + MTHCA_QP_OPTPAR_PM_STATE = 1 << 10, + MTHCA_QP_OPTPAR_PORT_NUM = 1 << 11, + MTHCA_QP_OPTPAR_RETRY_COUNT = 1 << 12, + MTHCA_QP_OPTPAR_ALT_RNR_RETRY = 1 << 13, + MTHCA_QP_OPTPAR_ACK_TIMEOUT = 1 << 14, + MTHCA_QP_OPTPAR_RNR_RETRY = 1 << 15, + MTHCA_QP_OPTPAR_SCHED_QUEUE = 1 << 16 +}; + +enum { + MTHCA_OPCODE_NOP = 0x00, + MTHCA_OPCODE_RDMA_WRITE = 0x08, + MTHCA_OPCODE_RDMA_WRITE_IMM = 0x09, + MTHCA_OPCODE_SEND = 0x0a, + MTHCA_OPCODE_SEND_IMM = 0x0b, + MTHCA_OPCODE_RDMA_READ = 0x10, + MTHCA_OPCODE_ATOMIC_CS = 0x11, + MTHCA_OPCODE_ATOMIC_FA = 0x12, + MTHCA_OPCODE_BIND_MW = 0x18, + MTHCA_OPCODE_INVALID = 0xff +}; + +enum { + MTHCA_NEXT_DBD = 1 << 7, + MTHCA_NEXT_FENCE = 1 << 6, + MTHCA_NEXT_CQ_UPDATE = 1 << 3, + MTHCA_NEXT_EVENT_GEN = 1 << 2, + MTHCA_NEXT_SOLICIT = 1 << 1, + + MTHCA_MLX_VL15 = 1 << 17, + MTHCA_MLX_SLR = 1 << 16 +}; + +struct mthca_next_seg { + u32 nda_op; /* [31:6] next WQE [4:0] next opcode */ + u32 ee_nds; /* [31:8] next EE [7] DBD [6] F [5:0] next WQE size */ + u32 flags; /* [3] CQ [2] Event [1] Solicit */ + u32 imm; /* immediate data */ +} __attribute__((packed)); + +struct mthca_ud_seg { + u32 reserved1; + u32 lkey; + u64 av_addr; + u32 reserved2[4]; + u32 dqpn; + u32 qkey; + u32 reserved3[2]; +} __attribute__((packed)); + +struct mthca_bind_seg { + u32 flags; /* [31] Atomic [30] rem write [29] rem read */ + u32 reserved; + u32 new_rkey; + u32 lkey; + u64 addr; + u64 length; +} __attribute__((packed)); + +struct mthca_raddr_seg { + u64 raddr; + u32 rkey; + u32 reserved; +} __attribute__((packed)); + +struct mthca_atomic_seg { + u64 swap_add; + u64 compare; +} __attribute__((packed)); + +struct mthca_data_seg { + u32 byte_count; + u32 lkey; + u64 addr; +} __attribute__((packed)); + +struct mthca_mlx_seg { + u32 nda_op; + u32 nds; + u32 flags; /* [17] VL15 [16] SLR [14:12] static rate + [11:8] SL [3] C [2] E */ + u16 rlid; + u16 vcrc; +} __attribute__((packed)); + +static int is_sqp(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 3; +} + +static int is_qp0(struct mthca_dev *dev, struct mthca_qp *qp) +{ + return qp->qpn >= dev->qp_table.sqp_start && + qp->qpn <= dev->qp_table.sqp_start + 1; +} + +static void *get_recv_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + (n << qp->rq.wqe_shift); + else + return qp->queue.page_list[(n << qp->rq.wqe_shift) >> PAGE_SHIFT].buf + + ((n << qp->rq.wqe_shift) & (PAGE_SIZE - 1)); +} + +static void *get_send_wqe(struct mthca_qp *qp, int n) +{ + if (qp->is_direct) + return qp->queue.direct.buf + qp->send_wqe_offset + + (n << qp->sq.wqe_shift); + else + return qp->queue.page_list[(qp->send_wqe_offset + + (n << qp->sq.wqe_shift)) >> + PAGE_SHIFT].buf + + ((qp->send_wqe_offset + (n << qp->sq.wqe_shift)) & + (PAGE_SIZE - 1)); +} + +void mthca_qp_event(struct mthca_dev *dev, u32 qpn, + enum ib_event_type event_type) +{ + struct mthca_qp *qp; + struct ib_event event; + + spin_lock(&dev->qp_table.lock); + qp = mthca_array_get(&dev->qp_table.qp, qpn & (dev->limits.num_qps - 1)); + if (qp) + atomic_inc(&qp->refcount); + spin_unlock(&dev->qp_table.lock); + + if (!qp) { + mthca_warn(dev, "Async event for bogus QP %08x\n", qpn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.qp = &qp->ibqp; + if (qp->ibqp.event_handler) + qp->ibqp.event_handler(&event, qp->ibqp.qp_context); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); +} + +static int to_mthca_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: return MTHCA_QP_STATE_RST; + case IB_QPS_INIT: return MTHCA_QP_STATE_INIT; + case IB_QPS_RTR: return MTHCA_QP_STATE_RTR; + case IB_QPS_RTS: return MTHCA_QP_STATE_RTS; + case IB_QPS_SQD: return MTHCA_QP_STATE_SQD; + case IB_QPS_SQE: return MTHCA_QP_STATE_SQE; + case IB_QPS_ERR: return MTHCA_QP_STATE_ERR; + default: return -1; + } +} + +enum { RC, UC, UD, RD, RDEE, MLX, NUM_TRANS }; + +static int to_mthca_st(int transport) +{ + switch (transport) { + case RC: return MTHCA_QP_ST_RC; + case UC: return MTHCA_QP_ST_UC; + case UD: return MTHCA_QP_ST_UD; + case RD: return MTHCA_QP_ST_RD; + case MLX: return MTHCA_QP_ST_MLX; + default: return -1; + } +} + +static const struct { + int trans; + u32 req_param[NUM_TRANS]; + u32 opt_param[NUM_TRANS]; +} state_table[IB_QPS_ERR + 1][IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_RST2INIT, + .req_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + }, + /* bug-for-bug compatibility with VAPI: */ + .opt_param = { + [MLX] = IB_QP_PORT + } + }, + }, + [IB_QPS_INIT] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_INIT] = { + .trans = MTHCA_TRANS_INIT2INIT, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_QKEY), + [RC] = (IB_QP_PKEY_INDEX | + IB_QP_PORT | + IB_QP_ACCESS_FLAGS), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + }, + [IB_QPS_RTR] = { + .trans = MTHCA_TRANS_INIT2RTR, + .req_param = { + [RC] = (IB_QP_AV | + IB_QP_PATH_MTU | + IB_QP_DEST_QPN | + IB_QP_RQ_PSN | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER), + }, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTR2RTS, + .req_param = { + [UD] = IB_QP_SQ_PSN, + [RC] = (IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_SQ_PSN | + IB_QP_MAX_QP_RD_ATOMIC), + [MLX] = IB_QP_SQ_PSN, + }, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_RTS] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_RTS2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_ACCESS_FLAGS | + IB_QP_ALT_PATH | + IB_QP_PATH_MIG_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_RTS2SQD, + }, + }, + [IB_QPS_SQD] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQD2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + }, + [IB_QPS_SQD] = { + .trans = MTHCA_TRANS_SQD2SQD, + .opt_param = { + [UD] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + [RC] = (IB_QP_AV | + IB_QP_TIMEOUT | + IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC | + IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_CUR_STATE | + IB_QP_ALT_PATH | + IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | + IB_QP_MIN_RNR_TIMER | + IB_QP_PATH_MIG_STATE), + [MLX] = (IB_QP_PKEY_INDEX | + IB_QP_QKEY), + } + } + }, + [IB_QPS_SQE] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR }, + [IB_QPS_RTS] = { + .trans = MTHCA_TRANS_SQERR2RTS, + .opt_param = { + [UD] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + [RC] = (IB_QP_CUR_STATE | + IB_QP_MIN_RNR_TIMER), + [MLX] = (IB_QP_CUR_STATE | + IB_QP_QKEY), + } + } + }, + [IB_QPS_ERR] = { + [IB_QPS_RESET] = { .trans = MTHCA_TRANS_ANY2RST }, + [IB_QPS_ERR] = { .trans = MTHCA_TRANS_ANY2ERR } + } +}; + +static void store_attrs(struct mthca_sqp *sqp, struct ib_qp_attr *attr, + int attr_mask) +{ + if (attr_mask & IB_QP_PKEY_INDEX) + sqp->pkey_index = attr->pkey_index; + if (attr_mask & IB_QP_QKEY) + sqp->qkey = attr->qkey; + if (attr_mask & IB_QP_SQ_PSN) + sqp->send_psn = attr->sq_psn; +} + +static void init_port(struct mthca_dev *dev, int port) +{ + int err; + u8 status; + struct mthca_init_ib_param param; + + memset(¶m, 0, sizeof param); + + param.enable_1x = 1; + param.enable_4x = 1; + param.vl_cap = dev->limits.vl_cap; + param.mtu_cap = dev->limits.mtu_cap; + param.gid_cap = dev->limits.gid_table_len; + param.pkey_cap = dev->limits.pkey_table_len; + + err = mthca_INIT_IB(dev, ¶m, port, &status); + if (err) + mthca_warn(dev, "INIT_IB failed, return code %d.\n", err); + if (status) + mthca_warn(dev, "INIT_IB returned status %02x.\n", status); +} + +int mthca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + enum ib_qp_state cur_state, new_state; + void *mailbox = NULL; + struct mthca_qp_param *qp_param; + struct mthca_qp_context *qp_context; + u32 req_param, opt_param; + u8 status; + int err; + + if (attr_mask & IB_QP_CUR_STATE) { + if (attr->cur_qp_state != IB_QPS_RTR && + attr->cur_qp_state != IB_QPS_RTS && + attr->cur_qp_state != IB_QPS_SQD && + attr->cur_qp_state != IB_QPS_SQE) + return -EINVAL; + else + cur_state = attr->cur_qp_state; + } else { + spin_lock_irq(&qp->lock); + cur_state = qp->state; + spin_unlock_irq(&qp->lock); + } + + if (attr_mask & IB_QP_STATE) { + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) + return -EINVAL; + new_state = attr->qp_state; + } else + new_state = cur_state; + + if (state_table[cur_state][new_state].trans == MTHCA_TRANS_INVALID) { + mthca_dbg(dev, "Illegal QP transition " + "%d->%d\n", cur_state, new_state); + return -EINVAL; + } + + req_param = state_table[cur_state][new_state].req_param[qp->transport]; + opt_param = state_table[cur_state][new_state].opt_param[qp->transport]; + + if ((req_param & attr_mask) != req_param) { + mthca_dbg(dev, "QP transition " + "%d->%d missing req attr 0x%08x\n", + cur_state, new_state, + req_param & ~attr_mask); + return -EINVAL; + } + + if (attr_mask & ~(req_param | opt_param | IB_QP_STATE)) { + mthca_dbg(dev, "QP transition (transport %d) " + "%d->%d has extra attr 0x%08x\n", + qp->transport, + cur_state, new_state, + attr_mask & ~(req_param | opt_param | + IB_QP_STATE)); + return -EINVAL; + } + + mailbox = kmalloc(sizeof (*qp_param) + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + qp_param = MAILBOX_ALIGN(mailbox); + qp_context = &qp_param->context; + memset(qp_param, 0, sizeof *qp_param); + + qp_context->flags = cpu_to_be32((to_mthca_state(new_state) << 28) | + (to_mthca_st(qp->transport) << 16)); + qp_context->flags |= cpu_to_be32(MTHCA_QP_BIT_DE); + if (!(attr_mask & IB_QP_PATH_MIG_STATE)) + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + else { + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PM_STATE); + switch (attr->path_mig_state) { + case IB_MIG_MIGRATED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_MIGRATED << 11); + break; + case IB_MIG_REARM: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_REARM << 11); + break; + case IB_MIG_ARMED: + qp_context->flags |= cpu_to_be32(MTHCA_QP_PM_ARMED << 11); + break; + } + } + /* leave sched_queue as 0 */ + if (qp->transport == MLX || qp->transport == UD) + qp_context->mtu_msgmax = cpu_to_be32((IB_MTU_2048 << 29) | + (11 << 24)); + else if (attr_mask & IB_QP_PATH_MTU) { + qp_context->mtu_msgmax = cpu_to_be32((attr->path_mtu << 29) | + (31 << 24)); + } + qp_context->usr_page = cpu_to_be32(MTHCA_KAR_PAGE); + qp_context->local_qpn = cpu_to_be32(qp->qpn); + if (attr_mask & IB_QP_DEST_QPN) { + qp_context->remote_qpn = cpu_to_be32(attr->dest_qp_num); + } + + if (qp->transport == MLX) + qp_context->pri_path.port_pkey |= + cpu_to_be32(to_msqp(qp)->port << 24); + else { + if (attr_mask & IB_QP_PORT) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->port_num << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PORT_NUM); + } + } + + if (attr_mask & IB_QP_PKEY_INDEX) { + qp_context->pri_path.port_pkey |= + cpu_to_be32(attr->pkey_index); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PKEY_INDEX); + } + + if (attr_mask & IB_QP_RNR_RETRY) { + qp_context->pri_path.rnr_retry = attr->rnr_retry << 5; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_RETRY); + } + + if (attr_mask & IB_QP_AV) { + qp_context->pri_path.g_mylmc = attr->ah_attr.src_path_bits & 0x7f; + qp_context->pri_path.rlid = cpu_to_be16(attr->ah_attr.dlid); + qp_context->pri_path.static_rate = (!!attr->ah_attr.static_rate) << 3; + if (attr->ah_attr.ah_flags & IB_AH_GRH) { + qp_context->pri_path.g_mylmc |= 1 << 7; + qp_context->pri_path.mgid_index = attr->ah_attr.grh.sgid_index; + qp_context->pri_path.hop_limit = attr->ah_attr.grh.hop_limit; + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32((attr->ah_attr.sl << 28) | + (attr->ah_attr.grh.traffic_class << 20) | + (attr->ah_attr.grh.flow_label)); + memcpy(qp_context->pri_path.rgid, + attr->ah_attr.grh.dgid.raw, 16); + } else { + qp_context->pri_path.sl_tclass_flowlabel = + cpu_to_be32(attr->ah_attr.sl << 28); + } + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_PRIMARY_ADDR_PATH); + } + + if (attr_mask & IB_QP_TIMEOUT) { + qp_context->pri_path.ackto = attr->timeout; + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_ACK_TIMEOUT); + } + + /* XXX alt_path */ + + /* leave rdd as 0 */ + qp_context->pd = cpu_to_be32(to_mpd(ibqp->pd)->pd_num); + /* leave wqe_base as 0 (we always create an MR based at 0 for WQs) */ + qp_context->wqe_lkey = cpu_to_be32(qp->mr.ibmr.lkey); + qp_context->params1 = cpu_to_be32((MTHCA_ACK_REQ_FREQ << 28) | + (MTHCA_FLIGHT_LIMIT << 24) | + MTHCA_QP_BIT_SRE | + MTHCA_QP_BIT_SWE | + MTHCA_QP_BIT_SAE); + if (qp->sq.policy == IB_SIGNAL_ALL_WR) + qp_context->params1 |= cpu_to_be32(MTHCA_QP_BIT_SSC); + if (attr_mask & IB_QP_RETRY_CNT) { + qp_context->params1 |= cpu_to_be32(attr->retry_cnt << 16); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RETRY_COUNT); + } + + /* XXX initiator resources */ + if (attr_mask & IB_QP_SQ_PSN) + qp_context->next_send_psn = cpu_to_be32(attr->sq_psn); + qp_context->cqn_snd = cpu_to_be32(to_mcq(ibqp->send_cq)->cqn); + + /* XXX RDMA/atomic enable, responder resources */ + + if (qp->rq.policy == IB_SIGNAL_ALL_WR) + qp_context->params2 |= cpu_to_be32(MTHCA_QP_BIT_RSC); + if (attr_mask & IB_QP_MIN_RNR_TIMER) { + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->min_rnr_timer << 24); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_RNR_TIMEOUT); + } + if (attr_mask & IB_QP_RQ_PSN) + qp_context->rnr_nextrecvpsn |= cpu_to_be32(attr->rq_psn); + + /* XXX ra_buff_indx */ + + qp_context->cqn_rcv = cpu_to_be32(to_mcq(ibqp->recv_cq)->cqn); + + if (attr_mask & IB_QP_QKEY) { + qp_context->qkey = cpu_to_be32(attr->qkey); + qp_param->opt_param_mask |= cpu_to_be32(MTHCA_QP_OPTPAR_Q_KEY); + } + + err = mthca_MODIFY_QP(dev, state_table[cur_state][new_state].trans, + qp->qpn, 0, qp_param, 0, &status); + if (status) { + mthca_warn(dev, "modify QP %d returned status %02x.\n", + state_table[cur_state][new_state].trans, status); + err = -EINVAL; + } + + if (!err) { + spin_lock_irq(&qp->lock); + /* XXX deal with async transitions to ERROR */ + qp->state = new_state; + spin_unlock_irq(&qp->lock); + } + + kfree(mailbox); + + if (is_sqp(dev, qp)) + store_attrs(to_msqp(qp), attr, attr_mask); + + /* + * If we are moving QP0 to RTR, bring the IB link up; if we + * are moving QP0 to RESET or ERROR, bring the link back down. + */ + if (is_qp0(dev, qp)) { + if (cur_state != IB_QPS_RTR && + new_state == IB_QPS_RTR) + init_port(dev, to_msqp(qp)->port); + + if (cur_state != IB_QPS_RESET && + cur_state != IB_QPS_ERR && + (new_state == IB_QPS_RESET || + new_state == IB_QPS_ERR)) + mthca_CLOSE_IB(dev, to_msqp(qp)->port, &status); + } + + return err; +} + +/* + * Allocate and register buffer for WQEs. qp->rq.max, sq.max, + * rq.max_gs and sq.max_gs must all be assigned. + * mthca_alloc_wqe_buf will calculate rq.wqe_shift and + * sq.wqe_shift (as well as send_wqe_offset, is_direct, and + * queue) + */ +static int mthca_alloc_wqe_buf(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_qp *qp) +{ + int size; + int i; + int npages, shift; + dma_addr_t t; + u64 *dma_list = NULL; + int err = -ENOMEM; + + size = sizeof (struct mthca_next_seg) + + qp->rq.max_gs * sizeof (struct mthca_data_seg); + + for (qp->rq.wqe_shift = 6; 1 << qp->rq.wqe_shift < size; + qp->rq.wqe_shift++) + ; /* nothing */ + + size = sizeof (struct mthca_next_seg) + + qp->sq.max_gs * sizeof (struct mthca_data_seg); + if (qp->transport == MLX) + size += 2 * sizeof (struct mthca_data_seg); + else if (qp->transport == UD) + size += sizeof (struct mthca_ud_seg); + else /* bind seg is as big as atomic + raddr segs */ + size += sizeof (struct mthca_bind_seg); + + for (qp->sq.wqe_shift = 6; 1 << qp->sq.wqe_shift < size; + qp->sq.wqe_shift++) + ; /* nothing */ + + qp->send_wqe_offset = ALIGN(qp->rq.max << qp->rq.wqe_shift, + 1 << qp->sq.wqe_shift); + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + qp->wrid = kmalloc((qp->rq.max + qp->sq.max) * sizeof (u64), + GFP_KERNEL); + if (!qp->wrid) + goto err_out; + + if (size <= MTHCA_MAX_DIRECT_QP_SIZE) { + qp->is_direct = 1; + npages = 1; + shift = get_order(size) + PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating direct QP of size %d (shift %d)\n", + size, shift); + + qp->queue.direct.buf = pci_alloc_consistent(dev->pdev, size, &t); + if (!qp->queue.direct.buf) + goto err_out; + + pci_unmap_addr_set(&qp->queue.direct, mapping, t); + + memset(qp->queue.direct.buf, 0, size); + + while (t & ((1 << shift) - 1)) { + --shift; + npages *= 2; + } + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out_free; + + for (i = 0; i < npages; ++i) + dma_list[i] = t + i * (1 << shift); + } else { + qp->is_direct = 0; + npages = size / PAGE_SIZE; + shift = PAGE_SHIFT; + + if (0) + mthca_dbg(dev, "Creating indirect QP with %d pages\n", npages); + + dma_list = kmalloc(npages * sizeof *dma_list, GFP_KERNEL); + if (!dma_list) + goto err_out; + + qp->queue.page_list = kmalloc(npages * + sizeof *qp->queue.page_list, + GFP_KERNEL); + if (!qp->queue.page_list) + goto err_out; + + for (i = 0; i < npages; ++i) { + qp->queue.page_list[i].buf = + pci_alloc_consistent(dev->pdev, PAGE_SIZE, &t); + if (!qp->queue.page_list[i].buf) + goto err_out_free; + + memset(qp->queue.page_list[i].buf, 0, PAGE_SIZE); + + pci_unmap_addr_set(&qp->queue.page_list[i], mapping, t); + dma_list[i] = t; + } + } + + err = mthca_mr_alloc_phys(dev, pd->pd_num, dma_list, shift, + npages, 0, size, + MTHCA_MPT_FLAG_LOCAL_WRITE | + MTHCA_MPT_FLAG_LOCAL_READ, + &qp->mr); + if (err) + goto err_out_free; + + kfree(dma_list); + return 0; + + err_out_free: + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else + for (i = 0; i < npages; ++i) { + if (qp->queue.page_list[i].buf) + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + + } + + err_out: + kfree(qp->wrid); + kfree(dma_list); + return err; +} + +static int mthca_alloc_qp_common(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + spin_lock_init(&qp->lock); + atomic_set(&qp->refcount, 1); + qp->state = IB_QPS_RESET; + qp->sq.policy = send_policy; + qp->rq.policy = recv_policy; + qp->rq.cur = 0; + qp->sq.cur = 0; + qp->rq.next = 0; + qp->sq.next = 0; + qp->rq.last_comp = qp->rq.max - 1; + qp->sq.last_comp = qp->sq.max - 1; + qp->rq.last = NULL; + qp->sq.last = NULL; + + err = mthca_alloc_wqe_buf(dev, pd, qp); + return err; +} + +int mthca_alloc_qp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_qp_type type, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + struct mthca_qp *qp) +{ + int err; + + switch (type) { + case IB_QPT_RC: qp->transport = RC; break; + case IB_QPT_UC: qp->transport = UC; break; + case IB_QPT_UD: qp->transport = UD; break; + default: return -EINVAL; + } + + qp->qpn = mthca_alloc(&dev->qp_table.alloc); + if (qp->qpn == -1) + return -ENOMEM; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, qp); + if (err) { + mthca_free(&dev->qp_table.alloc, qp->qpn); + return err; + } + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_set(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1), qp); + spin_unlock_irq(&dev->qp_table.lock); + + return 0; +} + +int mthca_alloc_sqp(struct mthca_dev *dev, + struct mthca_pd *pd, + struct mthca_cq *send_cq, + struct mthca_cq *recv_cq, + enum ib_sig_type send_policy, + enum ib_sig_type recv_policy, + int qpn, + int port, + struct mthca_sqp *sqp) +{ + int err = 0; + u32 mqpn = qpn * 2 + dev->qp_table.sqp_start + port - 1; + + sqp->header_buf_size = sqp->qp.sq.max * MTHCA_UD_HEADER_SIZE; + sqp->header_buf = dma_alloc_coherent(&dev->pdev->dev, sqp->header_buf_size, + &sqp->header_dma, GFP_KERNEL); + if (!sqp->header_buf) + return -ENOMEM; + + spin_lock_irq(&dev->qp_table.lock); + if (mthca_array_get(&dev->qp_table.qp, mqpn)) + err = -EBUSY; + else + mthca_array_set(&dev->qp_table.qp, mqpn, sqp); + spin_unlock_irq(&dev->qp_table.lock); + + if (err) + goto err_out; + + sqp->port = port; + sqp->qp.qpn = mqpn; + sqp->qp.transport = MLX; + + err = mthca_alloc_qp_common(dev, pd, send_cq, recv_cq, + send_policy, recv_policy, + &sqp->qp); + if (err) + goto err_out_free; + + atomic_inc(&pd->sqp_count); + + return 0; + + err_out_free: + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, mqpn); + spin_unlock_irq(&dev->qp_table.lock); + + err_out: + dma_free_coherent(&dev->pdev->dev, sqp->header_buf_size, + sqp->header_buf, sqp->header_dma); + + return err; +} + +void mthca_free_qp(struct mthca_dev *dev, + struct mthca_qp *qp) +{ + u8 status; + int size; + int i; + + spin_lock_irq(&dev->qp_table.lock); + mthca_array_clear(&dev->qp_table.qp, + qp->qpn & (dev->limits.num_qps - 1)); + spin_unlock_irq(&dev->qp_table.lock); + + atomic_dec(&qp->refcount); + wait_event(qp->wait, !atomic_read(&qp->refcount)); + + if (qp->state != IB_QPS_RESET) + mthca_MODIFY_QP(dev, MTHCA_TRANS_ANY2RST, qp->qpn, 0, NULL, 0, &status); + + mthca_cq_clean(dev, to_mcq(qp->ibqp.send_cq)->cqn, qp->qpn); + if (qp->ibqp.send_cq != qp->ibqp.recv_cq) + mthca_cq_clean(dev, to_mcq(qp->ibqp.recv_cq)->cqn, qp->qpn); + + mthca_free_mr(dev, &qp->mr); + + size = PAGE_ALIGN(qp->send_wqe_offset + + (qp->sq.max << qp->sq.wqe_shift)); + + if (qp->is_direct) { + pci_free_consistent(dev->pdev, size, + qp->queue.direct.buf, + pci_unmap_addr(&qp->queue.direct, mapping)); + } else { + for (i = 0; i < size / PAGE_SIZE; ++i) { + pci_free_consistent(dev->pdev, PAGE_SIZE, + qp->queue.page_list[i].buf, + pci_unmap_addr(&qp->queue.page_list[i], + mapping)); + } + } + + kfree(qp->wrid); + + if (is_sqp(dev, qp)) { + atomic_dec(&(to_mpd(qp->ibqp.pd)->sqp_count)); + dma_free_coherent(&dev->pdev->dev, + to_msqp(qp)->header_buf_size, + to_msqp(qp)->header_buf, + to_msqp(qp)->header_dma); + } + else + mthca_free(&dev->qp_table.alloc, qp->qpn); +} + +/* Create UD header for an MLX send and build a data segment for it */ +static int build_mlx_header(struct mthca_dev *dev, struct mthca_sqp *sqp, + int ind, struct ib_send_wr *wr, + struct mthca_mlx_seg *mlx, + struct mthca_data_seg *data) +{ + int header_size; + int err; + + ib_ud_header_init(256, /* assume a MAD */ + sqp->ud_header.grh_present, + &sqp->ud_header); + + err = mthca_read_ah(dev, to_mah(wr->wr.ud.ah), &sqp->ud_header); + if (err) + return err; + mlx->flags &= ~cpu_to_be32(MTHCA_NEXT_SOLICIT | 1); + mlx->flags |= cpu_to_be32((!sqp->qp.ibqp.qp_num ? MTHCA_MLX_VL15 : 0) | + (sqp->ud_header.lrh.destination_lid == 0xffff ? + MTHCA_MLX_SLR : 0) | + (sqp->ud_header.lrh.service_level << 8)); + mlx->rlid = sqp->ud_header.lrh.destination_lid; + mlx->vcrc = 0; + + switch (wr->opcode) { + case IB_WR_SEND: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY; + sqp->ud_header.immediate_present = 0; + break; + case IB_WR_SEND_WITH_IMM: + sqp->ud_header.bth.opcode = IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE; + sqp->ud_header.immediate_present = 1; + sqp->ud_header.immediate_data = wr->imm_data; + break; + default: + return -EINVAL; + } + + sqp->ud_header.lrh.virtual_lane = !sqp->qp.ibqp.qp_num ? 15 : 0; + if (sqp->ud_header.lrh.destination_lid == 0xffff) + sqp->ud_header.lrh.source_lid = 0xffff; + sqp->ud_header.bth.solicited_event = !!(wr->send_flags & IB_SEND_SOLICITED); + if (!sqp->qp.ibqp.qp_num) + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + sqp->pkey_index, + &sqp->ud_header.bth.pkey); + else + ib_cached_pkey_get(&dev->ib_dev, sqp->port, + wr->wr.ud.pkey_index, + &sqp->ud_header.bth.pkey); + cpu_to_be16s(&sqp->ud_header.bth.pkey); + sqp->ud_header.bth.destination_qpn = cpu_to_be32(wr->wr.ud.remote_qpn); + sqp->ud_header.bth.psn = cpu_to_be32((sqp->send_psn++) & ((1 << 24) - 1)); + sqp->ud_header.deth.qkey = cpu_to_be32(wr->wr.ud.remote_qkey & 0x80000000 ? + sqp->qkey : wr->wr.ud.remote_qkey); + sqp->ud_header.deth.source_qpn = cpu_to_be32(sqp->qp.ibqp.qp_num); + + header_size = ib_ud_header_pack(&sqp->ud_header, + sqp->header_buf + + ind * MTHCA_UD_HEADER_SIZE); + + data->byte_count = cpu_to_be32(header_size); + data->lkey = cpu_to_be32(to_mpd(sqp->qp.ibqp.pd)->ntmr.ibmr.lkey); + data->addr = cpu_to_be64(sqp->header_dma + + ind * MTHCA_UD_HEADER_SIZE); + + return 0; +} + +int mthca_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + void *wqe; + void *prev_wqe; + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + u32 f0 = 0; + int ind; + u8 op0 = 0; + + static const u8 opcode[] = { + [IB_WR_SEND] = MTHCA_OPCODE_SEND, + [IB_WR_SEND_WITH_IMM] = MTHCA_OPCODE_SEND_IMM, + [IB_WR_RDMA_WRITE] = MTHCA_OPCODE_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = MTHCA_OPCODE_RDMA_WRITE_IMM, + [IB_WR_RDMA_READ] = MTHCA_OPCODE_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = MTHCA_OPCODE_ATOMIC_CS, + [IB_WR_ATOMIC_FETCH_AND_ADD] = MTHCA_OPCODE_ATOMIC_FA, + }; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post send */ + + ind = qp->sq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->sq.cur + nreq >= qp->sq.max) { + mthca_err(dev, "SQ full (%d posted, %d max, %d nreq)\n", + qp->sq.cur, qp->sq.max, nreq); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_send_wqe(qp, ind); + prev_wqe = qp->sq.last; + qp->sq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = 0; + ((struct mthca_next_seg *) wqe)->flags = + ((wr->send_flags & IB_SEND_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0) | + ((wr->send_flags & IB_SEND_SOLICITED) ? + cpu_to_be32(MTHCA_NEXT_SOLICIT) : 0) | + cpu_to_be32(1); + if (wr->opcode == IB_WR_SEND_WITH_IMM || + wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) + ((struct mthca_next_seg *) wqe)->flags = wr->imm_data; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (qp->transport == UD) { + ((struct mthca_ud_seg *) wqe)->lkey = + cpu_to_be32(to_mah(wr->wr.ud.ah)->key); + ((struct mthca_ud_seg *) wqe)->av_addr = + cpu_to_be64(to_mah(wr->wr.ud.ah)->avdma); + ((struct mthca_ud_seg *) wqe)->dqpn = + cpu_to_be32(wr->wr.ud.remote_qpn); + ((struct mthca_ud_seg *) wqe)->qkey = + cpu_to_be32(wr->wr.ud.remote_qkey); + + wqe += sizeof (struct mthca_ud_seg); + size += sizeof (struct mthca_ud_seg) / 16; + } else if (qp->transport == MLX) { + err = build_mlx_header(dev, to_msqp(qp), ind, wr, + wqe - sizeof (struct mthca_next_seg), + wqe); + if (err) { + *bad_wr = wr; + goto out; + } + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + if (wr->num_sge > qp->sq.max_gs) { + mthca_err(dev, "too many gathers\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + /* Add one more inline data segment for ICRC */ + if (qp->transport == MLX) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32((1 << 31) | 4); + ((u32 *) wqe)[1] = 0; + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind + qp->rq.max] = wr->wr_id; + + if (wr->opcode >= ARRAY_SIZE(opcode)) { + mthca_err(dev, "opcode invalid\n"); + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32(((ind << qp->sq.wqe_shift) + + qp->send_wqe_offset) | + opcode[wr->opcode]); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32((size0 ? 0 : MTHCA_NEXT_DBD) | size); + } + + if (!size0) { + size0 = size; + op0 = opcode[wr->opcode]; + } + + ++ind; + if (unlikely(ind >= qp->sq.max)) + ind -= qp->sq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32(((qp->sq.next << qp->sq.wqe_shift) + + qp->send_wqe_offset) | f0 | op0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | size0); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_SEND_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->sq.cur += nreq; + qp->sq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + struct mthca_qp *qp = to_mqp(ibqp); + unsigned long flags; + int err = 0; + int nreq; + int i; + int size; + int size0 = 0; + int ind; + void *wqe; + void *prev_wqe; + + spin_lock_irqsave(&qp->lock, flags); + + /* XXX check that state is OK to post receive */ + + ind = qp->rq.next; + + for (nreq = 0; wr; ++nreq, wr = wr->next) { + if (qp->rq.cur + nreq >= qp->rq.max) { + mthca_err(dev, "RQ %06x full\n", qp->qpn); + err = -ENOMEM; + *bad_wr = wr; + goto out; + } + + wqe = get_recv_wqe(qp, ind); + prev_wqe = qp->rq.last; + qp->rq.last = wqe; + + ((struct mthca_next_seg *) wqe)->nda_op = 0; + ((struct mthca_next_seg *) wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD); + ((struct mthca_next_seg *) wqe)->flags = + (wr->recv_flags & IB_RECV_SIGNALED) ? + cpu_to_be32(MTHCA_NEXT_CQ_UPDATE) : 0; + + wqe += sizeof (struct mthca_next_seg); + size = sizeof (struct mthca_next_seg) / 16; + + if (wr->num_sge > qp->rq.max_gs) { + err = -EINVAL; + *bad_wr = wr; + goto out; + } + + for (i = 0; i < wr->num_sge; ++i) { + ((struct mthca_data_seg *) wqe)->byte_count = + cpu_to_be32(wr->sg_list[i].length); + ((struct mthca_data_seg *) wqe)->lkey = + cpu_to_be32(wr->sg_list[i].lkey); + ((struct mthca_data_seg *) wqe)->addr = + cpu_to_be64(wr->sg_list[i].addr); + wqe += sizeof (struct mthca_data_seg); + size += sizeof (struct mthca_data_seg) / 16; + } + + qp->wrid[ind] = wr->wr_id; + + if (prev_wqe) { + ((struct mthca_next_seg *) prev_wqe)->nda_op = + cpu_to_be32((ind << qp->rq.wqe_shift) | 1); + smp_wmb(); + ((struct mthca_next_seg *) prev_wqe)->ee_nds = + cpu_to_be32(MTHCA_NEXT_DBD | size); + } + + if (!size0) + size0 = size; + + ++ind; + if (unlikely(ind >= qp->rq.max)) + ind -= qp->rq.max; + } + +out: + if (nreq) { + u32 doorbell[2]; + + doorbell[0] = cpu_to_be32((qp->rq.next << qp->rq.wqe_shift) | size0); + doorbell[1] = cpu_to_be32((qp->qpn << 8) | nreq); + + wmb(); + + mthca_write64(doorbell, + dev->kar + MTHCA_RECEIVE_DOORBELL, + MTHCA_GET_DOORBELL_LOCK(&dev->doorbell_lock)); + } + + qp->rq.cur += nreq; + qp->rq.next = ind; + + spin_unlock_irqrestore(&qp->lock, flags); + return err; +} + +int mthca_free_err_wqe(struct mthca_qp *qp, int is_send, + int index, int *dbd, u32 *new_wqe) +{ + struct mthca_next_seg *next; + + if (is_send) + next = get_send_wqe(qp, index); + else + next = get_recv_wqe(qp, index); + + *dbd = !!(next->ee_nds & cpu_to_be32(MTHCA_NEXT_DBD)); + if (next->ee_nds & cpu_to_be32(0x3f)) + *new_wqe = (next->nda_op & cpu_to_be32(~0x3f)) | + (next->ee_nds & cpu_to_be32(0x3f)); + else + *new_wqe = 0; + + return 0; +} + +int __devinit mthca_init_qp_table(struct mthca_dev *dev) +{ + int err; + u8 status; + int i; + + spin_lock_init(&dev->qp_table.lock); + + /* + * We reserve 2 extra QPs per port for the special QPs. The + * special QP for port 1 has to be even, so round up. + */ + dev->qp_table.sqp_start = (dev->limits.reserved_qps + 1) & ~1UL; + err = mthca_alloc_init(&dev->qp_table.alloc, + dev->limits.num_qps, + (1 << 24) - 1, + dev->qp_table.sqp_start + + MTHCA_MAX_PORTS * 2); + if (err) + return err; + + err = mthca_array_init(&dev->qp_table.qp, + dev->limits.num_qps); + if (err) { + mthca_alloc_cleanup(&dev->qp_table.alloc); + return err; + } + + for (i = 0; i < 2; ++i) { + err = mthca_CONF_SPECIAL_QP(dev, i ? IB_QPT_GSI : IB_QPT_SMI, + dev->qp_table.sqp_start + i * 2, + &status); + if (err) + goto err_out; + if (status) { + mthca_warn(dev, "CONF_SPECIAL_QP returned " + "status %02x, aborting.\n", + status); + err = -EINVAL; + goto err_out; + } + } + return 0; + + err_out: + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_array_cleanup(&dev->qp_table.qp, dev->limits.num_qps); + mthca_alloc_cleanup(&dev->qp_table.alloc); + + return err; +} + +void __devexit mthca_cleanup_qp_table(struct mthca_dev *dev) +{ + int i; + u8 status; + + for (i = 0; i < 2; ++i) + mthca_CONF_SPECIAL_QP(dev, i, 0, &status); + + mthca_alloc_cleanup(&dev->qp_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:38 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:38 -0800 Subject: [openib-general] [PATCH][v3][13/21] Add Mellanox HCA low-level driver (last bits) In-Reply-To: <20041213109.qjNNDyU3lIqRtV2z@topspin.com> Message-ID: <20041213109.xRUlTjiTPPxsdvte@topspin.com> Add code for remaining InfiniBand objects (address vectors, multicast groups, memory regions and protection domains) Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_av.c 2004-12-13 09:44:46.775834455 -0800 @@ -0,0 +1,205 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_av.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include +#include + +#include "mthca_dev.h" + +struct mthca_av { + u32 port_pd; + u8 reserved1; + u8 g_slid; + u16 dlid; + u8 reserved2; + u8 gid_index; + u8 msg_sr; + u8 hop_limit; + u32 sl_tclass_flowlabel; + u32 dgid[4]; +}; + +int mthca_create_ah(struct mthca_dev *dev, + struct mthca_pd *pd, + struct ib_ah_attr *ah_attr, + struct mthca_ah *ah) +{ + u32 index = -1; + struct mthca_av *av = NULL; + + ah->on_hca = 0; + + if (!atomic_read(&pd->sqp_count) && + !(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + index = mthca_alloc(&dev->av_table.alloc); + + /* fall back to allocate in host memory */ + if (index == -1) + goto host_alloc; + + av = kmalloc(sizeof *av, GFP_KERNEL); + if (!av) + goto host_alloc; + + ah->on_hca = 1; + ah->avdma = dev->av_table.ddr_av_base + + index * MTHCA_AV_SIZE; + } + + host_alloc: + if (!ah->on_hca) { + ah->av = pci_pool_alloc(dev->av_table.pool, + SLAB_KERNEL, &ah->avdma); + if (!ah->av) + return -ENOMEM; + + av = ah->av; + } + + ah->key = pd->ntmr.ibmr.lkey; + + memset(av, 0, MTHCA_AV_SIZE); + + av->port_pd = cpu_to_be32(pd->pd_num | (ah_attr->port_num << 24)); + av->g_slid = ah_attr->src_path_bits; + av->dlid = cpu_to_be16(ah_attr->dlid); + av->msg_sr = (3 << 4) | /* 2K message */ + ah_attr->static_rate; + av->sl_tclass_flowlabel = cpu_to_be32(ah_attr->sl << 28); + if (ah_attr->ah_flags & IB_AH_GRH) { + av->g_slid |= 0x80; + av->gid_index = (ah_attr->port_num - 1) * dev->limits.gid_table_len + + ah_attr->grh.sgid_index; + av->hop_limit = ah_attr->grh.hop_limit; + av->sl_tclass_flowlabel |= + cpu_to_be32((ah_attr->grh.traffic_class << 20) | + ah_attr->grh.flow_label); + memcpy(av->dgid, ah_attr->grh.dgid.raw, 16); + } + + if (0) { + int j; + + mthca_dbg(dev, "Created UDAV at %p/%08lx:\n", + av, (unsigned long) ah->avdma); + for (j = 0; j < 8; ++j) + printk(KERN_DEBUG " [%2x] %08x\n", + j * 4, be32_to_cpu(((u32 *) av)[j])); + } + + if (ah->on_hca) { + memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE, + av, MTHCA_AV_SIZE); + kfree(av); + } + + return 0; +} + +int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah) +{ + if (ah->on_hca) + mthca_free(&dev->av_table.alloc, + (ah->avdma - dev->av_table.ddr_av_base) / + MTHCA_AV_SIZE); + else + pci_pool_free(dev->av_table.pool, ah->av, ah->avdma); + + return 0; +} + +int mthca_read_ah(struct mthca_dev *dev, struct mthca_ah *ah, + struct ib_ud_header *header) +{ + if (ah->on_hca) + return -EINVAL; + + header->lrh.service_level = be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 28; + header->lrh.destination_lid = ah->av->dlid; + header->lrh.source_lid = ah->av->g_slid & 0x7f; + if (ah->av->g_slid & 0x80) { + header->grh_present = 1; + header->grh.traffic_class = + (be32_to_cpu(ah->av->sl_tclass_flowlabel) >> 20) & 0xff; + header->grh.flow_label = + ah->av->sl_tclass_flowlabel & cpu_to_be32(0xfffff); + ib_cached_gid_get(&dev->ib_dev, + be32_to_cpu(ah->av->port_pd) >> 24, + ah->av->gid_index, + &header->grh.source_gid); + memcpy(header->grh.destination_gid.raw, + ah->av->dgid, 16); + } else { + header->grh_present = 0; + } + + return 0; +} + +int __devinit mthca_init_av_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->av_table.alloc, + dev->av_table.num_ddr_avs, + dev->av_table.num_ddr_avs - 1, + 0); + if (err) + return err; + + dev->av_table.pool = pci_pool_create("mthca_av", dev->pdev, + MTHCA_AV_SIZE, + MTHCA_AV_SIZE, 0); + if (!dev->av_table.pool) + goto out_free_alloc; + + if (!(dev->mthca_flags & MTHCA_FLAG_DDR_HIDDEN)) { + dev->av_table.av_map = ioremap(pci_resource_start(dev->pdev, 4) + + dev->av_table.ddr_av_base - + dev->ddr_start, + dev->av_table.num_ddr_avs * + MTHCA_AV_SIZE); + if (!dev->av_table.av_map) + goto out_free_pool; + } else + dev->av_table.av_map = NULL; + + return 0; + + out_free_pool: + pci_pool_destroy(dev->av_table.pool); + + out_free_alloc: + mthca_alloc_cleanup(&dev->av_table.alloc); + return -ENOMEM; +} + +void __devexit mthca_cleanup_av_table(struct mthca_dev *dev) +{ + if (dev->av_table.av_map) + iounmap(dev->av_table.av_map); + pci_pool_destroy(dev->av_table.pool); + mthca_alloc_cleanup(&dev->av_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mcg.c 2004-12-13 09:44:46.802830478 -0800 @@ -0,0 +1,365 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mcg.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + MTHCA_QP_PER_MGM = 4 * (MTHCA_MGM_ENTRY_SIZE / 16 - 2) +}; + +struct mthca_mgm { + u32 next_gid_index; + u32 reserved[3]; + u8 gid[16]; + u32 qp[MTHCA_QP_PER_MGM]; +}; + +static const u8 zero_gid[16]; /* automatically initialized to 0 */ + +/* + * Caller must hold MCG table semaphore. gid and mgm parameters must + * be properly aligned for command interface. + * + * Returns 0 unless a firmware command error occurs. + * + * If GID is found in MGM or MGM is empty, *index = *hash, *prev = -1 + * and *mgm holds MGM entry. + * + * if GID is found in AMGM, *index = index in AMGM, *prev = index of + * previous entry in hash chain and *mgm holds AMGM entry. + * + * If no AMGM exists for given gid, *index = -1, *prev = index of last + * entry in hash chain and *mgm holds end of hash chain. + */ +static int find_mgm(struct mthca_dev *dev, + u8 *gid, struct mthca_mgm *mgm, + u16 *hash, int *prev, int *index) +{ + void *mailbox; + u8 *mgid; + int err; + u8 status; + + mailbox = kmalloc(16 + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgid = MAILBOX_ALIGN(mailbox); + + memcpy(mgid, gid, 16); + + err = mthca_MGID_HASH(dev, mgid, hash, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "MGID_HASH returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (0) + mthca_dbg(dev, "Hash for %04x:%04x:%04x:%04x:" + "%04x:%04x:%04x:%04x is %04x\n", + be16_to_cpu(((u16 *) gid)[0]), be16_to_cpu(((u16 *) gid)[1]), + be16_to_cpu(((u16 *) gid)[2]), be16_to_cpu(((u16 *) gid)[3]), + be16_to_cpu(((u16 *) gid)[4]), be16_to_cpu(((u16 *) gid)[5]), + be16_to_cpu(((u16 *) gid)[6]), be16_to_cpu(((u16 *) gid)[7]), + *hash); + + *index = *hash; + *prev = -1; + + do { + err = mthca_READ_MGM(dev, *index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + return -EINVAL; + } + + if (!memcmp(mgm->gid, zero_gid, 16)) { + if (*index != *hash) { + mthca_err(dev, "Found zero MGID in AMGM.\n"); + err = -EINVAL; + } + goto out; + } + + if (!memcmp(mgm->gid, gid, 16)) + goto out; + + *prev = *index; + *index = be32_to_cpu(mgm->next_gid_index) >> 5; + } while (*index); + + *index = -1; + + out: + kfree(mailbox); + return err; +} + +int mthca_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int index, prev; + int link = 0; + int i; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index != -1) { + if (!memcmp(mgm->gid, zero_gid, 16)) + memcpy(mgm->gid, gid->raw, 16); + } else { + link = 1; + + index = mthca_alloc(&dev->mcg_table.alloc); + if (index == -1) { + mthca_err(dev, "No AMGM entries left\n"); + err = -ENOMEM; + goto out; + } + + err = mthca_READ_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + memcpy(mgm->gid, gid->raw, 16); + mgm->next_gid_index = 0; + } + + for (i = 0; i < MTHCA_QP_PER_MGM; ++i) + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); + break; + } + + if (i == MTHCA_QP_PER_MGM) { + mthca_err(dev, "MGM at index %x is full.\n", index); + err = -ENOMEM; + goto out; + } + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + if (!link) + goto out; + + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int mthca_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + struct mthca_dev *dev = to_mdev(ibqp->device); + void *mailbox; + struct mthca_mgm *mgm; + u16 hash; + int prev, index; + int i, loc; + int err; + u8 status; + + mailbox = kmalloc(sizeof *mgm + MTHCA_CMD_MAILBOX_EXTRA, GFP_KERNEL); + if (!mailbox) + return -ENOMEM; + mgm = MAILBOX_ALIGN(mailbox); + + if (down_interruptible(&dev->mcg_table.sem)) + return -EINTR; + + err = find_mgm(dev, gid->raw, mgm, &hash, &prev, &index); + if (err) + goto out; + + if (index == -1) { + mthca_err(dev, "MGID %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x " + "not found\n", + be16_to_cpu(((u16 *) gid->raw)[0]), + be16_to_cpu(((u16 *) gid->raw)[1]), + be16_to_cpu(((u16 *) gid->raw)[2]), + be16_to_cpu(((u16 *) gid->raw)[3]), + be16_to_cpu(((u16 *) gid->raw)[4]), + be16_to_cpu(((u16 *) gid->raw)[5]), + be16_to_cpu(((u16 *) gid->raw)[6]), + be16_to_cpu(((u16 *) gid->raw)[7])); + err = -EINVAL; + goto out; + } + + for (loc = -1, i = 0; i < MTHCA_QP_PER_MGM; ++i) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) + loc = i; + if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) + break; + } + + if (loc == -1) { + mthca_err(dev, "QP %06x not found in MGM\n", ibqp->qp_num); + err = -EINVAL; + goto out; + } + + mgm->qp[loc] = mgm->qp[i - 1]; + mgm->qp[i - 1] = 0; + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + if (i != 1) + goto out; + + goto out; + + if (prev == -1) { + /* Remove entry from MGM */ + if (be32_to_cpu(mgm->next_gid_index) >> 5) { + err = mthca_READ_MGM(dev, + be32_to_cpu(mgm->next_gid_index) >> 5, + mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", + status); + err = -EINVAL; + goto out; + } + } else + memset(mgm->gid, 0, 16); + + err = mthca_WRITE_MGM(dev, index, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } else { + /* Remove entry from AMGM */ + index = be32_to_cpu(mgm->next_gid_index) >> 5; + err = mthca_READ_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "READ_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + + mgm->next_gid_index = cpu_to_be32(index << 5); + + err = mthca_WRITE_MGM(dev, prev, mgm, &status); + if (err) + goto out; + if (status) { + mthca_err(dev, "WRITE_MGM returned status %02x\n", status); + err = -EINVAL; + goto out; + } + } + + out: + up(&dev->mcg_table.sem); + kfree(mailbox); + return err; +} + +int __devinit mthca_init_mcg_table(struct mthca_dev *dev) +{ + int err; + + err = mthca_alloc_init(&dev->mcg_table.alloc, + dev->limits.num_amgms, + dev->limits.num_amgms - 1, + 0); + if (err) + return err; + + init_MUTEX(&dev->mcg_table.sem); + + return 0; +} + +void __devexit mthca_cleanup_mcg_table(struct mthca_dev *dev) +{ + mthca_alloc_cleanup(&dev->mcg_table.alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mr.c 2004-12-13 09:44:46.828826648 -0800 @@ -0,0 +1,385 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mr.c 1298 2004-11-29 03:26:10Z roland $ + */ + +#include +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +/* + * Must be packed because mtt_seg is 64 bits but only aligned to 32 bits. + */ +struct mthca_mpt_entry { + u32 flags; + u32 page_size; + u32 key; + u32 pd; + u64 start; + u64 length; + u32 lkey; + u32 window_count; + u32 window_count_limit; + u64 mtt_seg; + u32 reserved[3]; +} __attribute__((packed)); + +#define MTHCA_MPT_FLAG_SW_OWNS (0xfUL << 28) +#define MTHCA_MPT_FLAG_MIO (1 << 17) +#define MTHCA_MPT_FLAG_BIND_ENABLE (1 << 15) +#define MTHCA_MPT_FLAG_PHYSICAL (1 << 9) +#define MTHCA_MPT_FLAG_REGION (1 << 8) + +#define MTHCA_MTT_FLAG_PRESENT 1 + +/* + * Buddy allocator for MTT segments (currently not very efficient + * since it doesn't keep a free list and just searches linearly + * through the bitmaps) + */ + +static u32 mthca_alloc_mtt(struct mthca_dev *dev, int order) +{ + int o; + int m; + u32 seg; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + for (o = order; o <= dev->mr_table.max_mtt_order; ++o) { + m = 1 << (dev->mr_table.max_mtt_order - o); + seg = find_first_bit(dev->mr_table.mtt_buddy[o], m); + if (seg < m) + goto found; + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + return -1; + + found: + clear_bit(seg, dev->mr_table.mtt_buddy[o]); + + while (o > order) { + --o; + seg <<= 1; + set_bit(seg ^ 1, dev->mr_table.mtt_buddy[o]); + } + + spin_unlock(&dev->mr_table.mpt_alloc.lock); + + seg <<= order; + + return seg; +} + +static void mthca_free_mtt(struct mthca_dev *dev, u32 seg, int order) +{ + seg >>= order; + + spin_lock(&dev->mr_table.mpt_alloc.lock); + + while (test_bit(seg ^ 1, dev->mr_table.mtt_buddy[order])) { + clear_bit(seg ^ 1, dev->mr_table.mtt_buddy[order]); + seg >>= 1; + ++order; + } + + set_bit(seg, dev->mr_table.mtt_buddy[order]); + + spin_unlock(&dev->mr_table.mpt_alloc.lock); +} + +int mthca_mr_alloc_notrans(struct mthca_dev *dev, u32 pd, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + struct mthca_mpt_entry *mpt_entry; + int err; + u8 status; + + might_sleep(); + + mr->order = -1; + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + mailbox = kmalloc(sizeof *mpt_entry + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) { + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return -ENOMEM; + } + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_PHYSICAL | + MTHCA_MPT_FLAG_REGION | + access); + mpt_entry->page_size = 0; + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = 0; + mpt_entry->length = ~0ULL; + + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; +} + +int mthca_mr_alloc_phys(struct mthca_dev *dev, u32 pd, + u64 *buffer_list, int buffer_size_shift, + int list_len, u64 iova, u64 total_size, + u32 access, struct mthca_mr *mr) +{ + void *mailbox; + u64 *mtt_entry; + struct mthca_mpt_entry *mpt_entry; + int err = -ENOMEM; + u8 status; + int i; + + might_sleep(); + WARN_ON(buffer_size_shift >= 32); + + mr->ibmr.lkey = mthca_alloc(&dev->mr_table.mpt_alloc); + if (mr->ibmr.lkey == -1) + return -ENOMEM; + mr->ibmr.rkey = mr->ibmr.lkey; + + for (i = dev->limits.mtt_seg_size / 8, mr->order = 0; + i < list_len; + i <<= 1, ++mr->order) + /* nothing */ ; + + mr->first_seg = mthca_alloc_mtt(dev, mr->order); + if (mr->first_seg == -1) + goto err_out_mpt_free; + + /* + * If list_len is odd, we add one more dummy entry for + * firmware efficiency. + */ + mailbox = kmalloc(max(sizeof *mpt_entry, + (size_t) 8 * (list_len + (list_len & 1) + 2)) + + MTHCA_CMD_MAILBOX_EXTRA, + GFP_KERNEL); + if (!mailbox) + goto err_out_free_mtt; + + mtt_entry = MAILBOX_ALIGN(mailbox); + + mtt_entry[0] = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + mtt_entry[1] = 0; + for (i = 0; i < list_len; ++i) + mtt_entry[i + 2] = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + if (list_len & 1) { + mtt_entry[i + 2] = 0; + ++list_len; + } + + if (0) { + mthca_dbg(dev, "Dumping MPT entry\n"); + for (i = 0; i < list_len + 2; ++i) + printk(KERN_ERR "[%2d] %016llx\n", + i, (unsigned long long) be64_to_cpu(mtt_entry[i])); + } + + err = mthca_WRITE_MTT(dev, mtt_entry, list_len, &status); + if (err) { + mthca_warn(dev, "WRITE_MTT failed (%d)\n", err); + goto err_out_mailbox_free; + } + if (status) { + mthca_warn(dev, "WRITE_MTT returned status 0x%02x\n", + status); + err = -EINVAL; + goto err_out_mailbox_free; + } + + mpt_entry = MAILBOX_ALIGN(mailbox); + + mpt_entry->flags = cpu_to_be32(MTHCA_MPT_FLAG_SW_OWNS | + MTHCA_MPT_FLAG_MIO | + MTHCA_MPT_FLAG_REGION | + access); + + mpt_entry->page_size = cpu_to_be32(buffer_size_shift - 12); + mpt_entry->key = cpu_to_be32(mr->ibmr.lkey); + mpt_entry->pd = cpu_to_be32(pd); + mpt_entry->start = cpu_to_be64(iova); + mpt_entry->length = cpu_to_be64(total_size); + memset(&mpt_entry->lkey, 0, + sizeof *mpt_entry - offsetof(struct mthca_mpt_entry, lkey)); + mpt_entry->mtt_seg = cpu_to_be64(dev->mr_table.mtt_base + + mr->first_seg * dev->limits.mtt_seg_size); + + if (0) { + mthca_dbg(dev, "Dumping MPT entry %08x:\n", mr->ibmr.lkey); + for (i = 0; i < sizeof (struct mthca_mpt_entry) / 4; ++i) { + if (i % 4 == 0) + printk("[%02x] ", i * 4); + printk(" %08x", be32_to_cpu(((u32 *) mpt_entry)[i])); + if ((i + 1) % 4 == 0) + printk("\n"); + } + } + + err = mthca_SW2HW_MPT(dev, mpt_entry, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "SW2HW_MPT failed (%d)\n", err); + else if (status) { + mthca_warn(dev, "SW2HW_MPT returned status 0x%02x\n", + status); + err = -EINVAL; + } + + kfree(mailbox); + return err; + + err_out_mailbox_free: + kfree(mailbox); + + err_out_free_mtt: + mthca_free_mtt(dev, mr->first_seg, mr->order); + + err_out_mpt_free: + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); + return err; +} + +void mthca_free_mr(struct mthca_dev *dev, struct mthca_mr *mr) +{ + int err; + u8 status; + + might_sleep(); + + err = mthca_HW2SW_MPT(dev, NULL, + mr->ibmr.lkey & (dev->limits.num_mpts - 1), + &status); + if (err) + mthca_warn(dev, "HW2SW_MPT failed (%d)\n", err); + else if (status) + mthca_warn(dev, "HW2SW_MPT returned status 0x%02x\n", + status); + + if (mr->order >= 0) + mthca_free_mtt(dev, mr->first_seg, mr->order); + + mthca_free(&dev->mr_table.mpt_alloc, mr->ibmr.lkey); +} + +int __devinit mthca_init_mr_table(struct mthca_dev *dev) +{ + int err; + int i, s; + + err = mthca_alloc_init(&dev->mr_table.mpt_alloc, + dev->limits.num_mpts, + ~0, dev->limits.reserved_mrws); + if (err) + return err; + + err = -ENOMEM; + + for (i = 1, dev->mr_table.max_mtt_order = 0; + i < dev->limits.num_mtt_segs; + i <<= 1, ++dev->mr_table.max_mtt_order) + /* nothing */ ; + + dev->mr_table.mtt_buddy = kmalloc((dev->mr_table.max_mtt_order + 1) * + sizeof (long *), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy) + goto err_out; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + dev->mr_table.mtt_buddy[i] = NULL; + + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) { + s = BITS_TO_LONGS(1 << (dev->mr_table.max_mtt_order - i)); + dev->mr_table.mtt_buddy[i] = kmalloc(s * sizeof (long), + GFP_KERNEL); + if (!dev->mr_table.mtt_buddy[i]) + goto err_out_free; + bitmap_zero(dev->mr_table.mtt_buddy[i], + 1 << (dev->mr_table.max_mtt_order - i)); + } + + set_bit(0, dev->mr_table.mtt_buddy[dev->mr_table.max_mtt_order]); + + for (i = 0; i < dev->mr_table.max_mtt_order; ++i) + if (1 << i >= dev->limits.reserved_mtts) + break; + + if (i == dev->mr_table.max_mtt_order) { + mthca_err(dev, "MTT table of order %d is " + "too small.\n", i); + goto err_out_free; + } + + (void) mthca_alloc_mtt(dev, i); + + return 0; + + err_out_free: + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + + err_out: + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); + + return err; +} + +void __devexit mthca_cleanup_mr_table(struct mthca_dev *dev) +{ + int i; + + /* XXX check if any MRs are still allocated? */ + for (i = 0; i <= dev->mr_table.max_mtt_order; ++i) + kfree(dev->mr_table.mtt_buddy[i]); + kfree(dev->mr_table.mtt_buddy); + mthca_alloc_cleanup(&dev->mr_table.mpt_alloc); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_pd.c 2004-12-13 09:44:46.861821787 -0800 @@ -0,0 +1,69 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_pd.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" + +int mthca_pd_alloc(struct mthca_dev *dev, struct mthca_pd *pd) +{ + int err; + + might_sleep(); + + atomic_set(&pd->sqp_count, 0); + pd->pd_num = mthca_alloc(&dev->pd_table.alloc); + if (pd->pd_num == -1) + return -ENOMEM; + + err = mthca_mr_alloc_notrans(dev, pd->pd_num, + MTHCA_MPT_FLAG_LOCAL_READ | + MTHCA_MPT_FLAG_LOCAL_WRITE, + &pd->ntmr); + if (err) + mthca_free(&dev->pd_table.alloc, pd->pd_num); + + return err; +} + +void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd) +{ + might_sleep(); + mthca_free_mr(dev, &pd->ntmr); + mthca_free(&dev->pd_table.alloc, pd->pd_num); +} + +int __devinit mthca_init_pd_table(struct mthca_dev *dev) +{ + return mthca_alloc_init(&dev->pd_table.alloc, + dev->limits.num_pds, + (1 << 24) - 1, + dev->limits.reserved_pds); +} + +void __devexit mthca_cleanup_pd_table(struct mthca_dev *dev) +{ + /* XXX check if any PDs are still allocated? */ + mthca_alloc_cleanup(&dev->pd_table.alloc); +} From roland at topspin.com Mon Dec 13 10:09:45 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:45 -0800 Subject: [openib-general] [PATCH][v3][14/21] Add Mellanox HCA low-level driver (MAD) In-Reply-To: <20041213109.xRUlTjiTPPxsdvte@topspin.com> Message-ID: <20041213109.m8TyDSPRhM2X6Nst@topspin.com> Add MAD (management datagram) code for Mellanox HCA driver. Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/hw/mthca/mthca_mad.c 2004-12-13 09:44:47.344750643 -0800 @@ -0,0 +1,314 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: mthca_mad.c 1288 2004-11-24 01:12:39Z roland $ + */ + +#include +#include + +#include "mthca_dev.h" +#include "mthca_cmd.h" + +enum { + IB_SM_PORT_INFO = 0x0015, + IB_SM_PKEY_TABLE = 0x0016, + IB_SM_SM_INFO = 0x0020, + IB_SM_VENDOR_START = 0xff00 +}; + +enum { + MTHCA_VENDOR_CLASS1 = 0x9, + MTHCA_VENDOR_CLASS2 = 0xa +}; + +struct mthca_trap_mad { + struct ib_mad *mad; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +static void update_sm_ah(struct mthca_dev *dev, + u8 port_num, u16 lid, u8 sl) +{ + struct ib_ah *new_ah; + struct ib_ah_attr ah_attr; + unsigned long flags; + + if (!dev->send_agent[port_num - 1][0]) + return; + + memset(&ah_attr, 0, sizeof ah_attr); + ah_attr.dlid = lid; + ah_attr.sl = sl; + ah_attr.port_num = port_num; + + new_ah = ib_create_ah(dev->send_agent[port_num - 1][0]->qp->pd, + &ah_attr); + if (IS_ERR(new_ah)) + return; + + spin_lock_irqsave(&dev->sm_lock, flags); + if (dev->sm_ah[port_num - 1]) + ib_destroy_ah(dev->sm_ah[port_num - 1]); + dev->sm_ah[port_num - 1] = new_ah; + spin_unlock_irqrestore(&dev->sm_lock, flags); +} + +/* + * Snoop SM MADs for port info and P_Key table sets, so we can + * synthesize LID change and P_Key change events. + */ +static void smp_snoop(struct ib_device *ibdev, + u8 port_num, + struct ib_mad *mad) +{ + struct ib_event event; + + if ((mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + mad->mad_hdr.method == IB_MGMT_METHOD_SET) { + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PORT_INFO)) { + update_sm_ah(to_mdev(ibdev), port_num, + be16_to_cpup((__be16 *) (mad->data + 58)), + (*(u8 *) (mad->data + 76)) & 0xf); + + event.device = ibdev; + event.event = IB_EVENT_LID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + + if (mad->mad_hdr.attr_id == cpu_to_be16(IB_SM_PKEY_TABLE)) { + event.device = ibdev; + event.event = IB_EVENT_PKEY_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } + } +} + +static void forward_trap(struct mthca_dev *dev, + u8 port_num, + struct ib_mad *mad) +{ + int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; + struct mthca_trap_mad *tmad; + struct ib_sge gather_list; + struct ib_send_wr *bad_wr, wr = { + .opcode = IB_WR_SEND, + .sg_list = &gather_list, + .num_sge = 1, + .send_flags = IB_SEND_SIGNALED, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = qpn ? IB_QP1_QKEY : 0, + .timeout_ms = 0 + } + } + }; + struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; + int ret; + unsigned long flags; + + if (agent) { + tmad = kmalloc(sizeof *tmad, GFP_KERNEL); + if (!tmad) + return; + + tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); + if (!tmad->mad) { + kfree(tmad); + return; + } + + memcpy(tmad->mad, mad, sizeof *mad); + + wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; + wr.wr_id = (unsigned long) tmad; + + gather_list.addr = dma_map_single(agent->device->dma_device, + tmad->mad, + sizeof *tmad->mad, + DMA_TO_DEVICE); + gather_list.length = sizeof *tmad->mad; + gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; + pci_unmap_addr_set(tmad, mapping, gather_list.addr); + + /* + * We rely here on the fact that MLX QPs don't use the + * address handle after the send is posted (this is + * wrong following the IB spec strictly, but we know + * it's OK for our devices). + */ + spin_lock_irqsave(&dev->sm_lock, flags); + wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &wr, &bad_wr); + else + ret = -EINVAL; + spin_unlock_irqrestore(&dev->sm_lock, flags); + + if (ret) { + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); + } + } +} + +int mthca_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + u16 slid, + struct ib_mad *in_mad, + struct ib_mad *out_mad) +{ + int err; + u8 status; + + /* Forward locally generated traps to the SM */ + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && + slid == 0) { + forward_trap(to_mdev(ibdev), port_num, in_mad); + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + } + + /* + * Only handle SM gets, sets and trap represses for SM class + * + * Only handle PMA and Mellanox vendor-specific class gets and + * sets for other classes. + */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_TRAP_REPRESS) + return IB_MAD_RESULT_SUCCESS; + + /* + * Don't process SMInfo queries or vendor-specific + * MADs -- the SMA can't handle them. + */ + if (be16_to_cpu(in_mad->mad_hdr.attr_id) == IB_SM_SM_INFO || + be16_to_cpu(in_mad->mad_hdr.attr_id) >= IB_SM_VENDOR_START) + return IB_MAD_RESULT_SUCCESS; + } else if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS1 || + in_mad->mad_hdr.mgmt_class == MTHCA_VENDOR_CLASS2) { + if (in_mad->mad_hdr.method != IB_MGMT_METHOD_GET && + in_mad->mad_hdr.method != IB_MGMT_METHOD_SET) + return IB_MAD_RESULT_SUCCESS; + } else + return IB_MAD_RESULT_SUCCESS; + + err = mthca_MAD_IFC(to_mdev(ibdev), + !!(mad_flags & IB_MAD_IGNORE_MKEY), + port_num, in_mad, out_mad, + &status); + if (err) { + mthca_err(to_mdev(ibdev), "MAD_IFC failed\n"); + return IB_MAD_RESULT_FAILURE; + } + if (status == MTHCA_CMD_STAT_BAD_PKT) + return IB_MAD_RESULT_SUCCESS; + if (status) { + mthca_err(to_mdev(ibdev), "MAD_IFC returned status %02x\n", + status); + return IB_MAD_RESULT_FAILURE; + } + + if (!out_mad->mad_hdr.status) + smp_snoop(ibdev, port_num, in_mad); + + /* set return bit in status of directed route responses */ + if (in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) + out_mad->mad_hdr.status |= cpu_to_be16(1 << 15); + + if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP_REPRESS) + /* no response for trap repress */ + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED; + + return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; +} + +static void send_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + struct mthca_trap_mad *tmad = + (void *) (unsigned long) mad_send_wc->wr_id; + + dma_unmap_single(agent->device->dma_device, + pci_unmap_addr(tmad, mapping), + sizeof *tmad->mad, + DMA_TO_DEVICE); + kfree(tmad->mad); + kfree(tmad); +} + +int mthca_create_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + spin_lock_init(&dev->sm_lock); + + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) { + agent = ib_register_mad_agent(&dev->ib_dev, p + 1, + q ? IB_QPT_GSI : IB_QPT_SMI, + NULL, 0, send_handler, + NULL, NULL); + if (IS_ERR(agent)) + goto err; + dev->send_agent[p][q] = agent; + } + + return 0; + +err: + for (p = 0; p < dev->limits.num_ports; ++p) + for (q = 0; q <= 1; ++q) + if (dev->send_agent[p][q]) + ib_unregister_mad_agent(dev->send_agent[p][q]); + + return PTR_ERR(agent); +} + +void mthca_free_agents(struct mthca_dev *dev) +{ + struct ib_mad_agent *agent; + int p, q; + + for (p = 0; p < dev->limits.num_ports; ++p) { + for (q = 0; q <= 1; ++q) { + agent = dev->send_agent[p][q]; + dev->send_agent[p][q] = NULL; + ib_unregister_mad_agent(agent); + } + + if (dev->sm_ah[p]) + ib_destroy_ah(dev->sm_ah[p]); + } +} From roland at topspin.com Mon Dec 13 10:09:45 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:45 -0800 Subject: [openib-general] [PATCH][v3][15/21] IPoIB IPv4 multicast In-Reply-To: <20041213109.m8TyDSPRhM2X6Nst@topspin.com> Message-ID: <20041213109.5NKezuGE9PMejMSM@topspin.com> Add ip_ib_mc_map() to convert IPv4 multicast addresses to IPoIB hardware addresses. Also add so INFINIBAND_ALEN has a home. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/include/linux/if_infiniband.h 2004-12-13 09:44:47.613711020 -0800 @@ -0,0 +1,29 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#ifndef _LINUX_IF_INFINIBAND_H +#define _LINUX_IF_INFINIBAND_H + +#define INFINIBAND_ALEN 20 /* Octets in IPoIB HW addr */ + +#endif /* _LINUX_IF_INFINIBAND_H */ --- linux-bk.orig/include/net/ip.h 2004-12-11 15:16:19.000000000 -0800 +++ linux-bk/include/net/ip.h 2004-12-13 09:44:47.641706896 -0800 @@ -229,6 +229,39 @@ buf[3]=addr&0x7F; } +/* + * Map a multicast IP onto multicast MAC for type IP-over-InfiniBand. + * Leave P_Key as 0 to be filled in by driver. + */ + +static inline void ip_ib_mc_map(u32 addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + addr = ntohl(addr); + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x40; /* IPv4 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + buf[10] = 0; + buf[11] = 0; + buf[12] = 0; + buf[13] = 0; + buf[14] = 0; + buf[15] = 0; + buf[19] = addr & 0xff; + addr >>= 8; + buf[18] = addr & 0xff; + addr >>= 8; + buf[17] = addr & 0xff; + addr >>= 8; + buf[16] = addr & 0x0f; +} + #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) #include #endif --- linux-bk.orig/net/ipv4/arp.c 2004-12-11 15:16:28.000000000 -0800 +++ linux-bk/net/ipv4/arp.c 2004-12-13 09:44:47.650705570 -0800 @@ -213,6 +213,9 @@ case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; + case ARPHRD_INFINIBAND: + ip_ib_mc_map(addr, haddr); + return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 13 10:09:46 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:46 -0800 Subject: [openib-general] [PATCH][v3][16/21] IPoIB IPv6 support In-Reply-To: <20041213109.5NKezuGE9PMejMSM@topspin.com> Message-ID: <20041213109.iziHvQZqtmP83gmx@topspin.com> Add ipv6_ib_mc_map() to convert IPv6 multicast addresses to IPoIB hardware addresses, and add support for autoconfiguration for devices with type ARPHRD_INFINIBAND. The mapping for multicast addresses is described in http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Nitin Hande Signed-off-by: Roland Dreier --- linux-bk.orig/include/net/if_inet6.h 2004-12-11 15:16:37.000000000 -0800 +++ linux-bk/include/net/if_inet6.h 2004-12-13 09:44:48.801536031 -0800 @@ -266,5 +266,20 @@ { buf[0] = 0x00; } + +static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +{ + buf[0] = 0; /* Reserved */ + buf[1] = 0xff; /* Multicast QPN */ + buf[2] = 0xff; + buf[3] = 0xff; + buf[4] = 0xff; + buf[5] = 0x12; /* link local scope */ + buf[6] = 0x60; /* IPv6 signature */ + buf[7] = 0x1b; + buf[8] = 0; /* P_Key */ + buf[9] = 0; + memcpy(buf + 10, addr->s6_addr + 6, 10); +} #endif #endif --- linux-bk.orig/net/ipv6/addrconf.c 2004-12-11 15:16:33.000000000 -0800 +++ linux-bk/net/ipv6/addrconf.c 2004-12-13 09:44:48.840530286 -0800 @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -1095,6 +1096,12 @@ memset(eui, 0, 7); eui[7] = *(u8*)dev->dev_addr; return 0; + case ARPHRD_INFINIBAND: + if (dev->addr_len != INFINIBAND_ALEN) + return -1; + memcpy(eui, dev->dev_addr + 12, 8); + eui[0] |= 2; + return 0; } return -1; } @@ -1794,6 +1801,7 @@ if ((dev->type != ARPHRD_ETHER) && (dev->type != ARPHRD_FDDI) && (dev->type != ARPHRD_IEEE802_TR) && + (dev->type != ARPHRD_INFINIBAND) && (dev->type != ARPHRD_ARCNET)) { /* Alas, we support only Ethernet autoconfiguration. */ return; --- linux-bk.orig/net/ipv6/ndisc.c 2004-12-11 15:16:13.000000000 -0800 +++ linux-bk/net/ipv6/ndisc.c 2004-12-13 09:44:48.890522921 -0800 @@ -260,6 +260,9 @@ case ARPHRD_ARCNET: ipv6_arcnet_mc_map(addr, buf); return 0; + case ARPHRD_INFINIBAND: + ipv6_ib_mc_map(addr, buf); + return 0; default: if (dir) { memcpy(buf, dev->broadcast, dev->addr_len); From roland at topspin.com Mon Dec 13 10:09:51 2004 From: roland at topspin.com (Roland Dreier) Date: Mon, 13 Dec 2004 10:09:51 -0800 Subject: [openib-general] [PATCH][v3][17/21] Add IPoIB (IP-over-InfiniBand) driver In-Reply-To: <20041213109.iziHvQZqtmP83gmx@topspin.com> Message-ID: <20041213109.JT1ejUdkRIUXbWOm@topspin.com> Add a driver that implements the (IPoIB) IP-over-InfiniBand protocol. This is a network device driver of type ARPHRD_INFINIBAND (and addr_len INFINIBAND_ALEN bytes). The ARP/ND implementation for this driver is not completely straightforward, because InfiniBand requires an additional path lookup be performed (through an IB-specific mechanism) after a remote hardware address has been resolved. We are very open to suggestions of a better way to handle this than the current implementation. Although IB has a special multicast group join mode intended to support IP multicast routing (non member join), no means to identify different multicast styles has yet been determined, so all joins by the driver are currently full member joins. We are looking for guidance in how to solve this. The IPoIB protocol/encapsulation is described in the Internet-Drafts http://www.ietf.org/internet-drafts/draft-ietf-ipoib-architecture-04.txt http://www.ietf.org/internet-drafts/draft-ietf-ipoib-ip-over-infiniband-07.txt Signed-off-by: Roland Dreier --- linux-bk.orig/drivers/infiniband/Kconfig 2004-12-13 09:44:43.936252779 -0800 +++ linux-bk/drivers/infiniband/Kconfig 2004-12-13 09:44:49.385450009 -0800 @@ -2,7 +2,6 @@ config INFINIBAND tristate "InfiniBand support" - default n ---help--- Core support for InfiniBand (IB). Make sure to also select any protocols you wish to use as well as drivers for your @@ -10,4 +9,6 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/ulp/ipoib/Kconfig" + endmenu --- linux-bk.orig/drivers/infiniband/Makefile 2004-12-13 09:44:43.909256756 -0800 +++ linux-bk/drivers/infiniband/Makefile 2004-12-13 09:44:49.342456343 -0800 @@ -1,2 +1,3 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Kconfig 2004-12-13 09:44:49.470437489 -0800 @@ -0,0 +1,33 @@ +config INFINIBAND_IPOIB + tristate "IP-over-InfiniBand" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the IP-over-InfiniBand protocol (IPoIB). This + transports IP packets over InfiniBand so you can use your IB + device as a fancy NIC. + + The IPoIB protocol is defined by the IETF ipoib working + group: . + +config INFINIBAND_IPOIB_DEBUG + bool "IP-over-InfiniBand debugging" + depends on INFINIBAND_IPOIB + ---help--- + This option causes debugging code to be compiled into the + IPoIB driver. The output can be turned on via the + debug_level and mcast_debug_level module parameters (which + can also be set after the driver is loaded through sysfs). + + This option also creates an "ipoib_debugfs," which can be + mounted to expose debugging information about IB multicast + groups used by the IPoIB driver. + +config INFINIBAND_IPOIB_DEBUG_DATA + bool "IP-over-InfiniBand data path debugging" + depends on INFINIBAND_IPOIB_DEBUG + ---help--- + This option compiles debugging code into the the data path + of the IPoIB driver. The output can be turned on via the + data_debug_level module parameter; however, even with output + turned off, this debugging code will have some performance + impact. --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/Makefile 2004-12-13 09:44:49.426443970 -0800 @@ -0,0 +1,11 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_IPOIB) += ib_ipoib.o + +ib_ipoib-y := ipoib_main.o \ + ipoib_ib.o \ + ipoib_multicast.o \ + ipoib_verbs.o \ + ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o + --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib.h 2004-12-13 09:44:49.497433512 -0800 @@ -0,0 +1,340 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib.h 1323 2004-12-11 02:36:04Z roland $ + */ + +#ifndef _IPOIB_H +#define _IPOIB_H + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include + +#include +#include +#include + +/* constants */ + +enum { + IPOIB_PACKET_SIZE = 2048, + IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES, + + IPOIB_ENCAP_LEN = 4, + + IPOIB_RX_RING_SIZE = 128, + IPOIB_TX_RING_SIZE = 64, + + IPOIB_NUM_WC = 4, + + IPOIB_MAX_PATH_REC_QUEUE = 3, + IPOIB_MAX_MCAST_QUEUE = 3, + + IPOIB_FLAG_TX_FULL = 0, + IPOIB_FLAG_OPER_UP = 1, + IPOIB_FLAG_ADMIN_UP = 2, + IPOIB_PKEY_ASSIGNED = 3, + IPOIB_PKEY_STOP = 4, + IPOIB_FLAG_SUBINTERFACE = 5, + IPOIB_MCAST_RUN = 6, + IPOIB_STOP_REAPER = 7, + + IPOIB_MAX_BACKOFF_SECONDS = 16, + + IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ + IPOIB_MCAST_FLAG_SENDONLY = 1, + IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + IPOIB_MCAST_FLAG_ATTACHED = 3, +}; + +/* structs */ + +struct ipoib_header { + u16 proto; + u16 reserved; +}; + +struct ipoib_pseudoheader { + u8 hwaddr[INFINIBAND_ALEN]; +}; + +struct ipoib_mcast; + +struct ipoib_buf { + struct sk_buff *skb; + DECLARE_PCI_UNMAP_ADDR(mapping) +}; + +/* + * Device private locking: tx_lock protects members used in TX fast + * path (and we use LLTX so upper layers don't do extra locking). + * lock protects everything else. lock nests inside of tx_lock (ie + * tx_lock must be acquired first if needed). + */ +struct ipoib_dev_priv { + spinlock_t lock; + + struct net_device *dev; + + unsigned long flags; + + struct semaphore mcast_mutex; + struct semaphore vlan_mutex; + + struct rb_root path_tree; + struct list_head path_list; + + struct ipoib_mcast *broadcast; + struct list_head multicast_list; + struct rb_root multicast_tree; + + struct work_struct pkey_task; + struct work_struct mcast_task; + struct work_struct flush_task; + struct work_struct restart_task; + struct work_struct ah_reap_task; + + struct ib_device *ca; + u8 port; + u16 pkey; + struct ib_pd *pd; + struct ib_mr *mr; + struct ib_cq *cq; + struct ib_qp *qp; + u32 qkey; + + union ib_gid local_gid; + u16 local_lid; + + unsigned int admin_mtu; + unsigned int mcast_mtu; + + struct ipoib_buf *rx_ring; + + spinlock_t tx_lock; + struct ipoib_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + + struct ib_wc ibwc[IPOIB_NUM_WC]; + + struct list_head dead_ahs; + + struct ib_event_handler event_handler; + + struct net_device_stats stats; + + struct net_device *parent; + struct list_head child_intfs; + struct list_head list; + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG + struct list_head fs_list; + struct dentry *mcg_dentry; +#endif +}; + +struct ipoib_ah { + struct net_device *dev; + struct ib_ah *ah; + struct list_head list; + struct kref ref; + unsigned last_send; +}; + +struct ipoib_path { + struct net_device *dev; + struct ib_sa_path_rec pathrec; + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct list_head neigh_list; + + int query_id; + struct ib_sa_query *query; + struct completion done; + + struct rb_node rb_node; + struct list_head list; +}; + +struct ipoib_neigh { + struct ipoib_ah *ah; + struct sk_buff_head queue; + + struct neighbour *neighbour; + + struct list_head list; +}; + +static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) +{ + return (struct ipoib_neigh **) (neigh->ha + 24 - + (offsetof(struct neighbour, ha) & 4)); +} + +extern struct workqueue_struct *ipoib_workqueue; + +/* functions */ + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr); +void ipoib_free_ah(struct kref *kref); +static inline void ipoib_put_ah(struct ipoib_ah *ah) +{ + kref_put(&ah->ref, ipoib_free_ah); +} + +int ipoib_add_pkey_attr(struct net_device *dev); + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn); +void ipoib_reap_ah(void *dev_ptr); + +void ipoib_flush_paths(struct net_device *dev); +struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_ib_dev_flush(void *dev); +void ipoib_ib_dev_cleanup(struct net_device *dev); + +int ipoib_ib_dev_open(struct net_device *dev); +int ipoib_ib_dev_up(struct net_device *dev); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); +void ipoib_dev_cleanup(struct net_device *dev); + +void ipoib_mcast_join_task(void *dev_ptr); +void ipoib_mcast_send(struct net_device *dev, union ib_gid *mgid, + struct sk_buff *skb); + +void ipoib_mcast_restart_task(void *dev_ptr); +int ipoib_mcast_start_thread(struct net_device *dev); +int ipoib_mcast_stop_thread(struct net_device *dev); + +void ipoib_mcast_dev_down(struct net_device *dev); +void ipoib_mcast_dev_flush(struct net_device *dev); + +struct ipoib_mcast_iter *ipoib_mcast_iter_init(struct net_device *dev); +void ipoib_mcast_iter_free(struct ipoib_mcast_iter *iter); +int ipoib_mcast_iter_next(struct ipoib_mcast_iter *iter); +void ipoib_mcast_iter_read(struct ipoib_mcast_iter *iter, + union ib_gid *gid, + unsigned long *created, + unsigned int *queuelen, + unsigned int *complete, + unsigned int *send_only); + +int ipoib_mcast_attach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); +int ipoib_mcast_detach(struct net_device *dev, u16 mlid, + union ib_gid *mgid); + +int ipoib_qp_create(struct net_device *dev); +int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); +void ipoib_transport_dev_cleanup(struct net_device *dev); + +void ipoib_event(struct ib_event_handler *handler, + struct ib_event *record); + +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); + +void ipoib_pkey_poll(void *dev); +int ipoib_pkey_dev_delay_open(struct net_device *dev); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int ipoib_create_debug_file(struct net_device *dev); +void ipoib_delete_debug_file(struct net_device *dev); +int ipoib_register_debugfs(void); +void ipoib_unregister_debugfs(void); +#else +static inline int ipoib_create_debug_file(struct net_device *dev) { return 0; } +static inline void ipoib_delete_debug_file(struct net_device *dev) { } +static inline int ipoib_register_debugfs(void) { return 0; } +static inline void ipoib_unregister_debugfs(void) { } +#endif + + +#define ipoib_printk(level, priv, format, arg...) \ + printk(level "%s: " format, ((struct ipoib_dev_priv *) priv)->dev->name , ## arg) +#define ipoib_warn(priv, format, arg...) \ + ipoib_printk(KERN_WARNING, priv, format , ## arg) + + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +extern int debug_level; + +#define ipoib_dbg(priv, format, arg...) \ + do { \ + if (debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { \ + if (mcast_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +#define ipoib_dbg(priv, format, arg...) \ + do { (void) (priv); } while (0) +#define ipoib_dbg_mcast(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +#define ipoib_dbg_data(priv, format, arg...) \ + do { \ + if (data_debug_level > 0) \ + ipoib_printk(KERN_DEBUG, priv, format , ## arg); \ + } while (0) +#else /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ +#define ipoib_dbg_data(priv, format, arg...) \ + do { (void) (priv); } while (0) +#endif /* CONFIG_INFINIBAND_IPOIB_DEBUG_DATA */ + + +#define IPOIB_GID_FMT "%x:%x:%x:%x:%x:%x:%x:%x" + +#define IPOIB_GID_ARG(gid) be16_to_cpup((__be16 *) ((gid).raw + 0)), \ + be16_to_cpup((__be16 *) ((gid).raw + 2)), \ + be16_to_cpup((__be16 *) ((gid).raw + 4)), \ + be16_to_cpup((__be16 *) ((gid).raw + 6)), \ + be16_to_cpup((__be16 *) ((gid).raw + 8)), \ + be16_to_cpup((__be16 *) ((gid).raw + 10)), \ + be16_to_cpup((__be16 *) ((gid).raw + 12)), \ + be16_to_cpup((__be16 *) ((gid).raw + 14)) + +#endif /* _IPOIB_H */ --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_fs.c 2004-12-13 09:44:49.522429829 -0800 @@ -0,0 +1,276 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id$ + */ + +#include +#include + +#include "ipoib.h" + +enum { + IPOIB_MAGIC = 0x49504942 /* "IPIB" */ +}; + +static DECLARE_MUTEX(ipoib_fs_mutex); +static struct dentry *ipoib_root; +static struct super_block *ipoib_sb; +static LIST_HEAD(ipoib_device_list); + +static void *ipoib_mcg_seq_start(struct seq_file *file, loff_t *pos) +{ + struct ipoib_mcast_iter *iter; + loff_t n = *pos; + + iter = ipoib_mcast_iter_init(file->private); + if (!iter) + return NULL; + + while (n--) { + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + } + + return iter; +} + +static void *ipoib_mcg_seq_next(struct seq_file *file, void *iter_ptr, + loff_t *pos) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + + (*pos)++; + + if (ipoib_mcast_iter_next(iter)) { + ipoib_mcast_iter_free(iter); + return NULL; + } + + return iter; +} + +static void ipoib_mcg_seq_stop(struct seq_file *file, void *iter_ptr) +{ + /* nothing for now */ +} + +static int ipoib_mcg_seq_show(struct seq_file *file, void *iter_ptr) +{ + struct ipoib_mcast_iter *iter = iter_ptr; + char gid_buf[sizeof "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"]; + union ib_gid mgid; + int i, n; + unsigned long created; + unsigned int queuelen, complete, send_only; + + if (iter) { + ipoib_mcast_iter_read(iter, &mgid, &created, &queuelen, + &complete, &send_only); + + for (n = 0, i = 0; i < sizeof mgid / 2; ++i) { + n += sprintf(gid_buf + n, "%x", + be16_to_cpu(((u16 *)mgid.raw)[i])); + if (i < sizeof mgid / 2 - 1) + gid_buf[n++] = ':'; + } + } + + seq_printf(file, "GID: %*s", -(1 + (int) sizeof gid_buf), gid_buf); + + seq_printf(file, + " created: %10ld queuelen: %4d complete: %d send_only: %d\n", + created, queuelen, complete, send_only); + + return 0; +} + +static struct seq_operations ipoib_seq_ops = { + .start = ipoib_mcg_seq_start, + .next = ipoib_mcg_seq_next, + .stop = ipoib_mcg_seq_stop, + .show = ipoib_mcg_seq_show, +}; + +static int ipoib_mcg_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &ipoib_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->u.generic_ip; + + return 0; +} + +static struct file_operations ipoib_fops = { + .owner = THIS_MODULE, + .open = ipoib_mcg_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +static struct inode *ipoib_get_inode(void) +{ + struct inode *inode = new_inode(ipoib_sb); + + if (inode) { + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_fop = &ipoib_fops; + } + + return inode; +} + +static int __ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct dentry *dentry; + struct inode *inode; + char name[IFNAMSIZ + sizeof "_mcg"]; + + snprintf(name, sizeof name, "%s_mcg", dev->name); + + dentry = d_alloc_name(ipoib_root, name); + if (!dentry) + return -ENOMEM; + + inode = ipoib_get_inode(); + if (!inode) { + dput(dentry); + return -ENOMEM; + } + + inode->u.generic_ip = dev; + priv->mcg_dentry = dentry; + + d_add(dentry, inode); + + return 0; +} + +int ipoib_create_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + + list_add_tail(&priv->fs_list, &ipoib_device_list); + + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return 0; + } + + up(&ipoib_fs_mutex); + + return __ipoib_create_debug_file(dev); +} + +void ipoib_delete_debug_file(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + down(&ipoib_fs_mutex); + list_del(&priv->fs_list); + if (!ipoib_sb) { + up(&ipoib_fs_mutex); + return; + } + up(&ipoib_fs_mutex); + + if (priv->mcg_dentry) { + d_drop(priv->mcg_dentry); + simple_unlink(ipoib_root->d_inode, priv->mcg_dentry); + } +} + +static int ipoib_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct tree_descr ipoib_files[] = { + { "" } + }; + struct ipoib_dev_priv *priv; + int ret; + + ret = simple_fill_super(sb, IPOIB_MAGIC, ipoib_files); + if (ret) + return ret; + + ipoib_root = sb->s_root; + + down(&ipoib_fs_mutex); + + ipoib_sb = sb; + + list_for_each_entry(priv, &ipoib_device_list, fs_list) { + ret = __ipoib_create_debug_file(priv->dev); + if (ret) + break; + } + + up(&ipoib_fs_mutex); + + return ret; +} + +static struct super_block *ipoib_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_single(fs_type, flags, data, ipoib_fill_super); +} + +static void ipoib_kill_sb(struct super_block *sb) +{ + down(&ipoib_fs_mutex); + ipoib_sb = NULL; + up(&ipoib_fs_mutex); + + kill_litter_super(sb); +} + +static struct file_system_type ipoib_fs_type = { + .owner = THIS_MODULE, + .name = "ipoib_debugfs", + .get_sb = ipoib_get_sb, + .kill_sb = ipoib_kill_sb, +}; + +int ipoib_register_debugfs(void) +{ + return register_filesystem(&ipoib_fs_type); +} + +void ipoib_unregister_debugfs(void) +{ + unregister_filesystem(&ipoib_fs_type); +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2004-12-13 09:44:49.547426147 -0800 @@ -0,0 +1,627 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_ib.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include +#include + +#include + +#include "ipoib.h" + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +int data_debug_level; + +module_param(data_debug_level, int, 0644); +MODULE_PARM_DESC(data_debug_level, + "Enable data path debug tracing if > 0"); +#endif + +#define IPOIB_OP_RECV (1ul << 31) + +static DECLARE_MUTEX(pkey_sem); + +struct ipoib_ah *ipoib_create_ah(struct net_device *dev, + struct ib_pd *pd, struct ib_ah_attr *attr) +{ + struct ipoib_ah *ah; + + ah = kmalloc(sizeof *ah, GFP_KERNEL); + if (!ah) + return NULL; + + ah->dev = dev; + ah->last_send = 0; + kref_init(&ah->ref); + + ah->ah = ib_create_ah(pd, attr); + if (IS_ERR(ah->ah)) { + kfree(ah); + ah = NULL; + } else + ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah); + + return ah; +} + +void ipoib_free_ah(struct kref *kref) +{ + struct ipoib_ah *ah = container_of(kref, struct ipoib_ah, ref); + struct ipoib_dev_priv *priv = netdev_priv(ah->dev); + + unsigned long flags; + + if (ah->last_send <= priv->tx_tail) { + ipoib_dbg(priv, "Freeing ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } else { + spin_lock_irqsave(&priv->lock, flags); + list_add_tail(&ah->list, &priv->dead_ahs); + spin_unlock_irqrestore(&priv->lock, flags); + } +} + +static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, + unsigned int wr_id, + dma_addr_t addr) +{ + struct ib_sge list = { + .addr = addr, + .length = IPOIB_BUF_SIZE, + .lkey = priv->mr->lkey, + }; + struct ib_recv_wr param = { + .wr_id = wr_id | IPOIB_OP_RECV, + .sg_list = &list, + .num_sge = 1, + .recv_flags = IB_RECV_SIGNALED + }; + struct ib_recv_wr *bad_wr; + + return ib_post_recv(priv->qp, ¶m, &bad_wr); +} + +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + int ret; + + skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + if (!skb) { + ipoib_warn(priv, "failed to allocate receive buffer\n"); + + priv->rx_ring[id].skb = NULL; + return -ENOMEM; + } + skb_reserve(skb, 4); /* 16 byte align IP header */ + priv->rx_ring[id].skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); + + ret = ipoib_ib_receive(priv, id, addr); + if (ret) { + ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", + id, ret); + priv->rx_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_ib_post_receives(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_ib_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + return -EIO; + } + } + + return 0; +} + +static void ipoib_ib_handle_wc(struct net_device *dev, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + + ipoib_dbg_data(priv, "called: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (wr_id & IPOIB_OP_RECV) { + wr_id &= ~IPOIB_OP_RECV; + + if (wr_id < IPOIB_RX_RING_SIZE) { + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + + priv->rx_ring[wr_id].skb = NULL; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->rx_ring[wr_id], + mapping), + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + + if (wc->status != IB_WC_SUCCESS) { + if (wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed recv event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put(skb, wc->byte_len); + skb_pull(skb, IB_GRH_BYTES); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + + /* repost receive */ + if (ipoib_ib_post_receive(dev, wr_id)) + ipoib_warn(priv, "ipoib_ib_post_receive failed " + "for buf %d\n", wr_id); + } else + ipoib_warn(priv, "completion event with wrid %d\n", + wr_id); + + } else { + struct ipoib_buf *tx_req; + unsigned long flags; + + if (wr_id >= IPOIB_TX_RING_SIZE) { + ipoib_warn(priv, "completion event with wrid %d (> %d)\n", + wr_id, IPOIB_TX_RING_SIZE); + return; + } + + ipoib_dbg_data(priv, "send complete, wrid %d\n", wr_id); + + tx_req = &priv->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++priv->tx_tail; + if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->tx_lock, flags); + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) + ipoib_warn(priv, "failed send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + } +} + +void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->ibwc); + for (i = 0; i < n; ++i) + ipoib_ib_handle_wc(dev, priv->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + unsigned int wr_id, + struct ib_ah *address, u32 qpn, + dma_addr_t addr, int len) +{ + struct ib_sge list = { + .addr = addr, + .length = len, + .lkey = priv->mr->lkey, + }; + struct ib_send_wr param = { + .wr_id = wr_id, + .opcode = IB_WR_SEND, + .sg_list = &list, + .num_sge = 1, + .wr = { + .ud = { + .remote_qpn = qpn, + .remote_qkey = priv->qkey, + .ah = address + }, + }, + .send_flags = IB_SEND_SIGNALED, + }; + struct ib_send_wr *bad_wr; + + return ib_post_send(priv->qp, ¶m, &bad_wr); +} + +void ipoib_send(struct net_device *dev, struct sk_buff *skb, + struct ipoib_ah *address, u32 qpn) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_buf *tx_req; + dma_addr_t addr; + + if (skb->len > dev->mtu + INFINIBAND_ALEN) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, dev->mtu + INFINIBAND_ALEN); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + if (!(skb = skb_unshare(skb, GFP_ATOMIC))) { + ipoib_warn(priv, "failed to unshare sk_buff. Dropping\n"); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + return; + } + + ipoib_dbg_data(priv, "sending packet, length=%d address=%p qpn=0x%06x\n", + skb->len, address, qpn); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &priv->tx_ring[priv->tx_head & (IPOIB_TX_RING_SIZE - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, + skb->data, skb->len, + DMA_TO_DEVICE); + pci_unmap_addr_set(tx_req, mapping, addr); + + if (post_send(priv, priv->tx_head & (IPOIB_TX_RING_SIZE - 1), + address->ah, qpn, addr, skb->len)) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + + address->last_send = priv->tx_head; + ++priv->tx_head; + + if (priv->tx_head - priv->tx_tail == IPOIB_TX_RING_SIZE) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + } + } +} + +void __ipoib_reap_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_ah *ah, *tah; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) + if (ah->last_send <= priv->tx_tail) { + list_del(&ah->list); + list_add_tail(&ah->list, &remove_list); + } + spin_unlock_irq(&priv->lock); + + list_for_each_entry_safe(ah, tah, &remove_list, list) { + ipoib_dbg(priv, "Reaping ah %p\n", ah->ah); + ib_destroy_ah(ah->ah); + kfree(ah); + } +} + +void ipoib_reap_ah(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + __ipoib_reap_ah(dev); + + if (!test_bit(IPOIB_STOP_REAPER, &priv->flags)) + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); +} + +int ipoib_ib_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + ret = ipoib_qp_create(dev); + if (ret) { + ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + return -1; + } + + ret = ipoib_ib_post_receives(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + return -1; + } + + clear_bit(IPOIB_STOP_REAPER, &priv->flags); + queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); + + return 0; +} + +int ipoib_ib_dev_up(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + + return ipoib_mcast_start_thread(dev); +} + +int ipoib_ib_dev_down(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "downing ib_dev\n"); + + clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags); + netif_carrier_off(dev); + + /* Shutdown the P_Key thread if still active */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + set_bit(IPOIB_PKEY_STOP, &priv->flags); + cancel_delayed_work(&priv->pkey_task); + up(&pkey_sem); + flush_workqueue(ipoib_workqueue); + } + + ipoib_mcast_stop_thread(dev); + + /* + * Flush the multicast groups first so we stop any multicast joins. The + * completion thread may have already died and we may deadlock waiting + * for the completion thread to finish some multicast joins. + */ + ipoib_mcast_dev_flush(dev); + + /* Delete broadcast and local addresses since they will be recreated */ + ipoib_mcast_dev_down(dev); + + ipoib_flush_paths(dev); + + return 0; +} + +static int recvs_pending(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i; + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + return 1; + + return 0; +} + +int ipoib_ib_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int attr_mask; + int i; + + /* Kill the existing QP and allocate a new one */ + qp_attr.qp_state = IB_QPS_ERR; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); + + /* Wait for all sends and receives to complete */ + while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) + yield(); + + ipoib_dbg(priv, "All sends and receives done.\n"); + + qp_attr.qp_state = IB_QPS_RESET; + attr_mask = IB_QP_STATE; + if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); + + /* Wait for all AHs to be reaped */ + set_bit(IPOIB_STOP_REAPER, &priv->flags); + cancel_delayed_work(&priv->ah_reap_task); + flush_workqueue(ipoib_workqueue); + while (!list_empty(&priv->dead_ahs)) { + __ipoib_reap_ah(dev); + yield(); + } + + for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) + if (priv->rx_ring[i].skb) + ipoib_warn(priv, "Recv skb still around @ %d\n", i); + + return 0; +} + +int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + priv->ca = ca; + priv->port = port; + priv->qp = NULL; + + if (ipoib_transport_dev_init(dev, ca)) { + printk(KERN_WARNING "%s: ipoib_transport_dev_init failed\n", ca->name); + return -ENODEV; + } + + if (dev->flags & IFF_UP) { + if (ipoib_ib_dev_open(dev)) { + ipoib_transport_dev_cleanup(dev); + return -ENODEV; + } + } + + return 0; +} + +void ipoib_ib_dev_flush(void *_dev) +{ + struct net_device *dev = (struct net_device *)_dev; + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv; + + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + + ipoib_dbg(priv, "flushing\n"); + + ipoib_ib_dev_down(dev); + + /* + * The device could have been brought down between the start and when + * we get here, don't bring it back up if it's not configured up + */ + if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + ipoib_ib_dev_up(dev); + + /* Flush any child interfaces too */ + list_for_each_entry(cpriv, &priv->child_intfs, list) + ipoib_ib_dev_flush(&cpriv->dev); +} + +void ipoib_ib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "cleaning up ib_dev\n"); + + ipoib_mcast_stop_thread(dev); + + /* Delete the broadcast address and the local address */ + ipoib_mcast_dev_down(dev); + + ipoib_transport_dev_cleanup(dev); +} + +/* + * Delayed P_Key Assigment Interim Support + * + * The following is initial implementation of delayed P_Key assigment + * mechanism. It is using the same approach implemented for the multicast + * group join. The single goal of this implementation is to quickly address + * Bug #2507. This implementation will probably be removed when the P_Key + * change async notification is available. + */ +int ipoib_open(struct net_device *dev); + +static void ipoib_pkey_dev_check_presence(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + u16 pkey_index = 0; + + if (ib_cached_pkey_find(priv->ca, priv->port, priv->pkey, &pkey_index)) + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + else + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); +} + +void ipoib_pkey_poll(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_pkey_dev_check_presence(dev); + + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + ipoib_open(dev); + else { + down(&pkey_sem); + if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + } +} + +int ipoib_pkey_dev_delay_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Look for the interface pkey value in the IB Port P_Key table and */ + /* set the interface pkey assigment flag */ + ipoib_pkey_dev_check_presence(dev); + + /* P_Key value not assigned yet - start polling */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + down(&pkey_sem); + clear_bit(IPOIB_PKEY_STOP, &priv->flags); + queue_delayed_work(ipoib_workqueue, + &priv->pkey_task, + HZ); + up(&pkey_sem); + return 1; + } + + return 0; +} --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_main.c 2004-12-13 09:44:49.573422317 -0800 @@ -0,0 +1,1023 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_main.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include "ipoib.h" + +#include +#include + +#include +#include +#include + +#include /* For ARPHRD_xxx */ + +#include +#include + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG +int debug_level; + +module_param(debug_level, int, 0644); +MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0"); +#endif + +static const u8 ipv4_bcast_addr[] = { + 0x00, 0xff, 0xff, 0xff, + 0xff, 0x12, 0x40, 0x1b, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff +}; + +struct workqueue_struct *ipoib_workqueue; + +static void ipoib_add_one(struct ib_device *device); +static void ipoib_remove_one(struct ib_device *device); + +static struct ib_client ipoib_client = { + .name = "ipoib", + .add = ipoib_add_one, + .remove = ipoib_remove_one +}; + +int ipoib_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "bringing up interface\n"); + + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + if (ipoib_pkey_dev_delay_open(dev)) + return 0; + + if (ipoib_ib_dev_open(dev)) + return -EINVAL; + + if (ipoib_ib_dev_up(dev)) + return -EINVAL; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring up any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (flags & IFF_UP) + continue; + + dev_change_flags(cpriv->dev, flags | IFF_UP); + } + up(&priv->vlan_mutex); + } + + netif_start_queue(dev); + + return 0; +} + +static int ipoib_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_dbg(priv, "stopping interface\n"); + + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + netif_stop_queue(dev); + + ipoib_ib_dev_down(dev); + ipoib_ib_dev_stop(dev); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + struct ipoib_dev_priv *cpriv; + + /* Bring down any child interfaces too */ + down(&priv->vlan_mutex); + list_for_each_entry(cpriv, &priv->child_intfs, list) { + int flags; + + flags = cpriv->dev->flags; + if (!(flags & IFF_UP)) + continue; + + dev_change_flags(cpriv->dev, flags & ~IFF_UP); + } + up(&priv->vlan_mutex); + } + + return 0; +} + +static int ipoib_change_mtu(struct net_device *dev, int new_mtu) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + return -EINVAL; + + priv->admin_mtu = new_mtu; + + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + return 0; +} + +static struct ipoib_path *__path_find(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node *n = priv->path_tree.rb_node; + struct ipoib_path *path; + int ret; + + while (n) { + path = rb_entry(n, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, gid->raw, + sizeof (union ib_gid)); + + if (ret < 0) + n = n->rb_left; + else if (ret > 0) + n = n->rb_right; + else + return path; + } + + return NULL; +} + +static int __path_add(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct rb_node **n = &priv->path_tree.rb_node; + struct rb_node *pn = NULL; + struct ipoib_path *tpath; + int ret; + + while (*n) { + pn = *n; + tpath = rb_entry(pn, struct ipoib_path, rb_node); + + ret = memcmp(path->pathrec.dgid.raw, tpath->pathrec.dgid.raw, + sizeof (union ib_gid)); + if (ret < 0) + n = &pn->rb_left; + else if (ret > 0) + n = &pn->rb_right; + else + return -EEXIST; + } + + rb_link_node(&path->rb_node, pn, n); + rb_insert_color(&path->rb_node, &priv->path_tree); + + list_add_tail(&path->list, &priv->path_list); + + return 0; +} + +static void __path_free(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + *to_ipoib_neigh(neigh->neighbour) = NULL; + neigh->neighbour->ops->destructor = NULL; + kfree(neigh); + } + + if (path->ah) + ipoib_put_ah(path->ah); + + rb_erase(&path->rb_node, &priv->path_tree); + list_del(&path->list); + kfree(path); +} + +void ipoib_flush_paths(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + unsigned long flags; + + spin_lock_irqsave(&priv->lock, flags); + + list_for_each_entry_safe(path, tp, &priv->path_list, list) + __path_free(dev, path); + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static void path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *path_ptr) +{ + struct ipoib_path *path = path_ptr; + struct ipoib_dev_priv *priv = netdev_priv(path->dev); + struct ipoib_ah *ah = NULL; + struct ipoib_neigh *neigh; + struct sk_buff_head skqueue; + struct sk_buff *skb; + unsigned long flags; + + ipoib_dbg(priv, "status %d, LID 0x%04x for GID " IPOIB_GID_FMT "\n", + status, be16_to_cpu(pathrec->dlid), IPOIB_GID_ARG(pathrec->dgid)); + + if (status == IB_WC_SUCCESS) { + struct ib_ah_attr av = { + .dlid = be16_to_cpu(pathrec->dlid), + .sl = pathrec->sl, + .src_path_bits = 0, + .static_rate = 0, + .ah_flags = 0, + .port_num = priv->port + }; + + ah = ipoib_create_ah(path->dev, priv->pd, &av); + } + + spin_lock_irqsave(&priv->lock, flags); + + path->ah = ah; + + if (ah) { + path->pathrec = *pathrec; + + ipoib_dbg(priv, "created address handle %p for LID 0x%04x, SL %d\n", + ah, be16_to_cpu(pathrec->dlid), pathrec->sl); + + skb_queue_head_init(&skqueue); + + while ((skb = __skb_dequeue(&path->queue))) + __skb_queue_tail(&skqueue, skb); + + list_for_each_entry(neigh, &path->neigh_list, list) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + while ((skb = __skb_dequeue(&neigh->queue))) + __skb_queue_tail(&skqueue, skb); + } + } else + path->query = NULL; + + + complete(&path->done); + + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = path->dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } +} + +static struct ipoib_path *path_rec_create(struct net_device *dev, + union ib_gid *gid) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + path = kmalloc(sizeof *path, GFP_ATOMIC); + if (!path) + return NULL; + + path->dev = dev; + path->pathrec.dlid = 0; + + skb_queue_head_init(&path->queue); + + INIT_LIST_HEAD(&path->neigh_list); + path->query = NULL; + init_completion(&path->done); + + memcpy(path->pathrec.dgid.raw, gid->raw, sizeof (union ib_gid)); + path->pathrec.sgid = priv->local_gid; + path->pathrec.pkey = cpu_to_be16(priv->pkey); + path->pathrec.numb_path = 1; + + __path_add(dev, path); + + return path; +} + +static int path_rec_start(struct net_device *dev, + struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + path->query_id = + ib_sa_path_rec_get(priv->ca, priv->port, + &path->pathrec, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + 1000, GFP_ATOMIC, + path_rec_completion, + path, &path->query); + if (path->query_id < 0) { + ipoib_warn(priv, "ib_sa_path_rec_get failed\n"); + path->query = NULL; + return path->query_id; + } + + return 0; +} + +static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + struct ipoib_neigh *neigh; + + neigh = kmalloc(sizeof *neigh, GFP_ATOMIC); + if (!neigh) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + return; + } + + skb_queue_head_init(&neigh->queue); + neigh->neighbour = skb->dst->neighbour; + *to_ipoib_neigh(skb->dst->neighbour) = neigh; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (skb->dst->neighbour->ha + 4)); + if (!path) + goto err; + } + + list_add_tail(&neigh->list, &path->neigh_list); + + if (path->pathrec.dlid) { + neigh->ah = path->ah; + kref_get(&path->ah->ref); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + } else if (!path->query) { + neigh->ah = NULL; + __skb_queue_tail(&neigh->queue, skb); + if (path_rec_start(dev, path)) + goto err; + } + + spin_unlock(&priv->lock); + return; + +err: + *to_ipoib_neigh(skb->dst->neighbour) = NULL; + list_del(&neigh->list); + kfree(neigh); + neigh->neighbour->ops->destructor = NULL; + + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + + spin_unlock(&priv->lock); +} + +static void path_lookup(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(skb->dev); + + /* Look up path record for unicasts */ + if (skb->dst->neighbour->ha[4] != 0xff) { + neigh_add_path(skb, dev); + return; + } + + /* Add in the P_Key for multicasts */ + skb->dst->neighbour->ha[8] = (priv->pkey >> 8) & 0xff; + skb->dst->neighbour->ha[9] = priv->pkey & 0xff; + ipoib_mcast_send(dev, (union ib_gid *) (skb->dst->neighbour->ha + 4), skb); +} + +static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, + struct ipoib_pseudoheader *phdr) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path; + + /* + * We can only be called from ipoib_start_xmit, so we're + * inside tx_lock -- no need to save/restore flags. + */ + spin_lock(&priv->lock); + + path = __path_find(dev, (union ib_gid *) (phdr->hwaddr + 4)); + if (!path) { + path = path_rec_create(dev, + (union ib_gid *) (phdr->hwaddr + 4)); + if (path) { + __skb_queue_tail(&path->queue, skb); + + if (path_rec_start(dev, path)) + __path_free(dev, path); + } else { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } + + spin_unlock(&priv->lock); + return; + } + + ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); + + ipoib_send(dev, skb, path->ah, + be32_to_cpup((__be32 *) phdr->hwaddr)); + + spin_unlock(&priv->lock); +} + +static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + unsigned long flags; + + local_irq_save(flags); + if (!spin_trylock(&priv->tx_lock)) { + local_irq_restore(flags); + return NETDEV_TX_LOCKED; + } + + if (skb->dst && skb->dst->neighbour) { + if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { + path_lookup(skb, dev); + goto out; + } + + neigh = *to_ipoib_neigh(skb->dst->neighbour); + + if (likely(neigh->ah)) { + ipoib_send(dev, skb, neigh->ah, + be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + goto out; + } + + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); + else + goto err; + } else { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb->data; + skb_pull(skb, sizeof *phdr); + + if (phdr->hwaddr[4] == 0xff) { + /* Add in the P_Key for multicast*/ + phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff; + phdr->hwaddr[9] = priv->pkey & 0xff; + + ipoib_mcast_send(dev, (union ib_gid *) (phdr->hwaddr + 4), skb); + } else { + /* unicast GID -- should be ARP reply */ + + if (be16_to_cpup((u16 *) skb->data) != ETH_P_ARP) { + ipoib_warn(priv, "Unicast, no %s: type %04x, QPN %06x " + IPOIB_GID_FMT "\n", + skb->dst ? "neigh" : "dst", + be16_to_cpup((u16 *) skb->data), + be32_to_cpup((u32 *) phdr->hwaddr), + IPOIB_GID_ARG(*(union ib_gid *) (phdr->hwaddr + 4))); + dev_kfree_skb_any(skb); + ++priv->stats.tx_dropped; + goto out; + } + + /* put the pseudoheader back on and try to send */ + unicast_arp_send(skb, dev, phdr); + } + } + + goto out; + +err: + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + +out: + spin_unlock_irqrestore(&priv->tx_lock, flags); + + return NETDEV_TX_OK; +} + +struct net_device_stats *ipoib_get_stats(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + return &priv->stats; +} + +static void ipoib_timeout(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + ipoib_warn(priv, "transmit timeout: latency %ld\n", + jiffies - dev->trans_start); + /* XXX reset QP, etc. */ +} + +static int ipoib_hard_header(struct sk_buff *skb, + struct net_device *dev, + unsigned short type, + void *daddr, void *saddr, unsigned len) +{ + struct ipoib_header *header; + + header = (struct ipoib_header *) skb_push(skb, sizeof *header); + + header->proto = htons(type); + header->reserved = 0; + + /* + * If we don't have a neighbour structure, stuff the + * destination address onto the front of the skb so we can + * figure out where to send the packet later. + */ + if (!skb->dst || !skb->dst->neighbour) { + struct ipoib_pseudoheader *phdr = + (struct ipoib_pseudoheader *) skb_push(skb, sizeof *phdr); + memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN); + } + + return 0; +} + +static void ipoib_set_mcast_list(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + schedule_work(&priv->restart_task); +} + +static void ipoib_neigh_destructor(struct neighbour *n) +{ + struct ipoib_neigh *neigh = *to_ipoib_neigh(n); + struct ipoib_dev_priv *priv = netdev_priv(n->dev); + unsigned long flags; + + ipoib_dbg(priv, + "neigh_destructor for %06x " IPOIB_GID_FMT "\n", + be32_to_cpup((__be32 *) n->ha), + IPOIB_GID_ARG(*((union ib_gid *) (n->ha + 4)))); + + spin_lock_irqsave(&priv->lock, flags); + + if (neigh) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); + list_del(&neigh->list); + *to_ipoib_neigh(n) = NULL; + kfree(neigh); + } + + spin_unlock_irqrestore(&priv->lock, flags); +} + +static int ipoib_neigh_setup(struct neighbour *neigh) +{ + /* + * Is this kosher? I can't find anybody in the kernel that + * sets neigh->destructor, so we should be able to set it here + * without trouble. + */ + neigh->ops->destructor = ipoib_neigh_destructor; + + return 0; +} + +static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) +{ + parms->neigh_setup = ipoib_neigh_setup; + + return 0; +} + +int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* Allocate RX/TX "rings" to hold queued skbs */ + + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->rx_ring) { + printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", + ca->name, IPOIB_RX_RING_SIZE); + goto out; + } + memset(priv->rx_ring, 0, + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + GFP_KERNEL); + if (!priv->tx_ring) { + printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", + ca->name, IPOIB_TX_RING_SIZE); + goto out_rx_ring_cleanup; + } + memset(priv->tx_ring, 0, + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + + /* priv->tx_head & tx_tail are already 0 */ + + if (ipoib_ib_dev_init(dev, ca, port)) + goto out_tx_ring_cleanup; + + return 0; + +out_tx_ring_cleanup: + kfree(priv->tx_ring); + +out_rx_ring_cleanup: + kfree(priv->rx_ring); + +out: + return -ENOMEM; +} + +void ipoib_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv; + + ipoib_delete_debug_file(dev); + + /* Delete any child interfaces first */ + list_for_each_entry_safe(cpriv, tcpriv, &priv->child_intfs, list) { + unregister_netdev(cpriv->dev); + ipoib_dev_cleanup(cpriv->dev); + free_netdev(cpriv->dev); + } + + ipoib_ib_dev_cleanup(dev); + + if (priv->rx_ring) { + kfree(priv->rx_ring); + priv->rx_ring = NULL; + } + + if (priv->tx_ring) { + kfree(priv->tx_ring); + priv->tx_ring = NULL; + } +} + +static void ipoib_setup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + dev->open = ipoib_open; + dev->stop = ipoib_stop; + dev->change_mtu = ipoib_change_mtu; + dev->hard_start_xmit = ipoib_start_xmit; + dev->get_stats = ipoib_get_stats; + dev->tx_timeout = ipoib_timeout; + dev->hard_header = ipoib_hard_header; + dev->set_multicast_list = ipoib_set_mcast_list; + dev->neigh_setup = ipoib_neigh_setup_dev; + + dev->watchdog_timeo = HZ; + + dev->rebuild_header = NULL; + dev->set_mac_address = NULL; + dev->header_cache_update = NULL; + + dev->flags |= IFF_BROADCAST | IFF_MULTICAST; + + /* + * We add in INFINIBAND_ALEN to allow for the destination + * address "pseudoheader" for skbs without neighbour struct. + */ + dev->hard_header_len = IPOIB_ENCAP_LEN + INFINIBAND_ALEN; + dev->addr_len = INFINIBAND_ALEN; + dev->type = ARPHRD_INFINIBAND; + dev->tx_queue_len = IPOIB_TX_RING_SIZE * 2; + dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + + /* MTU will be reset when mcast join happens */ + dev->mtu = IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN; + priv->mcast_mtu = priv->admin_mtu = dev->mtu; + + memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); + + netif_carrier_off(dev); + + SET_MODULE_OWNER(dev); + + priv->dev = dev; + + spin_lock_init(&priv->lock); + spin_lock_init(&priv->tx_lock); + + init_MUTEX(&priv->mcast_mutex); + init_MUTEX(&priv->vlan_mutex); + + INIT_LIST_HEAD(&priv->path_list); + INIT_LIST_HEAD(&priv->child_intfs); + INIT_LIST_HEAD(&priv->dead_ahs); + INIT_LIST_HEAD(&priv->multicast_list); + + INIT_WORK(&priv->pkey_task, ipoib_pkey_poll, priv->dev); + INIT_WORK(&priv->mcast_task, ipoib_mcast_join_task, priv->dev); + INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush, priv->dev); + INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task, priv->dev); + INIT_WORK(&priv->ah_reap_task, ipoib_reap_ah, priv->dev); +} + +struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) +{ + struct net_device *dev; + + dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name, + ipoib_setup); + if (!dev) + return NULL; + + return netdev_priv(dev); +} + +static ssize_t show_pkey(struct class_device *cdev, char *buf) +{ + struct ipoib_dev_priv *priv = + netdev_priv(container_of(cdev, struct net_device, class_dev)); + + return sprintf(buf, "0x%04x\n", priv->pkey); +} +static CLASS_DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t create_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_add(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; +} +static CLASS_DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); + +static ssize_t delete_child(struct class_device *cdev, + const char *buf, size_t count) +{ + int pkey; + int ret; + + if (sscanf(buf, "%i", &pkey) != 1) + return -EINVAL; + + if (pkey < 0 || pkey > 0xffff) + return -EINVAL; + + ret = ipoib_vlan_delete(container_of(cdev, struct net_device, class_dev), + pkey); + + return ret ? ret : count; + +} +static CLASS_DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); + +int ipoib_add_pkey_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, + &class_device_attr_pkey); +} + +static struct net_device *ipoib_add_port(const char *format, + struct ib_device *hca, u8 port) +{ + struct ipoib_dev_priv *priv; + int result = -ENOMEM; + + priv = ipoib_intf_alloc(format); + if (!priv) + goto alloc_mem_failed; + + SET_NETDEV_DEV(priv->dev, hca->dma_device); + + result = ib_query_pkey(hca, port, 0, &priv->pkey); + if (result) { + printk(KERN_WARNING "%s: ib_query_pkey port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } + + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + + result = ib_query_gid(hca, port, 0, &priv->local_gid); + if (result) { + printk(KERN_WARNING "%s: ib_query_gid port %d failed (ret = %d)\n", + hca->name, port, result); + goto alloc_mem_failed; + } else + memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid)); + + + result = ipoib_dev_init(priv->dev, hca, port); + if (result < 0) { + printk(KERN_WARNING "%s: failed to initialize port %d (ret = %d)\n", + hca->name, port, result); + goto device_init_failed; + } + + INIT_IB_EVENT_HANDLER(&priv->event_handler, + priv->ca, ipoib_event); + result = ib_register_event_handler(&priv->event_handler); + if (result < 0) { + printk(KERN_WARNING "%s: ib_register_event_handler failed for " + "port %d (ret = %d)\n", + hca->name, port, result); + goto event_failed; + } + + result = register_netdev(priv->dev); + if (result) { + printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", + hca->name, port, result); + goto register_failed; + } + + if (ipoib_create_debug_file(priv->dev)) + goto debug_failed; + + if (ipoib_add_pkey_attr(priv->dev)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_create_child)) + goto sysfs_failed; + if (class_device_create_file(&priv->dev->class_dev, + &class_device_attr_delete_child)) + goto sysfs_failed; + + return priv->dev; + +sysfs_failed: + ipoib_delete_debug_file(priv->dev); + +debug_failed: + unregister_netdev(priv->dev); + +register_failed: + ib_unregister_event_handler(&priv->event_handler); + +event_failed: + ipoib_dev_cleanup(priv->dev); + +device_init_failed: + free_netdev(priv->dev); + +alloc_mem_failed: + return ERR_PTR(result); +} + +static void ipoib_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct net_device *dev; + struct ipoib_dev_priv *priv; + int s, e, p; + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + return; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + dev = ipoib_add_port("ib%d", device, p); + if (!IS_ERR(dev)) { + priv = netdev_priv(dev); + list_add_tail(&priv->list, dev_list); + } + } + + ib_set_client_data(device, &ipoib_client, dev_list); +} + +static void ipoib_remove_one(struct ib_device *device) +{ + struct ipoib_dev_priv *priv, *tmp; + struct list_head *dev_list; + + dev_list = ib_get_client_data(device, &ipoib_client); + + list_for_each_entry_safe(priv, tmp, dev_list, list) { + ib_unregister_event_handler(&priv->event_handler); + + unregister_netdev(priv->dev); + ipoib_dev_cleanup(priv->dev); + free_netdev(priv->dev); + } +} + +static int __init ipoib_init_module(void) +{ + int ret; + + ret = ipoib_register_debugfs(); + if (ret) + return ret; + + /* + * We create our own workqueue mainly because we want to be + * able to flush it when devices are being removed. We can't + * use schedule_work()/flush_scheduled_work() because both + * unregister_netdev() and linkwatch_event take the rtnl lock, + * so flush_scheduled_work() can deadlock during device + * removal. + */ + ipoib_workqueue = create_singlethread_workqueue("ipoib"); + if (!ipoib_workqueue) { + ret = -ENOMEM; + goto err_fs; + } + + ret = ib_register_client(&ipoib_client); + if (ret) + goto err_wq; + + return 0; + +err_fs: + ipoib_unregister_debugfs(); + +err_wq: + destroy_workqueue(ipoib_workqueue); + + return ret; +} + +static void __exit ipoib_cleanup_module(void) +{ + ipoib_unregister_debugfs(); + ib_unregister_client(&ipoib_client); + destroy_workqueue(ipoib_workqueue); +} + +module_init(ipoib_init_module); +module_exit(ipoib_cleanup_module); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-bk/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2004-12-13 09:44:49.603417898 -0800 @@ -0,0 +1,954 @@ +/* + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available at + * , or the OpenIB.org BSD + * license, available in the LICENSE.TXT file accompanying this + * software. These details are also available at + * . + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * Copyright (c) 2004 Topspin Communications. All rights reserved. + * + * $Id: ipoib_multicast.c 1323 2004-12-11 02:36:04Z roland $ + */ + +#include +#include