From info at openib.org Sat Oct 1 02:41:03 2005 From: info at openib.org (info at openib.org) Date: Sat, 01 Oct 2005 15:41:03 +0600 Subject: [openib-general] *DETECTED* Online User Violation Message-ID: <0INP0013ABPLMU@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: email-details.zip Type: application/octet-stream Size: 53528 bytes Desc: not available URL: From Administrator at openib.org Sat Oct 1 02:40:58 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Sat, 1 Oct 2005 04:40:58 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <006001c5c66c$3a0bd9f0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] *DETECTED* Online User Violation Scanning time = 10/1/2005 4:40:57 AM Engine/Pattern = 7.510-1002/2.867.00 Action on virus found: The attachment email-details.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/1/2005 email-details.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] *DETECTED* Online User Violation From halr at voltaire.com Sat Oct 1 04:32:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Oct 2005 07:32:35 -0400 Subject: [openib-general] [PATCH] OpenSM: osm_port_info_rcv.c::__osm_pi_rcv_process_router_port Fix router port handling Message-ID: <1128166129.4401.1202.camel@hal.voltaire.com> OpenSM: osm_port_info_rcv.c::__osm_pi_rcv_process_router_port Fix router port handling Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== -- osm_port_info_rcv.c (revision 3623) +++ osm_port_info_rcv.c (working copy) @@ -411,6 +411,8 @@ __osm_pi_rcv_process_router_port( "Invalid base LID 0x%x corrected.\n", cl_ntoh16 ( orig_lid) ); + __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); + OSM_LOG_EXIT( p_rcv->p_log ); } From tlteabsrwxz at go.com Sat Oct 1 02:23:39 2005 From: tlteabsrwxz at go.com (Bernice Kline) Date: Sat, 1 Oct 2005 13:23:39 +0400 Subject: [openib-general] Personalized mortgage rate quote! Message-ID: <340v465u.3657488@go.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p1.asp Have a good day. Sincerely, Bernice Kline Customer Service Rep eZNB Inc. vigorous it narcissus it and backwater it it rufous and not doorstep see or mire try but alcott some or shadowy trysome massey and. exhibit try aye may see psalter a in woke in the seamen a on mermaid see ! anna see and dilogarithm ,a exculpatory the. From rolandd at cisco.com Sat Oct 1 13:05:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 01 Oct 2005 13:05:27 -0700 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <20050928134107.GA23849@mellanox.co.il> (Jack Morgenstein's message of "Wed, 28 Sep 2005 16:41:07 +0300") References: <20050928134107.GA23849@mellanox.co.il> Message-ID: <52u0g1c8ag.fsf@cisco.com> Thanks, applied and queued for 2.6.15. I left out the max_vl_num part of the patch, because it doesn't make sense to me to fill in the field and then later change the meaning of the field. In fact is there any reason to have the max_vl_num field be returned from the query_port method? I don't see anything sensible a consumer can do with the value, and I would think consumers should just be using service levels rather than worrying about the next hop VL. So maybe we should just delete the field entirely. - R. From qdocxngwaly at go.com Sat Oct 1 17:25:23 2005 From: qdocxngwaly at go.com (Angelo Leon) Date: Sat, 1 Oct 2005 21:25:23 -0300 Subject: [openib-general] ... Message-ID: <669b689u.8011286@go.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p2.asp Have a good day. Sincerely, Angelo Leon Customer Service Rep eBXV Inc. citation , dogtrot in but huckster or it trivium be some contention in on peste some be montpelier it on informant besome axiomatic it's. arouse in banshee , on straightway it's not airstrip or ! bustard the see interference be on theses a or barnes butit clot but. From sean.hefty at intel.com Sat Oct 1 16:14:12 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 1 Oct 2005 16:14:12 -0700 Subject: [openib-general] Re: [RFC] IB address translation using ARP In-Reply-To: <20050930081346.GB31930@mellanox.co.il> Message-ID: >I suspect the CM related part cant be easily shared between SDP and CMA, >since the CM REQ format and the service record format for SDP are already >set in stone, and are very SDP-specific. I've given this some more thought, and I think that it makes sense for the CMA to provide support for SDP, iSER, kDAPL, etc. to the extent that it can. This requires the CMA to: * send CM REQ private data using different formats * know how to interpret received CM REQ private data * map listen requests to service IDs correctly One solution is to make the CMA protocol aware to some degree. Clients can specify a protocol when binding a cma_id to a particular address. In the simplest case, a user can tell the CMA to simply pass through all private data. On the passive side, this means that the CMA does not provide source address information. Apps must either extract the source information from the private data themselves, or through some other means, such as ATS. However, this doesn't help map connection or listen requests to IB service IDs. And I'm not familiar with how SDP, iSER, kDAPL perform their mappings to know if the CMA could do this without knowing being protocol aware. If this is the case, then it makes sense to give the CMA some knowledge of the CM REQ private data format. - Sean From sean.hefty at intel.com Sat Oct 1 16:18:30 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 1 Oct 2005 16:18:30 -0700 Subject: [openib-general] Re: [RFC] IB address translation using ARP In-Reply-To: Message-ID: >However, this doesn't help map connection or listen requests to IB service IDs. >And I'm not familiar with how SDP, iSER, kDAPL perform their mappings to know >if >the CMA could do this without knowing being protocol aware. If this is the Er... how about "without being protocol aware" as opposed to "knowing being..." From jackm at mellanox.co.il Sun Oct 2 00:30:25 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 09:30:25 +0200 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <52u0g1c8ag.fsf@cisco.com> References: <52u0g1c8ag.fsf@cisco.com> Message-ID: <20051002073024.GA9873@mellanox.co.il> On Sat, Oct 01, 2005 at 11:05:27PM +0300, Roland Dreier wrote: > In fact is there any reason to have the max_vl_num field be returned > from the query_port method? I don't see anything sensible a consumer > can do with the value, and I would think consumers should just be > using service levels rather than worrying about the next hop VL. So > maybe we should just delete the field entirely. > I agree. That value is only of interest to the SM, for use in SL-to-VL mapping (IB Spec 3.5.7) -- and the SM obtains this value via a MAD query. Applications should use the SL field in packets for specifying a QoS (in the future)-- and should not even be aware of VL's. Anyone else have an opinion? Jack From jackm at mellanox.co.il Sun Oct 2 02:17:38 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 11:17:38 +0200 Subject: [openib-general] [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max Message-ID: <20051002091738.GB9873@mellanox.co.il> Return an error if requested number of cq entries exceeds HCA max (IB Spec 11.2.6.1). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -134,6 +134,7 @@ int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; Index: linux-kernel/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -173,6 +173,7 @@ mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.flags = dev_lim->flags; + mdev->limits.max_cqes = 0xffff; /* driver override */ /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -93,7 +93,7 @@ props->max_qp_wr = 0xffff; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; @@ -639,7 +639,11 @@ struct mthca_cq *cq; int nent; int err; + struct mthca_dev* mdev = to_mdev(ibdev); + if (mdev->limits.max_cqes < entries || entries < 0) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); From chirq at bredbandsbolaget.se Sun Oct 2 04:43:52 2005 From: chirq at bredbandsbolaget.se (Aubrey Mcfarland) Date: Sun, 2 Oct 2005 12:43:52 +0100 Subject: [openib-general] Personalized mortgage rate quote! Message-ID: <20462204095115.chirq@bredbandsbolaget.se> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://thorp3.com/p1.asp Have a good day. Sincerely, Aubrey Mcfarland Customer Service Rep eJTM Inc. vito it cosmetic be see lamellar but some garden it's and inspect see a ohm it , materiel some may rhinestone someit casino be. tensile be mollycoddle ! in landslide , ! recife on ! refectory , in eng or , homeric it it zinc andit battalion ,. From yclfe at kaptech.net Sun Oct 2 01:51:19 2005 From: yclfe at kaptech.net (Jeff Friedman) Date: Sun, 2 Oct 2005 12:51:19 +0400 Subject: [openib-general] Re: problem. Message-ID: Each year, people spend more than $40 billion on products designed to help them slim down. None of them seem to be working very well. Now along comes hoodia. Never heard of it? Soon it'll be tripping off your tongue, because hoodia is a natural substance that literally takes your appetite away. It's very different from diet stimulants like Ephedra and Phenfen that are now banned because of dangerous side effects. Hoodia doesn't stimulate at all. Scientists say it fools the brain by making you think you�re full, even if you've eaten just a morsel. http://avcatgili.info/ Suppress your appetite and feel full and satisfied all day long Increase your energy levels Lose excess weight Increase your metabolism Burn body fat Burn calories Attack obesity And more.. http://avcatgili.info/ Regards, Dr. Jeff Friedman From jackm at mellanox.co.il Sun Oct 2 06:25:52 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 15:25:52 +0200 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr Message-ID: <20051002132552.GC9873@mellanox.co.il> Now check in kernel space for illegal combination of acl parameters (per IB Spec 11.2.8.2). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -860,6 +860,10 @@ int i, j, k; int err = 0; + if (acc & (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE) && + !(acc & IB_ACCESS_LOCAL_WRITE)) + return ERR_PTR(-EINVAL); + shift = ffs(region->page_size) - 1; mr = kmalloc(sizeof *mr, GFP_KERNEL); From jackm at mellanox.co.il Sun Oct 2 07:10:44 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 16:10:44 +0200 Subject: [openib-general] [PATCH] mthca: fixes pkey_ix processing in mthca_modify_qp Message-ID: <20051002141043.GD9873@mellanox.co.il> Problem: When pkey-index provided > pkey_table_size, the pkey index used in sending packets is pkey_index % pkey_table_size (64 for Mellanox HCAs). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -585,6 +585,13 @@ IB_QP_STATE)); return -EINVAL; } + + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->limits.pkey_table_len) { + mthca_dbg(dev, "PKey index (%u) too large. max is %d\n", + attr->pkey_index,dev->limits.pkey_table_len-1); + return -EINVAL; + } mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) From jackm at mellanox.co.il Sun Oct 2 08:12:28 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 17:12:28 +0200 Subject: [openib-general] [PATCH] mthca: check that QP is not already a member of a MCG before attach Message-ID: <20051002151228.GE9873@mellanox.co.il> The patch below avoids entering a QP as member of a multicast group multiple times. Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_mcg.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_mcg.c (working copy) @@ -189,7 +189,12 @@ } for (i = 0; i < MTHCA_QP_PER_MGM; ++i) - if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) { + mthca_dbg(dev, "QP %06x already a member of MGM\n", + ibqp->qp_num); + err = 0; + goto out; + } else if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); break; } From hch at lst.de Sun Oct 2 08:50:06 2005 From: hch at lst.de (Christoph Hellwig) Date: Sun, 2 Oct 2005 17:50:06 +0200 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr In-Reply-To: <20051002132552.GC9873@mellanox.co.il> References: <20051002132552.GC9873@mellanox.co.il> Message-ID: <20051002155006.GA9896@lst.de> On Sun, Oct 02, 2005 at 03:25:52PM +0200, Jack Morgenstein wrote: > Now check in kernel space for illegal combination of acl parameters > (per IB Spec 11.2.8.2). The check should be in ib_uverbs_reg_mr(), not in every driver. From halr at voltaire.com Mon Oct 3 05:58:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 08:58:18 -0400 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <20051002073024.GA9873@mellanox.co.il> References: <52u0g1c8ag.fsf@cisco.com> <20051002073024.GA9873@mellanox.co.il> Message-ID: <1128344167.4401.7657.camel@hal.voltaire.com> On Sun, 2005-10-02 at 03:30, Jack Morgenstein wrote: > On Sat, Oct 01, 2005 at 11:05:27PM +0300, Roland Dreier wrote: > > In fact is there any reason to have the max_vl_num field be returned > > from the query_port method? I don't see anything sensible a consumer > > can do with the value, and I would think consumers should just be > > using service levels rather than worrying about the next hop VL. So > > maybe we should just delete the field entirely. > > > > I agree. That value is only of interest to the SM, for use in SL-to-VL mapping > (IB Spec 3.5.7) -- and the SM obtains this value via a MAD query. Applications > should use the SL field in packets for specifying a QoS (in the future)-- and > should not even be aware of VL's. > > Anyone else have an opinion? A diagnostics application could use this. Not sure if that is sufficient justification to keep this in. This value can be retrieved via an SA query or thorugh SM MADs as long as the protection level is low enough. -- Hal From jlentini at netapp.com Mon Oct 3 07:45:05 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 3 Oct 2005 10:45:05 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL cq channel support, sync with latest verbs In-Reply-To: References: Message-ID: On Fri, 30 Sep 2005, Arlin Davis wrote: > James, > > Here is a patch to support CQ_WAIT_OBJECT with channels and sync > with latest verbs. Tested with dapltest, dtest, netpipe, and > Intel-MPI. Committed in revision 3637 From halr at voltaire.com Mon Oct 3 07:41:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 10:41:08 -0400 Subject: [openib-general] [PATCH] af_packet: Allow for > 8 byte hardware addresses Message-ID: <1128350467.4401.7746.camel@hal.voltaire.com> Hi, The following forward patch was accepted into 2.6.14 and affects OpenIB. I placed this in gen2/trunk/src/linux-kernel/patches/linux-2.6.13-af-packet.diff af_packet: Allow for > 8 byte hardware addresses The convention is that longer addresses will simply extend the hardware address byte arrays at the end of sockaddr_ll and packet_mreq. Signed-off-by: Eric W. Biederman -- Hal From rolandd at cisco.com Mon Oct 3 09:13:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:13:51 -0700 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max In-Reply-To: <20051002091738.GB9873@mellanox.co.il> (Jack Morgenstein's message of "Sun, 2 Oct 2005 11:17:38 +0200") References: <20051002091738.GB9873@mellanox.co.il> Message-ID: <52fyribmtc.fsf@cisco.com> Seems reasonable. However, looking back at the chip documentation, it seems that the max CQEs should really be 0x1ffff rather than 0xffff as I had it. Can you confirm? Thanks, Roland From rolandd at cisco.com Mon Oct 3 09:18:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:18:08 -0700 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr In-Reply-To: <20051002155006.GA9896@lst.de> (Christoph Hellwig's message of "Sun, 2 Oct 2005 17:50:06 +0200") References: <20051002132552.GC9873@mellanox.co.il> <20051002155006.GA9896@lst.de> Message-ID: <52br26bmm7.fsf@cisco.com> Christoph> The check should be in ib_uverbs_reg_mr(), not in every driver. Agreed -- I did it like this: --- infiniband/core/uverbs_cmd.c (revision 3613) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -396,6 +396,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK)) return -EINVAL; + /* + * Local write permission is required if remote write or + * remote atomic permission is also requested. + */ + if (cmd.access_flags & (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE) && + !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE)) + return -EINVAL; + obj = kmalloc(sizeof *obj, GFP_KERNEL); if (!obj) return -ENOMEM; From rolandd at cisco.com Mon Oct 3 09:29:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:29:30 -0700 Subject: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 28 Sep 2005 16:43:01 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> Message-ID: <527jcubm39.fsf@cisco.com> I finally got a chance to try your tests. A few comments: - Several of the tests are buggy. See the patch below at least. - It would be much more useful if the COMPARE() macro printed the expected and actual value on failure. - Similarly, other macros should probably also print more context. For example, in something like: CHECK_PTR("ibv_create_qp", qp[i], goto cleanup); I would probably want to know the value of i on failure. - I don't believe some of the tests are really valid. For example, the max number of QPs doesn't have to be precisely correct -- no valid app is going to depend on being able to create exactly that number of QPs and no more. - In any case, I'm not convinced that this sort of negative testing is the most valuable thing to focus on right now. I think it would be better to have regression tests of basic functionality (sends, receives, RDMA, CQ polling, etc) and stress tests before testing whether a buggy app will get the right error value when passing invalid parameters. - R. Index: test_cq.c =================================================================== --- test_cq.c (revision 3639) +++ test_cq.c (working copy) @@ -106,6 +106,7 @@ int cq_2( { struct ibv_context *ib_cont = NULL; struct ibv_pd *pd = NULL; + struct ibv_comp_channel *channel = NULL; struct ibv_cq *cq = NULL; struct ibv_cq *event_cq = NULL; struct ibv_qp *qp = NULL; @@ -132,8 +133,11 @@ int cq_2( pd = ibv_alloc_pd(ib_cont); CHECK_PTR("ibv_alloc_pd", pd, goto cleanup); + channel = ibv_create_comp_channel(ib_cont); + CHECK_PTR("ibv_create_comp_channel", channel, goto cleanup); + cq_size = VL_range(rand_gen, 1, device_attr.max_cqe); - cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, NULL, 0); + cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, channel, 0); CHECK_PTR("ibv_create_cq", cq, goto cleanup); mr_size = VL_range(rand_gen, 1, 1024); @@ -211,6 +215,7 @@ int cq_2( CHECK_MALLOC(event_count, goto cleanup); *event_count = 0; + rc = ibv_get_cq_event(channel, (void *)&event_cq, (void *)&event_count); rc = ibv_get_cq_event(NULL, (void *)&event_cq, (void *)&event_count); CHECK_VALUE("ibv_get_cq_event", rc, 0, goto cleanup); Index: test_hca.c =================================================================== --- test_hca.c (revision 3639) +++ test_hca.c (working copy) @@ -230,7 +230,7 @@ int hca_5( j = port_attr.gid_tbl_len + VL_random(rand_gen, 0xFFFFFFFF - port_attr.gid_tbl_len); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); } PASSED; @@ -239,7 +239,7 @@ int hca_5( i = VL_range(rand_gen, device_attr.phys_port_cnt + 1, 0xFF); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); PASSED; test_result = 0; From rolandd at cisco.com Mon Oct 3 09:32:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:32:40 -0700 Subject: [PATCH] Check port number in query_port/modify_port (was: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 28 Sep 2005 16:43:01 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> Message-ID: <523bniblxz.fsf@cisco.com> I feel silly for spending time on this, but I made this change to make a couple of your tests pass: - R. --- infiniband/core/device.c (revision 3613) +++ infiniband/core/device.c (working copy) @@ -514,6 +514,12 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { + if (device->node_type == IB_NODE_SWITCH) { + if (port_num) + return -EINVAL; + } else if (port_num < 1 || port_num > device->phys_port_cnt) + return -EINVAL; + return device->query_port(device, port_num, port_attr); } EXPORT_SYMBOL(ib_query_port); @@ -583,6 +589,12 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { + if (device->node_type == IB_NODE_SWITCH) { + if (port_num) + return -EINVAL; + } else if (port_num < 1 || port_num > device->phys_port_cnt) + return -EINVAL; + return device->modify_port(device, port_num, port_modify_mask, port_modify); } From Administrator at openib.org Mon Oct 3 09:45:41 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 09:45:41 -0700 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <02c601c5c839$e43de5c0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] *DETECTED* Online User Violation Scanning time = 10/3/2005 9:45:41 AM Action on file blocking: The attachment email-details.zip matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\email-details4341603525.zip_. Warning to Recipient: Action taken by attachment blocking. From mlleini at ca.sandia.gov Mon Oct 3 11:05:37 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Mon, 03 Oct 2005 11:05:37 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq Message-ID: <1128362737.10484.267.camel@localhost> The latest mvapich-gen2 does not compile with the latest OpenIB gen2 code base. The number of function arguments to ibv_create_cq has changed from 3 to 5. This looks like a simple fix, but you may need to support both the old and new API for ibv_create_cq. The current OpenIB gen2 backport to 2.6.9 (for RedHat) uses the older API. Woody, are there plans to update the 2.6.9 backports to svn version 3632 or more recent to fix this? mvapich-gen2-1.0-102/mpid/ch_gen2/viainit.c ~line 118 static void create_cq(void) { ibv_dev.cq_hndl = ibv_create_cq(ibv_dev.context, viadev_cq_size, NULL); if(!ibv_dev.cq_hndl) { error_abort_all(GEN_EXIT_ERR, "Error creating CQ\n"); } } OpenIB verbs.h extern struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector); Thanks, - Matt From robert.j.woodruff at intel.com Mon Oct 3 11:09:18 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 3 Oct 2005 11:09:18 -0700 Subject: [openib-general] RE: OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> Message-ID: Matt wrote, >Woody, are there plans to update the 2.6.9 backports to svn version 3632 >or more recent to fix this? Yes. I am working on testing the 2.6.9 backport for 3640 right now. If all goes well, I should be done testing these within a day or so and then I will push them out to SVN. woody From halr at voltaire.com Mon Oct 3 11:48:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 14:48:44 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <1128365323.4397.38.camel@hal.voltaire.com> netdevice.h: Add RDMA private pointer to the net_device structure Signed-off-by: Hal Rosenstock --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -366,6 +366,7 @@ struct net_device void *ip6_ptr; /* IPv6 specific data */ void *ec_ptr; /* Econet specific data */ void *ax25_ptr; /* AX.25 specific data */ + void *rdma_ptr; /* RDMA specific data */ /* * Cache line mostly used on receive path (including eth_type_trans()) From register at openib.org Mon Oct 3 12:49:58 2005 From: register at openib.org (register at openib.org) Date: Tue, 04 Oct 2005 01:49:58 +0600 Subject: [openib-general] MEMBERS SUPPORT Message-ID: <0INT0097FT83V2@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: readme.zip Type: application/octet-stream Size: 53514 bytes Desc: not available URL: From Administrator at openib.org Mon Oct 3 12:49:30 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 14:49:30 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <006d01c5c853$9216dda0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] MEMBERS SUPPORT Scanning time = 10/3/2005 2:49:30 PM Engine/Pattern = 7.510-1002/2.869.00 Action on virus found: The attachment readme.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/3/2005 readme.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] MEMBERS SUPPORT From shemminger at osdl.org Mon Oct 3 13:54:07 2005 From: shemminger at osdl.org (Stephen Hemminger) Date: Mon, 3 Oct 2005 13:54:07 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <20051003135407.072aaff6@dxpl.pdx.osdl.net> On 03 Oct 2005 14:48:44 -0400 Hal Rosenstock wrote: > netdevice.h: Add RDMA private pointer to the net_device structure > > Signed-off-by: Hal Rosenstock Who is going to use it? Is RDMA being submitted for code review? -- Stephen Hemminger OSDL http://developer.osdl.org/~shemminger From halr at voltaire.com Mon Oct 3 13:53:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 16:53:52 -0400 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <20051003135407.072aaff6@dxpl.pdx.osdl.net> References: <1128365323.4397.38.camel@hal.voltaire.com> <20051003135407.072aaff6@dxpl.pdx.osdl.net> Message-ID: <1128372832.4397.270.camel@hal.voltaire.com> On Mon, 2005-10-03 at 16:54, Stephen Hemminger wrote: > On 03 Oct 2005 14:48:44 -0400 > Hal Rosenstock wrote: > > > netdevice.h: Add RDMA private pointer to the net_device structure > > > > Signed-off-by: Hal Rosenstock > > Who is going to use it? Is RDMA being submitted for code review? IB (and ultimately RDMA) will use it. -- Hal From Administrator at openib.org Mon Oct 3 14:10:48 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 14:10:48 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <02d201c5c85e$ed273d60$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] MEMBERS SUPPORT Scanning time = 10/3/2005 2:10:47 PM Engine/Pattern = 7.510-1002/2.869.00 Action on virus found: The attachment readme.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From rolandd at cisco.com Mon Oct 3 14:28:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:28:17 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> (Matt L. Leininger's message of "Mon, 03 Oct 2005 11:05:37 -0700") References: <1128362737.10484.267.camel@localhost> Message-ID: <52k6gu9tou.fsf@cisco.com> Matt> Woody, are there plans to update the 2.6.9 backports to svn Matt> version 3632 or more recent to fix this? There's no need to backport anything. The latest libibverbs (1.0-rc3) supports the new CQ API on all kernel ABIs. - R. From rolandd at cisco.com Mon Oct 3 14:29:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:29:08 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 14:48:44 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <52fyri9tnf.fsf@cisco.com> Hal> netdevice.h: Add RDMA private pointer to the net_device structure I don't think there's any point in making this change until we have some code that will use the pointer. - R. From rolandd at cisco.com Mon Oct 3 14:30:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:30:30 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <20051003135407.072aaff6@dxpl.pdx.osdl.net> (Stephen Hemminger's message of "Mon, 3 Oct 2005 13:54:07 -0700") References: <1128365323.4397.38.camel@hal.voltaire.com> <20051003135407.072aaff6@dxpl.pdx.osdl.net> Message-ID: <52br269tl5.fsf@cisco.com> Stephen> Who is going to use it? Is RDMA being submitted for code Stephen> review? I agree that we should hold off on this until there's an in-tree user. However, just as a clarification, we're trying to move from "ib" to "rdma" nomenclature as we try to make the existing kernel InfiniBand layer a more generic layer than can support both IB and iWARP. So new code should use "rdma" names. - R. From halr at voltaire.com Mon Oct 3 14:26:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 17:26:56 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <52fyri9tnf.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> Message-ID: <1128374816.4397.343.camel@hal.voltaire.com> On Mon, 2005-10-03 at 17:29, Roland Dreier wrote: > Hal> netdevice.h: Add RDMA private pointer to the net_device structure > > I don't think there's any point in making this change until we have > some code that will use the pointer. We will have this shortly. I have been waiting for this to propose the changes to SDP et al. -- Hal From davem at davemloft.net Mon Oct 3 14:34:07 2005 From: davem at davemloft.net (David S. Miller) Date: Mon, 03 Oct 2005 14:34:07 -0700 (PDT) Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <52fyri9tnf.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> Message-ID: <20051003.143407.49100316.davem@davemloft.net> From: Roland Dreier Date: Mon, 03 Oct 2005 14:29:08 -0700 > Hal> netdevice.h: Add RDMA private pointer to the net_device structure > > I don't think there's any point in making this change until we have > some code that will use the pointer. I definitely agree. From rolandd at cisco.com Mon Oct 3 14:35:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:35:39 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128374816.4397.343.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:26:56 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> Message-ID: <523bni9tck.fsf@cisco.com> Hal> We will have this shortly. I have been waiting for this to Hal> propose the changes to SDP et al. OK, but I don't think it makes sense to merge this upstream until there is in-tree code that will use it. - R. From halr at voltaire.com Mon Oct 3 14:50:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 17:50:21 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <523bni9tck.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> Message-ID: <1128375898.4397.389.camel@hal.voltaire.com> On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > Hal> We will have this shortly. I have been waiting for this to > Hal> propose the changes to SDP et al. > > OK, but I don't think it makes sense to merge this upstream until > there is in-tree code that will use it. I wanted to get this in so I could add the code to IPoIB to use this so SDP and others no longer poke at IPoIB's private data. This is a small change. Should this change be made locally (in OpenIB) first (and we'll have our own modified netdevice.h for a short time) ? -- Hal From nacc at us.ibm.com Mon Oct 3 15:15:54 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 3 Oct 2005 15:15:54 -0700 Subject: [openib-general] Latest build test results Message-ID: <20051003221553.GA27996@us.ibm.com> Hello, Here are the build results for 2.6.14-rc3 with and without the latest gen2 trunk. Looks like all the builds were successful, with some warnings: - ppc64 + gen2 with =y drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type - same for =m, plus *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find - x86 + gen2 with =y drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': drivers/infiniband/ulp/iser/iser_conn.c:195: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:203: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:206: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': drivers/infiniband/ulp/iser/iser_conn.c:285: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_enable_rdma': drivers/infiniband/ulp/iser/iser_conn.c:357: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:431: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': drivers/infiniband/ulp/iser/iser_conn.c:933: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:950: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:981: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_add_to_dto': drivers/infiniband/ulp/iser/iser_memory.c:230: warning: cast from pointer to integer of different size drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': drivers/infiniband/ulp/iser/iser_mod.c:152: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': drivers/infiniband/ulp/iser/iser_initiator.c:62: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:67: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:80: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:95: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': drivers/infiniband/ulp/iser/iser_lkdapl.c:147: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_start_dto': drivers/infiniband/ulp/iser/iser_lkdapl.c:660: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': drivers/infiniband/ulp/iser/iser_lkdapl.c:758: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': drivers/infiniband/ulp/iser/iser_lkdapl.c:800: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:819: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:846: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:849: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:852: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:855: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:858: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:861: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:864: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:867: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:870: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:1116: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': drivers/infiniband/ulp/iser/iser_mod.c:241: warning: too few arguments for format drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type - same for =m, plus: *** Warning: "ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: "ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find Mainline does not appear to have any issues on either ppc64 or x86, =m or =y. Thanks, Nish From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. From panda at cse.ohio-state.edu Mon Oct 3 15:47:54 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon, 3 Oct 2005 18:47:54 -0400 (EDT) Subject: [openib-general] Re: OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> from "Matt L. Leininger" at Oct 03, 2005 11:05:37 AM Message-ID: <200510032247.j93MlssL006110@xi.cse.ohio-state.edu> Matt, > The latest mvapich-gen2 does not compile with the latest OpenIB gen2 > code base. The number of function arguments to ibv_create_cq has > changed from 3 to 5. This looks like a simple fix, but you may need to > support both the old and new API for ibv_create_cq. The current OpenIB > gen2 backport to 2.6.9 (for RedHat) uses the older API. The patch has been included in the latest MVAPICH-Gen2 version checked into the SVN a few hours ago. MVAPICH-Gen2 now compiles against the latest Gen2 stack. If an older Gen2 stack is being used against the latest MVAPICH-Gen2, we have added a new flag (-DGEN2_OLD_CQ_VERB) for the code to be compiled with. More information on this has been added to mvapich.user_guide.pdf (Version 1.1). Hope this helps. Thanks, DK > Woody, are there plans to update the 2.6.9 backports to svn version 3632 > or more recent to fix this? > > > > > mvapich-gen2-1.0-102/mpid/ch_gen2/viainit.c ~line 118 > > static void create_cq(void) > { > ibv_dev.cq_hndl = ibv_create_cq(ibv_dev.context, > viadev_cq_size, NULL); > > if(!ibv_dev.cq_hndl) { > error_abort_all(GEN_EXIT_ERR, "Error creating CQ\n"); > } > } > > > > OpenIB verbs.h > > extern struct ibv_cq *ibv_create_cq(struct ibv_context *context, int > cqe, > void *cq_context, > struct ibv_comp_channel *channel, > int comp_vector); > > > Thanks, > > - Matt > > From pradeep at us.ibm.com Mon Oct 3 16:05:45 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 3 Oct 2005 16:05:45 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: My understanding is that the refcnt will still need to be held (even after this change) even if SDP would not poke at IPoIB's private data. Is that true? Moreover there was discussion about getting this data from the CM REQ private data. So, what is the exact rationale for adding this to the net_device structure? Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > Hal> We will have this shortly. I have been waiting for this to > > Hal> propose the changes to SDP et al. > > > > OK, but I don't think it makes sense to merge this upstream until > > there is in-tree code that will use it. > > I wanted to get this in so I could add the code to IPoIB to use this so SDP > and others no longer poke at IPoIB's private data. This is a small > change. Should this change be made locally (in OpenIB) first (and we'll > have our own modified netdevice.h for a short time) ? > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Oct 3 16:17:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 19:17:17 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: References: Message-ID: <1128381437.4397.594.camel@hal.voltaire.com> On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > My understanding is that the refcnt will still need to be held (even > after this change) even if SDP would not poke at IPoIB's private data. > Is that true? Yes, that's an independent issue. > Moreover there was discussion about getting this data from the CM REQ > private data. So, what is the exact rationale for adding this to the > net_device structure? To get at the ib_device, port, and PKey which are needed for a subsequent SA path record request. -- Hal > Pradeep > pradeep at us.ibm.com > > openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > > > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > > Hal> We will have this shortly. I have been waiting for this > to > > > Hal> propose the changes to SDP et al. > > > > > > OK, but I don't think it makes sense to merge this upstream until > > > there is in-tree code that will use it. > > > > I wanted to get this in so I could add the code to IPoIB to use this > so SDP > > and others no longer poke at IPoIB's private data. This is a small > > change. Should this change be made locally (in OpenIB) first (and > we'll > > have our own modified netdevice.h for a short time) ? > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From info at sdkfjy.com Mon Oct 3 15:33:15 2005 From: info at sdkfjy.com (info at sdkfjy.com) Date: 4 Oct 2005 07:33:15 +0900 Subject: [openib-general] $BCK@-I,$:2T$2$k%7%9%F%`$G$9(B Message-ID: <20051003223315.9601.qmail@mail.sdkfjy.com> $B=w$N;R$H%"%]$r@\$d$jl9g$O(B awg_tokyo at yahoo.com.au $B"#(B==========================$B"#(B From pradeep at us.ibm.com Mon Oct 3 16:52:55 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 3 Oct 2005 16:52:55 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128381437.4397.594.camel@hal.voltaire.com> Message-ID: Ok thanks for the explanation. So, I presume that means rdma_ptr will now point to ib_device? If so, one issue that strikes me as significant would be backward compatability. My view is that one could continue to use the IPoIB private data. Pradeep pradeep at us.ibm.com Hal Rosenstock wrote on 10/03/2005 04:17:17 PM: > On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > > My understanding is that the refcnt will still need to be held (even > > after this change) even if SDP would not poke at IPoIB's private data. > > Is that true? > > Yes, that's an independent issue. > > > Moreover there was discussion about getting this data from the CM REQ > > private data. So, what is the exact rationale for adding this to the > > net_device structure? > > To get at the ib_device, port, and PKey which are needed for a > subsequent SA path record request. > > -- Hal > > > Pradeep > > pradeep at us.ibm.com > > > > openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > > > > > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > > > Hal> We will have this shortly. I have been waiting for this > > to > > > > Hal> propose the changes to SDP et al. > > > > > > > > OK, but I don't think it makes sense to merge this upstream until > > > > there is in-tree code that will use it. > > > > > > I wanted to get this in so I could add the code to IPoIB to use this > > so SDP > > > and others no longer poke at IPoIB's private data. This is a small > > > change. Should this change be made locally (in OpenIB) first (and > > we'll > > > have our own modified netdevice.h for a short time) ? > > > > > > -- Hal > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Oct 3 17:07:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 17:07:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: (Pradeep Satyanarayana's message of "Mon, 3 Oct 2005 16:52:55 -0700") References: Message-ID: <52ll1a87r5.fsf@cisco.com> Pradeep> If so, one issue that strikes me as significant would be Pradeep> backward compatability. My view is that one could Pradeep> continue to use the IPoIB private data. This is an in-kernel API. There's no reason to even think about backwards compatibility. - R. From sean.hefty at intel.com Mon Oct 3 17:09:58 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Oct 2005 17:09:58 -0700 Subject: [openib-general] CMA and device removal Message-ID: >The idea with this is that a user of the CMA does not need to register for >device addition/removal, and track devices themselves. What I have right now >is something similar to this: > >rdma_create_id(); >rdma_bind_addr(id, optional src addr, dst addr); >rdma_resolve_route(id); /* optional - done by connect if not called */ >rdma_connect(id); I've committed a version of the CMA that attempts to handle device removal internally. When a device is removed, a device removal event is generated on a user's RDMA identifier, and the removal is delayed within the CMA until all references have been released. An updated version of the API is given below. The implementation has not been tested, and there are a couple of missing features: support for listening across all devices and automatic route resolution. The implementation is available under: svn/gen2/users/mshefty. - Sean /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. * * 2) under the terms of the "The BSD License" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. * * Licensee has the right to choose one of the above licenses. * * Redistributions of source code must retain the above copyright * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. * */ #if !defined(RDMA_CMA_H) #define RDMA_CMA_H #include #include #include /* * Upon receiving a device removal event, users must destroy the associated * RDMA identifier and release all resources allocated with the device. */ enum rdma_event_type { RDMA_EVENT_ADDR_RESOLVED, RDMA_EVENT_ADDR_ERROR, RDMA_EVENT_ROUTE_RESOLVED, RDMA_EVENT_ROUTE_ERROR, RDMA_EVENT_CONNECT_REQUEST, RDMA_EVENT_CONNECT_ERROR, RDMA_EVENT_UNREACHABLE, RDMA_EVENT_REJECTED, RDMA_EVENT_ESTABLISHED, RDMA_EVENT_DISCONNECTED, RDMA_EVENT_DEVICE_REMOVAL, }; struct rdma_addr { struct sockaddr src_addr; struct sockaddr dst_addr; union { struct ib_addr ibaddr; } addr; }; struct rdma_route { struct rdma_addr addr; struct ib_sa_path_rec *path_rec; int num_paths; }; struct rdma_event { enum rdma_event_type event; int status; void *private_data; u8 private_data_len; }; struct rdma_id; /** * rdma_event_handler - Callback used to report user events. * * Notes: Users may not call rdma_destroy_id from this callback to destroy * the passed in id, or a corresponding listen id. Returning a * non-zero value from the callback will destroy the corresponding id. */ typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); struct rdma_id { struct ib_device *device; void *context; struct ib_qp *qp; rdma_event_handler event_handler; struct rdma_route route; }; struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context); void rdma_destroy_id(struct rdma_id *id); /** * rdma_bind_addr - Bind an RDMA identifier to a source address and * associated RDMA device, if needed. * * @id: RDMA identifier. * @addr: Local address information. Wildcard values are permitted. * * This associates a source address with the RDMA identifier before calling * rdma_listen. If a specific local address is given, the RDMA identifier will * be bound to a local RDMA device. */ int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); /** * rdma_resolve_addr - Resolve destination and optional source addresses * from IP addresses to an RDMA address. If successful, the specified * rdma_id will be bound to a local device. * * @id: RDMA identifier. * @src_addr: Source address information. This parameter may be NULL. * @dst_addr: Destination address information. * @timeout_ms: Time to wait for resolution to complete. */ int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms); /** * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier * into route information needed to establish a connection. * * This is called on the client side of a connection, but its use is optional. * Users must have first called rdma_bind_addr to resolve a dst_addr * into an RDMA address before calling this routine. */ int rdma_resolve_route(struct rdma_id *id, int timeout_ms); /** * rdma_init_qp - Associates a QP with a CMA identifier and initializes the * QP for use in establishing a connection. * * TODO: fix how to do this... doesn't work with iWarp... */ int rdma_init_qp(struct rdma_id *id, struct ib_qp *qp, int qp_access_flags); struct rdma_conn_param { const void *private_data; u8 private_data_len; u8 responder_resources; u8 initiator_depth; u8 flow_control; u8 retry_count; /* ignored when accepting */ u8 rnr_retry_count; }; /** * rdma_connect - Initiate an active connection request. * * Users must have bound the rdma_id to a local device by having called * rdma_resolve_addr before calling this routine. Users may also resolve the * RDMA address to a route with rdma_resolve_route, but if a route has not * been resolved, a default route will be selected. * * Note that the QP must be in the INIT state. */ int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); /** * rdma_listen - This function is called by the passive side to * listen for incoming connection requests. * * Users must have bound the rdma_id to a local address by calling * rdma_bind_addr before calling this routine. */ int rdma_listen(struct rdma_id *id); /** * rdma_accept - Called on the passive side to accept a connection request * * Note that the QP must be in the INIT state. */ int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param); /** * rdma_reject - Called on the passive side to reject a connection request. */ int rdma_reject(struct rdma_id *id, const void *private_data, u8 private_data_len); /** * rdma_disconnect - This function disconnects the associated QP. */ int rdma_disconnect(struct rdma_id *id); #endif /* RDMA_CMA_H */ From rolandd at cisco.com Mon Oct 3 14:28:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:28:17 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> (Matt L. Leininger's message of "Mon, 03 Oct 2005 11:05:37 -0700") References: <1128362737.10484.267.camel@localhost> Message-ID: <52k6gu9tou.fsf@cisco.com> Matt> Woody, are there plans to update the 2.6.9 backports to svn Matt> version 3632 or more recent to fix this? There's no need to backport anything. The latest libibverbs (1.0-rc3) supports the new CQ API on all kernel ABIs. - R. From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From shemminger at osdl.org Mon Oct 3 13:54:07 2005 From: shemminger at osdl.org (Stephen Hemminger) Date: Mon, 3 Oct 2005 13:54:07 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <20051003135407.072aaff6@dxpl.pdx.osdl.net> On 03 Oct 2005 14:48:44 -0400 Hal Rosenstock wrote: > netdevice.h: Add RDMA private pointer to the net_device structure > > Signed-off-by: Hal Rosenstock Who is going to use it? Is RDMA being submitted for code review? -- Stephen Hemminger OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From sean.hefty at intel.com Mon Oct 3 20:54:26 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Oct 2005 20:54:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA privatepointer to the net_device structure In-Reply-To: <1128381437.4397.594.camel@hal.voltaire.com> Message-ID: >> Moreover there was discussion about getting this data from the CM REQ >> private data. So, what is the exact rationale for adding this to the >> net_device structure? > >To get at the ib_device, port, and PKey which are needed for a >subsequent SA path record request. We should be able to retrieve the device and port through GID matching. I'm not sure how safe it is to access the device pointer in the case of device removal. Reading the device pointer from the rdma_ptr would need to be synchronized with ipoib's device removal handling, but maybe that's handled by the reference on the net_device...? Does ipoib create a device per pkey associated with a port? Is it possible for a user to get at a pkey other than the one at index 0 given only an IP address? - Sean From rolandd at cisco.com Mon Oct 3 21:13:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 21:13:21 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA privatepointer to the net_device structure In-Reply-To: (Sean Hefty's message of "Mon, 3 Oct 2005 20:54:26 -0700") References: Message-ID: <52ek719axq.fsf@cisco.com> Sean> Does ipoib create a device per pkey associated with a port? Sean> Is it possible for a user to get at a pkey other than the Sean> one at index 0 given only an IP address? Yes to both. Each P_Key is a different IPoIB broadcast domain and a different netdevice/interface. Routing could easily return an IPoIB interface with any P_Key. - R. From rolandd at cisco.com Mon Oct 3 21:14:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 21:14:53 -0700 Subject: [openib-general] CMA and device removal In-Reply-To: (Sean Hefty's message of "Mon, 3 Oct 2005 17:09:58 -0700") References: Message-ID: <52achp9av6.fsf@cisco.com> >> The idea with this is that a user of the CMA does not need to >> register for device addition/removal, and track devices >> themselves. Not really related to this latest posting, but I think I forgot to reply earlier... in any case, I think this is a really good idea: have the CMA insulate consumers from device addition/removal, so that CMA consumers don't have to use the ib_register_client() API directly. - R. From onnxxw at yahoo.com Tue Oct 4 05:54:06 2005 From: onnxxw at yahoo.com (Brett Parker) Date: Tue, 4 Oct 2005 11:54:06 -0100 Subject: [openib-general] ... Message-ID: <23910604095115.onnxxw@yahoo.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p2.asp Have a good day. Sincerely, Brett Parker Customer Service Rep eLHR Inc. enoch a hereinbelow or it classification it's it trypsin see a homicide some a priest it try excel in or hellgrammite notbe coachwork it's. Update on site. wabash , circumflex but it's preside not a rheostat the on greenbriar some but charcoal or a fob and be yamaha notbe niobe see. From IBMEHCAD at de.ibm.com Tue Oct 4 06:52:55 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Tue, 4 Oct 2005 15:52:55 +0200 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org Message-ID: We're ready now to release the eHCA device driver to openib.org under http://openib.org/license.html Our assumption is the right place for that code would be: gen2/trunk/src/linux-kernel/infiniband/hw/ehca gen2/trunk/src/userspace/libehca We should probably modify the linux-kernel/infiniband/kconfig to only allow to compile the kernel part for ppc64 builds Please let us know if this is the right way to move our code from sourceforge to openib.org Thanks, Christoph Raisch ibm boeblingen lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 4 06:57:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 09:57:00 -0400 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: References: Message-ID: <1128434220.4397.3899.camel@hal.voltaire.com> Hi, On Tue, 2005-10-04 at 09:52, IBMEHCA DD wrote: > We're ready now to release the eHCA device driver to openib.org under > http://openib.org/license.html Glad to hear this :-) > Our assumption is the right place for that code would be: > > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > gen2/trunk/src/userspace/libehca > > We should probably modify the linux-kernel/infiniband/kconfig to only > allow to compile the kernel part for ppc64 builds Yes (and the makefile there as well). > Please let us know if this is the right way to move our code from > sourceforge to openib.org Yes, that appears right to me. -- Hal From caitlinb at broadcom.com Tue Oct 4 07:21:20 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 07:21:20 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Monday, October 03, 2005 4:17 PM > To: Pradeep Satyanarayana > Cc: openib-general-bounces at openib.org; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > > My understanding is that the refcnt will still need to be > held (even > > after this change) even if SDP would not poke at IPoIB's > private data. > > Is that true? > > Yes, that's an independent issue. > > > Moreover there was discussion about getting this data from > the CM REQ > > private data. So, what is the exact rationale for adding > this to the > > net_device structure? > > To get at the ib_device, port, and PKey which are needed for > a subsequent SA path record request. > In terms of justifying the field in the net_device structure you are saying that this holds data needed by and only understood by the rdma layer, but that is specific to the net_device. That makes sense. The only thing really missing is clarifying the intended scope of this data. I believe that the intent is for it to be transport specific, but not device specific. Is that correct? From halr at voltaire.com Tue Oct 4 08:01:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 11:01:14 -0400 Subject: [openib-general] [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 Message-ID: <1128438073.4397.4105.camel@hal.voltaire.com> ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 (There is emerging functionality (not yet pushed upstream) in the IB subsystem which relies on this being available. ip_dev_find is used to find a valid IPoIB device when the outgoing device returned by the route lookup (ip_route_output_key) is using the loopback interface. A valid IPoIB device is needed to perform sending an ARP and doing an IB path lookup so that an IB connection can be made). Signed-off-by: Hal Rosenstock --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -661,4 +661,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From mshefty at ichips.intel.com Tue Oct 4 09:36:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Oct 2005 09:36:27 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4342AF8B.60900@ichips.intel.com> Caitlin Bestler wrote: > That makes sense. The only thing really missing is clarifying > the intended scope of this data. I believe that the intent is > for it to be transport specific, but not device specific. Is > that correct? I'm trying to understand who would use this field and what it would contain. From discussions so far, it looks like only an IP to IB address translation mechanism would need it. And the only value that's required seems to be the pkey. Other values could be returned as well to possibly simplify things, but not sure that anything else is required. - Sean From rolandd at cisco.com Tue Oct 4 09:43:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 09:43:09 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: (IBMEHCA DD's message of "Tue, 4 Oct 2005 15:52:55 +0200") References: Message-ID: <52ll196xnm.fsf@cisco.com> Congratulations on getting to this stage! > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > gen2/trunk/src/userspace/libehca Yes, this is the right place to add the code. > We should probably modify the linux-kernel/infiniband/Kconfig to only > allow to compile the kernel part for ppc64 builds Yes, add source "drivers/infiniband/hw/ehca/Kconfig" to that Kconfig, and obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ to the Makefile. - R. From caitlinb at broadcom.com Tue Oct 4 10:43:26 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 10:43:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> I've been trying to think of some iWARP uses, but haven't come up with any yet. But I have strong lingering suspicions that they will eventually be found and having this type of field will ensure that the data is placed where it belongs rather than another inappropriate peeking at another layer's data being the result. > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 04, 2005 9:36 AM > To: Caitlin Bestler > Cc: Hal Rosenstock; Pradeep Satyanarayana; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin Bestler wrote: > > That makes sense. The only thing really missing is clarifying the > > intended scope of this data. I believe that the intent is > for it to be > > transport specific, but not device specific. Is that correct? > > I'm trying to understand who would use this field and what it > would contain. > From discussions so far, it looks like only an IP to IB > address translation mechanism would need it. And the only > value that's required seems to be the pkey. Other values > could be returned as well to possibly simplify things, but > not sure that anything else is required. > > - Sean > > From rolandd at cisco.com Tue Oct 4 10:51:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 10:51:54 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 4 Oct 2005 10:43:26 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <52y8595fwl.fsf@cisco.com> Caitlin> I've been trying to think of some iWARP uses, but haven't Caitlin> come up with any yet. But I have strong lingering Caitlin> suspicions that they will eventually be found and having Caitlin> this type of field will ensure that the data is placed Caitlin> where it belongs rather than another inappropriate Caitlin> peeking at another layer's data being the result. I'm pretty sure iWARP needs the rdma_ptr member for exactly the same reason that IB needs it: to go from a struct net_device coming from route lookup on to a struct rdma_device. - R. From caitlinb at broadcom.com Tue Oct 4 10:55:30 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 10:55:30 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> I think a link from the rdma_device to the net_device is adequate for those purposes. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 04, 2005 10:52 AM > To: Caitlin Bestler > Cc: Sean Hefty; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin> I've been trying to think of some iWARP uses, but haven't > Caitlin> come up with any yet. But I have strong lingering > Caitlin> suspicions that they will eventually be found and having > Caitlin> this type of field will ensure that the data is placed > Caitlin> where it belongs rather than another inappropriate > Caitlin> peeking at another layer's data being the result. > > I'm pretty sure iWARP needs the rdma_ptr member for exactly > the same reason that IB needs it: to go from a struct > net_device coming from route lookup on to a struct rdma_device. > > - R. > > From rolandd at cisco.com Tue Oct 4 11:01:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 11:01:28 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 4 Oct 2005 10:55:30 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <52u0fx5fgn.fsf@cisco.com> Caitlin> I think a link from the rdma_device to the net_device is Caitlin> adequate for those purposes. It's the wrong direction though. It seems kind of ugly to have to iterate through the list of rdma_devices for every route lookup, even if the list is almost always short. - R. From caitlinb at broadcom.com Tue Oct 4 11:06:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 11:06:59 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F102096A@NT-SJCA-0751.brcm.ad.broadcom.com> Good point. That might be enough of a justification alone. And as already state, I'm convinced there will be other uses. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 04, 2005 11:01 AM > To: Caitlin Bestler > Cc: Sean Hefty; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin> I think a link from the rdma_device to the net_device is > Caitlin> adequate for those purposes. > > It's the wrong direction though. It seems kind of ugly to > have to iterate through the list of rdma_devices for every > route lookup, even if the list is almost always short. > > - R. > > From viswa.krish at gmail.com Tue Oct 4 11:17:32 2005 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Tue, 4 Oct 2005 11:17:32 -0700 Subject: [openib-general] Vendor specific MAD support Message-ID: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Does openIB Gen2 stack umad/mad library support Vendor specific MAD extensions ? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 4 11:20:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 11:20:08 -0700 Subject: [openib-general] Vendor specific MAD support In-Reply-To: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> (Viswanath Krishnamurthy's message of "Tue, 4 Oct 2005 11:17:32 -0700") References: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Message-ID: <52psql5elj.fsf@cisco.com> Viswanath> Does openIB Gen2 stack umad/mad library support Vendor Viswanath> specific MAD extensions ? The kernel's userspace MAD interface allows userspace to send and receive arbitrary MADs containing any data at all that userspace wants. I'm not sure what the existing libraries expose, but it's rather trivial to code directly to the kernel interface if required. - R. From halr at voltaire.com Tue Oct 4 11:54:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 14:54:32 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <4342AF8B.60900@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> <4342AF8B.60900@ichips.intel.com> Message-ID: <1128452025.4397.4580.camel@hal.voltaire.com> On Tue, 2005-10-04 at 12:36, Sean Hefty wrote: > Caitlin Bestler wrote: > > That makes sense. The only thing really missing is clarifying > > the intended scope of this data. I believe that the intent is > > for it to be transport specific, but not device specific. Is > > that correct? > > I'm trying to understand who would use this field and what it would contain. > From discussions so far, it looks like only an IP to IB address translation > mechanism would need it. And the only value that's required seems to be the > pkey. Other values could be returned as well to possibly simplify things, but > not sure that anything else is required. Also, ib_device and port as well as PKey. -- Hal From halr at voltaire.com Tue Oct 4 11:58:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 14:58:28 -0400 Subject: [openib-general] Vendor specific MAD support In-Reply-To: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> References: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Message-ID: <1128452307.4397.4593.camel@hal.voltaire.com> On Tue, 2005-10-04 at 14:17, Viswanath Krishnamurthy wrote: > Does openIB Gen2 stack umad/mad library support Vendor specific MAD > extensions ? libibmad has some support for vendor MADs: uint8_t * ib_vendor_call(void *data, ib_portid_t *portid, ib_vendor_call_t *call) where: typedef struct ib_vendor_call { uint method; uint mgmt_class; uint attrid; uint mod; uint32_t oui; uint timeout; ib_rmpp_hdr_t rmpp; } ib_vendor_call_t; You can look at ibping or ibsysstat (under diags) for use of this. -- Hal From mshefty at ichips.intel.com Tue Oct 4 12:26:33 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Oct 2005 12:26:33 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128452025.4397.4580.camel@hal.voltaire.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> <4342AF8B.60900@ichips.intel.com> <1128452025.4397.4580.camel@hal.voltaire.com> Message-ID: <4342D769.2030700@ichips.intel.com> Hal Rosenstock wrote: >>I'm trying to understand who would use this field and what it would contain. >> From discussions so far, it looks like only an IP to IB address translation >>mechanism would need it. And the only value that's required seems to be the >>pkey. Other values could be returned as well to possibly simplify things, but >>not sure that anything else is required. > > Also, ib_device and port as well as PKey. The device and port can be retrieved by looking up the GID in a local device list, though it's a little inefficient. I agree that these 3 values are ideal, but not sure that having them helps. (And returning the device pointer could actually lead to misuse.) What's still not clear to me is how an ib_device pointer would be used with respect to device removal. Ultimately a client needs to get a pointer to an ib_device that they can use for QP allocation, etc. I think that we need to examine the problem from a ULP's perspective, versus going up a single layer in the stack. For example, currently the CMA queries an address translation service to convert an IP address into a GID. The CMA searches its device list until it finds a match on the GID. This permits synchronization with device removal. Given the current device registration interface, it seems that a search through a device list is needed at some point. The only alternative that I can think of is to make use of a more complex reference counting scheme. - Sean From davem at davemloft.net Tue Oct 4 12:39:05 2005 From: davem at davemloft.net (David S. Miller) Date: Tue, 04 Oct 2005 12:39:05 -0700 (PDT) Subject: [openib-general] Re: [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 In-Reply-To: <1128438073.4397.4105.camel@hal.voltaire.com> References: <1128438073.4397.4105.camel@hal.voltaire.com> Message-ID: <20051004.123905.56817889.davem@davemloft.net> From: Hal Rosenstock Date: 04 Oct 2005 11:01:14 -0400 > (There is emerging functionality (not yet pushed upstream) in the IB > subsystem which relies on this being available. ip_dev_find is used to > find a valid IPoIB device when the outgoing device returned by the route > lookup (ip_route_output_key) is using the loopback interface. A valid > IPoIB device is needed to perform sending an ARP and doing an IB path > lookup so that an IB connection can be made). Then add this when this "emerging functionality" is pushed upstream. From wcchen at us.ibm.com Tue Oct 4 13:09:16 2005 From: wcchen at us.ibm.com (Winston Chen) Date: Tue, 4 Oct 2005 16:09:16 -0400 Subject: [openib-general] libibat/libibcm build mess Message-ID: Hi, Hal: Where can I find functions class_create() and class_device_create() called by ~/infiniband/core/uat.c ? Thanks, Winston Chen IBM RS/6000 SP Development 522 South Road, MS P963 Poughkeepsie, New York 12601 Tel: 1-845-433-8071 email: wcchen at us.ibm.com From info at openib.org Tue Oct 4 13:10:45 2005 From: info at openib.org (info at openib.org) Date: Wed, 05 Oct 2005 02:10:45 +0600 Subject: [openib-general] You have successfully updated your password Message-ID: <0INV0085WOUB4N@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: updated-password.zip Type: application/octet-stream Size: 53534 bytes Desc: not available URL: From Administrator at openib.org Tue Oct 4 13:10:04 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Tue, 4 Oct 2005 15:10:04 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <007801c5c91f$9be59250$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] You have successfully updated your password Scanning time = 10/4/2005 3:10:04 PM Engine/Pattern = 7.510-1002/2.871.00 Action on virus found: The attachment updated-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/4/2005 updated-password.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] You have successfully updated your password From rolandd at cisco.com Tue Oct 4 13:25:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 13:25:54 -0700 Subject: [openib-general] libibat/libibcm build mess In-Reply-To: (Winston Chen's message of "Tue, 4 Oct 2005 16:09:16 -0400") References: Message-ID: <52br25t4fh.fsf@cisco.com> Winston> Where can I find functions class_create() and Winston> class_device_create() called by ~/infiniband/core/uat.c ? They're in include/linux/device.h. What kernel version are you using? - R. From halr at voltaire.com Tue Oct 4 15:00:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 18:00:00 -0400 Subject: [openib-general] libibat/libibcm build mess In-Reply-To: References: Message-ID: <1128463199.4397.4605.camel@hal.voltaire.com> Hi Winston, On Tue, 2005-10-04 at 16:09, Winston Chen wrote: > Where can I find functions class_create() and class_device_create() > called by > ~/infiniband/core/uat.c ? Those functions are in 2.6.13 and beyond. Are you using a kernel older than that ? There is a backpatch available: https://openib.org/svn/gen2/branches/backport/2.6.12/uat_3465_to_2_6_12.patch -- Hal > > Thanks, > > Winston Chen > IBM RS/6000 SP Development > 522 South Road, MS P963 > Poughkeepsie, New York 12601 > Tel: 1-845-433-8071 > email: wcchen at us.ibm.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Tue Oct 4 16:46:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 16:46:15 -0700 Subject: [openib-general] [git pull] InfiniBand updates for 2.6.14 Message-ID: <52u0fwsv5k.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the following changes (full patch below): Michael S. Tsirkin: [IB] mthca: Fix memory leak on device close Roland Dreier: [IPoIB] Rename IPoIB's path_lookup() to avoid name clashes drivers/infiniband/hw/mthca/mthca_main.c | 45 ++++++++++++++--------------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 +-- 2 files changed, 23 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -503,6 +503,25 @@ err_free_aux: return err; } +static void mthca_free_icms(struct mthca_dev *mdev) +{ + u8 status; + + mthca_free_icm_table(mdev, mdev->mcg_table.table); + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); + mthca_free_icm_table(mdev, mdev->cq_table.table); + mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); + mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); + mthca_free_icm_table(mdev, mdev->qp_table.qp_table); + mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); + mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); + mthca_unmap_eq_icm(mdev); + + mthca_UNMAP_ICM_AUX(mdev, &status); + mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); +} + static int __devinit mthca_init_arbel(struct mthca_dev *mdev) { struct mthca_dev_lim dev_lim; @@ -580,18 +599,7 @@ static int __devinit mthca_init_arbel(st return 0; err_free_icm: - if (mdev->mthca_flags & MTHCA_FLAG_SRQ) - mthca_free_icm_table(mdev, mdev->srq_table.table); - mthca_free_icm_table(mdev, mdev->cq_table.table); - mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); - mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); - mthca_free_icm_table(mdev, mdev->qp_table.qp_table); - mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); - mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); - mthca_unmap_eq_icm(mdev); - - mthca_UNMAP_ICM_AUX(mdev, &status); - mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); + mthca_free_icms(mdev); err_stop_fw: mthca_UNMAP_FA(mdev, &status); @@ -611,18 +619,7 @@ static void mthca_close_hca(struct mthca mthca_CLOSE_HCA(mdev, 0, &status); if (mthca_is_memfree(mdev)) { - if (mdev->mthca_flags & MTHCA_FLAG_SRQ) - mthca_free_icm_table(mdev, mdev->srq_table.table); - mthca_free_icm_table(mdev, mdev->cq_table.table); - mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); - mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); - mthca_free_icm_table(mdev, mdev->qp_table.qp_table); - mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); - mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); - mthca_unmap_eq_icm(mdev); - - mthca_UNMAP_ICM_AUX(mdev, &status); - mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); + mthca_free_icms(mdev); mthca_UNMAP_FA(mdev, &status); mthca_free_icm(mdev, mdev->fw.arbel.fw_icm); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -474,7 +474,7 @@ err: spin_unlock(&priv->lock); } -static void path_lookup(struct sk_buff *skb, struct net_device *dev) +static void ipoib_path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); @@ -569,7 +569,7 @@ static int ipoib_start_xmit(struct sk_bu if (skb->dst && skb->dst->neighbour) { if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { - path_lookup(skb, dev); + ipoib_path_lookup(skb, dev); goto out; } From halr at voltaire.com Wed Oct 5 03:25:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Oct 2005 06:25:03 -0400 Subject: [openib-general] [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 Message-ID: <1128507902.4397.5400.camel@hal.voltaire.com> Hi, The following patch is currently needed for 2.6.14-rc3 (for SDP and AT). I placed this in gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff -- Hal ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 This was removed at 2.6.14 as part of a general cleanup as noone outside of IP currently is using this (but SDP and AT currently do) Signed-off-by: Hal Rosenstock --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -661,4 +661,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From info at openib.org Wed Oct 5 04:27:26 2005 From: info at openib.org (info at openib.org) Date: Wed, 05 Oct 2005 17:27:26 +0600 Subject: [openib-general] Important Notification Message-ID: <0INW002NIV9MTS@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: account-report.zip Type: application/octet-stream Size: 53530 bytes Desc: not available URL: From Administrator at openib.org Wed Oct 5 04:26:45 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 06:26:45 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <008501c5c99f$aafb5bf0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Important Notification Scanning time = 10/5/2005 6:26:45 AM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment account-report.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/5/2005 account-report.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Important Notification From Administrator at openib.org Wed Oct 5 07:04:11 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 07:04:11 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <031201c5c9b5$a93e67b0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Important Notification Scanning time = 10/5/2005 7:04:11 AM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment account-report.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From xma at us.ibm.com Wed Oct 5 08:54:37 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:54:37 -0600 Subject: [openib-general] [PATCH]small cleanup in cache.c Message-ID: The first time ib_cache_update being called both old_pkey_cache & old_gid_cache are NULL. Signed-off-by: Shirley Ma (xma at us.ibm.com) diff -uprN infiniband/core/cache.c infiniband-patch/core/cache.c --- infiniband/core/cache.c 2005-10-05 06:59:34.000000000 -0700 +++ infiniband-patch/core/cache.c 2005-10-05 08:55:42.550693304 -0700 @@ -252,8 +252,10 @@ static void ib_cache_update(struct ib_de write_unlock_irq(&device->cache.lock); - kfree(old_pkey_cache); - kfree(old_gid_cache); + if (old_pkey_cache) + kfree(old_pkey_cache); + if (old_gid_cache) + kfree(old_gid_cache); kfree(tprops); return; Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: freecache.patch Type: application/octet-stream Size: 522 bytes Desc: not available URL: From twbowman at gmail.com Wed Oct 5 08:56:14 2005 From: twbowman at gmail.com (Todd Bowman) Date: Wed, 5 Oct 2005 09:56:14 -0600 Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com>

Message-ID: On 9/30/05, James Lentini wrote: > > > > On Fri, 30 Sep 2005, Todd Bowman wrote: > > > udapl is using 0x115d3. How is this set and what value should it be? > > > > Todd > > On InfiniBand, uDAPL maps connection qualifiers onto service IDs > (SIDs). > > The connection qualifier is chosen by the uDAPL application when it > creates a Public Service Point (PSP) or Reserved Service Point (RSP). > > As Arlin noted, 0x115d3 is in the SDP range. The dapltest test tools > uses 0xB0de. I would try any value except those in the range > 0x10000-0x1fffff and 0xB0de. > > james > Here is a patch for dtest.c to remove the qualifier from the sdp range. Index: userspace/dapl/test/dtest/dtest.c =================================================================== --- userspace/dapl/test/dtest/dtest.c (revision 3547) +++ userspace/dapl/test/dtest/dtest.c (working copy) @@ -53,7 +53,7 @@ #include "dat/udat.h" /* definitions */ -#define SERVER_CONN_QUAL 71123 +#define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) #define DTO_FLUSH_TIMEOUT (1000*1000*2) #define CONN_TIMEOUT (1000*1000*10) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 09:04:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 09:04:54 -0700 Subject: [openib-general] [PATCH]small cleanup in cache.c In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:54:37 -0600") References: Message-ID: <52br24rluh.fsf@cisco.com> > - kfree(old_pkey_cache); > - kfree(old_gid_cache); > + if (old_pkey_cache) > + kfree(old_pkey_cache); > + if (old_gid_cache) > + kfree(old_gid_cache); This isn't needed and in fact having this check is considered bad kernel style. The first thing kfree() does is check if the pointer is NULL, so duplicating this check in the caller just makes the code bigger. - R. From xma at us.ibm.com Wed Oct 5 09:09:02 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:09:02 -0700 Subject: [openib-general] [PATCH]small cleanup in cache.c In-Reply-To: <52br24rluh.fsf@cisco.com> Message-ID: Yes, as long as it's on Linux it's safe. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Oct 5 09:52:53 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:52:53 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA Message-ID: One HCA could support 256 ports. The current implementation doesn't support partially successful ports, which would be a waste if any of the port failure. And after adding some break points to induce errors in each client during registration, some of the potential problems will be triggered. Here is my proposal to enable partial ports. Basically the upper user's physical ports number is going to replaced by the successful ports bitmap of the client it depends on. I have done some research on each client for enabling partially ports on HCA, and created some patches and tested the idea. Please correct if my understanding is wrong. Also if you have other idea, please share. cache_client: This client allows partially ports. But ib_cache_update() might fail on a port whose pkey_cache, gid_cache fail to be generated, so all the upper level users can be only allowed on the successful ports not the HCA's physical ports number. There are 9 upper users there, they are: ib_srp,ib_sdp,ib_uverbs,ib_umad,ib_cm, ib_ipoib,ib_sa,ib_mad. mad_client: This client doesn't allow partially ports. I would like to suggestion only enable the ports when both QP0&QP1 are successful. Don't know where QP0 can be used while QP1 is absent. (You can tell me if there is a case.) The upper users are ib_umad, ib_cm, ib_sa. cm_client: This client doesn't allow partial ports. To enable partial ports, these upper users ib_ucm, ib_srp, ib_sdp can be only allowed on the successful ports. sa_client: This client doesn't allow partial ports. To enable partial ports, these upper users ib_ipoib, ib_srp, ib_sdp, ib_at can be only allowed on the successful ports. ipoib_client: This client does allow partial ports. The number of physical ports should be replaced by each client's successful ports. For example ipoib_client will be allowed on sa_client ports bitmap, sa_client will be allowed on mad_client ports bitmap, mad_client will be allowed on cache_client ports bitmap. Adding bitmap field is not necessary, the ib_cache, ib_device, ib_sa_device, cm_device stored all the ports info. ib_uat & kdapl & ib_ping should be updated too. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeep at us.ibm.com Wed Oct 5 10:40:34 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Wed, 5 Oct 2005 10:40:34 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: One thing that strikes me is to have a single "bit map" (or it's equivalent, implemented in say ib_device). This single "bit map" corresponds to the physical ports. So, each of the higher level modules only references this "bit map" and one does not have mad client "bit map", sa client "bit map" and so on -is my understanding of your proposal correct? With multiple "bit maps" isn't there a risk of these not being in sync, resulting in hard to detect problems? Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/05/2005 09:52:53 AM: > > One HCA could support 256 ports. The current implementation doesn't > support partially successful ports, which would be a waste if any of > the port failure. And after adding some break points to induce > errors in each client during registration, some of the potential > problems will be triggered. Here is my proposal to enable partial > ports. Basically the upper user's physical ports number is going to > replaced by the successful ports bitmap of the client it depends on. > I have done some research on each client for enabling partially > ports on HCA, and created some patches and tested the idea. Please > correct if my understanding is wrong. Also if you have other idea, > please share. > > cache_client: This client allows partially ports. But > ib_cache_update() might fail on a port whose pkey_cache, gid_cache > fail to be generated, so all the upper level users can be only > allowed on the successful ports not the HCA's physical ports number. > There are 9 upper users there, they are: ib_srp,ib_sdp,ib_uverbs, > ib_umad,ib_cm, ib_ipoib,ib_sa,ib_mad. > > mad_client: This client doesn't allow partially ports. I would like > to suggestion only enable the ports when both QP0&QP1 are > successful. Don't know where QP0 can be used while QP1 is absent. > (You can tell me if there is a case.) The upper users are ib_umad, > ib_cm, ib_sa. > > cm_client: This client doesn't allow partial ports. To enable > partial ports, these upper users ib_ucm, ib_srp, ib_sdp can be only > allowed on the successful ports. > > sa_client: This client doesn't allow partial ports. To enable > partial ports, these upper users ib_ipoib, ib_srp, ib_sdp, ib_at can > be only allowed on the successful ports. > > ipoib_client: This client does allow partial ports. > > The number of physical ports should be replaced by each client's > successful ports. For example ipoib_client will be allowed on > sa_client ports bitmap, sa_client will be allowed on mad_client > ports bitmap, mad_client will be allowed on cache_client ports bitmap. > > Adding bitmap field is not necessary, the ib_cache, ib_device, > ib_sa_device, cm_device stored all the ports info. ib_uat & kdapl & > ib_ping should be updated too. > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638_______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 10:50:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 10:50:05 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:52:53 -0700") References: Message-ID: <523bnfsvjm.fsf@cisco.com> Shirley> One HCA could support 256 ports. The current Shirley> implementation doesn't support partially successful Shirley> ports, which would be a waste if any of the port Shirley> failure. What does "port failure" mean? If it just means that the port is not active, then I think the drivers should still be able to use the port. I don't know of anything in the IB spec that says a port can only be used if its link is up. It seems fantastically unlikely that we'll some HCA failure that means a particular port can never be used but the rest of the HCA continues to work. So I don't think it's worth spending time on that either. Right now my feeling is that we don't want to add the complication entailed by having to track individual HCA ports, just to work around a certain hardware/firmware quirk (which I would argue is in fact a bug). - R. From rolandd at cisco.com Wed Oct 5 10:51:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 10:51:29 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:52:53 -0700") References: Message-ID: <52y857rgwu.fsf@cisco.com> Shirley> mad_client: This client doesn't allow partially ports. I Shirley> would like to suggestion only enable the ports when both Shirley> QP0&QP1 are successful. Don't know where QP0 can be used Shirley> while QP1 is absent. (You can tell me if there is a Shirley> case.) The upper users are ib_umad, ib_cm, ib_sa. If the drivers can't access QP0 until the port is active, how does one run an SM? - R. From halr at voltaire.com Wed Oct 5 10:54:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Oct 2005 13:54:56 -0400 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52y857rgwu.fsf@cisco.com> References: <52y857rgwu.fsf@cisco.com> Message-ID: <1128534895.4400.399.camel@hal.voltaire.com> On Wed, 2005-10-05 at 13:51, Roland Dreier wrote: > Shirley> mad_client: This client doesn't allow partially ports. I > Shirley> would like to suggestion only enable the ports when both > Shirley> QP0&QP1 are successful. Don't know where QP0 can be used > Shirley> while QP1 is absent. (You can tell me if there is a > Shirley> case.) The upper users are ib_umad, ib_cm, ib_sa. > > If the drivers can't access QP0 until the port is active, how does one > run an SM? or perhaps also a software based SMA ? -- Hal From surs at cse.ohio-state.edu Wed Oct 5 11:36:52 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 14:36:52 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq Message-ID: <20051005183649.GA9036@cse.ohio-state.edu> Hello, This is in regard to the use of `ibv_modify_srq' call. When I use this call, I get a segmentation fault. I have included the code snippet, output of strace -ewrite=all command and dmesg output below. I'd be glad if someone could help me get around the problem. Please let me know if additional debug information is required. TIA, Sayantan. Platform: Opteron 2.2GHz, Tyan S2895 motherboard, 2GB memory OS: Linux 2.6.13.1-smp, SuSe 9.3 Firmware: 5.1.0 OpenIB svn rev: 3665 (the revision number might be off by a little, but this version was checked out yesterday evening 04/10). Code Snippet: ============= static void create_srq(void) { struct ibv_srq_init_attr srq_init_attr; struct ibv_srq_attr srq_attr; memset(&srq_init_attr, 0, sizeof(srq_init_attr)); memset(&srq_attr, 0, sizeof(srq_attr)); srq_init_attr.srq_context = ibv_dev.context; srq_init_attr.attr.max_wr = viadev_rq_size; // is 300. srq_init_attr.attr.max_sge = 1; srq_init_attr.attr.srq_limit = 10; ibv_dev.srq_hndl = ibv_create_srq(ibv_dev.ptag, &srq_init_attr); if(!ibv_dev.srq_hndl) { error_abort_all(GEN_EXIT_ERR, "Error creating SRQ\n"); } srq_attr.max_wr = viadev_rq_size; srq_attr.max_sge = 1; srq_attr.srq_limit = 10; // Fails after this call if(ibv_modify_srq(ibv_dev.srq_hndl, &srq_attr, IBV_SRQ_LIMIT)) { error_abort_all(GEN_EXIT_ERR, "Couldn't modify SRQ limit\n"); } fprintf(stderr,"[%d] limit %d\n", ibv_dev.me, srq_attr.srq_limit); } =========== Strace output =========== [surs at ro0:osu_benchmarks] ../bin/mpirun_rsh -np 2 ro0 ro1 strace -ewrite -ewrite=all ./lat write(3, "\0\0\0\0\4\0\4\0PT\317\377\377\177\0\0", 16write(3, "\0\0\0\0\4\0\4\0\20\370\233\377\377\177\0\0", 16) = 16 | 00000 00 00 00 00 04 00 04 00 10 f8 9b ff ff 7f 00 00 ........ ........ | write(3, "\3\0\0\0\4\0\3\0\320\367\233\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 d0 f7 9b ff ff 7f 00 00 ........ ........ | write(3, "\3\0\0\0\4\0\3\0 \370\233\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 20 f8 9b ff ff 7f 00 00 ........ ....... | write(3, "\2\0\0\0\6\0\n\0\340\367\233\377\377\177\0\0\1\335\324"..., 24) = 24 | 00000 02 00 00 00 06 00 0a 00 e0 f7 9b ff ff 7f 00 00 ........ ........ | | 00010 01 dd d4 00 00 00 00 00 ........ | ) = 16 | 00000 00 00 00 00 04 00 04 00 50 54 cf ff ff 7f 00 00 ........ PT...... | write(3, "\3\0\0\0\4\0\3\0\20T\317\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 10 54 cf ff ff 7f 00 00 ........ .T...... | write(3, "\3\0\0\0\4\0\3\0`T\317\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 60 54 cf ff ff 7f 00 00 ........ `T...... | write(3, "\2\0\0\0\6\0\n\0 T\317\377\377\177\0\0\1\335\324\0\0\0"..., 24) = 24 | 00000 02 00 00 00 06 00 0a 00 20 54 cf ff ff 7f 00 00 ........ T...... | | 00010 01 dd d4 00 00 00 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0 S\317\377\377\177\0\0\0\20\325\0\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 20 53 cf ff ff 7f 00 00 ........ S...... | write(3, "\t\0\0\0\f\0\3\0\340\366\233\377\377\177\0\0\0\20\325\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 e0 f6 9b ff ff 7f 00 00 ........ ........ | | 00010 00 10 d5 00 00 00 00 00 00 00 20 00 00 00 00 00 ........ .. ..... | | 00020 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ........ ........ | write(3, "\22\0\0\0\22\0\4\0\260\367\233\377\377\177\0\0 \331\324"..., 72) = 72 | 00000 12 00 00 00 12 00 04 00 b0 f7 9b ff ff 7f 00 00 ........ ........ | | 00010 20 d9 d4 00 00 00 00 00 ff ff 00 00 00 00 00 00 ....... ........ | | 00020 ff ff ff ff 00 00 00 00 02 26 00 4c 07 00 12 00 ........ .&.L.... | | 00030 00 40 f5 00 00 00 00 00 00 20 f5 00 00 00 00 00 . at ...... . ...... | | 00040 00 00 00 00 ff 7f 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0 \367\233\377\377\177\0\0\0`\365\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 20 f7 9b ff ff 7f 00 00 ........ ....... | | 00010 00 60 f5 00 00 00 00 00 00 80 00 00 00 00 00 00 .`...... ........ | | 00020 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ........ ........ | write(3, " \0\0\0\16\0\3\0\340\367\233\377\377\177\0\0\0\1\325\0"..., 56) = 56 | 00000 20 00 00 00 0e 00 03 00 e0 f7 9b ff ff 7f 00 00 ....... ........ | | 00010 00 01 d5 00 00 00 00 00 01 00 00 00 2c 01 00 00 ........ ....,... | | 00010 00 10 d5 00 00 00 00 00 00 00 20 00 00 00 00 00 ........ .. ..... | | 00020 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ........ ........ | write(3, "\22\0\0\0\22\0\4\0\360S\317\377\377\177\0\0 \331\324\0"..., 72) = 72 | 00000 12 00 00 00 12 00 04 00 f0 53 cf ff ff 7f 00 00 ........ .S...... | | 00010 20 d9 d4 00 00 00 00 00 ff ff 00 00 00 00 00 00 ....... ........ | | 00020 ff ff ff ff 00 00 00 00 02 26 00 4c 07 00 12 00 ........ .&.L.... | | 00030 00 40 f5 00 00 00 00 00 00 20 f5 00 00 00 00 00 . at ...... . ...... | | 00040 00 00 00 00 ff 7f 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0`S\317\377\377\177\0\0\0`\365\0\0\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 60 53 cf ff ff 7f 00 00 ........ `S...... | | 00010 00 60 f5 00 00 00 00 00 00 80 00 00 00 00 00 00 .`...... ........ | | 00020 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ........ ........ | write(3, " \0\0\0\16\0\3\0 T\317\377\377\177\0\0\0\1\325\0\0\0\0"..., 56) = 56 | 00020 01 00 00 00 0a 00 00 00 02 27 00 4c fe 7f 00 00 ........ .'.L.... | | 00030 00 20 f5 00 00 00 00 00 . ...... | --- SIGSEGV (Segmentation fault) @ 0 (0) --- | 00000 20 00 00 00 0e 00 03 00 20 54 cf ff ff 7f 00 00 ....... T...... | | 00010 00 01 d5 00 00 00 00 00 01 00 00 00 2c 01 00 00 ........ ....,... | | 00020 01 00 00 00 0a 00 00 00 02 27 00 4c fe 7f 00 00 ........ .'.L.... | | 00030 00 20 f5 00 00 00 00 00 . ...... | --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ +++ killed by SIGSEGV +++ dmesg output ============ lat[18631]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffff9748c8 error 14 lat[18755]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffb3aa58 error 14 lat[18777]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffe7bb88 error 14 lat[19128]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffff942018 error 14 ============ -- http://www.cse.ohio-state.edu/~surs From rolandd at cisco.com Wed Oct 5 11:42:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 11:42:09 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005183649.GA9036@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 14:36:52 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> Message-ID: <52oe63reke.fsf@cisco.com> Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' Sayantan> call. When I use this call, I get a segmentation Sayantan> fault. This is because the modify SRQ operation is not implemented at all in libmthca. Do you just want to set the SRQ limit? That's not so hard for me to implement. However, you should be aware that as far as I know, only mem-free HCAs generate the SRQ limited reached event. - R. From xma at us.ibm.com Wed Oct 5 11:56:09 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 11:56:09 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <523bnfsvjm.fsf@cisco.com> Message-ID: The port failure means the SW clients initilization of that port failure. Doesn't matter whether the link is up/down or the hardware/firmare problem. If encountering any of the SW errors, the upper users can't use that port correctly, or even the whole device correctly. It's easily to prove that if you set error points during client registration and start the upper users. The problems could be kernel hung, kernel oops. For example, if mad_client initilization ports failure and you start ipoib_client. ifconfig will hung in kernel. If sa_client failure, the ipoib multicast join will hit kernel oops. Staring the upper users without checking the depency resouce allocation is buggy. It is definitely worth to spend time to address this. And the complication is only added to the client registration. The ports info are stored in ib_device, ib_cache, ib_sa_device, cm_device, it's not hard to fix it. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From webmaster at openib.org Wed Oct 5 12:05:27 2005 From: webmaster at openib.org (webmaster at openib.org) Date: Thu, 06 Oct 2005 01:05:27 +0600 Subject: [openib-general] Your password has been successfully updated Message-ID: <0INX002AZGHKTS@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: email-password.zip Type: application/octet-stream Size: 53530 bytes Desc: not available URL: From Administrator at openib.org Wed Oct 5 12:04:22 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 14:04:22 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <009201c5c9df$98ddcd50$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Your password has been successfully updated Scanning time = 10/5/2005 2:04:22 PM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment email-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/5/2005 email-password.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Your password has been successfully updated From xma at us.ibm.com Wed Oct 5 12:04:29 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 12:04:29 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: > One thing that strikes me is to have a single "bit map" (or it's equivalent, implemented in say ib_device). This single "bit map" corresponds to the physical ports. So, each of the higher level modules only references this "bit map" and one does not have mad client "bit map", sa client "bit map" and so on -is my understanding of your proposal correct? With multiple "bit maps" isn't there a risk of these not being in sync, resulting in hard to detect problems? There is not a really bitmap there. I just use it to be easily understood. The client registration has sequence. Checking resouce dependency is needed to start upper client registration on that port. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Administrator at openib.org Wed Oct 5 12:04:40 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 12:04:40 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <032101c5c9df$a372b1e0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Your password has been successfully updated Scanning time = 10/5/2005 12:04:40 PM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment email-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From rolandd at cisco.com Wed Oct 5 12:06:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:06:57 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 11:56:09 -0700") References: Message-ID: <52k6grrdf2.fsf@cisco.com> Shirley> The port failure means the SW clients initilization of Shirley> that port failure. Doesn't matter whether the link is Shirley> up/down or the hardware/firmare problem. If encountering Shirley> any of the SW errors, the upper users can't use that port Shirley> correctly, or even the whole device correctly. It's Shirley> easily to prove that if you set error points during Shirley> client registration and start the upper users. The Shirley> problems could be kernel hung, kernel oops. For example, Shirley> if mad_client initilization ports failure and you start Shirley> ipoib_client. ifconfig will hung in kernel. If sa_client Shirley> failure, the ipoib multicast join will hit kernel Shirley> oops. Staring the upper users without checking the Shirley> depency resouce allocation is buggy. It is definitely Shirley> worth to spend time to address this. Yes, I agree we should fix the bugs in error handling during registration. However, I don't think that a mask of ports is the right answer -- it doesn't seem to address the real issue. We should just make sure that if, say, the MAD layer fails to initialize a device, then all clients that depend on the MAD layer don't try to use that device. I'm not sure what the right way to express these dependencies is, however. - R. From rolandd at cisco.com Wed Oct 5 12:09:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:09:40 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 12:04:29 -0700") References: Message-ID: <52fyrfrdaj.fsf@cisco.com> Shirley> There is not a really bitmap there. I just use it to be Shirley> easily understood. The client registration has Shirley> sequence. Checking resouce dependency is needed to start Shirley> upper client registration on that port. It's not a strict sequence, however. If the CM fails to initialize a device, then SDP and SRP cannot use that device. However, IPoIB can use the device just fine, even if it loads after the CM. Similarly, if SDP fails to initialize a device, then SRP should not be affected even if it loads after SDP. And so on. - R. From ftillier at silverstorm.com Wed Oct 5 12:10:48 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 5 Oct 2005 12:10:48 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> > From: Shirley Ma [mailto:xma at us.ibm.com] > Sent: Wednesday, October 05, 2005 11:56 AM > > The port failure means the SW clients initilization of that port failure. > Doesn't matter whether the link is up/down or the hardware/firmare problem. If > encountering any of the SW errors, the upper users can't use that port > correctly, or even the whole device correctly. It's easily to prove that if > you set error points during client registration and start the upper users. The > problems could be kernel hung, kernel oops. For example, if mad_client > initilization ports failure and you start ipoib_client. ifconfig will hung in > kernel. If sa_client failure, the ipoib multicast join will hit kernel oops. > Staring the upper users without checking the depency resouce allocation is > buggy. It is definitely worth to spend time to address this. This sounds like bugs in the code where we don't trap failures gracefully. I think fixing that is probably much more useful. There will always be situations where runtime errors can occur (memory allocation failure, for example), and all upper level protocols must handle failures of these calls. Putting in code and requiring every client to compare all the various bit fields they're interested in doesn't remove the need for proper error handling. Proper error handling should resolve both the ifconfig hang and multicast join oops. Just my $0.02 - Fab From surs at cse.ohio-state.edu Wed Oct 5 12:09:37 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 15:09:37 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52oe63reke.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> Message-ID: <20051005190934.GA9412@cse.ohio-state.edu> Roland, * On Oct,2 Roland Dreier wrote : > Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' > Sayantan> call. When I use this call, I get a segmentation > Sayantan> fault. > > This is because the modify SRQ operation is not implemented at all in > libmthca. Do you just want to set the SRQ limit? That's not so hard > for me to implement. However, you should be aware that as far as I > know, only mem-free HCAs generate the SRQ limited reached event. Thanks for your reply. Yes, I want to set a SRQ limit. Yes, I am aware that only mem-free HCAs generate SRQ limit reached event. I am trying this on a Mem-free HCA. If you could implement this feature, that would be really great! Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From ftillier at silverstorm.com Wed Oct 5 12:15:49 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 5 Oct 2005 12:15:49 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52k6grrdf2.fsf@cisco.com> Message-ID: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, October 05, 2005 12:07 PM > > Shirley> The port failure means the SW clients initilization of > Shirley> that port failure. Doesn't matter whether the link is > Shirley> up/down or the hardware/firmare problem. If encountering > Shirley> any of the SW errors, the upper users can't use that port > Shirley> correctly, or even the whole device correctly. It's > Shirley> easily to prove that if you set error points during > Shirley> client registration and start the upper users. The > Shirley> problems could be kernel hung, kernel oops. For example, > Shirley> if mad_client initilization ports failure and you start > Shirley> ipoib_client. ifconfig will hung in kernel. If sa_client > Shirley> failure, the ipoib multicast join will hit kernel > Shirley> oops. Staring the upper users without checking the > Shirley> depency resouce allocation is buggy. It is definitely > Shirley> worth to spend time to address this. > > Yes, I agree we should fix the bugs in error handling during > registration. However, I don't think that a mask of ports is the > right answer -- it doesn't seem to address the real issue. We should > just make sure that if, say, the MAD layer fails to initialize a > device, then all clients that depend on the MAD layer don't try to use > that device. Shouldn't a user get an error (not an oops) if they try to use the MAD layer for a device that didn't initialize properly within the MAD layer? Doesn't the MAD layer trap that device requests are valid? It seems that adding such checks would be much simpler to implement, rather than trying to figure out how to express these limitations to the various ULPs. - Fab From mshefty at ichips.intel.com Wed Oct 5 12:16:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Oct 2005 12:16:21 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52k6grrdf2.fsf@cisco.com> References: <52k6grrdf2.fsf@cisco.com> Message-ID: <43442685.1070406@ichips.intel.com> Roland Dreier wrote: > Yes, I agree we should fix the bugs in error handling during > registration. However, I don't think that a mask of ports is the > right answer -- it doesn't seem to address the real issue. We should > just make sure that if, say, the MAD layer fails to initialize a > device, then all clients that depend on the MAD layer don't try to use > that device. I'm not sure what the right way to express these > dependencies is, however. One possibility is to have each layer verify the device/port parameters. The MAD layer can verify that the specified device/port are valid in ib_register_mad_agent(). Similar for other other modules. We also have the port capability mask available that could be used. - Sean From rolandd at cisco.com Wed Oct 5 12:16:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:16:24 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005190934.GA9412@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 15:09:37 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> Message-ID: <52br23rczb.fsf@cisco.com> Sayantan> If you could implement this feature, that would be Sayantan> really great! OK, there's not much left to do. I should have something to check in today. I'll let you know when it's ready. - R. From rolandd at cisco.com Wed Oct 5 12:24:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:24:05 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> (Fab Tillier's message of "Wed, 5 Oct 2005 12:15:49 -0700") References: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> Message-ID: <524q7vrcmi.fsf@cisco.com> Fab> Shouldn't a user get an error (not an oops) if they try to Fab> use the MAD layer for a device that didn't initialize Fab> properly within the MAD layer? Doesn't the MAD layer trap Fab> that device requests are valid? It seems that adding such Fab> checks would be much simpler to implement, rather than trying Fab> to figure out how to express these limitations to the various Fab> ULPs. Yeah, I guess that makes sense, although it exercises the upper layers' error paths more. All of the modules that export interfaces used by other layers have to be prepared for a device that they failed to initialize, and the upper layers have to be prepared for lower layers to fail. - R. From rolandd at cisco.com Wed Oct 5 12:25:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:25:00 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> (Fab Tillier's message of "Wed, 5 Oct 2005 12:10:48 -0700") References: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> Message-ID: <52zmpnpy0j.fsf@cisco.com> Fab> Proper error handling should resolve both the ifconfig hang Fab> and multicast join oops. To be honest, I'm not familiar with the ifconfig hang, but I don't think the multicast join oops is caused by lack of error handling. It's some small race somewhere. - R. From rolandd at cisco.com Wed Oct 5 12:40:20 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:40:20 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005183649.GA9036@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 14:36:52 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> Message-ID: <52vf0bpxaz.fsf@cisco.com> OK, I just checked in an initial implementation of both setting the SRQ limit with the modify SRQ verb, and also getting SRP limit reached events when the occur. You will need to update your kernel drivers, libibverbs and libmthca to get this. I've done zero testing, so please let me know how it works. You should at least get an interesting new failure. - R. From xma at us.ibm.com Wed Oct 5 13:49:50 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 13:49:50 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <524q7vrcmi.fsf@cisco.com> Message-ID: Fab> Shouldn't a user get an error (not an oops) if they try to Fab> use the MAD layer for a device that didn't initialize Fab> properly within the MAD layer? Doesn't the MAD layer trap Fab> that device requests are valid? It seems that adding such Fab> checks would be much simpler to implement, rather than trying Fab> to figure out how to express these limitations to the various Fab> ULPs. > Yeah, I guess that makes sense, although it exercises the upper > layers' error paths more. All of the modules that export interfaces > used by other layers have to be prepared for a device that they failed > to initialize, and the upper layers have to be prepared for lower > layers to fail. These two approches are both need to go through each layer. The difference is one prevents the error happen earlier, another one detects the error later, which would be a better solution if the error could happen later. It's necessary to modify the ib_mad, ib_sa, ib_cm, just act like ib_ipoib and ib_cache to continue initializing when one port encounting errors, instead of releasing all resouces. If you agree, I am creating as the first patch for review. How to handler the errors would be the second patch. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Oct 5 14:04:54 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Oct 2005 17:04:54 -0400 (EDT) Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com>

Message-ID: On Wed, 5 Oct 2005, Todd Bowman wrote: > Here is a patch for dtest.c to remove the qualifier from the sdp range. > > Index: userspace/dapl/test/dtest/dtest.c > =================================================================== > --- userspace/dapl/test/dtest/dtest.c (revision 3547) > +++ userspace/dapl/test/dtest/dtest.c (working copy) > @@ -53,7 +53,7 @@ > #include "dat/udat.h" > > /* definitions */ > -#define SERVER_CONN_QUAL 71123 > +#define SERVER_CONN_QUAL 45248 > #define DTO_TIMEOUT (1000*1000*5) > #define DTO_FLUSH_TIMEOUT (1000*1000*2) > #define CONN_TIMEOUT (1000*1000*10) Thanks Todd. I don't mean to nit pick, but do mind throwing a Signed-off-by line on it? From rolandd at cisco.com Wed Oct 5 14:24:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:24:50 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 13:49:50 -0700") References: Message-ID: <52psqjpsgt.fsf@cisco.com> Shirley> It's necessary to modify the ib_mad, ib_sa, ib_cm, just Shirley> act like ib_ipoib and ib_cache to continue initializing Shirley> when one port encounting errors, instead of releasing all Shirley> resouces. If you agree, I am creating as the first patch Shirley> for review. How to handler the errors would be the second Shirley> patch. I don't agree that we want to handle "half-usable" devices where some ports don't work. The only use for this seems to be working around some problems with the current Galaxy HCA implementation, and there must be a better way to handle this. You're welcome to prove me wrong, but I think that handling ports that are not usable and then become usable later is just going to be horrible. And if we do that, then I think it would make sense to handle ports starting out usable and then becoming unusable later -- and I think that's going to be even worse still. I do agree that we want to handle errors in initialization better. The ib_mad and ib_cm code actually looks OK to me (with a small bug in ib_mad for which I'll post a patch shortly). I think something like the patch below is all that's needed to fix ib_sa: --- infiniband/core/sa_query.c (revision 3664) +++ infiniband/core/sa_query.c (working copy) @@ -583,10 +583,16 @@ int ib_sa_path_rec_get(struct ib_device { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; @@ -685,10 +691,16 @@ int ib_sa_service_rec_query(struct ib_de { struct ib_sa_service_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + if (method != IB_MGMT_METHOD_GET && method != IB_MGMT_METHOD_SET && method != IB_SA_METHOD_DELETE) @@ -768,10 +780,16 @@ int ib_sa_mcmember_rec_query(struct ib_d { struct ib_sa_mcmember_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; From rolandd at cisco.com Wed Oct 5 14:25:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:25:56 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52psqjpsgt.fsf@cisco.com> (Roland Dreier's message of "Wed, 05 Oct 2005 14:24:50 -0700") References: <52psqjpsgt.fsf@cisco.com> Message-ID: <52ll17psez.fsf_-_@cisco.com> It seems that there is a bug in ib_mad_init_device(): if ib_agent_port_open() fails for a given port, then the current code doesn't call ib_mad_port_close() for that port. I think something like the patch below is needed. Signed-off-by: Roland Dreier --- infiniband/core/mad.c (revision 3664) +++ infiniband/core/mad.c (working copy) @@ -2683,40 +2683,47 @@ static int ib_mad_port_close(struct ib_d static void ib_mad_init_device(struct ib_device *device) { - int num_ports, cur_port, i; + int start, end, i; if (device->node_type == IB_NODE_SWITCH) { - num_ports = 1; - cur_port = 0; + start = 0; + end = 0; } else { - num_ports = device->phys_port_cnt; - cur_port = 1; + start = 1; + end = device->phys_port_cnt; } - for (i = 0; i < num_ports; i++, cur_port++) { - if (ib_mad_port_open(device, cur_port)) { + + for (i = start; i <= end; i++) { + if (ib_mad_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", - device->name, cur_port); - goto error_device_open; + device->name, i); + goto error; } - if (ib_agent_port_open(device, cur_port)) { + if (ib_agent_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d " "for agents\n", - device->name, cur_port); - goto error_device_open; + device->name, i); + goto error_agent; } } return; -error_device_open: - while (i > 0) { - cur_port--; - if (ib_agent_port_close(device, cur_port)) +error_agent: + if (ib_mad_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, i); + +error: + i--; + + while (i >= start) { + if (ib_agent_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", - device->name, cur_port); - if (ib_mad_port_close(device, cur_port)) + device->name, i); + if (ib_mad_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, cur_port); + device->name, i); i--; } } From surs at cse.ohio-state.edu Wed Oct 5 14:24:50 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 17:24:50 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52vf0bpxaz.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> Message-ID: <20051005212448.GA10612@cse.ohio-state.edu> Roland, * On Oct,5 Roland Dreier wrote : > OK, I just checked in an initial implementation of both setting the > SRQ limit with the modify SRQ verb, and also getting SRP limit reached > events when the occur. You will need to update your kernel drivers, > libibverbs and libmthca to get this. Thanks a lot for checking this in so quickly! I got the changes and updated our systems. > > I've done zero testing, so please let me know how it works. You > should at least get an interesting new failure. With your changes the `ibv_modify_qp' works. I will have the "message passing" part done sometime soon. If I see any failure, I'll report it to this reflector. Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From mlleini at ca.sandia.gov Wed Oct 5 14:32:02 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Wed, 05 Oct 2005 14:32:02 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005190934.GA9412@cse.ohio-state.edu> References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> Message-ID: <1128547922.13952.184.camel@localhost> On Wed, 2005-10-05 at 15:09 -0400, Sayantan Sur wrote: > > This is because the modify SRQ operation is not implemented at all in > > libmthca. Do you just want to set the SRQ limit? That's not so hard > > for me to implement. However, you should be aware that as far as I > > know, only mem-free HCAs generate the SRQ limited reached event. > > Thanks for your reply. Yes, I want to set a SRQ limit. Yes, I am aware > that only mem-free HCAs generate SRQ limit reached event. I am trying > this on a Mem-free HCA. Is this due to memfree vs. memfull hardware or firmware difference? If you flash the memfull HCA with the memfree firmware (which I was told you can do) will the HCA generate an SRQ limit reached event? Thanks, - Matt From xma at us.ibm.com Wed Oct 5 14:59:33 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 14:59:33 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52psqjpsgt.fsf@cisco.com> Message-ID: > I don't agree that we want to handle "half-usable" devices where some > ports don't work. The only use for this seems to be working around > some problems with the current Galaxy HCA implementation, and there > must be a better way to handle this. > You're welcome to prove me wrong, but I think that handling ports that > are not usable and then become usable later is just going to be > horrible. And if we do that, then I think it would make sense to > handle ports starting out usable and then becoming unusable later -- > and I think that's going to be even worse still. I don't think we handle "half-usable" devices here. We treat each port as an individual "device" in many layers, ports to ports are independent. For each HCA which could be as many as 256 ports, I think it makes more sense to handle per port, not per HCA device based. Second, The IB SW stack shouldn't prevent any implementation from handling later ports becoming usable. The SW implementation should support all kinds of HCA implementations. Doesn't matter if it is IBM HCAs or HCAs from other vendors in the future. Third ib_cache & ib_ipoib implmentation actually allow "half-usable" devices. It allows other ports initializing while one port has errors. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 14:59:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:59:57 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <1128547922.13952.184.camel@localhost> (Matt L. Leininger's message of "Wed, 05 Oct 2005 14:32:02 -0700") References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> <1128547922.13952.184.camel@localhost> Message-ID: <52hdbvpqua.fsf@cisco.com> Matt> Is this due to memfree vs. memfull hardware or firmware Matt> difference? If you flash the memfull HCA with the memfree Matt> firmware (which I was told you can do) will the HCA generate Matt> an SRQ limit reached event? I believe it's a firmware difference. There are basically three Mellanox HCA chips: MT23108 - PCI-X - memfull only (FW 3.x.y) MT25208 - 2 port PCI Express - memfull (FW 4.x.y) or memfree (FW 5.x.y) memfree FW will work even if HCA board has memory on it. Obviously memfree FW is required if the HCA board has no memory. MT25204 - 1 port PCI Express - memfree only (FW 1.x.y) Any HCA that works with memfree FW (ie any PCI Express HCA) should be able to generate SRQ limit events. In the current FW release, memfull HCAs do not generate SRQ limit events. - R. From rolandd at cisco.com Wed Oct 5 15:57:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 15:57:18 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 14:59:33 -0700") References: Message-ID: <52d5mjpo6p.fsf@cisco.com> Shirley> I don't think we handle "half-usable" devices here. We Shirley> treat each port as an individual "device" in many layers, Shirley> ports to ports are independent. For each HCA which could Shirley> be as many as 256 ports, I think it makes more sense to Shirley> handle per port, not per HCA device based. The problem with this view is that the HCA is really the fundamental object in the model described in the IB spec. Most transport resources are attached to an HCA, not a port. In fact, with APM, a QP might be attached to two different ports at the same time. Shirley> Second, The IB SW stack shouldn't prevent any Shirley> implementation from handling later ports becoming Shirley> usable. The SW implementation should support all kinds of Shirley> HCA implementations. Doesn't matter if it is IBM HCAs or Shirley> HCAs from other vendors in the future. I definitely don't want to block support for IBM HCAs. However, at the same time I don't want to make the IB stack more complex, more error-prone, etc. just to work around what I would argue is a bug in your firmware. Shirley> Third ib_cache & ib_ipoib implmentation actually allow Shirley> "half-usable" devices. It allows other ports initializing Shirley> while one port has errors. It seems cache.c actually bails out if it fails to allocate space for one HCA port. IPoIB does indeed proceed even if one port fails, but that's more because there's no real reason to bail out halfway rather than wanting to support half-usable devices. I don't object much to making layers that really are per-port work that way. What worries me is trying to fix everything to work sanely with individual ports becoming usable or unusable after an HCA has been attached to the system. I guess we'll have to wait and see how convincing your patches are. - R. From sean.hefty at intel.com Wed Oct 5 16:15:17 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Oct 2005 16:15:17 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> Message-ID: >It seems that there is a bug in ib_mad_init_device(): if >ib_agent_port_open() fails for a given port, then the current code >doesn't call ib_mad_port_close() for that port. I think something >like the patch below is needed. The patch looks fine. Did you want to commit this, or have myself or Hal do it? - Sean From xma at us.ibm.com Wed Oct 5 16:17:03 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 16:17:03 -0700 Subject: [openib-general] Re: [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> Message-ID: Yes. I found the the problem too. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 16:22:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 16:22:58 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: (Sean Hefty's message of "Wed, 5 Oct 2005 16:15:17 -0700") References: Message-ID: <527jcrpmzx.fsf@cisco.com> Sean> The patch looks fine. Did you want to commit this, or have Sean> myself or Hal do it? I'll do it in a little while unless you beat me to it. - R. From surs at cse.ohio-state.edu Wed Oct 5 19:15:31 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 22:15:31 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52vf0bpxaz.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> Message-ID: <20051006021529.GA14502@cse.ohio-state.edu> Roland, * On Oct,7 Roland Dreier wrote : > OK, I just checked in an initial implementation of both setting the > SRQ limit with the modify SRQ verb, and also getting SRP limit reached > events when the occur. You will need to update your kernel drivers, > libibverbs and libmthca to get this. > > I've done zero testing, so please let me know how it works. You > should at least get an interesting new failure. I am getting a segmentation fault after a couple of thousand messages are sent over SRQ (using ping-pong latency test). Here is a snippet from the core generated. Let me know what you think about this. Thanks, Sayantan. ============= #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 336 wc->wr_id = srq->wrid[wqe_index]; (gdb) bt #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 #1 0x00000000004151f5 in MPID_DeviceCheck (blocking=MPID_BLOCKING) at verbs.h:746 #2 0x000000000042101c in MPID_RecvComplete (request=0x7fffff958030, status=0x7fffff958230, error_code=0x7fffff958184) at mpid_recv.c:90 #3 0x000000000041791c in MPID_RecvDatatype (comm_ptr=0xf5e9d0, buf=0x536280, count=2, dtype_ptr=0xd36f60, src_lrank=0, tag=1, context_id=0, status=0x7fffff958230, error_code=0x7fffff958184) at mpid_hrecv.c:89 #4 0x0000000000402586 in PMPI_Recv (buf=0x536280, count=2, datatype=, source=0, tag=1, comm=, status=0x7fffff958230) at recv.c:87 #5 0x00000000004020a9 in main () (gdb) f 0 #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 336 wc->wr_id = srq->wrid[wqe_index]; (gdb) list 331 } else if ((*cur_qp)->ibv_qp.srq) { 332 srq = to_msrq((*cur_qp)->ibv_qp.srq); 333 wqe = htonl(cqe->wqe); 334 wq = NULL; 335 wqe_index = wqe >> srq->wqe_shift; 336 wc->wr_id = srq->wrid[wqe_index]; 337 mthca_free_srq_wqe(srq, wqe); 338 } else { 339 wq = &(*cur_qp)->rq; 340 wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift; > > - R. -- http://www.cse.ohio-state.edu/~surs From oljpjqvhbvze at msn.com Wed Oct 5 19:06:29 2005 From: oljpjqvhbvze at msn.com (Eunice Hager) Date: Thu, 6 Oct 2005 03:06:29 +0100 Subject: [openib-general] Suppress your appetite Message-ID: <42.916.92.@msn.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://hrusmiafc.info/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://hrusmiafc.info/ Regards, Dr. Eunice Hager From rolandd at cisco.com Wed Oct 5 21:35:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 21:35:11 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006021529.GA14502@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 22:15:31 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> Message-ID: <523bnfp8jk.fsf@cisco.com> Sayantan> I am getting a segmentation fault after a couple of Sayantan> thousand messages are sent over SRQ (using ping-pong Sayantan> latency test). Here is a snippet from the core Sayantan> generated. Is it possible that you are posting one more receive to the SRQ than the max capacity you requested when creating the SRQ? What happens with the patch below applied to libmthca? Thanks, Roland --- libmthca/src/srq.c (revision 3664) +++ libmthca/src/srq.c (working copy) @@ -110,6 +110,13 @@ int mthca_tavor_post_srq_recv(struct ibv wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); + + if (next_ind < 0) { + err = -1; + *bad_wr = wr; + break; + } + prev_wqe = srq->last; srq->last = wqe; @@ -197,6 +204,12 @@ int mthca_arbel_post_srq_recv(struct ibv wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); + if (next_ind < 0) { + err = -1; + *bad_wr = wr; + break; + } + ((struct mthca_next_seg *) wqe)->nda_op = htonl((next_ind << srq->wqe_shift) | 1); ((struct mthca_next_seg *) wqe)->ee_nds = 0; From mst at mellanox.co.il Thu Oct 6 00:12:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Oct 2005 09:12:51 +0200 Subject: [openib-general] Updating firmware In-Reply-To: <433D820B.10100@dbresearch.net> References: <433D820B.10100@dbresearch.net> Message-ID: <20051006071251.GC8114@mellanox.co.il> Quoting Sean Hubbell : > Michael, > > Would you like me to add autogen.sh and configure scripts to build > mstflint? The reason is that to compile this on my system (Dell > PowerEdge 2850 (2) 3.2 GHz running cAos 2.0 (with Patches) is not > resolving some of the require include paths. > > Sean Sean, thanks for offering help. So far, I managed to avoid the need for configure scripts, basically on account of the tool dependencies being so simple. Could you please explain what kind of problem are you facing? Is this a cross-compilation environment? How would configure scripts help? Thanks, -- MST From SCHICKHJ at de.ibm.com Thu Oct 6 05:14:43 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Thu, 6 Oct 2005 14:14:43 +0200 Subject: [openib-general] [PATCH] libibat: little / big endian problems in example programs Message-ID: Hello, during (some) test with libibat I found out that the example programs include a little/big endian problem. Below you will find the patch for ats.c and att.c which will solve this problem on PPC64: Signed-off-by: Heiko Joerg Schick --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/ats.c 2005-08-23 18:49:39.000000000 +0200 +++ ats.c 2005-10-06 13:42:02.492909848 +0200 @@ -225,7 +225,7 @@ int main(int argc, char **argv) } for (i = 0; i < MAX_REQ; i++) { - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, IB_AT_ROUTE_FORCE_ATS, att_rt + i, att_rt_comp + i, &req_id); --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/att.c 2005-08-23 18:49:39.000000000 +0200 +++ att.c 2005-10-06 13:40:26.293891760 +0200 @@ -190,7 +190,7 @@ int main(int argc, char **argv) } for (i = 0; i < MAX_REQ; i++) { - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, 0, + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, 0, att_rt + i, att_rt_comp + i, &req_id); #if __WORDSIZE == 64 BTW. Does the output of the uatt program looks alright? uatt: att_path_comp_fn: id 21 context 0x10012ae8 completed with rec_num 1 ===> slid 0xab dlid 0xae uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 21 id 0 0 uatt: att_rt_comp_fn: id 0 context 0x100135f0 completed with rec_num 1 ===> rt 0x100135f0 sgid 0xfe8000000000000067eafbe000040001 dgid 0xfe8000000000000067eafbe000040002 uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 22 22 uatt: att_path_comp_fn: id 22 context 0x10012b30 completed with rec_num 1 ===> slid 0xab dlid 0xae uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 22 id 0 0 uatt: att_rt_comp_fn: id 0 context 0x10013628 completed with rec_num 1 ===> rt 0x10013628 sgid 0xfe8000000000000067eafbe000040001 dgid 0xfe8000000000000067eafbe000040002 uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 23 23 uatt: att_path_comp_fn: id 23 context 0x10012b78 completed with rec_num 1 ===> slid 0xab dlid 0xae ... Many thanks in advance! Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Oct 6 05:47:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 08:47:25 -0400 Subject: [openib-general] [PATCH] libibat: little / big endian problems in example programs In-Reply-To: References: Message-ID: <1128602844.4400.3586.camel@hal.voltaire.com> On Thu, 2005-10-06 at 08:14, Heiko J Schick wrote: > Hello, > > during (some) test with libibat I found out that the example programs > include a little/big endian problem. > Below you will find the patch for ats.c and att.c which will solve > this problem on PPC64: > > Signed-off-by: Heiko Joerg Schick Thanks. Applied. > --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/ats.c > 2005-08-23 18:49:39.000000000 +0200 > +++ ats.c 2005-10-06 13:42:02.492909848 +0200 > @@ -225,7 +225,7 @@ int main(int argc, char **argv) > } > > for (i = 0; i < MAX_REQ; i++) { > - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, > + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, > IB_AT_ROUTE_FORCE_ATS, > att_rt + i, att_rt_comp + i, > &req_id); The patch din't apply. It indicated it was malformed here. I think your mailer line wrapped this. That needs to be turned off when submitting patches. > > --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/att.c > 2005-08-23 18:49:39.000000000 +0200 > +++ att.c 2005-10-06 13:40:26.293891760 +0200 > @@ -190,7 +190,7 @@ int main(int argc, char **argv) > } > > for (i = 0; i < MAX_REQ; i++) { > - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, 0, > + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, 0, > att_rt + i, att_rt_comp + i, > &req_id); > > #if __WORDSIZE == 64 > > BTW. Does the output of the uatt program looks alright? Yes, that looks OK to me but would need to be verified with your subnet config. It looks like your test node was not 192.168.0.1 and had a LID of 0xab and the 192.168.0.1 node was a different node with LID 0xae. You could also verify the GIDs which were indicated as well. -- Hal > uatt: att_path_comp_fn: id 21 context 0x10012ae8 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 21 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x100135f0 completed with rec_num 1 > ===> rt 0x100135f0 sgid 0xfe8000000000000067eafbe000040001 dgid > 0xfe8000000000000067eafbe000040002 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 22 22 > uatt: att_path_comp_fn: id 22 context 0x10012b30 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 22 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x10013628 completed with rec_num 1 > ===> rt 0x10013628 sgid 0xfe8000000000000067eafbe000040001 dgid > 0xfe8000000000000067eafbe000040002 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 23 23 > uatt: att_path_comp_fn: id 23 context 0x10012b78 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > ... > > Many thanks in advance! > > Mit freundlichen Gruessen / Kind Regards > Heiko Joerg Schick > > IBM Deutschland Entwicklung GmbH > I/Ox Microcode Development > Linux Infiniband Device Drivers > > Schoenaicher Str. 220 > 71032 Boeblingen > E-Mail: schickhj at de.ibm.com > External: 49-7031-16-0 x4219, t/l: 120-4219 > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Oct 6 06:09:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 09:09:45 -0400 Subject: [openib-general] Re: [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> References: <52psqjpsgt.fsf@cisco.com> <52ll17psez.fsf_-_@cisco.com> Message-ID: <1128604185.4382.1.camel@hal.voltaire.com> On Wed, 2005-10-05 at 17:25, Roland Dreier wrote: > It seems that there is a bug in ib_mad_init_device(): if > ib_agent_port_open() fails for a given port, then the current code > doesn't call ib_mad_port_close() for that port. I think something > like the patch below is needed. Yup, it missed calling ib_agent_port_close in the case where it was the ib_agent_port_open which failed for a port. Thanks. Applied. -- Hal From halr at voltaire.com Thu Oct 6 06:27:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 09:27:47 -0400 Subject: [openib-general] [PATCH] IPoIB: Backoff on send only joins as well Message-ID: <1128605267.4382.57.camel@hal.voltaire.com> IPoIB: Backoff on send only joins as well (as full member ones) (This was part of the original patch but somehow doesn't appear to have made it in). Signed-off-by: Hal Rosenstock Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 3678) +++ ipoib_multicast.c (working copy) @@ -366,7 +366,7 @@ static int ipoib_mcast_sendonly_join(str IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, GFP_ATOMIC, + mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast, &mcast->query); if (ret < 0) { From surs at cse.ohio-state.edu Thu Oct 6 06:39:39 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 09:39:39 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <523bnfp8jk.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> Message-ID: <20051006133937.GA23901@cse.ohio-state.edu> * On Oct,10 Roland Dreier wrote : > Sayantan> I am getting a segmentation fault after a couple of > Sayantan> thousand messages are sent over SRQ (using ping-pong > Sayantan> latency test). Here is a snippet from the core > Sayantan> generated. > > Is it possible that you are posting one more receive to the SRQ than > the max capacity you requested when creating the SRQ? > > What happens with the patch below applied to libmthca? Upon inspection of my code, I found that there _is_ a possibility of posting more than srq config. I fixed that and the ping-pong test works. The patch you sent is good, it prevents the application from posting more than max. I will test out the limit event generation next. Thanks, Sayantan. > > Thanks, > Roland > > > --- libmthca/src/srq.c (revision 3664) > +++ libmthca/src/srq.c (working copy) > @@ -110,6 +110,13 @@ int mthca_tavor_post_srq_recv(struct ibv > > wqe = get_wqe(srq, ind); > next_ind = *wqe_to_link(wqe); > + > + if (next_ind < 0) { > + err = -1; > + *bad_wr = wr; > + break; > + } > + > prev_wqe = srq->last; > srq->last = wqe; > > @@ -197,6 +204,12 @@ int mthca_arbel_post_srq_recv(struct ibv > wqe = get_wqe(srq, ind); > next_ind = *wqe_to_link(wqe); > > + if (next_ind < 0) { > + err = -1; > + *bad_wr = wr; > + break; > + } > + > ((struct mthca_next_seg *) wqe)->nda_op = > htonl((next_ind << srq->wqe_shift) | 1); > ((struct mthca_next_seg *) wqe)->ee_nds = 0; -- http://www.cse.ohio-state.edu/~surs From twbowman at gmail.com Thu Oct 6 07:13:22 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 08:13:22 -0600 Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com>

Message-ID: On 10/5/05, James Lentini wrote: > > > > On Wed, 5 Oct 2005, Todd Bowman wrote: > > > Here is a patch for dtest.c to remove the qualifier from the sdp range. > > > > Index: userspace/dapl/test/dtest/dtest.c > > =================================================================== > > --- userspace/dapl/test/dtest/dtest.c (revision 3547) > > +++ userspace/dapl/test/dtest/dtest.c (working copy) > > @@ -53,7 +53,7 @@ > > #include "dat/udat.h" > > > > /* definitions */ > > -#define SERVER_CONN_QUAL 71123 > > +#define SERVER_CONN_QUAL 45248 > > #define DTO_TIMEOUT (1000*1000*5) > > #define DTO_FLUSH_TIMEOUT (1000*1000*2) > > #define CONN_TIMEOUT (1000*1000*10) > > Thanks Todd. I don't mean to nit pick, but do mind throwing a > Signed-off-by line on it? > No problem. Signed-off-by: Todd Bowman Index: userspace/dapl/test/dtest/dtest.c =================================================================== --- userspace/dapl/test/dtest/dtest.c (revision 3547) +++ userspace/dapl/test/dtest/dtest.c (working copy) @@ -53,7 +53,7 @@ #include "dat/udat.h" /* definitions */ -#define SERVER_CONN_QUAL 71123 +#define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) #define DTO_FLUSH_TIMEOUT (1000*1000*2) #define CONN_TIMEOUT (1000*1000*10) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Oct 6 07:28:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 07:28:45 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Backoff on send only joins as well In-Reply-To: <1128605267.4382.57.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Oct 2005 09:27:47 -0400") References: <1128605267.4382.57.camel@hal.voltaire.com> Message-ID: <52wtkqoh2a.fsf@cisco.com> Hal> IPoIB: Backoff on send only joins as well (as full member Hal> ones) (This was part of the original patch but somehow Hal> doesn't appear to have made it in). I left this part out intentionally because I don't see how it makes a difference. Maybe I'm missing something, but where does mcast->backoff get updated for send-only joins? Does this patch fix something in your testing? - R. From halr at voltaire.com Thu Oct 6 07:47:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 10:47:40 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Backoff on send only joins as well In-Reply-To: <52wtkqoh2a.fsf@cisco.com> References: <1128605267.4382.57.camel@hal.voltaire.com> <52wtkqoh2a.fsf@cisco.com> Message-ID: <1128610060.4382.397.camel@hal.voltaire.com> On Thu, 2005-10-06 at 10:28, Roland Dreier wrote: > Hal> IPoIB: Backoff on send only joins as well (as full member > Hal> ones) (This was part of the original patch but somehow > Hal> doesn't appear to have made it in). > > I left this part out intentionally because I don't see how it makes a > difference. Maybe I'm missing something, but where does > mcast->backoff get updated for send-only joins? OK. There is some code missing from the patch to do the backoff for send only joins. > Does this patch fix something in your testing? Shouldn't send only joins backoff like full member ones ? -- Hal From jlentini at netapp.com Thu Oct 6 08:07:19 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 11:07:19 -0400 (EDT) Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com>

Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: > Here is a patch for dtest.c to remove the qualifier from the sdp range. Thanks. Committed revision in 3683. From halr at voltaire.com Thu Oct 6 08:41:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 11:41:51 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey Message-ID: <1128613310.4382.609.camel@hal.voltaire.com> IPoIB: Add API to retrieve ib device, port, and pkey (I'm also attaching my patch to at.c which uses this; If this is accepted, I will make up a patch for SDP as well.) Signed-off-by: Hal Rosenstock Index: ipoib.h =================================================================== --- ipoib.h (revision 3683) +++ ipoib.h (working copy) @@ -210,6 +210,12 @@ struct ipoib_neigh { struct list_head list; }; +struct ipoib_info { + struct ib_device *dev; + int port; + u16 pkey; +}; + static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - @@ -239,6 +245,8 @@ void ipoib_reap_ah(void *dev_ptr); void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); +int ipoib_get_info(struct net_device *dev, struct ipoib_info *info); + int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(void *dev); void ipoib_ib_dev_cleanup(struct net_device *dev); Index: ipoib_ib.c =================================================================== --- ipoib_ib.c (revision 3683) +++ ipoib_ib.c (working copy) @@ -38,6 +38,8 @@ #include #include +#include /* For ARPHRD_xxx */ + #include #include "ipoib.h" @@ -569,6 +571,29 @@ int ipoib_ib_dev_init(struct net_device return 0; } +int ipoib_get_info(struct net_device *dev, struct ipoib_info *info) +{ + struct ipoib_dev_priv *priv; + + if (!info) + return -EINVAL; + + /* Make sure IPoIB interface */ + if (dev->type != ARPHRD_INFINIBAND) + return -ENODEV; + + priv = netdev_priv(dev); + /* PKey assigned yet ? */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -ENOENT; + + info->dev = priv->ca; + info->port = priv->port; + info->pkey = priv->pkey; + return 0; +} +EXPORT_SYMBOL(ipoib_get_info); + void ipoib_ib_dev_flush(void *_dev) { struct net_device *dev = (struct net_device *)_dev; Index: at.c =================================================================== --- at.c (revision 3683) +++ at.c (working copy) @@ -416,10 +416,10 @@ static void ib_at_ats_reg(void *data) static int resolve_ip(struct ib_at_src *src, u32 dst_ip, u32 src_ip, int tos, union ib_gid *dgid) { - struct ipoib_dev_priv *priv; struct net_device *loopback = NULL; struct net_device *ipoib_dev; struct rtable *rt; + struct ipoib_info info; struct flowi fl = { .oif = 0, /* oif */ .nl_u = { @@ -504,14 +504,16 @@ static int resolve_ip(struct ib_at_src * } /* - * lookup local info. + * Obtain ib_device, port, and PKey based on IPoIB net_device */ - priv = ipoib_dev->priv; - src->netdev = ipoib_dev; - src->dev = priv->ca; - src->port = priv->port; - src->pkey = cpu_to_be16(priv->pkey); + if ((r = ipoib_get_info(ipoib_dev, &info))) { + DEBUG("ipoib_get_pkey failed %d", r); + goto done; + } + src->dev = info.dev; + src->port = info.port; + src->pkey = cpu_to_be16(info.pkey); memcpy(&src->gid, ipoib_dev->dev_addr + 4, sizeof src->gid); if (!dgid) { From mshefty at ichips.intel.com Thu Oct 6 09:34:15 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 09:34:15 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128613310.4382.609.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> Message-ID: <43455207.9010508@ichips.intel.com> Hal Rosenstock wrote: > IPoIB: Add API to retrieve ib device, port, and pkey > > (I'm also attaching my patch to at.c which uses this; If this is > accepted, I will make up a patch for SDP as well.) I didn't see any other way to retrieve the pkey associated with an IP address without this. For SDP, if we layered it over the CMA, would it still need to access this information? - Sean From bardov at gmail.com Thu Oct 6 09:40:40 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Thu, 6 Oct 2005 19:40:40 +0300 Subject: [openib-general] Latest build test results In-Reply-To: <20051003221553.GA27996@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> Message-ID: I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. Dan On 10/4/05, Nishanth Aravamudan wrote: > Hello, > > Here are the build results for 2.6.14-rc3 with and without the latest > gen2 trunk. > > Looks like all the builds were successful, with some warnings: > > - ppc64 + gen2 with =y > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > - same for =m, plus > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find > > - x86 + gen2 with =y > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > drivers/infiniband/ulp/iser/iser_conn.c:195: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:203: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:206: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > drivers/infiniband/ulp/iser/iser_conn.c:285: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_enable_rdma': > drivers/infiniband/ulp/iser/iser_conn.c:357: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:431: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > drivers/infiniband/ulp/iser/iser_conn.c:933: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:950: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:981: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_add_to_dto': > drivers/infiniband/ulp/iser/iser_memory.c:230: warning: cast from pointer to integer of different size > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > drivers/infiniband/ulp/iser/iser_mod.c:152: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > drivers/infiniband/ulp/iser/iser_initiator.c:62: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:67: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:80: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:95: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_start_dto': > drivers/infiniband/ulp/iser/iser_lkdapl.c:660: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > drivers/infiniband/ulp/iser/iser_lkdapl.c:758: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:819: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:846: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:849: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:852: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:855: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:858: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:861: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:864: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:867: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:870: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:1116: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > drivers/infiniband/ulp/iser/iser_mod.c:241: warning: too few arguments for format > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > - same for =m, plus: > > *** Warning: "ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: "ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find > > Mainline does not appear to have any issues on either ppc64 or x86, =m > or =y. > > Thanks, > Nish > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Oct 6 09:45:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 12:45:39 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <43455207.9010508@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> Message-ID: <1128616945.4382.839.camel@hal.voltaire.com> On Thu, 2005-10-06 at 12:34, Sean Hefty wrote: > Hal Rosenstock wrote: > > IPoIB: Add API to retrieve ib device, port, and pkey > > > > (I'm also attaching my patch to at.c which uses this; If this is > > accepted, I will make up a patch for SDP as well.) > > I didn't see any other way to retrieve the pkey associated with an IP address > without this. Yes, and I looked at getting the ib_device but there is no easy way so I added them into the structure returned. Is CMA keeping a list of ib_devices that it walks for this ? > For SDP, if we layered it over the CMA, would it still need to access this > information? I'm not 100% sure. It partially depends on the CMA APIs. How is the PathRecord request done ? That's what it's needed for. -- Hal From rolandd at cisco.com Thu Oct 6 09:55:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 09:55:34 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128613310.4382.609.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Oct 2005 11:41:51 -0400") References: <1128613310.4382.609.camel@hal.voltaire.com> Message-ID: <52r7ayoa9l.fsf@cisco.com> Did we ever figure out how to handle the hotplug issues with the lifetime of the struct ib_device pointer? Right now this API is unsafe, because a caller can get a pointer to a device that has already disappeared. Also if we do decide to add an API like this, the struct ipoib_info and ipoib_get_info() declarations should be in rather than in the private ipoib.h header. - R. From mshefty at ichips.intel.com Thu Oct 6 10:01:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 10:01:35 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128616945.4382.839.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> Message-ID: <4345586F.7010001@ichips.intel.com> Hal Rosenstock wrote: >>I didn't see any other way to retrieve the pkey associated with an IP address >>without this. > > Yes, and I looked at getting the ib_device but there is no easy way so I > added them into the structure returned. Is CMA keeping a list of > ib_devices that it walks for this ? The CMA maintains a list of devices. The address translation code takes an IP address and returns the corresponding GID. The CMA looks up the GID against its list of devices. All synchronization for device removal is handled by the CMA. Currently, the address translation code isn't aware of ib_devices. It's almost a device independent IP to HW address translation mechanism. A question that I have is how does the user know if the ib_device pointer is valid? >>For SDP, if we layered it over the CMA, would it still need to access this >>information? > > I'm not 100% sure. It partially depends on the CMA APIs. How is the > PathRecord request done ? That's what it's needed for. Right now, the CMA issues a path record request based on the SGID/DGID only. It would be fairly easy to add the PKey to the request once the address translation code returns it. - Sean From nacc at us.ibm.com Thu Oct 6 10:11:28 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 10:11:28 -0700 Subject: [openib-general] Latest build test results In-Reply-To: References: <20051003221553.GA27996@us.ibm.com> Message-ID: <20051006171128.GA15908@us.ibm.com> On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. Great! Thanks. I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs weren't running) now and will post the latest results. Thanks, Nish From jcarr at linuxmachines.com Thu Oct 6 10:32:05 2005 From: jcarr at linuxmachines.com (Jeff Carr) Date: Thu, 06 Oct 2005 10:32:05 -0700 Subject: [openib-general] Re: [git pull] InfiniBand fixes for 2.6.14 In-Reply-To: <524q85on6e.fsf@cisco.com> References: <524q85on6e.fsf@cisco.com> Message-ID: <43455F95.8000105@linuxmachines.com> On 09/27/2005 09:01 PM, Roland Dreier wrote: > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus When I pulled this yesterday, it didn't compile uverbs_main.c. It looks like it's missing from include/rdma/ib_user_verbs.h I'm wondering if I pulled your tree/branch correctly. Can you confirm these would be the right instructions? export \ IB="rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git" git clone $IB ib cd ib git-read-tree -m HEAD git-checkout-cache -q -f -u -a At that point I have the master branch. Then I switch to your branch: git checkout -f for-linus Then, after the initial pull, if I wanted to update to the current version I'd run: git pull Thanks, Jeff drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_write': drivers/infiniband/core/uverbs_main.c:517: error: `IB_USER_VERBS_CMD_QUERY_PARAMS' undeclared (first use in this function) drivers/infiniband/core/uverbs_main.c:517: error: (Each undeclared identifier is reported only once drivers/infiniband/core/uverbs_main.c:517: error: for each function it appears in.) From halr at voltaire.com Thu Oct 6 10:25:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 13:25:35 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006171128.GA15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> Message-ID: <1128619535.4382.1039.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > Great! Thanks. > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > weren't running) now and will post the latest results. You might also want to apply https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff to get rid of the AT and SDP warnings. -- Hal From twbowman at gmail.com Thu Oct 6 10:48:02 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 11:48:02 -0600 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change Message-ID: This patch in addition to "PPC64 atomic function additions" provides udapl support on PPC64 platform. /proc/cpuinfo on PPC64 prints different label for processor speed. Signed-off-by: Todd Bowman Index: userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c =================================================================== --- userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c (revision 3547) +++ userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c (working copy) @@ -186,7 +186,12 @@ void ) { #define DT_CPU_MHZ_BUFFER_SIZE 128 + +#if defined (__PPC64__) +#define DT_CPU_MHZ_MHZ "clock" +#else #define DT_CPU_MHZ_MHZ "cpu MHz" +#endif #define DT_CPU_MHZ_DELIMITER ":" FILE *fp; -------------- next part -------------- An HTML attachment was scrubbed... URL: From twbowman at gmail.com Thu Oct 6 10:48:06 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 11:48:06 -0600 Subject: [openib-general] [PATCH] udapl: PPC64 atomic function additions Message-ID: This patch in addition to "PPC64 cpuinfo change" provides udapl support on PPC64 platform. Added PPC64 dependent code to dapl_os_atomic_inc, dapl_os_atomic_dec, dapl_os_atomic_assign and DT_Mdep_GetTimeStamp. Also added PPC64 to platform checks. Signed-off-by: Todd Bowman Index: userspace/dapl/dapl/udapl/linux/dapl_osd.h =================================================================== --- userspace/dapl/dapl/udapl/linux/dapl_osd.h (revision 3547) +++ userspace/dapl/dapl/udapl/linux/dapl_osd.h (working copy) @@ -49,7 +49,7 @@ #error UNDEFINED OS TYPE #endif /* __linux__ */ -#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) +#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) && !defined(__PPC64__) #error UNDEFINED ARCH #endif @@ -78,7 +78,7 @@ #include #include -#ifdef __ia64__ +#if defined(__ia64__) || defined(__PPC64__) #include #include #endif @@ -162,6 +160,8 @@ IA64_FETCHADD (old_value,v,1,4); #endif +#elif defined(__PPC64__) + atomic_inc((atomic_t *) v); #else /* !__ia64__ */ __asm__ __volatile__ ( "lock;" "incl %0" @@ -190,6 +190,9 @@ IA64_FETCHADD (old_value,v,-1,4); #endif +#elif defined (__PPC64__) + atomic_dec((atomic_t *)v); + #else /* !__ia64__ */ __asm__ __volatile__ ( "lock;" "decl %0" @@ -230,6 +233,22 @@ current_value = ia64_cmpxchg("acq",v,match_value,new_value,4); +#elif defined(__PPC64__) + + __asm__ __volatile__ ( + EIEIO_ON_SMP +"1: lwarx %0,0,%2 # __cmpxchg_u64\n\ + cmpd 0,%0,%3\n\ + bne- 2f\n\ + stwcx. %4,0,%2\n\ + bne- 1b" + ISYNC_ON_SMP + "\n\ +2:" + : "=&r" (current_value), "=m" (*v) + : "r" (v), "r" (match_value), "r" (new_value), "m" (*v) + : "cc", "memory"); + #else __asm__ __volatile__ ( "lock; cmpxchgl %1, %2" Index: userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h =================================================================== --- userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h (revision 3547) +++ userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h (working copy) @@ -128,10 +128,20 @@ x = get_cycles (); return x; +#else +#if defined(__PPC64__) + unsigned int tbl, tbu0, tbu1; + do { + __asm__ __volatile__ ("mftbu %0" : "=r"(tbu0)); + __asm__ __volatile__ ("mftb %0" : "=r"(tbl)); + __asm__ __volatile__ ("mftbu %0" : "=r"(tbu1)); + } while (tbu0 != tbu1); + return (((unsigned long long)tbu0) << 32) | tbl; #else -#error "Non-Pentium Linux - unimplemented" +#error "Non-Pentium and Non-PPC Linux - unimplemented" #endif #endif +#endif } /* -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Oct 6 10:50:15 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 10:50:15 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52r7ayoa9l.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> Message-ID: <434563D7.6080601@ichips.intel.com> Roland Dreier wrote: > Did we ever figure out how to handle the hotplug issues with the > lifetime of the struct ib_device pointer? Right now this API is > unsafe, because a caller can get a pointer to a device that has > already disappeared. Is it possible to retrieve the pkey using net_device->class_dev? - Sean From rolandd at cisco.com Thu Oct 6 10:51:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 10:51:25 -0700 Subject: [openib-general] Re: [git pull] InfiniBand fixes for 2.6.14 In-Reply-To: <43455F95.8000105@linuxmachines.com> (Jeff Carr's message of "Thu, 06 Oct 2005 10:32:05 -0700") References: <524q85on6e.fsf@cisco.com> <43455F95.8000105@linuxmachines.com> Message-ID: <52mzlmo7oi.fsf@cisco.com> Jeff> When I pulled this yesterday, it didn't compile Jeff> uverbs_main.c. It looks like it's missing from Jeff> include/rdma/ib_user_verbs.h Jeff> I'm wondering if I pulled your tree/branch correctly. Can Jeff> you confirm these would be the right instructions? Looks reasonable to me. I'm not sure what went wrong. Unfortunately I just blew away that git tree and rebased against Linus's latest tree. But everything from the for-linus branch should be in Linus's git tree. Does Linus's tree build for you? I just made a new infiniband git tree with an "upstream" branch for changes I plan to merge in 2.6.15 and a for-linus branch (currently empty) with 2.6.14 fixes. Once that hits the mirrors you could try pulling that and see how it works for you. > drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_write': > drivers/infiniband/core/uverbs_main.c:517: error: > `IB_USER_VERBS_CMD_QUERY_PARAMS' undeclared (first use in this function) > drivers/infiniband/core/uverbs_main.c:517: error: (Each undeclared > identifier is reported only once > drivers/infiniband/core/uverbs_main.c:517: error: for each function it > appears in.) These error messages seem like your uverbs_main.c and ib_user_verbs.h files got out of sync somehow. My tree looked OK to me so I don't know how to explain this. - R. From nacc at us.ibm.com Thu Oct 6 11:11:47 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 11:11:47 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128619535.4382.1039.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> Message-ID: <20051006181147.GB15908@us.ibm.com> On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > Great! Thanks. > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > weren't running) now and will post the latest results. > > You might also want to apply > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > to get rid of the AT and SDP warnings. I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the gen2 ones with that patch, thanks. Here are the results from 2.6.14-rc3-git6 + gen2 3683 Looks like x86 is broken in the current svn tree. x86 and ppc64 mainline is fine with both =y and =m ppc64 + gen2 =y drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type ppc64 + gen2 =m same as above, plus *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! x86 + gen2 =y *FAILED* drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' x86 + gen2 =m *FAILED* same as above Thanks, Nish From surs at cse.ohio-state.edu Thu Oct 6 11:46:54 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 14:46:54 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006133937.GA23901@cse.ohio-state.edu> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> Message-ID: <20051006184652.GA27969@cse.ohio-state.edu> Roland, * On Oct,11 Sayantan Sur wrote : > I will test out the limit event generation next. I made some simple modifications to srq_pingpong.c to see if I am able to generate the IBV_EVENT_SRQ_LIMIT_REACHED event. I have attached my changes as a patch and the full file (for easy execution). I noticed that the test re-posts buffers only when the outstanding recv count is <= 1. I set a SRQ limit as max_recv - 5. So, I should get the event when 5 WQEs are consumed from the SRQ, right? As of now, I am not able to see the event happening. I'd be glad if you could see if this issue can be resolved. Thanks for your prompt help. Sayantan. -- http://www.cse.ohio-state.edu/~surs -------------- next part -------------- Index: srq_pingpong.c =================================================================== --- srq_pingpong.c (revision 3676) +++ srq_pingpong.c (working copy) @@ -36,6 +36,8 @@ # include #endif /* HAVE_CONFIG_H */ +#define _GNU_SOURCE + #include #include #include @@ -62,6 +64,8 @@ static int page_size; +static pthread_t limit_thread; + struct pingpong_context { struct ibv_context *context; struct ibv_comp_channel *channel; @@ -82,6 +86,25 @@ int psn; }; + +static void asyncwatch(struct ibv_context *context) +{ + struct ibv_async_event event; + + while (1) { + + if (ibv_get_async_event(context, &event)) { + fprintf(stderr,"Error getting event!\n"); + } + + fprintf(stderr, " event_type %d, port %d\n", event.event_type, + event.element.port_num); + fflush(stderr); + + ibv_ack_async_event(&event); + } +} + static uint16_t pp_get_local_lid(struct pingpong_context *ctx, int port) { struct ibv_port_attr attr; @@ -382,7 +405,11 @@ return NULL; } + pthread_create(&limit_thread, NULL, (void *) asyncwatch, (void *)ctx->context); + { + struct ibv_srq_attr srq_attr; + struct ibv_srq_init_attr attr = { .attr = { .max_wr = rx_depth, @@ -395,6 +422,15 @@ fprintf(stderr, "Couldn't create SRQ\n"); return NULL; } + + srq_attr.max_wr = rx_depth; + srq_attr.max_sge = 1; + srq_attr.srq_limit = rx_depth-5; + + if(ibv_modify_srq(ctx->srq, &srq_attr, IBV_SRQ_LIMIT)) { + fprintf(stderr,"Error modifying SRQ\n"); + exit(-1); + } } for (i = 0; i < num_qp; ++i) { @@ -434,6 +470,7 @@ } } + return ctx; } @@ -742,6 +779,8 @@ } } + fprintf(stderr,"routs %d\n", routs); + if (scnt < iters) { j = find_qp(wc[i].qp_num, ctx, num_qp); if (j < 0) { @@ -784,5 +823,7 @@ iters, usec / 1000000., usec / iters); } + sleep(3); + return 0; } -------------- next part -------------- A non-text attachment was scrubbed... Name: srq_pingpong.c Type: text/x-csrc Size: 19155 bytes Desc: not available URL: From halr at voltaire.com Thu Oct 6 11:55:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 14:55:02 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <434563D7.6080601@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> Message-ID: <1128624901.4382.1599.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:50, Sean Hefty wrote: > Roland Dreier wrote: > > Did we ever figure out how to handle the hotplug issues with the > > lifetime of the struct ib_device pointer? Right now this API is > > unsafe, because a caller can get a pointer to a device that has > > already disappeared. > > Is it possible to retrieve the pkey using net_device->class_dev? I think so, but would that be any safer ? I think it might end up going through the IPoIB device private data (or an API anyhow). -- Hal From rolandd at cisco.com Thu Oct 6 12:00:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 12:00:48 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <434563D7.6080601@ichips.intel.com> (Sean Hefty's message of "Thu, 06 Oct 2005 10:50:15 -0700") References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> Message-ID: <52irwao4gv.fsf@cisco.com> Sean> Is it possible to retrieve the pkey using Sean> net_device->class_dev? Maybe, but even more direct would be taking it from net_device->broadcast. - R. From halr at voltaire.com Thu Oct 6 12:03:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:03:48 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <4345586F.7010001@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> <4345586F.7010001@ichips.intel.com> Message-ID: <1128625363.4382.1653.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:01, Sean Hefty wrote: > Hal Rosenstock wrote: > >>I didn't see any other way to retrieve the pkey associated with an IP address > >>without this. > > > > Yes, and I looked at getting the ib_device but there is no easy way so I > > added them into the structure returned. Is CMA keeping a list of > > ib_devices that it walks for this ? > > The CMA maintains a list of devices. The address translation code takes an IP > address and returns the corresponding GID. The CMA looks up the GID against its > list of devices. All synchronization for device removal is handled by the CMA. > > Currently, the address translation code isn't aware of ib_devices. It's almost > a device independent IP to HW address translation mechanism. > > A question that I have is how does the user know if the ib_device pointer is valid? The only way I see is that a user needs to register as a client and track device removals. Is there another way ? > >>For SDP, if we layered it over the CMA, would it still need to access this > >>information? > > > > I'm not 100% sure. It partially depends on the CMA APIs. How is the > > PathRecord request done ? That's what it's needed for. > > Right now, the CMA issues a path record request based on the SGID/DGID only. It > would be fairly easy to add the PKey to the request once the address translation > code returns it. How would the address translation code get it ? -- Hal From mshefty at ichips.intel.com Thu Oct 6 12:08:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:08:21 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52irwao4gv.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> Message-ID: <43457625.1020702@ichips.intel.com> Roland Dreier wrote: > Sean> Is it possible to retrieve the pkey using > Sean> net_device->class_dev? > > Maybe, but even more direct would be taking it from net_device->broadcast. Okay - this is starting to make more sense to me now: priv->dev->broadcast[8] = priv->pkey >> 8; priv->dev->broadcast[9] = priv->pkey & 0xff; I assume that the broadcast address is well defined, and there's no issue reading it from there? If so, then I think it's a simple change to addr.c to extract it. - Sean From mshefty at ichips.intel.com Thu Oct 6 12:16:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:16:20 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128625363.4382.1653.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> <4345586F.7010001@ichips.intel.com> <1128625363.4382.1653.camel@hal.voltaire.com> Message-ID: <43457804.6090506@ichips.intel.com> Hal Rosenstock wrote: >>The CMA maintains a list of devices. The address translation code takes an IP >>address and returns the corresponding GID. The CMA looks up the GID against its >>list of devices. All synchronization for device removal is handled by the CMA. > > The only way I see is that a user needs to register as a client and > track device removals. Is there another way ? The CMA will attempt to handle device removal internally. The basic operation is this: id = rdma_create_id(); rdma_resolve_addr(id...); /* associates a device with the ID */ /* wait for resolution to complete */ ib_alloc_pd(id->device...); ib_create_cq(id->device...); ib_create_qp(id->device...); rdma_connect(id); If a device is removed, the user will receive a callback with DEVICE_REMOVAL. The user must free all resources created using id->device, and destroy the id. The removal is blocked until the id is destroyed. >>Right now, the CMA issues a path record request based on the SGID/DGID only. It >>would be fairly easy to add the PKey to the request once the address translation >>code returns it. > > How would the address translation code get it ? Right now, it doesn't. But see Roland's message. It could be read directly from the broadcast address. - Sean From nacc at us.ibm.com Thu Oct 6 12:20:24 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 12:20:24 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128619535.4382.1039.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> Message-ID: <20051006192024.GC15908@us.ibm.com> On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > Great! Thanks. > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > weren't running) now and will post the latest results. > > You might also want to apply > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > to get rid of the AT and SDP warnings. This patch does remove the warning regarding undefined symbols during modpost, but does not remove the warnings drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type Thanks, Nish From halr at voltaire.com Thu Oct 6 12:23:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:23:19 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <43457625.1020702@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> <43457625.1020702@ichips.intel.com> Message-ID: <1128626405.4382.1741.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:08, Sean Hefty wrote: > Roland Dreier wrote: > > Sean> Is it possible to retrieve the pkey using > > Sean> net_device->class_dev? > > > > Maybe, but even more direct would be taking it from net_device->broadcast. > > Okay - this is starting to make more sense to me now: > > priv->dev->broadcast[8] = priv->pkey >> 8; > priv->dev->broadcast[9] = priv->pkey & 0xff; > > I assume that the broadcast address is well defined, and there's no issue > reading it from there? If so, then I think it's a simple change to addr.c to > extract it. What stops the net_device from being pulled from underneath this ? Seems like a similar issue to me. The difference I see is that only net_devices need to be tracked rather than perhaps net_devices and ib_devices. -- Hal From halr at voltaire.com Thu Oct 6 12:26:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:26:41 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006192024.GC15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> Message-ID: <1128626684.4382.1762.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > Great! Thanks. > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > weren't running) now and will post the latest results. > > > > You might also want to apply > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > to get rid of the AT and SDP warnings. > > This patch does remove the warning regarding undefined symbols during > modpost, but does not remove the warnings > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type Right. Roland reported a change to struct packet_type in 2.6.14. I'll work on a patch for this too. Thanks. -- Hal From mshefty at ichips.intel.com Thu Oct 6 12:35:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:35:04 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128626405.4382.1741.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> <43457625.1020702@ichips.intel.com> <1128626405.4382.1741.camel@hal.voltaire.com> Message-ID: <43457C68.8020905@ichips.intel.com> Hal Rosenstock wrote: > What stops the net_device from being pulled from underneath this ? Seems > like a similar issue to me. The difference I see is that only > net_devices need to be tracked rather than perhaps net_devices and > ib_devices. A reference on the net_device needs to be held while this is being read. Net_devices already have reference counting that comes with them; this would need to be added to ib_devices. E.g. dev = ip_dev_find(ip); gid = dev->dev_addr + 4; pkey = get_pkey(dev->broadcast); dev_put(dev); could be used to convert a local IP address to a GID/PKey. I'm assuming that neigh_lookup() provides the same protection: that neigh->dev is valid while a reference on the neigh is held (until neigh_release is called). Does anyone know if this is the case? - Sean From shubbell at dbresearch.net Thu Oct 6 12:43:32 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 06 Oct 2005 14:43:32 -0500 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <1128626684.4382.1762.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> Message-ID: <43457E64.1010406@dbresearch.net> Hello, Will openib still supply patches to the 2.6.13 Kernel or do I need to upgrade my kernel to 2.6.14? Thanks, Sean Hubbell From rolandd at cisco.com Thu Oct 6 12:50:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 12:50:36 -0700 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <43457E64.1010406@dbresearch.net> (Sean Hubbell's message of "Thu, 06 Oct 2005 14:43:32 -0500") References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <43457E64.1010406@dbresearch.net> Message-ID: <52ek6yo25v.fsf@cisco.com> Sean> Hello, Will openib still supply patches to the 2.6.13 Kernel Sean> or do I need to upgrade my kernel to 2.6.14? 2.6.14 is not out yet, so the OpenIB subversion repository continues to be targeted at 2.6.13 (the latest full kernel release). Once 2.6.14 is released, we'll target that for development. If the are API changes from 2.6.13 to 2.6.14 that mean the subversion tree no longer works with 2.6.13, then if you want to use the latest subversion sources, you'll have to either upgrade to 2.6.14, find some contributed backport patches, or do the backporting yourself. - R. From shubbell at dbresearch.net Thu Oct 6 12:55:50 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 06 Oct 2005 14:55:50 -0500 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <52ek6yo25v.fsf@cisco.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <43457E64.1010406@dbresearch.net> <52ek6yo25v.fsf@cisco.com> Message-ID: <43458146.2090307@dbresearch.net> Roland Dreier wrote: > Sean> Hello, Will openib still supply patches to the 2.6.13 Kernel > Sean> or do I need to upgrade my kernel to 2.6.14? > >2.6.14 is not out yet, so the OpenIB subversion repository continues >to be targeted at 2.6.13 (the latest full kernel release). Once >2.6.14 is released, we'll target that for development. If the are API >changes from 2.6.13 to 2.6.14 that mean the subversion tree no longer >works with 2.6.13, then if you want to use the latest subversion >sources, you'll have to either upgrade to 2.6.14, find some >contributed backport patches, or do the backporting yourself. > > - R. > > > > Thanks Roland. Sean Hubbell From rolandd at cisco.com Thu Oct 6 13:10:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 13:10:42 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006184652.GA27969@cse.ohio-state.edu> (Sayantan Sur's message of "Thu, 6 Oct 2005 14:46:54 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> Message-ID: <52achmo18d.fsf@cisco.com> Sayantan> I noticed that the test re-posts buffers only when the Sayantan> outstanding recv count is <= 1. I set a SRQ limit as Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are Sayantan> consumed from the SRQ, right? Yes, your code is correct. The problem was that the mthca kernel driver was dispatching SRQ events incorrectly, so the event never reached userspace. I've checked in a fix for that, and I'm going to queue the SRQ limit event stuff for 2.6.15 (now that I've seen it working). BTW, in your code, you have: fprintf(stderr, " event_type %d, port %d\n", event.event_type, event.element.port_num); it would be more sensible to print event.element.srq here, since you're expecting an SRQ event. - R. From surs at cse.ohio-state.edu Thu Oct 6 13:54:29 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 16:54:29 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52achmo18d.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> <52achmo18d.fsf@cisco.com> Message-ID: <20051006205426.GA28969@cse.ohio-state.edu> Roland, * On Oct,13 Roland Dreier wrote : > Sayantan> I noticed that the test re-posts buffers only when the > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > Sayantan> consumed from the SRQ, right? > > Yes, your code is correct. The problem was that the mthca kernel > driver was dispatching SRQ events incorrectly, so the event never > reached userspace. I've checked in a fix for that, and I'm going to > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > working). > > BTW, in your code, you have: > > fprintf(stderr, " event_type %d, port %d\n", event.event_type, > event.element.port_num); > > it would be more sensible to print event.element.srq here, since > you're expecting an SRQ event. Thanks for the fix!! I have updated our systems, and am able to see the event. Thanks for the tip too. My async function was a quick copy from the example asyncwatch.c :-) Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From jlentini at netapp.com Thu Oct 6 14:00:02 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 17:00:02 -0400 (EDT) Subject: [openib-general] Re: [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: References: Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: twbowm> This patch in addition to "PPC64 atomic function additions" provides udapl twbowm> support on PPC64 platform. twbowm> twbowm> /proc/cpuinfo on PPC64 prints different label for processor speed. Committed in revision 3687. From jlentini at netapp.com Thu Oct 6 14:00:24 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 17:00:24 -0400 (EDT) Subject: [openib-general] Re: [PATCH] udapl: PPC64 atomic function additions In-Reply-To: References: Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: > This patch in addition to "PPC64 cpuinfo change" provides udapl support on > PPC64 platform. > > Added PPC64 dependent code to dapl_os_atomic_inc, dapl_os_atomic_dec, > dapl_os_atomic_assign and DT_Mdep_GetTimeStamp. > Also added PPC64 to platform checks. Committed in revision 3687. From iod00d at hp.com Thu Oct 6 14:14:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 6 Oct 2005 14:14:08 -0700 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: References: Message-ID: <20051006211408.GF26238@esmail.cup.hp.com> On Thu, Oct 06, 2005 at 11:48:02AM -0600, Todd Bowman wrote: > /proc/cpuinfo on PPC64 prints different label for processor speed. ... ISTR the "clock" value in cpuinfo is NOT the same as the CPU MHz. Can you remind me if "clock" value * "mtfb" results in "wall clock" time units? If not, then use of DT_CPU_MHZ_MHZ needs to be reviewed since it typically makes that assumption. Also, if someone cares about sparc (hey Tom! :^) ), then might leverage the get_clock.c code on: http://svn.gnumonks.org/trunk/mmio_test/ hth, grant From robert.j.woodruff at intel.com Thu Oct 6 15:08:17 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 6 Oct 2005 15:08:17 -0700 Subject: [openib-general] RE: OpenIB gen2 support ibv_create_cq Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005C17A28@orsmsx408> Matt wrote, >Woody, are there plans to update the 2.6.9 backports to svn version 3632 >or more recent to fix this? I just checked in new 2.6.9 backport patches for SVN rev. 3640 that should have this fix. woody From hozer at hozed.org Thu Oct 6 21:01:21 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 6 Oct 2005 23:01:21 -0500 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: <20051006211408.GF26238@esmail.cup.hp.com> References: <20051006211408.GF26238@esmail.cup.hp.com> Message-ID: <20051007040121.GW4612@kalmia.hozed.org> On Thu, Oct 06, 2005 at 02:14:08PM -0700, Grant Grundler wrote: > On Thu, Oct 06, 2005 at 11:48:02AM -0600, Todd Bowman wrote: > > /proc/cpuinfo on PPC64 prints different label for processor speed. > ... > > ISTR the "clock" value in cpuinfo is NOT the same as the CPU MHz. > Can you remind me if "clock" value * "mtfb" results in > "wall clock" time units? > > If not, then use of DT_CPU_MHZ_MHZ needs to be reviewed since > it typically makes that assumption. > > Also, if someone cares about sparc (hey Tom! :^) ), > then might leverage the get_clock.c code on: > http://svn.gnumonks.org/trunk/mmio_test/ Oh boy.... is there some reason 'gettimeofday' does not work? Trying to infer timebase/clock/rtsc frequency is going to be a mess. Think cpus that dynamically change frequency.. Laptops do now.. how long before something with infiniband does and breaks this code horribly? (think embedded systems) There are a couple of implementations of gettimeofday fully in userspace that hide the details and still read the high-res hardware counters. Google for 'vDSO gettimeofday'. From admin at openib.org Fri Oct 7 00:05:21 2005 From: admin at openib.org (admin at openib.org) Date: Fri, 07 Oct 2005 13:05:21 +0600 Subject: [openib-general] Members Support Message-ID: <0IO000MRZ8HITQ@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ykj.zip Type: application/octet-stream Size: 53508 bytes Desc: not available URL: From Administrator at openib.org Fri Oct 7 00:04:42 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Fri, 7 Oct 2005 02:04:42 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <00a901c5cb0d$641a0f50$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Members Support Scanning time = 10/7/2005 2:04:42 AM Engine/Pattern = 7.510-1002/2.879.00 Action on virus found: The attachment ykj.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/7/2005 ykj.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Members Support From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 01:06:53 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 01:06:53 -0700 Subject: [openib-general] Timeline of IPoIB performance Message-ID: <1128672413.13948.326.camel@localhost> I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when using kernels newer than 2.6.11. This doesn't appear to be an OpenIB IPoIB issue since the in-kernel and a recent svn3687 snapshot both have the same performance (464 MB/s) with 2.6.11. I used the same kernel config file as a starting point for each of these kernel builds. Have there been any changes in Linux that would explain these results? All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12 in-kernel 1 406 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) Thanks, - Matt From tefdmwrgfv at proxad.net Fri Oct 7 04:07:30 2005 From: tefdmwrgfv at proxad.net (Jimmie Fleming) Date: Fri, 7 Oct 2005 12:07:30 +0100 Subject: [openib-general] Your order# 1266. Message-ID: <42.916.92.@proxad.net> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://htupreulx.info/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://htupreulx.info/ Regards, Dr. Jimmie Fleming From halr at voltaire.com Fri Oct 7 05:21:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 08:21:19 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006181147.GB15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006181147.GB15908@us.ibm.com> Message-ID: <1128687678.4382.6520.camel@hal.voltaire.com> On Thu, 2005-10-06 at 14:11, Nishanth Aravamudan wrote: > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > Great! Thanks. > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > weren't running) now and will post the latest results. > > > > You might also want to apply > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > to get rid of the AT and SDP warnings. > > I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the > gen2 ones with that patch, thanks. > > Here are the results from 2.6.14-rc3-git6 + gen2 3683 > > Looks like x86 is broken in the current svn tree. > > x86 and ppc64 mainline is fine with both =y and =m > > ppc64 + gen2 =y > > drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': > drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > ppc64 + gen2 =m > > same as above, plus > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > x86 + gen2 =y *FAILED* What gcc version are you using ? > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' > > x86 + gen2 =m *FAILED* > > same as above Can you try this patch and see if it eliminates the iser errors ? Thanks. -- Hal Signed-off-by: Hal Rosenstock Index: iser.h =================================================================== --- iser.h (revision 3691) +++ iser.h (working copy) @@ -334,7 +334,7 @@ extern int iser_debug_level; do { \ if (iser_debug_level > 0) \ printk(KERN_DEBUG PFX "%s:" fmt,\ - __func__, ## arg); \ + __func__ , ## arg); \ } while (0) #define iser_err(fmt, arg...) \ From halr at voltaire.com Fri Oct 7 05:38:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 08:38:05 -0400 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> References: <1128672413.13948.326.camel@localhost> Message-ID: <1128688684.4382.6629.camel@hal.voltaire.com> Hi Matt, On Fri, 2005-10-07 at 04:06, Matt Leininger wrote: > I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when > using kernels newer than 2.6.11. This doesn't appear to be an OpenIB > IPoIB issue since the in-kernel and a recent svn3687 snapshot both have > the same performance (464 MB/s) with 2.6.11. I used the same kernel > config file as a starting point for each of these kernel builds. Have > there been any changes in Linux that would explain these results? > > > All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 > dual EM64T 3.2 GHz PCIe IB HCA (memfull) > > Kernel OpenIB msi_x netperf (MB/s) > 2.6.14-rc3 in-kernel 1 374 > 2.6.13.2 svn3627 1 386 > 2.6.13.2 in-kernel 1 394 > 2.6.12 in-kernel 1 406 > 2.6.11 in-kernel 1 464 > 2.6.11 svn3687 1 464 > 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) There was already the following thread on netdev that I found: TCP Network performance degade from 2.4.18 to 2.6.10 http://marc.theaimsgroup.com/?l=linux-netdev&m=112792558832125&w=2 I think you should (cross)post this to netdev. -- Hal From nacc at us.ibm.com Fri Oct 7 06:47:46 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 7 Oct 2005 06:47:46 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128687678.4382.6520.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006181147.GB15908@us.ibm.com> <1128687678.4382.6520.camel@hal.voltaire.com> Message-ID: <20051007134746.GA5972@us.ibm.com> On 07.10.2005 [08:21:19 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 14:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > Great! Thanks. > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > weren't running) now and will post the latest results. > > > > > > You might also want to apply > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > to get rid of the AT and SDP warnings. > > > > I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the > > gen2 ones with that patch, thanks. > > > > Here are the results from 2.6.14-rc3-git6 + gen2 3683 > > > > Looks like x86 is broken in the current svn tree. > > > > x86 and ppc64 mainline is fine with both =y and =m > > > > ppc64 + gen2 =y > > > > drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': > > drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > ppc64 + gen2 =m > > > > same as above, plus > > > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > > > x86 + gen2 =y *FAILED* > > What gcc version are you using ? I believe the build systems on all the automated machines are 2.95: Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs gcc version 2.95.4 20011002 (Debian prerelease) > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > > drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > > drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > > drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > > drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > > drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' > > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > > drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' > > > > x86 + gen2 =m *FAILED* > > > > same as above > > Can you try this patch and see if it eliminates the iser errors ? > Thanks. Will try it in a bit. Thanks, Nish From halr at voltaire.com Fri Oct 7 06:48:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 09:48:56 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <1128626684.4382.1762.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> Message-ID: <1128692935.4382.7072.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > Great! Thanks. > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > weren't running) now and will post the latest results. > > > > > > You might also want to apply > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > to get rid of the AT and SDP warnings. > > > > This patch does remove the warning regarding undefined symbols during > > modpost, but does not remove the warnings > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > work on a patch for this too. Thanks. Can you try this patch for the above 2 warnings ? If it works, I check it into the patches directory. Thanks. -- Hal Update arp_recv functions to latest 2.6.14 netdevice.h API for struct packet_type Signed-off-by: Hal Rosenstock Index: core/at.c =================================================================== --- core/at.c (revision 3691) +++ core/at.c (working copy) @@ -1258,7 +1258,7 @@ static void ib_at_arp_work(void *data) } static int ib_at_arp_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt) + struct packet_type *pt, struct net_device *orig_dev) { struct arp_work *work; struct arphdr *arp_hdr; Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 3691) +++ ulp/sdp/sdp_link.c (working copy) @@ -716,7 +716,7 @@ done: * sdp_link_arp_recv - receive all ARP packets */ static int sdp_link_arp_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt) + struct packet_type *pt, struct net_device *orig_dev) { struct sdp_work *work; struct arphdr *arp_hdr; From hozer at hozed.org Fri Oct 7 07:12:07 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 7 Oct 2005 09:12:07 -0500 Subject: [openib-general] IBM eHCA testing.. Message-ID: <20051007141207.GX4612@kalmia.hozed.org> I have two IBM eHCA cards installed and it appears that OpenSM is happily talking to the firmware and bringing up the links. So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz code drop, and wondering what (if any) issues there are with a 2.6.13 kernel, or later OpenIB svn drops. Is there a later code drop I can get ahold of? Is the nr_ports issue something in the driver? I wound up connecting to the lower port in the Openpower720 machine.. do you know if that's port 1 or 2? From nacc at us.ibm.com Fri Oct 7 07:16:39 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 7 Oct 2005 07:16:39 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128692935.4382.7072.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> Message-ID: <20051007141639.GB5972@us.ibm.com> On 07.10.2005 [09:48:56 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > > > Great! Thanks. > > > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > > weren't running) now and will post the latest results. > > > > > > > > You might also want to apply > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > > to get rid of the AT and SDP warnings. > > > > > > This patch does remove the warning regarding undefined symbols during > > > modpost, but does not remove the warnings > > > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > > work on a patch for this too. Thanks. > > Can you try this patch for the above 2 warnings ? If it works, I check > it into the patches directory. Thanks. Will try this along with the other patch you sent after I return from class (about 2 hours). Thanks, Nish From Administrator at openib.org Fri Oct 7 08:48:14 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Fri, 7 Oct 2005 08:48:14 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <004701c5cb56$86db4470$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Members Support Scanning time = 10/7/2005 8:48:13 AM Engine/Pattern = 7.510-1002/2.879.00 Action on virus found: The attachment ykj.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From parks at lanl.gov Fri Oct 7 08:49:12 2005 From: parks at lanl.gov (Parks Fields) Date: Fri, 07 Oct 2005 09:49:12 -0600 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> References: <1128672413.13948.326.camel@localhost> Message-ID: <6.2.3.4.2.20051007074938.01fefcf8@ccn-mail.lanl.gov> Matt, I have seen the same thing. I just didn't relate it to the Kernel. My IPoIB performance is down to ~340MB/sec with 2.6.12.1 and svn 3040. With 2.6.13 and svn 3490 the peak is 402MB/sec. At 02:06 AM 10/7/2005, Matt Leininger wrote: >I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when >using kernels newer than 2.6.11. This doesn't appear to be an OpenIB >IPoIB issue since the in-kernel and a recent svn3687 snapshot both have >the same performance (464 MB/s) with 2.6.11. I used the same kernel >config file as a starting point for each of these kernel builds. Have >there been any changes in Linux that would explain these results? > > >All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 >dual EM64T 3.2 GHz PCIe IB HCA (memfull) > >Kernel OpenIB msi_x netperf (MB/s) >2.6.14-rc3 in-kernel 1 374 >2.6.13.2 svn3627 1 386 >2.6.13.2 in-kernel 1 394 >2.6.12 in-kernel 1 406 >2.6.11 in-kernel 1 464 >2.6.11 svn3687 1 464 >2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) > > Thanks, > > - Matt > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Fri Oct 7 08:58:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 07 Oct 2005 08:58:55 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 01:06:53 -0700") References: <1128672413.13948.326.camel@localhost> Message-ID: <52ek6xmi80.fsf@cisco.com> Hmm, looks like something in the network stack must have changed. > 2.6.12 in-kernel 1 406 > 2.6.11 in-kernel 1 464 This looks like the biggest dropoff. I can think of two things that would be interesting to do if you or anyone else has time. First, taking profiles of netperf runs between these two kernels and comparing might be enlightening. Also, it would be useful to pin down when the regression happened, so running the same test with 2.6.12-rc1 through 2.6.12-rc6 would be a good thing. - R. From pradeep at us.ibm.com Fri Oct 7 09:14:04 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 7 Oct 2005 09:14:04 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <20051007141207.GX4612@kalmia.hozed.org> Message-ID: I believe the lower port is port 1. I will defer to the EHCA team as regards to issues with 2.6.13 (if any). We have minimally used both ports on p570. So, my guess is that should work on a Openpower720. Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/07/2005 07:12:07 AM: > I have two IBM eHCA cards installed and it appears that OpenSM > is happily talking to the firmware and bringing up the links. > > So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz > code drop, and wondering what (if any) issues there are with a 2.6.13 > kernel, or later OpenIB svn drops. > > Is there a later code drop I can get ahold of? Is the nr_ports issue > something in the driver? I wound up connecting to the lower port in the > Openpower720 machine.. do you know if that's port 1 or 2? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Oct 7 09:20:17 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 07 Oct 2005 09:20:17 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020912@NT-SJCA-0751.brcm.a d.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020912@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20051007091316.024dec70@esmail.cup.hp.com> At 06:38 AM 9/30/2005, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > > Sent: Thursday, September 29, 2005 6:50 PM > > To: Sean Hefty > > Cc: Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > Sean> Can you explain how RDMA works in this case? This is simply > > Sean> performing IP routing, and not IB routing, correct? Are you > > Sean> referring to a protocol running on top of IP or IB directly? > > Sean> Is the router establishing a second reliable connection on > > Sean> the backend? Does it simply translate headers as packets > > Sean> pass through in this case? > > > > I think the usage model is the following: you have some magic > > device that has an IB port on one side and "something else" > > on the other side. Think of something like a gateway that > > talks SDP on the IB side and TCP/IP on the other side. > > > > You configure your IPoIB routing so that this magic device is > > the next hop for talking to hosts on the IP network on the other side. > > > > Now someone tries to make an SDP connection to an IP address > > on the other side of the magic device. Routing tables + ARP > > give it the GID of the IB port of this magic device. It > > connects to the magic device and run SDP to talk to the magic > > device, and the magic device magically splices this into a > > TCP connection to the real destination. > > > > Or the same idea for an NFS/RDMA <-> NFS/UDP gateway, etc. > > > >Those examples are all basically application level gateways. >As such they would have no transport or connection setup >implications. The application level gateway simply offers >a service on network X that it fulfills on network Y. But >as far as network X is concerned the gateway IS the server. It must be viewed as such. The cross over point between the two domains represents independent management domains, trust domains, reliable delivery domains, etc. >I do not believe it is possible to construct a transport >layer gateway that bridges RDMA between IB and iWARP while >appearing to be a normal RDMA endpoint on both networks. >Higher level gateways will be possible for many >applications, but I don't see how that relates to >connection establishment. That would require having >an end-to-end reliable connection, complete with flow >control semantics, that bridged the two networks by >some method other than encapsulation or tunneling. We took steps to insure that both IB and iWARP could transmit packets in the main data path very efficiently between the two interconnects but it was never envisioned that a connection was truly end-to-end transparent across the gateway component. I think most of the architects would not support such an effort to define such a beast. There are many issues in attempting such an offering. Just examine all of the problems with the existing iSCSI to FC solutions; they ignore a number of customer issues and hence have been relegated in many customer minds as TTM, play toys not ready for prime time. This is one of the many reasons why iSCSI has not taken off as the hype portrayed. It would be best to define a CM architecture that enabled communication between like endpoints and avoid the gateway dilemma. Let the gateway provider work out such issues as there are many requirements already on each side of these interconnects. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Oct 7 09:29:19 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 07 Oct 2005 09:29:19 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7F9F14@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F7F9F14@taurus.voltaire.com> Message-ID: <6.2.0.14.2.20051007092706.02504c98@esmail.cup.hp.com> At 06:24 AM 9/30/2005, Yaron Haviv wrote: > > -----Original Message----- > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Thursday, September 29, 2005 9:50 PM > > To: Sean Hefty > > Cc: Yaron Haviv; Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > I think the usage model is the following: you have some magic device > > that has an IB port on one side and "something else" on the other > > side. Think of something like a gateway that talks SDP on the IB side > > and TCP/IP on the other side. > > > >Also applicable to two IB ports, e.g. forwarding SDP traffic from one IB >partition to SDP on another partition (may even be the same port with >two P_Keys), and doing some load-balancing or traffic management in >between, overall there are many use cases for that. While I can envision how an endpoint could communicate with another in separate partitions, doing so really violates the spirit of the partitioning where endpoints must be in the same partition in order to see one another and communicate. Attempting to create an intermediary who has insights into both and then somehow is able to communicate how to find one another using some proprietary (can't be through standards that I can think of) method, seems like way too much complexity to be worth it. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Fri Oct 7 09:33:27 2005 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 7 Oct 2005 09:33:27 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: Message-ID: Hi, Troy, There is INSTALL file in the EHCA driver package. In OpenPower 720 port 1 is at the top, port 2 is at the bottom. In P570, port1 is at the bottom, port2 is at the top. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Oct 7 09:40:04 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 09:40:04 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051007091316.024dec70@esmail.cup.hp.com> Message-ID: >It would be best to define a CM architecture that enabled communication >between like endpoints and avoid the gateway dilemma. Let the gateway >provider work out such issues as there are many requirements already >on each side of these interconnects. I've given this some more thought since the original postings and agree with you. It doesn't seem right to me to have the CM establish a connection to something that is not the specified destination, under the assumption that whatever is being connected to is a gateway. I think it would be better for the application to determine that the actual destination is on a different subnet, locate the gateway, and issue a connection request to the gateway. - Sean From iod00d at hp.com Fri Oct 7 10:05:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 7 Oct 2005 10:05:50 -0700 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: <20051007040121.GW4612@kalmia.hozed.org> References: <20051006211408.GF26238@esmail.cup.hp.com> <20051007040121.GW4612@kalmia.hozed.org> Message-ID: <20051007170550.GD30308@esmail.cup.hp.com> On Thu, Oct 06, 2005 at 11:01:21PM -0500, Troy Benjegerdes wrote: > Oh boy.... is there some reason 'gettimeofday' does not work? In general, it doesn't work as well. > Trying to infer timebase/clock/rtsc frequency is going to be a mess. Using cycle counters is quite portable today and provides accurate results (with caveats on it's use). I'm open to using the next best thing once it's clear the cycle counters do NOT work. > Think cpus that dynamically change frequency.. Laptops do now.. > how long before something with infiniband does and breaks this > code horribly? (think embedded systems) I don't buy this argument. Most of the tests load the CPU and it essentially runs at a fixed frequency. A better argument is how to benchmark under virtualized environment. I think that is totally broken today regardless of what method one uses to measure time. > There are a couple of implementations of gettimeofday fully in userspace > that hide the details and still read the high-res hardware counters. Google > for 'vDSO gettimeofday'. Well, I'm sure Michael is open to patches on this for userspace/perftest stuff and like wise for James Lentini for uDAPL. grant From sean.hefty at intel.com Fri Oct 7 12:19:23 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 12:19:23 -0700 Subject: [openib-general] [PATCH] [ADDR] address translation module for CMA Message-ID: The following patch adds a simple IP to IB address translation module using ARP. It is based off AT and SDP, but kept as simple as possible. I would like to merge this back into the trunk, and apply other changes there. Signed-off-by: Sean Hefty Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 0) +++ include/rdma/ib_addr.h (revision 0) @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(IB_ADDR_H) +#define IB_ADDR_H + +#include +#include + +struct ib_addr { + union ib_gid sgid; + union ib_gid dgid; + u16 pkey; +}; + +/** + * ib_translate_addr - Translate a local IP address to an Infiniband GID and + * PKey. + */ +int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey); + +/** + * ib_resolve_addr - Resolve source and destination IP addresses to + * Infiniband network addresses. + * @src_addr: An optional source address to use in the resolution. If a + * source address is not provided, a usable address will be returned via + * the callback. + * @dst_addr: The destination address to resolve. + * @addr: A reference to a data location that will receive the resolved + * addresses. The data location must remain valid until the callback has + * been invoked. + * @timeout_ms: Amount of time to wait for the address resolution to complete. + * @callback: Call invoked once address resolution has completed, timed out, + * or been canceled. A status of 0 indicates success. + * @context: User-specified context associated with the call. + */ +int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct ib_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context), + void *context); + +void ib_addr_cancel(struct ib_addr *addr); + +#endif /* IB_ADDR_H */ + Index: core/addr.c =================================================================== --- core/addr.c (revision 0) +++ core/addr.c (revision 0) @@ -0,0 +1,351 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct ib_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DECLARE_MUTEX(mutex); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +static struct workqueue_struct *wq; + +static u16 addr_get_pkey(struct net_device *dev) +{ + return ((u16)dev->broadcast[8] << 8) | (u16)dev->broadcast[9]; +} + +int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + *gid = *(union ib_gid *) (dev->dev_addr + 4); + *pkey = addr_get_pkey(dev); + dev_put(dev); + return 0; +} +EXPORT_SYMBOL(ib_translate_addr); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + down(&mutex); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + up(&mutex); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, dst_ip, rt->idev->dev, rt->rt_src, + NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct ib_addr *addr) +{ + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + struct flowi fl; + struct rtable *rt; + struct neighbour *neigh; + int ret; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + fl.nl_u.ip4_u.saddr = src_ip; + ret = ip_route_output_key(&rt, &fl); + if (ret) + goto out; + + neigh = neigh_lookup(&arp_tbl, &dst_ip, rt->idev->dev); + if (!neigh) { + ret = -ENODATA; + goto err1; + } + + if (!(neigh->nud_state & NUD_VALID)) { + ret = -ENODATA; + goto err2; + } + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = rt->rt_src; + } + + addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); + addr->dgid = *(union ib_gid *) (neigh->ha + 4); + addr->pkey = addr_get_pkey(neigh->dev); + +err2: + neigh_release(neigh); +err1: + ip_rt_put(rt); +out: + return ret; +} + +static void process_req(void *data) +{ + struct addr_req *req, *temp_req; + struct sockaddr_in *src_in, *dst_in; + struct list_head done_list; + + INIT_LIST_HEAD(&done_list); + + down(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->status) { + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + req->status = addr_resolve_remote(src_in, dst_in, + req->addr); + } + if (req->status && time_after(jiffies, req->timeout)) + req->status = -ETIMEDOUT; + else if (req->status == -ENODATA) + continue; + + list_del(&req->list); + list_add_tail(&req->list, &done_list); + } + + if (!list_empty(&req_list)) { + req = list_entry(req_list.next, struct addr_req, list); + set_timeout(req->timeout); + } + up(&mutex); + + list_for_each_entry_safe(req, temp_req, &done_list, list) { + list_del(&req->list); + req->callback(req->status, &req->src_addr, req->addr, + req->context); + kfree(req); + } +} + +static int addr_resolve_local(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct ib_addr *addr) +{ + struct net_device *dev; + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + int ret = 0; + + dev = ip_dev_find(dst_ip); + if (!dev) + return -EADDRNOTAVAIL; + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = dst_ip; + addr->sgid = *(union ib_gid *) (dev->dev_addr + 4); + addr->pkey = addr_get_pkey(dev); + } else { + ret = ib_translate_addr((struct sockaddr *)src_in, + &addr->sgid, &addr->pkey); + if (ret) + goto out; + } + + addr->dgid = *(union ib_gid *) (dev->dev_addr + 4); +out: + dev_put(dev); + return ret; +} + +int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct ib_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context), + void *context) +{ + struct sockaddr_in *src_in, *dst_in; + struct addr_req *req; + int ret = 0; + + req = kmalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + memset(req, 0, sizeof *req); + + if (src_addr) + req->src_addr = *src_addr; + req->dst_addr = *dst_addr; + req->addr = addr; + req->callback = callback; + req->context = context; + + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + + req->status = addr_resolve_local(src_in, dst_in, addr); + if (req->status == -EADDRNOTAVAIL) + req->status = addr_resolve_remote(src_in, dst_in, addr); + + switch (req->status) { + case 0: + req->timeout = jiffies; + queue_req(req); + break; + case -ENODATA: + req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; + queue_req(req); + addr_send_arp(dst_in); + break; + default: + ret = req->status; + kfree(req); + break; + } + return ret; +} +EXPORT_SYMBOL(ib_resolve_addr); + +void ib_addr_cancel(struct ib_addr *addr) +{ + struct addr_req *req, *temp_req; + + up(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->addr == addr) { + req->status = -ECANCELED; + req->timeout = jiffies; + list_del(&req->list); + list_add(&req->list, &req_list); + set_timeout(req->timeout); + break; + } + } + up(&mutex); +} +EXPORT_SYMBOL(ib_addr_cancel); + +static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pkt) +{ + struct arphdr *arp_hdr; + + arp_hdr = (struct arphdr *) skb->nh.raw; + + if (dev->type == ARPHRD_INFINIBAND && + (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) + set_timeout(jiffies); + + kfree_skb(skb); + return 0; +} + +static struct packet_type addr_arp = { + .type = __constant_htons(ETH_P_ARP), + .func = addr_arp_recv, + .af_packet_priv = (void*) 1, +}; + +static int addr_init(void) +{ + wq = create_singlethread_workqueue("ib_addr"); + if (!wq) + return -ENOMEM; + + dev_add_pack(&addr_arp); + return 0; +} + +static void addr_cleanup(void) +{ + dev_remove_pack(&addr_arp); + destroy_workqueue(wq); +} + +module_init(addr_init); +module_exit(addr_cleanup); From pradeep at us.ibm.com Fri Oct 7 12:19:48 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 7 Oct 2005 12:19:48 -0700 Subject: [openib-general] Questions about mad_test Message-ID: I am hoping some one will be able to help me out with a few answers saving me some debug time, or having to expend effort on something that is already known. I was trying to execute mad_test and found that it errors out. For some reason it does not like the DR Path that I gave it. 1. I ran ibnetdiscover and got the set of LIDs that I use is DR Path. Is that correct way to go about it? It always errors out with something like: hop 0 != 0 or hop 1 != dev_port 2. Also there is an expectation of there being a device /dev/infiniband/mthca0/ports/1/mad (using all defaults in this case) -is that correct? Any specific major and minor numbers I must use? 3. Anything else that I am missing? I am using this from trunk 3675 on 2.6.13 kernel. Thanks in advance for all the help! Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Oct 7 12:27:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 12:27:44 -0700 Subject: [openib-general] [PATCH] [CMA] RDMA CM abstraction module Message-ID: The following patch adds in a basic RDMA connection management abstraction. It is functional, but needs additional work for handling device removal, plus several missing features. I'd like to merge this back into the trunk, and continue working on it from there. This depends on the ib_addr module. Signed-off-by: Sean Hefty Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 0) +++ include/rdma/rdma_cm.h (revision 0) @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CM_H) +#define RDMA_CM_H + +#include +#include +#include + +/* + * Upon receiving a device removal event, users must destroy the associated + * RDMA identifier and release all resources allocated with the device. + */ +enum rdma_event_type { + RDMA_EVENT_ADDR_RESOLVED, + RDMA_EVENT_ADDR_ERROR, + RDMA_EVENT_ROUTE_RESOLVED, + RDMA_EVENT_ROUTE_ERROR, + RDMA_EVENT_CONNECT_REQUEST, + RDMA_EVENT_CONNECT_ERROR, + RDMA_EVENT_UNREACHABLE, + RDMA_EVENT_REJECTED, + RDMA_EVENT_ESTABLISHED, + RDMA_EVENT_DISCONNECTED, + RDMA_EVENT_DEVICE_REMOVAL, +}; + +struct rdma_addr { + struct sockaddr src_addr; + struct sockaddr dst_addr; + union { + struct ib_addr ibaddr; + } addr; +}; + +struct rdma_route { + struct rdma_addr addr; + struct ib_sa_path_rec *path_rec; + int num_paths; +}; + +struct rdma_event { + enum rdma_event_type event; + int status; + void *private_data; + u8 private_data_len; +}; + +struct rdma_id; + +/** + * rdma_event_handler - Callback used to report user events. + * + * Notes: Users may not call rdma_destroy_id from this callback to destroy + * the passed in id, or a corresponding listen id. Returning a + * non-zero value from the callback will destroy the corresponding id. + */ +typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); + +struct rdma_id { + struct ib_device *device; + void *context; + struct ib_qp *qp; + rdma_event_handler event_handler; + struct rdma_route route; +}; + +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context); + +void rdma_destroy_id(struct rdma_id *id); + +/** + * rdma_bind_addr - Bind an RDMA identifier to a source address and + * associated RDMA device, if needed. + * + * @id: RDMA identifier. + * @addr: Local address information. Wildcard values are permitted. + * + * This associates a source address with the RDMA identifier before calling + * rdma_listen. If a specific local address is given, the RDMA identifier will + * be bound to a local RDMA device. + */ +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); + +/** + * rdma_resolve_addr - Resolve destination and optional source addresses + * from IP addresses to an RDMA address. If successful, the specified + * rdma_id will be bound to a local device. + * + * @id: RDMA identifier. + * @src_addr: Source address information. This parameter may be NULL. + * @dst_addr: Destination address information. + * @timeout_ms: Time to wait for resolution to complete. + */ +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms); + +/** + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier + * into route information needed to establish a connection. + * + * This is called on the client side of a connection, but its use is optional. + * Users must have first called rdma_bind_addr to resolve a dst_addr + * into an RDMA address before calling this routine. + */ +int rdma_resolve_route(struct rdma_id *id, int timeout_ms); + +/** + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA + * identifier. + */ +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA + * identifier. + * + * Users must destroy any QP associated with an RDMA identifier before + * destroying the RDMA ID. + */ +void rdma_destroy_qp(struct rdma_id *id); + +struct rdma_conn_param { + const void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 flow_control; + u8 retry_count; /* ignored when accepting */ + u8 rnr_retry_count; +}; + +/** + * rdma_connect - Initiate an active connection request. + * + * Users must have bound the rdma_id to a local device by having called + * rdma_resolve_addr before calling this routine. Users may also resolve the + * RDMA address to a route with rdma_resolve_route, but if a route has not + * been resolved, a default route will be selected. + * + * Note that the QP must be in the INIT state. + */ +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_listen - This function is called by the passive side to + * listen for incoming connection requests. + * + * Users must have bound the rdma_id to a local address by calling + * rdma_bind_addr before calling this routine. + */ +int rdma_listen(struct rdma_id *id); + +/** + * rdma_accept - Called on the passive side to accept a connection request + * + * Note that the QP must be in the INIT state. + */ +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_reject - Called on the passive side to reject a connection request. + */ +int rdma_reject(struct rdma_id *id, const void *private_data, + u8 private_data_len); + +/** + * rdma_disconnect - This function disconnects the associated QP. + */ +int rdma_disconnect(struct rdma_id *id); + +#endif /* RDMA_CM_H */ + Index: core/cma.c =================================================================== --- core/cma.c (revision 0) +++ core/cma.c (revision 0) @@ -0,0 +1,1207 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Guy German"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add = cma_add_one, + .remove = cma_remove_one +}; + +static DEFINE_SPINLOCK(lock); +static LIST_HEAD(dev_list); + +struct cma_device { + struct list_head list; + struct ib_device *device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_t refcount; + struct list_head id_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_id id; + + struct list_head list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + atomic_t dev_remove; + + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; +}; + +struct cma_addr { + u8 version; /* CMA version: 7:4, IP version: 3:0 */ + u8 reserved; + __be16 port; + struct { + union { + struct in6_addr ip6; + struct { + __be32 pad[3]; + __be32 addr; + } ip4; + } ver; + } src_addr, dst_addr; +}; + +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + ret = (id_priv->state == comp); + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static int cma_comp_exch(struct rdma_id_private *id_priv, + enum cma_state comp, enum cma_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + if ((ret = (id_priv->state == comp))) + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static enum cma_state cma_exch(struct rdma_id_private *id_priv, + enum cma_state exch) +{ + unsigned long flags; + enum cma_state old; + + spin_lock_irqsave(&id_priv->lock, flags); + old = id_priv->state; + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return old; +} + +static inline u8 cma_get_ip_ver(struct cma_addr *addr) +{ + return addr->version & 0xF; +} + +static inline u8 cma_get_cma_ver(struct cma_addr *addr) +{ + return addr->version >> 4; +} + +static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 ip_ver) +{ + addr->version = (cma_ver << 4) + (ip_ver & 0xF); +} + +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, + union ib_gid *gid) +{ + struct cma_device *cma_dev; + unsigned long flags; + int ret = -ENODEV; + u8 port; + + spin_lock_irqsave(&lock, flags); + list_for_each_entry(cma_dev, &dev_list, list) { + ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); + if (!ret) { + atomic_inc(&cma_dev->refcount); + id_priv->cma_dev = cma_dev; + id_priv->id.device = cma_dev->device; + list_add_tail(&id_priv->list, &cma_dev->id_list); + break; + } + } + spin_unlock_irqrestore(&lock, flags); + return ret; +} + +static void cma_release_dev(struct rdma_id_private *id_priv) +{ + unsigned long flags; + + spin_lock_irqsave(&lock, flags); + list_del(&id_priv->list); + spin_unlock_irqrestore(&lock, flags); + + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) + wake_up(&id_priv->cma_dev->wait); +} + +static void cma_deref_id(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->refcount)) + wake_up(&id_priv->wait); +} + +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context) +{ + struct rdma_id_private *id_priv; + + id_priv = kmalloc(sizeof *id_priv, GFP_KERNEL); + if (!id_priv) + return NULL; + memset(id_priv, 0, sizeof *id_priv); + + id_priv->state = CMA_IDLE; + id_priv->id.context = context; + id_priv->id.event_handler = event_handler; + spin_lock_init(&id_priv->lock); + init_waitqueue_head(&id_priv->wait); + atomic_set(&id_priv->refcount, 1); + atomic_set(&id_priv->dev_remove, 1); + + return &id_priv->id; +} +EXPORT_SYMBOL(rdma_create_id); + +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + struct ib_sa_path_rec *path_rec; + int ret; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + path_rec = id_priv->id.route.path_rec; + ret = ib_find_cached_gid(id_priv->id.device, &path_rec->sgid, + &qp_attr.port_num, NULL); + if (ret) + return ret; + + ret = ib_find_cached_pkey(id_priv->id.device, qp_attr.port_num, + id_priv->id.route.addr.addr.ibaddr.pkey, + &qp_attr.pkey_index); + if (ret) + return ret; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | IB_QP_PORT); +} + +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct rdma_id_private *id_priv; + struct ib_qp *qp; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id->device != pd->device) + return -EINVAL; + + qp = ib_create_qp(pd, qp_init_attr); + if (IS_ERR(qp)) + return PTR_ERR(qp); + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_init_ib_qp(id_priv, qp); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto err; + + id->qp = qp; + return 0; +err: + ib_destroy_qp(qp); + return ret; +} +EXPORT_SYMBOL(rdma_create_qp); + +void rdma_destroy_qp(struct rdma_id *id) +{ + ib_destroy_qp(id->qp); +} +EXPORT_SYMBOL(rdma_destroy_qp); + +static int cma_modify_ib_qp_rtr(struct rdma_id_private *id_priv) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + /* Need to update QP attributes from default values. */ + qp_attr.qp_state = IB_QPS_INIT; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + qp_attr.rq_psn = id_priv->id.qp->qp_num; + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_ib_qp_rts(struct rdma_id_private *id_priv) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_err(struct rdma_id *id) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); +} + +static int cma_verify_addr(struct cma_addr *addr, + struct sockaddr_in *ip_addr) +{ + if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) + return -EINVAL; + + if (ip_addr->sin_port != be16_to_cpu(addr->port)) + return -EINVAL; + + if (ip_addr->sin_addr.s_addr && + (ip_addr->sin_addr.s_addr != be32_to_cpu(addr->dst_addr. + ver.ip4.addr))) + return -EINVAL; + + return 0; +} + +static int cma_notify_user(struct rdma_id_private *id_priv, + enum rdma_event_type type, int status, + void *data, u8 data_len) +{ + struct rdma_event event; + + event.event = type; + event.status = status; + event.private_data = data; + event.private_data_len = data_len; + + return id_priv->id.event_handler(&id_priv->id, &event); +} + +static inline void cma_disable_dev_remove(struct rdma_id_private *id_priv) +{ + atomic_inc(&id_priv->dev_remove); +} + +static inline void cma_deref_dev(struct rdma_id_private *id_priv) +{ +// if (atomic_dec_and_test(&id_priv->dev_remove)) +// wake_up(&id_priv->wait); +// return atomic_dec_and_test(&id_priv->dev_remove) ? +// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, +// NULL, 0) : 0; +} + +static void cma_cancel_addr(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); + break; + default: + break; + } +} + +static void cma_cancel_route(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_sa_cancel_query(id_priv->query_id, id_priv->query); + break; + default: + break; + } +} + +static void cma_cancel_operation(struct rdma_id_private *id_priv, + enum cma_state state) +{ + switch (state) { + case CMA_ADDR_QUERY: + cma_cancel_addr(id_priv); + break; + case CMA_ROUTE_QUERY: + cma_cancel_route(id_priv); + break; + default: + break; + } +} + +static void cma_free_id(struct rdma_id_private *id_priv) +{ + if (id_priv->cma_dev) { + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + break; + default: + break; + } + cma_release_dev(id_priv); + } + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv->id.route.path_rec); + kfree(id_priv); +} + +void rdma_destroy_id(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + enum cma_state state; + + id_priv = container_of(id, struct rdma_id_private, id); + + state = cma_exch(id_priv, CMA_DESTROYING); + cma_cancel_operation(id_priv, state); + cma_free_id(id_priv); +} +EXPORT_SYMBOL(rdma_destroy_id); + +static int cma_rep_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_ib_qp_rtr(id_priv); + if (ret) + goto reject; + + ret = cma_modify_ib_qp_rts(id_priv); + if (ret) + goto reject; + + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_rtu_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_ib_qp_rts(id_priv); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv = cm_id->context; + enum rdma_event_type event; + u8 private_data_len = 0; + int ret = 0, status = 0; + + if (!cma_comp(id_priv, CMA_CONNECT)) + return 0; + + switch (ib_event->event) { + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + event = RDMA_EVENT_UNREACHABLE; + status = -ETIMEDOUT; + break; + case IB_CM_REP_RECEIVED: + status = cma_rep_recv(id_priv); + event = status ? RDMA_EVENT_CONNECT_ERROR : + RDMA_EVENT_ESTABLISHED; + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + break; + case IB_CM_RTU_RECEIVED: + status = cma_rtu_recv(id_priv); + event = status ? RDMA_EVENT_CONNECT_ERROR : + RDMA_EVENT_ESTABLISHED; + break; + case IB_CM_DREQ_ERROR: + status = -ETIMEDOUT; /* fall through */ + case IB_CM_DREQ_RECEIVED: + case IB_CM_DREP_RECEIVED: + event = RDMA_EVENT_DISCONNECTED; + break; + case IB_CM_TIMEWAIT_EXIT: + case IB_CM_MRA_RECEIVED: + /* ignore event */ + goto out; + case IB_CM_REJ_RECEIVED: + cma_modify_qp_err(&id_priv->id); + status = ib_event->param.rej_rcvd.reason; + event = RDMA_EVENT_REJECTED; + break; + default: + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", + ib_event->event); + goto out; + } + + ret = cma_notify_user(id_priv, event, status, ib_event->private_data, + private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id = NULL; + rdma_destroy_id(&id_priv->id); + } +out: + return ret; +} + +static struct rdma_id_private* cma_new_id(struct rdma_id *listen_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv; + struct rdma_id *id; + struct rdma_route *route; + struct sockaddr_in *ip_addr; + struct ib_sa_path_rec *path_rec; + struct cma_addr *addr; + int num_paths; + + ip_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; + if (cma_verify_addr(ib_event->private_data, ip_addr)) + return NULL; + + num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); + path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); + if (!path_rec) + return NULL; + + id = rdma_create_id(listen_id->event_handler, listen_id->context); + if (!id) + goto err; + + route = &id->route; + route->addr.src_addr = listen_id->route.addr.src_addr; + route->addr.dst_addr.sa_family = ip_addr->sin_family; + + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; + addr = ib_event->private_data; + ip_addr->sin_addr.s_addr = be32_to_cpu(addr->src_addr.ver.ip4.addr); + + route->num_paths = num_paths; + route->path_rec = path_rec; + path_rec[0] = *ib_event->param.req_rcvd.primary_path; + if (num_paths == 2) + path_rec[1] = *ib_event->param.req_rcvd.alternate_path; + + route->addr.addr.ibaddr.sgid = path_rec->dgid; + route->addr.addr.ibaddr.dgid = path_rec->sgid; + route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->state = CMA_CONNECT; + return id_priv; +err: + kfree(path_rec); + return NULL; +} + +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *listen_id, *conn_id; + int offset, ret; + + listen_id = cm_id->context; + conn_id = cma_new_id(&listen_id->id, ib_event); + if (!conn_id) + return -ENOMEM; + + ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); + if (ret) { + ret = -ENODEV; + goto err; + } + + conn_id->cm_id = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_ib_handler; + conn_id->state = CMA_CONNECT; + + offset = sizeof(struct cma_addr); + ret = cma_notify_user(conn_id, RDMA_EVENT_CONNECT_REQUEST, 0, + ib_event->private_data + offset, + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id = NULL; + rdma_destroy_id(&conn_id->id); + } + return ret; +err: + rdma_destroy_id(&conn_id->id); + return ret; +} + +static __be64 cma_get_service_id(struct sockaddr *addr) +{ + return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + + ((struct sockaddr_in *) addr)->sin_port); +} + +static int cma_ib_listen(struct rdma_id_private *id_priv) +{ + __be64 svc_id; + int ret; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) + return PTR_ERR(id_priv->cm_id); + + svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); + if (ret) + ib_destroy_cm_id(id_priv->cm_id); + + return ret; +} + +int rdma_listen(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) + return -EINVAL; + + /* TODO: handle listen across multiple devices */ + if (!id->device) { + ret = -ENOSYS; + goto err; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_ib_listen(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); + return ret; +}; +EXPORT_SYMBOL(rdma_listen); + +static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, + void *context) +{ + struct rdma_id_private *id_priv = context; + struct rdma_route *route = &id_priv->id.route; + enum rdma_event_type event = RDMA_EVENT_ROUTE_RESOLVED; + + if (!status) { + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (route->path_rec) { + route->num_paths = 1; + *route->path_rec = *path_rec; + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED)) { + kfree(route->path_rec); + goto out; + } + } else + status = -ENOMEM; + } + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) + goto out; + event = RDMA_EVENT_ROUTE_ERROR; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_deref_id(id_priv); +} + +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; + struct ib_sa_path_rec path_rec; + int ret; + u8 port; + + ret = ib_find_cached_gid(id_priv->id.device, &addr->sgid, &port, NULL); + if (ret) + return -ENODEV; + + memset(&path_rec, 0, sizeof path_rec); + path_rec.sgid = addr->sgid; + path_rec.dgid = addr->dgid; + path_rec.pkey = addr->pkey; + path_rec.numb_path = 1; + + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + port, &path_rec, + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, + timeout_ms, GFP_KERNEL, + cma_query_handler, id_priv, &id_priv->query); + + return (id_priv->query_id < 0) ? id_priv->query_id : 0; +} + +int rdma_resolve_route(struct rdma_id *id, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_resolve_ib_route(id_priv, timeout_ms); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_route); + +static void addr_handler(int status, struct sockaddr *src_addr, + struct ib_addr *ibaddr, void *context) +{ + struct rdma_id_private *id_priv = context; + enum rdma_event_type event; + + if (!status) + status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE)) + goto out; + event = RDMA_EVENT_ADDR_ERROR; + } else { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + id_priv->id.route.addr.src_addr = *src_addr; + event = RDMA_EVENT_ADDR_RESOLVED; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_deref_id(id_priv); +} + +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + id->route.addr.dst_addr = *dst_addr; + ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, + timeout_ms, addr_handler, id_priv); + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_addr); + +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr) +{ + struct rdma_id_private *id_priv; + struct sockaddr_in *ip_addr = (struct sockaddr_in *) addr; + struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; + int ret; + + if (addr->sa_family != AF_INET) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) + return -EINVAL; + + if (ip_addr->sin_addr.s_addr) { + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); + if (!ret) + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + } else + ret = -ENOSYS; /* TODO: support wild card addresses */ + + if (ret) + goto err; + + id->route.addr.src_addr = *addr; + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); + return ret; +} +EXPORT_SYMBOL(rdma_bind_addr); + +static void cma_format_addr(struct cma_addr *addr, struct rdma_route *route) +{ + struct sockaddr_in *ip_addr; + + memset(addr, 0, sizeof *addr); + cma_set_vers(addr, 1, 4); + + ip_addr = (struct sockaddr_in *) &route->addr.src_addr; + addr->src_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); + + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; + addr->dst_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); + addr->port = cpu_to_be16(ip_addr->sin_port); +} + +static int cma_connect_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_req_param req; + struct rdma_route *route; + struct cma_addr *addr; + void *private_data; + int ret; + + memset(&req, 0, sizeof req); + req.private_data_len = sizeof *addr + conn_param->private_data_len; + + private_data = kmalloc(req.private_data_len, GFP_ATOMIC); + if (!private_data) + return -ENOMEM; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) { + ret = PTR_ERR(id_priv->cm_id); + goto out; + } + + addr = private_data; + route = &id_priv->id.route; + cma_format_addr(addr, route); + + if (conn_param->private_data && conn_param->private_data_len) + memcpy(addr + 1, conn_param->private_data, + conn_param->private_data_len); + req.private_data = private_data; + + req.primary_path = &route->path_rec[0]; + if (route->num_paths == 2) + req.alternate_path = &route->path_rec[1]; + + req.service_id = cma_get_service_id(&route->addr.dst_addr); + req.qp_num = id_priv->id.qp->qp_num; + req.qp_type = IB_QPT_RC; + req.starting_psn = req.qp_num; + req.responder_resources = conn_param->responder_resources; + req.initiator_depth = conn_param->initiator_depth; + req.flow_control = conn_param->flow_control; + req.retry_count = conn_param->retry_count; + req.rnr_retry_count = conn_param->rnr_retry_count; + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.srq = id_priv->id.qp->srq ? 1 : 0; + + ret = ib_send_cm_req(id_priv->cm_id, &req); +out: + kfree(private_data); + return ret; +} + +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_connect_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); + return ret; +} +EXPORT_SYMBOL(rdma_connect); + +static int cma_accept_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_rep_param rep; + int ret; + + ret = cma_modify_ib_qp_rtr(id_priv); + if (ret) + return ret; + + memset(&rep, 0, sizeof rep); + rep.qp_num = id_priv->id.qp->qp_num; + rep.starting_psn = rep.qp_num; + rep.private_data = conn_param->private_data; + rep.private_data_len = conn_param->private_data_len; + rep.responder_resources = conn_param->responder_resources; + rep.initiator_depth = conn_param->initiator_depth; + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.failover_accepted = 0; + rep.flow_control = conn_param->flow_control; + rep.rnr_retry_count = conn_param->rnr_retry_count; + rep.srq = id_priv->id.qp->srq ? 1 : 0; + + return ib_send_cm_rep(id_priv->cm_id, &rep); +} + +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_accept_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(id); + rdma_reject(id, NULL, 0); + return ret; +} +EXPORT_SYMBOL(rdma_accept); + +int rdma_reject(struct rdma_id *id, const void *private_data, + u8 private_data_len) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, private_data, private_data_len); + break; + default: + ret = -ENOSYS; + break; + } + return ret; +}; +EXPORT_SYMBOL(rdma_reject); + +int rdma_disconnect(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + ret = cma_modify_qp_err(id); + if (ret) + goto out; + + switch (id->device->node_type) { + case IB_NODE_CA: + /* Initiate or respond to a disconnect. */ + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id, NULL, 0); + break; + default: + break; + } +out: + return ret; +} +EXPORT_SYMBOL(rdma_disconnect); + +/* TODO: add this to the device structure - see Roland's patch */ +static __be64 get_ca_guid(struct ib_device *device) +{ + struct ib_device_attr *device_attr; + __be64 guid; + int ret; + + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); + if (!device_attr) + return 0; + + ret = ib_query_device(device, device_attr); + guid = ret ? 0 : device_attr->node_guid; + kfree(device_attr); + return guid; +} + +static void cma_add_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + unsigned long flags; + + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); + if (!cma_dev) + return; + + cma_dev->device = device; + cma_dev->node_guid = get_ca_guid(device); + if (!cma_dev->node_guid) + goto err; + + init_waitqueue_head(&cma_dev->wait); + atomic_set(&cma_dev->refcount, 1); + INIT_LIST_HEAD(&cma_dev->id_list); + ib_set_client_data(device, &cma_client, cma_dev); + + spin_lock_irqsave(&lock, flags); + list_add_tail(&cma_dev->list, &dev_list); + spin_unlock_irqrestore(&lock, flags); + return; +err: + kfree(cma_dev); +} + +static int cma_remove_id_dev(struct rdma_id_private *id_priv) +{ + enum cma_state state; + + /* Record that we want to remove the device */ + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); + if (state == CMA_DESTROYING) + return 0; + + /* TODO: wait until safe to process removal. */ + + /* Check for destruction from another callback. */ + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) + return 0; + + return cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, 0, NULL, 0); +} + +static void cma_process_remove(struct cma_device *cma_dev) +{ + struct list_head remove_list; + struct rdma_id_private *id_priv; + unsigned long flags; + int ret; + + INIT_LIST_HEAD(&remove_list); + + spin_lock_irqsave(&lock, flags); + while (!list_empty(&cma_dev->id_list)) { + id_priv = list_entry(cma_dev->id_list.next, + struct rdma_id_private, list); + list_del(&id_priv->list); + list_add_tail(&id_priv->list, &remove_list); + atomic_inc(&id_priv->refcount); + spin_unlock_irqrestore(&lock, flags); + + ret = cma_remove_id_dev(id_priv); + cma_deref_id(id_priv); + if (ret) + rdma_destroy_id(&id_priv->id); + + spin_lock_irqsave(&lock, flags); + } + spin_unlock_irqrestore(&lock, flags); + + atomic_dec(&cma_dev->refcount); + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); +} + +static void cma_remove_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + unsigned long flags; + + cma_dev = ib_get_client_data(device, &cma_client); + if (!cma_dev) + return; + + spin_lock_irqsave(&lock, flags); + list_del(&cma_dev->list); + spin_unlock_irqrestore(&lock, flags); + + cma_process_remove(cma_dev); + kfree(cma_dev); +} + +static int cma_init(void) +{ + return ib_register_client(&cma_client); +} + +static void cma_cleanup(void) +{ + ib_unregister_client(&cma_client); +} + +module_init(cma_init); +module_exit(cma_cleanup); From yaronh at voltaire.com Fri Oct 7 12:52:29 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 7 Oct 2005 21:52:29 +0200 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <35EA21F54A45CB47B879F21A91F4862F7FA3A1@taurus.voltaire.com> > ________________________________________ > From: Michael Krause [mailto:krause at cup.hp.com] > Sent: Friday, October 07, 2005 12:29 PM > To: Yaron Haviv > Cc: Openib > Subject: RE: [openib-general] [RFC] IB address translation using ARP > > At 06:24 AM 9/30/2005, Yaron Haviv wrote: > > > -----Original Message----- > > From: Roland Dreier [ mailto:rolandd at cisco.com] > > Sent: Thursday, September 29, 2005 9:50 PM > > To: Sean Hefty > > Cc: Yaron Haviv; Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > I think the usage model is the following: you have some magic device > > that has an IB port on one side and "something else" on the other > > side. Think of something like a gateway that talks SDP on the IB side > > and TCP/IP on the other side. > > > > >Also applicable to two IB ports, e.g. forwarding SDP traffic from one IB > >partition to SDP on another partition (may even be the same port with > >two P_Keys), and doing some load-balancing or traffic management in > >between, overall there are many use cases for that. > > While I can envision how an endpoint could communicate with another in > separate partitions, doing so really violates the spirit of the > partitioning where endpoints must be in the same partition in order to see > one another and communicate. Mike, This is exactly the same case as two IPoIB interfaces over same port with two partitions configured with IP routing between them, or a layer 7 proxy that connects two network segments I don’t see anything wrong with such a model > Attempting to create an intermediary who has > insights into both and then somehow is able to communicate how to find one > another using some proprietary (can't be through standards that I can > think of) method, seems like way too much complexity to be worth it. > Assuming the ULPs on both sides are standards, how the proxy is built and how it functions is application dependent just like people do proxies for XML which don’t need to obey to any standard beside be transparent to both sides. OpenIB should not block the ability to provide gateway/proxy functionality, or routing traffic beyond a single IP addressing hop. This is just matching IB to capabilities already available in iWarp. Yaron From yaronh at voltaire.com Fri Oct 7 12:59:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 7 Oct 2005 21:59:00 +0200 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, October 07, 2005 12:40 PM > To: 'Michael Krause'; Caitlin Bestler > Cc: Openib > Subject: RE: [openib-general] [RFC] IB address translation using ARP > > >It would be best to define a CM architecture that enabled communication > >between like endpoints and avoid the gateway dilemma. Let the gateway > >provider work out such issues as there are many requirements already > >on each side of these interconnects. > > > I've given this some more thought since the original postings and agree > with > you. It doesn't seem right to me to have the CM establish a connection to > something that is not the specified destination, under the assumption that > whatever is being connected to is a gateway. I think it would be better > for the > application to determine that the actual destination is on a different > subnet, > locate the gateway, and issue a connection request to the gateway. > > - Sean > Sean, I believe this is exactly how it is been proposed The gateway is the endpoint in IB, and the IB CM request is done against the gateway, the gateway may decide to create its own connection on the other side based on IB headers or Private data or even application data (depend on the type of the gateway), this just requires that traffic targeted to a certain IP range/subnet/non-local will end up in the gateway without the need to specify address by address individually (just like its done in IP) Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From mshefty at ichips.intel.com Fri Oct 7 13:10:34 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 13:10:34 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> Message-ID: <4346D63A.2070801@ichips.intel.com> Yaron Haviv wrote: > Sean, I believe this is exactly how it is been proposed > The gateway is the endpoint in IB, and the IB CM request is done against > the gateway, the gateway may decide to create its own connection on the Yes - I agree with that. I'm referring to the RDMA connection manager, versus the IB connection manager. > targeted to a certain IP range/subnet/non-local will end up in the > gateway without the need to specify address by address individually > (just like its done in IP) IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. With TCP, the connection is to the actual endpoint, not the IP router. This seems more similar to an application requesting a connection to a proxy server. - Sean From halr at voltaire.com Fri Oct 7 13:07:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 16:07:37 -0400 Subject: [openib-general] Re: Questions about mad_test In-Reply-To: References: Message-ID: <1128715656.4382.9844.camel@hal.voltaire.com> Hi Pradeep, On Fri, 2005-10-07 at 15:19, Pradeep Satyanarayana wrote: > I am hoping some one will be able to help me out with a few answers > saving me some debug time, or having to expend effort on something > that is already known. > I was trying to execute mad_test and found that it errors out. What is your command invocation ? Can you send the output of ibnetdiscover ? > For some reason it does not like the DR Path that I gave it. > > 1. I ran ibnetdiscover and got the set of LIDs that I use is DR Path. > Is that correct way to go about it? > It always errors out with something like: hop 0 != 0 or hop 1 != > dev_port It's telling you the DR path you specified is invalid. LIDs go "direct" and are hardware forwarded (via LID routing). DR is uses a list of next hop (switch) ports (and not LIDs) and is firmware or software forwarded usually although that is more an implementation than architectural. See IBA 1.2 14.2.2 p.797 on for more on DR SMPs (MADs). > 2. Also there is an expectation of there being a device > /dev/infiniband/mthca0/ports/1/mad (using all defaults in this case) > -is that correct? Any specific major and minor numbers I must use? No. It just accesses those and some /sys/class/infiniband infiniband_mad files. > 3. Anything else that I am missing? > > I am using this from trunk 3675 on 2.6.13 kernel. > > Thanks in advance for all the help! > > Pradeep > pradeep at us.ibm.com From halr at voltaire.com Fri Oct 7 13:17:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 16:17:00 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346D63A.2070801@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> Message-ID: <1128716018.4382.9900.camel@hal.voltaire.com> On Fri, 2005-10-07 at 16:10, Sean Hefty wrote: > Yaron Haviv wrote: > > Sean, I believe this is exactly how it is been proposed > > The gateway is the endpoint in IB, and the IB CM request is done against > > the gateway, the gateway may decide to create its own connection on the > > Yes - I agree with that. I'm referring to the RDMA connection manager, versus > the IB connection manager. > > > targeted to a certain IP range/subnet/non-local will end up in the > > gateway without the need to specify address by address individually > > (just like its done in IP) > > IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. IP is connectionless but has been implemented on top of connection oriented link layers which may gateway to other connection oriented link layers or non connection oriented link layers. I think it is analagous to that. -- Hal > With TCP, the connection is to the actual endpoint, not the IP router. This > seems more similar to an application requesting a connection to a proxy server. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Fri Oct 7 14:02:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 14:02:09 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128716018.4382.9900.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> Message-ID: <4346E251.9080109@ichips.intel.com> Hal Rosenstock wrote: >>IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. > > > IP is connectionless but has been implemented on top of connection > oriented link layers which may gateway to other connection oriented link > layers or non connection oriented link layers. I think it is analagous > to that. I didn't think that IP was even being run in this case. Aren't we talking about an application level gateway? If the RDMA CM ran a protocol that ensured that data sent from the source reached the actual destination, then this would make more sense to me. But the protocol is coming from the client. I just don't think that the RDMA CM should connect to a gateway under the assumption that a client is running a protocol that operates this way. If the source and destination were both running iWarp, then wouldn't a connection be established to the actual destination, and not a gateway? - Sean From halr at voltaire.com Fri Oct 7 14:08:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 17:08:35 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346E251.9080109@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> Message-ID: <1128719144.4382.10255.camel@hal.voltaire.com> On Fri, 2005-10-07 at 17:02, Sean Hefty wrote: > Hal Rosenstock wrote: > >>IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. > > > > > > IP is connectionless but has been implemented on top of connection > > oriented link layers which may gateway to other connection oriented link > > layers or non connection oriented link layers. I think it is analagous > > to that. > > I didn't think that IP was even being run in this case. Aren't we talking about > an application level gateway? Yes. > If the RDMA CM ran a protocol that ensured that data sent from the source reached the actual destination, then this would make > more sense to me. But the protocol is coming from the client. Wouldn't the gateway/host reject or drop the connection if it couldn't do what was required ? > I just don't think that the RDMA CM should connect to a gateway under the > assumption that a client is running a protocol that operates this way. If the > source and destination were both running iWarp, then wouldn't a connection be > established to the actual destination, and not a gateway? Would it shortcut the connection across IP subnets or go through a gateway in that case ? -- Hal From mshefty at ichips.intel.com Fri Oct 7 14:30:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 14:30:43 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128719144.4382.10255.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> Message-ID: <4346E903.8030601@ichips.intel.com> Hal Rosenstock wrote: >> If the RDMA CM ran a protocol that ensured that data sent from the source >> reached the actual destination, then this would make more sense to me. But >> the protocol is coming from the client. > > Wouldn't the gateway/host reject or drop the connection if it couldn't do > what was required ? I would assume so, and maybe that's sufficient. The one problem that I see if this feature weren't in the RDMA CM is that clients may need to be transport aware. (Assuming that an iWarp connection would go directly to the destination.) - Sean From halr at voltaire.com Fri Oct 7 16:48:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 19:48:00 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346E903.8030601@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> Message-ID: <1128728790.4382.11354.camel@hal.voltaire.com> On Fri, 2005-10-07 at 17:30, Sean Hefty wrote: > Hal Rosenstock wrote: > >> If the RDMA CM ran a protocol that ensured that data sent from the source > >> reached the actual destination, then this would make more sense to me. But > >> the protocol is coming from the client. > > > > Wouldn't the gateway/host reject or drop the connection if it couldn't do > > what was required ? > > I would assume so, and maybe that's sufficient. The one problem that I see if > this feature weren't in the RDMA CM is that clients may need to be transport > aware. (Assuming that an iWarp connection would go directly to the destination.) Would an iWARP connection jump across IP subnets ? It would need to determine that it could do this (ala NHRP with ATM). Also, could there be other RDMA networks between them (like IB) ? -- Hal From mshefty at ichips.intel.com Fri Oct 7 16:57:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 16:57:48 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128728790.4382.11354.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> Message-ID: <43470B7C.7060600@ichips.intel.com> Hal Rosenstock wrote: > Would an iWARP connection jump across IP subnets ? It would need to > determine that it could do this (ala NHRP with ATM). Also, could there > be other RDMA networks between them (like IB) ? if iWarp is on top of TCP, I don't think that it would care about IP subnets. - Sean From halr at voltaire.com Fri Oct 7 17:13:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 20:13:18 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <43470B7C.7060600@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> <43470B7C.7060600@ichips.intel.com> Message-ID: <1128730364.4382.11557.camel@hal.voltaire.com> On Fri, 2005-10-07 at 19:57, Sean Hefty wrote: > Hal Rosenstock wrote: > > Would an iWARP connection jump across IP subnets ? It would need to > > determine that it could do this (ala NHRP with ATM). Also, could there > > be other RDMA networks between them (like IB) ? > > if iWarp is on top of TCP, I don't think that it would care about IP subnets. I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? Doesn't a routing decision still need to be made at the IP layer ? Doesn't the IP next hop need to be determined (e.g. gateway when the destination is off the local IP subnet) ? Is there something that precludes iWARP from working across IP subnets ? -- Hal From rolandd at cisco.com Fri Oct 7 18:16:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 07 Oct 2005 18:16:37 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 01:06:53 -0700") References: <1128672413.13948.326.camel@localhost> Message-ID: <52br20lsei.fsf@cisco.com> I wonder if this BIC bug has anything to do with it: http://lkml.org/lkml/2005/10/7/230 From hozer at hozed.org Fri Oct 7 19:03:08 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 7 Oct 2005 21:03:08 -0500 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References:

Message-ID: <20051008020308.GZ4612@kalmia.hozed.org> On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: > Hi, Troy, > > There is INSTALL file in the EHCA driver package. > In OpenPower 720 port 1 is at the top, port 2 is at the bottom. > In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 19:22:56 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 19:22:56 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br20lsei.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> Message-ID: <1128738176.13952.365.camel@localhost> On Fri, 2005-10-07 at 18:16 -0700, Roland Dreier wrote: > I wonder if this BIC bug has anything to do with it: http://lkml.org/lkml/2005/10/7/230 > I'm not sure this helps. I'm seeing the performance drop of happen between 2.6.12-rc4 (470 MB/s) and 2.6.12-rc5 (405 MB/s). I'll send out my new data and cc netdev. - Matt From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 19:25:49 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 19:25:49 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br20lsei.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> Message-ID: <1128738350.13945.369.camel@localhost> I'm adding netdev to this thread to see if they can help. I'm seeing an IPoIB (IP over InfiniBand) netperf performance drop off, of up to 90 MB/s, when using kernels newer than 2.6.11. This doesn't appear to be an OpenIB IPoIB issue since the older in-kernel IB for 2.6.11 and a recent svn3687 snapshot both have the same performance (464 MB/s) with 2.6.11. I used the same kernel config file as a starting point for each of these kernel builds. Have there been any changes in Linux that would explain these results? Here is the hardware setup and netperf results using 'netperf -f -M -c -C -H IPoIB_ADDRESS All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12.5-lustre in-kernel 1 399 2.6.12.5 in-kernel 1 402 2.6.12 in-kernel 1 406 2.6.12-rc6 in-kernel 1 407 2.6.12-rc5 in-kernel 1 405 <<<<< 2.6.12-rc4 in-kernel 1 470 <<<<< 2.6.12-rc3 in-kernel 1 466 2.6.12-rc2 in-kernel 1 469 2.6.12-rc1 in-kernel 1 466 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) Thanks, - Matt From info at giiut.com Fri Oct 7 07:20:24 2005 From: info at giiut.com (info at giiut.com) Date: 7 Oct 2005 23:20:24 +0900 Subject: [openib-general] fu-ka.jpg Message-ID: <20051007142024.5251.qmail@mail.giiut.com> $B!!(B $B!!<:NiCW$7$^$9!#>!@\!L$*M6$$%a!<%kMzNr!M$,0l7oJ]N1Cf$K$J$C$F$*$j$^$9!#(B $B%3%A%i$+$i%a!<%kFbMF$r%3%T!<$7$FG[?.$9$k;v$HCW$7$^$7$F!"$43N(B $BG'$N>e!"JV;v!!(Bhttp://www.alladdin-master.com?return1 $B$r$*4j$$CW$7$^$9!#(B $B"""#Iw9a(B $B$5$s"#""(B $BK\J8(B: $B!V$O$8$a$^$7$F(B^^$B6a=j$NJ}$rC5$7$F$F!"$"$J$?$r>R2p$5$l$?$N$GJV(B $B;v$r=P$7$F$_$^$7$?!#2qM5$,$"$j$^$9$N$G!"(B $B$"$kDxEY(B(20$BK|0L$+$J!&!&!&(B $B>P(B)$B$ONO$K$J$C$F$"$2$k$3$H$,$G$-$k$H;W$$$^$9!#$G$-$l$PAa4|$,(B $B$$$$$N$G!"D>@\%a!<%k$G$-$^$;$s$+!)(B $B!!;d$N%"%I%l%9$O(Bfu-ka*e*cco@ hotmail.com$B59$7$/!"JV;vBT$C$F$^(B $B$9$M"v!W(B $B"(%W%i%$%P%7!pJs$r3NG'$9$k$K$O%3%A%i$N%Z!<%8$K$F4JC1$JZ$r:Q$^$;$k$H!"99$K(B $B!o(B10,000$B1_(B($BAjEv%]%$%s%H(B)$B$^$G40A4L5NA$G$*;n$7=PMh$^$9!#$@$+$i!"(B $B!o(B0$B1_$G$J$s$H!&!&!&(B $B!!!!!!!!!!(Bhttp://www.alladdin-master.com?return1 $B"(2q0w$NJ}$OF~2q(B24$B;~4V0JFb$K0[@-2q0w$+$i$ND>@\O"Mm$NL5$+$C$?(B $B>l9g$O>r7oL5$/40A4L5NA$H at _Dj$5$;$FD:$-$^$9!#(B $B"($3$N%a!<%k$r3+Iu$7$F(B2$B;~4V0JFb$KEPO?$5$l$k$H!"CO0h0[@-$ND>(B $B%"%I(B($B:GBg(B5$BL>(B)$B$r%W%l%<%s%HCW$7$^$9!#(B ------------------------------------------------------------ $B$* References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> <43470B7C.7060600@ichips.intel.com> <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: <1128829186.25001.76.camel@mail.es335.com> On Fri, 2005-10-07 at 20:13 -0400, Hal Rosenstock wrote: > On Fri, 2005-10-07 at 19:57, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > Would an iWARP connection jump across IP subnets ? It would need to > > > determine that it could do this (ala NHRP with ATM). Also, could there > > > be other RDMA networks between them (like IB) ? > > > > if iWarp is on top of TCP, I don't think that it would care about IP subnets. > > I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > Doesn't a routing decision still need to be made at the IP layer ? > Doesn't the IP next hop need to be determined (e.g. gateway when the > destination is off the local IP subnet) ? Is there something that > precludes iWARP from working across IP subnets ? > > -- Hal > I've just read through entire this thread for the first time, and I sense considerable confusion about how IP routing works. I know I'm confused ;-) With sockets, the path to the remote peer is determined *after* the connection request is submitted by the app (connect(...)). The app has no idea which local interface will ultimately handle this connection or what the path (route) is to the remote peer. It simply says connect(67.65.105.4, ...). In fact, TCP doesn't know this either! Like Hal suggests, the connect request (SYN packet) gets all the way down to IP where the least cost route is selected, and if not already known, the Ethernet address is determined (arp) for the next hop. The reasons for this are varied but include: routes may change, Ethernet addresses for next hops change, all within the lifetime of a connection. Almost certainly if the connection lasts more than 15 minutes. The route identifies the local interface, and next hop IP. An interface is only ever on a single subnet. The ARP broadcast is issued on this interface and is only on this one subnet. We're not broadcasting across subnets. Note that the local interface is "logical", and a single Ethernet NIC may have multiple IP addresses and may in fact be on multiple subnets if using VLAN. It is theoretically possible to support all this on an IPoIB based network. Multiple subnets, multiple routes to remote peers, ICMP redirect, multiple IP addresses for each physical interface, yada yada yada. But IMHO, the only way to do this would be to tie directly into the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise you'll end up recreating a gigantic (and I mean GIGANTIC) amount of code. This belief is why I've been a proponent of mapping GIDs to one and only one IP address and treating it for management purposes as the equivalent of an IP address. Without this, the whole mechanism for determining routes, etc.. breaks down. If you treat the GID like a MAC address -- it breaks, because a MAC address can have multiple IP addresses -- the observation that lead to the conclusion that ATS was broken in the first place. I know there is significant resistance to this idea, but I just don't see how we get this generically resolved without binding the two addressing schemes more closely. With the current binding, I just don't think it works. If I'm off in the weeds, please let me know ... and I'll cease spouting off. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Sun Oct 9 01:44:55 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Oct 2005 10:44:55 +0200 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max In-Reply-To: <52fyribmtc.fsf@cisco.com> References: <52fyribmtc.fsf@cisco.com> Message-ID: <20051009084455.GA24993@mellanox.co.il> Hi, I'm proposing a better fix. see below. On Mon, Oct 03, 2005 at 06:13:51PM +0200, Roland Dreier wrote: > Seems reasonable. However, looking back at the chip documentation, it > seems that the max CQEs should really be 0x1ffff rather than 0xffff as > I had it. Can you confirm? > > Thanks, > Roland -------------------------------------------------- Best to take the actual max cqes from QUERY_DEV_LIMS -- new patch below. The "- 1" is there because the cq needs one spare cqe (circular list logic). Jack Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -134,6 +134,7 @@ int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; Index: linux-kernel/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -173,6 +173,7 @@ mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.flags = dev_lim->flags; + mdev->limits.max_cqes = dev_lim->max_cq_sz - 1; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -93,7 +93,7 @@ props->max_qp_wr = 0xffff; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; @@ -639,7 +639,11 @@ struct mthca_cq *cq; int nent; int err; + struct mthca_dev* mdev = to_mdev(ibdev); + if (mdev->limits.max_cqes < entries || entries < 0) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); From info at ppuyt.com Sun Oct 9 01:35:37 2005 From: info at ppuyt.com (info at ppuyt.com) Date: 9 Oct 2005 17:35:37 +0900 Subject: [openib-general] $B9b3[1g=u$G$bL5M}$+$J!)!JN^!K(B Message-ID: <20051009083537.17994.qmail@mail.ppuyt.com> $B7k:'(B5$BG/L\(B28$B:P!#;R6!$,M_$7$/$F;EJ}$J$$$N$K=PMh$^$;$s!#(B $BK\5$$G@:;R$r;d$N%*!{!{%3$K=P$7$F$/$l$^$;$s$+!)@dBP$K(B $BLBOG$+$1$^$;$s!#(B $BA06b$G7 at Ls$9$k;v$G$9!#K\Ev$K=u$1$F$/$@$5$$!#59$7$/(B $B$*4j$$CW$7$^$9!#BT$C$F$^$9!#(B http://awg.webchu.com/sweet-s/?gyakuen $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail sweet_baby_sweet_12 at yahoo.it $B:#8e!"l9g$O(B sweet_baby_sweet_12 at yahoo.it $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From yael at mellanox.co.il Sun Oct 9 04:18:23 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 09 Oct 2005 13:18:23 +0200 Subject: [openib-general] [PATCH] Opensm - handling immediate error in vendor_send Message-ID: <5zu0frvszk.fsf@mtl066.yok.mtl.com> Hi Hal, During our tests on Windows we encountered an issue that is caused due to some problem in the lower layer, but causes problem in the opensm. If the osm_vendor_send call fails immediatly, we need to update several counters (currently, only qp0_mads_sent is decremented), and also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in order to signal the state mgr). What we saw was that these counters weren't decremented, and thus the state mgr wasn't signalled, and the opensm didn't proceed in traversing through its stages. The following patch updates the relevant counters, and calls the dispatcher, if neccessary. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_vl15intf.h =================================================================== --- include/opensm/osm_vl15intf.h (revision 3703) +++ include/opensm/osm_vl15intf.h (working copy) @@ -60,6 +60,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -137,6 +138,8 @@ typedef struct _osm_vl15 osm_vendor_t *p_vend; osm_log_t *p_log; osm_stats_t *p_stats; + osm_subn_t *p_subn; + cl_disp_reg_handle_t h_disp; } osm_vl15_t; /* @@ -176,6 +179,12 @@ typedef struct _osm_vl15 * p_stats * Pointer to the OpenSM statistics block. * +* p_subn +* Pointer to the Subnet object for this subnet. +* +* h_disp +* Handle returned from dispatcher registration. +* * SEE ALSO * VL15 object *********/ @@ -265,7 +274,9 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ); + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp ); /* * PARAMETERS * p_vl15 @@ -283,6 +294,12 @@ osm_vl15_init( * max_wire_smps * [in] Maximum number of MADs allowed on the wire at one time. * +* p_subn +* [in] Pointer to the subnet object. +* +* p_disp +* [in] Pointer to the dispatcher object. +* * RETURN VALUES * IB_SUCCESS if the VL15 object was initialized successfully. * Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3703) +++ opensm/osm_opensm.c (working copy) @@ -257,7 +257,7 @@ osm_opensm_init( status = osm_vl15_init( &p_osm->vl15, p_osm->p_vendor, - &p_osm->log, &p_osm->stats, p_opt->max_wire_smps ); + &p_osm->log, &p_osm->stats, p_opt->max_wire_smps, &p_osm->subn, &p_osm->disp ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3703) +++ opensm/osm_vl15intf.c (working copy) @@ -157,6 +157,8 @@ __osm_vl15_poller( if( status != IB_SUCCESS ) { + uint32_t outstanding; + cl_status_t cl_status; osm_log( p_vl->p_log, OSM_LOG_ERROR, "__osm_vl15_poller: ERR 3E03: " "MAD send failed (%s).\n", @@ -166,7 +168,64 @@ __osm_vl15_poller( The MAD was never successfully sent, so fix up the pre-incremented count values. */ + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire + that was incremented in the code above. */ mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); + if( p_madw->resp_expected == TRUE ) + if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire ) + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E04: " + "Trying to dec qp0_mads_outstanding_on_wire=0. " + "Problem with transaction mgr!\n"); + else + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); + + /* The following code is similar to the one in + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the + qp0_mads_outstanding counter, and if we reached 0 - need to call + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order + to wake up the state mgr). */ + if ( !&p_vl->p_stats->qp0_mads_outstanding ) + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E05: " + "Trying to dec qp0_mads_outstanding=0. " + "Problem with transaction mgr!\n"); + else + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); + + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "%u(%u) QP0 MADs outstanding.\n", + p_vl->p_stats->qp0_mads_outstanding,outstanding ); + + if( outstanding == 0 ) + { + /* + The wire is clean. + Signal the state manager. + */ + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "Posting Dispatcher message %s.\n", + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); + } + + cl_status = cl_disp_post( p_vl->h_disp, + OSM_MSG_NO_SMPS_OUTSTANDING, + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, + NULL, + NULL ); + + if( cl_status != CL_SUCCESS ) + { + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E06: " + "Dispatcher post message failed (%s).\n", + CL_STATUS_MSG( cl_status ) ); + } + } } else { @@ -232,6 +291,7 @@ osm_vl15_construct( cl_qlist_init( &p_vl->rfifo ); cl_qlist_init( &p_vl->ufifo ); cl_thread_construct( &p_vl->poller ); + p_vl->h_disp = CL_DISP_INVALID_HANDLE; } /********************************************************************** @@ -281,6 +341,8 @@ osm_vl15_destroy( p_vl->state = OSM_VL15_STATE_INIT; cl_spinlock_destroy( &p_vl->lock ); + cl_disp_unregister( p_vl->h_disp ); + OSM_LOG_EXIT( p_vl->p_log ); } @@ -292,7 +354,10 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ) + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp + ) { ib_api_status_t status = IB_SUCCESS; OSM_LOG_ENTER( p_log, osm_vl15_init ); @@ -301,6 +366,7 @@ osm_vl15_init( p_vl->p_log = p_log; p_vl->p_stats = p_stats; p_vl->max_wire_smps = max_wire_smps; + p_vl->p_subn = p_subn; status = cl_event_init( &p_vl->signal, FALSE ); if( status != IB_SUCCESS ) @@ -321,6 +387,21 @@ osm_vl15_init( if( status != IB_SUCCESS ) goto Exit; + p_vl->h_disp = cl_disp_register( + p_disp, + CL_DISP_MSGID_NONE, + NULL, + NULL ); + + if( p_vl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_vl15_init: ERR 3E01: " + "Dispatcher registration failed.\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + Exit: OSM_LOG_EXIT( p_log ); return( status ); From sean.hefty at intel.com Sun Oct 9 07:19:37 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 9 Oct 2005 07:19:37 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? I'm referring to the case that iWarp is running over TCP. I know that it can run over SCTP, but I'm not familiar with the details of that protocol. With TCP, this is an end-to-end connection, so layering iWarp over it, only the endpoints need to deal with it. I believe the same is true for SCTP. >Doesn't a routing decision still need to be made at the IP layer ? Routing of the IP packets is done at the IP layer, but I don't see how this affects iWarp. >Doesn't the IP next hop need to be determined (e.g. gateway when the >destination is off the local IP subnet) ? Is there something that >precludes iWARP from working across IP subnets ? I can't think of anything that would preclude iWarp from working across subnets. - Sean From sean.hefty at intel.com Sun Oct 9 07:57:04 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 9 Oct 2005 07:57:04 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128829186.25001.76.camel@mail.es335.com> Message-ID: >It is theoretically possible to support all this on an IPoIB based >network. Multiple subnets, multiple routes to remote peers, ICMP >redirect, multiple IP addresses for each physical interface, yada yada >yada. But IMHO, the only way to do this would be to tie directly into >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of The current implementation ties into the standard Linux ARP tables. If connections were made over TCP/IP, using IPoIB, then I don't think that there would be any issues. The issues only arise because of the desire to use TCP/IP network addresses over a non-TCP/IP network. >code. This belief is why I've been a proponent of mapping GIDs to one >and only one IP address and treating it for management purposes as the >equivalent of an IP address. Without this, the whole mechanism for >determining routes, etc.. breaks down. If you treat the GID like a MAC >address -- it breaks, because a MAC address can have multiple IP >addresses -- the observation that lead to the conclusion that ATS was >broken in the first place. We should be able to handle the case where a GID has multiple IP addresses bound to it. But even if we added a 1:1 restriction, the connection over IB issue still exists. >I know there is significant resistance to this idea, but I just don't >see how we get this generically resolved without binding the two >addressing schemes more closely. With the current binding, I just don't >think it works. Again, I don't think that the binding is the issue, so much as the desire to use an address for a protocol that isn't actually being used for communication. I don't view a GID as an IP address because we're not sending and receiving IP packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I think is the proper view. Anyway, returning back to the original problem of connecting to an IB gateway if a given a destination IP address on a different subnet... I'm slowly convincing myself that either the CMA or AT should do this. (I believe that the ib_addr code will do this now, but still wasn't sure that we wanted it to.) - Sean From surs at cse.ohio-state.edu Sun Oct 9 08:18:53 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 9 Oct 2005 11:18:53 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52achmo18d.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> <52achmo18d.fsf@cisco.com> Message-ID: <20051009151851.GA16147@cse.ohio-state.edu> Roland, * On Oct,13 Roland Dreier wrote : > Sayantan> I noticed that the test re-posts buffers only when the > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > Sayantan> consumed from the SRQ, right? > > Yes, your code is correct. The problem was that the mthca kernel > driver was dispatching SRQ events incorrectly, so the event never > reached userspace. I've checked in a fix for that, and I'm going to > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > working). I did some further testing with this. Apparently, when the asynchronous thread is first started, it gets the limit event (since no receives are posted yet ...). But after that when the number of posted receives actually drop below max_recv - 5, I am not able to see another limit event. Do you think that this could happen in the current implementation? Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From jackm at mellanox.co.il Sun Oct 9 09:30:05 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Oct 2005 18:30:05 +0200 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051009151851.GA16147@cse.ohio-state.edu> References: <20051009151851.GA16147@cse.ohio-state.edu> Message-ID: <20051009163005.GA26296@mellanox.co.il> Sayantan, The Limit Event must be re-armed after an event has occurred (it is a "one-shot"). (i.e., modify-srq/set-limit must be re-invoked).This is compliant with the IB Spec (see section 10.2.9.3, first paragraph). (Note that after each SRQ LWM event, the limit for the SRQ gets reset back to zero -- i.e., disabled). Therefore, proper use of this feature is as follows (after creating the SRQ): a. Post the SRQ WQEs b. Arm the Limit to a non-zero value (less than the number of WQEs posted, or the arming is useless -- you will immediately get the event). c. If the number of posted WQEs falls below your limit, you will get an event. d. Handling the event: 1) FIRST, post more WQEs to the SRQ to get the number of posted wqe's to be greater than your desired limit. 2) THEN, re-arm the event (i.e., modify the SRQ limit again to be a non-zero value). Jack -----Original Message----- On Sun, Oct 09, 2005 at 05:18:53PM +0200, Sayantan Sur wrote: > Roland, > > * On Oct,13 Roland Dreier wrote : > > Sayantan> I noticed that the test re-posts buffers only when the > > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > > Sayantan> consumed from the SRQ, right? > > > > Yes, your code is correct. The problem was that the mthca kernel > > driver was dispatching SRQ events incorrectly, so the event never > > reached userspace. I've checked in a fix for that, and I'm going to > > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > > working). > > I did some further testing with this. Apparently, when the asynchronous > thread is first started, it gets the limit event (since no receives are > posted yet ...). But after that when the number of posted receives > actually drop below max_recv - 5, I am not able to see another limit > event. > > Do you think that this could happen in the current implementation? > > Thanks, > Sayantan. > > -- > http://www.cse.ohio-state.edu/~surs > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From tom at ammasso.com Sun Oct 9 10:10:18 2005 From: tom at ammasso.com (Tom Tucker) Date: Sun, 09 Oct 2005 12:10:18 -0500 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: Message-ID: <1128877818.24182.54.camel@mail.es335.com> On Sun, 2005-10-09 at 07:57 -0700, Sean Hefty wrote: > >It is theoretically possible to support all this on an IPoIB based > >network. Multiple subnets, multiple routes to remote peers, ICMP > >redirect, multiple IP addresses for each physical interface, yada yada > >yada. But IMHO, the only way to do this would be to tie directly into > >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise > >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of > > The current implementation ties into the standard Linux ARP tables. If > connections were made over TCP/IP, using IPoIB, then I don't think that there > would be any issues. The issues only arise because of the desire to use TCP/IP > network addresses over a non-TCP/IP network. > > >code. This belief is why I've been a proponent of mapping GIDs to one > >and only one IP address and treating it for management purposes as the > >equivalent of an IP address. Without this, the whole mechanism for > >determining routes, etc.. breaks down. If you treat the GID like a MAC > >address -- it breaks, because a MAC address can have multiple IP > >addresses -- the observation that lead to the conclusion that ATS was > >broken in the first place. > > We should be able to handle the case where a GID has multiple IP addresses bound > to it. But even if we added a 1:1 restriction, the connection over IB issue > still exists. I agree, except for RARP. > > >I know there is significant resistance to this idea, but I just don't > >see how we get this generically resolved without binding the two > >addressing schemes more closely. With the current binding, I just don't > >think it works. > > Again, I don't think that the binding is the issue, so much as the desire to use > an address for a protocol that isn't actually being used for communication. Not to be pedantic, but if binding or mapping or somesuch weren't an issue we wouldn't need AT. > I > don't view a GID as an IP address because we're not sending and receiving IP > packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I > think is the proper view. > > Anyway, returning back to the original problem of connecting to an IB gateway if > a given a destination IP address on a different subnet... I'm slowly convincing > myself that either the CMA or AT should do this. (I believe that the ib_addr > code will do this now, but still wasn't sure that we wanted it to.) > IMHO, you need a service separate from the CMA to do address translation. My (iWARP's) rationale for this is that there are two clients of the service, the CM and IP. For CM, you need it to elect a route and thereby a local interface. For IP you need it because routes change and ARP entries time out. BTW, can you educate me ... is the following what you're thinking: On the client side... - route is discovered by looking at the Linux routing table - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) - send ARP AT message over local IB interface At the gateway...bridging to IP - ARP AT query received on IB interface - Lookup route to destination IP address in gateway's route table. - If next hop's Ethernet address is already known, it is returned - Otherwise, local interface identified is IPoEthernet - New ARP query goes out on the local interface from the route - When response comes back, answer is returned. At the gateway...bridging to IPoIB - ARP AT message received on IB interface, delivered to AT - Lookup route to destination IP address in gateway's route table - If next hop's Ethernet address is already known, it is returned - otherwise, local interface identified in route is IPoIB - New ARP AT query goes out on the local interface - When response comes back, answer is returned. Thanks, > - Sean > > From surs at cse.ohio-state.edu Sun Oct 9 11:50:31 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 9 Oct 2005 14:50:31 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051009163005.GA26296@mellanox.co.il> References: <20051009151851.GA16147@cse.ohio-state.edu> <20051009163005.GA26296@mellanox.co.il> Message-ID: <20051009185029.GA16927@cse.ohio-state.edu> Jack, * On Oct,16 Jack Morgenstein wrote : > Sayantan, > The Limit Event must be re-armed after an event has occurred (it is a "one-shot"). > (i.e., modify-srq/set-limit must be re-invoked).This is compliant with the > IB Spec (see section 10.2.9.3, first paragraph). (Note that after each SRQ LWM > event, the limit for the SRQ gets reset back to zero -- i.e., disabled). > > Therefore, proper use of this feature is as follows (after creating the SRQ): > a. Post the SRQ WQEs > b. Arm the Limit to a non-zero value (less than the number of WQEs posted, > or the arming is useless -- you will immediately get the event). > c. If the number of posted WQEs falls below your limit, you will get an > event. > d. Handling the event: > 1) FIRST, post more WQEs to the SRQ to get the number of posted wqe's to be > greater than your desired limit. > 2) THEN, re-arm the event (i.e., modify the SRQ limit again to > be a non-zero value). Thanks for the detailed instructions. I am able to see the limit event exactly when the buffer count goes down. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From braam at clusterfs.com Sun Oct 9 14:17:56 2005 From: braam at clusterfs.com (Peter J. Braam) Date: Sun, 9 Oct 2005 17:17:56 -0400 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? Message-ID: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Cluster File Systems, Inc and its customers have been wondering if the Lustre Network Driver (LND) for OpenIb gen2, which we will begin to develop during the coming months, should be based on kdapl or verbs. The driver we plan to develop should strive to address several goals: - high reliability and performance - allow interoperability between user and kernel level - allow interoperability, or better, portability among different operating systems (Linux, OS X, Windows, Solaris) - be suitable for inclusion in the Linux kernel We are keen to hear some opinions! Thanks Peter Braam -------------- next part -------------- An HTML attachment was scrubbed... URL: From hozer at hozed.org Sun Oct 9 18:32:57 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sun, 9 Oct 2005 20:32:57 -0500 Subject: [openib-general] IBM eHCA testing.. Message-ID: <20051010013256.GE4612@kalmia.hozed.org> What's the status on getting the ehca driver integrated into subversion? If there's something holding it up, can we at least get a version that can be dropped into drivers/infiniband/hw ? Also, one final note, is it really appropriate to have ehca/ebus in the infiniband directory? It's really a PPC64 specific driver that works for more than just the ehca device, correct? I have the correct port plugged in now, and I can see the logical HCA device in the output of 'ibnetdiscover' (from another node), but trying to bring up ib0 caused this: [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff [ 452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0 CTR: D000000000049C60 [ 528.813554] REGS: c00000000f1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: 0000000000000000 DSISR: 0000000040000000 [ 528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD: c00000000f1e8000 CPU: 0 [ 528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0 0000000000000000 [ 528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540 0000000000000001 [ 528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004 D000000000049C60 [ 528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000 00000000100A0000 [ 528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000 0000000010020000 [ 528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914 C0000001E46D4810 [ 528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10 0000000000000002 [ 528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640 C00000000F1EB548 [ 528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5 (unreliable) [ 528.813846] [c00000000f1eb4d0] [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [c00000000f1eb5f0] [d00000000007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [c00000000f1eb680] [d00000000006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120 [ 528.813942] [c00000000f1eb7c0] [c000000000328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [c00000000f1eb860] [c00000000037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [c00000000f1eb970] [c00000000037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [c00000000f1eba00] [c00000000031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [c00000000f1ebca0] [c00000000001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [c00000000f1ebd70] [c0000000000e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 60000000 60000000 [ 528.814150] 60000000 7c0802a6 f8010010 f821ff81 e9490170 e80a0000 f8410028 [ 528.814174] <7>RTAS: event: 3, Type: Platform Error, Severity: 2 From ftillier at silverstorm.com Sun Oct 9 21:53:36 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Sun, 9 Oct 2005 21:53:36 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <000301c5cd56$94101290$9e5aa8c0@infiniconsys.com> > From: Peter J. Braam [mailto:braam at clusterfs.com] > Sent: Sunday, October 09, 2005 2:18 PM > > Cluster File Systems, Inc and its customers have been wondering if the Lustre > Network Driver (LND) for OpenIb gen2, which we will begin to develop during > the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating > systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel I think that suitability for inclusion in the Linux kernel is going to be mutually exclusive with portability between different operating systems. If you want to be in the Linux kernel, you need to be a native Linux driver, and not use any sorts of abstraction layers. Feedback to date on abstraction layers has been consistently clear that they will not be tolerated in the kernel. With the ongoing work to support both IB and iWarp devices under the OpenIB verbs, I think coding directly to verbs would be just fine. You'll likely want to use the higher level CM abstraction being developed now for establishing connections in a transport neutral manner, but the verbs themselves should be the same. Others more closely involved can likely give you better guidance. With all this said, I'm personally interested to see a cluster file system on top of the OpenIB Windows stack, and since kDAPL doesn't exist in Windows at the moment, interfacing to native verbs would be my preference. There really aren't that many differences in verbs, though Windows will likely make you deal with more things asynchronously depending on your IRQL. I'd be happy to field specific questions about Windows on the openib-windows mailing list if you have them. Cheers, - Fab From IBMEHCAD at de.ibm.com Mon Oct 10 00:23:59 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Mon, 10 Oct 2005 09:23:59 +0200 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <20051008020308.GZ4612@kalmia.hozed.org> Message-ID: This is caused by a complex interaction of ib_mad, hcad_mod and pSeries firmware. As you might already have noticed a eHCA doesn't show up as a "port" but as a switch in the fabric. Reason for this is partition support and virtualisation in Infininband. If you want to give each partition in a system a "own" IB adapter, it has to have its "own" LID(s) and therefore it's own GUIDs. IB standard only allows one way currently how to accomplish this: You need a switch and multiple adapters behind. So that's exactly how the eHCA shows up in the fabric. In our case system firmware handles the SMA traffic for that "switch" and for all "adapters" (running an SMA or SM on QP0 is currently not supported). This brings up another problem: you definetly won't want to allocate LIDs for all "potentially possible" operating system partitions (not to confuse with IB partitioning), otherwise you could come close to the 48000 LIDs/subnet limit pretty quickly. So you need some kind of signal from the operating system to system firmware, which in the eHCA case is the H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI parameter. AFTER that call handshaking between system firmware and the SM will start, here's a new adapter active on a switch port... what's your guid? here's your LID, p_key, SM lid.... ...and after all that it's possible to send and receive packets from the fabric. The openib stack expects that a port is fully functional after this create_qp returns, and starts to do all sorts of modify QP and post send. So the only choice we have there is to delay create_qp until the complete handshaking between system firmware and the SM has finished (until we see a IB_PORT_ACTIVE in hcad_mod). If we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have to return an error code to openib, otherwise we're seriously in trouble (tried that). Shirley already pointed out on the mailinglist, that ib_mad and others have different recovery depending on the success of ib_create_qp(IB_QPT_GSI), especially ib_mad decides it's the best thing to kill the complete adapter if that call fails on a single port. so that's the full explanation of ehca_nr_ports and hopefully answers your question.... Troy Benjegerdes 08.10.2005 04:03 To Shirley Ma cc Pradeep Satyanarayana , Troy Benjegerdes , IBMEHCA DD/Germany/IBM at IBMDE, openib-general at openib.org, openib-general-bounces at openib.org Subject Re: [openib-general] IBM eHCA testing.. On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: > Hi, Troy, > > There is INSTALL file in the EHCA driver package. > In OpenPower 720 port 1 is at the top, port 2 is at the bottom. > In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at vbdfsp.com Sun Oct 9 22:21:30 2005 From: info at vbdfsp.com (info at vbdfsp.com) Date: 10 Oct 2005 14:21:30 +0900 Subject: [openib-general] $BCK@-I,$:2T$2$k%7%9%F%`$G$9(B Message-ID: <20051010052130.13602.qmail@mail.vbdfsp.com> $B=w$N;R$H%"%]$r@\$d$jl9g$O(B awg_tokyo at yahoo.com.au $B"#(B==========================$B"#(B From yipeeyipeeyipeeyipee at yahoo.com Mon Oct 10 01:28:06 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 10 Oct 2005 08:28:06 +0000 (UTC) Subject: [openib-general] IRQ sharing on PCIe bus Message-ID: Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 kernel. A Mellanox memfree PCIe ddr HCA is connected. Why do I see IRQ sharing although I'm using msi_x and PCIe? Doesn't IRQ sharing only happen on older non PCIe busses? When insmod'ing ib_mthca.ko I see: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:06:00.0 IRQ for 0000:06:00.0[A] -> PIRQ 60, mask dcd8, excl 0000 -> newirq=10 -> got IRQ 10 PCI: Found IRQ 10 for device 0000:06:00.0 PCI: Sharing IRQ 10 with 0000:00:01.0 PCI: Sharing IRQ 10 with 0000:00:02.0 PCI: Sharing IRQ 10 with 0000:00:04.0 PCI: Sharing IRQ 10 with 0000:00:05.0 PCI: Sharing IRQ 10 with 0000:00:06.0 PCI: Sharing IRQ 10 with 0000:00:1d.0 PCI: Sharing IRQ 10 with 0000:07:04.0 PCI: Setting latency timer of device 0000:06:00.0 to 64 the /proc/pci is: PCI devices found: Bus 0, device 0, function 0: Class 0600: PCI device 8086:3590 (rev 12). Bus 0, device 0, function 1: Class ff00: PCI device 8086:3591 (rev 12). Bus 0, device 1, function 0: Class 0880: PCI device 8086:3594 (rev 12). IRQ 10. Non-prefetchable 32 bit memory at 0xfcdff000 [0xfcdfffff]. Bus 0, device 2, function 0: Class 0604: PCI device 8086:3595 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 4, function 0: Class 0604: PCI device 8086:3597 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 5, function 0: Class 0604: PCI device 8086:3598 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=7. Bus 0, device 6, function 0: Class 0604: PCI device 8086:3599 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 29, function 0: Class 0c03: PCI device 8086:24d2 (rev 2). IRQ 10. I/O at 0xd800 [0xd81f]. Bus 0, device 29, function 1: Class 0c03: PCI device 8086:24d4 (rev 2). IRQ 7. I/O at 0xd880 [0xd89f]. Bus 0, device 29, function 2: Class 0c03: PCI device 8086:24d7 (rev 2). IRQ 15. I/O at 0xdc00 [0xdc1f]. Bus 0, device 29, function 7: Class 0c03: PCI device 8086:24dd (rev 2). IRQ 5. Non-prefetchable 32 bit memory at 0xfcdfec00 [0xfcdfefff]. Bus 0, device 30, function 0: Class 0604: PCI device 8086:244e (rev 194). Master Capable. No bursts. Min Gnt=11. Bus 0, device 31, function 0: Class 0601: PCI device 8086:24d0 (rev 2). Bus 0, device 31, function 1: Class 0101: PCI device 8086:24db (rev 2). IRQ 15. I/O at 0xfc00 [0xfc0f]. Non-prefetchable 32 bit memory at 0x80100000 [0x801003ff]. Bus 0, device 31, function 3: Class 0c05: PCI device 8086:24d3 (rev 2). IRQ 11. I/O at 0x540 [0x55f]. Bus 1, device 0, function 0: Class 0604: PCI device 8086:0329 (rev 9). Master Capable. No bursts. Min Gnt=7. Bus 1, device 0, function 1: Class 0800: PCI device 8086:0326 (rev 9). Non-prefetchable 32 bit memory at 0xfcefe000 [0xfcefefff]. Bus 1, device 0, function 2: Class 0604: PCI device 8086:032a (rev 9). Master Capable. No bursts. Min Gnt=7. Bus 1, device 0, function 3: Class 0800: PCI device 8086:0327 (rev 9). Non-prefetchable 32 bit memory at 0xfceff000 [0xfcefffff]. Bus 6, device 0, function 0: Class 0c06: PCI device 15b3:6282 (rev 32). IRQ 10. Non-prefetchable 64 bit memory at 0xfcf00000 [0xfcffffff]. Prefetchable 64 bit memory at 0xfb800000 [0xfbffffff]. Bus 7, device 4, function 0: Class 0200: PCI device 8086:1076 (rev 5). IRQ 10. Master Capable. Latency=32. Min Gnt=255. Non-prefetchable 32 bit memory at 0xfebe0000 [0xfebfffff]. I/O at 0xec00 [0xec3f]. Bus 7, device 6, function 0: Class 0200: PCI device 8086:107c (rev 5). IRQ 15. Master Capable. Latency=32. Min Gnt=255. Non-prefetchable 32 bit memory at 0xfeba0000 [0xfebbffff]. Non-prefetchable 32 bit memory at 0xfeb80000 [0xfeb9ffff]. I/O at 0xe880 [0xe8bf]. Bus 7, device 12, function 0: Class 0300: PCI device 1002:4752 (rev 39). IRQ 11. Master Capable. Latency=32. Min Gnt=8. Non-prefetchable 32 bit memory at 0xfd000000 [0xfdffffff]. I/O at 0xe400 [0xe4ff]. Non-prefetchable 32 bit memory at 0xfebdb000 [0xfebdbfff]. Thanks, y From david at allinea.com Mon Oct 10 02:23:21 2005 From: david at allinea.com (David Lecomber) Date: Mon, 10 Oct 2005 10:23:21 +0100 Subject: [openib-general] ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Message-ID: <1128936201.26749.10.camel@delmo.priv.wark.uk.streamline-computing.com> Dear all, I'm having a kernel problem which I believe to be caused by the infiniband drivers on the system I am using. Kernel 2.6.11, Mellanox software stack IBGD 1.7.0. Essentially, once an MPI code is started, the kernel refuses to allow ptrace() access to the text segment (ie. where the program instructions lie), although it is possible to access the data segment. This means debugging is impossible (gdb, idb, ddt, etc.). The attached code demonstrates the problem. Untar, and then make. Run the 'mpi' program, and pick a line of it's output, paste into another shell. On the standard, non MPI test code, the ptrace reads are all successful. On the MPI test, it gives an error for the text segment reads.. Is this a known issue - are there any upgrades/fixes which should have been applied? I would appreciate if someone could run the test suggested on a really new setup, and see if the error happens. Regards David -- David Lecomber, CTO, Allinea Software tel: +44 1926 623231 fax: +44 1926 623232 -------------- next part -------------- A non-text attachment was scrubbed... Name: ib.tar Type: application/x-tar Size: 10240 bytes Desc: not available URL: From SCHICKHJ at de.ibm.com Mon Oct 10 03:53:23 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Mon, 10 Oct 2005 12:53:23 +0200 Subject: [openib-general] IBM eHCA testing.. Message-ID: Hello Troy, below you will find our preliminary analysis about the problem you've reported on Oct 10 via the OpenIB mailing-list [1]: [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff [ 452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0 CTR: D000000000049C60 [ 528.813554] REGS: c00000000f1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: 0000000000000000 DSISR: 0000000040000000 [ 528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD: c00000000f1e8000 CPU: 0 [ 528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0 0000000000000000 [ 528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540 0000000000000001 [ 528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004 D000000000049C60 [ 528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000 00000000100A0000 [ 528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000 0000000010020000 [ 528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914 C0000001E46D4810 [ 528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10 0000000000000002 [ 528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640 C00000000F1EB548 [ 528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5 (unreliable) [ 528.813846] [c00000000f1eb4d0] [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [c00000000f1eb5f0] [d00000000007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [c00000000f1eb680] [d00000000006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120 [ 528.813942] [c00000000f1eb7c0] [c000000000328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [c00000000f1eb860] [c00000000037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [c00000000f1eb970] [c00000000037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [c00000000f1eba00] [c00000000031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [c00000000f1ebca0] [c00000000001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [c00000000f1ebd70] [c0000000000e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 60000000 60000000 [ 528.814150] 60000000 7c0802a6 f8010010 f821ff81 e9490170 e80a0000 f8410028 [ 528.814174] <7>RTAS: event: 3, Type: Platform Error, Severity: 2 It looks that IPoIB uses ressources which are already freed. We don't receive a "port active" event for port 1 in time (after 20 seconds). This means, that the ib_mad stack tries to create an AQP1. Here, our eHCA InfiniBand Device Driver waits for a maximum of 20 seconds for a port active event. It seems that with the usage of OpenSM we will receive the "port active" event after ca. 45 seconds. For the MAD and IPoIB stack this means the following: MAD: ==== 1. No AQP1 QP will exist for port 1, because of the missing "port active event". 2. All resources are freed, because of the error handling routines in ib_mad. create_mad_qp reports an error to ib_mad_port_open which destroys all allocated resources (workqueue, AQPs, MR, PD, CQ, etc.). 3. Multicast join request to the SM won't work !!! IPoIB doesn't work on ifconfig ib0 xxx.xxx.xxx.xxx !!! IPoIB: ====== For IPoIB a "port active" event which is to late is going to be a problem. 1. The function ipoib_add_one calls ipoib_add_port which creates all IB ressources (QPs, CQ, etc. function ipoib_dev_init -> ipoib_in_dev_init, ...) 2. Function ipoib_ib_dev_init (executed at startup / module load) calls ipoib_ib_dev_open which wants to modify the IPoIB QP from INIT -> RTR -> RTS via ipoib_qp_create. 3. The first ib_modify_qp functions (Reset2Init) in ipoib_qp_create failes, because the port is not active at the moment. See: [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ... [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 4. If that happes the function ipoib_qp_create in ib_verbs.c will destroy the IPoIB QP. 5. A user enters ifconfig ib0 xxx.xxx.xxx.xxx which executes ipoib_open. This function executes also ipoib_ib_dev_open which wants to modifies the IPoIB QP from INIT -> RTR -> RTS via ipoib_qp_create. 6. ib_modify_qp will occur a Kernel panic (because priv->qp is NULL see function ipoib_qp_create). Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Mon Oct 10 06:22:35 2005 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 10 Oct 2005 15:22:35 +0200 Subject: [openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> David, IBGD 1.7 does not support kernel 2.6.11 so I assume you have made changes to IBGD to make it compile. In the files you sent I can't see a call to ptrace with PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in the IBGD stack, registered buffers are not inherited by a child process when a the parent forks. This is accomplished by setting the VM_DONTCOPY flag on the vma. It is so done to retain the virtual to physical translation of a page at the parent by disabling COW on the pages. So the child may not even have these buffers in its address space and this could be the reason why ptrace fails. Note also that IBGD 1.8 is the latest release and it does support kernel 2.6.11 so you may consider using it, though the description above holds also for IBGD 1.8 Eli -----Original Message----- From: David Lecomber [mailto:david at allinea.com] Sent: Monday, October 10, 2005 11:23 AM To: openib-general at openib.org Subject: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Dear all, I'm having a kernel problem which I believe to be caused by the infiniband drivers on the system I am using. Kernel 2.6.11, Mellanox software stack IBGD 1.7.0. Essentially, once an MPI code is started, the kernel refuses to allow ptrace() access to the text segment (ie. where the program instructions lie), although it is possible to access the data segment. This means debugging is impossible (gdb, idb, ddt, etc.). The attached code demonstrates the problem. Untar, and then make. Run the 'mpi' program, and pick a line of it's output, paste into another shell. On the standard, non MPI test code, the ptrace reads are all successful. On the MPI test, it gives an error for the text segment reads.. Is this a known issue - are there any upgrades/fixes which should have been applied? I would appreciate if someone could run the test suggested on a really new setup, and see if the error happens. Regards David -- David Lecomber, CTO, Allinea Software tel: +44 1926 623231 fax: +44 1926 623232 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david at allinea.com Mon Oct 10 06:22:32 2005 From: david at allinea.com (David Lecomber) Date: Mon, 10 Oct 2005 14:22:32 +0100 Subject: [openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> Message-ID: <1128950552.26749.36.camel@delmo.priv.wark.uk.streamline-computing.com> On Mon, 2005-10-10 at 15:22 +0200, Eli Cohen wrote: > David, > IBGD 1.7 does not support kernel 2.6.11 so I assume you have made > changes to IBGD to make it compile. > In the files you sent I can't see a call to ptrace with > PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in > the IBGD stack, registered buffers are not inherited by a child > process when a the parent forks. This is accomplished by setting the > VM_DONTCOPY flag on the vma. It is so done to retain the virtual to > physical translation of a page at the parent by disabling COW on the > pages. So the child may not even have these buffers in its address > space and this could be the reason why ptrace fails. > > Note also that IBGD 1.8 is the latest release and it does support > kernel 2.6.11 so you may consider using it, though the description > above holds also for IBGD 1.8 > > Eli Hi Eli, Thanks for looking at this. Peektext/peekdata are synonymous, at least in Linux (c.f. the man page for ptrace). Do you happen to have a 1.8 based machine you could try the example on for me (please!)? Do you have any suggestions for a way to work around this. All debuggers need to be able to read memory locations, and even write to them (for breakpoints) - so it's kind of essential! Regards David From mst at mellanox.co.il Mon Oct 10 06:57:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 15:57:24 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: References: Message-ID: <20051010135723.GT21551@mellanox.co.il> Quoting Sean Hefty : > Subject: [PATCH] [CMA] RDMA CM abstraction module > > The following patch adds in a basic RDMA connection management abstraction. > It is functional, but needs additional work for handling device removal, > plus several missing features. > > I'd like to merge this back into the trunk, and continue working on it > from there. > > This depends on the ib_addr module. > > Signed-off-by: Sean Hefty > > > > Index: include/rdma/rdma_cm.h > =================================================================== > --- include/rdma/rdma_cm.h (revision 0) > +++ include/rdma/rdma_cm.h (revision 0) > @@ -0,0 +1,201 @@ > > [........... snip ...............] > > + > +#if !defined(RDMA_CM_H) > +#define RDMA_CM_H > + > +#include > +#include > +#include > + > +/* > + * Upon receiving a device removal event, users must destroy the > associated > + * RDMA identifier and release all resources allocated with the device. > + */ > +enum rdma_event_type { > + RDMA_EVENT_ADDR_RESOLVED, > + RDMA_EVENT_ADDR_ERROR, > + RDMA_EVENT_ROUTE_RESOLVED, > + RDMA_EVENT_ROUTE_ERROR, > + RDMA_EVENT_CONNECT_REQUEST, > + RDMA_EVENT_CONNECT_ERROR, > + RDMA_EVENT_UNREACHABLE, > + RDMA_EVENT_REJECTED, > + RDMA_EVENT_ESTABLISHED, > + RDMA_EVENT_DISCONNECTED, > + RDMA_EVENT_DEVICE_REMOVAL, > +}; > + > +struct rdma_addr { > + struct sockaddr src_addr; > + struct sockaddr dst_addr; > + union { > + struct ib_addr ibaddr; > + } addr; > +}; > + > +struct rdma_route { > + struct rdma_addr addr; > + struct ib_sa_path_rec *path_rec; > + int num_paths; > +}; > + > +struct rdma_event { > + enum rdma_event_type event; > + int status; > + void *private_data; > + u8 private_data_len; > +}; Wouldnt is be a good idea to start names with rdma_cm or rdma_cma or something like that? For example, rdma_event_type is a bit confusing since this actually only includes CM events. Similiar comments apply to other names. > +struct rdma_id; I propose renaming this to rdma_connection or something else more specific than just "id". Makes sense? > +/** > + * rdma_event_handler - Callback used to report user events. > + * > + * Notes: Users may not call rdma_destroy_id from this callback to destroy > + * the passed in id, or a corresponding listen id. Returning a > + * non-zero value from the callback will destroy the corresponding id. > + */ > +typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); > + > +struct rdma_id { > + struct ib_device *device; > + void *context; > + struct ib_qp *qp; > + rdma_event_handler event_handler; > + struct rdma_route route; > +}; > + > +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void > *context); > + > +void rdma_destroy_id(struct rdma_id *id); > + > +/** > + * rdma_bind_addr - Bind an RDMA identifier to a source address and > + * associated RDMA device, if needed. > + * > + * @id: RDMA identifier. > + * @addr: Local address information. Wildcard values are permitted. > + * > + * This associates a source address with the RDMA identifier before calling > + * rdma_listen. If a specific local address is given, the RDMA identifier will > + * be bound to a local RDMA device. > + */ > +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); > + > +/** > + * rdma_resolve_addr - Resolve destination and optional source addresses > + * from IP addresses to an RDMA address. If successful, the specified > + * rdma_id will be bound to a local device. > + * > + * @id: RDMA identifier. > + * @src_addr: Source address information. This parameter may be NULL. > + * @dst_addr: Destination address information. > + * @timeout_ms: Time to wait for resolution to complete. > + */ > +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, > + struct sockaddr *dst_addr, int timeout_ms); > + > +/** > + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier > + * into route information needed to establish a connection. > + * > + * This is called on the client side of a connection, but its use is optional. > + * Users must have first called rdma_bind_addr to resolve a dst_addr > + * into an RDMA address before calling this routine. > + */ > +int rdma_resolve_route(struct rdma_id *id, int timeout_ms); Not sure I understand what this does, since the only extra parameter is timeout_ms. > +/** > + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA > + * identifier. > + */ > +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > + struct ib_qp_init_attr *qp_init_attr); > + > +/** > + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA > + * identifier. > + * > + * Users must destroy any QP associated with an RDMA identifier before > + * destroying the RDMA ID. > + */ > +void rdma_destroy_qp(struct rdma_id *id); Not sure what the intended usage is. When does the user need to call this? > +struct rdma_conn_param { > + const void *private_data; > + u8 private_data_len; > + u8 responder_resources; > + u8 initiator_depth; > + u8 flow_control; > + u8 retry_count; /* ignored when accepting */ > + u8 rnr_retry_count; > +}; > + > +/** > + * rdma_connect - Initiate an active connection request. > + * > + * Users must have bound the rdma_id to a local device by having called > + * rdma_resolve_addr before calling this routine. Users may also resolve the > + * RDMA address to a route with rdma_resolve_route, but if a route has not > + * been resolved, a default route will be selected. > + * > + * Note that the QP must be in the INIT state. > + */ > +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); > + > +/** > + * rdma_listen - This function is called by the passive side to > + * listen for incoming connection requests. > + * > + * Users must have bound the rdma_id to a local address by calling > + * rdma_bind_addr before calling this routine. > + */ > +int rdma_listen(struct rdma_id *id); > + > +/** > + * rdma_accept - Called on the passive side to accept a connection request > + * > + * Note that the QP must be in the INIT state. > + */ > +int rdma_accept(struct rdma_id *id, struct rdma_conn_param > *conn_param); > + > +/** > + * rdma_reject - Called on the passive side to reject a connection request. > + */ > +int rdma_reject(struct rdma_id *id, const void *private_data, > + u8 private_data_len); > + > +/** > + * rdma_disconnect - This function disconnects the associated QP. > + */ > +int rdma_disconnect(struct rdma_id *id); > + > +#endif /* RDMA_CM_H */ > + > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 0) > +++ core/cma.c (revision 0) > @@ -0,0 +1,1207 @@ > + > [ ......... snip .............. ] > > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include Are all of these headers really needed? For example, I dont see arp.h used anywhere. Am I missing something? > +MODULE_AUTHOR("Guy German"); > +MODULE_DESCRIPTION("Generic RDMA CM Agent"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +#define CMA_CM_RESPONSE_TIMEOUT 20 > +#define CMA_MAX_CM_RETRIES 3 > + > +static void cma_add_one(struct ib_device *device); > +static void cma_remove_one(struct ib_device *device); > + > +static struct ib_client cma_client = { > + .name = "cma", > + .add = cma_add_one, > + .remove = cma_remove_one > +}; > + > +static DEFINE_SPINLOCK(lock); > +static LIST_HEAD(dev_list); > + > +struct cma_device { > + struct list_head list; > + struct ib_device *device; > + __be64 node_guid; > + wait_queue_head_t wait; > + atomic_t refcount; > + struct list_head id_list; > +}; > + > +enum cma_state { > + CMA_IDLE, > + CMA_ADDR_QUERY, > + CMA_ADDR_RESOLVED, > + CMA_ROUTE_QUERY, > + CMA_ROUTE_RESOLVED, > + CMA_CONNECT, > + CMA_ADDR_BOUND, > + CMA_LISTEN, > + CMA_DEVICE_REMOVAL, > + CMA_DESTROYING > +}; > + > +/* > + * Device removal can occur at anytime, so we need extra handling to > + * serialize notifying the user of device removal with other callbacks. > + * We do this by disabling removal notification while a callback is in process, > + * and reporting it after the callback completes. > + */ > +struct rdma_id_private { > + struct rdma_id id; > + > + struct list_head list; > + struct cma_device *cma_dev; > + > + enum cma_state state; > + spinlock_t lock; > + wait_queue_head_t wait; > + atomic_t refcount; > + atomic_t dev_remove; > + > + int timeout_ms; > + struct ib_sa_query *query; > + int query_id; > + struct ib_cm_id *cm_id; > +}; > + > +struct cma_addr { > + u8 version; /* CMA version: 7:4, IP version: 3:0 */ > + u8 reserved; > + __be16 port; > + struct { > + union { > + struct in6_addr ip6; > + struct { > + __be32 pad[3]; > + __be32 addr; > + } ip4; > + } ver; > + } src_addr, dst_addr; > +}; > + > +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state > comp) > +{ > + unsigned long flags; > + int ret; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + ret = (id_priv->state == comp); > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return ret; > +} > + > +static int cma_comp_exch(struct rdma_id_private *id_priv, > + enum cma_state comp, enum cma_state exch) > +{ > + unsigned long flags; > + int ret; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + if ((ret = (id_priv->state == comp))) > + id_priv->state = exch; > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return ret; > +} > + > +static enum cma_state cma_exch(struct rdma_id_private *id_priv, > + enum cma_state exch) > +{ > + unsigned long flags; > + enum cma_state old; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + old = id_priv->state; > + id_priv->state = exch; > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return old; > +} > + > +static inline u8 cma_get_ip_ver(struct cma_addr *addr) > +{ > + return addr->version & 0xF; > +} > + > +static inline u8 cma_get_cma_ver(struct cma_addr *addr) > +{ > + return addr->version >> 4; > +} > + > +static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 > ip_ver) > +{ > + addr->version = (cma_ver << 4) + (ip_ver & 0xF); > +} > + > +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, > + union ib_gid *gid) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + int ret = -ENODEV; > + u8 port; > + > + spin_lock_irqsave(&lock, flags); > + list_for_each_entry(cma_dev, &dev_list, list) { > + ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); > + if (!ret) { > + atomic_inc(&cma_dev->refcount); > + id_priv->cma_dev = cma_dev; > + id_priv->id.device = cma_dev->device; > + list_add_tail(&id_priv->list, &cma_dev->id_list); > + break; > + } > + } > + spin_unlock_irqrestore(&lock, flags); > + return ret; > +} > + > +static void cma_release_dev(struct rdma_id_private *id_priv) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&lock, flags); > + list_del(&id_priv->list); > + spin_unlock_irqrestore(&lock, flags); > + > + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) > + wake_up(&id_priv->cma_dev->wait); > +} > + > +static void cma_deref_id(struct rdma_id_private *id_priv) > +{ > + if (atomic_dec_and_test(&id_priv->refcount)) > + wake_up(&id_priv->wait); > +} > + > +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void > *context) > +{ > + struct rdma_id_private *id_priv; > + > + id_priv = kmalloc(sizeof *id_priv, GFP_KERNEL); > + if (!id_priv) > + return NULL; > + memset(id_priv, 0, sizeof *id_priv); > + > + id_priv->state = CMA_IDLE; > + id_priv->id.context = context; > + id_priv->id.event_handler = event_handler; > + spin_lock_init(&id_priv->lock); > + init_waitqueue_head(&id_priv->wait); > + atomic_set(&id_priv->refcount, 1); > + atomic_set(&id_priv->dev_remove, 1); > + > + return &id_priv->id; > +} > +EXPORT_SYMBOL(rdma_create_id); > + > +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp > *qp) > +{ > + struct ib_qp_attr qp_attr; > + struct ib_sa_path_rec *path_rec; > + int ret; > + > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + > + path_rec = id_priv->id.route.path_rec; > + ret = ib_find_cached_gid(id_priv->id.device, &path_rec->sgid, > + &qp_attr.port_num, NULL); > + if (ret) > + return ret; > + > + ret = ib_find_cached_pkey(id_priv->id.device, qp_attr.port_num, > + > id_priv->id.route.addr.addr.ibaddr.pkey, > + &qp_attr.pkey_index); > + if (ret) > + return ret; > + > + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | IB_QP_PORT); > +} > + > +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > + struct ib_qp_init_attr *qp_init_attr) > +{ > + struct rdma_id_private *id_priv; > + struct ib_qp *qp; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (id->device != pd->device) > + return -EINVAL; > + > + qp = ib_create_qp(pd, qp_init_attr); > + if (IS_ERR(qp)) > + return PTR_ERR(qp); > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_init_ib_qp(id_priv, qp); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + > + if (ret) > + goto err; > + > + id->qp = qp; > + return 0; > +err: > + ib_destroy_qp(qp); > + return ret; > +} > +EXPORT_SYMBOL(rdma_create_qp); What about replacing switch with one case statements with if statements. Like this: if (id->device->node_type == IB_NODE_CA) ret = cma_init_ib_qp(id_priv, qp); else ret = -ENOSYS; Or even ret = id->device->node_type == IB_NODE_CA ? cma_init_ib_qp(id_priv, qp) : -ENOSYS; I also wander why do we really need all these node_type checks. The code above seems to imply that rdma_create_qp will fail on non-CA. Why is that? > +void rdma_destroy_qp(struct rdma_id *id) > +{ > + ib_destroy_qp(id->qp); > +} > +EXPORT_SYMBOL(rdma_destroy_qp); > + > +static int cma_modify_ib_qp_rtr(struct rdma_id_private *id_priv) > +{ > + struct ib_qp_attr qp_attr; > + int qp_attr_mask, ret; > + > + /* Need to update QP attributes from default values. */ > + qp_attr.qp_state = IB_QPS_INIT; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > + if (ret) > + return ret; > + > + qp_attr.qp_state = IB_QPS_RTR; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + qp_attr.rq_psn = id_priv->id.qp->qp_num; > + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > +} > + > +static int cma_modify_ib_qp_rts(struct rdma_id_private *id_priv) > +{ > + struct ib_qp_attr qp_attr; > + int qp_attr_mask, ret; > + > + qp_attr.qp_state = IB_QPS_RTS; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > +} > + > +static int cma_modify_qp_err(struct rdma_id *id) > +{ > + struct ib_qp_attr qp_attr; > + > + qp_attr.qp_state = IB_QPS_ERR; > + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); > +} > + > +static int cma_verify_addr(struct cma_addr *addr, > + struct sockaddr_in *ip_addr) > +{ > + if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) > + return -EINVAL; > + > + if (ip_addr->sin_port != be16_to_cpu(addr->port)) > + return -EINVAL; > + > + if (ip_addr->sin_addr.s_addr && > + (ip_addr->sin_addr.s_addr != be32_to_cpu(addr->dst_addr. > + ver.ip4.addr))) > + return -EINVAL; > + > + return 0; > +} > + > +static int cma_notify_user(struct rdma_id_private *id_priv, > + enum rdma_event_type type, int status, > + void *data, u8 data_len) > +{ > + struct rdma_event event; > + > + event.event = type; > + event.status = status; > + event.private_data = data; > + event.private_data_len = data_len; > + > + return id_priv->id.event_handler(&id_priv->id, &event); > +} > + > +static inline void cma_disable_dev_remove(struct rdma_id_private > *id_priv) > +{ > + atomic_inc(&id_priv->dev_remove); > +} > + > +static inline void cma_deref_dev(struct rdma_id_private *id_priv) > +{ > +// if (atomic_dec_and_test(&id_priv->dev_remove)) > +// wake_up(&id_priv->wait); > +// return atomic_dec_and_test(&id_priv->dev_remove) ? > +// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, > +// NULL, 0) : 0; > +} The above seems to need some cleanup. Some of the comments above apply to the patch as a whole, so I'm preserving the rest of it here for reference. There aren't any more my comments below. Thanks, MST ---------------------------------------------- > +static void cma_cancel_addr(struct rdma_id_private *id_priv) > +{ > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); > + break; > + default: > + break; > + } > +} > + > +static void cma_cancel_route(struct rdma_id_private *id_priv) > +{ > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + ib_sa_cancel_query(id_priv->query_id, id_priv->query); > + break; > + default: > + break; > + } > +} > + > +static void cma_cancel_operation(struct rdma_id_private *id_priv, > + enum cma_state state) > +{ > + switch (state) { > + case CMA_ADDR_QUERY: > + cma_cancel_addr(id_priv); > + break; > + case CMA_ROUTE_QUERY: > + cma_cancel_route(id_priv); > + break; > + default: > + break; > + } > +} > + > +static void cma_free_id(struct rdma_id_private *id_priv) > +{ > + if (id_priv->cma_dev) { > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > + ib_destroy_cm_id(id_priv->cm_id); > + break; > + default: > + break; > + } > + cma_release_dev(id_priv); > + } > + > + atomic_dec(&id_priv->refcount); > + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); > + > + kfree(id_priv->id.route.path_rec); > + kfree(id_priv); > +} > + > +void rdma_destroy_id(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + enum cma_state state; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + > + state = cma_exch(id_priv, CMA_DESTROYING); > + cma_cancel_operation(id_priv, state); > + cma_free_id(id_priv); > +} > +EXPORT_SYMBOL(rdma_destroy_id); > + > +static int cma_rep_recv(struct rdma_id_private *id_priv) > +{ > + int ret; > + > + ret = cma_modify_ib_qp_rtr(id_priv); > + if (ret) > + goto reject; > + > + ret = cma_modify_ib_qp_rts(id_priv); > + if (ret) > + goto reject; > + > + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(&id_priv->id); > + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, NULL, 0); > + return ret; > +} > + > +static int cma_rtu_recv(struct rdma_id_private *id_priv) > +{ > + int ret; > + > + ret = cma_modify_ib_qp_rts(id_priv); > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(&id_priv->id); > + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, NULL, 0); > + return ret; > +} > + > +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event > *ib_event) > +{ > + struct rdma_id_private *id_priv = cm_id->context; > + enum rdma_event_type event; > + u8 private_data_len = 0; > + int ret = 0, status = 0; > + > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return 0; > + > + switch (ib_event->event) { > + case IB_CM_REQ_ERROR: > + case IB_CM_REP_ERROR: > + event = RDMA_EVENT_UNREACHABLE; > + status = -ETIMEDOUT; > + break; > + case IB_CM_REP_RECEIVED: > + status = cma_rep_recv(id_priv); > + event = status ? RDMA_EVENT_CONNECT_ERROR : > + RDMA_EVENT_ESTABLISHED; > + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; > + break; > + case IB_CM_RTU_RECEIVED: > + status = cma_rtu_recv(id_priv); > + event = status ? RDMA_EVENT_CONNECT_ERROR : > + RDMA_EVENT_ESTABLISHED; > + break; > + case IB_CM_DREQ_ERROR: > + status = -ETIMEDOUT; /* fall through */ > + case IB_CM_DREQ_RECEIVED: > + case IB_CM_DREP_RECEIVED: > + event = RDMA_EVENT_DISCONNECTED; > + break; > + case IB_CM_TIMEWAIT_EXIT: > + case IB_CM_MRA_RECEIVED: > + /* ignore event */ > + goto out; > + case IB_CM_REJ_RECEIVED: > + cma_modify_qp_err(&id_priv->id); > + status = ib_event->param.rej_rcvd.reason; > + event = RDMA_EVENT_REJECTED; > + break; > + default: > + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", > + ib_event->event); > + goto out; > + } > + > + ret = cma_notify_user(id_priv, event, status, > ib_event->private_data, > + private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id = NULL; > + rdma_destroy_id(&id_priv->id); > + } > +out: > + return ret; > +} > + > +static struct rdma_id_private* cma_new_id(struct rdma_id *listen_id, > + struct ib_cm_event *ib_event) > +{ > + struct rdma_id_private *id_priv; > + struct rdma_id *id; > + struct rdma_route *route; > + struct sockaddr_in *ip_addr; > + struct ib_sa_path_rec *path_rec; > + struct cma_addr *addr; > + int num_paths; > + > + ip_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; > + if (cma_verify_addr(ib_event->private_data, ip_addr)) > + return NULL; > + > + num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); > + path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); > + if (!path_rec) > + return NULL; > + > + id = rdma_create_id(listen_id->event_handler, listen_id->context); > + if (!id) > + goto err; > + > + route = &id->route; > + route->addr.src_addr = listen_id->route.addr.src_addr; > + route->addr.dst_addr.sa_family = ip_addr->sin_family; > + > + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > + addr = ib_event->private_data; > + ip_addr->sin_addr.s_addr = be32_to_cpu(addr->src_addr.ver.ip4.addr); > + > + route->num_paths = num_paths; > + route->path_rec = path_rec; > + path_rec[0] = *ib_event->param.req_rcvd.primary_path; > + if (num_paths == 2) > + path_rec[1] = *ib_event->param.req_rcvd.alternate_path; > + > + route->addr.addr.ibaddr.sgid = path_rec->dgid; > + route->addr.addr.ibaddr.dgid = path_rec->sgid; > + route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); > + > + id_priv = container_of(id, struct rdma_id_private, id); > + id_priv->state = CMA_CONNECT; > + return id_priv; > +err: > + kfree(path_rec); > + return NULL; > +} > + > +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event > *ib_event) > +{ > + struct rdma_id_private *listen_id, *conn_id; > + int offset, ret; > + > + listen_id = cm_id->context; > + conn_id = cma_new_id(&listen_id->id, ib_event); > + if (!conn_id) > + return -ENOMEM; > + > + ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); > + if (ret) { > + ret = -ENODEV; > + goto err; > + } > + > + conn_id->cm_id = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_ib_handler; > + conn_id->state = CMA_CONNECT; > + > + offset = sizeof(struct cma_addr); > + ret = cma_notify_user(conn_id, RDMA_EVENT_CONNECT_REQUEST, 0, > + ib_event->private_data + offset, > + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + conn_id->cm_id = NULL; > + rdma_destroy_id(&conn_id->id); > + } > + return ret; > +err: > + rdma_destroy_id(&conn_id->id); > + return ret; > +} > + > +static __be64 cma_get_service_id(struct sockaddr *addr) > +{ > + return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + > + ((struct sockaddr_in *) addr)->sin_port); > +} > + > +static int cma_ib_listen(struct rdma_id_private *id_priv) > +{ > + __be64 svc_id; > + int ret; > + > + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id)) > + return PTR_ERR(id_priv->cm_id); > + > + svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); > + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); > + if (ret) > + ib_destroy_cm_id(id_priv->cm_id); > + > + return ret; > +} > + > +int rdma_listen(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > + return -EINVAL; > + > + /* TODO: handle listen across multiple devices */ > + if (!id->device) { > + ret = -ENOSYS; > + goto err; > + } > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_ib_listen(id_priv); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); > + return ret; > +}; > +EXPORT_SYMBOL(rdma_listen); > + > +static void cma_query_handler(int status, struct ib_sa_path_rec > *path_rec, > + void *context) > +{ > + struct rdma_id_private *id_priv = context; > + struct rdma_route *route = &id_priv->id.route; > + enum rdma_event_type event = RDMA_EVENT_ROUTE_RESOLVED; > + > + if (!status) { > + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); > + if (route->path_rec) { > + route->num_paths = 1; > + *route->path_rec = *path_rec; > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, > + CMA_ROUTE_RESOLVED)) > { > + kfree(route->path_rec); > + goto out; > + } > + } else > + status = -ENOMEM; > + } > + > + if (status) { > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) > + goto out; > + event = RDMA_EVENT_ROUTE_ERROR; > + } > + > + if (cma_notify_user(id_priv, event, status, NULL, 0)) { > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > +out: > + cma_deref_id(id_priv); > +} > + > +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int > timeout_ms) > +{ > + struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; > + struct ib_sa_path_rec path_rec; > + int ret; > + u8 port; > + > + ret = ib_find_cached_gid(id_priv->id.device, &addr->sgid, &port, NULL); > + if (ret) > + return -ENODEV; > + > + memset(&path_rec, 0, sizeof path_rec); > + path_rec.sgid = addr->sgid; > + path_rec.dgid = addr->dgid; > + path_rec.pkey = addr->pkey; > + path_rec.numb_path = 1; > + > + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > + port, &path_rec, > + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, > + timeout_ms, GFP_KERNEL, > + cma_query_handler, id_priv, &id_priv->query); > + > + return (id_priv->query_id < 0) ? id_priv->query_id : 0; > +} > + > +int rdma_resolve_route(struct rdma_id *id, int timeout_ms) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) > + return -EINVAL; > + > + atomic_inc(&id_priv->refcount); > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_resolve_ib_route(id_priv, timeout_ms); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); > + cma_deref_id(id_priv); > + return ret; > +} > +EXPORT_SYMBOL(rdma_resolve_route); > + > +static void addr_handler(int status, struct sockaddr *src_addr, > + struct ib_addr *ibaddr, void *context) > +{ > + struct rdma_id_private *id_priv = context; > + enum rdma_event_type event; > + > + if (!status) > + status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + > + if (status) { > + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE)) > + goto out; > + event = RDMA_EVENT_ADDR_ERROR; > + } else { > + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) > + goto out; > + id_priv->id.route.addr.src_addr = *src_addr; > + event = RDMA_EVENT_ADDR_RESOLVED; > + } > + > + if (cma_notify_user(id_priv, event, status, NULL, 0)) { > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > +out: > + cma_deref_id(id_priv); > +} > + > +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, > + struct sockaddr *dst_addr, int timeout_ms) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) > + return -EINVAL; > + > + atomic_inc(&id_priv->refcount); > + id->route.addr.dst_addr = *dst_addr; > + ret = ib_resolve_addr(src_addr, dst_addr, > &id->route.addr.addr.ibaddr, > + timeout_ms, addr_handler, id_priv); > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE); > + cma_deref_id(id_priv); > + return ret; > +} > +EXPORT_SYMBOL(rdma_resolve_addr); > + > +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr) > +{ > + struct rdma_id_private *id_priv; > + struct sockaddr_in *ip_addr = (struct sockaddr_in *) addr; > + struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; > + int ret; > + > + if (addr->sa_family != AF_INET) > + return -EINVAL; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) > + return -EINVAL; > + > + if (ip_addr->sin_addr.s_addr) { > + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); > + if (!ret) > + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + } else > + ret = -ENOSYS; /* TODO: support wild card addresses */ > + > + if (ret) > + goto err; > + > + id->route.addr.src_addr = *addr; > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); > + return ret; > +} > +EXPORT_SYMBOL(rdma_bind_addr); > + > +static void cma_format_addr(struct cma_addr *addr, struct rdma_route > *route) > +{ > + struct sockaddr_in *ip_addr; > + > + memset(addr, 0, sizeof *addr); > + cma_set_vers(addr, 1, 4); > + > + ip_addr = (struct sockaddr_in *) &route->addr.src_addr; > + addr->src_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); > + > + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > + addr->dst_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); > + addr->port = cpu_to_be16(ip_addr->sin_port); > +} > + > +static int cma_connect_ib(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct ib_cm_req_param req; > + struct rdma_route *route; > + struct cma_addr *addr; > + void *private_data; > + int ret; > + > + memset(&req, 0, sizeof req); > + req.private_data_len = sizeof *addr + conn_param->private_data_len; > + > + private_data = kmalloc(req.private_data_len, GFP_ATOMIC); > + if (!private_data) > + return -ENOMEM; > + > + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id)) { > + ret = PTR_ERR(id_priv->cm_id); > + goto out; > + } > + > + addr = private_data; > + route = &id_priv->id.route; > + cma_format_addr(addr, route); > + > + if (conn_param->private_data && conn_param->private_data_len) > + memcpy(addr + 1, conn_param->private_data, > + conn_param->private_data_len); > + req.private_data = private_data; > + > + req.primary_path = &route->path_rec[0]; > + if (route->num_paths == 2) > + req.alternate_path = &route->path_rec[1]; > + > + req.service_id = cma_get_service_id(&route->addr.dst_addr); > + req.qp_num = id_priv->id.qp->qp_num; > + req.qp_type = IB_QPT_RC; > + req.starting_psn = req.qp_num; > + req.responder_resources = conn_param->responder_resources; > + req.initiator_depth = conn_param->initiator_depth; > + req.flow_control = conn_param->flow_control; > + req.retry_count = conn_param->retry_count; > + req.rnr_retry_count = conn_param->rnr_retry_count; > + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > + req.max_cm_retries = CMA_MAX_CM_RETRIES; > + req.srq = id_priv->id.qp->srq ? 1 : 0; > + > + ret = ib_send_cm_req(id_priv->cm_id, &req); > +out: > + kfree(private_data); > + return ret; > +} > + > +int rdma_connect(struct rdma_id *id, struct rdma_conn_param > *conn_param) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_connect_ib(id_priv, conn_param); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); > + return ret; > +} > +EXPORT_SYMBOL(rdma_connect); > + > +static int cma_accept_ib(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct ib_cm_rep_param rep; > + int ret; > + > + ret = cma_modify_ib_qp_rtr(id_priv); > + if (ret) > + return ret; > + > + memset(&rep, 0, sizeof rep); > + rep.qp_num = id_priv->id.qp->qp_num; > + rep.starting_psn = rep.qp_num; > + rep.private_data = conn_param->private_data; > + rep.private_data_len = conn_param->private_data_len; > + rep.responder_resources = conn_param->responder_resources; > + rep.initiator_depth = conn_param->initiator_depth; > + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; > + rep.failover_accepted = 0; > + rep.flow_control = conn_param->flow_control; > + rep.rnr_retry_count = conn_param->rnr_retry_count; > + rep.srq = id_priv->id.qp->srq ? 1 : 0; > + > + return ib_send_cm_rep(id_priv->cm_id, &rep); > +} > + > +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_accept_ib(id_priv, conn_param); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(id); > + rdma_reject(id, NULL, 0); > + return ret; > +} > +EXPORT_SYMBOL(rdma_accept); > + > +int rdma_reject(struct rdma_id *id, const void *private_data, > + u8 private_data_len) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, private_data, > private_data_len); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + return ret; > +}; > +EXPORT_SYMBOL(rdma_reject); > + > +int rdma_disconnect(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + ret = cma_modify_qp_err(id); > + if (ret) > + goto out; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + /* Initiate or respond to a disconnect. */ > + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) > + ib_send_cm_drep(id_priv->cm_id, NULL, 0); > + break; > + default: > + break; > + } > +out: > + return ret; > +} > +EXPORT_SYMBOL(rdma_disconnect); > + > +/* TODO: add this to the device structure - see Roland's patch */ > +static __be64 get_ca_guid(struct ib_device *device) > +{ > + struct ib_device_attr *device_attr; > + __be64 guid; > + int ret; > + > + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); > + if (!device_attr) > + return 0; > + > + ret = ib_query_device(device, device_attr); > + guid = ret ? 0 : device_attr->node_guid; > + kfree(device_attr); > + return guid; > +} > + > +static void cma_add_one(struct ib_device *device) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + > + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); > + if (!cma_dev) > + return; > + > + cma_dev->device = device; > + cma_dev->node_guid = get_ca_guid(device); > + if (!cma_dev->node_guid) > + goto err; > + > + init_waitqueue_head(&cma_dev->wait); > + atomic_set(&cma_dev->refcount, 1); > + INIT_LIST_HEAD(&cma_dev->id_list); > + ib_set_client_data(device, &cma_client, cma_dev); > + > + spin_lock_irqsave(&lock, flags); > + list_add_tail(&cma_dev->list, &dev_list); > + spin_unlock_irqrestore(&lock, flags); > + return; > +err: > + kfree(cma_dev); > +} > + > +static int cma_remove_id_dev(struct rdma_id_private *id_priv) > +{ > + enum cma_state state; > + > + /* Record that we want to remove the device */ > + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); > + if (state == CMA_DESTROYING) > + return 0; > + > + /* TODO: wait until safe to process removal. */ > + > + /* Check for destruction from another callback. */ > + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) > + return 0; > + > + return cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, 0, > NULL, 0); > +} > + > +static void cma_process_remove(struct cma_device *cma_dev) > +{ > + struct list_head remove_list; > + struct rdma_id_private *id_priv; > + unsigned long flags; > + int ret; > + > + INIT_LIST_HEAD(&remove_list); > + > + spin_lock_irqsave(&lock, flags); > + while (!list_empty(&cma_dev->id_list)) { > + id_priv = list_entry(cma_dev->id_list.next, > + struct rdma_id_private, list); > + list_del(&id_priv->list); > + list_add_tail(&id_priv->list, &remove_list); > + atomic_inc(&id_priv->refcount); > + spin_unlock_irqrestore(&lock, flags); > + > + ret = cma_remove_id_dev(id_priv); > + cma_deref_id(id_priv); > + if (ret) > + rdma_destroy_id(&id_priv->id); > + > + spin_lock_irqsave(&lock, flags); > + } > + spin_unlock_irqrestore(&lock, flags); > + > + atomic_dec(&cma_dev->refcount); > + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); > +} > + > +static void cma_remove_one(struct ib_device *device) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + > + cma_dev = ib_get_client_data(device, &cma_client); > + if (!cma_dev) > + return; > + > + spin_lock_irqsave(&lock, flags); > + list_del(&cma_dev->list); > + spin_unlock_irqrestore(&lock, flags); > + > + cma_process_remove(cma_dev); > + kfree(cma_dev); > +} > + > +static int cma_init(void) > +{ > + return ib_register_client(&cma_client); > +} > + > +static void cma_cleanup(void) > +{ > + ib_unregister_client(&cma_client); > +} > + > +module_init(cma_init); > +module_exit(cma_cleanup); -- MST From mst at mellanox.co.il Mon Oct 10 06:58:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 15:58:00 +0200 Subject: [openib-general] Re: Linux 2.6.13 Kernel Support Question In-Reply-To: <43457E64.1010406@dbresearch.net> References: <43457E64.1010406@dbresearch.net> Message-ID: <20051010135800.GU21551@mellanox.co.il> Quoting Sean Hubbell : > Subject: Linux 2.6.13 Kernel Support Question > > Hello, > > Will openib still supply patches to the 2.6.13 Kernel or do I need to > upgrade my kernel to 2.6.14? > > Thanks, > > Sean Hubbell As Roland commented, once 2.6.14 is out the trunk will target it. I keep patches to make trunk compile on older kernels under https://openib.org/svn/gen2/branches/backport/ Its usually an uncomplicated excersize to add support to more kernels, so I usually do it a couple of days after trunk switches to newer kernels, but one has to keep in mind that testing is another matter. Here in mellanox people are testing against kernels that come with popular distributions, so we are currently testing 2.6.9 on RHEL4, 2.6.11_FC4 (which is between 2.6.11 and 2.6.12) on FC4 and 2.6.11 on SuSE Pro 9.3. Whether 2.6.13 will be tested in mellanox depends on whether there is/will be a distribution tested here that will include this kernel revision. -- MST From halr at voltaire.com Mon Oct 10 07:03:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:03:46 -0400 Subject: [openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send In-Reply-To: <5zu0frvszk.fsf@mtl066.yok.mtl.com> References: <5zu0frvszk.fsf@mtl066.yok.mtl.com> Message-ID: <1128953025.4377.72.camel@hal.voltaire.com> Hi Yael, On Sun, 2005-10-09 at 07:18, Yael Kalka wrote: > During our tests on Windows we encountered an issue that is caused due > to some problem in the lower layer, but causes problem in the opensm. > If the osm_vendor_send call fails immediatly, we need to update > several counters (currently, only qp0_mads_sent is decremented), and > also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in > order to signal the state mgr). > What we saw was that these counters weren't decremented, and thus the > state mgr wasn't signalled, and the opensm didn't proceed in > traversing through its stages. > The following patch updates the relevant counters, and calls the > dispatcher, if neccessary. Is there a similar issue with QP1 as well ? Also, in general, atomic_inc and atomic_dec deal with int32 quantities. There is potential danger if they wrap from positive to negative or visa versa. I don't think there is any code which deals with this. I have some comments and questions on this patch embedded below. -- Hal > > Thanks, > Yael > > Signed-off-by: Yael Kalka > Index: opensm/osm_vl15intf.c > =================================================================== > --- opensm/osm_vl15intf.c (revision 3703) > +++ opensm/osm_vl15intf.c (working copy) > @@ -157,6 +157,8 @@ __osm_vl15_poller( > > if( status != IB_SUCCESS ) > { > + uint32_t outstanding; > + cl_status_t cl_status; > osm_log( p_vl->p_log, OSM_LOG_ERROR, > "__osm_vl15_poller: ERR 3E03: " > "MAD send failed (%s).\n", > @@ -166,7 +168,64 @@ __osm_vl15_poller( > The MAD was never successfully sent, so > fix up the pre-incremented count values. > */ > + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire > + that was incremented in the code above. */ > mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); > + if( p_madw->resp_expected == TRUE ) > + if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire ) Should this be !&p_vl->p_stats->qp0_mads_outstanding_on_wire or just !p_vl->p_stats->qp0_mads_outstanding_on_wire ? If it is the latter, should there be locking around it like: CL_PLOCK_ACQUIRE( p_ctrl->p_lock ); outstanding = p_ctrl->p_stats->qp0_mads_outstanding; CL_PLOCK_RELEASE( p_ctrl->p_lock ); Also, this appears to be debug code (not in other places) ? Why is it needed here ? > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E04: " > + "Trying to dec qp0_mads_outstanding_on_wire=0. " > + "Problem with transaction mgr!\n"); In this case, outstanding is not initialized so what is supposed to occur below when outstanding is checked against 0. (Should it be initialized to 0 ? Do extra signals to the state manager (for NO_PENDING_TRANSACTIONS) cause the wrong thing to occur ?). > + else > + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); > + > + /* The following code is similar to the one in > + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the > + qp0_mads_outstanding counter, and if we reached 0 - need to call > + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order > + to wake up the state mgr). */ > + if ( !&p_vl->p_stats->qp0_mads_outstanding ) > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E05: " > + "Trying to dec qp0_mads_outstanding=0. " > + "Problem with transaction mgr!\n"); > + else > + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); > + > + osm_log( p_vl->p_log, OSM_LOG_DEBUG, > + "__osm_vl15_poller: " > + "%u(%u) QP0 MADs outstanding.\n", > + p_vl->p_stats->qp0_mads_outstanding,outstanding ); Should the following preceed this DEBUG call to osm_log: if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > + if( outstanding == 0 ) > + { > + /* > + The wire is clean. > + Signal the state manager. > + */ > + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > + { > + osm_log( p_vl->p_log, OSM_LOG_DEBUG, > + "__osm_vl15_poller: " > + "Posting Dispatcher message %s.\n", > + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); > + } > + > + cl_status = cl_disp_post( p_vl->h_disp, > + OSM_MSG_NO_SMPS_OUTSTANDING, > + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, > + NULL, > + NULL ); > + > + if( cl_status != CL_SUCCESS ) > + { > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E06: " > + "Dispatcher post message failed (%s).\n", > + CL_STATUS_MSG( cl_status ) ); > + } > + } > } > else > { Also, the formatting has extra whitespace. (I fixed this by hand). -- Hal From rolandd at cisco.com Mon Oct 10 07:26:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 07:26:01 -0700 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max References: <52fyribmtc.fsf@cisco.com> <20051009084455.GA24993@mellanox.co.il> Message-ID: <52zmphih3a.fsf@cisco.com> Thanks, I extended this even further -- we might as well do similar checking for QPs and SRQs while we're at it. How does this seem? - R. --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -128,12 +128,15 @@ struct mthca_limits { int num_uars; int max_sg; int num_qps; + int max_wqes; int reserved_qps; int num_srqs; + int max_srq_wqes; int reserved_srqs; int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -162,9 +162,17 @@ static int __devinit mthca_dev_lim(struc mdev->limits.pkey_table_len = dev_lim->max_pkeys; mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.max_wqes = dev_lim->max_qp_sz; mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + /* + * Subtract 1 from the limit because we need to allocate a + * spare CQE so the HCA HW can tell the difference between an + * empty CQ and a full CQ. + */ + mdev->limits.max_cqes = dev_lim->max_cq_sz - 1; mdev->limits.reserved_cqs = dev_lim->reserved_cqs; mdev->limits.reserved_eqs = dev_lim->reserved_eqs; mdev->limits.reserved_mtts = dev_lim->reserved_mtts; --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -90,14 +90,17 @@ static int mthca_query_device(struct ib_ props->max_mr_size = ~0ull; props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; - props->max_qp_wr = 0xffff; + props->max_qp_wr = mdev->limits.max_wqes; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; + props->max_srq = mdev->limits.num_srqs - mdev->limits.reserved_srqs; + props->max_srq_wr = mdev->limits.max_srq_wqes; + props->max_srq_sge = mdev->limits.max_sg; props->local_ca_ack_delay = mdev->limits.local_ca_ack_delay; props->atomic_cap = mdev->limits.flags & DEV_LIM_FLAG_ATOMIC ? IB_ATOMIC_HCA : IB_ATOMIC_NONE; @@ -640,6 +643,9 @@ static struct ib_cq *mthca_create_cq(str int nent; int err; + if (entries < 1 || entries > to_mdev(ibdev)->limits.max_cqes) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); --- linux-kernel/infiniband/hw/mthca/mthca_cmd.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -933,9 +933,9 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev goto out; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); - dev_lim->max_srq_sz = 1 << field; + dev_lim->max_srq_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); - dev_lim->max_qp_sz = 1 << field; + dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); dev_lim->reserved_qps = 1 << (field & 0xf); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); @@ -1045,6 +1045,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", dev_lim->max_pds, dev_lim->reserved_mgms); + mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", + dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); --- linux-kernel/infiniband/hw/mthca/mthca_srq.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_srq.c (working copy) @@ -186,7 +186,8 @@ int mthca_alloc_srq(struct mthca_dev *de int err; /* Sanity check SRQ size before proceeding */ - if (attr->max_wr > 16 << 20 || attr->max_sge > 64) + if (attr->max_wr > dev->limits.max_srq_wqes || + attr->max_sge > dev->limits.max_sg) return -EINVAL; srq->max = attr->max_wr; --- linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1112,8 +1112,10 @@ static int mthca_set_qp_size(struct mthc struct mthca_qp *qp) { /* Sanity check QP size before proceeding */ - if (cap->max_send_wr > 65536 || cap->max_recv_wr > 65536 || - cap->max_send_sge > 64 || cap->max_recv_sge > 64) + if (cap->max_send_wr > dev->limits.max_wqes || + cap->max_recv_wr > dev->limits.max_wqes || + cap->max_send_sge > dev->limits.max_sg || + cap->max_recv_sge > dev->limits.max_sg) return -EINVAL; if (mthca_is_memfree(dev)) { From halr at voltaire.com Mon Oct 10 07:45:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:45:59 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: Message-ID: <1128955559.4377.81.camel@hal.voltaire.com> On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > I'm referring to the case that iWarp is running over TCP. I know that it can > run over SCTP, but I'm not familiar with the details of that protocol. With > TCP, this is an end-to-end connection, so layering iWarp over it, only the > endpoints need to deal with it. I believe the same is true for SCTP. Yes, SCTP is similar in those regards. > >Doesn't a routing decision still need to be made at the IP layer ? > > Routing of the IP packets is done at the IP layer, but I don't see how this > affects iWarp. It does under the "covers", those covers being IP routing. > >Doesn't the IP next hop need to be determined (e.g. gateway when the > >destination is off the local IP subnet) ? Is there something that > >precludes iWARP from working across IP subnets ? > > I can't think of anything that would preclude iWarp from working > across subnets. Doesn't the IP next hop need determining in that case ? Why is that not relevant ? I don't think the iWARP connection is end to end in all cases. -- Hal From halr at voltaire.com Mon Oct 10 07:56:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:56:49 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128877818.24182.54.camel@mail.es335.com> References: <1128877818.24182.54.camel@mail.es335.com> Message-ID: <1128956208.4377.103.camel@hal.voltaire.com> Hi Tom, On Sun, 2005-10-09 at 13:10, Tom Tucker wrote: > On Sun, 2005-10-09 at 07:57 -0700, Sean Hefty wrote: > > >It is theoretically possible to support all this on an IPoIB based > > >network. Multiple subnets, multiple routes to remote peers, ICMP > > >redirect, multiple IP addresses for each physical interface, yada yada > > >yada. But IMHO, the only way to do this would be to tie directly into > > >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise > > >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of > > > > The current implementation ties into the standard Linux ARP tables. If > > connections were made over TCP/IP, using IPoIB, then I don't think that there > > would be any issues. The issues only arise because of the desire to use TCP/IP > > network addresses over a non-TCP/IP network. > > > > >code. This belief is why I've been a proponent of mapping GIDs to one > > >and only one IP address and treating it for management purposes as the > > >equivalent of an IP address. Without this, the whole mechanism for > > >determining routes, etc.. breaks down. If you treat the GID like a MAC > > >address -- it breaks, because a MAC address can have multiple IP > > >addresses -- the observation that lead to the conclusion that ATS was > > >broken in the first place. > > > > We should be able to handle the case where a GID has multiple IP addresses bound > > to it. But even if we added a 1:1 restriction, the connection over IB issue > > still exists. > > I agree, except for RARP. Not sure what you mean "except for RARP". Can you elaborate ? [snip...] > > I > > don't view a GID as an IP address because we're not sending and receiving IP > > packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I > > think is the proper view. > > > > Anyway, returning back to the original problem of connecting to an IB gateway if > > a given a destination IP address on a different subnet... I'm slowly convincing > > myself that either the CMA or AT should do this. (I believe that the ib_addr > > code will do this now, but still wasn't sure that we wanted it to.) > > > > IMHO, you need a service separate from the CMA to do address > translation. My (iWARP's) rationale for this is that there are two > clients of the service, the CM and IP. For CM, you need it to elect a > route and thereby a local interface. For IP you need it because routes > change and ARP entries time out. > > BTW, can you educate me ... is the following what you're thinking: > > On the client side... > > - route is discovered by looking at the Linux routing table > - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) > - send ARP AT message over local IB interface It's just a normal IPoIB ARP to the destination IP address initiated by AT. (With ATS, it could have been an SA Get ServiceRecord as an alternative). I think the current CMA code handles client above and server but not (bridging) gateway below. > At the gateway...bridging to IP > - ARP AT query received on IB interface > - Lookup route to destination IP address in gateway's route table. > - If next hop's Ethernet address is already known, it is returned ^^^^^^^^ hardware (may not be ethernet) > - Otherwise, local interface identified is IPoEthernet > - New ARP query goes out on the local interface from the route > - When response comes back, answer is returned. > At the gateway...bridging to IPoIB > > - ARP AT message received on IB interface, delivered to AT > - Lookup route to destination IP address in gateway's route table > - If next hop's Ethernet address is already known, it is returned > - otherwise, local interface identified in route is IPoIB > - New ARP AT query goes out on the local interface > - When response comes back, answer is returned. -- Hal From halr at voltaire.com Mon Oct 10 08:03:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 11:03:24 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52r7ayoa9l.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> Message-ID: <1128956603.4377.112.camel@hal.voltaire.com> On Thu, 2005-10-06 at 12:55, Roland Dreier wrote: > Did we ever figure out how to handle the hotplug issues with the > lifetime of the struct ib_device pointer? Right now this API is > unsafe, because a caller can get a pointer to a device that has > already disappeared. I think this can be handled as follows: The netdev references would be maintained for the duration each AT call until it completes/times out. If subsequent calls are made based on an ib_device which has been removed, an error could be returned based on the fact that AT maintaining a list of devices and validate the supplied device against its list. ipoib_get_info() would be called only with a valid device and the caller holding a netdev reference for at least the duration of that call. > Also if we do decide to add an API like this, the struct ipoib_info > and ipoib_get_info() declarations should be in > rather than in the private ipoib.h header. OK. -- Hal From caitlin.bestler at gmail.com Mon Oct 10 08:47:27 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 10 Oct 2005 08:47:27 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: <469958e00510100847v53bbc1baq726a3bf0e9561d90@mail.gmail.com> On 10/9/05, Sean Hefty wrote: > > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > I'm referring to the case that iWarp is running over TCP. I know that it > can > run over SCTP, but I'm not familiar with the details of that protocol. > With > TCP, this is an end-to-end connection, so layering iWarp over it, only the > endpoints need to deal with it. I believe the same is true for SCTP. The main impact of SCTP is that even the IP address can change under the covers. So not only is their routing that is transparent to the RDMA consumer, there is also selection of source/destination IP addresses . -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlin.bestler at gmail.com Mon Oct 10 08:50:59 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 10 Oct 2005 08:50:59 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128955559.4377.81.camel@hal.voltaire.com> References: <1128955559.4377.81.camel@hal.voltaire.com> Message-ID: <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> On 10 Oct 2005 10:45:59 -0400, Hal Rosenstock wrote: > > On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > > > I'm referring to the case that iWarp is running over TCP. I know that it > can > > run over SCTP, but I'm not familiar with the details of that protocol. > With > > TCP, this is an end-to-end connection, so layering iWarp over it, only > the > > endpoints need to deal with it. I believe the same is true for SCTP. > > Yes, SCTP is similar in those regards. > > > >Doesn't a routing decision still need to be made at the IP layer ? > > > > Routing of the IP packets is done at the IP layer, but I don't see how > this > > affects iWarp. > > It does under the "covers", those covers being IP routing. > > > >Doesn't the IP next hop need to be determined (e.g. gateway when the > > >destination is off the local IP subnet) ? Is there something that > > >precludes iWARP from working across IP subnets ? > > > > I can't think of anything that would preclude iWarp from working > > across subnets. > > Doesn't the IP next hop need determining in that case ? Why is that not > relevant ? I don't think the iWARP connection is end to end in all > cases. Of course it's end to end. It's just that only the end points understand that it is an iWARP connection. Or more properly, the underlying transport (or "LLP") connections are end to end, but the iWARP semantics exist only in the RDMA endpoints. That is why iWARP works across multiple subnets. We've actually done true worldwide connections. The exisitng IP network carries the iWARP traffic because it is indeed just TCP traffic to the intermediate network. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Mon Oct 10 08:51:37 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 10 Oct 2005 08:51:37 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005C505A8@orsmsx408> Peter Braam wrote, > Cluster File Systems, Inc and its customers have been wondering if the Lustre > Network Driver (LND) for > OpenIb gen2, which we will begin to develop during > the coming months, should be based on kdapl or verbs. >The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > We are keen to hear some opinions! For Linux, I would target Sean's new CMA for connection establishment and then the current IB verbs which are being modified to support both iWarp and IB. my 2 cents, woody From halr at voltaire.com Mon Oct 10 09:08:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 12:08:09 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> Message-ID: <1128960287.4377.378.camel@hal.voltaire.com> On Mon, 2005-10-10 at 11:50, Caitlin Bestler wrote: > Doesn't the IP next hop need determining in that case ? Why is > that not > relevant ? I don't think the iWARP connection is end to end in > all > cases. > > > Of course it's end to end. It's just that only the end points > understand that it is an iWARP connection. What about the case of iWARP <-> IB ? > Or more properly, the underlying transport (or "LLP") connections > are end to end, but the iWARP semantics exist only in the RDMA > endpoints. > > That is why iWARP works across multiple subnets. ^^^^^^^ IP subnets > We've actually > done true worldwide connections. The exisitng IP network carries > the iWARP traffic because it is indeed just TCP traffic to the > intermediate network. -- Hal From rolandd at cisco.com Mon Oct 10 09:22:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 09:22:34 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: (yipee's message of "Mon, 10 Oct 2005 08:28:06 +0000 (UTC)") References: Message-ID: <52d5mdibp1.fsf@cisco.com> yipee> Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 yipee> kernel. A Mellanox memfree PCIe ddr HCA is connected. Why yipee> do I see IRQ sharing although I'm using msi_x and PCIe? yipee> Doesn't IRQ sharing only happen on older non PCIe busses? I think the messages you see are coming from the ACPI interrupt routing that is done when the driver calls pci_enable_device(). However, if you use MSI-X then that interrupt won't actually be used. If you check /proc/interrupts you should see ib_mthca using 3 non-shared interrupts. BTW, for "INTx emulation" on PCI Express, there are no physical interrupt lines -- interrupts are asserted and deasserted with messages. So PCI Express interrupts are unshared. However, the PCI Express host bridge turns those interrupts into real interrupts to the system's interrupt controller, and for that part of the story, it's entirely possible for two different PCI Express devices to end up sharing the same interrupt line. - R. From mshefty at ichips.intel.com Mon Oct 10 10:21:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:21:02 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010135723.GT21551@mellanox.co.il> References: <20051010135723.GT21551@mellanox.co.il> Message-ID: <434AA2FE.6000702@ichips.intel.com> Thanks for the feedback. See below. Michael S. Tsirkin wrote: > Wouldnt is be a good idea to start names with rdma_cm > or rdma_cma or something like that? > For example, rdma_event_type is a bit confusing since this actually only > includes CM events. Similiar comments apply to other names. I had that originally, but changed it. I figured that names like rdma_connect() and rdma_listen() were clear enough that they were for connection management. >>+struct rdma_id; > > I propose renaming this to rdma_connection or something > else more specific than just "id". Makes sense? I can change this to rdma_cm_id or rdma_cma or something else... >>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > Not sure I understand what this does, since the only extra parameter is > timeout_ms. For IB, this results in a path record query based on the GIDs that were set with the rdma_id from rdma_resolve_addr(). The GIDs are in rdma_id.route.addr.ibaddr. The output is saved to rdma_id.route.path_rec. My intent is to make this call optional in the future. >>+int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, >>+ struct ib_qp_init_attr *qp_init_attr); >>+ >>+void rdma_destroy_qp(struct rdma_id *id); > > Not sure what the intended usage is. > When does the user need to call this? The CMA needs to associate a QP with the rdma_id, and CMA will transition the QP through its connection states. The rdma_create_qp() is called to allocate a QP and transition it to the INIT state, so users can post receives to the QP. The destroy call is a pass-through call provided simply for symmetry. >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include > > Are all of these headers really needed? > For example, I dont see arp.h used anywhere. > Am I missing something? They were needed at one point, but might not all be needed now. I will see which ones can be removed. Some were only needed for address translation, which was originally part of this file while I worked out its API. > What about replacing switch with one case statements with if statements. > Like this: > > if (id->device->node_type == IB_NODE_CA) > ret = cma_init_ib_qp(id_priv, qp); > else > ret = -ENOSYS; I tried to make it easy to modify the code to support iWarp, or some other RDMA device. I'd prefer to leave these checks as switch statements for that reason, or just remove them completely. > I also wander why do we really need all these node_type checks. > The code above seems to imply that rdma_create_qp will fail > on non-CA. Why is that? The code doesn't set the right parameters to INIT for an iWarp QP. >>+static inline void cma_deref_dev(struct rdma_id_private *id_priv) >>+{ >>+// if (atomic_dec_and_test(&id_priv->dev_remove)) >>+// wake_up(&id_priv->wait); >>+// return atomic_dec_and_test(&id_priv->dev_remove) ? >>+// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, >>+// NULL, 0) : 0; >>+} > > > The above seems to need some cleanup. This has been cleaned up in my latest version. It was part of the initial device removal handling code that didn't work. I decided to just try to get connection establishment working, and then come back to fix device removal. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:36:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:36:35 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128877818.24182.54.camel@mail.es335.com> References: <1128877818.24182.54.camel@mail.es335.com> Message-ID: <434AA6A3.5090504@ichips.intel.com> Tom Tucker wrote: >>Again, I don't think that the binding is the issue, so much as the desire to use >>an address for a protocol that isn't actually being used for communication. > > Not to be pedantic, but if binding or mapping or somesuch weren't an > issue we wouldn't need AT. We need AT because we're not using network addresses. If a client used an IP address and ran over IP, we wouldn't need to do anything special. > IMHO, you need a service separate from the CMA to do address > translation. My (iWARP's) rationale for this is that there are two > clients of the service, the CM and IP. For CM, you need it to elect a > route and thereby a local interface. For IP you need it because routes > change and ARP entries time out. The connection management and address translation are separate services, with the CMA calling the address translation for the user. You may want to look at ib_addr for details on how the address translation works. > - route is discovered by looking at the Linux routing table ^^^^^ address mapping from IP to GID/Pkey. > - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) The address translation looks only at the hardware and broadcast addresses. No additional rdma_ptr is needed with ib_addr. > - send ARP AT message over local IB interface It sends a normal IP ARP to get the remove hardware address, which contains the destination GID. An ARP is sent only if the mapping isn't available in the local ARP table. At this point, the client has the SGID, DGID, and PKey. It then issues a path record query to obtain the "route" to the destination. The CMA doesn't really care if that destination is the actual destination or some gateway. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:40:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:40:16 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128960287.4377.378.camel@hal.voltaire.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> Message-ID: <434AA780.60808@ichips.intel.com> Hal Rosenstock wrote: > What about the case of iWARP <-> IB ? Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using IPoIB. You could build a gateway to make the transfer across IB more efficient, but it's not required. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:59:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:59:51 -0700 Subject: [openib-general] [PATCH] [ADDR] address translation module for CMA In-Reply-To: References: Message-ID: <434AAC17.1010709@ichips.intel.com> Sean Hefty wrote: > The following patch adds a simple IP to IB address translation module > using ARP. It is based off AT and SDP, but kept as simple as possible. > > I would like to merge this back into the trunk, and apply other changes > there. I didn't see any objections, so I have committed this to the trunk as part of the core software. - Sean From mshefty at ichips.intel.com Mon Oct 10 11:01:03 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:01:03 -0700 Subject: [openib-general] [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: References: Message-ID: <434AAC5F.70301@ichips.intel.com> Sean Hefty wrote: > The following patch adds in a basic RDMA connection management abstraction. > It is functional, but needs additional work for handling device removal, plus > several missing features. > > I'd like to merge this back into the trunk, and continue working on it from > there. I didn't see any objections, so I have merged this into the trunk. Changes were made from the original patch based on Michael's feedback, and device removal handling was added. - Sean From mst at mellanox.co.il Mon Oct 10 11:07:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 20:07:50 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AA2FE.6000702@ichips.intel.com> References: <434AA2FE.6000702@ichips.intel.com> Message-ID: <20051010180750.GA5916@mellanox.co.il> Quoting Sean Hefty : > > Wouldnt is be a good idea to start names with rdma_cm > > or rdma_cma or something like that? > > For example, rdma_event_type is a bit confusing since this actually only > > includes CM events. Similiar comments apply to other names. > > I had that originally, but changed it. I figured that names like rdma_connect() > and rdma_listen() were clear enough that they were for connection management. Yes, fine, but names like rdma_event_type probably do need the prefix, dont they? > >>+struct rdma_id; > > > > I propose renaming this to rdma_connection or something > > else more specific than just "id". Makes sense? > > I can change this to rdma_cm_id or rdma_cma or something else... Maybe rdma_connection (these things encapsulate connectin state)? Or, rdma_sock or rdma_socket, since people are used to the fact that connections are sockets? > >>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > > > Not sure I understand what this does, since the only extra parameter is > > timeout_ms. > > For IB, this results in a path record query based on the GIDs that were set with > the rdma_id from rdma_resolve_addr(). The GIDs are in > rdma_id.route.addr.ibaddr. The output is saved to rdma_id.route.path_rec. My > intent is to make this call optional in the future. I was trying to say, why doesnt rdma_connect just do this transparently? Why do we need a separate call? > >>+int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > >>+ struct ib_qp_init_attr *qp_init_attr); > >>+ > >>+void rdma_destroy_qp(struct rdma_id *id); > > > > Not sure what the intended usage is. > > When does the user need to call this? > > The CMA needs to associate a QP with the rdma_id, and CMA will transition the QP > through its connection states. The rdma_create_qp() is called to allocate a QP > and transition it to the INIT state, so users can post receives to the QP. The > destroy call is a pass-through call provided simply for symmetry. What happends on the passive side? May we need more than one qp per rdma_id? Or is a new id created each time a connection request arrives? -- MST From mshefty at ichips.intel.com Mon Oct 10 11:15:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:15:57 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010180750.GA5916@mellanox.co.il> References: <434AA2FE.6000702@ichips.intel.com> <20051010180750.GA5916@mellanox.co.il> Message-ID: <434AAFDD.90208@ichips.intel.com> Michael S. Tsirkin wrote: > Yes, fine, but names like rdma_event_type probably do need the prefix, > dont they? I'll fix this. > Maybe rdma_connection (these things encapsulate connectin state)? > Or, rdma_sock or rdma_socket, since people are used to the fact that connections > are sockets? Any objection to rdma_socket? >>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > I was trying to say, why doesnt rdma_connect just do this > transparently? Why do we need a separate call? Eventually rdma_connect will call this for the user if a route hasn't been resolved. At some point though, the API will likely need to be expanded to specify some sort of quality of service. > What happends on the passive side? > May we need more than one qp per rdma_id? > Or is a new id created each time a connection request arrives? A new identifier is created each time a connection request arrives. The goal is to support a single listen across multiple devices, so listen id's will not necessarily be bound to an ib_device. The new id will be bound to the device that the connection request was received on. - Sean From rolandd at cisco.com Mon Oct 10 11:23:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 11:23:45 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128738350.13945.369.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 19:25:49 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> Message-ID: <521x2tgrim.fsf@cisco.com> > 2.6.12-rc5 in-kernel 1 405 <<<<< > 2.6.12-rc4 in-kernel 1 470 <<<<< I was optimistic when I saw this, because the changeover to git occurred with 2.6.12-rc2, so I thought I could use git bisect to track down exactly when the performance regression happened. However, I haven't been able to get numbers that are stable enough to track this down. I have two systems, both HP DL145s with dual Opteron 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the completion interrupt affinity set to CPU 0, and "taskset 2" to run netserver and netperf on CPU 1. With default netperf parameters (just "-H otherguy") I get numbers between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. The numbers are quite consistent between reboots, but if I reboot the system (even keeping the kernel identical), I see large performance changes. Presumably something is happening like the cache coloring of some hot data structures changing semi-randomly depending on the timing of various initialations. Matt, how stable are your numbers? - R. From tom at ammasso.com Mon Oct 10 11:30:53 2005 From: tom at ammasso.com (Tom Tucker) Date: Mon, 10 Oct 2005 14:30:53 -0400 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Monday, October 10, 2005 12:37 PM > To: Tom Tucker > Cc: Sean Hefty; Openib > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > Tom Tucker wrote: > >>Again, I don't think that the binding is the issue, so much > as the desire to use > >>an address for a protocol that isn't actually being used > for communication. > > > > Not to be pedantic, but if binding or mapping or somesuch weren't an > > issue we wouldn't need AT. > > We need AT because we're not using network addresses. If a > client used an IP > address and ran over IP, we wouldn't need to do anything special. agreed. > > > IMHO, you need a service separate from the CMA to do address > > translation. My (iWARP's) rationale for this is that there are two > > clients of the service, the CM and IP. For CM, you need it > to elect a > > route and thereby a local interface. For IP you need it > because routes > > change and ARP entries time out. > > The connection management and address translation are > separate services, with > the CMA calling the address translation for the user. You > may want to look at > ib_addr for details on how the address translation works. Very cool. I've applied the patch and will take a look. > > > - route is discovered by looking at the Linux routing table > ^^^^^ > address mapping from IP to GID/Pkey. I think I understand where I'm upside down now. In my world, you don't know which interface to send the ARP request on until you've identified the local interface and you can't identify the local interface until you've looked up the route. Not all interface have a path to all remote peers. In your world, you can't look up the path record until you've identified the remote GID. What I don't get is, if you have more than one IB interface, which interface do you submit your IPoIB ARP request on? All of them? > > > - local interface is IPoIB (looks at rdma_ptr embedded in > netdev struct) > The address translation looks only at the hardware and > broadcast addresses. No > additional rdma_ptr is needed with ib_addr. > Cool, I must have misunderstood an earlier discussion. > > - send ARP AT message over local IB interface > It sends a normal IP ARP to get the remove hardware address, > which contains the > destination GID. An ARP is sent only if the mapping isn't > available in the > local ARP table. Not sure what a "normal IP ARP" message is. In my world, ARP and IP are peer protocols. ARP does not sit on top of IP, nor is it a special kind of IP message. Forgive my ignorance, but does IPoIB have ARP built into it? But regardless, how do you know which local interface to send the IP ARP message on? > > At this point, the client has the SGID, DGID, and PKey. It > then issues a path > record query to obtain the "route" to the destination. The > CMA doesn't really > care if that destination is the actual destination or some gateway. Thanks for the clarifications. > > - Sean > From mshefty at ichips.intel.com Mon Oct 10 11:43:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:43:51 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> Message-ID: <434AB667.8060707@ichips.intel.com> Tom Tucker wrote: > I think I understand where I'm upside down now. In my world, > you don't know which interface to send the ARP request on > until you've identified the local interface and you can't > identify the local interface until you've looked up the route. > Not all interface have a path to all remote peers. We have the same restriction. I lookup the route based on the destination IP address to get the local interface. > In your world, you can't look up the path record until you've > identified the remote GID. What I don't get is, if you have more > than one IB interface, which interface do you submit your IPoIB ARP > request on? All of them? It's based on the device returned by the route lookup. I've attached the relevant code portion below. If the code below fails, I generate an ARP, wait for the reply, then re-execute the code. > Not sure what a "normal IP ARP" message is. In my world, ARP and > IP are peer protocols. ARP does not sit on top of IP, nor is it a > special kind of IP message. Forgive my ignorance, but does IPoIB > have ARP built into it? I was being confusing. The ARP is sent on the IPoIB net_device to map an IP address to the remote hardware address. There's nothing special about the ARP. - Sean static int addr_resolve_remote(struct sockaddr_in *src_in, struct sockaddr_in *dst_in, struct ib_addr *addr) { u32 src_ip = src_in->sin_addr.s_addr; u32 dst_ip = dst_in->sin_addr.s_addr; struct flowi fl; struct rtable *rt; struct neighbour *neigh; int ret; memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = dst_ip; fl.nl_u.ip4_u.saddr = src_ip; ret = ip_route_output_key(&rt, &fl); if (ret) goto out; neigh = neigh_lookup(&arp_tbl, &dst_ip, rt->idev->dev); if (!neigh) { ret = -ENODATA; goto err1; } if (!(neigh->nud_state & NUD_VALID)) { ret = -ENODATA; goto err2; } if (!src_ip) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = rt->rt_src; } addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); addr->dgid = *(union ib_gid *) (neigh->ha + 4); addr->pkey = addr_get_pkey(neigh->dev); err2: neigh_release(neigh); err1: ip_rt_put(rt); out: return ret; } static void addr_send_arp(struct sockaddr_in *dst_in) { struct rtable *rt; struct flowi fl; u32 dst_ip = dst_in->sin_addr.s_addr; memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = dst_ip; if (ip_route_output_key(&rt, &fl)) return; arp_send(ARPOP_REQUEST, ETH_P_ARP, dst_ip, rt->idev->dev, rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); ip_rt_put(rt); } From mshefty at ichips.intel.com Mon Oct 10 11:46:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:46:30 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> References: <434AA2FE.6000702@ichips.intel.com> <20051010180750.GA5916@mellanox.co.il> <434AAFDD.90208@ichips.intel.com> Message-ID: <434AB706.8040101@ichips.intel.com> Sean Hefty wrote: > Michael S. Tsirkin wrote: > >> Yes, fine, but names like rdma_event_type probably do need the prefix, >> dont they? > > I'll fix this. I've just committed a patch to rename the prefix. - Sean From ftillier at silverstorm.com Mon Oct 10 12:13:15 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Mon, 10 Oct 2005 12:13:15 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> Message-ID: <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Monday, October 10, 2005 11:16 AM > > Michael S. Tsirkin wrote: > > Maybe rdma_connection (these things encapsulate connectin state)? > > Or, rdma_sock or rdma_socket, since people are used to the fact that > > connections are sockets? > > Any objection to rdma_socket? I don't like rdma_socket, since you can't actually perform any I/O operations on the rdma_socket, unlike normal sockets. We're dealing only with the connection part of the problem, and the name should reflect that. So rdma_connection, rdma_conn, or rdma_cid seem more appropriate. - Fab From mst at mellanox.co.il Mon Oct 10 13:00:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 22:00:42 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> References: <434AAFDD.90208@ichips.intel.com> Message-ID: <20051010200042.GB6633@mellanox.co.il> Quoting Sean Hefty : > > Maybe rdma_connection (these things encapsulate connectin state)? > > Or, rdma_sock or rdma_socket, since people are used to the fact that connections > > are sockets? > > Any objection to rdma_socket? Fine with me, this makes the intent of bind/listen explicit. > >>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > > > I was trying to say, why doesnt rdma_connect just do this > > transparently? Why do we need a separate call? > > Eventually rdma_connect will call this for the user if a route hasn't been > resolved. At some point though, the API will likely need to be expanded to > specify some sort of quality of service. I thought that would also happen at connect time. No? -- MST From krause at cup.hp.com Mon Oct 10 12:53:29 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:53:29 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> References: <434AAFDD.90208@ichips.intel.com> <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> Message-ID: <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> At 12:13 PM 10/10/2005, Fab Tillier wrote: > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Monday, October 10, 2005 11:16 AM > > > > Michael S. Tsirkin wrote: > > > Maybe rdma_connection (these things encapsulate connectin state)? > > > Or, rdma_sock or rdma_socket, since people are used to the fact that > > > connections are sockets? > > > > Any objection to rdma_socket? > >I don't like rdma_socket, since you can't actually perform any I/O >operations on >the rdma_socket, unlike normal sockets. We're dealing only with the >connection >part of the problem, and the name should reflect that. So rdma_connection, >rdma_conn, or rdma_cid seem more appropriate. Naming should not involve sockets as that is part of existing standards. There are also the new standard Sockets extension API available today that might be extended sometime in the future to include explicit RDMA support should people decide to bypass SDP and go straight to a more robust API definition. The Sockets Extensions already comprehend explicit memory management, async comms, etc. making a significant improvement over the existing sync Sockets as well as going further in solving areas like memory management beyond what was done in Winsocks. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Oct 10 13:03:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 22:03:22 +0200 Subject: [openib-general] Re: Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <521x2tgrim.fsf@cisco.com> Message-ID: <20051010200321.GC6633@mellanox.co.il> Hi Roland, Quoting r. Roland Dreier : > However, I haven't been able to get numbers that are stable enough to > track this down. Disabling irq balancing sometimes helps me make the numbers more stable. Hope this helps, -- MST From krause at cup.hp.com Mon Oct 10 12:56:19 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:56:19 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AA780.60808@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> Message-ID: <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> At 10:40 AM 10/10/2005, Sean Hefty wrote: >Hal Rosenstock wrote: >>What about the case of iWARP <-> IB ? > >Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >using IPoIB. You could build a gateway to make the transfer across IB >more efficient, but it's not required. I don't understand this statement. iWARP is RDMA based and if someone wanted to build a gateway with IB in between, it should be mapped to an IB RC connection 1:1. Going through IPoIB is a waste and would result in a very poor performing solution (not that such a solution would deliver stellar performance to start with. Prior similar solutions used ULP over IB and the gateway then provided ULP over TOE and would then be easily extended to do iWARP. In general, you would want to have defined domains for each interconnect and not try to add poor ROI superset functionality of one over the other - waste of time and money. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Mon Oct 10 12:50:59 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:50:59 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: <52d5mdibp1.fsf@cisco.com> References: <52d5mdibp1.fsf@cisco.com> Message-ID: <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> At 09:22 AM 10/10/2005, Roland Dreier wrote: > yipee> Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 > yipee> kernel. A Mellanox memfree PCIe ddr HCA is connected. Why > yipee> do I see IRQ sharing although I'm using msi_x and PCIe? > yipee> Doesn't IRQ sharing only happen on older non PCIe busses? > >I think the messages you see are coming from the ACPI interrupt >routing that is done when the driver calls pci_enable_device(). >However, if you use MSI-X then that interrupt won't actually be used. >If you check /proc/interrupts you should see ib_mthca using 3 >non-shared interrupts. > >BTW, for "INTx emulation" on PCI Express, there are no physical >interrupt lines -- interrupts are asserted and deasserted with >messages. So PCI Express interrupts are unshared. They are messages upstream that any device. >However, the PCI Express host bridge turns those interrupts into real >interrupts to the system's interrupt controller, and for that part of the >story, it's entirely possible for two different PCI Express devices to end >up sharing the same interrupt line. Correct, the host bridge may map them to a "monarch" processor and thus any or all devices can share the same interrupt. This is why within the PCI-SIG we recommend using MSI-X and long-term, many of us would simply like to drop INTx and make MSI-X mandatory. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Mon Oct 10 13:09:21 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 10 Oct 2005 22:09:21 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> References: <434AAFDD.90208@ichips.intel.com> <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> Message-ID: <20051010200921.GB25968@lst.de> On Mon, Oct 10, 2005 at 12:53:29PM -0700, Michael Krause wrote: > standards. There are also the new standard Sockets extension API available > today that might be extended sometime in the future to include explicit which is never going to get into linux. one more of these braindead standards people masturbating in a dark room and coming up with a frankenstein bastard cases. From rick.jones2 at hp.com Mon Oct 10 13:17:56 2005 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 10 Oct 2005 13:17:56 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <434ACC74.3020404@hp.com> Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. > > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. Which rev of netperf are you using, and areyou using the "confidence intervals" options (-i, -I)? for a long time, the linux-unique behaviour of returning the overhead bytes for SO_[SND|RCV]BUF and them being 2X what one gives in setsockopt() gave netperf some trouble - the socket buffer would double in size each iteration on a confidence interval run. Later netperf versions (late 2.3, and 2.4.X) have a kludge for this. Slightly related to that, IIRC, the linux receiver code adjusts the advertised window as the connection goes along - how far the receive code opens the window may change from run to run - might that have an effect? If there is a way to get the linux receiver to simply advertise the full window from the beginning that might help minimize the number of variables. Are there large changes in service demand along with the large performance changes? FWIW, on later netperfs the -T option should allow you to specify the CPU on which netperf and/or netserver run, although I've had some trouble reliably detecting the right sched_setaffinity syntax among the releases. rick jones From vuhuong at mellanox.com Mon Oct 10 13:25:44 2005 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 10 Oct 2005 13:25:44 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <433C78A1.30207@mellanox.com> References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> Message-ID: <434ACE48.1030208@mellanox.com> Roland, >> >>That makes some sense. An issue is that FMRs are a fairly limited >>resource, and a system with many SRP targets where each target doesn't >>get much traffic could tie up a lot of FMRs. >> >> >> > You're right. For the same reason of unused port (ie. srp_host), I > create fmr resource per device and keep it in srp_device_data struct > > I put back fmr + your patch and it works well with my setup. > > Signed-off-by: Vu Pham > Have you got time to review this SRP's FMR patch? Thanks, vu From rolandd at cisco.com Mon Oct 10 13:53:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 13:53:29 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> (Peter J. Braam's message of "Sun, 9 Oct 2005 17:17:56 -0400") References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <52mzlhf60m.fsf@cisco.com> > The driver we plan to develop should strive to address several goals: > - high reliability and performance It seems unlikely that you would get more reliability or performance by adding another layer of software in your stack. > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) Interoperability seems a function of designing an appropriate wire protocol rather than how you choose to implement the protocol. I believe that experience has proven that trying to maintain a single codebase portable to different OS kernels is always more work than just having separate codebases for separate kernels. Even trying to use the same code in both Linux kernel 2.4 and kernel 2.6 is enough of a pain that it's probably not worth it. > - be suitable for inclusion in the Linux kernel It extremely unlikely that kDAPL will ever be included in the kernel. Does this last point mean that you are planning to try again and work on merging Lustre into the mainline kernel? - R. From rolandd at cisco.com Mon Oct 10 13:58:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 13:58:03 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <434ACC74.3020404@hp.com> (Rick Jones's message of "Mon, 10 Oct 2005 13:17:56 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <434ACC74.3020404@hp.com> Message-ID: <52irw5f5t0.fsf@cisco.com> Rick> Which rev of netperf are you using, and areyou using the Rick> "confidence intervals" options (-i, -I)? for a long time, Rick> the linux-unique behaviour of returning the overhead bytes Rick> for SO_[SND|RCV]BUF and them being 2X what one gives in Rick> setsockopt() gave netperf some trouble - the socket buffer Rick> would double in size each iteration on a confidence interval Rick> run. Later netperf versions (late 2.3, and 2.4.X) have a Rick> kludge for this. I believe it's netperf 2.2. I'm not using any confidence interval stuff. However, the variation is not between single runs of netperf -- if I do 5 runs of netperf in a row, I get roughly the same number from each run. For example, I might see something like TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3869.82 and then TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3862.41 for two successive runs. However, if I reboot the system into the same kernel (ie everything set up exactly the same), the same invocation of netperf might give TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 4389.20 Rick> Are there large changes in service demand along with the Rick> large performance changes? Not sure. How do I have netperf report service demand? - R. From mshefty at ichips.intel.com Mon Oct 10 13:59:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 13:59:09 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> Message-ID: <434AD61D.4060205@ichips.intel.com> Michael Krause wrote: >>> What about the case of iWARP <-> IB ? >> >> Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >> using IPoIB. You could build a gateway to make the transfer across IB >> more efficient, but it's not required. > > I don't understand this statement. iWARP is RDMA based and if someone I was referring to the case where both endpoints are running over iWarp, with IB being one of the subnets being crossed. I believe that you're referring to one side running over iWarp, and the other running over IB, with an application level gateway in between. For the latter case, I would think that the gateway needs to establish iWarp connections for any IP addresses that reside on the IB subnet behind it, with a separate IB connection on the back-end. It seems to me that this would occur transparently to the application using iWarp. - Sean From rolandd at cisco.com Mon Oct 10 14:03:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:03:26 -0700 Subject: [openib-general] Re: Timeline of IPoIB performance In-Reply-To: <20051010200321.GC6633@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Oct 2005 22:03:22 +0200") References: <521x2tgrim.fsf@cisco.com> <20051010200321.GC6633@mellanox.co.il> Message-ID: <52ek6tf5k1.fsf@cisco.com> Michael> Disabling irq balancing sometimes helps me make the Michael> numbers more stable. I don't think that's an issue. I'm running on x86_64, which I don't think has the kernel irq balancer, and I'm not running a userspace IRQ balancer. I can see all the mthca interrupts going to the CPU I set through the smp_affinity file. - R. From rolandd at cisco.com Mon Oct 10 14:05:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:05:08 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> (Michael Krause's message of "Mon, 10 Oct 2005 12:50:59 -0700") References: <52d5mdibp1.fsf@cisco.com> <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> Message-ID: <52achhf5h7.fsf@cisco.com> Roland> BTW, for "INTx emulation" on PCI Express, there are no Roland> physical interrupt lines -- interrupts are asserted and Roland> deasserted with messages. So PCI Express interrupts are Roland> unshared. Michael> They are messages upstream that any device. That doesn't parse for me. Was what I said wrong? - R. From rolandd at cisco.com Mon Oct 10 14:05:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:05:43 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <434ACE48.1030208@mellanox.com> (Vu Pham's message of "Mon, 10 Oct 2005 13:25:44 -0700") References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> <434ACE48.1030208@mellanox.com> Message-ID: <5264s5f5g8.fsf@cisco.com> Vu> Have you got time to review this SRP's FMR patch? Sorry, no. I haven't had much time to work on SRP for the past few weeks. - R. From krause at cup.hp.com Mon Oct 10 14:09:45 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 14:09:45 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AD61D.4060205@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> <434AD61D.4060205@ichips.intel.com> Message-ID: <6.2.0.14.2.20051010140748.025c5fa0@esmail.cup.hp.com> At 01:59 PM 10/10/2005, Sean Hefty wrote: >Michael Krause wrote: >>>>What about the case of iWARP <-> IB ? >>> >>>Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >>>using IPoIB. You could build a gateway to make the transfer across IB >>>more efficient, but it's not required. >>I don't understand this statement. iWARP is RDMA based and if someone > >I was referring to the case where both endpoints are running over iWarp, >with IB being one of the subnets being crossed. I believe that you're >referring to one side running over iWarp, and the other running over IB, >with an application level gateway in between. > >For the latter case, I would think that the gateway needs to establish >iWarp connections for any IP addresses that reside on the IB subnet behind >it, with a separate IB connection on the back-end. It seems to me that >this would occur transparently to the application using iWarp. iWARP with IB in between seems like a waste of time to do (very small if any market for such a beast). IB HCA on a host with an iWARP edge device may be reasonable but again seems like a waste to construct. These types of corner usage models while of interest to comprehend to see if there is any architectural issues to insure they are not precluded really are just that, corner cases, and little time or effort should be spent on their support. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Oct 10 14:14:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:14:49 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <433C78A1.30207@mellanox.com> (Vu Pham's message of "Thu, 29 Sep 2005 16:28:33 -0700") References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> Message-ID: <521x2tf512.fsf@cisco.com> OK, a few trivial comments: > +struct srp_device_data { > + struct list_head *dev_list; > + struct ib_pd *pd; > + struct ib_mr *mr; > + struct ib_fmr_pool *fmr_pool; > +}; Why put a pointer to struct list_head here instead of just a struct list_head? If you just used the struct, then you wouldn't need this: > + srp_data->dev_list = kmalloc(sizeof *srp_data->dev_list, GFP_KERNEL); > + if (!srp_data->dev_list) > + goto free_params_attr; > @@ -94,10 +115,14 @@ struct srp_request { > struct scsi_cmnd *scmnd; > struct srp_iu *cmd; > struct srp_iu *tsk_mgmt; > + DECLARE_PCI_UNMAP_ADDR(direct_mapping) > struct completion done; > short next; > u8 cmd_done; > u8 tsk_status; > + struct srp_fmr *fmr_arr; > + u16 fmr_cnt; > + u16 in_use; > }; I can't find anywhere that the in_use flag is used. > +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, > + int sg_cnt, struct srp_request *req) [...] > + return -ENOMEM; > + } else if (fmr_cnt <= 0) { fmr_cnt is unsigned so I think this is going to get you in trouble. Might as well make fmr_cnt a plain int to make things simpler. Also, it might be good to try and add some more comments explaining srp_map_fmr() -- it would definitely help me review. - R. From mshefty at ichips.intel.com Mon Oct 10 14:25:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 14:25:10 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010200042.GB6633@mellanox.co.il> References: <434AAFDD.90208@ichips.intel.com> <20051010200042.GB6633@mellanox.co.il> Message-ID: <434ADC36.60101@ichips.intel.com> Michael S. Tsirkin wrote: >>Any objection to rdma_socket? > > Fine with me, this makes the intent of bind/listen explicit. I have rdma_cm_id right now, and will likely just keep it as that. >>>>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); >>> >>>I was trying to say, why doesnt rdma_connect just do this >>>transparently? Why do we need a separate call? >> >>Eventually rdma_connect will call this for the user if a route hasn't been >>resolved. At some point though, the API will likely need to be expanded to >>specify some sort of quality of service. > > I thought that would also happen at connect time. No? I went with the option of exposing the necessary functionality. Folding this into the connect call means that the user cannot view the returned route before deciding to establishing a connection, and the CMA sets the timeout/retry policy for resolving routes. The only benefit of hiding this call is a slight code simplification for the user: case RDMA_CM_EVENT_ADDR_RESOLVED: ret = rdma_resolve_route(cma_id->context, timeout); if (ret) connect_error(); break; case RDMA_CM_EVENT_ROUTE_RESOLVED: connect(cma_id->context); break; becomes: case RDMA_CM_EVENT_ADDR_RESOLVED: connect(cma_id->context); break; To make the API slightly easier to use, I thought of letting rdma_resolve_route() be optional. But, that makes it more difficult to handle device removal, and I'm not sure that it's even worth it. As for QoS, I'm not even sure that it shouldn't be specified when performing the address resolution, so that the correct device can be selected. - Sean From iod00d at hp.com Mon Oct 10 14:26:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 10 Oct 2005 14:26:52 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <20051010212652.GG9613@esmail.cup.hp.com> On Mon, Oct 10, 2005 at 11:23:45AM -0700, Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. As you know, opteron boxes are NUMA. I think you want MSI-X interrupt bound to the same CPU that's connected to the IO. Is CPU 0 closer to IO? I would bind netperf to CPU0 and netserver to CPU 1 on each box respectively. Or just try all 4 combinations to see which combinations are CPU bound vs memory/IO bound. > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. I gather you meant "tests" in the first phrase? (vs reboot). > Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. My guess is based on the same premise. The mem-free card will be very sensitive to were it's control data is allocated. Is either box configured to interleave memory from both CPUs? If it's interleaving, every other cacheline will be "local". Can you disable interleave and try different netperf/server bindings as suggested above? hth, grant From rick.jones2 at hp.com Mon Oct 10 14:22:40 2005 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 10 Oct 2005 14:22:40 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52irw5f5t0.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <434ACC74.3020404@hp.com> <52irw5f5t0.fsf@cisco.com> Message-ID: <434ADBA0.5070103@hp.com> Roland Dreier wrote: > Rick> Which rev of netperf are you using, and areyou using the > Rick> "confidence intervals" options (-i, -I)? for a long time, > Rick> the linux-unique behaviour of returning the overhead bytes > Rick> for SO_[SND|RCV]BUF and them being 2X what one gives in > Rick> setsockopt() gave netperf some trouble - the socket buffer > Rick> would double in size each iteration on a confidence interval > Rick> run. Later netperf versions (late 2.3, and 2.4.X) have a > Rick> kludge for this. > > I believe it's netperf 2.2. That's rather old. I literally just put 2.4.1 out on ftp.cup.hp.com - probably better to use that if possible. Not that it will change the variability just that I like it when people are up-to-date on the versions :) If nothing else, the 2.4.X version(s) have a much improved (hopefully) manual in doc/ [If you are really maschochistic, the very first release of netperf 4.0.0 source has happened. I can make no guarantees as to its actually working at the moment though :) Netperf4 is going to be the stream for the multiple-connection, multiple system tests rather than the single-connection nature of netperf2] > I'm not using any confidence interval stuff. However, the variation > is not between single runs of netperf -- if I do 5 runs of netperf in > a row, I get roughly the same number from each run. For example, I > might see something like > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 3869.82 > > and then > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 3862.41 > > for two successive runs. However, if I reboot the system into the > same kernel (ie everything set up exactly the same), the same > invocation of netperf might give > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 4389.20 > > Rick> Are there large changes in service demand along with the > Rick> large performance changes? > > Not sure. How do I have netperf report service demand? Ask for CPU utilization with -c (local) and -C (remote). The /proc/stat stuff used by Linux does not need calibration (IIRC) so you don't have to worry about that. If cache effects are involved, you can make netperf "harder" or "easier" on the caches by altering the size of the send and/or recv buffer rings. By default they are one more than the socket buffer size divided by the send size, but you can make them larger or smaller with the -W option. These days I use a 128K socket buffer and 32K send for the "canonical" (although not default :) netperf TCP_STREAM test: netperf -H remote -c -C -- -s 128K -S 128K -m 32K In netperf-speak K == 1024, k == 1000, M == 2^20, m == 10^6, G == 2^40, g == 10^9... rick jones From rolandd at cisco.com Mon Oct 10 14:44:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:44:21 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: (IBMEHCA DD's message of "Mon, 10 Oct 2005 09:23:59 +0200") References: Message-ID: <52oe5xdp3e.fsf@cisco.com> IBMEHCA> So you need some kind of signal from the operating system IBMEHCA> to system firmware, which in the eHCA case is the IBMEHCA> H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI IBMEHCA> parameter. AFTER that call handshaking between system IBMEHCA> firmware and the SM will start, here's a new adapter IBMEHCA> active on a switch port... what's your guid? here's your IBMEHCA> LID, p_key, SM lid.... ...and after all that it's IBMEHCA> possible to send and receive packets from the fabric. IBMEHCA> The openib stack expects that a port is fully functional IBMEHCA> after this create_qp returns, and starts to do all sorts IBMEHCA> of modify QP and post send. So the only choice we have IBMEHCA> there is to delay create_qp until the complete IBMEHCA> handshaking between system firmware and the SM has IBMEHCA> finished (until we see a IB_PORT_ACTIVE in hcad_mod). If IBMEHCA> we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have IBMEHCA> to return an error code to openib, otherwise we're IBMEHCA> seriously in trouble (tried that). I think this scheme breaks the IB model. When consumers get access to an HCA, they expect to be able to access the HCA, even if an SM has not configured it (and even in the case no cable is connected). As an example of why this is useful, if the link won't come up, it's nice to be able to get to query the port's PMA counters to see if there are excessive errors or something like that. I understand that you don't want to have all HCAs always visible to the SM, but the scheme you've chosen puts an unneeded dependency between driver initialization and the external SM. It would be fine if creating QP1 triggered the transition of the port from DOWN to INIT so that it is discoverable by the SM, but there's no reason for creation of QP1 to wait to finish until the SM has brought the port up. (As a side note, Mellanox HCAs don't bring a port to INIT until the host driver has transitioned QP0 to the RTR state, which seems more sensible than waiting for QP1 to be created) I hope this can be fixed in firmware with your current HCA hardware. - R. From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:25:07 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:25:07 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <1128986707.13945.424.camel@localhost> On Mon, 2005-10-10 at 11:23 -0700, Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. > > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. > > Matt, how stable are your numbers? Pretty consistent. Here are a few runs with 2.6.12-rc5 with reboots in between each run. I'm using netperf-2.3pl1. Run 1: TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 410302.39 99.89 92.09 4.869 4.489 Run 2: (after another reboot) TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 409510.33 99.89 91.59 4.879 4.473 Run 3: (after reboot) TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 404354.11 99.89 91.39 4.941 4.520 I see the same variance in netperf results if I don't reboot between runs. - Matt > From iod00d at hp.com Mon Oct 10 16:30:54 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 10 Oct 2005 16:30:54 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <20051010212652.GG9613@esmail.cup.hp.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <20051010212652.GG9613@esmail.cup.hp.com> Message-ID: <20051010233054.GA11213@esmail.cup.hp.com> On Mon, Oct 10, 2005 at 02:26:52PM -0700, Grant Grundler wrote: ... > If it's interleaving, every other cacheline will be "local". ISTR AMD64 was page-interleaved but then got confused by documents describing "128-bit" 2-way interleave. I now realize the 128bit is refering to interleave between two "banks" of memory behind each memory controller. ie 2 * 128-bit provides in the 32-byte cacheline size that most x86 programs expect. Anyway, I'm hoping that we'll see a consistent result if node interleave is turned off. sorry for the confusion, grant From rolandd at cisco.com Mon Oct 10 16:38:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 16:38:13 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128986707.13945.424.camel@localhost> (Matt Leininger's message of "Mon, 10 Oct 2005 16:25:07 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> Message-ID: <52br1xdjtm.fsf@cisco.com> Matt> Pretty consistent. Here are a few runs with 2.6.12-rc5 Matt> with reboots in between each run. I'm using netperf-2.3pl1. That's interesting. I'm guessing you're using mem-ful HCAs? Given that your results are more stable than mine, if you're up for it, you could install git, clone Linus's tree, and then do a git bisect between 2.6.12-rc4 and 2.6.12-rc5 to narrow down the regression to a single commit (if in fact that's possible). - R. From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:42:52 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:42:52 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <1128987772.13945.439.camel@localhost> On Sun, 2005-10-09 at 17:17 -0400, Peter J. Braam wrote: > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to > develop during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > These last two bullets are mutually exclusive. Submitting code, for inclusion into Linux, that contains an OS abstraction is a sure way to get your code rejected. It happened to the IBAL stack and it will happen again unless you focus on a Linux specific "Lustre network driver". As a custom of IB products and Lustre, I'd recommend coding to the OpenIB Verbs layer and use the new CM code as it develops (as Fab described). It's not difficult to port from VAPI to OpenIB Verbs so your current VAPI NAL would be a good starting point. It would be great to see fewer Lustre kernel patches and more of Lustre in the Linux kernel. Thanks, - Matt From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:44:57 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:44:57 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br1xdjtm.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> Message-ID: <1128987897.13952.441.camel@localhost> On Mon, 2005-10-10 at 16:38 -0700, Roland Dreier wrote: > Matt> Pretty consistent. Here are a few runs with 2.6.12-rc5 > Matt> with reboots in between each run. I'm using netperf-2.3pl1. > > That's interesting. I'm guessing you're using mem-ful HCAs? Yes, I'm using mem-full HCAs. I could try reflashing the firmware for memfree if that's of interest. > > Given that your results are more stable than mine, if you're up for > it, you could install git, clone Linus's tree, and then do a git > bisect between 2.6.12-rc4 and 2.6.12-rc5 to narrow down the regression > to a single commit (if in fact that's possible). I was hoping someone else would do this. :) I'll start working on it tomorrow if no one else gets to it. Thanks, - Matt From rolandd at cisco.com Mon Oct 10 16:53:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 16:53:12 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128987897.13952.441.camel@localhost> (Matt Leininger's message of "Mon, 10 Oct 2005 16:44:57 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> <1128987897.13952.441.camel@localhost> Message-ID: <527jcldj4n.fsf@cisco.com> Matt> Yes, I'm using mem-full HCAs. I could try reflashing the Matt> firmware for memfree if that's of interest. No, probably not. If I get a chance I'll do the opposite (flash mem-free -> mem-full, since my HCAs do have memory) and see if it makes my results stable. Matt> I was hoping someone else would do this. :) I'll start Matt> working on it tomorrow if no one else gets to it. I might get a chance to do it tonight... I'll post if I do. - R. From halr at voltaire.com Mon Oct 10 17:33:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 20:33:28 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AA780.60808@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> Message-ID: <1128990622.4377.3828.camel@hal.voltaire.com> On Mon, 2005-10-10 at 13:40, Sean Hefty wrote: > Hal Rosenstock wrote: > > What about the case of iWARP <-> IB ? > > Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using > IPoIB. You could build a gateway to make the transfer across IB more efficient, > but it's not required. I was referring to gatewaying to an IB end client from iWARP. -- Hal From ak at suse.de Mon Oct 10 17:51:22 2005 From: ak at suse.de (Andi Kleen) Date: Tue, 11 Oct 2005 02:51:22 +0200 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <20051010233054.GA11213@esmail.cup.hp.com> References: <1128672413.13948.326.camel@localhost> <20051010212652.GG9613@esmail.cup.hp.com> <20051010233054.GA11213@esmail.cup.hp.com> Message-ID: <200510110251.22442.ak@suse.de> On Tuesday 11 October 2005 01:30, Grant Grundler wrote: > On Mon, Oct 10, 2005 at 02:26:52PM -0700, Grant Grundler wrote: > ... > > > If it's interleaving, every other cacheline will be "local". > > ISTR AMD64 was page-interleaved but then got confused by documents > describing "128-bit" 2-way interleave. I now realize the 128bit > is refering to interleave between two "banks" of memory behind > each memory controller. ie 2 * 128-bit provides in the 32-byte > cacheline size that most x86 programs expect. The cache line size on K7 and K8 is 64 bytes. > Anyway, I'm hoping that we'll see a consistent result if node interleave > is turned off. Yes usually a good idea. -Andi From rolandd at cisco.com Mon Oct 10 21:03:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 21:03:35 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <527jcldj4n.fsf@cisco.com> (Roland Dreier's message of "Mon, 10 Oct 2005 16:53:12 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> <1128987897.13952.441.camel@localhost> <527jcldj4n.fsf@cisco.com> Message-ID: <52u0fod7jc.fsf@cisco.com> Roland> I might get a chance to do it tonight... I'll post if I do. I'm giving it a shot but I just can't reproduce this well on my systems. I do see a pretty big regression between 2.6.12-rc4 and 2.6.14-rc2, but 2.6.12-rc5 looks OK on my systems. I reflashed to FW 4.7.0 (mem-ful) and built netperf 2.4.1. With 2.6.12-rc4 I've seen runs as slow as: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.00 553.71 37.46 -1.00 2.642 -1.000 and with 2.6.12-rc5 I've seen runs as fast as: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.00 581.82 39.58 -1.00 2.657 -1.000 so not much difference there. With 2.6.14-rc2, the best of 10 runs was: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.01 497.00 39.71 -1.00 3.121 -1.000 so we've definitely lost something there. Time to do some more bisecting... - R. From SCHICKHJ at de.ibm.com Mon Oct 10 23:07:24 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 08:07:24 +0200 Subject: [openib-general] Wrong minor number for /dev/uat in README file Message-ID: Hello, I think the minor number for /dev/uat in /src/userspace/libibat/README is wrong. mknod /dev/infiniband/uat c 231 254 should be replaced by mknod /dev/infiniband/uat c 231 191 At least, the file /src/linux-kernel/infiniband/core/uat.c has the following content: enum { IB_UAT_MAJOR = 231, IB_UAT_MINOR = 191 }; Many thanks in advance! Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 From yael at mellanox.co.il Tue Oct 11 01:28:31 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 11 Oct 2005 10:28:31 +0200 Subject: [openib-general] [PATCH] Opensm - handling immediate error in vendor_send new Message-ID: <5zslv8wj80.fsf@mtl066.yok.mtl.com> Hi Hal, Attached is a new patch with several fixes for this issue. I decided to remove the checking for zero in the atomic_dec after all, since as I mentioned before - clearing it is not a fix, and we will see the value in other infos in the log file. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_vl15intf.h =================================================================== --- include/opensm/osm_vl15intf.h (revision 3704) +++ include/opensm/osm_vl15intf.h (working copy) @@ -55,11 +55,13 @@ #include #include #include +#include #include #include #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -137,6 +139,9 @@ typedef struct _osm_vl15 osm_vendor_t *p_vend; osm_log_t *p_log; osm_stats_t *p_stats; + osm_subn_t *p_subn; + cl_disp_reg_handle_t h_disp; + cl_plock_t *p_lock; } osm_vl15_t; /* @@ -176,6 +181,15 @@ typedef struct _osm_vl15 * p_stats * Pointer to the OpenSM statistics block. * +* p_subn +* Pointer to the Subnet object for this subnet. +* +* h_disp +* Handle returned from dispatcher registration. +* +* p_lock +* Pointer to the serializing lock. +* * SEE ALSO * VL15 object *********/ @@ -265,7 +279,10 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ); + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp, + IN cl_plock_t* const p_lock ); /* * PARAMETERS * p_vl15 @@ -283,6 +300,15 @@ osm_vl15_init( * max_wire_smps * [in] Maximum number of MADs allowed on the wire at one time. * +* p_subn +* [in] Pointer to the subnet object. +* +* p_disp +* [in] Pointer to the dispatcher object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* * RETURN VALUES * IB_SUCCESS if the VL15 object was initialized successfully. * Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3704) +++ opensm/osm_opensm.c (working copy) @@ -257,7 +257,8 @@ osm_opensm_init( status = osm_vl15_init( &p_osm->vl15, p_osm->p_vendor, - &p_osm->log, &p_osm->stats, p_opt->max_wire_smps ); + &p_osm->log, &p_osm->stats, p_opt->max_wire_smps, + &p_osm->subn, &p_osm->disp, &p_osm->lock ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3704) +++ opensm/osm_vl15intf.c (working copy) @@ -157,6 +157,8 @@ __osm_vl15_poller( if( status != IB_SUCCESS ) { + uint32_t outstanding; + cl_status_t cl_status; osm_log( p_vl->p_log, OSM_LOG_ERROR, "__osm_vl15_poller: ERR 3E03: " "MAD send failed (%s).\n", @@ -166,7 +168,69 @@ __osm_vl15_poller( The MAD was never successfully sent, so fix up the pre-incremented count values. */ + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire + that was incremented in the code above. */ mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); + if( p_madw->resp_expected == TRUE ) + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); + + /* + The following code is similar to the one in + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the + qp0_mads_outstanding counter, and if we reached 0 - need to call + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order + to wake up the state mgr). + */ + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); + + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "%u QP0 MADs outstanding.\n", + p_vl->p_stats->qp0_mads_outstanding ); + + /* + Acquire the lock non-exclusively. + Other modules that send MADs grab this lock exclusively. + These modules that are in the process of sending MADs + will hold the lock until they finish posting all the MADs + they plan to send. While the other module is sending MADs + the outstanding count may temporarily go to zero. + Thus, by grabbing the lock ourselves, we get an accurate + view of whether or not the number of outstanding MADs is + really zero. + */ + CL_PLOCK_ACQUIRE( p_vl->p_lock ); + outstanding = p_vl->p_stats->qp0_mads_outstanding; + CL_PLOCK_RELEASE( p_vl->p_lock ); + + if( outstanding == 0 ) + { + /* + The wire is clean. + Signal the state manager. + */ + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "Posting Dispatcher message %s.\n", + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); + } + + cl_status = cl_disp_post( p_vl->h_disp, + OSM_MSG_NO_SMPS_OUTSTANDING, + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, + NULL, + NULL ); + + if( cl_status != CL_SUCCESS ) + { + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E06: " + "Dispatcher post message failed (%s).\n", + CL_STATUS_MSG( cl_status ) ); + } + } } else { @@ -232,6 +296,7 @@ osm_vl15_construct( cl_qlist_init( &p_vl->rfifo ); cl_qlist_init( &p_vl->ufifo ); cl_thread_construct( &p_vl->poller ); + p_vl->h_disp = CL_DISP_INVALID_HANDLE; } /********************************************************************** @@ -281,6 +346,8 @@ osm_vl15_destroy( p_vl->state = OSM_VL15_STATE_INIT; cl_spinlock_destroy( &p_vl->lock ); + cl_disp_unregister( p_vl->h_disp ); + OSM_LOG_EXIT( p_vl->p_log ); } @@ -292,7 +359,11 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ) + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp, + IN cl_plock_t* const p_lock + ) { ib_api_status_t status = IB_SUCCESS; OSM_LOG_ENTER( p_log, osm_vl15_init ); @@ -301,6 +372,8 @@ osm_vl15_init( p_vl->p_log = p_log; p_vl->p_stats = p_stats; p_vl->max_wire_smps = max_wire_smps; + p_vl->p_subn = p_subn; + p_vl->p_lock = p_lock; status = cl_event_init( &p_vl->signal, FALSE ); if( status != IB_SUCCESS ) @@ -321,6 +394,21 @@ osm_vl15_init( if( status != IB_SUCCESS ) goto Exit; + p_vl->h_disp = cl_disp_register( + p_disp, + CL_DISP_MSGID_NONE, + NULL, + NULL ); + + if( p_vl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_vl15_init: ERR 3E01: " + "Dispatcher registration failed.\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + Exit: OSM_LOG_EXIT( p_log ); return( status ); From mohitka at noida.hcltech.com Tue Oct 11 02:58:19 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Tue, 11 Oct 2005 15:28:19 +0530 Subject: [openib-general] SRP & Infiniband Message-ID: <3E6BB9CEE261E2428AD25D0D553DC4970142EA36@HSDLNTD1110010.noida.hcltech.com> Hi all, I am just an investigating level newbee to Infiniband and I have a query in it. I am not clear about the functionalities of the user level HCA driver? Are there any specifications for it or it is totally vendor based? It is also said it is used in speed path operations? Does anyone has any ideas how does it do accomplishes it? If I have SCSI storage devices in a SAN then can I use SRP module to send some request and User mode HCA library for some speed path operation? Basically I wanted to know that for SCSI devices can User mode HCA library be used for speed path operations . If yes the how they can be used(Only theoretical details rest I wil try) Thanks in advance for all the help I am going to get Mohit Katiyar -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 03:50:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 06:50:15 -0400 Subject: [openib-general] Wrong minor number for /dev/uat in README file In-Reply-To: References: Message-ID: <1129027815.4377.6876.camel@hal.voltaire.com> On Tue, 2005-10-11 at 02:07, Heiko J Schick wrote: > Hello, > > I think the minor number for /dev/uat in /src/userspace/libibat/README is > wrong. > > mknod /dev/infiniband/uat c 231 254 > should be replaced by > mknod /dev/infiniband/uat c 231 191 > > At least, the file /src/linux-kernel/infiniband/core/uat.c has the > following content: > > enum { > IB_UAT_MAJOR = 231, > IB_UAT_MINOR = 191 > }; > > Many thanks in advance! Thanks. The README wasn't updated when this occured (on 9/15). -- Hal From yael at mellanox.co.il Tue Oct 11 05:24:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 11 Oct 2005 14:24:49 +0200 Subject: [openib-general] [PATCH] Opensm - enabling erase of log file flag Message-ID: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> Hi Hal, Currently the osm log file is accumulative. I've added an option to erase the log file before starting to write it. By default, still, the log is still accumulative. Attached is a patch for that. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 3704) +++ include/opensm/osm_subnet.h (working copy) @@ -220,6 +220,7 @@ typedef struct _osm_subn_opt uint8_t log_flags; char * dump_files_dir; char * log_file; + boolean_t accum_log_file; cl_map_t port_pro_ignore_guids; boolean_t port_profile_switch_nodes; uint32_t max_port_profile; @@ -319,6 +320,10 @@ typedef struct _osm_subn_opt * log_file * Name of the log file (or NULL) for stdout. * +* accum_log_file +* If TRUE (default) - the log file will be accumulated. +* If FALSE - the log file will be erased before starting current opensm run. +* * port_pro_ignore_guids * A map of guids to be ignored by port profiling. * Index: include/opensm/osm_log.h =================================================================== --- include/opensm/osm_log.h (revision 3704) +++ include/opensm/osm_log.h (working copy) @@ -218,7 +218,8 @@ osm_log_init( IN osm_log_t* const p_log, IN const boolean_t flush, IN const uint8_t log_flags, - IN const char *log_file) + IN const char *log_file, + IN const boolean_t accum_log_file ) { p_log->level = log_flags; p_log->flush = flush; @@ -229,10 +230,18 @@ osm_log_init( } else { + if (accum_log_file) p_log->out_port = fopen(log_file,"a+"); + else + p_log->out_port = fopen(log_file,"w+"); + if (!p_log->out_port) { + if (accum_log_file) printf("Cannot open %s for appending. Permission denied\n", log_file); + else + printf("Cannot open %s for writing. Permission denied\n", log_file); + return(IB_UNKNOWN_ERROR); } } Index: complib/cl_event_wheel.c =================================================================== --- complib/cl_event_wheel.c (revision 3704) +++ complib/cl_event_wheel.c (working copy) @@ -597,7 +597,7 @@ main () cl_event_wheel_construct( &event_wheel ); /* init */ - osm_log_init( &log, TRUE, 0xff, NULL); + osm_log_init( &log, TRUE, 0xff, NULL, FALSE); cl_event_wheel_init( &event_wheel, &log ); /* Start Playing */ Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 3704) +++ osmtest/osmtest.c (working copy) @@ -507,7 +507,7 @@ osmtest_init( IN osmtest_t * const p_osm osmtest_construct( p_osmt ); status = osm_log_init( &p_osmt->log, p_opt->force_log_flush, - 0x0001, p_opt->log_file ); + 0x0001, p_opt->log_file, TRUE ); if( status != IB_SUCCESS ) return ( status ); /* but we do not want any extra staff here */ Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 3704) +++ opensm/osm_subnet.c (working copy) @@ -427,6 +427,7 @@ osm_subn_set_default_opt( p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; p_opt->max_port_profile = 0xffffffff; p_opt->pfn_ui_pre_lid_assign = NULL; @@ -754,6 +755,10 @@ osm_subn_parse_conf_file( __osm_subn_opts_unpack_charp( "log_file" , p_key, p_val, &p_opts->log_file); + __osm_subn_opts_unpack_boolean( + "accum_log_file", + p_key, p_val, &p_opts->accum_log_file); + __osm_subn_opts_unpack_charp( "dump_files_dir" , p_key, p_val, &p_opts->dump_files_dir); @@ -920,6 +925,7 @@ osm_subn_write_conf_file( "force_log_flush %s\n\n" "# Log file to be used\n" "log_file %s\n\n" + "accum_log_file %s\n\n" "# The directory to hold the file OpenSM dumps\n" "dump_files_dir %s\n\n" "# If TRUE if OpenSM should disable multicast support\n" @@ -929,6 +935,7 @@ osm_subn_write_conf_file( p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, + p_opts->accum_log_file, p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", p_opts->disable_multicast ? "TRUE" : "FALSE" Index: opensm/osm_db_files.c =================================================================== --- opensm/osm_db_files.c (revision 3704) +++ opensm/osm_db_files.c (working copy) @@ -673,7 +673,7 @@ main(int argc, char **argv) cl_list_construct( &keys ); cl_list_init( &keys, 10 ); - osm_log_init( &log, TRUE, 0xff, "/tmp/test_osm_db.log"); + osm_log_init( &log, TRUE, 0xff, "/tmp/test_osm_db.log", FALSE); osm_db_construct(&db); if (osm_db_init(&db, &log)) Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3704) +++ opensm/osm_opensm.c (working copy) @@ -205,7 +205,7 @@ osm_opensm_init( osm_opensm_construct( p_osm ); status = osm_log_init( &p_osm->log, p_opt->force_log_flush, - p_opt->log_flags, p_opt->log_file ); + p_opt->log_flags, p_opt->log_file, p_opt->accum_log_file ); if( status != IB_SUCCESS ) return ( status ); Index: opensm/main.c =================================================================== --- opensm/main.c (revision 3704) +++ opensm/main.c (working copy) @@ -167,6 +167,11 @@ show_usage(void) " This option defines the log to be the given file.\n" " By default the log goes to /var/log/osm.log.\n" " For the log to go to standard output use -f stdout.\n\n"); + printf( "-e\n" + "--erase_log_file\n" + " This option will cause deletion of the log file \n" + " (if it previously exists). By default, the log file \n" + " is accumulative.\n\n"); printf( "-v\n" "--verbose\n" " This option increases the log verbosity level.\n" @@ -447,7 +452,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:d:g:l:s:t:vVhorc"; + const char * const short_option = "i:f:ed:g:l:s:t:vVhorc"; /* In the array below, the 2nd parameter specified the number @@ -467,6 +472,7 @@ main( { "verbose", 0, NULL, 'v'}, { "D", 1, NULL, 'D'}, { "log_file", 1, NULL, 'f'}, + { "erase_log_file",0, NULL, 'e'}, { "maxsmps", 1, NULL, 'n'}, { "V", 0, NULL, 'V'}, { "help", 0, NULL, 'h'}, @@ -636,6 +642,11 @@ main( opt.log_file = optarg; break; + case 'e': + opt.accum_log_file = FALSE; + printf(" Creating new log file\n"); + break; + case 'v': log_flags = (log_flags <<1 )|1; printf(" Verbose option -v (log flags = 0x%X)\n", log_flags ); From SCHICKHJ at de.ibm.com Tue Oct 11 05:43:42 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 14:43:42 +0200 Subject: [openib-general] IBM eHCA testing.. Message-ID: Hello Troy, this morning I've looked in detail into the problem you've reported on Oct 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an IPoIB issues. [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html The following things appens: 1. modprobe hcad_mod ehca_nr_ports=1 The eHCA InfiniBand Device Driver is loaded. 2. modprobe ib_mad The ib_mad stack creates an AQP1. This will start the port activation process. By my count it will take more than 110 / 120 seconds to activate a port. Our device driver gets a timeout, which means that the port is NOT active. and ib_modify_qp will not work (for any QP, doesn't matter if it was created in the ib_mad stack or in the ib_ipoib stack). 3. modprobe ib_ipoib All ressources for IPoIB are allocated (CQ, QPs, MR, etc.) 4. A user runs ifconfig ib0 xxx.xxx.xxx.xxx which executes the following functions: ipoib_open -> ipoib_ib_dev_open -> ipoib_qp_create. The user should see the following error message: l2:/home/schickhj/ibt/linstack/ehca2/ehca2 # ifconfig ib0 192.168.8.8 SIOCSIFFLAGS: Invalid argument 5. The function ipoib_qp_create modifies the QP from Reset 2 Init 2 RTR 2 RTS. If one of these three ib_modify_qp doesn't work, the IPoIB QP (priv->qp) will be destroyed (by the ipoib_qp_create error routine / out_fail) and priv->qp will be NULL. --> see /src/linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c function ipoib_qp_create 6. A user runs (again) ifconfig ib0 xxx.xxx.xxx which executes (again) the following functions: ipoib_open -> ipoib_ib_dev_open -> ipoib_qp_create 7. ipoib_qp_create wants to modify the IPoIB QP (priv->qp) which is NULL, because the QP was destroy earlier in time by the error handling routine in ipoib_qp_create (see 5.) I think this error could also show up on Mellanox based IB cards when ib_modify_qp failes in ipoib_qp_create. In dmesg you should see: (see 1.) eHCA Infiniband Device Driver (Rel.: ) xics_enable_irq: irq=9029: ibm_int_on returned fffffffd eHCA Infiniband Device Driver (Rel.: ) (see 2.) PU0000 000b0078:ehca_define_sqp HCAD_ERROR Port 1 is not active. PU0000 000b0387:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff PU0000 000b03ae:ehca_create_qp <<< failed ret=ffffffea ib_mad: Couldn't create ib_mad QP1 ib_mad: Couldn't open ehca0 port 1 PU0001 00060103:ehca_parse_ec EHCA port 1 is available. PU0000 000b00bd:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1001000503000004 r5=200100000000002c r6=8a40000000000000 3ed48000 r8=0 r9=0 r10=0 PU0000 000b00c4:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 (see 4.) PU0000 000b0564:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c000000003ba4e00 qp_num=2c ib0: failed to modify QP to init, ret = -22 ib0: ipoib_qp_create returned -22 Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 From halr at voltaire.com Tue Oct 11 05:42:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 08:42:37 -0400 Subject: [openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send new In-Reply-To: <5zslv8wj80.fsf@mtl066.yok.mtl.com> References: <5zslv8wj80.fsf@mtl066.yok.mtl.com> Message-ID: <1129034556.4377.7616.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-10-11 at 04:28, Yael Kalka wrote: > Attached is a new patch with several fixes for this issue. Thanks. Applied. There were still extra whitespace issues which I fixed by hand. Please try to eliminate these so I don't have to do hand touch ups. > I decided to remove the checking for zero in the atomic_dec after all, > since as I mentioned before - clearing it is not a fix, and we will > see the value in other infos in the log file. But there is danger is these counters wrap, right ? Also, in looking further at the code, the same issue does not appear to occur for QP1 handling, right ? -- Hal From halr at voltaire.com Tue Oct 11 05:48:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 08:48:25 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129034904.4377.7666.camel@hal.voltaire.com> Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: > this morning I've looked in detail into the problem you've reported on Oct > 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an > IPoIB issues. > > [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html > > The following things appens: > > 1. modprobe hcad_mod ehca_nr_ports=1 > The eHCA InfiniBand Device Driver is loaded. > > 2. modprobe ib_mad > The ib_mad stack creates an AQP1. This will start the port > activation process. > By my count it will take more than 110 / 120 seconds to activate a > port. > Our device driver gets a timeout, which means that the port is NOT > active. and > ib_modify_qp will not work (for any QP, doesn't matter if it was > created in the ib_mad > stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal From SCHICKHJ at de.ibm.com Tue Oct 11 06:21:34 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 15:21:34 +0200 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <1129034904.4377.7666.camel@hal.voltaire.com> Message-ID: Hello Hal, normally the timeout is set to 30 seconds. If you need more information about the "activation" please see [1]. [1]: http://openib.org/pipermail/openib-general/2005-October/012355.html Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 Hal Rosenstock 11.10.2005 14:48 To Heiko J Schick/Germany/IBM at IBMDE cc openib-general at openib.org, Christoph Raisch/Germany/IBM at IBMDE Subject Re: Re: Re: [openib-general] IBM eHCA testing.. Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: > this morning I've looked in detail into the problem you've reported on Oct > 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an > IPoIB issues. > > [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html > > The following things appens: > > 1. modprobe hcad_mod ehca_nr_ports=1 > The eHCA InfiniBand Device Driver is loaded. > > 2. modprobe ib_mad > The ib_mad stack creates an AQP1. This will start the port > activation process. > By my count it will take more than 110 / 120 seconds to activate a > port. > Our device driver gets a timeout, which means that the port is NOT > active. and > ib_modify_qp will not work (for any QP, doesn't matter if it was > created in the ib_mad > stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 06:18:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:18:16 -0400 Subject: [openib-general] Re: [PATCH] Opensm - enabling erase of log file flag In-Reply-To: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> References: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> Message-ID: <1129036689.4377.7915.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-10-11 at 08:24, Yael Kalka wrote: > Currently the osm log file is accumulative. I've added an option to > erase the log file before starting to write it. > By default, still, the log is still accumulative. > Attached is a patch for that. One minor comment on this... > Thanks, > Yael > > Signed-off-by: Yael Kalka > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 3704) > +++ opensm/osm_subnet.c (working copy) > @@ -920,6 +925,7 @@ osm_subn_write_conf_file( > "force_log_flush %s\n\n" > "# Log file to be used\n" > "log_file %s\n\n" > + "accum_log_file %s\n\n" > "# The directory to hold the file OpenSM dumps\n" > "dump_files_dir %s\n\n" > "# If TRUE if OpenSM should disable multicast support\n" > @@ -929,6 +935,7 @@ osm_subn_write_conf_file( > p_opts->log_flags, > p_opts->force_log_flush ? "TRUE" : "FALSE", > p_opts->log_file, > + p_opts->accum_log_file, Shouldn't this line be: p_opts->accum_log_file ? "TRUE" : "FALSE", -- Hal From jlentini at netapp.com Tue Oct 11 06:33:37 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Oct 2005 09:33:37 -0400 (EDT) Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: On Sun, 9 Oct 2005, Peter J. Braam wrote: > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to > develop during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > > We are keen to hear some opinions! > > Thanks > > Peter Braam Hi Peter, I am the maintainer of the kDAPL reference implementation. If you are interested in portability, I would recommend kDAPL. Earlier this year, there was an effort to modify the kDAPL API to make it acceptable for inclusion in the Linux kernel. After making these modifications, the OpenIB community still felt that the kDAPL API was not ready for merging into the upstream kernel. As a result, a new project was begun to develop an API capable of supporting both IB and iWARP and suitable for kernel inclusion. At the present time, neither the kDAPL API or the new RDMA API (verbs + CMA) has been sent upstream. The current thinking is that the RDMA API has a better chance than kDAPL. james From halr at voltaire.com Tue Oct 11 06:31:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:31:18 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129037478.4377.8015.camel@hal.voltaire.com> Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: > Hello Hal, > > normally the timeout is set to 30 seconds. Why does there need to be a timeout for this ? There is no time defined in the IB spec for activating a port. The SM may or may not be up and it is implementation specific when it activates any particular port. > If you need more information about the "activation" please see [1]. > > [1]: > http://openib.org/pipermail/openib-general/2005-October/012355.html Yes, I saw that post yesterday. -- Hal From sinate at yahoo.com Tue Oct 11 06:37:00 2005 From: sinate at yahoo.com (Steven Wooding) Date: Tue, 11 Oct 2005 14:37:00 +0100 (BST) Subject: [openib-general] Compiling an application that calls ib_cm_* functions Message-ID: <20051011133700.77105.qmail@web32506.mail.mud.yahoo.com> Hi, I wonder if someone could help me with compiling my IB application? The problem is when I go to link my program I get all of the ib_cm* function calls come up as "undefined reference". Also dlist_start and _dlist_mark_move (dlist_next in the code). Here is my linking command: icpc -o ib_comms_test1 ib_comms_test1.o ib_queue_pair.o ib_comms_manager.o -L/usr/local/lib -libcm -libat -libverbs -libumad -lsysfs -ldl Get the same result when using g++ The cmpost.c example compiles fine. I've tried to see what it is doing. It seems to link-in the libibcm.la file, but when I try this with icpc or g++, they say they cannot recogised the file type. Maybe someone can spot the simple mistake I'm making. Cheers, Steve. --------------------------------- To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 06:38:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:38:11 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129037891.4377.8074.camel@hal.voltaire.com> Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: > Hello Hal, > > normally the timeout is set to 30 seconds. One more thing: How can the timeout be adjusted ? Is it an module parameter ? -- Hal From bardov at gmail.com Tue Oct 11 06:44:25 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 11 Oct 2005 15:44:25 +0200 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: Hi Peter, I can testify from first hand experience - we first developed ISER over KDAPL. It simplified our work since kDAPL was pretty stable at the time. We are now porting ISER to run over openIB-verbs + CMA. Although CMA is not there yet, the port does simplify the code compared to the kDAPL implementation. Dan On 10/9/05, Peter J. Braam wrote: > > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to develop > during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating > systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > > We are keen to hear some opinions! > > Thanks > > Peter Braam > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From mst at mellanox.co.il Tue Oct 11 06:47:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 15:47:47 +0200 Subject: [openib-general] Re: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes In-Reply-To: <1128091110.5270.1072.camel@hal.voltaire.com> References: <1128091110.5270.1072.camel@hal.voltaire.com> Message-ID: <20051011134747.GA17185@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes > > SDP: In sdp_link.c::do_link_path_lookup, handle interface table > numbering holes > (similar to James Lentini's patch to at.c) > > (this is untested) > > Signed-off-by: Hal Rosenstock > > Index: sdp_link.c > =================================================================== > --- sdp_link.c (revision 3623) > +++ sdp_link.c (working copy) > @@ -354,7 +354,6 @@ static void do_link_path_lookup(struct s > struct ipoib_dev_priv *priv; > struct net_device *dev = NULL; > struct rtable *rt; > - int counter = 0; > int result = 0; > struct flowi fl = { > .oif = info->dif, /* oif */ > @@ -435,7 +434,7 @@ static void do_link_path_lookup(struct s > > if (dev->flags & IFF_LOOPBACK) { > dev_put(dev); > - while ((dev = dev_get_by_index(++counter))) { > + for (dev = dev_base; dev; dev = dev->next) { > if (dev->type == ARPHRD_INFINIBAND && > (dev->flags & IFF_UP)) > break; > I think this list scan needs some kind of protection. The following is what I checked in. Does this needs to be updated in other places as well? Handle net interface table numbering holes (similar to James Lentini's patch to at.c) Signed-off-by: Michael S. Tsirkin Signed-off-by: Hal Rosenstock Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-kernel.orig/drivers/infiniband/ulp/sdp/sdp_link.c 2005-10-11 13:48:30.000000000 +0200 +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c 2005-10-11 13:55:15.000000000 +0200 @@ -433,13 +433,15 @@ static void do_link_path_lookup(struct s if (dev->flags & IFF_LOOPBACK) { dev_put(dev); - while ((dev = dev_get_by_index(++counter))) { + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { if (dev->type == ARPHRD_INFINIBAND && - (dev->flags & IFF_UP)) + (dev->flags & IFF_UP)) { + dev_hold(dev); break; - else - dev_put(dev); + } } + read_unlock(&dev_base_lock); } if (!dev) { -- MST From mst at mellanox.co.il Tue Oct 11 07:02:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 16:02:24 +0200 Subject: [openib-general] [PATCH] reduce the number of included files in cma.c Message-ID: <20051011140224.GB17185@mellanox.co.il> Remove unnecessary includes from cma.c Signed-off-by: Michael S. Tsirkin Index: linux-2.6.13/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.13/drivers/infiniband/core/cma.c (revision 3720) +++ linux-2.6.13/drivers/infiniband/core/cma.c (working copy) @@ -30,10 +30,6 @@ */ #include #include -#include -#include -#include -#include #include #include #include -- MST From tziporet at mellanox.co.il Tue Oct 11 09:11:42 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 11 Oct 2005 18:11:42 +0200 Subject: [openib-general] segmentation fault in ibv_modify_srq Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E33E79B9@mtlexch01.mtl.com> SRQ limit event will be supported also on cards with memory (both Infinihost and Infinihost III) If someone need it nowadays we can give a drop of FW that supports it. It will be officially released in Q4. Tziporet -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Wednesday, October 05, 2005 9:42 PM To: Sayantan Sur Cc: openib-general at openib.org Subject: Re: [openib-general] segmentation fault in ibv_modify_srq Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' Sayantan> call. When I use this call, I get a segmentation Sayantan> fault. This is because the modify SRQ operation is not implemented at all in libmthca. Do you just want to set the SRQ limit? That's not so hard for me to implement. However, you should be aware that as far as I know, only mem-free HCAs generate the SRQ limited reached event. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Oct 11 09:13:20 2005 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 11 Oct 2005 09:13:20 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <1129037891.4377.8074.camel@hal.voltaire.com> Message-ID: The IB stack doesn't handle errors during client initialization. This problem is easy to reproduce by inducing errors (resouce allocation failure or query failure) in mad_client or sa_client registration. I am working on a patch, but I am in class the whole week, don't have time to verify the patch. I hope the patch will be available early next week to fix the panic. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: