From info at openib.org Sat Oct 1 02:41:03 2005 From: info at openib.org (info at openib.org) Date: Sat, 01 Oct 2005 15:41:03 +0600 Subject: [openib-general] *DETECTED* Online User Violation Message-ID: <0INP0013ABPLMU@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: email-details.zip Type: application/octet-stream Size: 53528 bytes Desc: not available URL: From Administrator at openib.org Sat Oct 1 02:40:58 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Sat, 1 Oct 2005 04:40:58 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <006001c5c66c$3a0bd9f0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] *DETECTED* Online User Violation Scanning time = 10/1/2005 4:40:57 AM Engine/Pattern = 7.510-1002/2.867.00 Action on virus found: The attachment email-details.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/1/2005 email-details.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] *DETECTED* Online User Violation From halr at voltaire.com Sat Oct 1 04:32:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Oct 2005 07:32:35 -0400 Subject: [openib-general] [PATCH] OpenSM: osm_port_info_rcv.c::__osm_pi_rcv_process_router_port Fix router port handling Message-ID: <1128166129.4401.1202.camel@hal.voltaire.com> OpenSM: osm_port_info_rcv.c::__osm_pi_rcv_process_router_port Fix router port handling Signed-off-by: Hal Rosenstock Index: osm_port_info_rcv.c =================================================================== -- osm_port_info_rcv.c (revision 3623) +++ osm_port_info_rcv.c (working copy) @@ -411,6 +411,8 @@ __osm_pi_rcv_process_router_port( "Invalid base LID 0x%x corrected.\n", cl_ntoh16 ( orig_lid) ); + __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); + OSM_LOG_EXIT( p_rcv->p_log ); } From tlteabsrwxz at go.com Sat Oct 1 02:23:39 2005 From: tlteabsrwxz at go.com (Bernice Kline) Date: Sat, 1 Oct 2005 13:23:39 +0400 Subject: [openib-general] Personalized mortgage rate quote! Message-ID: <340v465u.3657488@go.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p1.asp Have a good day. Sincerely, Bernice Kline Customer Service Rep eZNB Inc. vigorous it narcissus it and backwater it it rufous and not doorstep see or mire try but alcott some or shadowy trysome massey and. exhibit try aye may see psalter a in woke in the seamen a on mermaid see ! anna see and dilogarithm ,a exculpatory the. From rolandd at cisco.com Sat Oct 1 13:05:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 01 Oct 2005 13:05:27 -0700 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <20050928134107.GA23849@mellanox.co.il> (Jack Morgenstein's message of "Wed, 28 Sep 2005 16:41:07 +0300") References: <20050928134107.GA23849@mellanox.co.il> Message-ID: <52u0g1c8ag.fsf@cisco.com> Thanks, applied and queued for 2.6.15. I left out the max_vl_num part of the patch, because it doesn't make sense to me to fill in the field and then later change the meaning of the field. In fact is there any reason to have the max_vl_num field be returned from the query_port method? I don't see anything sensible a consumer can do with the value, and I would think consumers should just be using service levels rather than worrying about the next hop VL. So maybe we should just delete the field entirely. - R. From qdocxngwaly at go.com Sat Oct 1 17:25:23 2005 From: qdocxngwaly at go.com (Angelo Leon) Date: Sat, 1 Oct 2005 21:25:23 -0300 Subject: [openib-general] ... Message-ID: <669b689u.8011286@go.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p2.asp Have a good day. Sincerely, Angelo Leon Customer Service Rep eBXV Inc. citation , dogtrot in but huckster or it trivium be some contention in on peste some be montpelier it on informant besome axiomatic it's. arouse in banshee , on straightway it's not airstrip or ! bustard the see interference be on theses a or barnes butit clot but. From sean.hefty at intel.com Sat Oct 1 16:14:12 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 1 Oct 2005 16:14:12 -0700 Subject: [openib-general] Re: [RFC] IB address translation using ARP In-Reply-To: <20050930081346.GB31930@mellanox.co.il> Message-ID: >I suspect the CM related part cant be easily shared between SDP and CMA, >since the CM REQ format and the service record format for SDP are already >set in stone, and are very SDP-specific. I've given this some more thought, and I think that it makes sense for the CMA to provide support for SDP, iSER, kDAPL, etc. to the extent that it can. This requires the CMA to: * send CM REQ private data using different formats * know how to interpret received CM REQ private data * map listen requests to service IDs correctly One solution is to make the CMA protocol aware to some degree. Clients can specify a protocol when binding a cma_id to a particular address. In the simplest case, a user can tell the CMA to simply pass through all private data. On the passive side, this means that the CMA does not provide source address information. Apps must either extract the source information from the private data themselves, or through some other means, such as ATS. However, this doesn't help map connection or listen requests to IB service IDs. And I'm not familiar with how SDP, iSER, kDAPL perform their mappings to know if the CMA could do this without knowing being protocol aware. If this is the case, then it makes sense to give the CMA some knowledge of the CM REQ private data format. - Sean From sean.hefty at intel.com Sat Oct 1 16:18:30 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sat, 1 Oct 2005 16:18:30 -0700 Subject: [openib-general] Re: [RFC] IB address translation using ARP In-Reply-To: Message-ID: >However, this doesn't help map connection or listen requests to IB service IDs. >And I'm not familiar with how SDP, iSER, kDAPL perform their mappings to know >if >the CMA could do this without knowing being protocol aware. If this is the Er... how about "without being protocol aware" as opposed to "knowing being..." From jackm at mellanox.co.il Sun Oct 2 00:30:25 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 09:30:25 +0200 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <52u0g1c8ag.fsf@cisco.com> References: <52u0g1c8ag.fsf@cisco.com> Message-ID: <20051002073024.GA9873@mellanox.co.il> On Sat, Oct 01, 2005 at 11:05:27PM +0300, Roland Dreier wrote: > In fact is there any reason to have the max_vl_num field be returned > from the query_port method? I don't see anything sensible a consumer > can do with the value, and I would think consumers should just be > using service levels rather than worrying about the next hop VL. So > maybe we should just delete the field entirely. > I agree. That value is only of interest to the SM, for use in SL-to-VL mapping (IB Spec 3.5.7) -- and the SM obtains this value via a MAD query. Applications should use the SL field in packets for specifying a QoS (in the future)-- and should not even be aware of VL's. Anyone else have an opinion? Jack From jackm at mellanox.co.il Sun Oct 2 02:17:38 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 11:17:38 +0200 Subject: [openib-general] [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max Message-ID: <20051002091738.GB9873@mellanox.co.il> Return an error if requested number of cq entries exceeds HCA max (IB Spec 11.2.6.1). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -134,6 +134,7 @@ int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; Index: linux-kernel/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -173,6 +173,7 @@ mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.flags = dev_lim->flags; + mdev->limits.max_cqes = 0xffff; /* driver override */ /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -93,7 +93,7 @@ props->max_qp_wr = 0xffff; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; @@ -639,7 +639,11 @@ struct mthca_cq *cq; int nent; int err; + struct mthca_dev* mdev = to_mdev(ibdev); + if (mdev->limits.max_cqes < entries || entries < 0) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); From chirq at bredbandsbolaget.se Sun Oct 2 04:43:52 2005 From: chirq at bredbandsbolaget.se (Aubrey Mcfarland) Date: Sun, 2 Oct 2005 12:43:52 +0100 Subject: [openib-general] Personalized mortgage rate quote! Message-ID: <20462204095115.chirq@bredbandsbolaget.se> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://thorp3.com/p1.asp Have a good day. Sincerely, Aubrey Mcfarland Customer Service Rep eJTM Inc. vito it cosmetic be see lamellar but some garden it's and inspect see a ohm it , materiel some may rhinestone someit casino be. tensile be mollycoddle ! in landslide , ! recife on ! refectory , in eng or , homeric it it zinc andit battalion ,. From yclfe at kaptech.net Sun Oct 2 01:51:19 2005 From: yclfe at kaptech.net (Jeff Friedman) Date: Sun, 2 Oct 2005 12:51:19 +0400 Subject: [openib-general] Re: problem. Message-ID: Each year, people spend more than $40 billion on products designed to help them slim down. None of them seem to be working very well. Now along comes hoodia. Never heard of it? Soon it'll be tripping off your tongue, because hoodia is a natural substance that literally takes your appetite away. It's very different from diet stimulants like Ephedra and Phenfen that are now banned because of dangerous side effects. Hoodia doesn't stimulate at all. Scientists say it fools the brain by making you think you�re full, even if you've eaten just a morsel. http://avcatgili.info/ Suppress your appetite and feel full and satisfied all day long Increase your energy levels Lose excess weight Increase your metabolism Burn body fat Burn calories Attack obesity And more.. http://avcatgili.info/ Regards, Dr. Jeff Friedman From jackm at mellanox.co.il Sun Oct 2 06:25:52 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 15:25:52 +0200 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr Message-ID: <20051002132552.GC9873@mellanox.co.il> Now check in kernel space for illegal combination of acl parameters (per IB Spec 11.2.8.2). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -860,6 +860,10 @@ int i, j, k; int err = 0; + if (acc & (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE) && + !(acc & IB_ACCESS_LOCAL_WRITE)) + return ERR_PTR(-EINVAL); + shift = ffs(region->page_size) - 1; mr = kmalloc(sizeof *mr, GFP_KERNEL); From jackm at mellanox.co.il Sun Oct 2 07:10:44 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 16:10:44 +0200 Subject: [openib-general] [PATCH] mthca: fixes pkey_ix processing in mthca_modify_qp Message-ID: <20051002141043.GD9873@mellanox.co.il> Problem: When pkey-index provided > pkey_table_size, the pkey index used in sending packets is pkey_index % pkey_table_size (64 for Mellanox HCAs). Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_qp.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -585,6 +585,13 @@ IB_QP_STATE)); return -EINVAL; } + + if ((attr_mask & IB_QP_PKEY_INDEX) && + attr->pkey_index >= dev->limits.pkey_table_len) { + mthca_dbg(dev, "PKey index (%u) too large. max is %d\n", + attr->pkey_index,dev->limits.pkey_table_len-1); + return -EINVAL; + } mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) From jackm at mellanox.co.il Sun Oct 2 08:12:28 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 2 Oct 2005 17:12:28 +0200 Subject: [openib-general] [PATCH] mthca: check that QP is not already a member of a MCG before attach Message-ID: <20051002151228.GE9873@mellanox.co.il> The patch below avoids entering a QP as member of a multicast group multiple times. Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_mcg.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_mcg.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_mcg.c (working copy) @@ -189,7 +189,12 @@ } for (i = 0; i < MTHCA_QP_PER_MGM; ++i) - if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { + if (mgm->qp[i] == cpu_to_be32(ibqp->qp_num | (1 << 31))) { + mthca_dbg(dev, "QP %06x already a member of MGM\n", + ibqp->qp_num); + err = 0; + goto out; + } else if (!(mgm->qp[i] & cpu_to_be32(1 << 31))) { mgm->qp[i] = cpu_to_be32(ibqp->qp_num | (1 << 31)); break; } From hch at lst.de Sun Oct 2 08:50:06 2005 From: hch at lst.de (Christoph Hellwig) Date: Sun, 2 Oct 2005 17:50:06 +0200 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr In-Reply-To: <20051002132552.GC9873@mellanox.co.il> References: <20051002132552.GC9873@mellanox.co.il> Message-ID: <20051002155006.GA9896@lst.de> On Sun, Oct 02, 2005 at 03:25:52PM +0200, Jack Morgenstein wrote: > Now check in kernel space for illegal combination of acl parameters > (per IB Spec 11.2.8.2). The check should be in ib_uverbs_reg_mr(), not in every driver. From halr at voltaire.com Mon Oct 3 05:58:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 08:58:18 -0400 Subject: [openib-general] Re: [PATCH] [mthca]: fixed fields in query_port In-Reply-To: <20051002073024.GA9873@mellanox.co.il> References: <52u0g1c8ag.fsf@cisco.com> <20051002073024.GA9873@mellanox.co.il> Message-ID: <1128344167.4401.7657.camel@hal.voltaire.com> On Sun, 2005-10-02 at 03:30, Jack Morgenstein wrote: > On Sat, Oct 01, 2005 at 11:05:27PM +0300, Roland Dreier wrote: > > In fact is there any reason to have the max_vl_num field be returned > > from the query_port method? I don't see anything sensible a consumer > > can do with the value, and I would think consumers should just be > > using service levels rather than worrying about the next hop VL. So > > maybe we should just delete the field entirely. > > > > I agree. That value is only of interest to the SM, for use in SL-to-VL mapping > (IB Spec 3.5.7) -- and the SM obtains this value via a MAD query. Applications > should use the SL field in packets for specifying a QoS (in the future)-- and > should not even be aware of VL's. > > Anyone else have an opinion? A diagnostics application could use this. Not sure if that is sufficient justification to keep this in. This value can be retrieved via an SA query or thorugh SM MADs as long as the protection level is low enough. -- Hal From jlentini at netapp.com Mon Oct 3 07:45:05 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 3 Oct 2005 10:45:05 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL cq channel support, sync with latest verbs In-Reply-To: References: Message-ID: On Fri, 30 Sep 2005, Arlin Davis wrote: > James, > > Here is a patch to support CQ_WAIT_OBJECT with channels and sync > with latest verbs. Tested with dapltest, dtest, netpipe, and > Intel-MPI. Committed in revision 3637 From halr at voltaire.com Mon Oct 3 07:41:08 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 10:41:08 -0400 Subject: [openib-general] [PATCH] af_packet: Allow for > 8 byte hardware addresses Message-ID: <1128350467.4401.7746.camel@hal.voltaire.com> Hi, The following forward patch was accepted into 2.6.14 and affects OpenIB. I placed this in gen2/trunk/src/linux-kernel/patches/linux-2.6.13-af-packet.diff af_packet: Allow for > 8 byte hardware addresses The convention is that longer addresses will simply extend the hardware address byte arrays at the end of sockaddr_ll and packet_mreq. Signed-off-by: Eric W. Biederman -- Hal From rolandd at cisco.com Mon Oct 3 09:13:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:13:51 -0700 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max In-Reply-To: <20051002091738.GB9873@mellanox.co.il> (Jack Morgenstein's message of "Sun, 2 Oct 2005 11:17:38 +0200") References: <20051002091738.GB9873@mellanox.co.il> Message-ID: <52fyribmtc.fsf@cisco.com> Seems reasonable. However, looking back at the chip documentation, it seems that the max CQEs should really be 0x1ffff rather than 0xffff as I had it. Can you confirm? Thanks, Roland From rolandd at cisco.com Mon Oct 3 09:18:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:18:08 -0700 Subject: [openib-general] [PATCH] mthca: check for illegal acl when registering an mr In-Reply-To: <20051002155006.GA9896@lst.de> (Christoph Hellwig's message of "Sun, 2 Oct 2005 17:50:06 +0200") References: <20051002132552.GC9873@mellanox.co.il> <20051002155006.GA9896@lst.de> Message-ID: <52br26bmm7.fsf@cisco.com> Christoph> The check should be in ib_uverbs_reg_mr(), not in every driver. Agreed -- I did it like this: --- infiniband/core/uverbs_cmd.c (revision 3613) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -396,6 +396,14 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverb if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK)) return -EINVAL; + /* + * Local write permission is required if remote write or + * remote atomic permission is also requested. + */ + if (cmd.access_flags & (IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_REMOTE_WRITE) && + !(cmd.access_flags & IB_ACCESS_LOCAL_WRITE)) + return -EINVAL; + obj = kmalloc(sizeof *obj, GFP_KERNEL); if (!obj) return -ENOMEM; From rolandd at cisco.com Mon Oct 3 09:29:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:29:30 -0700 Subject: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 28 Sep 2005 16:43:01 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> Message-ID: <527jcubm39.fsf@cisco.com> I finally got a chance to try your tests. A few comments: - Several of the tests are buggy. See the patch below at least. - It would be much more useful if the COMPARE() macro printed the expected and actual value on failure. - Similarly, other macros should probably also print more context. For example, in something like: CHECK_PTR("ibv_create_qp", qp[i], goto cleanup); I would probably want to know the value of i on failure. - I don't believe some of the tests are really valid. For example, the max number of QPs doesn't have to be precisely correct -- no valid app is going to depend on being able to create exactly that number of QPs and no more. - In any case, I'm not convinced that this sort of negative testing is the most valuable thing to focus on right now. I think it would be better to have regression tests of basic functionality (sends, receives, RDMA, CQ polling, etc) and stress tests before testing whether a buggy app will get the right error value when passing invalid parameters. - R. Index: test_cq.c =================================================================== --- test_cq.c (revision 3639) +++ test_cq.c (working copy) @@ -106,6 +106,7 @@ int cq_2( { struct ibv_context *ib_cont = NULL; struct ibv_pd *pd = NULL; + struct ibv_comp_channel *channel = NULL; struct ibv_cq *cq = NULL; struct ibv_cq *event_cq = NULL; struct ibv_qp *qp = NULL; @@ -132,8 +133,11 @@ int cq_2( pd = ibv_alloc_pd(ib_cont); CHECK_PTR("ibv_alloc_pd", pd, goto cleanup); + channel = ibv_create_comp_channel(ib_cont); + CHECK_PTR("ibv_create_comp_channel", channel, goto cleanup); + cq_size = VL_range(rand_gen, 1, device_attr.max_cqe); - cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, NULL, 0); + cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, channel, 0); CHECK_PTR("ibv_create_cq", cq, goto cleanup); mr_size = VL_range(rand_gen, 1, 1024); @@ -211,6 +215,7 @@ int cq_2( CHECK_MALLOC(event_count, goto cleanup); *event_count = 0; + rc = ibv_get_cq_event(channel, (void *)&event_cq, (void *)&event_count); rc = ibv_get_cq_event(NULL, (void *)&event_cq, (void *)&event_count); CHECK_VALUE("ibv_get_cq_event", rc, 0, goto cleanup); Index: test_hca.c =================================================================== --- test_hca.c (revision 3639) +++ test_hca.c (working copy) @@ -230,7 +230,7 @@ int hca_5( j = port_attr.gid_tbl_len + VL_random(rand_gen, 0xFFFFFFFF - port_attr.gid_tbl_len); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); } PASSED; @@ -239,7 +239,7 @@ int hca_5( i = VL_range(rand_gen, device_attr.phys_port_cnt + 1, 0xFF); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); PASSED; test_result = 0; From rolandd at cisco.com Mon Oct 3 09:32:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 09:32:40 -0700 Subject: [PATCH] Check port number in query_port/modify_port (was: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m ellanox folder) In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> (Dotan Barak's message of "Wed, 28 Sep 2005 16:43:01 +0300") References: <6AB138A2AB8C8E4A98B9C0C3D52670E319B157@mtlexch01.mtl.com> Message-ID: <523bniblxz.fsf@cisco.com> I feel silly for spending time on this, but I made this change to make a couple of your tests pass: - R. --- infiniband/core/device.c (revision 3613) +++ infiniband/core/device.c (working copy) @@ -514,6 +514,12 @@ int ib_query_port(struct ib_device *devi u8 port_num, struct ib_port_attr *port_attr) { + if (device->node_type == IB_NODE_SWITCH) { + if (port_num) + return -EINVAL; + } else if (port_num < 1 || port_num > device->phys_port_cnt) + return -EINVAL; + return device->query_port(device, port_num, port_attr); } EXPORT_SYMBOL(ib_query_port); @@ -583,6 +589,12 @@ int ib_modify_port(struct ib_device *dev u8 port_num, int port_modify_mask, struct ib_port_modify *port_modify) { + if (device->node_type == IB_NODE_SWITCH) { + if (port_num) + return -EINVAL; + } else if (port_num < 1 || port_num > device->phys_port_cnt) + return -EINVAL; + return device->modify_port(device, port_num, port_modify_mask, port_modify); } From Administrator at openib.org Mon Oct 3 09:45:41 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 09:45:41 -0700 Subject: [openib-general] [MailServer Notification]To Recipient file blocking settings matched and action taken. Message-ID: <02c601c5c839$e43de5c0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has blocked an attachment. Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] *DETECTED* Online User Violation Scanning time = 10/3/2005 9:45:41 AM Action on file blocking: The attachment email-details.zip matches the file blocking settings. ScanMail has Quarantined it. The attachment was quarantined to C:\Program Files\Trend\Smex\Alert\email-details4341603525.zip_. Warning to Recipient: Action taken by attachment blocking. From mlleini at ca.sandia.gov Mon Oct 3 11:05:37 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Mon, 03 Oct 2005 11:05:37 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq Message-ID: <1128362737.10484.267.camel@localhost> The latest mvapich-gen2 does not compile with the latest OpenIB gen2 code base. The number of function arguments to ibv_create_cq has changed from 3 to 5. This looks like a simple fix, but you may need to support both the old and new API for ibv_create_cq. The current OpenIB gen2 backport to 2.6.9 (for RedHat) uses the older API. Woody, are there plans to update the 2.6.9 backports to svn version 3632 or more recent to fix this? mvapich-gen2-1.0-102/mpid/ch_gen2/viainit.c ~line 118 static void create_cq(void) { ibv_dev.cq_hndl = ibv_create_cq(ibv_dev.context, viadev_cq_size, NULL); if(!ibv_dev.cq_hndl) { error_abort_all(GEN_EXIT_ERR, "Error creating CQ\n"); } } OpenIB verbs.h extern struct ibv_cq *ibv_create_cq(struct ibv_context *context, int cqe, void *cq_context, struct ibv_comp_channel *channel, int comp_vector); Thanks, - Matt From robert.j.woodruff at intel.com Mon Oct 3 11:09:18 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 3 Oct 2005 11:09:18 -0700 Subject: [openib-general] RE: OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> Message-ID: Matt wrote, >Woody, are there plans to update the 2.6.9 backports to svn version 3632 >or more recent to fix this? Yes. I am working on testing the 2.6.9 backport for 3640 right now. If all goes well, I should be done testing these within a day or so and then I will push them out to SVN. woody From halr at voltaire.com Mon Oct 3 11:48:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 14:48:44 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <1128365323.4397.38.camel@hal.voltaire.com> netdevice.h: Add RDMA private pointer to the net_device structure Signed-off-by: Hal Rosenstock --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -366,6 +366,7 @@ struct net_device void *ip6_ptr; /* IPv6 specific data */ void *ec_ptr; /* Econet specific data */ void *ax25_ptr; /* AX.25 specific data */ + void *rdma_ptr; /* RDMA specific data */ /* * Cache line mostly used on receive path (including eth_type_trans()) From register at openib.org Mon Oct 3 12:49:58 2005 From: register at openib.org (register at openib.org) Date: Tue, 04 Oct 2005 01:49:58 +0600 Subject: [openib-general] MEMBERS SUPPORT Message-ID: <0INT0097FT83V2@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: readme.zip Type: application/octet-stream Size: 53514 bytes Desc: not available URL: From Administrator at openib.org Mon Oct 3 12:49:30 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 14:49:30 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <006d01c5c853$9216dda0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] MEMBERS SUPPORT Scanning time = 10/3/2005 2:49:30 PM Engine/Pattern = 7.510-1002/2.869.00 Action on virus found: The attachment readme.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/3/2005 readme.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] MEMBERS SUPPORT From shemminger at osdl.org Mon Oct 3 13:54:07 2005 From: shemminger at osdl.org (Stephen Hemminger) Date: Mon, 3 Oct 2005 13:54:07 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <20051003135407.072aaff6@dxpl.pdx.osdl.net> On 03 Oct 2005 14:48:44 -0400 Hal Rosenstock wrote: > netdevice.h: Add RDMA private pointer to the net_device structure > > Signed-off-by: Hal Rosenstock Who is going to use it? Is RDMA being submitted for code review? -- Stephen Hemminger OSDL http://developer.osdl.org/~shemminger From halr at voltaire.com Mon Oct 3 13:53:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 16:53:52 -0400 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <20051003135407.072aaff6@dxpl.pdx.osdl.net> References: <1128365323.4397.38.camel@hal.voltaire.com> <20051003135407.072aaff6@dxpl.pdx.osdl.net> Message-ID: <1128372832.4397.270.camel@hal.voltaire.com> On Mon, 2005-10-03 at 16:54, Stephen Hemminger wrote: > On 03 Oct 2005 14:48:44 -0400 > Hal Rosenstock wrote: > > > netdevice.h: Add RDMA private pointer to the net_device structure > > > > Signed-off-by: Hal Rosenstock > > Who is going to use it? Is RDMA being submitted for code review? IB (and ultimately RDMA) will use it. -- Hal From Administrator at openib.org Mon Oct 3 14:10:48 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Mon, 3 Oct 2005 14:10:48 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <02d201c5c85e$ed273d60$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] MEMBERS SUPPORT Scanning time = 10/3/2005 2:10:47 PM Engine/Pattern = 7.510-1002/2.869.00 Action on virus found: The attachment readme.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From rolandd at cisco.com Mon Oct 3 14:28:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:28:17 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> (Matt L. Leininger's message of "Mon, 03 Oct 2005 11:05:37 -0700") References: <1128362737.10484.267.camel@localhost> Message-ID: <52k6gu9tou.fsf@cisco.com> Matt> Woody, are there plans to update the 2.6.9 backports to svn Matt> version 3632 or more recent to fix this? There's no need to backport anything. The latest libibverbs (1.0-rc3) supports the new CQ API on all kernel ABIs. - R. From rolandd at cisco.com Mon Oct 3 14:29:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:29:08 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 14:48:44 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <52fyri9tnf.fsf@cisco.com> Hal> netdevice.h: Add RDMA private pointer to the net_device structure I don't think there's any point in making this change until we have some code that will use the pointer. - R. From rolandd at cisco.com Mon Oct 3 14:30:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:30:30 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <20051003135407.072aaff6@dxpl.pdx.osdl.net> (Stephen Hemminger's message of "Mon, 3 Oct 2005 13:54:07 -0700") References: <1128365323.4397.38.camel@hal.voltaire.com> <20051003135407.072aaff6@dxpl.pdx.osdl.net> Message-ID: <52br269tl5.fsf@cisco.com> Stephen> Who is going to use it? Is RDMA being submitted for code Stephen> review? I agree that we should hold off on this until there's an in-tree user. However, just as a clarification, we're trying to move from "ib" to "rdma" nomenclature as we try to make the existing kernel InfiniBand layer a more generic layer than can support both IB and iWARP. So new code should use "rdma" names. - R. From halr at voltaire.com Mon Oct 3 14:26:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 17:26:56 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <52fyri9tnf.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> Message-ID: <1128374816.4397.343.camel@hal.voltaire.com> On Mon, 2005-10-03 at 17:29, Roland Dreier wrote: > Hal> netdevice.h: Add RDMA private pointer to the net_device structure > > I don't think there's any point in making this change until we have > some code that will use the pointer. We will have this shortly. I have been waiting for this to propose the changes to SDP et al. -- Hal From davem at davemloft.net Mon Oct 3 14:34:07 2005 From: davem at davemloft.net (David S. Miller) Date: Mon, 03 Oct 2005 14:34:07 -0700 (PDT) Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <52fyri9tnf.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> Message-ID: <20051003.143407.49100316.davem@davemloft.net> From: Roland Dreier Date: Mon, 03 Oct 2005 14:29:08 -0700 > Hal> netdevice.h: Add RDMA private pointer to the net_device structure > > I don't think there's any point in making this change until we have > some code that will use the pointer. I definitely agree. From rolandd at cisco.com Mon Oct 3 14:35:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:35:39 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128374816.4397.343.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:26:56 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> Message-ID: <523bni9tck.fsf@cisco.com> Hal> We will have this shortly. I have been waiting for this to Hal> propose the changes to SDP et al. OK, but I don't think it makes sense to merge this upstream until there is in-tree code that will use it. - R. From halr at voltaire.com Mon Oct 3 14:50:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 17:50:21 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <523bni9tck.fsf@cisco.com> References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> Message-ID: <1128375898.4397.389.camel@hal.voltaire.com> On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > Hal> We will have this shortly. I have been waiting for this to > Hal> propose the changes to SDP et al. > > OK, but I don't think it makes sense to merge this upstream until > there is in-tree code that will use it. I wanted to get this in so I could add the code to IPoIB to use this so SDP and others no longer poke at IPoIB's private data. This is a small change. Should this change be made locally (in OpenIB) first (and we'll have our own modified netdevice.h for a short time) ? -- Hal From nacc at us.ibm.com Mon Oct 3 15:15:54 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 3 Oct 2005 15:15:54 -0700 Subject: [openib-general] Latest build test results Message-ID: <20051003221553.GA27996@us.ibm.com> Hello, Here are the build results for 2.6.14-rc3 with and without the latest gen2 trunk. Looks like all the builds were successful, with some warnings: - ppc64 + gen2 with =y drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type - same for =m, plus *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find - x86 + gen2 with =y drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': drivers/infiniband/ulp/iser/iser_conn.c:195: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:203: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:206: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': drivers/infiniband/ulp/iser/iser_conn.c:285: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_enable_rdma': drivers/infiniband/ulp/iser/iser_conn.c:357: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:431: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': drivers/infiniband/ulp/iser/iser_conn.c:933: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:950: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_conn.c:981: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_add_to_dto': drivers/infiniband/ulp/iser/iser_memory.c:230: warning: cast from pointer to integer of different size drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': drivers/infiniband/ulp/iser/iser_mod.c:152: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': drivers/infiniband/ulp/iser/iser_initiator.c:62: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:67: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:80: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_initiator.c:95: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': drivers/infiniband/ulp/iser/iser_lkdapl.c:147: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_start_dto': drivers/infiniband/ulp/iser/iser_lkdapl.c:660: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': drivers/infiniband/ulp/iser/iser_lkdapl.c:758: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': drivers/infiniband/ulp/iser/iser_lkdapl.c:800: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:819: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:846: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:849: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:852: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:855: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:858: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:861: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:864: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:867: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c:870: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:1116: warning: too few arguments for format drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': drivers/infiniband/ulp/iser/iser_mod.c:241: warning: too few arguments for format drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type - same for =m, plus: *** Warning: "ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: "ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find Mainline does not appear to have any issues on either ppc64 or x86, =m or =y. Thanks, Nish From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. From panda at cse.ohio-state.edu Mon Oct 3 15:47:54 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon, 3 Oct 2005 18:47:54 -0400 (EDT) Subject: [openib-general] Re: OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> from "Matt L. Leininger" at Oct 03, 2005 11:05:37 AM Message-ID: <200510032247.j93MlssL006110@xi.cse.ohio-state.edu> Matt, > The latest mvapich-gen2 does not compile with the latest OpenIB gen2 > code base. The number of function arguments to ibv_create_cq has > changed from 3 to 5. This looks like a simple fix, but you may need to > support both the old and new API for ibv_create_cq. The current OpenIB > gen2 backport to 2.6.9 (for RedHat) uses the older API. The patch has been included in the latest MVAPICH-Gen2 version checked into the SVN a few hours ago. MVAPICH-Gen2 now compiles against the latest Gen2 stack. If an older Gen2 stack is being used against the latest MVAPICH-Gen2, we have added a new flag (-DGEN2_OLD_CQ_VERB) for the code to be compiled with. More information on this has been added to mvapich.user_guide.pdf (Version 1.1). Hope this helps. Thanks, DK > Woody, are there plans to update the 2.6.9 backports to svn version 3632 > or more recent to fix this? > > > > > mvapich-gen2-1.0-102/mpid/ch_gen2/viainit.c ~line 118 > > static void create_cq(void) > { > ibv_dev.cq_hndl = ibv_create_cq(ibv_dev.context, > viadev_cq_size, NULL); > > if(!ibv_dev.cq_hndl) { > error_abort_all(GEN_EXIT_ERR, "Error creating CQ\n"); > } > } > > > > OpenIB verbs.h > > extern struct ibv_cq *ibv_create_cq(struct ibv_context *context, int > cqe, > void *cq_context, > struct ibv_comp_channel *channel, > int comp_vector); > > > Thanks, > > - Matt > > From pradeep at us.ibm.com Mon Oct 3 16:05:45 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 3 Oct 2005 16:05:45 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: My understanding is that the refcnt will still need to be held (even after this change) even if SDP would not poke at IPoIB's private data. Is that true? Moreover there was discussion about getting this data from the CM REQ private data. So, what is the exact rationale for adding this to the net_device structure? Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > Hal> We will have this shortly. I have been waiting for this to > > Hal> propose the changes to SDP et al. > > > > OK, but I don't think it makes sense to merge this upstream until > > there is in-tree code that will use it. > > I wanted to get this in so I could add the code to IPoIB to use this so SDP > and others no longer poke at IPoIB's private data. This is a small > change. Should this change be made locally (in OpenIB) first (and we'll > have our own modified netdevice.h for a short time) ? > > -- Hal > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Oct 3 16:17:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Oct 2005 19:17:17 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: References: Message-ID: <1128381437.4397.594.camel@hal.voltaire.com> On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > My understanding is that the refcnt will still need to be held (even > after this change) even if SDP would not poke at IPoIB's private data. > Is that true? Yes, that's an independent issue. > Moreover there was discussion about getting this data from the CM REQ > private data. So, what is the exact rationale for adding this to the > net_device structure? To get at the ib_device, port, and PKey which are needed for a subsequent SA path record request. -- Hal > Pradeep > pradeep at us.ibm.com > > openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > > > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > > Hal> We will have this shortly. I have been waiting for this > to > > > Hal> propose the changes to SDP et al. > > > > > > OK, but I don't think it makes sense to merge this upstream until > > > there is in-tree code that will use it. > > > > I wanted to get this in so I could add the code to IPoIB to use this > so SDP > > and others no longer poke at IPoIB's private data. This is a small > > change. Should this change be made locally (in OpenIB) first (and > we'll > > have our own modified netdevice.h for a short time) ? > > > > -- Hal > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From info at sdkfjy.com Mon Oct 3 15:33:15 2005 From: info at sdkfjy.com (info at sdkfjy.com) Date: 4 Oct 2005 07:33:15 +0900 Subject: [openib-general] $BCK@-I,$:2T$2$k%7%9%F%`$G$9(B Message-ID: <20051003223315.9601.qmail@mail.sdkfjy.com> $B=w$N;R$H%"%]$r@\$d$jl9g$O(B awg_tokyo at yahoo.com.au $B"#(B==========================$B"#(B From pradeep at us.ibm.com Mon Oct 3 16:52:55 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Mon, 3 Oct 2005 16:52:55 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128381437.4397.594.camel@hal.voltaire.com> Message-ID: Ok thanks for the explanation. So, I presume that means rdma_ptr will now point to ib_device? If so, one issue that strikes me as significant would be backward compatability. My view is that one could continue to use the IPoIB private data. Pradeep pradeep at us.ibm.com Hal Rosenstock wrote on 10/03/2005 04:17:17 PM: > On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > > My understanding is that the refcnt will still need to be held (even > > after this change) even if SDP would not poke at IPoIB's private data. > > Is that true? > > Yes, that's an independent issue. > > > Moreover there was discussion about getting this data from the CM REQ > > private data. So, what is the exact rationale for adding this to the > > net_device structure? > > To get at the ib_device, port, and PKey which are needed for a > subsequent SA path record request. > > -- Hal > > > Pradeep > > pradeep at us.ibm.com > > > > openib-general-bounces at openib.org wrote on 10/03/2005 02:50:21 PM: > > > > > On Mon, 2005-10-03 at 17:35, Roland Dreier wrote: > > > > Hal> We will have this shortly. I have been waiting for this > > to > > > > Hal> propose the changes to SDP et al. > > > > > > > > OK, but I don't think it makes sense to merge this upstream until > > > > there is in-tree code that will use it. > > > > > > I wanted to get this in so I could add the code to IPoIB to use this > > so SDP > > > and others no longer poke at IPoIB's private data. This is a small > > > change. Should this change be made locally (in OpenIB) first (and > > we'll > > > have our own modified netdevice.h for a short time) ? > > > > > > -- Hal > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Oct 3 17:07:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 17:07:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: (Pradeep Satyanarayana's message of "Mon, 3 Oct 2005 16:52:55 -0700") References: Message-ID: <52ll1a87r5.fsf@cisco.com> Pradeep> If so, one issue that strikes me as significant would be Pradeep> backward compatability. My view is that one could Pradeep> continue to use the IPoIB private data. This is an in-kernel API. There's no reason to even think about backwards compatibility. - R. From sean.hefty at intel.com Mon Oct 3 17:09:58 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Oct 2005 17:09:58 -0700 Subject: [openib-general] CMA and device removal Message-ID: >The idea with this is that a user of the CMA does not need to register for >device addition/removal, and track devices themselves. What I have right now >is something similar to this: > >rdma_create_id(); >rdma_bind_addr(id, optional src addr, dst addr); >rdma_resolve_route(id); /* optional - done by connect if not called */ >rdma_connect(id); I've committed a version of the CMA that attempts to handle device removal internally. When a device is removed, a device removal event is generated on a user's RDMA identifier, and the removal is delayed within the CMA until all references have been released. An updated version of the API is given below. The implementation has not been tested, and there are a couple of missing features: support for listening across all devices and automatic route resolution. The implementation is available under: svn/gen2/users/mshefty. - Sean /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This Software is licensed under one of the following licenses: * * 1) under the terms of the "Common Public License 1.0" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/cpl.php. * * 2) under the terms of the "The BSD License" a copy of which is * available from the Open Source Initiative, see * http://www.opensource.org/licenses/bsd-license.php. * * 3) under the terms of the "GNU General Public License (GPL) Version 2" a * copy of which is available from the Open Source Initiative, see * http://www.opensource.org/licenses/gpl-license.php. * * Licensee has the right to choose one of the above licenses. * * Redistributions of source code must retain the above copyright * notice and one of the license notices. * * Redistributions in binary form must reproduce both the above copyright * notice, one of the license notices in the documentation * and/or other materials provided with the distribution. * */ #if !defined(RDMA_CMA_H) #define RDMA_CMA_H #include #include #include /* * Upon receiving a device removal event, users must destroy the associated * RDMA identifier and release all resources allocated with the device. */ enum rdma_event_type { RDMA_EVENT_ADDR_RESOLVED, RDMA_EVENT_ADDR_ERROR, RDMA_EVENT_ROUTE_RESOLVED, RDMA_EVENT_ROUTE_ERROR, RDMA_EVENT_CONNECT_REQUEST, RDMA_EVENT_CONNECT_ERROR, RDMA_EVENT_UNREACHABLE, RDMA_EVENT_REJECTED, RDMA_EVENT_ESTABLISHED, RDMA_EVENT_DISCONNECTED, RDMA_EVENT_DEVICE_REMOVAL, }; struct rdma_addr { struct sockaddr src_addr; struct sockaddr dst_addr; union { struct ib_addr ibaddr; } addr; }; struct rdma_route { struct rdma_addr addr; struct ib_sa_path_rec *path_rec; int num_paths; }; struct rdma_event { enum rdma_event_type event; int status; void *private_data; u8 private_data_len; }; struct rdma_id; /** * rdma_event_handler - Callback used to report user events. * * Notes: Users may not call rdma_destroy_id from this callback to destroy * the passed in id, or a corresponding listen id. Returning a * non-zero value from the callback will destroy the corresponding id. */ typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); struct rdma_id { struct ib_device *device; void *context; struct ib_qp *qp; rdma_event_handler event_handler; struct rdma_route route; }; struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context); void rdma_destroy_id(struct rdma_id *id); /** * rdma_bind_addr - Bind an RDMA identifier to a source address and * associated RDMA device, if needed. * * @id: RDMA identifier. * @addr: Local address information. Wildcard values are permitted. * * This associates a source address with the RDMA identifier before calling * rdma_listen. If a specific local address is given, the RDMA identifier will * be bound to a local RDMA device. */ int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); /** * rdma_resolve_addr - Resolve destination and optional source addresses * from IP addresses to an RDMA address. If successful, the specified * rdma_id will be bound to a local device. * * @id: RDMA identifier. * @src_addr: Source address information. This parameter may be NULL. * @dst_addr: Destination address information. * @timeout_ms: Time to wait for resolution to complete. */ int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms); /** * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier * into route information needed to establish a connection. * * This is called on the client side of a connection, but its use is optional. * Users must have first called rdma_bind_addr to resolve a dst_addr * into an RDMA address before calling this routine. */ int rdma_resolve_route(struct rdma_id *id, int timeout_ms); /** * rdma_init_qp - Associates a QP with a CMA identifier and initializes the * QP for use in establishing a connection. * * TODO: fix how to do this... doesn't work with iWarp... */ int rdma_init_qp(struct rdma_id *id, struct ib_qp *qp, int qp_access_flags); struct rdma_conn_param { const void *private_data; u8 private_data_len; u8 responder_resources; u8 initiator_depth; u8 flow_control; u8 retry_count; /* ignored when accepting */ u8 rnr_retry_count; }; /** * rdma_connect - Initiate an active connection request. * * Users must have bound the rdma_id to a local device by having called * rdma_resolve_addr before calling this routine. Users may also resolve the * RDMA address to a route with rdma_resolve_route, but if a route has not * been resolved, a default route will be selected. * * Note that the QP must be in the INIT state. */ int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); /** * rdma_listen - This function is called by the passive side to * listen for incoming connection requests. * * Users must have bound the rdma_id to a local address by calling * rdma_bind_addr before calling this routine. */ int rdma_listen(struct rdma_id *id); /** * rdma_accept - Called on the passive side to accept a connection request * * Note that the QP must be in the INIT state. */ int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param); /** * rdma_reject - Called on the passive side to reject a connection request. */ int rdma_reject(struct rdma_id *id, const void *private_data, u8 private_data_len); /** * rdma_disconnect - This function disconnects the associated QP. */ int rdma_disconnect(struct rdma_id *id); #endif /* RDMA_CMA_H */ From rolandd at cisco.com Mon Oct 3 14:28:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 14:28:17 -0700 Subject: [openib-general] OpenIB gen2 support ibv_create_cq In-Reply-To: <1128362737.10484.267.camel@localhost> (Matt L. Leininger's message of "Mon, 03 Oct 2005 11:05:37 -0700") References: <1128362737.10484.267.camel@localhost> Message-ID: <52k6gu9tou.fsf@cisco.com> Matt> Woody, are there plans to update the 2.6.9 backports to svn Matt> version 3632 or more recent to fix this? There's no need to backport anything. The latest libibverbs (1.0-rc3) supports the new CQ API on all kernel ABIs. - R. From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From shemminger at osdl.org Mon Oct 3 13:54:07 2005 From: shemminger at osdl.org (Stephen Hemminger) Date: Mon, 3 Oct 2005 13:54:07 -0700 Subject: [openib-general] Re: [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128365323.4397.38.camel@hal.voltaire.com> References: <1128365323.4397.38.camel@hal.voltaire.com> Message-ID: <20051003135407.072aaff6@dxpl.pdx.osdl.net> On 03 Oct 2005 14:48:44 -0400 Hal Rosenstock wrote: > netdevice.h: Add RDMA private pointer to the net_device structure > > Signed-off-by: Hal Rosenstock Who is going to use it? Is RDMA being submitted for code review? -- Stephen Hemminger OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From rolandd at cisco.com Mon Oct 3 15:17:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 15:17:18 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128375898.4397.389.camel@hal.voltaire.com> (Hal Rosenstock's message of "03 Oct 2005 17:50:21 -0400") References: <1128365323.4397.38.camel@hal.voltaire.com> <52fyri9tnf.fsf@cisco.com> <1128374816.4397.343.camel@hal.voltaire.com> <523bni9tck.fsf@cisco.com> <1128375898.4397.389.camel@hal.voltaire.com> Message-ID: <52psqm8cup.fsf@cisco.com> Hal> I wanted to get this in so I could add the code to IPoIB to Hal> use this so SDP and others no longer poke at IPoIB's private Hal> data. This is a small change. Should this change be made Hal> locally (in OpenIB) first (and we'll have our own modified Hal> netdevice.h for a short time) ? Yes, I think that's the way to develop this sort of thing. - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html From sean.hefty at intel.com Mon Oct 3 20:54:26 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 3 Oct 2005 20:54:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA privatepointer to the net_device structure In-Reply-To: <1128381437.4397.594.camel@hal.voltaire.com> Message-ID: >> Moreover there was discussion about getting this data from the CM REQ >> private data. So, what is the exact rationale for adding this to the >> net_device structure? > >To get at the ib_device, port, and PKey which are needed for a >subsequent SA path record request. We should be able to retrieve the device and port through GID matching. I'm not sure how safe it is to access the device pointer in the case of device removal. Reading the device pointer from the rdma_ptr would need to be synchronized with ipoib's device removal handling, but maybe that's handled by the reference on the net_device...? Does ipoib create a device per pkey associated with a port? Is it possible for a user to get at a pkey other than the one at index 0 given only an IP address? - Sean From rolandd at cisco.com Mon Oct 3 21:13:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 21:13:21 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA privatepointer to the net_device structure In-Reply-To: (Sean Hefty's message of "Mon, 3 Oct 2005 20:54:26 -0700") References: Message-ID: <52ek719axq.fsf@cisco.com> Sean> Does ipoib create a device per pkey associated with a port? Sean> Is it possible for a user to get at a pkey other than the Sean> one at index 0 given only an IP address? Yes to both. Each P_Key is a different IPoIB broadcast domain and a different netdevice/interface. Routing could easily return an IPoIB interface with any P_Key. - R. From rolandd at cisco.com Mon Oct 3 21:14:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 03 Oct 2005 21:14:53 -0700 Subject: [openib-general] CMA and device removal In-Reply-To: (Sean Hefty's message of "Mon, 3 Oct 2005 17:09:58 -0700") References: Message-ID: <52achp9av6.fsf@cisco.com> >> The idea with this is that a user of the CMA does not need to >> register for device addition/removal, and track devices >> themselves. Not really related to this latest posting, but I think I forgot to reply earlier... in any case, I think this is a really good idea: have the CMA insulate consumers from device addition/removal, so that CMA consumers don't have to use the ib_register_client() API directly. - R. From onnxxw at yahoo.com Tue Oct 4 05:54:06 2005 From: onnxxw at yahoo.com (Brett Parker) Date: Tue, 4 Oct 2005 11:54:06 -0100 Subject: [openib-general] ... Message-ID: <23910604095115.onnxxw@yahoo.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://th0ng.com/p2.asp Have a good day. Sincerely, Brett Parker Customer Service Rep eLHR Inc. enoch a hereinbelow or it classification it's it trypsin see a homicide some a priest it try excel in or hellgrammite notbe coachwork it's. Update on site. wabash , circumflex but it's preside not a rheostat the on greenbriar some but charcoal or a fob and be yamaha notbe niobe see. From IBMEHCAD at de.ibm.com Tue Oct 4 06:52:55 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Tue, 4 Oct 2005 15:52:55 +0200 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org Message-ID: We're ready now to release the eHCA device driver to openib.org under http://openib.org/license.html Our assumption is the right place for that code would be: gen2/trunk/src/linux-kernel/infiniband/hw/ehca gen2/trunk/src/userspace/libehca We should probably modify the linux-kernel/infiniband/kconfig to only allow to compile the kernel part for ppc64 builds Please let us know if this is the right way to move our code from sourceforge to openib.org Thanks, Christoph Raisch ibm boeblingen lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 4 06:57:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 09:57:00 -0400 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: References: Message-ID: <1128434220.4397.3899.camel@hal.voltaire.com> Hi, On Tue, 2005-10-04 at 09:52, IBMEHCA DD wrote: > We're ready now to release the eHCA device driver to openib.org under > http://openib.org/license.html Glad to hear this :-) > Our assumption is the right place for that code would be: > > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > gen2/trunk/src/userspace/libehca > > We should probably modify the linux-kernel/infiniband/kconfig to only > allow to compile the kernel part for ppc64 builds Yes (and the makefile there as well). > Please let us know if this is the right way to move our code from > sourceforge to openib.org Yes, that appears right to me. -- Hal From caitlinb at broadcom.com Tue Oct 4 07:21:20 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 07:21:20 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Monday, October 03, 2005 4:17 PM > To: Pradeep Satyanarayana > Cc: openib-general-bounces at openib.org; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > On Mon, 2005-10-03 at 19:05, Pradeep Satyanarayana wrote: > > My understanding is that the refcnt will still need to be > held (even > > after this change) even if SDP would not poke at IPoIB's > private data. > > Is that true? > > Yes, that's an independent issue. > > > Moreover there was discussion about getting this data from > the CM REQ > > private data. So, what is the exact rationale for adding > this to the > > net_device structure? > > To get at the ib_device, port, and PKey which are needed for > a subsequent SA path record request. > In terms of justifying the field in the net_device structure you are saying that this holds data needed by and only understood by the rdma layer, but that is specific to the net_device. That makes sense. The only thing really missing is clarifying the intended scope of this data. I believe that the intent is for it to be transport specific, but not device specific. Is that correct? From halr at voltaire.com Tue Oct 4 08:01:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 11:01:14 -0400 Subject: [openib-general] [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 Message-ID: <1128438073.4397.4105.camel@hal.voltaire.com> ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 (There is emerging functionality (not yet pushed upstream) in the IB subsystem which relies on this being available. ip_dev_find is used to find a valid IPoIB device when the outgoing device returned by the route lookup (ip_route_output_key) is using the loopback interface. A valid IPoIB device is needed to perform sending an ARP and doing an IB path lookup so that an IB connection can be made). Signed-off-by: Hal Rosenstock --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -661,4 +661,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From mshefty at ichips.intel.com Tue Oct 4 09:36:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Oct 2005 09:36:27 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4342AF8B.60900@ichips.intel.com> Caitlin Bestler wrote: > That makes sense. The only thing really missing is clarifying > the intended scope of this data. I believe that the intent is > for it to be transport specific, but not device specific. Is > that correct? I'm trying to understand who would use this field and what it would contain. From discussions so far, it looks like only an IP to IB address translation mechanism would need it. And the only value that's required seems to be the pkey. Other values could be returned as well to possibly simplify things, but not sure that anything else is required. - Sean From rolandd at cisco.com Tue Oct 4 09:43:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 09:43:09 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: (IBMEHCA DD's message of "Tue, 4 Oct 2005 15:52:55 +0200") References: Message-ID: <52ll196xnm.fsf@cisco.com> Congratulations on getting to this stage! > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > gen2/trunk/src/userspace/libehca Yes, this is the right place to add the code. > We should probably modify the linux-kernel/infiniband/Kconfig to only > allow to compile the kernel part for ppc64 builds Yes, add source "drivers/infiniband/hw/ehca/Kconfig" to that Kconfig, and obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ to the Makefile. - R. From caitlinb at broadcom.com Tue Oct 4 10:43:26 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 10:43:26 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> I've been trying to think of some iWARP uses, but haven't come up with any yet. But I have strong lingering suspicions that they will eventually be found and having this type of field will ensure that the data is placed where it belongs rather than another inappropriate peeking at another layer's data being the result. > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 04, 2005 9:36 AM > To: Caitlin Bestler > Cc: Hal Rosenstock; Pradeep Satyanarayana; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin Bestler wrote: > > That makes sense. The only thing really missing is clarifying the > > intended scope of this data. I believe that the intent is > for it to be > > transport specific, but not device specific. Is that correct? > > I'm trying to understand who would use this field and what it > would contain. > From discussions so far, it looks like only an IP to IB > address translation mechanism would need it. And the only > value that's required seems to be the pkey. Other values > could be returned as well to possibly simplify things, but > not sure that anything else is required. > > - Sean > > From rolandd at cisco.com Tue Oct 4 10:51:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 10:51:54 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 4 Oct 2005 10:43:26 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1020963@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <52y8595fwl.fsf@cisco.com> Caitlin> I've been trying to think of some iWARP uses, but haven't Caitlin> come up with any yet. But I have strong lingering Caitlin> suspicions that they will eventually be found and having Caitlin> this type of field will ensure that the data is placed Caitlin> where it belongs rather than another inappropriate Caitlin> peeking at another layer's data being the result. I'm pretty sure iWARP needs the rdma_ptr member for exactly the same reason that IB needs it: to go from a struct net_device coming from route lookup on to a struct rdma_device. - R. From caitlinb at broadcom.com Tue Oct 4 10:55:30 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 10:55:30 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> I think a link from the rdma_device to the net_device is adequate for those purposes. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 04, 2005 10:52 AM > To: Caitlin Bestler > Cc: Sean Hefty; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin> I've been trying to think of some iWARP uses, but haven't > Caitlin> come up with any yet. But I have strong lingering > Caitlin> suspicions that they will eventually be found and having > Caitlin> this type of field will ensure that the data is placed > Caitlin> where it belongs rather than another inappropriate > Caitlin> peeking at another layer's data being the result. > > I'm pretty sure iWARP needs the rdma_ptr member for exactly > the same reason that IB needs it: to go from a struct > net_device coming from route lookup on to a struct rdma_device. > > - R. > > From rolandd at cisco.com Tue Oct 4 11:01:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 11:01:28 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Tue, 4 Oct 2005 10:55:30 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1020968@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <52u0fx5fgn.fsf@cisco.com> Caitlin> I think a link from the rdma_device to the net_device is Caitlin> adequate for those purposes. It's the wrong direction though. It seems kind of ugly to have to iterate through the list of rdma_devices for every route lookup, even if the list is almost always short. - R. From caitlinb at broadcom.com Tue Oct 4 11:06:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 4 Oct 2005 11:06:59 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure Message-ID: <54AD0F12E08D1541B826BE97C98F99F102096A@NT-SJCA-0751.brcm.ad.broadcom.com> Good point. That might be enough of a justification alone. And as already state, I'm convinced there will be other uses. > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 04, 2005 11:01 AM > To: Caitlin Bestler > Cc: Sean Hefty; openib-general at openib.org > Subject: Re: [openib-general] [PATCH] netdevice.h: Add RDMA > private pointer to the net_device structure > > Caitlin> I think a link from the rdma_device to the net_device is > Caitlin> adequate for those purposes. > > It's the wrong direction though. It seems kind of ugly to > have to iterate through the list of rdma_devices for every > route lookup, even if the list is almost always short. > > - R. > > From viswa.krish at gmail.com Tue Oct 4 11:17:32 2005 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Tue, 4 Oct 2005 11:17:32 -0700 Subject: [openib-general] Vendor specific MAD support Message-ID: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Does openIB Gen2 stack umad/mad library support Vendor specific MAD extensions ? -Viswa -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 4 11:20:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 11:20:08 -0700 Subject: [openib-general] Vendor specific MAD support In-Reply-To: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> (Viswanath Krishnamurthy's message of "Tue, 4 Oct 2005 11:17:32 -0700") References: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Message-ID: <52psql5elj.fsf@cisco.com> Viswanath> Does openIB Gen2 stack umad/mad library support Vendor Viswanath> specific MAD extensions ? The kernel's userspace MAD interface allows userspace to send and receive arbitrary MADs containing any data at all that userspace wants. I'm not sure what the existing libraries expose, but it's rather trivial to code directly to the kernel interface if required. - R. From halr at voltaire.com Tue Oct 4 11:54:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 14:54:32 -0400 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <4342AF8B.60900@ichips.intel.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> <4342AF8B.60900@ichips.intel.com> Message-ID: <1128452025.4397.4580.camel@hal.voltaire.com> On Tue, 2005-10-04 at 12:36, Sean Hefty wrote: > Caitlin Bestler wrote: > > That makes sense. The only thing really missing is clarifying > > the intended scope of this data. I believe that the intent is > > for it to be transport specific, but not device specific. Is > > that correct? > > I'm trying to understand who would use this field and what it would contain. > From discussions so far, it looks like only an IP to IB address translation > mechanism would need it. And the only value that's required seems to be the > pkey. Other values could be returned as well to possibly simplify things, but > not sure that anything else is required. Also, ib_device and port as well as PKey. -- Hal From halr at voltaire.com Tue Oct 4 11:58:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 14:58:28 -0400 Subject: [openib-general] Vendor specific MAD support In-Reply-To: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> References: <4df28be40510041117t6f01b70fu488228a16b83b6b9@mail.gmail.com> Message-ID: <1128452307.4397.4593.camel@hal.voltaire.com> On Tue, 2005-10-04 at 14:17, Viswanath Krishnamurthy wrote: > Does openIB Gen2 stack umad/mad library support Vendor specific MAD > extensions ? libibmad has some support for vendor MADs: uint8_t * ib_vendor_call(void *data, ib_portid_t *portid, ib_vendor_call_t *call) where: typedef struct ib_vendor_call { uint method; uint mgmt_class; uint attrid; uint mod; uint32_t oui; uint timeout; ib_rmpp_hdr_t rmpp; } ib_vendor_call_t; You can look at ibping or ibsysstat (under diags) for use of this. -- Hal From mshefty at ichips.intel.com Tue Oct 4 12:26:33 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 04 Oct 2005 12:26:33 -0700 Subject: [openib-general] [PATCH] netdevice.h: Add RDMA private pointer to the net_device structure In-Reply-To: <1128452025.4397.4580.camel@hal.voltaire.com> References: <54AD0F12E08D1541B826BE97C98F99F1020956@NT-SJCA-0751.brcm.ad.broadcom.com> <4342AF8B.60900@ichips.intel.com> <1128452025.4397.4580.camel@hal.voltaire.com> Message-ID: <4342D769.2030700@ichips.intel.com> Hal Rosenstock wrote: >>I'm trying to understand who would use this field and what it would contain. >> From discussions so far, it looks like only an IP to IB address translation >>mechanism would need it. And the only value that's required seems to be the >>pkey. Other values could be returned as well to possibly simplify things, but >>not sure that anything else is required. > > Also, ib_device and port as well as PKey. The device and port can be retrieved by looking up the GID in a local device list, though it's a little inefficient. I agree that these 3 values are ideal, but not sure that having them helps. (And returning the device pointer could actually lead to misuse.) What's still not clear to me is how an ib_device pointer would be used with respect to device removal. Ultimately a client needs to get a pointer to an ib_device that they can use for QP allocation, etc. I think that we need to examine the problem from a ULP's perspective, versus going up a single layer in the stack. For example, currently the CMA queries an address translation service to convert an IP address into a GID. The CMA searches its device list until it finds a match on the GID. This permits synchronization with device removal. Given the current device registration interface, it seems that a search through a device list is needed at some point. The only alternative that I can think of is to make use of a more complex reference counting scheme. - Sean From davem at davemloft.net Tue Oct 4 12:39:05 2005 From: davem at davemloft.net (David S. Miller) Date: Tue, 04 Oct 2005 12:39:05 -0700 (PDT) Subject: [openib-general] Re: [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 In-Reply-To: <1128438073.4397.4105.camel@hal.voltaire.com> References: <1128438073.4397.4105.camel@hal.voltaire.com> Message-ID: <20051004.123905.56817889.davem@davemloft.net> From: Hal Rosenstock Date: 04 Oct 2005 11:01:14 -0400 > (There is emerging functionality (not yet pushed upstream) in the IB > subsystem which relies on this being available. ip_dev_find is used to > find a valid IPoIB device when the outgoing device returned by the route > lookup (ip_route_output_key) is using the loopback interface. A valid > IPoIB device is needed to perform sending an ARP and doing an IB path > lookup so that an IB connection can be made). Then add this when this "emerging functionality" is pushed upstream. From wcchen at us.ibm.com Tue Oct 4 13:09:16 2005 From: wcchen at us.ibm.com (Winston Chen) Date: Tue, 4 Oct 2005 16:09:16 -0400 Subject: [openib-general] libibat/libibcm build mess Message-ID: Hi, Hal: Where can I find functions class_create() and class_device_create() called by ~/infiniband/core/uat.c ? Thanks, Winston Chen IBM RS/6000 SP Development 522 South Road, MS P963 Poughkeepsie, New York 12601 Tel: 1-845-433-8071 email: wcchen at us.ibm.com From info at openib.org Tue Oct 4 13:10:45 2005 From: info at openib.org (info at openib.org) Date: Wed, 05 Oct 2005 02:10:45 +0600 Subject: [openib-general] You have successfully updated your password Message-ID: <0INV0085WOUB4N@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: updated-password.zip Type: application/octet-stream Size: 53534 bytes Desc: not available URL: From Administrator at openib.org Tue Oct 4 13:10:04 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Tue, 4 Oct 2005 15:10:04 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <007801c5c91f$9be59250$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] You have successfully updated your password Scanning time = 10/4/2005 3:10:04 PM Engine/Pattern = 7.510-1002/2.871.00 Action on virus found: The attachment updated-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/4/2005 updated-password.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] You have successfully updated your password From rolandd at cisco.com Tue Oct 4 13:25:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 13:25:54 -0700 Subject: [openib-general] libibat/libibcm build mess In-Reply-To: (Winston Chen's message of "Tue, 4 Oct 2005 16:09:16 -0400") References: Message-ID: <52br25t4fh.fsf@cisco.com> Winston> Where can I find functions class_create() and Winston> class_device_create() called by ~/infiniband/core/uat.c ? They're in include/linux/device.h. What kernel version are you using? - R. From halr at voltaire.com Tue Oct 4 15:00:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Oct 2005 18:00:00 -0400 Subject: [openib-general] libibat/libibcm build mess In-Reply-To: References: Message-ID: <1128463199.4397.4605.camel@hal.voltaire.com> Hi Winston, On Tue, 2005-10-04 at 16:09, Winston Chen wrote: > Where can I find functions class_create() and class_device_create() > called by > ~/infiniband/core/uat.c ? Those functions are in 2.6.13 and beyond. Are you using a kernel older than that ? There is a backpatch available: https://openib.org/svn/gen2/branches/backport/2.6.12/uat_3465_to_2_6_12.patch -- Hal > > Thanks, > > Winston Chen > IBM RS/6000 SP Development > 522 South Road, MS P963 > Poughkeepsie, New York 12601 > Tel: 1-845-433-8071 > email: wcchen at us.ibm.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Tue Oct 4 16:46:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 04 Oct 2005 16:46:15 -0700 Subject: [openib-general] [git pull] InfiniBand updates for 2.6.14 Message-ID: <52u0fwsv5k.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the following changes (full patch below): Michael S. Tsirkin: [IB] mthca: Fix memory leak on device close Roland Dreier: [IPoIB] Rename IPoIB's path_lookup() to avoid name clashes drivers/infiniband/hw/mthca/mthca_main.c | 45 ++++++++++++++--------------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 4 +-- 2 files changed, 23 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -503,6 +503,25 @@ err_free_aux: return err; } +static void mthca_free_icms(struct mthca_dev *mdev) +{ + u8 status; + + mthca_free_icm_table(mdev, mdev->mcg_table.table); + if (mdev->mthca_flags & MTHCA_FLAG_SRQ) + mthca_free_icm_table(mdev, mdev->srq_table.table); + mthca_free_icm_table(mdev, mdev->cq_table.table); + mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); + mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); + mthca_free_icm_table(mdev, mdev->qp_table.qp_table); + mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); + mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); + mthca_unmap_eq_icm(mdev); + + mthca_UNMAP_ICM_AUX(mdev, &status); + mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); +} + static int __devinit mthca_init_arbel(struct mthca_dev *mdev) { struct mthca_dev_lim dev_lim; @@ -580,18 +599,7 @@ static int __devinit mthca_init_arbel(st return 0; err_free_icm: - if (mdev->mthca_flags & MTHCA_FLAG_SRQ) - mthca_free_icm_table(mdev, mdev->srq_table.table); - mthca_free_icm_table(mdev, mdev->cq_table.table); - mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); - mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); - mthca_free_icm_table(mdev, mdev->qp_table.qp_table); - mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); - mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); - mthca_unmap_eq_icm(mdev); - - mthca_UNMAP_ICM_AUX(mdev, &status); - mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); + mthca_free_icms(mdev); err_stop_fw: mthca_UNMAP_FA(mdev, &status); @@ -611,18 +619,7 @@ static void mthca_close_hca(struct mthca mthca_CLOSE_HCA(mdev, 0, &status); if (mthca_is_memfree(mdev)) { - if (mdev->mthca_flags & MTHCA_FLAG_SRQ) - mthca_free_icm_table(mdev, mdev->srq_table.table); - mthca_free_icm_table(mdev, mdev->cq_table.table); - mthca_free_icm_table(mdev, mdev->qp_table.rdb_table); - mthca_free_icm_table(mdev, mdev->qp_table.eqp_table); - mthca_free_icm_table(mdev, mdev->qp_table.qp_table); - mthca_free_icm_table(mdev, mdev->mr_table.mpt_table); - mthca_free_icm_table(mdev, mdev->mr_table.mtt_table); - mthca_unmap_eq_icm(mdev); - - mthca_UNMAP_ICM_AUX(mdev, &status); - mthca_free_icm(mdev, mdev->fw.arbel.aux_icm); + mthca_free_icms(mdev); mthca_UNMAP_FA(mdev, &status); mthca_free_icm(mdev, mdev->fw.arbel.fw_icm); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -474,7 +474,7 @@ err: spin_unlock(&priv->lock); } -static void path_lookup(struct sk_buff *skb, struct net_device *dev) +static void ipoib_path_lookup(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(skb->dev); @@ -569,7 +569,7 @@ static int ipoib_start_xmit(struct sk_bu if (skb->dst && skb->dst->neighbour) { if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { - path_lookup(skb, dev); + ipoib_path_lookup(skb, dev); goto out; } From halr at voltaire.com Wed Oct 5 03:25:03 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Oct 2005 06:25:03 -0400 Subject: [openib-general] [PATCH] ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 Message-ID: <1128507902.4397.5400.camel@hal.voltaire.com> Hi, The following patch is currently needed for 2.6.14-rc3 (for SDP and AT). I placed this in gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff -- Hal ipv4/fib_frontend.c: (Re)export ip_dev_find for 2.6.14 This was removed at 2.6.14 as part of a general cleanup as noone outside of IP currently is using this (but SDP and AT currently do) Signed-off-by: Hal Rosenstock --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -661,4 +661,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); From info at openib.org Wed Oct 5 04:27:26 2005 From: info at openib.org (info at openib.org) Date: Wed, 05 Oct 2005 17:27:26 +0600 Subject: [openib-general] Important Notification Message-ID: <0INW002NIV9MTS@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: account-report.zip Type: application/octet-stream Size: 53530 bytes Desc: not available URL: From Administrator at openib.org Wed Oct 5 04:26:45 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 06:26:45 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <008501c5c99f$aafb5bf0$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Important Notification Scanning time = 10/5/2005 6:26:45 AM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment account-report.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/5/2005 account-report.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Important Notification From Administrator at openib.org Wed Oct 5 07:04:11 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 07:04:11 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <031201c5c9b5$a93e67b0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Important Notification Scanning time = 10/5/2005 7:04:11 AM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment account-report.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From xma at us.ibm.com Wed Oct 5 08:54:37 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:54:37 -0600 Subject: [openib-general] [PATCH]small cleanup in cache.c Message-ID: The first time ib_cache_update being called both old_pkey_cache & old_gid_cache are NULL. Signed-off-by: Shirley Ma (xma at us.ibm.com) diff -uprN infiniband/core/cache.c infiniband-patch/core/cache.c --- infiniband/core/cache.c 2005-10-05 06:59:34.000000000 -0700 +++ infiniband-patch/core/cache.c 2005-10-05 08:55:42.550693304 -0700 @@ -252,8 +252,10 @@ static void ib_cache_update(struct ib_de write_unlock_irq(&device->cache.lock); - kfree(old_pkey_cache); - kfree(old_gid_cache); + if (old_pkey_cache) + kfree(old_pkey_cache); + if (old_gid_cache) + kfree(old_gid_cache); kfree(tprops); return; Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: freecache.patch Type: application/octet-stream Size: 522 bytes Desc: not available URL: From twbowman at gmail.com Wed Oct 5 08:56:14 2005 From: twbowman at gmail.com (Todd Bowman) Date: Wed, 5 Oct 2005 09:56:14 -0600 Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com> Message-ID: On 9/30/05, James Lentini wrote: > > > > On Fri, 30 Sep 2005, Todd Bowman wrote: > > > udapl is using 0x115d3. How is this set and what value should it be? > > > > Todd > > On InfiniBand, uDAPL maps connection qualifiers onto service IDs > (SIDs). > > The connection qualifier is chosen by the uDAPL application when it > creates a Public Service Point (PSP) or Reserved Service Point (RSP). > > As Arlin noted, 0x115d3 is in the SDP range. The dapltest test tools > uses 0xB0de. I would try any value except those in the range > 0x10000-0x1fffff and 0xB0de. > > james > Here is a patch for dtest.c to remove the qualifier from the sdp range. Index: userspace/dapl/test/dtest/dtest.c =================================================================== --- userspace/dapl/test/dtest/dtest.c (revision 3547) +++ userspace/dapl/test/dtest/dtest.c (working copy) @@ -53,7 +53,7 @@ #include "dat/udat.h" /* definitions */ -#define SERVER_CONN_QUAL 71123 +#define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) #define DTO_FLUSH_TIMEOUT (1000*1000*2) #define CONN_TIMEOUT (1000*1000*10) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 09:04:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 09:04:54 -0700 Subject: [openib-general] [PATCH]small cleanup in cache.c In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:54:37 -0600") References: Message-ID: <52br24rluh.fsf@cisco.com> > - kfree(old_pkey_cache); > - kfree(old_gid_cache); > + if (old_pkey_cache) > + kfree(old_pkey_cache); > + if (old_gid_cache) > + kfree(old_gid_cache); This isn't needed and in fact having this check is considered bad kernel style. The first thing kfree() does is check if the pointer is NULL, so duplicating this check in the caller just makes the code bigger. - R. From xma at us.ibm.com Wed Oct 5 09:09:02 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:09:02 -0700 Subject: [openib-general] [PATCH]small cleanup in cache.c In-Reply-To: <52br24rluh.fsf@cisco.com> Message-ID: Yes, as long as it's on Linux it's safe. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Wed Oct 5 09:52:53 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 09:52:53 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA Message-ID: One HCA could support 256 ports. The current implementation doesn't support partially successful ports, which would be a waste if any of the port failure. And after adding some break points to induce errors in each client during registration, some of the potential problems will be triggered. Here is my proposal to enable partial ports. Basically the upper user's physical ports number is going to replaced by the successful ports bitmap of the client it depends on. I have done some research on each client for enabling partially ports on HCA, and created some patches and tested the idea. Please correct if my understanding is wrong. Also if you have other idea, please share. cache_client: This client allows partially ports. But ib_cache_update() might fail on a port whose pkey_cache, gid_cache fail to be generated, so all the upper level users can be only allowed on the successful ports not the HCA's physical ports number. There are 9 upper users there, they are: ib_srp,ib_sdp,ib_uverbs,ib_umad,ib_cm, ib_ipoib,ib_sa,ib_mad. mad_client: This client doesn't allow partially ports. I would like to suggestion only enable the ports when both QP0&QP1 are successful. Don't know where QP0 can be used while QP1 is absent. (You can tell me if there is a case.) The upper users are ib_umad, ib_cm, ib_sa. cm_client: This client doesn't allow partial ports. To enable partial ports, these upper users ib_ucm, ib_srp, ib_sdp can be only allowed on the successful ports. sa_client: This client doesn't allow partial ports. To enable partial ports, these upper users ib_ipoib, ib_srp, ib_sdp, ib_at can be only allowed on the successful ports. ipoib_client: This client does allow partial ports. The number of physical ports should be replaced by each client's successful ports. For example ipoib_client will be allowed on sa_client ports bitmap, sa_client will be allowed on mad_client ports bitmap, mad_client will be allowed on cache_client ports bitmap. Adding bitmap field is not necessary, the ib_cache, ib_device, ib_sa_device, cm_device stored all the ports info. ib_uat & kdapl & ib_ping should be updated too. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeep at us.ibm.com Wed Oct 5 10:40:34 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Wed, 5 Oct 2005 10:40:34 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: One thing that strikes me is to have a single "bit map" (or it's equivalent, implemented in say ib_device). This single "bit map" corresponds to the physical ports. So, each of the higher level modules only references this "bit map" and one does not have mad client "bit map", sa client "bit map" and so on -is my understanding of your proposal correct? With multiple "bit maps" isn't there a risk of these not being in sync, resulting in hard to detect problems? Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/05/2005 09:52:53 AM: > > One HCA could support 256 ports. The current implementation doesn't > support partially successful ports, which would be a waste if any of > the port failure. And after adding some break points to induce > errors in each client during registration, some of the potential > problems will be triggered. Here is my proposal to enable partial > ports. Basically the upper user's physical ports number is going to > replaced by the successful ports bitmap of the client it depends on. > I have done some research on each client for enabling partially > ports on HCA, and created some patches and tested the idea. Please > correct if my understanding is wrong. Also if you have other idea, > please share. > > cache_client: This client allows partially ports. But > ib_cache_update() might fail on a port whose pkey_cache, gid_cache > fail to be generated, so all the upper level users can be only > allowed on the successful ports not the HCA's physical ports number. > There are 9 upper users there, they are: ib_srp,ib_sdp,ib_uverbs, > ib_umad,ib_cm, ib_ipoib,ib_sa,ib_mad. > > mad_client: This client doesn't allow partially ports. I would like > to suggestion only enable the ports when both QP0&QP1 are > successful. Don't know where QP0 can be used while QP1 is absent. > (You can tell me if there is a case.) The upper users are ib_umad, > ib_cm, ib_sa. > > cm_client: This client doesn't allow partial ports. To enable > partial ports, these upper users ib_ucm, ib_srp, ib_sdp can be only > allowed on the successful ports. > > sa_client: This client doesn't allow partial ports. To enable > partial ports, these upper users ib_ipoib, ib_srp, ib_sdp, ib_at can > be only allowed on the successful ports. > > ipoib_client: This client does allow partial ports. > > The number of physical ports should be replaced by each client's > successful ports. For example ipoib_client will be allowed on > sa_client ports bitmap, sa_client will be allowed on mad_client > ports bitmap, mad_client will be allowed on cache_client ports bitmap. > > Adding bitmap field is not necessary, the ib_cache, ib_device, > ib_sa_device, cm_device stored all the ports info. ib_uat & kdapl & > ib_ping should be updated too. > > Thanks > Shirley Ma > IBM Linux Technology Center > 15300 SW Koll Parkway > Beaverton, OR 97006-6063 > Phone(Fax): (503) 578-7638_______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 10:50:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 10:50:05 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:52:53 -0700") References: Message-ID: <523bnfsvjm.fsf@cisco.com> Shirley> One HCA could support 256 ports. The current Shirley> implementation doesn't support partially successful Shirley> ports, which would be a waste if any of the port Shirley> failure. What does "port failure" mean? If it just means that the port is not active, then I think the drivers should still be able to use the port. I don't know of anything in the IB spec that says a port can only be used if its link is up. It seems fantastically unlikely that we'll some HCA failure that means a particular port can never be used but the rest of the HCA continues to work. So I don't think it's worth spending time on that either. Right now my feeling is that we don't want to add the complication entailed by having to track individual HCA ports, just to work around a certain hardware/firmware quirk (which I would argue is in fact a bug). - R. From rolandd at cisco.com Wed Oct 5 10:51:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 10:51:29 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 09:52:53 -0700") References: Message-ID: <52y857rgwu.fsf@cisco.com> Shirley> mad_client: This client doesn't allow partially ports. I Shirley> would like to suggestion only enable the ports when both Shirley> QP0&QP1 are successful. Don't know where QP0 can be used Shirley> while QP1 is absent. (You can tell me if there is a Shirley> case.) The upper users are ib_umad, ib_cm, ib_sa. If the drivers can't access QP0 until the port is active, how does one run an SM? - R. From halr at voltaire.com Wed Oct 5 10:54:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Oct 2005 13:54:56 -0400 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52y857rgwu.fsf@cisco.com> References: <52y857rgwu.fsf@cisco.com> Message-ID: <1128534895.4400.399.camel@hal.voltaire.com> On Wed, 2005-10-05 at 13:51, Roland Dreier wrote: > Shirley> mad_client: This client doesn't allow partially ports. I > Shirley> would like to suggestion only enable the ports when both > Shirley> QP0&QP1 are successful. Don't know where QP0 can be used > Shirley> while QP1 is absent. (You can tell me if there is a > Shirley> case.) The upper users are ib_umad, ib_cm, ib_sa. > > If the drivers can't access QP0 until the port is active, how does one > run an SM? or perhaps also a software based SMA ? -- Hal From surs at cse.ohio-state.edu Wed Oct 5 11:36:52 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 14:36:52 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq Message-ID: <20051005183649.GA9036@cse.ohio-state.edu> Hello, This is in regard to the use of `ibv_modify_srq' call. When I use this call, I get a segmentation fault. I have included the code snippet, output of strace -ewrite=all command and dmesg output below. I'd be glad if someone could help me get around the problem. Please let me know if additional debug information is required. TIA, Sayantan. Platform: Opteron 2.2GHz, Tyan S2895 motherboard, 2GB memory OS: Linux 2.6.13.1-smp, SuSe 9.3 Firmware: 5.1.0 OpenIB svn rev: 3665 (the revision number might be off by a little, but this version was checked out yesterday evening 04/10). Code Snippet: ============= static void create_srq(void) { struct ibv_srq_init_attr srq_init_attr; struct ibv_srq_attr srq_attr; memset(&srq_init_attr, 0, sizeof(srq_init_attr)); memset(&srq_attr, 0, sizeof(srq_attr)); srq_init_attr.srq_context = ibv_dev.context; srq_init_attr.attr.max_wr = viadev_rq_size; // is 300. srq_init_attr.attr.max_sge = 1; srq_init_attr.attr.srq_limit = 10; ibv_dev.srq_hndl = ibv_create_srq(ibv_dev.ptag, &srq_init_attr); if(!ibv_dev.srq_hndl) { error_abort_all(GEN_EXIT_ERR, "Error creating SRQ\n"); } srq_attr.max_wr = viadev_rq_size; srq_attr.max_sge = 1; srq_attr.srq_limit = 10; // Fails after this call if(ibv_modify_srq(ibv_dev.srq_hndl, &srq_attr, IBV_SRQ_LIMIT)) { error_abort_all(GEN_EXIT_ERR, "Couldn't modify SRQ limit\n"); } fprintf(stderr,"[%d] limit %d\n", ibv_dev.me, srq_attr.srq_limit); } =========== Strace output =========== [surs at ro0:osu_benchmarks] ../bin/mpirun_rsh -np 2 ro0 ro1 strace -ewrite -ewrite=all ./lat write(3, "\0\0\0\0\4\0\4\0PT\317\377\377\177\0\0", 16write(3, "\0\0\0\0\4\0\4\0\20\370\233\377\377\177\0\0", 16) = 16 | 00000 00 00 00 00 04 00 04 00 10 f8 9b ff ff 7f 00 00 ........ ........ | write(3, "\3\0\0\0\4\0\3\0\320\367\233\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 d0 f7 9b ff ff 7f 00 00 ........ ........ | write(3, "\3\0\0\0\4\0\3\0 \370\233\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 20 f8 9b ff ff 7f 00 00 ........ ....... | write(3, "\2\0\0\0\6\0\n\0\340\367\233\377\377\177\0\0\1\335\324"..., 24) = 24 | 00000 02 00 00 00 06 00 0a 00 e0 f7 9b ff ff 7f 00 00 ........ ........ | | 00010 01 dd d4 00 00 00 00 00 ........ | ) = 16 | 00000 00 00 00 00 04 00 04 00 50 54 cf ff ff 7f 00 00 ........ PT...... | write(3, "\3\0\0\0\4\0\3\0\20T\317\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 10 54 cf ff ff 7f 00 00 ........ .T...... | write(3, "\3\0\0\0\4\0\3\0`T\317\377\377\177\0\0", 16) = 16 | 00000 03 00 00 00 04 00 03 00 60 54 cf ff ff 7f 00 00 ........ `T...... | write(3, "\2\0\0\0\6\0\n\0 T\317\377\377\177\0\0\1\335\324\0\0\0"..., 24) = 24 | 00000 02 00 00 00 06 00 0a 00 20 54 cf ff ff 7f 00 00 ........ T...... | | 00010 01 dd d4 00 00 00 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0 S\317\377\377\177\0\0\0\20\325\0\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 20 53 cf ff ff 7f 00 00 ........ S...... | write(3, "\t\0\0\0\f\0\3\0\340\366\233\377\377\177\0\0\0\20\325\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 e0 f6 9b ff ff 7f 00 00 ........ ........ | | 00010 00 10 d5 00 00 00 00 00 00 00 20 00 00 00 00 00 ........ .. ..... | | 00020 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ........ ........ | write(3, "\22\0\0\0\22\0\4\0\260\367\233\377\377\177\0\0 \331\324"..., 72) = 72 | 00000 12 00 00 00 12 00 04 00 b0 f7 9b ff ff 7f 00 00 ........ ........ | | 00010 20 d9 d4 00 00 00 00 00 ff ff 00 00 00 00 00 00 ....... ........ | | 00020 ff ff ff ff 00 00 00 00 02 26 00 4c 07 00 12 00 ........ .&.L.... | | 00030 00 40 f5 00 00 00 00 00 00 20 f5 00 00 00 00 00 . at ...... . ...... | | 00040 00 00 00 00 ff 7f 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0 \367\233\377\377\177\0\0\0`\365\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 20 f7 9b ff ff 7f 00 00 ........ ....... | | 00010 00 60 f5 00 00 00 00 00 00 80 00 00 00 00 00 00 .`...... ........ | | 00020 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ........ ........ | write(3, " \0\0\0\16\0\3\0\340\367\233\377\377\177\0\0\0\1\325\0"..., 56) = 56 | 00000 20 00 00 00 0e 00 03 00 e0 f7 9b ff ff 7f 00 00 ....... ........ | | 00010 00 01 d5 00 00 00 00 00 01 00 00 00 2c 01 00 00 ........ ....,... | | 00010 00 10 d5 00 00 00 00 00 00 00 20 00 00 00 00 00 ........ .. ..... | | 00020 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 ........ ........ | write(3, "\22\0\0\0\22\0\4\0\360S\317\377\377\177\0\0 \331\324\0"..., 72) = 72 | 00000 12 00 00 00 12 00 04 00 f0 53 cf ff ff 7f 00 00 ........ .S...... | | 00010 20 d9 d4 00 00 00 00 00 ff ff 00 00 00 00 00 00 ....... ........ | | 00020 ff ff ff ff 00 00 00 00 02 26 00 4c 07 00 12 00 ........ .&.L.... | | 00030 00 40 f5 00 00 00 00 00 00 20 f5 00 00 00 00 00 . at ...... . ...... | | 00040 00 00 00 00 ff 7f 00 00 ........ | write(3, "\t\0\0\0\f\0\3\0`S\317\377\377\177\0\0\0`\365\0\0\0\0\0"..., 48) = 48 | 00000 09 00 00 00 0c 00 03 00 60 53 cf ff ff 7f 00 00 ........ `S...... | | 00010 00 60 f5 00 00 00 00 00 00 80 00 00 00 00 00 00 .`...... ........ | | 00020 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ........ ........ | write(3, " \0\0\0\16\0\3\0 T\317\377\377\177\0\0\0\1\325\0\0\0\0"..., 56) = 56 | 00020 01 00 00 00 0a 00 00 00 02 27 00 4c fe 7f 00 00 ........ .'.L.... | | 00030 00 20 f5 00 00 00 00 00 . ...... | --- SIGSEGV (Segmentation fault) @ 0 (0) --- | 00000 20 00 00 00 0e 00 03 00 20 54 cf ff ff 7f 00 00 ....... T...... | | 00010 00 01 d5 00 00 00 00 00 01 00 00 00 2c 01 00 00 ........ ....,... | | 00020 01 00 00 00 0a 00 00 00 02 27 00 4c fe 7f 00 00 ........ .'.L.... | | 00030 00 20 f5 00 00 00 00 00 . ...... | --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ +++ killed by SIGSEGV +++ dmesg output ============ lat[18631]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffff9748c8 error 14 lat[18755]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffb3aa58 error 14 lat[18777]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffffe7bb88 error 14 lat[19128]: segfault at 0000000000000000 rip 0000000000000000 rsp 00007fffff942018 error 14 ============ -- http://www.cse.ohio-state.edu/~surs From rolandd at cisco.com Wed Oct 5 11:42:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 11:42:09 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005183649.GA9036@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 14:36:52 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> Message-ID: <52oe63reke.fsf@cisco.com> Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' Sayantan> call. When I use this call, I get a segmentation Sayantan> fault. This is because the modify SRQ operation is not implemented at all in libmthca. Do you just want to set the SRQ limit? That's not so hard for me to implement. However, you should be aware that as far as I know, only mem-free HCAs generate the SRQ limited reached event. - R. From xma at us.ibm.com Wed Oct 5 11:56:09 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 11:56:09 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <523bnfsvjm.fsf@cisco.com> Message-ID: The port failure means the SW clients initilization of that port failure. Doesn't matter whether the link is up/down or the hardware/firmare problem. If encountering any of the SW errors, the upper users can't use that port correctly, or even the whole device correctly. It's easily to prove that if you set error points during client registration and start the upper users. The problems could be kernel hung, kernel oops. For example, if mad_client initilization ports failure and you start ipoib_client. ifconfig will hung in kernel. If sa_client failure, the ipoib multicast join will hit kernel oops. Staring the upper users without checking the depency resouce allocation is buggy. It is definitely worth to spend time to address this. And the complication is only added to the client registration. The ports info are stored in ib_device, ib_cache, ib_sa_device, cm_device, it's not hard to fix it. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From webmaster at openib.org Wed Oct 5 12:05:27 2005 From: webmaster at openib.org (webmaster at openib.org) Date: Thu, 06 Oct 2005 01:05:27 +0600 Subject: [openib-general] Your password has been successfully updated Message-ID: <0INX002AZGHKTS@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: email-password.zip Type: application/octet-stream Size: 53530 bytes Desc: not available URL: From Administrator at openib.org Wed Oct 5 12:04:22 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 14:04:22 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <009201c5c9df$98ddcd50$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Your password has been successfully updated Scanning time = 10/5/2005 2:04:22 PM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment email-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/5/2005 email-password.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Your password has been successfully updated From xma at us.ibm.com Wed Oct 5 12:04:29 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 12:04:29 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: > One thing that strikes me is to have a single "bit map" (or it's equivalent, implemented in say ib_device). This single "bit map" corresponds to the physical ports. So, each of the higher level modules only references this "bit map" and one does not have mad client "bit map", sa client "bit map" and so on -is my understanding of your proposal correct? With multiple "bit maps" isn't there a risk of these not being in sync, resulting in hard to detect problems? There is not a really bitmap there. I just use it to be easily understood. The client registration has sequence. Checking resouce dependency is needed to start upper client registration on that port. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Administrator at openib.org Wed Oct 5 12:04:40 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Wed, 5 Oct 2005 12:04:40 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <032101c5c9df$a372b1e0$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Your password has been successfully updated Scanning time = 10/5/2005 12:04:40 PM Engine/Pattern = 7.510-1002/2.873.00 Action on virus found: The attachment email-password.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From rolandd at cisco.com Wed Oct 5 12:06:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:06:57 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 11:56:09 -0700") References: Message-ID: <52k6grrdf2.fsf@cisco.com> Shirley> The port failure means the SW clients initilization of Shirley> that port failure. Doesn't matter whether the link is Shirley> up/down or the hardware/firmare problem. If encountering Shirley> any of the SW errors, the upper users can't use that port Shirley> correctly, or even the whole device correctly. It's Shirley> easily to prove that if you set error points during Shirley> client registration and start the upper users. The Shirley> problems could be kernel hung, kernel oops. For example, Shirley> if mad_client initilization ports failure and you start Shirley> ipoib_client. ifconfig will hung in kernel. If sa_client Shirley> failure, the ipoib multicast join will hit kernel Shirley> oops. Staring the upper users without checking the Shirley> depency resouce allocation is buggy. It is definitely Shirley> worth to spend time to address this. Yes, I agree we should fix the bugs in error handling during registration. However, I don't think that a mask of ports is the right answer -- it doesn't seem to address the real issue. We should just make sure that if, say, the MAD layer fails to initialize a device, then all clients that depend on the MAD layer don't try to use that device. I'm not sure what the right way to express these dependencies is, however. - R. From rolandd at cisco.com Wed Oct 5 12:09:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:09:40 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 12:04:29 -0700") References: Message-ID: <52fyrfrdaj.fsf@cisco.com> Shirley> There is not a really bitmap there. I just use it to be Shirley> easily understood. The client registration has Shirley> sequence. Checking resouce dependency is needed to start Shirley> upper client registration on that port. It's not a strict sequence, however. If the CM fails to initialize a device, then SDP and SRP cannot use that device. However, IPoIB can use the device just fine, even if it loads after the CM. Similarly, if SDP fails to initialize a device, then SRP should not be affected even if it loads after SDP. And so on. - R. From ftillier at silverstorm.com Wed Oct 5 12:10:48 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 5 Oct 2005 12:10:48 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: Message-ID: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> > From: Shirley Ma [mailto:xma at us.ibm.com] > Sent: Wednesday, October 05, 2005 11:56 AM > > The port failure means the SW clients initilization of that port failure. > Doesn't matter whether the link is up/down or the hardware/firmare problem. If > encountering any of the SW errors, the upper users can't use that port > correctly, or even the whole device correctly. It's easily to prove that if > you set error points during client registration and start the upper users. The > problems could be kernel hung, kernel oops. For example, if mad_client > initilization ports failure and you start ipoib_client. ifconfig will hung in > kernel. If sa_client failure, the ipoib multicast join will hit kernel oops. > Staring the upper users without checking the depency resouce allocation is > buggy. It is definitely worth to spend time to address this. This sounds like bugs in the code where we don't trap failures gracefully. I think fixing that is probably much more useful. There will always be situations where runtime errors can occur (memory allocation failure, for example), and all upper level protocols must handle failures of these calls. Putting in code and requiring every client to compare all the various bit fields they're interested in doesn't remove the need for proper error handling. Proper error handling should resolve both the ifconfig hang and multicast join oops. Just my $0.02 - Fab From surs at cse.ohio-state.edu Wed Oct 5 12:09:37 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 15:09:37 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52oe63reke.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> Message-ID: <20051005190934.GA9412@cse.ohio-state.edu> Roland, * On Oct,2 Roland Dreier wrote : > Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' > Sayantan> call. When I use this call, I get a segmentation > Sayantan> fault. > > This is because the modify SRQ operation is not implemented at all in > libmthca. Do you just want to set the SRQ limit? That's not so hard > for me to implement. However, you should be aware that as far as I > know, only mem-free HCAs generate the SRQ limited reached event. Thanks for your reply. Yes, I want to set a SRQ limit. Yes, I am aware that only mem-free HCAs generate SRQ limit reached event. I am trying this on a Mem-free HCA. If you could implement this feature, that would be really great! Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From ftillier at silverstorm.com Wed Oct 5 12:15:49 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Wed, 5 Oct 2005 12:15:49 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52k6grrdf2.fsf@cisco.com> Message-ID: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Wednesday, October 05, 2005 12:07 PM > > Shirley> The port failure means the SW clients initilization of > Shirley> that port failure. Doesn't matter whether the link is > Shirley> up/down or the hardware/firmare problem. If encountering > Shirley> any of the SW errors, the upper users can't use that port > Shirley> correctly, or even the whole device correctly. It's > Shirley> easily to prove that if you set error points during > Shirley> client registration and start the upper users. The > Shirley> problems could be kernel hung, kernel oops. For example, > Shirley> if mad_client initilization ports failure and you start > Shirley> ipoib_client. ifconfig will hung in kernel. If sa_client > Shirley> failure, the ipoib multicast join will hit kernel > Shirley> oops. Staring the upper users without checking the > Shirley> depency resouce allocation is buggy. It is definitely > Shirley> worth to spend time to address this. > > Yes, I agree we should fix the bugs in error handling during > registration. However, I don't think that a mask of ports is the > right answer -- it doesn't seem to address the real issue. We should > just make sure that if, say, the MAD layer fails to initialize a > device, then all clients that depend on the MAD layer don't try to use > that device. Shouldn't a user get an error (not an oops) if they try to use the MAD layer for a device that didn't initialize properly within the MAD layer? Doesn't the MAD layer trap that device requests are valid? It seems that adding such checks would be much simpler to implement, rather than trying to figure out how to express these limitations to the various ULPs. - Fab From mshefty at ichips.intel.com Wed Oct 5 12:16:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 05 Oct 2005 12:16:21 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52k6grrdf2.fsf@cisco.com> References: <52k6grrdf2.fsf@cisco.com> Message-ID: <43442685.1070406@ichips.intel.com> Roland Dreier wrote: > Yes, I agree we should fix the bugs in error handling during > registration. However, I don't think that a mask of ports is the > right answer -- it doesn't seem to address the real issue. We should > just make sure that if, say, the MAD layer fails to initialize a > device, then all clients that depend on the MAD layer don't try to use > that device. I'm not sure what the right way to express these > dependencies is, however. One possibility is to have each layer verify the device/port parameters. The MAD layer can verify that the specified device/port are valid in ib_register_mad_agent(). Similar for other other modules. We also have the port capability mask available that could be used. - Sean From rolandd at cisco.com Wed Oct 5 12:16:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:16:24 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005190934.GA9412@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 15:09:37 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> Message-ID: <52br23rczb.fsf@cisco.com> Sayantan> If you could implement this feature, that would be Sayantan> really great! OK, there's not much left to do. I should have something to check in today. I'll let you know when it's ready. - R. From rolandd at cisco.com Wed Oct 5 12:24:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:24:05 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> (Fab Tillier's message of "Wed, 5 Oct 2005 12:15:49 -0700") References: <000b01c5c9e1$332eb210$9601470a@infiniconsys.com> Message-ID: <524q7vrcmi.fsf@cisco.com> Fab> Shouldn't a user get an error (not an oops) if they try to Fab> use the MAD layer for a device that didn't initialize Fab> properly within the MAD layer? Doesn't the MAD layer trap Fab> that device requests are valid? It seems that adding such Fab> checks would be much simpler to implement, rather than trying Fab> to figure out how to express these limitations to the various Fab> ULPs. Yeah, I guess that makes sense, although it exercises the upper layers' error paths more. All of the modules that export interfaces used by other layers have to be prepared for a device that they failed to initialize, and the upper layers have to be prepared for lower layers to fail. - R. From rolandd at cisco.com Wed Oct 5 12:25:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:25:00 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> (Fab Tillier's message of "Wed, 5 Oct 2005 12:10:48 -0700") References: <000a01c5c9e0$8014c7f0$9601470a@infiniconsys.com> Message-ID: <52zmpnpy0j.fsf@cisco.com> Fab> Proper error handling should resolve both the ifconfig hang Fab> and multicast join oops. To be honest, I'm not familiar with the ifconfig hang, but I don't think the multicast join oops is caused by lack of error handling. It's some small race somewhere. - R. From rolandd at cisco.com Wed Oct 5 12:40:20 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 12:40:20 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005183649.GA9036@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 14:36:52 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> Message-ID: <52vf0bpxaz.fsf@cisco.com> OK, I just checked in an initial implementation of both setting the SRQ limit with the modify SRQ verb, and also getting SRP limit reached events when the occur. You will need to update your kernel drivers, libibverbs and libmthca to get this. I've done zero testing, so please let me know how it works. You should at least get an interesting new failure. - R. From xma at us.ibm.com Wed Oct 5 13:49:50 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 13:49:50 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <524q7vrcmi.fsf@cisco.com> Message-ID: Fab> Shouldn't a user get an error (not an oops) if they try to Fab> use the MAD layer for a device that didn't initialize Fab> properly within the MAD layer? Doesn't the MAD layer trap Fab> that device requests are valid? It seems that adding such Fab> checks would be much simpler to implement, rather than trying Fab> to figure out how to express these limitations to the various Fab> ULPs. > Yeah, I guess that makes sense, although it exercises the upper > layers' error paths more. All of the modules that export interfaces > used by other layers have to be prepared for a device that they failed > to initialize, and the upper layers have to be prepared for lower > layers to fail. These two approches are both need to go through each layer. The difference is one prevents the error happen earlier, another one detects the error later, which would be a better solution if the error could happen later. It's necessary to modify the ib_mad, ib_sa, ib_cm, just act like ib_ipoib and ib_cache to continue initializing when one port encounting errors, instead of releasing all resouces. If you agree, I am creating as the first patch for review. How to handler the errors would be the second patch. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Wed Oct 5 14:04:54 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 5 Oct 2005 17:04:54 -0400 (EDT) Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com> Message-ID: On Wed, 5 Oct 2005, Todd Bowman wrote: > Here is a patch for dtest.c to remove the qualifier from the sdp range. > > Index: userspace/dapl/test/dtest/dtest.c > =================================================================== > --- userspace/dapl/test/dtest/dtest.c (revision 3547) > +++ userspace/dapl/test/dtest/dtest.c (working copy) > @@ -53,7 +53,7 @@ > #include "dat/udat.h" > > /* definitions */ > -#define SERVER_CONN_QUAL 71123 > +#define SERVER_CONN_QUAL 45248 > #define DTO_TIMEOUT (1000*1000*5) > #define DTO_FLUSH_TIMEOUT (1000*1000*2) > #define CONN_TIMEOUT (1000*1000*10) Thanks Todd. I don't mean to nit pick, but do mind throwing a Signed-off-by line on it? From rolandd at cisco.com Wed Oct 5 14:24:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:24:50 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 13:49:50 -0700") References: Message-ID: <52psqjpsgt.fsf@cisco.com> Shirley> It's necessary to modify the ib_mad, ib_sa, ib_cm, just Shirley> act like ib_ipoib and ib_cache to continue initializing Shirley> when one port encounting errors, instead of releasing all Shirley> resouces. If you agree, I am creating as the first patch Shirley> for review. How to handler the errors would be the second Shirley> patch. I don't agree that we want to handle "half-usable" devices where some ports don't work. The only use for this seems to be working around some problems with the current Galaxy HCA implementation, and there must be a better way to handle this. You're welcome to prove me wrong, but I think that handling ports that are not usable and then become usable later is just going to be horrible. And if we do that, then I think it would make sense to handle ports starting out usable and then becoming unusable later -- and I think that's going to be even worse still. I do agree that we want to handle errors in initialization better. The ib_mad and ib_cm code actually looks OK to me (with a small bug in ib_mad for which I'll post a patch shortly). I think something like the patch below is all that's needed to fix ib_sa: --- infiniband/core/sa_query.c (revision 3664) +++ infiniband/core/sa_query.c (working copy) @@ -583,10 +583,16 @@ int ib_sa_path_rec_get(struct ib_device { struct ib_sa_path_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; @@ -685,10 +691,16 @@ int ib_sa_service_rec_query(struct ib_de { struct ib_sa_service_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + if (method != IB_MGMT_METHOD_GET && method != IB_MGMT_METHOD_SET && method != IB_SA_METHOD_DELETE) @@ -768,10 +780,16 @@ int ib_sa_mcmember_rec_query(struct ib_d { struct ib_sa_mcmember_query *query; struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); - struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; - struct ib_mad_agent *agent = port->agent; + struct ib_sa_port *port; + struct ib_mad_agent *agent; int ret; + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; From rolandd at cisco.com Wed Oct 5 14:25:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:25:56 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52psqjpsgt.fsf@cisco.com> (Roland Dreier's message of "Wed, 05 Oct 2005 14:24:50 -0700") References: <52psqjpsgt.fsf@cisco.com> Message-ID: <52ll17psez.fsf_-_@cisco.com> It seems that there is a bug in ib_mad_init_device(): if ib_agent_port_open() fails for a given port, then the current code doesn't call ib_mad_port_close() for that port. I think something like the patch below is needed. Signed-off-by: Roland Dreier --- infiniband/core/mad.c (revision 3664) +++ infiniband/core/mad.c (working copy) @@ -2683,40 +2683,47 @@ static int ib_mad_port_close(struct ib_d static void ib_mad_init_device(struct ib_device *device) { - int num_ports, cur_port, i; + int start, end, i; if (device->node_type == IB_NODE_SWITCH) { - num_ports = 1; - cur_port = 0; + start = 0; + end = 0; } else { - num_ports = device->phys_port_cnt; - cur_port = 1; + start = 1; + end = device->phys_port_cnt; } - for (i = 0; i < num_ports; i++, cur_port++) { - if (ib_mad_port_open(device, cur_port)) { + + for (i = start; i <= end; i++) { + if (ib_mad_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d\n", - device->name, cur_port); - goto error_device_open; + device->name, i); + goto error; } - if (ib_agent_port_open(device, cur_port)) { + if (ib_agent_port_open(device, i)) { printk(KERN_ERR PFX "Couldn't open %s port %d " "for agents\n", - device->name, cur_port); - goto error_device_open; + device->name, i); + goto error_agent; } } return; -error_device_open: - while (i > 0) { - cur_port--; - if (ib_agent_port_close(device, cur_port)) +error_agent: + if (ib_mad_port_close(device, i)) + printk(KERN_ERR PFX "Couldn't close %s port %d\n", + device->name, i); + +error: + i--; + + while (i >= start) { + if (ib_agent_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d " "for agents\n", - device->name, cur_port); - if (ib_mad_port_close(device, cur_port)) + device->name, i); + if (ib_mad_port_close(device, i)) printk(KERN_ERR PFX "Couldn't close %s port %d\n", - device->name, cur_port); + device->name, i); i--; } } From surs at cse.ohio-state.edu Wed Oct 5 14:24:50 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 17:24:50 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52vf0bpxaz.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> Message-ID: <20051005212448.GA10612@cse.ohio-state.edu> Roland, * On Oct,5 Roland Dreier wrote : > OK, I just checked in an initial implementation of both setting the > SRQ limit with the modify SRQ verb, and also getting SRP limit reached > events when the occur. You will need to update your kernel drivers, > libibverbs and libmthca to get this. Thanks a lot for checking this in so quickly! I got the changes and updated our systems. > > I've done zero testing, so please let me know how it works. You > should at least get an interesting new failure. With your changes the `ibv_modify_qp' works. I will have the "message passing" part done sometime soon. If I see any failure, I'll report it to this reflector. Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From mlleini at ca.sandia.gov Wed Oct 5 14:32:02 2005 From: mlleini at ca.sandia.gov (Matt L. Leininger) Date: Wed, 05 Oct 2005 14:32:02 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051005190934.GA9412@cse.ohio-state.edu> References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> Message-ID: <1128547922.13952.184.camel@localhost> On Wed, 2005-10-05 at 15:09 -0400, Sayantan Sur wrote: > > This is because the modify SRQ operation is not implemented at all in > > libmthca. Do you just want to set the SRQ limit? That's not so hard > > for me to implement. However, you should be aware that as far as I > > know, only mem-free HCAs generate the SRQ limited reached event. > > Thanks for your reply. Yes, I want to set a SRQ limit. Yes, I am aware > that only mem-free HCAs generate SRQ limit reached event. I am trying > this on a Mem-free HCA. Is this due to memfree vs. memfull hardware or firmware difference? If you flash the memfull HCA with the memfree firmware (which I was told you can do) will the HCA generate an SRQ limit reached event? Thanks, - Matt From xma at us.ibm.com Wed Oct 5 14:59:33 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 14:59:33 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: <52psqjpsgt.fsf@cisco.com> Message-ID: > I don't agree that we want to handle "half-usable" devices where some > ports don't work. The only use for this seems to be working around > some problems with the current Galaxy HCA implementation, and there > must be a better way to handle this. > You're welcome to prove me wrong, but I think that handling ports that > are not usable and then become usable later is just going to be > horrible. And if we do that, then I think it would make sense to > handle ports starting out usable and then becoming unusable later -- > and I think that's going to be even worse still. I don't think we handle "half-usable" devices here. We treat each port as an individual "device" in many layers, ports to ports are independent. For each HCA which could be as many as 256 ports, I think it makes more sense to handle per port, not per HCA device based. Second, The IB SW stack shouldn't prevent any implementation from handling later ports becoming usable. The SW implementation should support all kinds of HCA implementations. Doesn't matter if it is IBM HCAs or HCAs from other vendors in the future. Third ib_cache & ib_ipoib implmentation actually allow "half-usable" devices. It allows other ports initializing while one port has errors. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 14:59:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 14:59:57 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <1128547922.13952.184.camel@localhost> (Matt L. Leininger's message of "Wed, 05 Oct 2005 14:32:02 -0700") References: <20051005183649.GA9036@cse.ohio-state.edu> <52oe63reke.fsf@cisco.com> <20051005190934.GA9412@cse.ohio-state.edu> <1128547922.13952.184.camel@localhost> Message-ID: <52hdbvpqua.fsf@cisco.com> Matt> Is this due to memfree vs. memfull hardware or firmware Matt> difference? If you flash the memfull HCA with the memfree Matt> firmware (which I was told you can do) will the HCA generate Matt> an SRQ limit reached event? I believe it's a firmware difference. There are basically three Mellanox HCA chips: MT23108 - PCI-X - memfull only (FW 3.x.y) MT25208 - 2 port PCI Express - memfull (FW 4.x.y) or memfree (FW 5.x.y) memfree FW will work even if HCA board has memory on it. Obviously memfree FW is required if the HCA board has no memory. MT25204 - 1 port PCI Express - memfree only (FW 1.x.y) Any HCA that works with memfree FW (ie any PCI Express HCA) should be able to generate SRQ limit events. In the current FW release, memfull HCAs do not generate SRQ limit events. - R. From rolandd at cisco.com Wed Oct 5 15:57:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 15:57:18 -0700 Subject: [openib-general] [PATCH]proposal for enabling partial ports on HCA In-Reply-To: (Shirley Ma's message of "Wed, 5 Oct 2005 14:59:33 -0700") References: Message-ID: <52d5mjpo6p.fsf@cisco.com> Shirley> I don't think we handle "half-usable" devices here. We Shirley> treat each port as an individual "device" in many layers, Shirley> ports to ports are independent. For each HCA which could Shirley> be as many as 256 ports, I think it makes more sense to Shirley> handle per port, not per HCA device based. The problem with this view is that the HCA is really the fundamental object in the model described in the IB spec. Most transport resources are attached to an HCA, not a port. In fact, with APM, a QP might be attached to two different ports at the same time. Shirley> Second, The IB SW stack shouldn't prevent any Shirley> implementation from handling later ports becoming Shirley> usable. The SW implementation should support all kinds of Shirley> HCA implementations. Doesn't matter if it is IBM HCAs or Shirley> HCAs from other vendors in the future. I definitely don't want to block support for IBM HCAs. However, at the same time I don't want to make the IB stack more complex, more error-prone, etc. just to work around what I would argue is a bug in your firmware. Shirley> Third ib_cache & ib_ipoib implmentation actually allow Shirley> "half-usable" devices. It allows other ports initializing Shirley> while one port has errors. It seems cache.c actually bails out if it fails to allocate space for one HCA port. IPoIB does indeed proceed even if one port fails, but that's more because there's no real reason to bail out halfway rather than wanting to support half-usable devices. I don't object much to making layers that really are per-port work that way. What worries me is trying to fix everything to work sanely with individual ports becoming usable or unusable after an HCA has been attached to the system. I guess we'll have to wait and see how convincing your patches are. - R. From sean.hefty at intel.com Wed Oct 5 16:15:17 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 5 Oct 2005 16:15:17 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> Message-ID: >It seems that there is a bug in ib_mad_init_device(): if >ib_agent_port_open() fails for a given port, then the current code >doesn't call ib_mad_port_close() for that port. I think something >like the patch below is needed. The patch looks fine. Did you want to commit this, or have myself or Hal do it? - Sean From xma at us.ibm.com Wed Oct 5 16:17:03 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 5 Oct 2005 16:17:03 -0700 Subject: [openib-general] Re: [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> Message-ID: Yes. I found the the problem too. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 5 16:22:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 16:22:58 -0700 Subject: [openib-general] [PATCH] Fix leak on MAD initialization failure In-Reply-To: (Sean Hefty's message of "Wed, 5 Oct 2005 16:15:17 -0700") References: Message-ID: <527jcrpmzx.fsf@cisco.com> Sean> The patch looks fine. Did you want to commit this, or have Sean> myself or Hal do it? I'll do it in a little while unless you beat me to it. - R. From surs at cse.ohio-state.edu Wed Oct 5 19:15:31 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 5 Oct 2005 22:15:31 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52vf0bpxaz.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> Message-ID: <20051006021529.GA14502@cse.ohio-state.edu> Roland, * On Oct,7 Roland Dreier wrote : > OK, I just checked in an initial implementation of both setting the > SRQ limit with the modify SRQ verb, and also getting SRP limit reached > events when the occur. You will need to update your kernel drivers, > libibverbs and libmthca to get this. > > I've done zero testing, so please let me know how it works. You > should at least get an interesting new failure. I am getting a segmentation fault after a couple of thousand messages are sent over SRQ (using ping-pong latency test). Here is a snippet from the core generated. Let me know what you think about this. Thanks, Sayantan. ============= #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 336 wc->wr_id = srq->wrid[wqe_index]; (gdb) bt #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 #1 0x00000000004151f5 in MPID_DeviceCheck (blocking=MPID_BLOCKING) at verbs.h:746 #2 0x000000000042101c in MPID_RecvComplete (request=0x7fffff958030, status=0x7fffff958230, error_code=0x7fffff958184) at mpid_recv.c:90 #3 0x000000000041791c in MPID_RecvDatatype (comm_ptr=0xf5e9d0, buf=0x536280, count=2, dtype_ptr=0xd36f60, src_lrank=0, tag=1, context_id=0, status=0x7fffff958230, error_code=0x7fffff958184) at mpid_hrecv.c:89 #4 0x0000000000402586 in PMPI_Recv (buf=0x536280, count=2, datatype=, source=0, tag=1, comm=, status=0x7fffff958230) at recv.c:87 #5 0x00000000004020a9 in main () (gdb) f 0 #0 0x00002aaaab238faa in mthca_poll_cq (ibcq=0xd4b920, ne=1, wc=0x7fffff957f90) at cq.c:336 336 wc->wr_id = srq->wrid[wqe_index]; (gdb) list 331 } else if ((*cur_qp)->ibv_qp.srq) { 332 srq = to_msrq((*cur_qp)->ibv_qp.srq); 333 wqe = htonl(cqe->wqe); 334 wq = NULL; 335 wqe_index = wqe >> srq->wqe_shift; 336 wc->wr_id = srq->wrid[wqe_index]; 337 mthca_free_srq_wqe(srq, wqe); 338 } else { 339 wq = &(*cur_qp)->rq; 340 wqe_index = ntohl(cqe->wqe) >> wq->wqe_shift; > > - R. -- http://www.cse.ohio-state.edu/~surs From oljpjqvhbvze at msn.com Wed Oct 5 19:06:29 2005 From: oljpjqvhbvze at msn.com (Eunice Hager) Date: Thu, 6 Oct 2005 03:06:29 +0100 Subject: [openib-general] Suppress your appetite Message-ID: <42.916.92.@msn.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://hrusmiafc.info/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://hrusmiafc.info/ Regards, Dr. Eunice Hager From rolandd at cisco.com Wed Oct 5 21:35:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 05 Oct 2005 21:35:11 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006021529.GA14502@cse.ohio-state.edu> (Sayantan Sur's message of "Wed, 5 Oct 2005 22:15:31 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> Message-ID: <523bnfp8jk.fsf@cisco.com> Sayantan> I am getting a segmentation fault after a couple of Sayantan> thousand messages are sent over SRQ (using ping-pong Sayantan> latency test). Here is a snippet from the core Sayantan> generated. Is it possible that you are posting one more receive to the SRQ than the max capacity you requested when creating the SRQ? What happens with the patch below applied to libmthca? Thanks, Roland --- libmthca/src/srq.c (revision 3664) +++ libmthca/src/srq.c (working copy) @@ -110,6 +110,13 @@ int mthca_tavor_post_srq_recv(struct ibv wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); + + if (next_ind < 0) { + err = -1; + *bad_wr = wr; + break; + } + prev_wqe = srq->last; srq->last = wqe; @@ -197,6 +204,12 @@ int mthca_arbel_post_srq_recv(struct ibv wqe = get_wqe(srq, ind); next_ind = *wqe_to_link(wqe); + if (next_ind < 0) { + err = -1; + *bad_wr = wr; + break; + } + ((struct mthca_next_seg *) wqe)->nda_op = htonl((next_ind << srq->wqe_shift) | 1); ((struct mthca_next_seg *) wqe)->ee_nds = 0; From mst at mellanox.co.il Thu Oct 6 00:12:51 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 6 Oct 2005 09:12:51 +0200 Subject: [openib-general] Updating firmware In-Reply-To: <433D820B.10100@dbresearch.net> References: <433D820B.10100@dbresearch.net> Message-ID: <20051006071251.GC8114@mellanox.co.il> Quoting Sean Hubbell : > Michael, > > Would you like me to add autogen.sh and configure scripts to build > mstflint? The reason is that to compile this on my system (Dell > PowerEdge 2850 (2) 3.2 GHz running cAos 2.0 (with Patches) is not > resolving some of the require include paths. > > Sean Sean, thanks for offering help. So far, I managed to avoid the need for configure scripts, basically on account of the tool dependencies being so simple. Could you please explain what kind of problem are you facing? Is this a cross-compilation environment? How would configure scripts help? Thanks, -- MST From SCHICKHJ at de.ibm.com Thu Oct 6 05:14:43 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Thu, 6 Oct 2005 14:14:43 +0200 Subject: [openib-general] [PATCH] libibat: little / big endian problems in example programs Message-ID: Hello, during (some) test with libibat I found out that the example programs include a little/big endian problem. Below you will find the patch for ats.c and att.c which will solve this problem on PPC64: Signed-off-by: Heiko Joerg Schick --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/ats.c 2005-08-23 18:49:39.000000000 +0200 +++ ats.c 2005-10-06 13:42:02.492909848 +0200 @@ -225,7 +225,7 @@ int main(int argc, char **argv) } for (i = 0; i < MAX_REQ; i++) { - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, IB_AT_ROUTE_FORCE_ATS, att_rt + i, att_rt_comp + i, &req_id); --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/att.c 2005-08-23 18:49:39.000000000 +0200 +++ att.c 2005-10-06 13:40:26.293891760 +0200 @@ -190,7 +190,7 @@ int main(int argc, char **argv) } for (i = 0; i < MAX_REQ; i++) { - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, 0, + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, 0, att_rt + i, att_rt_comp + i, &req_id); #if __WORDSIZE == 64 BTW. Does the output of the uatt program looks alright? uatt: att_path_comp_fn: id 21 context 0x10012ae8 completed with rec_num 1 ===> slid 0xab dlid 0xae uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 21 id 0 0 uatt: att_rt_comp_fn: id 0 context 0x100135f0 completed with rec_num 1 ===> rt 0x100135f0 sgid 0xfe8000000000000067eafbe000040001 dgid 0xfe8000000000000067eafbe000040002 uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 22 22 uatt: att_path_comp_fn: id 22 context 0x10012b30 completed with rec_num 1 ===> slid 0xab dlid 0xae uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 22 id 0 0 uatt: att_rt_comp_fn: id 0 context 0x10013628 completed with rec_num 1 ===> rt 0x10013628 sgid 0xfe8000000000000067eafbe000040001 dgid 0xfe8000000000000067eafbe000040002 uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 23 23 uatt: att_path_comp_fn: id 23 context 0x10012b78 completed with rec_num 1 ===> slid 0xab dlid 0xae ... Many thanks in advance! Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Oct 6 05:47:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 08:47:25 -0400 Subject: [openib-general] [PATCH] libibat: little / big endian problems in example programs In-Reply-To: References: Message-ID: <1128602844.4400.3586.camel@hal.voltaire.com> On Thu, 2005-10-06 at 08:14, Heiko J Schick wrote: > Hello, > > during (some) test with libibat I found out that the example programs > include a little/big endian problem. > Below you will find the patch for ats.c and att.c which will solve > this problem on PPC64: > > Signed-off-by: Heiko Joerg Schick Thanks. Applied. > --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/ats.c > 2005-08-23 18:49:39.000000000 +0200 > +++ ats.c 2005-10-06 13:42:02.492909848 +0200 > @@ -225,7 +225,7 @@ int main(int argc, char **argv) > } > > for (i = 0; i < MAX_REQ; i++) { > - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, > + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, > IB_AT_ROUTE_FORCE_ATS, > att_rt + i, att_rt_comp + i, > &req_id); The patch din't apply. It indicated it was malformed here. I think your mailer line wrapped this. That needs to be turned off when submitting patches. > > --- /home/source/trunk_3615_orig/src/userspace/libibat/examples/att.c > 2005-08-23 18:49:39.000000000 +0200 > +++ att.c 2005-10-06 13:40:26.293891760 +0200 > @@ -190,7 +190,7 @@ int main(int argc, char **argv) > } > > for (i = 0; i < MAX_REQ; i++) { > - r = ib_at_route_by_ip(0x0100a8c0, 0, 0, 0, > + r = ib_at_route_by_ip(htonl(0xc0a80001), 0, 0, 0, > att_rt + i, att_rt_comp + i, > &req_id); > > #if __WORDSIZE == 64 > > BTW. Does the output of the uatt program looks alright? Yes, that looks OK to me but would need to be verified with your subnet config. It looks like your test node was not 192.168.0.1 and had a LID of 0xab and the 192.168.0.1 node was a different node with LID 0xae. You could also verify the GIDs which were indicated as well. -- Hal > uatt: att_path_comp_fn: id 21 context 0x10012ae8 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 21 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x100135f0 completed with rec_num 1 > ===> rt 0x100135f0 sgid 0xfe8000000000000067eafbe000040001 dgid > 0xfe8000000000000067eafbe000040002 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 22 22 > uatt: att_path_comp_fn: id 22 context 0x10012b30 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 22 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x10013628 completed with rec_num 1 > ===> rt 0x10013628 sgid 0xfe8000000000000067eafbe000040001 dgid > 0xfe8000000000000067eafbe000040002 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 23 23 > uatt: att_path_comp_fn: id 23 context 0x10012b78 completed with > rec_num 1 > ===> slid 0xab dlid 0xae > ... > > Many thanks in advance! > > Mit freundlichen Gruessen / Kind Regards > Heiko Joerg Schick > > IBM Deutschland Entwicklung GmbH > I/Ox Microcode Development > Linux Infiniband Device Drivers > > Schoenaicher Str. 220 > 71032 Boeblingen > E-Mail: schickhj at de.ibm.com > External: 49-7031-16-0 x4219, t/l: 120-4219 > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Oct 6 06:09:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 09:09:45 -0400 Subject: [openib-general] Re: [PATCH] Fix leak on MAD initialization failure In-Reply-To: <52ll17psez.fsf_-_@cisco.com> References: <52psqjpsgt.fsf@cisco.com> <52ll17psez.fsf_-_@cisco.com> Message-ID: <1128604185.4382.1.camel@hal.voltaire.com> On Wed, 2005-10-05 at 17:25, Roland Dreier wrote: > It seems that there is a bug in ib_mad_init_device(): if > ib_agent_port_open() fails for a given port, then the current code > doesn't call ib_mad_port_close() for that port. I think something > like the patch below is needed. Yup, it missed calling ib_agent_port_close in the case where it was the ib_agent_port_open which failed for a port. Thanks. Applied. -- Hal From halr at voltaire.com Thu Oct 6 06:27:47 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 09:27:47 -0400 Subject: [openib-general] [PATCH] IPoIB: Backoff on send only joins as well Message-ID: <1128605267.4382.57.camel@hal.voltaire.com> IPoIB: Backoff on send only joins as well (as full member ones) (This was part of the original patch but somehow doesn't appear to have made it in). Signed-off-by: Hal Rosenstock Index: ipoib_multicast.c =================================================================== --- ipoib_multicast.c (revision 3678) +++ ipoib_multicast.c (working copy) @@ -366,7 +366,7 @@ static int ipoib_mcast_sendonly_join(str IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, - 1000, GFP_ATOMIC, + mcast->backoff * 1000, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast, &mcast->query); if (ret < 0) { From surs at cse.ohio-state.edu Thu Oct 6 06:39:39 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 09:39:39 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <523bnfp8jk.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> Message-ID: <20051006133937.GA23901@cse.ohio-state.edu> * On Oct,10 Roland Dreier wrote : > Sayantan> I am getting a segmentation fault after a couple of > Sayantan> thousand messages are sent over SRQ (using ping-pong > Sayantan> latency test). Here is a snippet from the core > Sayantan> generated. > > Is it possible that you are posting one more receive to the SRQ than > the max capacity you requested when creating the SRQ? > > What happens with the patch below applied to libmthca? Upon inspection of my code, I found that there _is_ a possibility of posting more than srq config. I fixed that and the ping-pong test works. The patch you sent is good, it prevents the application from posting more than max. I will test out the limit event generation next. Thanks, Sayantan. > > Thanks, > Roland > > > --- libmthca/src/srq.c (revision 3664) > +++ libmthca/src/srq.c (working copy) > @@ -110,6 +110,13 @@ int mthca_tavor_post_srq_recv(struct ibv > > wqe = get_wqe(srq, ind); > next_ind = *wqe_to_link(wqe); > + > + if (next_ind < 0) { > + err = -1; > + *bad_wr = wr; > + break; > + } > + > prev_wqe = srq->last; > srq->last = wqe; > > @@ -197,6 +204,12 @@ int mthca_arbel_post_srq_recv(struct ibv > wqe = get_wqe(srq, ind); > next_ind = *wqe_to_link(wqe); > > + if (next_ind < 0) { > + err = -1; > + *bad_wr = wr; > + break; > + } > + > ((struct mthca_next_seg *) wqe)->nda_op = > htonl((next_ind << srq->wqe_shift) | 1); > ((struct mthca_next_seg *) wqe)->ee_nds = 0; -- http://www.cse.ohio-state.edu/~surs From twbowman at gmail.com Thu Oct 6 07:13:22 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 08:13:22 -0600 Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com> Message-ID: On 10/5/05, James Lentini wrote: > > > > On Wed, 5 Oct 2005, Todd Bowman wrote: > > > Here is a patch for dtest.c to remove the qualifier from the sdp range. > > > > Index: userspace/dapl/test/dtest/dtest.c > > =================================================================== > > --- userspace/dapl/test/dtest/dtest.c (revision 3547) > > +++ userspace/dapl/test/dtest/dtest.c (working copy) > > @@ -53,7 +53,7 @@ > > #include "dat/udat.h" > > > > /* definitions */ > > -#define SERVER_CONN_QUAL 71123 > > +#define SERVER_CONN_QUAL 45248 > > #define DTO_TIMEOUT (1000*1000*5) > > #define DTO_FLUSH_TIMEOUT (1000*1000*2) > > #define CONN_TIMEOUT (1000*1000*10) > > Thanks Todd. I don't mean to nit pick, but do mind throwing a > Signed-off-by line on it? > No problem. Signed-off-by: Todd Bowman Index: userspace/dapl/test/dtest/dtest.c =================================================================== --- userspace/dapl/test/dtest/dtest.c (revision 3547) +++ userspace/dapl/test/dtest/dtest.c (working copy) @@ -53,7 +53,7 @@ #include "dat/udat.h" /* definitions */ -#define SERVER_CONN_QUAL 71123 +#define SERVER_CONN_QUAL 45248 #define DTO_TIMEOUT (1000*1000*5) #define DTO_FLUSH_TIMEOUT (1000*1000*2) #define CONN_TIMEOUT (1000*1000*10) -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Oct 6 07:28:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 07:28:45 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Backoff on send only joins as well In-Reply-To: <1128605267.4382.57.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Oct 2005 09:27:47 -0400") References: <1128605267.4382.57.camel@hal.voltaire.com> Message-ID: <52wtkqoh2a.fsf@cisco.com> Hal> IPoIB: Backoff on send only joins as well (as full member Hal> ones) (This was part of the original patch but somehow Hal> doesn't appear to have made it in). I left this part out intentionally because I don't see how it makes a difference. Maybe I'm missing something, but where does mcast->backoff get updated for send-only joins? Does this patch fix something in your testing? - R. From halr at voltaire.com Thu Oct 6 07:47:40 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 10:47:40 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Backoff on send only joins as well In-Reply-To: <52wtkqoh2a.fsf@cisco.com> References: <1128605267.4382.57.camel@hal.voltaire.com> <52wtkqoh2a.fsf@cisco.com> Message-ID: <1128610060.4382.397.camel@hal.voltaire.com> On Thu, 2005-10-06 at 10:28, Roland Dreier wrote: > Hal> IPoIB: Backoff on send only joins as well (as full member > Hal> ones) (This was part of the original patch but somehow > Hal> doesn't appear to have made it in). > > I left this part out intentionally because I don't see how it makes a > difference. Maybe I'm missing something, but where does > mcast->backoff get updated for send-only joins? OK. There is some code missing from the patch to do the backoff for send only joins. > Does this patch fix something in your testing? Shouldn't send only joins backoff like full member ones ? -- Hal From jlentini at netapp.com Thu Oct 6 08:07:19 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 11:07:19 -0400 (EDT) Subject: [openib-general] ib_cm_listen failure In-Reply-To: References: <433C2ADF.4010402@ichips.intel.com> Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: > Here is a patch for dtest.c to remove the qualifier from the sdp range. Thanks. Committed revision in 3683. From halr at voltaire.com Thu Oct 6 08:41:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 11:41:51 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey Message-ID: <1128613310.4382.609.camel@hal.voltaire.com> IPoIB: Add API to retrieve ib device, port, and pkey (I'm also attaching my patch to at.c which uses this; If this is accepted, I will make up a patch for SDP as well.) Signed-off-by: Hal Rosenstock Index: ipoib.h =================================================================== --- ipoib.h (revision 3683) +++ ipoib.h (working copy) @@ -210,6 +210,12 @@ struct ipoib_neigh { struct list_head list; }; +struct ipoib_info { + struct ib_device *dev; + int port; + u16 pkey; +}; + static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) { return (struct ipoib_neigh **) (neigh->ha + 24 - @@ -239,6 +245,8 @@ void ipoib_reap_ah(void *dev_ptr); void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); +int ipoib_get_info(struct net_device *dev, struct ipoib_info *info); + int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); void ipoib_ib_dev_flush(void *dev); void ipoib_ib_dev_cleanup(struct net_device *dev); Index: ipoib_ib.c =================================================================== --- ipoib_ib.c (revision 3683) +++ ipoib_ib.c (working copy) @@ -38,6 +38,8 @@ #include #include +#include /* For ARPHRD_xxx */ + #include #include "ipoib.h" @@ -569,6 +571,29 @@ int ipoib_ib_dev_init(struct net_device return 0; } +int ipoib_get_info(struct net_device *dev, struct ipoib_info *info) +{ + struct ipoib_dev_priv *priv; + + if (!info) + return -EINVAL; + + /* Make sure IPoIB interface */ + if (dev->type != ARPHRD_INFINIBAND) + return -ENODEV; + + priv = netdev_priv(dev); + /* PKey assigned yet ? */ + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + return -ENOENT; + + info->dev = priv->ca; + info->port = priv->port; + info->pkey = priv->pkey; + return 0; +} +EXPORT_SYMBOL(ipoib_get_info); + void ipoib_ib_dev_flush(void *_dev) { struct net_device *dev = (struct net_device *)_dev; Index: at.c =================================================================== --- at.c (revision 3683) +++ at.c (working copy) @@ -416,10 +416,10 @@ static void ib_at_ats_reg(void *data) static int resolve_ip(struct ib_at_src *src, u32 dst_ip, u32 src_ip, int tos, union ib_gid *dgid) { - struct ipoib_dev_priv *priv; struct net_device *loopback = NULL; struct net_device *ipoib_dev; struct rtable *rt; + struct ipoib_info info; struct flowi fl = { .oif = 0, /* oif */ .nl_u = { @@ -504,14 +504,16 @@ static int resolve_ip(struct ib_at_src * } /* - * lookup local info. + * Obtain ib_device, port, and PKey based on IPoIB net_device */ - priv = ipoib_dev->priv; - src->netdev = ipoib_dev; - src->dev = priv->ca; - src->port = priv->port; - src->pkey = cpu_to_be16(priv->pkey); + if ((r = ipoib_get_info(ipoib_dev, &info))) { + DEBUG("ipoib_get_pkey failed %d", r); + goto done; + } + src->dev = info.dev; + src->port = info.port; + src->pkey = cpu_to_be16(info.pkey); memcpy(&src->gid, ipoib_dev->dev_addr + 4, sizeof src->gid); if (!dgid) { From mshefty at ichips.intel.com Thu Oct 6 09:34:15 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 09:34:15 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128613310.4382.609.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> Message-ID: <43455207.9010508@ichips.intel.com> Hal Rosenstock wrote: > IPoIB: Add API to retrieve ib device, port, and pkey > > (I'm also attaching my patch to at.c which uses this; If this is > accepted, I will make up a patch for SDP as well.) I didn't see any other way to retrieve the pkey associated with an IP address without this. For SDP, if we layered it over the CMA, would it still need to access this information? - Sean From bardov at gmail.com Thu Oct 6 09:40:40 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Thu, 6 Oct 2005 19:40:40 +0300 Subject: [openib-general] Latest build test results In-Reply-To: <20051003221553.GA27996@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> Message-ID: I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. Dan On 10/4/05, Nishanth Aravamudan wrote: > Hello, > > Here are the build results for 2.6.14-rc3 with and without the latest > gen2 trunk. > > Looks like all the builds were successful, with some warnings: > > - ppc64 + gen2 with =y > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > - same for =m, plus > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find > > - x86 + gen2 with =y > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > drivers/infiniband/ulp/iser/iser_conn.c:195: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:203: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:206: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > drivers/infiniband/ulp/iser/iser_conn.c:285: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_enable_rdma': > drivers/infiniband/ulp/iser/iser_conn.c:357: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:431: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > drivers/infiniband/ulp/iser/iser_conn.c:933: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:950: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_conn.c:981: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_memory.c: In function `iser_all_mem_add_to_dto': > drivers/infiniband/ulp/iser/iser_memory.c:230: warning: cast from pointer to integer of different size > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > drivers/infiniband/ulp/iser/iser_mod.c:152: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > drivers/infiniband/ulp/iser/iser_initiator.c:62: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:67: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:80: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_initiator.c:95: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_start_dto': > drivers/infiniband/ulp/iser/iser_lkdapl.c:660: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > drivers/infiniband/ulp/iser/iser_lkdapl.c:758: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:819: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:846: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:849: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:852: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:855: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:858: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:861: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:864: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:867: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c:870: warning: too few arguments for format > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:1116: warning: too few arguments for format > > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > drivers/infiniband/ulp/iser/iser_mod.c:241: warning: too few arguments for format > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > - same for =m, plus: > > *** Warning: "ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: "ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/ulp/sdp/ib_sdp.ko needs unknown symbol ip_dev_find > WARNING: /lib/modules/2.6.14-rc3-git3-autokern1/kernel/drivers/infiniband/core/ib_at.ko needs unknown symbol ip_dev_find > > Mainline does not appear to have any issues on either ppc64 or x86, =m > or =y. > > Thanks, > Nish > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Oct 6 09:45:39 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 12:45:39 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <43455207.9010508@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> Message-ID: <1128616945.4382.839.camel@hal.voltaire.com> On Thu, 2005-10-06 at 12:34, Sean Hefty wrote: > Hal Rosenstock wrote: > > IPoIB: Add API to retrieve ib device, port, and pkey > > > > (I'm also attaching my patch to at.c which uses this; If this is > > accepted, I will make up a patch for SDP as well.) > > I didn't see any other way to retrieve the pkey associated with an IP address > without this. Yes, and I looked at getting the ib_device but there is no easy way so I added them into the structure returned. Is CMA keeping a list of ib_devices that it walks for this ? > For SDP, if we layered it over the CMA, would it still need to access this > information? I'm not 100% sure. It partially depends on the CMA APIs. How is the PathRecord request done ? That's what it's needed for. -- Hal From rolandd at cisco.com Thu Oct 6 09:55:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 09:55:34 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128613310.4382.609.camel@hal.voltaire.com> (Hal Rosenstock's message of "06 Oct 2005 11:41:51 -0400") References: <1128613310.4382.609.camel@hal.voltaire.com> Message-ID: <52r7ayoa9l.fsf@cisco.com> Did we ever figure out how to handle the hotplug issues with the lifetime of the struct ib_device pointer? Right now this API is unsafe, because a caller can get a pointer to a device that has already disappeared. Also if we do decide to add an API like this, the struct ipoib_info and ipoib_get_info() declarations should be in rather than in the private ipoib.h header. - R. From mshefty at ichips.intel.com Thu Oct 6 10:01:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 10:01:35 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128616945.4382.839.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> Message-ID: <4345586F.7010001@ichips.intel.com> Hal Rosenstock wrote: >>I didn't see any other way to retrieve the pkey associated with an IP address >>without this. > > Yes, and I looked at getting the ib_device but there is no easy way so I > added them into the structure returned. Is CMA keeping a list of > ib_devices that it walks for this ? The CMA maintains a list of devices. The address translation code takes an IP address and returns the corresponding GID. The CMA looks up the GID against its list of devices. All synchronization for device removal is handled by the CMA. Currently, the address translation code isn't aware of ib_devices. It's almost a device independent IP to HW address translation mechanism. A question that I have is how does the user know if the ib_device pointer is valid? >>For SDP, if we layered it over the CMA, would it still need to access this >>information? > > I'm not 100% sure. It partially depends on the CMA APIs. How is the > PathRecord request done ? That's what it's needed for. Right now, the CMA issues a path record request based on the SGID/DGID only. It would be fairly easy to add the PKey to the request once the address translation code returns it. - Sean From nacc at us.ibm.com Thu Oct 6 10:11:28 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 10:11:28 -0700 Subject: [openib-general] Latest build test results In-Reply-To: References: <20051003221553.GA27996@us.ibm.com> Message-ID: <20051006171128.GA15908@us.ibm.com> On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. Great! Thanks. I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs weren't running) now and will post the latest results. Thanks, Nish From jcarr at linuxmachines.com Thu Oct 6 10:32:05 2005 From: jcarr at linuxmachines.com (Jeff Carr) Date: Thu, 06 Oct 2005 10:32:05 -0700 Subject: [openib-general] Re: [git pull] InfiniBand fixes for 2.6.14 In-Reply-To: <524q85on6e.fsf@cisco.com> References: <524q85on6e.fsf@cisco.com> Message-ID: <43455F95.8000105@linuxmachines.com> On 09/27/2005 09:01 PM, Roland Dreier wrote: > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus When I pulled this yesterday, it didn't compile uverbs_main.c. It looks like it's missing from include/rdma/ib_user_verbs.h I'm wondering if I pulled your tree/branch correctly. Can you confirm these would be the right instructions? export \ IB="rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git" git clone $IB ib cd ib git-read-tree -m HEAD git-checkout-cache -q -f -u -a At that point I have the master branch. Then I switch to your branch: git checkout -f for-linus Then, after the initial pull, if I wanted to update to the current version I'd run: git pull Thanks, Jeff drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_write': drivers/infiniband/core/uverbs_main.c:517: error: `IB_USER_VERBS_CMD_QUERY_PARAMS' undeclared (first use in this function) drivers/infiniband/core/uverbs_main.c:517: error: (Each undeclared identifier is reported only once drivers/infiniband/core/uverbs_main.c:517: error: for each function it appears in.) From halr at voltaire.com Thu Oct 6 10:25:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 13:25:35 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006171128.GA15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> Message-ID: <1128619535.4382.1039.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > Great! Thanks. > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > weren't running) now and will post the latest results. You might also want to apply https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff to get rid of the AT and SDP warnings. -- Hal From twbowman at gmail.com Thu Oct 6 10:48:02 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 11:48:02 -0600 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change Message-ID: This patch in addition to "PPC64 atomic function additions" provides udapl support on PPC64 platform. /proc/cpuinfo on PPC64 prints different label for processor speed. Signed-off-by: Todd Bowman Index: userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c =================================================================== --- userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c (revision 3547) +++ userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.c (working copy) @@ -186,7 +186,12 @@ void ) { #define DT_CPU_MHZ_BUFFER_SIZE 128 + +#if defined (__PPC64__) +#define DT_CPU_MHZ_MHZ "clock" +#else #define DT_CPU_MHZ_MHZ "cpu MHz" +#endif #define DT_CPU_MHZ_DELIMITER ":" FILE *fp; -------------- next part -------------- An HTML attachment was scrubbed... URL: From twbowman at gmail.com Thu Oct 6 10:48:06 2005 From: twbowman at gmail.com (Todd Bowman) Date: Thu, 6 Oct 2005 11:48:06 -0600 Subject: [openib-general] [PATCH] udapl: PPC64 atomic function additions Message-ID: This patch in addition to "PPC64 cpuinfo change" provides udapl support on PPC64 platform. Added PPC64 dependent code to dapl_os_atomic_inc, dapl_os_atomic_dec, dapl_os_atomic_assign and DT_Mdep_GetTimeStamp. Also added PPC64 to platform checks. Signed-off-by: Todd Bowman Index: userspace/dapl/dapl/udapl/linux/dapl_osd.h =================================================================== --- userspace/dapl/dapl/udapl/linux/dapl_osd.h (revision 3547) +++ userspace/dapl/dapl/udapl/linux/dapl_osd.h (working copy) @@ -49,7 +49,7 @@ #error UNDEFINED OS TYPE #endif /* __linux__ */ -#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) +#if !defined (__i386__) && !defined (__ia64__) && !defined(__x86_64__) && !defined(__PPC64__) #error UNDEFINED ARCH #endif @@ -78,7 +78,7 @@ #include #include -#ifdef __ia64__ +#if defined(__ia64__) || defined(__PPC64__) #include #include #endif @@ -162,6 +160,8 @@ IA64_FETCHADD (old_value,v,1,4); #endif +#elif defined(__PPC64__) + atomic_inc((atomic_t *) v); #else /* !__ia64__ */ __asm__ __volatile__ ( "lock;" "incl %0" @@ -190,6 +190,9 @@ IA64_FETCHADD (old_value,v,-1,4); #endif +#elif defined (__PPC64__) + atomic_dec((atomic_t *)v); + #else /* !__ia64__ */ __asm__ __volatile__ ( "lock;" "decl %0" @@ -230,6 +233,22 @@ current_value = ia64_cmpxchg("acq",v,match_value,new_value,4); +#elif defined(__PPC64__) + + __asm__ __volatile__ ( + EIEIO_ON_SMP +"1: lwarx %0,0,%2 # __cmpxchg_u64\n\ + cmpd 0,%0,%3\n\ + bne- 2f\n\ + stwcx. %4,0,%2\n\ + bne- 1b" + ISYNC_ON_SMP + "\n\ +2:" + : "=&r" (current_value), "=m" (*v) + : "r" (v), "r" (match_value), "r" (new_value), "m" (*v) + : "cc", "memory"); + #else __asm__ __volatile__ ( "lock; cmpxchgl %1, %2" Index: userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h =================================================================== --- userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h (revision 3547) +++ userspace/dapl/test/dapltest/mdep/linux/dapl_mdep_user.h (working copy) @@ -128,10 +128,20 @@ x = get_cycles (); return x; +#else +#if defined(__PPC64__) + unsigned int tbl, tbu0, tbu1; + do { + __asm__ __volatile__ ("mftbu %0" : "=r"(tbu0)); + __asm__ __volatile__ ("mftb %0" : "=r"(tbl)); + __asm__ __volatile__ ("mftbu %0" : "=r"(tbu1)); + } while (tbu0 != tbu1); + return (((unsigned long long)tbu0) << 32) | tbl; #else -#error "Non-Pentium Linux - unimplemented" +#error "Non-Pentium and Non-PPC Linux - unimplemented" #endif #endif +#endif } /* -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Oct 6 10:50:15 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 10:50:15 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52r7ayoa9l.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> Message-ID: <434563D7.6080601@ichips.intel.com> Roland Dreier wrote: > Did we ever figure out how to handle the hotplug issues with the > lifetime of the struct ib_device pointer? Right now this API is > unsafe, because a caller can get a pointer to a device that has > already disappeared. Is it possible to retrieve the pkey using net_device->class_dev? - Sean From rolandd at cisco.com Thu Oct 6 10:51:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 10:51:25 -0700 Subject: [openib-general] Re: [git pull] InfiniBand fixes for 2.6.14 In-Reply-To: <43455F95.8000105@linuxmachines.com> (Jeff Carr's message of "Thu, 06 Oct 2005 10:32:05 -0700") References: <524q85on6e.fsf@cisco.com> <43455F95.8000105@linuxmachines.com> Message-ID: <52mzlmo7oi.fsf@cisco.com> Jeff> When I pulled this yesterday, it didn't compile Jeff> uverbs_main.c. It looks like it's missing from Jeff> include/rdma/ib_user_verbs.h Jeff> I'm wondering if I pulled your tree/branch correctly. Can Jeff> you confirm these would be the right instructions? Looks reasonable to me. I'm not sure what went wrong. Unfortunately I just blew away that git tree and rebased against Linus's latest tree. But everything from the for-linus branch should be in Linus's git tree. Does Linus's tree build for you? I just made a new infiniband git tree with an "upstream" branch for changes I plan to merge in 2.6.15 and a for-linus branch (currently empty) with 2.6.14 fixes. Once that hits the mirrors you could try pulling that and see how it works for you. > drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_write': > drivers/infiniband/core/uverbs_main.c:517: error: > `IB_USER_VERBS_CMD_QUERY_PARAMS' undeclared (first use in this function) > drivers/infiniband/core/uverbs_main.c:517: error: (Each undeclared > identifier is reported only once > drivers/infiniband/core/uverbs_main.c:517: error: for each function it > appears in.) These error messages seem like your uverbs_main.c and ib_user_verbs.h files got out of sync somehow. My tree looked OK to me so I don't know how to explain this. - R. From nacc at us.ibm.com Thu Oct 6 11:11:47 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 11:11:47 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128619535.4382.1039.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> Message-ID: <20051006181147.GB15908@us.ibm.com> On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > Great! Thanks. > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > weren't running) now and will post the latest results. > > You might also want to apply > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > to get rid of the AT and SDP warnings. I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the gen2 ones with that patch, thanks. Here are the results from 2.6.14-rc3-git6 + gen2 3683 Looks like x86 is broken in the current svn tree. x86 and ppc64 mainline is fine with both =y and =m ppc64 + gen2 =y drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type ppc64 + gen2 =m same as above, plus *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! x86 + gen2 =y *FAILED* drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' x86 + gen2 =m *FAILED* same as above Thanks, Nish From surs at cse.ohio-state.edu Thu Oct 6 11:46:54 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 14:46:54 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006133937.GA23901@cse.ohio-state.edu> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> Message-ID: <20051006184652.GA27969@cse.ohio-state.edu> Roland, * On Oct,11 Sayantan Sur wrote : > I will test out the limit event generation next. I made some simple modifications to srq_pingpong.c to see if I am able to generate the IBV_EVENT_SRQ_LIMIT_REACHED event. I have attached my changes as a patch and the full file (for easy execution). I noticed that the test re-posts buffers only when the outstanding recv count is <= 1. I set a SRQ limit as max_recv - 5. So, I should get the event when 5 WQEs are consumed from the SRQ, right? As of now, I am not able to see the event happening. I'd be glad if you could see if this issue can be resolved. Thanks for your prompt help. Sayantan. -- http://www.cse.ohio-state.edu/~surs -------------- next part -------------- Index: srq_pingpong.c =================================================================== --- srq_pingpong.c (revision 3676) +++ srq_pingpong.c (working copy) @@ -36,6 +36,8 @@ # include #endif /* HAVE_CONFIG_H */ +#define _GNU_SOURCE + #include #include #include @@ -62,6 +64,8 @@ static int page_size; +static pthread_t limit_thread; + struct pingpong_context { struct ibv_context *context; struct ibv_comp_channel *channel; @@ -82,6 +86,25 @@ int psn; }; + +static void asyncwatch(struct ibv_context *context) +{ + struct ibv_async_event event; + + while (1) { + + if (ibv_get_async_event(context, &event)) { + fprintf(stderr,"Error getting event!\n"); + } + + fprintf(stderr, " event_type %d, port %d\n", event.event_type, + event.element.port_num); + fflush(stderr); + + ibv_ack_async_event(&event); + } +} + static uint16_t pp_get_local_lid(struct pingpong_context *ctx, int port) { struct ibv_port_attr attr; @@ -382,7 +405,11 @@ return NULL; } + pthread_create(&limit_thread, NULL, (void *) asyncwatch, (void *)ctx->context); + { + struct ibv_srq_attr srq_attr; + struct ibv_srq_init_attr attr = { .attr = { .max_wr = rx_depth, @@ -395,6 +422,15 @@ fprintf(stderr, "Couldn't create SRQ\n"); return NULL; } + + srq_attr.max_wr = rx_depth; + srq_attr.max_sge = 1; + srq_attr.srq_limit = rx_depth-5; + + if(ibv_modify_srq(ctx->srq, &srq_attr, IBV_SRQ_LIMIT)) { + fprintf(stderr,"Error modifying SRQ\n"); + exit(-1); + } } for (i = 0; i < num_qp; ++i) { @@ -434,6 +470,7 @@ } } + return ctx; } @@ -742,6 +779,8 @@ } } + fprintf(stderr,"routs %d\n", routs); + if (scnt < iters) { j = find_qp(wc[i].qp_num, ctx, num_qp); if (j < 0) { @@ -784,5 +823,7 @@ iters, usec / 1000000., usec / iters); } + sleep(3); + return 0; } -------------- next part -------------- A non-text attachment was scrubbed... Name: srq_pingpong.c Type: text/x-csrc Size: 19155 bytes Desc: not available URL: From halr at voltaire.com Thu Oct 6 11:55:02 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 14:55:02 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <434563D7.6080601@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> Message-ID: <1128624901.4382.1599.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:50, Sean Hefty wrote: > Roland Dreier wrote: > > Did we ever figure out how to handle the hotplug issues with the > > lifetime of the struct ib_device pointer? Right now this API is > > unsafe, because a caller can get a pointer to a device that has > > already disappeared. > > Is it possible to retrieve the pkey using net_device->class_dev? I think so, but would that be any safer ? I think it might end up going through the IPoIB device private data (or an API anyhow). -- Hal From rolandd at cisco.com Thu Oct 6 12:00:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 12:00:48 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <434563D7.6080601@ichips.intel.com> (Sean Hefty's message of "Thu, 06 Oct 2005 10:50:15 -0700") References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> Message-ID: <52irwao4gv.fsf@cisco.com> Sean> Is it possible to retrieve the pkey using Sean> net_device->class_dev? Maybe, but even more direct would be taking it from net_device->broadcast. - R. From halr at voltaire.com Thu Oct 6 12:03:48 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:03:48 -0400 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <4345586F.7010001@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> <4345586F.7010001@ichips.intel.com> Message-ID: <1128625363.4382.1653.camel@hal.voltaire.com> On Thu, 2005-10-06 at 13:01, Sean Hefty wrote: > Hal Rosenstock wrote: > >>I didn't see any other way to retrieve the pkey associated with an IP address > >>without this. > > > > Yes, and I looked at getting the ib_device but there is no easy way so I > > added them into the structure returned. Is CMA keeping a list of > > ib_devices that it walks for this ? > > The CMA maintains a list of devices. The address translation code takes an IP > address and returns the corresponding GID. The CMA looks up the GID against its > list of devices. All synchronization for device removal is handled by the CMA. > > Currently, the address translation code isn't aware of ib_devices. It's almost > a device independent IP to HW address translation mechanism. > > A question that I have is how does the user know if the ib_device pointer is valid? The only way I see is that a user needs to register as a client and track device removals. Is there another way ? > >>For SDP, if we layered it over the CMA, would it still need to access this > >>information? > > > > I'm not 100% sure. It partially depends on the CMA APIs. How is the > > PathRecord request done ? That's what it's needed for. > > Right now, the CMA issues a path record request based on the SGID/DGID only. It > would be fairly easy to add the PKey to the request once the address translation > code returns it. How would the address translation code get it ? -- Hal From mshefty at ichips.intel.com Thu Oct 6 12:08:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:08:21 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52irwao4gv.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> Message-ID: <43457625.1020702@ichips.intel.com> Roland Dreier wrote: > Sean> Is it possible to retrieve the pkey using > Sean> net_device->class_dev? > > Maybe, but even more direct would be taking it from net_device->broadcast. Okay - this is starting to make more sense to me now: priv->dev->broadcast[8] = priv->pkey >> 8; priv->dev->broadcast[9] = priv->pkey & 0xff; I assume that the broadcast address is well defined, and there's no issue reading it from there? If so, then I think it's a simple change to addr.c to extract it. - Sean From mshefty at ichips.intel.com Thu Oct 6 12:16:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:16:20 -0700 Subject: [openib-general] [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128625363.4382.1653.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <43455207.9010508@ichips.intel.com> <1128616945.4382.839.camel@hal.voltaire.com> <4345586F.7010001@ichips.intel.com> <1128625363.4382.1653.camel@hal.voltaire.com> Message-ID: <43457804.6090506@ichips.intel.com> Hal Rosenstock wrote: >>The CMA maintains a list of devices. The address translation code takes an IP >>address and returns the corresponding GID. The CMA looks up the GID against its >>list of devices. All synchronization for device removal is handled by the CMA. > > The only way I see is that a user needs to register as a client and > track device removals. Is there another way ? The CMA will attempt to handle device removal internally. The basic operation is this: id = rdma_create_id(); rdma_resolve_addr(id...); /* associates a device with the ID */ /* wait for resolution to complete */ ib_alloc_pd(id->device...); ib_create_cq(id->device...); ib_create_qp(id->device...); rdma_connect(id); If a device is removed, the user will receive a callback with DEVICE_REMOVAL. The user must free all resources created using id->device, and destroy the id. The removal is blocked until the id is destroyed. >>Right now, the CMA issues a path record request based on the SGID/DGID only. It >>would be fairly easy to add the PKey to the request once the address translation >>code returns it. > > How would the address translation code get it ? Right now, it doesn't. But see Roland's message. It could be read directly from the broadcast address. - Sean From nacc at us.ibm.com Thu Oct 6 12:20:24 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 6 Oct 2005 12:20:24 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128619535.4382.1039.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> Message-ID: <20051006192024.GC15908@us.ibm.com> On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > Great! Thanks. > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > weren't running) now and will post the latest results. > > You might also want to apply > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > to get rid of the AT and SDP warnings. This patch does remove the warning regarding undefined symbols during modpost, but does not remove the warnings drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type Thanks, Nish From halr at voltaire.com Thu Oct 6 12:23:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:23:19 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <43457625.1020702@ichips.intel.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> <43457625.1020702@ichips.intel.com> Message-ID: <1128626405.4382.1741.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:08, Sean Hefty wrote: > Roland Dreier wrote: > > Sean> Is it possible to retrieve the pkey using > > Sean> net_device->class_dev? > > > > Maybe, but even more direct would be taking it from net_device->broadcast. > > Okay - this is starting to make more sense to me now: > > priv->dev->broadcast[8] = priv->pkey >> 8; > priv->dev->broadcast[9] = priv->pkey & 0xff; > > I assume that the broadcast address is well defined, and there's no issue > reading it from there? If so, then I think it's a simple change to addr.c to > extract it. What stops the net_device from being pulled from underneath this ? Seems like a similar issue to me. The difference I see is that only net_devices need to be tracked rather than perhaps net_devices and ib_devices. -- Hal From halr at voltaire.com Thu Oct 6 12:26:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Oct 2005 15:26:41 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006192024.GC15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> Message-ID: <1128626684.4382.1762.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > Great! Thanks. > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > weren't running) now and will post the latest results. > > > > You might also want to apply > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > to get rid of the AT and SDP warnings. > > This patch does remove the warning regarding undefined symbols during > modpost, but does not remove the warnings > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type Right. Roland reported a change to struct packet_type in 2.6.14. I'll work on a patch for this too. Thanks. -- Hal From mshefty at ichips.intel.com Thu Oct 6 12:35:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 06 Oct 2005 12:35:04 -0700 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <1128626405.4382.1741.camel@hal.voltaire.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> <434563D7.6080601@ichips.intel.com> <52irwao4gv.fsf@cisco.com> <43457625.1020702@ichips.intel.com> <1128626405.4382.1741.camel@hal.voltaire.com> Message-ID: <43457C68.8020905@ichips.intel.com> Hal Rosenstock wrote: > What stops the net_device from being pulled from underneath this ? Seems > like a similar issue to me. The difference I see is that only > net_devices need to be tracked rather than perhaps net_devices and > ib_devices. A reference on the net_device needs to be held while this is being read. Net_devices already have reference counting that comes with them; this would need to be added to ib_devices. E.g. dev = ip_dev_find(ip); gid = dev->dev_addr + 4; pkey = get_pkey(dev->broadcast); dev_put(dev); could be used to convert a local IP address to a GID/PKey. I'm assuming that neigh_lookup() provides the same protection: that neigh->dev is valid while a reference on the neigh is held (until neigh_release is called). Does anyone know if this is the case? - Sean From shubbell at dbresearch.net Thu Oct 6 12:43:32 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 06 Oct 2005 14:43:32 -0500 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <1128626684.4382.1762.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> Message-ID: <43457E64.1010406@dbresearch.net> Hello, Will openib still supply patches to the 2.6.13 Kernel or do I need to upgrade my kernel to 2.6.14? Thanks, Sean Hubbell From rolandd at cisco.com Thu Oct 6 12:50:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 12:50:36 -0700 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <43457E64.1010406@dbresearch.net> (Sean Hubbell's message of "Thu, 06 Oct 2005 14:43:32 -0500") References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <43457E64.1010406@dbresearch.net> Message-ID: <52ek6yo25v.fsf@cisco.com> Sean> Hello, Will openib still supply patches to the 2.6.13 Kernel Sean> or do I need to upgrade my kernel to 2.6.14? 2.6.14 is not out yet, so the OpenIB subversion repository continues to be targeted at 2.6.13 (the latest full kernel release). Once 2.6.14 is released, we'll target that for development. If the are API changes from 2.6.13 to 2.6.14 that mean the subversion tree no longer works with 2.6.13, then if you want to use the latest subversion sources, you'll have to either upgrade to 2.6.14, find some contributed backport patches, or do the backporting yourself. - R. From shubbell at dbresearch.net Thu Oct 6 12:55:50 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 06 Oct 2005 14:55:50 -0500 Subject: [openib-general] Linux 2.6.13 Kernel Support Question In-Reply-To: <52ek6yo25v.fsf@cisco.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <43457E64.1010406@dbresearch.net> <52ek6yo25v.fsf@cisco.com> Message-ID: <43458146.2090307@dbresearch.net> Roland Dreier wrote: > Sean> Hello, Will openib still supply patches to the 2.6.13 Kernel > Sean> or do I need to upgrade my kernel to 2.6.14? > >2.6.14 is not out yet, so the OpenIB subversion repository continues >to be targeted at 2.6.13 (the latest full kernel release). Once >2.6.14 is released, we'll target that for development. If the are API >changes from 2.6.13 to 2.6.14 that mean the subversion tree no longer >works with 2.6.13, then if you want to use the latest subversion >sources, you'll have to either upgrade to 2.6.14, find some >contributed backport patches, or do the backporting yourself. > > - R. > > > > Thanks Roland. Sean Hubbell From rolandd at cisco.com Thu Oct 6 13:10:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 06 Oct 2005 13:10:42 -0700 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051006184652.GA27969@cse.ohio-state.edu> (Sayantan Sur's message of "Thu, 6 Oct 2005 14:46:54 -0400") References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> Message-ID: <52achmo18d.fsf@cisco.com> Sayantan> I noticed that the test re-posts buffers only when the Sayantan> outstanding recv count is <= 1. I set a SRQ limit as Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are Sayantan> consumed from the SRQ, right? Yes, your code is correct. The problem was that the mthca kernel driver was dispatching SRQ events incorrectly, so the event never reached userspace. I've checked in a fix for that, and I'm going to queue the SRQ limit event stuff for 2.6.15 (now that I've seen it working). BTW, in your code, you have: fprintf(stderr, " event_type %d, port %d\n", event.event_type, event.element.port_num); it would be more sensible to print event.element.srq here, since you're expecting an SRQ event. - R. From surs at cse.ohio-state.edu Thu Oct 6 13:54:29 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 6 Oct 2005 16:54:29 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52achmo18d.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> <52achmo18d.fsf@cisco.com> Message-ID: <20051006205426.GA28969@cse.ohio-state.edu> Roland, * On Oct,13 Roland Dreier wrote : > Sayantan> I noticed that the test re-posts buffers only when the > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > Sayantan> consumed from the SRQ, right? > > Yes, your code is correct. The problem was that the mthca kernel > driver was dispatching SRQ events incorrectly, so the event never > reached userspace. I've checked in a fix for that, and I'm going to > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > working). > > BTW, in your code, you have: > > fprintf(stderr, " event_type %d, port %d\n", event.event_type, > event.element.port_num); > > it would be more sensible to print event.element.srq here, since > you're expecting an SRQ event. Thanks for the fix!! I have updated our systems, and am able to see the event. Thanks for the tip too. My async function was a quick copy from the example asyncwatch.c :-) Thanks, Sayantan. > > - R. -- http://www.cse.ohio-state.edu/~surs From jlentini at netapp.com Thu Oct 6 14:00:02 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 17:00:02 -0400 (EDT) Subject: [openib-general] Re: [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: References: Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: twbowm> This patch in addition to "PPC64 atomic function additions" provides udapl twbowm> support on PPC64 platform. twbowm> twbowm> /proc/cpuinfo on PPC64 prints different label for processor speed. Committed in revision 3687. From jlentini at netapp.com Thu Oct 6 14:00:24 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 6 Oct 2005 17:00:24 -0400 (EDT) Subject: [openib-general] Re: [PATCH] udapl: PPC64 atomic function additions In-Reply-To: References: Message-ID: On Thu, 6 Oct 2005, Todd Bowman wrote: > This patch in addition to "PPC64 cpuinfo change" provides udapl support on > PPC64 platform. > > Added PPC64 dependent code to dapl_os_atomic_inc, dapl_os_atomic_dec, > dapl_os_atomic_assign and DT_Mdep_GetTimeStamp. > Also added PPC64 to platform checks. Committed in revision 3687. From iod00d at hp.com Thu Oct 6 14:14:08 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 6 Oct 2005 14:14:08 -0700 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: References: Message-ID: <20051006211408.GF26238@esmail.cup.hp.com> On Thu, Oct 06, 2005 at 11:48:02AM -0600, Todd Bowman wrote: > /proc/cpuinfo on PPC64 prints different label for processor speed. ... ISTR the "clock" value in cpuinfo is NOT the same as the CPU MHz. Can you remind me if "clock" value * "mtfb" results in "wall clock" time units? If not, then use of DT_CPU_MHZ_MHZ needs to be reviewed since it typically makes that assumption. Also, if someone cares about sparc (hey Tom! :^) ), then might leverage the get_clock.c code on: http://svn.gnumonks.org/trunk/mmio_test/ hth, grant From robert.j.woodruff at intel.com Thu Oct 6 15:08:17 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 6 Oct 2005 15:08:17 -0700 Subject: [openib-general] RE: OpenIB gen2 support ibv_create_cq Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005C17A28@orsmsx408> Matt wrote, >Woody, are there plans to update the 2.6.9 backports to svn version 3632 >or more recent to fix this? I just checked in new 2.6.9 backport patches for SVN rev. 3640 that should have this fix. woody From hozer at hozed.org Thu Oct 6 21:01:21 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 6 Oct 2005 23:01:21 -0500 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: <20051006211408.GF26238@esmail.cup.hp.com> References: <20051006211408.GF26238@esmail.cup.hp.com> Message-ID: <20051007040121.GW4612@kalmia.hozed.org> On Thu, Oct 06, 2005 at 02:14:08PM -0700, Grant Grundler wrote: > On Thu, Oct 06, 2005 at 11:48:02AM -0600, Todd Bowman wrote: > > /proc/cpuinfo on PPC64 prints different label for processor speed. > ... > > ISTR the "clock" value in cpuinfo is NOT the same as the CPU MHz. > Can you remind me if "clock" value * "mtfb" results in > "wall clock" time units? > > If not, then use of DT_CPU_MHZ_MHZ needs to be reviewed since > it typically makes that assumption. > > Also, if someone cares about sparc (hey Tom! :^) ), > then might leverage the get_clock.c code on: > http://svn.gnumonks.org/trunk/mmio_test/ Oh boy.... is there some reason 'gettimeofday' does not work? Trying to infer timebase/clock/rtsc frequency is going to be a mess. Think cpus that dynamically change frequency.. Laptops do now.. how long before something with infiniband does and breaks this code horribly? (think embedded systems) There are a couple of implementations of gettimeofday fully in userspace that hide the details and still read the high-res hardware counters. Google for 'vDSO gettimeofday'. From admin at openib.org Fri Oct 7 00:05:21 2005 From: admin at openib.org (admin at openib.org) Date: Fri, 07 Oct 2005 13:05:21 +0600 Subject: [openib-general] Members Support Message-ID: <0IO000MRZ8HITQ@mail.interblocks.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ykj.zip Type: application/octet-stream Size: 53508 bytes Desc: not available URL: From Administrator at openib.org Fri Oct 7 00:04:42 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Fri, 7 Oct 2005 02:04:42 -0500 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <00a901c5cb0d$641a0f50$020ca8c0@banderacom.com> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Members Support Scanning time = 10/7/2005 2:04:42 AM Engine/Pattern = 7.510-1002/2.879.00 Action on virus found: The attachment ykj.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. 10/7/2005 ykj.zip/Deleted openib-general at openib.org openib-general-bounces at openib.org [openib-general] Members Support From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 01:06:53 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 01:06:53 -0700 Subject: [openib-general] Timeline of IPoIB performance Message-ID: <1128672413.13948.326.camel@localhost> I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when using kernels newer than 2.6.11. This doesn't appear to be an OpenIB IPoIB issue since the in-kernel and a recent svn3687 snapshot both have the same performance (464 MB/s) with 2.6.11. I used the same kernel config file as a starting point for each of these kernel builds. Have there been any changes in Linux that would explain these results? All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12 in-kernel 1 406 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) Thanks, - Matt From tefdmwrgfv at proxad.net Fri Oct 7 04:07:30 2005 From: tefdmwrgfv at proxad.net (Jimmie Fleming) Date: Fri, 7 Oct 2005 12:07:30 +0100 Subject: [openib-general] Your order# 1266. Message-ID: <42.916.92.@proxad.net> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://htupreulx.info/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://htupreulx.info/ Regards, Dr. Jimmie Fleming From halr at voltaire.com Fri Oct 7 05:21:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 08:21:19 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051006181147.GB15908@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006181147.GB15908@us.ibm.com> Message-ID: <1128687678.4382.6520.camel@hal.voltaire.com> On Thu, 2005-10-06 at 14:11, Nishanth Aravamudan wrote: > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > Great! Thanks. > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > weren't running) now and will post the latest results. > > > > You might also want to apply > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > to get rid of the AT and SDP warnings. > > I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the > gen2 ones with that patch, thanks. > > Here are the results from 2.6.14-rc3-git6 + gen2 3683 > > Looks like x86 is broken in the current svn tree. > > x86 and ppc64 mainline is fine with both =y and =m > > ppc64 + gen2 =y > > drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': > drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > ppc64 + gen2 =m > > same as above, plus > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > x86 + gen2 =y *FAILED* What gcc version are you using ? > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' > drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' > drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' > > x86 + gen2 =m *FAILED* > > same as above Can you try this patch and see if it eliminates the iser errors ? Thanks. -- Hal Signed-off-by: Hal Rosenstock Index: iser.h =================================================================== --- iser.h (revision 3691) +++ iser.h (working copy) @@ -334,7 +334,7 @@ extern int iser_debug_level; do { \ if (iser_debug_level > 0) \ printk(KERN_DEBUG PFX "%s:" fmt,\ - __func__, ## arg); \ + __func__ , ## arg); \ } while (0) #define iser_err(fmt, arg...) \ From halr at voltaire.com Fri Oct 7 05:38:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 08:38:05 -0400 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> References: <1128672413.13948.326.camel@localhost> Message-ID: <1128688684.4382.6629.camel@hal.voltaire.com> Hi Matt, On Fri, 2005-10-07 at 04:06, Matt Leininger wrote: > I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when > using kernels newer than 2.6.11. This doesn't appear to be an OpenIB > IPoIB issue since the in-kernel and a recent svn3687 snapshot both have > the same performance (464 MB/s) with 2.6.11. I used the same kernel > config file as a starting point for each of these kernel builds. Have > there been any changes in Linux that would explain these results? > > > All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 > dual EM64T 3.2 GHz PCIe IB HCA (memfull) > > Kernel OpenIB msi_x netperf (MB/s) > 2.6.14-rc3 in-kernel 1 374 > 2.6.13.2 svn3627 1 386 > 2.6.13.2 in-kernel 1 394 > 2.6.12 in-kernel 1 406 > 2.6.11 in-kernel 1 464 > 2.6.11 svn3687 1 464 > 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) There was already the following thread on netdev that I found: TCP Network performance degade from 2.4.18 to 2.6.10 http://marc.theaimsgroup.com/?l=linux-netdev&m=112792558832125&w=2 I think you should (cross)post this to netdev. -- Hal From nacc at us.ibm.com Fri Oct 7 06:47:46 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 7 Oct 2005 06:47:46 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128687678.4382.6520.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006181147.GB15908@us.ibm.com> <1128687678.4382.6520.camel@hal.voltaire.com> Message-ID: <20051007134746.GA5972@us.ibm.com> On 07.10.2005 [08:21:19 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 14:11, Nishanth Aravamudan wrote: > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > Great! Thanks. > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > weren't running) now and will post the latest results. > > > > > > You might also want to apply > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > to get rid of the AT and SDP warnings. > > > > I already submitted several jobs for 2.6.14-rc3-git6, but I'll redo the > > gen2 ones with that patch, thanks. > > > > Here are the results from 2.6.14-rc3-git6 + gen2 3683 > > > > Looks like x86 is broken in the current svn tree. > > > > x86 and ppc64 mainline is fine with both =y and =m > > > > ppc64 + gen2 =y > > > > drivers/infiniband/ulp/srp/ib_srp.c: In function `srp_process_rsp': > > drivers/infiniband/ulp/srp/ib_srp.c:650: warning: long long unsigned int format, u64 arg (arg 2) > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > ppc64 + gen2 =m > > > > same as above, plus > > > > *** Warning: ".ip_dev_find" [drivers/infiniband/ulp/sdp/ib_sdp.ko] undefined! > > *** Warning: ".ip_dev_find" [drivers/infiniband/core/ib_at.ko] undefined! > > > > x86 + gen2 =y *FAILED* > > What gcc version are you using ? I believe the build systems on all the automated machines are 2.95: Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs gcc version 2.95.4 20011002 (Debian prerelease) > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_adaptor_release': > > drivers/infiniband/ulp/iser/iser_conn.c:195: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:203: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:206: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_conn_establish': > > drivers/infiniband/ulp/iser/iser_conn.c:284: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c: In function `iser_post_receive_control': > > drivers/infiniband/ulp/iser/iser_conn.c:861: parse error before `)' > > drivers/infiniband/ulp/iser/iser_conn.c:873: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_initiator.c: In function `iser_reg_rdma_mem': > > drivers/infiniband/ulp/iser/iser_initiator.c:125: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:130: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:141: parse error before `)' > > drivers/infiniband/ulp/iser/iser_initiator.c:153: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_mod.c: In function `init_module': > > drivers/infiniband/ulp/iser/iser_mod.c:154: parse error before `)' > > drivers/infiniband/ulp/iser/iser_mod.c: In function `cleanup_module': > > drivers/infiniband/ulp/iser/iser_mod.c:243: parse error before `)' > > > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_create_ia_pz_evd': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:147: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_consume_events': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:691: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_event_handler_thread': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:731: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:749: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_conn_event': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:776: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:779: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:782: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:785: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:788: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:791: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:794: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:797: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c:800: parse error before `)' > > drivers/infiniband/ulp/iser/iser_lkdapl.c: In function `iser_handle_single_kdapl_event': > > drivers/infiniband/ulp/iser/iser_lkdapl.c:1025: parse error before `)' > > > > x86 + gen2 =m *FAILED* > > > > same as above > > Can you try this patch and see if it eliminates the iser errors ? > Thanks. Will try it in a bit. Thanks, Nish From halr at voltaire.com Fri Oct 7 06:48:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 09:48:56 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <1128626684.4382.1762.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> Message-ID: <1128692935.4382.7072.camel@hal.voltaire.com> On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > Great! Thanks. > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > weren't running) now and will post the latest results. > > > > > > You might also want to apply > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > to get rid of the AT and SDP warnings. > > > > This patch does remove the warning regarding undefined symbols during > > modpost, but does not remove the warnings > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > work on a patch for this too. Thanks. Can you try this patch for the above 2 warnings ? If it works, I check it into the patches directory. Thanks. -- Hal Update arp_recv functions to latest 2.6.14 netdevice.h API for struct packet_type Signed-off-by: Hal Rosenstock Index: core/at.c =================================================================== --- core/at.c (revision 3691) +++ core/at.c (working copy) @@ -1258,7 +1258,7 @@ static void ib_at_arp_work(void *data) } static int ib_at_arp_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt) + struct packet_type *pt, struct net_device *orig_dev) { struct arp_work *work; struct arphdr *arp_hdr; Index: ulp/sdp/sdp_link.c =================================================================== --- ulp/sdp/sdp_link.c (revision 3691) +++ ulp/sdp/sdp_link.c (working copy) @@ -716,7 +716,7 @@ done: * sdp_link_arp_recv - receive all ARP packets */ static int sdp_link_arp_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt) + struct packet_type *pt, struct net_device *orig_dev) { struct sdp_work *work; struct arphdr *arp_hdr; From hozer at hozed.org Fri Oct 7 07:12:07 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 7 Oct 2005 09:12:07 -0500 Subject: [openib-general] IBM eHCA testing.. Message-ID: <20051007141207.GX4612@kalmia.hozed.org> I have two IBM eHCA cards installed and it appears that OpenSM is happily talking to the firmware and bringing up the links. So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz code drop, and wondering what (if any) issues there are with a 2.6.13 kernel, or later OpenIB svn drops. Is there a later code drop I can get ahold of? Is the nr_ports issue something in the driver? I wound up connecting to the lower port in the Openpower720 machine.. do you know if that's port 1 or 2? From nacc at us.ibm.com Fri Oct 7 07:16:39 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 7 Oct 2005 07:16:39 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128692935.4382.7072.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> Message-ID: <20051007141639.GB5972@us.ibm.com> On 07.10.2005 [09:48:56 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > > > Great! Thanks. > > > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > > weren't running) now and will post the latest results. > > > > > > > > You might also want to apply > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > > to get rid of the AT and SDP warnings. > > > > > > This patch does remove the warning regarding undefined symbols during > > > modpost, but does not remove the warnings > > > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > > work on a patch for this too. Thanks. > > Can you try this patch for the above 2 warnings ? If it works, I check > it into the patches directory. Thanks. Will try this along with the other patch you sent after I return from class (about 2 hours). Thanks, Nish From Administrator at openib.org Fri Oct 7 08:48:14 2005 From: Administrator at openib.org (Administrator at openib.org) Date: Fri, 7 Oct 2005 08:48:14 -0700 Subject: [openib-general] [MailServer Notification]To Recipient virus found and action taken. Message-ID: <004701c5cb56$86db4470$faf9a8c0@qlogic.org> ScanMail for Microsoft Exchange has detected virus-infected attachment(s). Sender = openib-general-bounces at openib.org Recipient(s) = openib-general at openib.org Subject = [openib-general] Members Support Scanning time = 10/7/2005 8:48:13 AM Engine/Pattern = 7.510-1002/2.879.00 Action on virus found: The attachment ykj.zip contains WORM_MYTOB.EI virus. ScanMail has Deleted it. Warning to recipient. ScanMail has detected a virus. From parks at lanl.gov Fri Oct 7 08:49:12 2005 From: parks at lanl.gov (Parks Fields) Date: Fri, 07 Oct 2005 09:49:12 -0600 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> References: <1128672413.13948.326.camel@localhost> Message-ID: <6.2.3.4.2.20051007074938.01fefcf8@ccn-mail.lanl.gov> Matt, I have seen the same thing. I just didn't relate it to the Kernel. My IPoIB performance is down to ~340MB/sec with 2.6.12.1 and svn 3040. With 2.6.13 and svn 3490 the peak is 402MB/sec. At 02:06 AM 10/7/2005, Matt Leininger wrote: >I'm seeing an IPoIB netperf performance drop off, up to 90 MB/s, when >using kernels newer than 2.6.11. This doesn't appear to be an OpenIB >IPoIB issue since the in-kernel and a recent svn3687 snapshot both have >the same performance (464 MB/s) with 2.6.11. I used the same kernel >config file as a starting point for each of these kernel builds. Have >there been any changes in Linux that would explain these results? > > >All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 >dual EM64T 3.2 GHz PCIe IB HCA (memfull) > >Kernel OpenIB msi_x netperf (MB/s) >2.6.14-rc3 in-kernel 1 374 >2.6.13.2 svn3627 1 386 >2.6.13.2 in-kernel 1 394 >2.6.12 in-kernel 1 406 >2.6.11 in-kernel 1 464 >2.6.11 svn3687 1 464 >2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) > > Thanks, > > - Matt > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Fri Oct 7 08:58:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 07 Oct 2005 08:58:55 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 01:06:53 -0700") References: <1128672413.13948.326.camel@localhost> Message-ID: <52ek6xmi80.fsf@cisco.com> Hmm, looks like something in the network stack must have changed. > 2.6.12 in-kernel 1 406 > 2.6.11 in-kernel 1 464 This looks like the biggest dropoff. I can think of two things that would be interesting to do if you or anyone else has time. First, taking profiles of netperf runs between these two kernels and comparing might be enlightening. Also, it would be useful to pin down when the regression happened, so running the same test with 2.6.12-rc1 through 2.6.12-rc6 would be a good thing. - R. From pradeep at us.ibm.com Fri Oct 7 09:14:04 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 7 Oct 2005 09:14:04 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <20051007141207.GX4612@kalmia.hozed.org> Message-ID: I believe the lower port is port 1. I will defer to the EHCA team as regards to issues with 2.6.13 (if any). We have minimally used both ports on p570. So, my guess is that should work on a Openpower720. Pradeep pradeep at us.ibm.com openib-general-bounces at openib.org wrote on 10/07/2005 07:12:07 AM: > I have two IBM eHCA cards installed and it appears that OpenSM > is happily talking to the firmware and bringing up the links. > > So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz > code drop, and wondering what (if any) issues there are with a 2.6.13 > kernel, or later OpenIB svn drops. > > Is there a later code drop I can get ahold of? Is the nr_ports issue > something in the driver? I wound up connecting to the lower port in the > Openpower720 machine.. do you know if that's port 1 or 2? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Oct 7 09:20:17 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 07 Oct 2005 09:20:17 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020912@NT-SJCA-0751.brcm.a d.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020912@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20051007091316.024dec70@esmail.cup.hp.com> At 06:38 AM 9/30/2005, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Roland Dreier > > Sent: Thursday, September 29, 2005 6:50 PM > > To: Sean Hefty > > Cc: Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > Sean> Can you explain how RDMA works in this case? This is simply > > Sean> performing IP routing, and not IB routing, correct? Are you > > Sean> referring to a protocol running on top of IP or IB directly? > > Sean> Is the router establishing a second reliable connection on > > Sean> the backend? Does it simply translate headers as packets > > Sean> pass through in this case? > > > > I think the usage model is the following: you have some magic > > device that has an IB port on one side and "something else" > > on the other side. Think of something like a gateway that > > talks SDP on the IB side and TCP/IP on the other side. > > > > You configure your IPoIB routing so that this magic device is > > the next hop for talking to hosts on the IP network on the other side. > > > > Now someone tries to make an SDP connection to an IP address > > on the other side of the magic device. Routing tables + ARP > > give it the GID of the IB port of this magic device. It > > connects to the magic device and run SDP to talk to the magic > > device, and the magic device magically splices this into a > > TCP connection to the real destination. > > > > Or the same idea for an NFS/RDMA <-> NFS/UDP gateway, etc. > > > >Those examples are all basically application level gateways. >As such they would have no transport or connection setup >implications. The application level gateway simply offers >a service on network X that it fulfills on network Y. But >as far as network X is concerned the gateway IS the server. It must be viewed as such. The cross over point between the two domains represents independent management domains, trust domains, reliable delivery domains, etc. >I do not believe it is possible to construct a transport >layer gateway that bridges RDMA between IB and iWARP while >appearing to be a normal RDMA endpoint on both networks. >Higher level gateways will be possible for many >applications, but I don't see how that relates to >connection establishment. That would require having >an end-to-end reliable connection, complete with flow >control semantics, that bridged the two networks by >some method other than encapsulation or tunneling. We took steps to insure that both IB and iWARP could transmit packets in the main data path very efficiently between the two interconnects but it was never envisioned that a connection was truly end-to-end transparent across the gateway component. I think most of the architects would not support such an effort to define such a beast. There are many issues in attempting such an offering. Just examine all of the problems with the existing iSCSI to FC solutions; they ignore a number of customer issues and hence have been relegated in many customer minds as TTM, play toys not ready for prime time. This is one of the many reasons why iSCSI has not taken off as the hype portrayed. It would be best to define a CM architecture that enabled communication between like endpoints and avoid the gateway dilemma. Let the gateway provider work out such issues as there are many requirements already on each side of these interconnects. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Oct 7 09:29:19 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 07 Oct 2005 09:29:19 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7F9F14@taurus.voltaire.com > References: <35EA21F54A45CB47B879F21A91F4862F7F9F14@taurus.voltaire.com> Message-ID: <6.2.0.14.2.20051007092706.02504c98@esmail.cup.hp.com> At 06:24 AM 9/30/2005, Yaron Haviv wrote: > > -----Original Message----- > > From: Roland Dreier [mailto:rolandd at cisco.com] > > Sent: Thursday, September 29, 2005 9:50 PM > > To: Sean Hefty > > Cc: Yaron Haviv; Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > I think the usage model is the following: you have some magic device > > that has an IB port on one side and "something else" on the other > > side. Think of something like a gateway that talks SDP on the IB side > > and TCP/IP on the other side. > > > >Also applicable to two IB ports, e.g. forwarding SDP traffic from one IB >partition to SDP on another partition (may even be the same port with >two P_Keys), and doing some load-balancing or traffic management in >between, overall there are many use cases for that. While I can envision how an endpoint could communicate with another in separate partitions, doing so really violates the spirit of the partitioning where endpoints must be in the same partition in order to see one another and communicate. Attempting to create an intermediary who has insights into both and then somehow is able to communicate how to find one another using some proprietary (can't be through standards that I can think of) method, seems like way too much complexity to be worth it. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Fri Oct 7 09:33:27 2005 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 7 Oct 2005 09:33:27 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: Message-ID: Hi, Troy, There is INSTALL file in the EHCA driver package. In OpenPower 720 port 1 is at the top, port 2 is at the bottom. In P570, port1 is at the bottom, port2 is at the top. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Oct 7 09:40:04 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 09:40:04 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051007091316.024dec70@esmail.cup.hp.com> Message-ID: >It would be best to define a CM architecture that enabled communication >between like endpoints and avoid the gateway dilemma. Let the gateway >provider work out such issues as there are many requirements already >on each side of these interconnects. I've given this some more thought since the original postings and agree with you. It doesn't seem right to me to have the CM establish a connection to something that is not the specified destination, under the assumption that whatever is being connected to is a gateway. I think it would be better for the application to determine that the actual destination is on a different subnet, locate the gateway, and issue a connection request to the gateway. - Sean From iod00d at hp.com Fri Oct 7 10:05:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 7 Oct 2005 10:05:50 -0700 Subject: [openib-general] [PATCH] udapl: PPC64 cpuinfo change In-Reply-To: <20051007040121.GW4612@kalmia.hozed.org> References: <20051006211408.GF26238@esmail.cup.hp.com> <20051007040121.GW4612@kalmia.hozed.org> Message-ID: <20051007170550.GD30308@esmail.cup.hp.com> On Thu, Oct 06, 2005 at 11:01:21PM -0500, Troy Benjegerdes wrote: > Oh boy.... is there some reason 'gettimeofday' does not work? In general, it doesn't work as well. > Trying to infer timebase/clock/rtsc frequency is going to be a mess. Using cycle counters is quite portable today and provides accurate results (with caveats on it's use). I'm open to using the next best thing once it's clear the cycle counters do NOT work. > Think cpus that dynamically change frequency.. Laptops do now.. > how long before something with infiniband does and breaks this > code horribly? (think embedded systems) I don't buy this argument. Most of the tests load the CPU and it essentially runs at a fixed frequency. A better argument is how to benchmark under virtualized environment. I think that is totally broken today regardless of what method one uses to measure time. > There are a couple of implementations of gettimeofday fully in userspace > that hide the details and still read the high-res hardware counters. Google > for 'vDSO gettimeofday'. Well, I'm sure Michael is open to patches on this for userspace/perftest stuff and like wise for James Lentini for uDAPL. grant From sean.hefty at intel.com Fri Oct 7 12:19:23 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 12:19:23 -0700 Subject: [openib-general] [PATCH] [ADDR] address translation module for CMA Message-ID: The following patch adds a simple IP to IB address translation module using ARP. It is based off AT and SDP, but kept as simple as possible. I would like to merge this back into the trunk, and apply other changes there. Signed-off-by: Sean Hefty Index: include/rdma/ib_addr.h =================================================================== --- include/rdma/ib_addr.h (revision 0) +++ include/rdma/ib_addr.h (revision 0) @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(IB_ADDR_H) +#define IB_ADDR_H + +#include +#include + +struct ib_addr { + union ib_gid sgid; + union ib_gid dgid; + u16 pkey; +}; + +/** + * ib_translate_addr - Translate a local IP address to an Infiniband GID and + * PKey. + */ +int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey); + +/** + * ib_resolve_addr - Resolve source and destination IP addresses to + * Infiniband network addresses. + * @src_addr: An optional source address to use in the resolution. If a + * source address is not provided, a usable address will be returned via + * the callback. + * @dst_addr: The destination address to resolve. + * @addr: A reference to a data location that will receive the resolved + * addresses. The data location must remain valid until the callback has + * been invoked. + * @timeout_ms: Amount of time to wait for the address resolution to complete. + * @callback: Call invoked once address resolution has completed, timed out, + * or been canceled. A status of 0 indicates success. + * @context: User-specified context associated with the call. + */ +int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct ib_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context), + void *context); + +void ib_addr_cancel(struct ib_addr *addr); + +#endif /* IB_ADDR_H */ + Index: core/addr.c =================================================================== --- core/addr.c (revision 0) +++ core/addr.c (revision 0) @@ -0,0 +1,351 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct ib_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DECLARE_MUTEX(mutex); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +static struct workqueue_struct *wq; + +static u16 addr_get_pkey(struct net_device *dev) +{ + return ((u16)dev->broadcast[8] << 8) | (u16)dev->broadcast[9]; +} + +int ib_translate_addr(struct sockaddr *addr, union ib_gid *gid, u16 *pkey) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + *gid = *(union ib_gid *) (dev->dev_addr + 4); + *pkey = addr_get_pkey(dev); + dev_put(dev); + return 0; +} +EXPORT_SYMBOL(ib_translate_addr); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + down(&mutex); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + up(&mutex); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, dst_ip, rt->idev->dev, rt->rt_src, + NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct ib_addr *addr) +{ + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + struct flowi fl; + struct rtable *rt; + struct neighbour *neigh; + int ret; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + fl.nl_u.ip4_u.saddr = src_ip; + ret = ip_route_output_key(&rt, &fl); + if (ret) + goto out; + + neigh = neigh_lookup(&arp_tbl, &dst_ip, rt->idev->dev); + if (!neigh) { + ret = -ENODATA; + goto err1; + } + + if (!(neigh->nud_state & NUD_VALID)) { + ret = -ENODATA; + goto err2; + } + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = rt->rt_src; + } + + addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); + addr->dgid = *(union ib_gid *) (neigh->ha + 4); + addr->pkey = addr_get_pkey(neigh->dev); + +err2: + neigh_release(neigh); +err1: + ip_rt_put(rt); +out: + return ret; +} + +static void process_req(void *data) +{ + struct addr_req *req, *temp_req; + struct sockaddr_in *src_in, *dst_in; + struct list_head done_list; + + INIT_LIST_HEAD(&done_list); + + down(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->status) { + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + req->status = addr_resolve_remote(src_in, dst_in, + req->addr); + } + if (req->status && time_after(jiffies, req->timeout)) + req->status = -ETIMEDOUT; + else if (req->status == -ENODATA) + continue; + + list_del(&req->list); + list_add_tail(&req->list, &done_list); + } + + if (!list_empty(&req_list)) { + req = list_entry(req_list.next, struct addr_req, list); + set_timeout(req->timeout); + } + up(&mutex); + + list_for_each_entry_safe(req, temp_req, &done_list, list) { + list_del(&req->list); + req->callback(req->status, &req->src_addr, req->addr, + req->context); + kfree(req); + } +} + +static int addr_resolve_local(struct sockaddr_in *src_in, + struct sockaddr_in *dst_in, + struct ib_addr *addr) +{ + struct net_device *dev; + u32 src_ip = src_in->sin_addr.s_addr; + u32 dst_ip = dst_in->sin_addr.s_addr; + int ret = 0; + + dev = ip_dev_find(dst_ip); + if (!dev) + return -EADDRNOTAVAIL; + + if (!src_ip) { + src_in->sin_family = dst_in->sin_family; + src_in->sin_addr.s_addr = dst_ip; + addr->sgid = *(union ib_gid *) (dev->dev_addr + 4); + addr->pkey = addr_get_pkey(dev); + } else { + ret = ib_translate_addr((struct sockaddr *)src_in, + &addr->sgid, &addr->pkey); + if (ret) + goto out; + } + + addr->dgid = *(union ib_gid *) (dev->dev_addr + 4); +out: + dev_put(dev); + return ret; +} + +int ib_resolve_addr(struct sockaddr *src_addr, struct sockaddr *dst_addr, + struct ib_addr *addr, int timeout_ms, + void (*callback)(int status, struct sockaddr *src_addr, + struct ib_addr *addr, void *context), + void *context) +{ + struct sockaddr_in *src_in, *dst_in; + struct addr_req *req; + int ret = 0; + + req = kmalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + memset(req, 0, sizeof *req); + + if (src_addr) + req->src_addr = *src_addr; + req->dst_addr = *dst_addr; + req->addr = addr; + req->callback = callback; + req->context = context; + + src_in = (struct sockaddr_in *) &req->src_addr; + dst_in = (struct sockaddr_in *) &req->dst_addr; + + req->status = addr_resolve_local(src_in, dst_in, addr); + if (req->status == -EADDRNOTAVAIL) + req->status = addr_resolve_remote(src_in, dst_in, addr); + + switch (req->status) { + case 0: + req->timeout = jiffies; + queue_req(req); + break; + case -ENODATA: + req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; + queue_req(req); + addr_send_arp(dst_in); + break; + default: + ret = req->status; + kfree(req); + break; + } + return ret; +} +EXPORT_SYMBOL(ib_resolve_addr); + +void ib_addr_cancel(struct ib_addr *addr) +{ + struct addr_req *req, *temp_req; + + up(&mutex); + list_for_each_entry_safe(req, temp_req, &req_list, list) { + if (req->addr == addr) { + req->status = -ECANCELED; + req->timeout = jiffies; + list_del(&req->list); + list_add(&req->list, &req_list); + set_timeout(req->timeout); + break; + } + } + up(&mutex); +} +EXPORT_SYMBOL(ib_addr_cancel); + +static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pkt) +{ + struct arphdr *arp_hdr; + + arp_hdr = (struct arphdr *) skb->nh.raw; + + if (dev->type == ARPHRD_INFINIBAND && + (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) || + arp_hdr->ar_op == __constant_htons(ARPOP_REPLY))) + set_timeout(jiffies); + + kfree_skb(skb); + return 0; +} + +static struct packet_type addr_arp = { + .type = __constant_htons(ETH_P_ARP), + .func = addr_arp_recv, + .af_packet_priv = (void*) 1, +}; + +static int addr_init(void) +{ + wq = create_singlethread_workqueue("ib_addr"); + if (!wq) + return -ENOMEM; + + dev_add_pack(&addr_arp); + return 0; +} + +static void addr_cleanup(void) +{ + dev_remove_pack(&addr_arp); + destroy_workqueue(wq); +} + +module_init(addr_init); +module_exit(addr_cleanup); From pradeep at us.ibm.com Fri Oct 7 12:19:48 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Fri, 7 Oct 2005 12:19:48 -0700 Subject: [openib-general] Questions about mad_test Message-ID: I am hoping some one will be able to help me out with a few answers saving me some debug time, or having to expend effort on something that is already known. I was trying to execute mad_test and found that it errors out. For some reason it does not like the DR Path that I gave it. 1. I ran ibnetdiscover and got the set of LIDs that I use is DR Path. Is that correct way to go about it? It always errors out with something like: hop 0 != 0 or hop 1 != dev_port 2. Also there is an expectation of there being a device /dev/infiniband/mthca0/ports/1/mad (using all defaults in this case) -is that correct? Any specific major and minor numbers I must use? 3. Anything else that I am missing? I am using this from trunk 3675 on 2.6.13 kernel. Thanks in advance for all the help! Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Oct 7 12:27:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 7 Oct 2005 12:27:44 -0700 Subject: [openib-general] [PATCH] [CMA] RDMA CM abstraction module Message-ID: The following patch adds in a basic RDMA connection management abstraction. It is functional, but needs additional work for handling device removal, plus several missing features. I'd like to merge this back into the trunk, and continue working on it from there. This depends on the ib_addr module. Signed-off-by: Sean Hefty Index: include/rdma/rdma_cm.h =================================================================== --- include/rdma/rdma_cm.h (revision 0) +++ include/rdma/rdma_cm.h (revision 0) @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ + +#if !defined(RDMA_CM_H) +#define RDMA_CM_H + +#include +#include +#include + +/* + * Upon receiving a device removal event, users must destroy the associated + * RDMA identifier and release all resources allocated with the device. + */ +enum rdma_event_type { + RDMA_EVENT_ADDR_RESOLVED, + RDMA_EVENT_ADDR_ERROR, + RDMA_EVENT_ROUTE_RESOLVED, + RDMA_EVENT_ROUTE_ERROR, + RDMA_EVENT_CONNECT_REQUEST, + RDMA_EVENT_CONNECT_ERROR, + RDMA_EVENT_UNREACHABLE, + RDMA_EVENT_REJECTED, + RDMA_EVENT_ESTABLISHED, + RDMA_EVENT_DISCONNECTED, + RDMA_EVENT_DEVICE_REMOVAL, +}; + +struct rdma_addr { + struct sockaddr src_addr; + struct sockaddr dst_addr; + union { + struct ib_addr ibaddr; + } addr; +}; + +struct rdma_route { + struct rdma_addr addr; + struct ib_sa_path_rec *path_rec; + int num_paths; +}; + +struct rdma_event { + enum rdma_event_type event; + int status; + void *private_data; + u8 private_data_len; +}; + +struct rdma_id; + +/** + * rdma_event_handler - Callback used to report user events. + * + * Notes: Users may not call rdma_destroy_id from this callback to destroy + * the passed in id, or a corresponding listen id. Returning a + * non-zero value from the callback will destroy the corresponding id. + */ +typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); + +struct rdma_id { + struct ib_device *device; + void *context; + struct ib_qp *qp; + rdma_event_handler event_handler; + struct rdma_route route; +}; + +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context); + +void rdma_destroy_id(struct rdma_id *id); + +/** + * rdma_bind_addr - Bind an RDMA identifier to a source address and + * associated RDMA device, if needed. + * + * @id: RDMA identifier. + * @addr: Local address information. Wildcard values are permitted. + * + * This associates a source address with the RDMA identifier before calling + * rdma_listen. If a specific local address is given, the RDMA identifier will + * be bound to a local RDMA device. + */ +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); + +/** + * rdma_resolve_addr - Resolve destination and optional source addresses + * from IP addresses to an RDMA address. If successful, the specified + * rdma_id will be bound to a local device. + * + * @id: RDMA identifier. + * @src_addr: Source address information. This parameter may be NULL. + * @dst_addr: Destination address information. + * @timeout_ms: Time to wait for resolution to complete. + */ +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms); + +/** + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier + * into route information needed to establish a connection. + * + * This is called on the client side of a connection, but its use is optional. + * Users must have first called rdma_bind_addr to resolve a dst_addr + * into an RDMA address before calling this routine. + */ +int rdma_resolve_route(struct rdma_id *id, int timeout_ms); + +/** + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA + * identifier. + */ +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr); + +/** + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA + * identifier. + * + * Users must destroy any QP associated with an RDMA identifier before + * destroying the RDMA ID. + */ +void rdma_destroy_qp(struct rdma_id *id); + +struct rdma_conn_param { + const void *private_data; + u8 private_data_len; + u8 responder_resources; + u8 initiator_depth; + u8 flow_control; + u8 retry_count; /* ignored when accepting */ + u8 rnr_retry_count; +}; + +/** + * rdma_connect - Initiate an active connection request. + * + * Users must have bound the rdma_id to a local device by having called + * rdma_resolve_addr before calling this routine. Users may also resolve the + * RDMA address to a route with rdma_resolve_route, but if a route has not + * been resolved, a default route will be selected. + * + * Note that the QP must be in the INIT state. + */ +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_listen - This function is called by the passive side to + * listen for incoming connection requests. + * + * Users must have bound the rdma_id to a local address by calling + * rdma_bind_addr before calling this routine. + */ +int rdma_listen(struct rdma_id *id); + +/** + * rdma_accept - Called on the passive side to accept a connection request + * + * Note that the QP must be in the INIT state. + */ +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param); + +/** + * rdma_reject - Called on the passive side to reject a connection request. + */ +int rdma_reject(struct rdma_id *id, const void *private_data, + u8 private_data_len); + +/** + * rdma_disconnect - This function disconnects the associated QP. + */ +int rdma_disconnect(struct rdma_id *id); + +#endif /* RDMA_CM_H */ + Index: core/cma.c =================================================================== --- core/cma.c (revision 0) +++ core/cma.c (revision 0) @@ -0,0 +1,1207 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Guy German"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add = cma_add_one, + .remove = cma_remove_one +}; + +static DEFINE_SPINLOCK(lock); +static LIST_HEAD(dev_list); + +struct cma_device { + struct list_head list; + struct ib_device *device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_t refcount; + struct list_head id_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_id id; + + struct list_head list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_t refcount; + atomic_t dev_remove; + + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; +}; + +struct cma_addr { + u8 version; /* CMA version: 7:4, IP version: 3:0 */ + u8 reserved; + __be16 port; + struct { + union { + struct in6_addr ip6; + struct { + __be32 pad[3]; + __be32 addr; + } ip4; + } ver; + } src_addr, dst_addr; +}; + +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state comp) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + ret = (id_priv->state == comp); + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static int cma_comp_exch(struct rdma_id_private *id_priv, + enum cma_state comp, enum cma_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&id_priv->lock, flags); + if ((ret = (id_priv->state == comp))) + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return ret; +} + +static enum cma_state cma_exch(struct rdma_id_private *id_priv, + enum cma_state exch) +{ + unsigned long flags; + enum cma_state old; + + spin_lock_irqsave(&id_priv->lock, flags); + old = id_priv->state; + id_priv->state = exch; + spin_unlock_irqrestore(&id_priv->lock, flags); + return old; +} + +static inline u8 cma_get_ip_ver(struct cma_addr *addr) +{ + return addr->version & 0xF; +} + +static inline u8 cma_get_cma_ver(struct cma_addr *addr) +{ + return addr->version >> 4; +} + +static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 ip_ver) +{ + addr->version = (cma_ver << 4) + (ip_ver & 0xF); +} + +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, + union ib_gid *gid) +{ + struct cma_device *cma_dev; + unsigned long flags; + int ret = -ENODEV; + u8 port; + + spin_lock_irqsave(&lock, flags); + list_for_each_entry(cma_dev, &dev_list, list) { + ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); + if (!ret) { + atomic_inc(&cma_dev->refcount); + id_priv->cma_dev = cma_dev; + id_priv->id.device = cma_dev->device; + list_add_tail(&id_priv->list, &cma_dev->id_list); + break; + } + } + spin_unlock_irqrestore(&lock, flags); + return ret; +} + +static void cma_release_dev(struct rdma_id_private *id_priv) +{ + unsigned long flags; + + spin_lock_irqsave(&lock, flags); + list_del(&id_priv->list); + spin_unlock_irqrestore(&lock, flags); + + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) + wake_up(&id_priv->cma_dev->wait); +} + +static void cma_deref_id(struct rdma_id_private *id_priv) +{ + if (atomic_dec_and_test(&id_priv->refcount)) + wake_up(&id_priv->wait); +} + +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void *context) +{ + struct rdma_id_private *id_priv; + + id_priv = kmalloc(sizeof *id_priv, GFP_KERNEL); + if (!id_priv) + return NULL; + memset(id_priv, 0, sizeof *id_priv); + + id_priv->state = CMA_IDLE; + id_priv->id.context = context; + id_priv->id.event_handler = event_handler; + spin_lock_init(&id_priv->lock); + init_waitqueue_head(&id_priv->wait); + atomic_set(&id_priv->refcount, 1); + atomic_set(&id_priv->dev_remove, 1); + + return &id_priv->id; +} +EXPORT_SYMBOL(rdma_create_id); + +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp *qp) +{ + struct ib_qp_attr qp_attr; + struct ib_sa_path_rec *path_rec; + int ret; + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + + path_rec = id_priv->id.route.path_rec; + ret = ib_find_cached_gid(id_priv->id.device, &path_rec->sgid, + &qp_attr.port_num, NULL); + if (ret) + return ret; + + ret = ib_find_cached_pkey(id_priv->id.device, qp_attr.port_num, + id_priv->id.route.addr.addr.ibaddr.pkey, + &qp_attr.pkey_index); + if (ret) + return ret; + + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | + IB_QP_PKEY_INDEX | IB_QP_PORT); +} + +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, + struct ib_qp_init_attr *qp_init_attr) +{ + struct rdma_id_private *id_priv; + struct ib_qp *qp; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (id->device != pd->device) + return -EINVAL; + + qp = ib_create_qp(pd, qp_init_attr); + if (IS_ERR(qp)) + return PTR_ERR(qp); + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_init_ib_qp(id_priv, qp); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto err; + + id->qp = qp; + return 0; +err: + ib_destroy_qp(qp); + return ret; +} +EXPORT_SYMBOL(rdma_create_qp); + +void rdma_destroy_qp(struct rdma_id *id) +{ + ib_destroy_qp(id->qp); +} +EXPORT_SYMBOL(rdma_destroy_qp); + +static int cma_modify_ib_qp_rtr(struct rdma_id_private *id_priv) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + /* Need to update QP attributes from default values. */ + qp_attr.qp_state = IB_QPS_INIT; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); + if (ret) + return ret; + + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + qp_attr.rq_psn = id_priv->id.qp->qp_num; + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_ib_qp_rts(struct rdma_id_private *id_priv) +{ + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, &qp_attr_mask); + if (ret) + return ret; + + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); +} + +static int cma_modify_qp_err(struct rdma_id *id) +{ + struct ib_qp_attr qp_attr; + + qp_attr.qp_state = IB_QPS_ERR; + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); +} + +static int cma_verify_addr(struct cma_addr *addr, + struct sockaddr_in *ip_addr) +{ + if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) + return -EINVAL; + + if (ip_addr->sin_port != be16_to_cpu(addr->port)) + return -EINVAL; + + if (ip_addr->sin_addr.s_addr && + (ip_addr->sin_addr.s_addr != be32_to_cpu(addr->dst_addr. + ver.ip4.addr))) + return -EINVAL; + + return 0; +} + +static int cma_notify_user(struct rdma_id_private *id_priv, + enum rdma_event_type type, int status, + void *data, u8 data_len) +{ + struct rdma_event event; + + event.event = type; + event.status = status; + event.private_data = data; + event.private_data_len = data_len; + + return id_priv->id.event_handler(&id_priv->id, &event); +} + +static inline void cma_disable_dev_remove(struct rdma_id_private *id_priv) +{ + atomic_inc(&id_priv->dev_remove); +} + +static inline void cma_deref_dev(struct rdma_id_private *id_priv) +{ +// if (atomic_dec_and_test(&id_priv->dev_remove)) +// wake_up(&id_priv->wait); +// return atomic_dec_and_test(&id_priv->dev_remove) ? +// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, +// NULL, 0) : 0; +} + +static void cma_cancel_addr(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); + break; + default: + break; + } +} + +static void cma_cancel_route(struct rdma_id_private *id_priv) +{ + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + ib_sa_cancel_query(id_priv->query_id, id_priv->query); + break; + default: + break; + } +} + +static void cma_cancel_operation(struct rdma_id_private *id_priv, + enum cma_state state) +{ + switch (state) { + case CMA_ADDR_QUERY: + cma_cancel_addr(id_priv); + break; + case CMA_ROUTE_QUERY: + cma_cancel_route(id_priv); + break; + default: + break; + } +} + +static void cma_free_id(struct rdma_id_private *id_priv) +{ + if (id_priv->cma_dev) { + switch (id_priv->id.device->node_type) { + case IB_NODE_CA: + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + break; + default: + break; + } + cma_release_dev(id_priv); + } + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv->id.route.path_rec); + kfree(id_priv); +} + +void rdma_destroy_id(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + enum cma_state state; + + id_priv = container_of(id, struct rdma_id_private, id); + + state = cma_exch(id_priv, CMA_DESTROYING); + cma_cancel_operation(id_priv, state); + cma_free_id(id_priv); +} +EXPORT_SYMBOL(rdma_destroy_id); + +static int cma_rep_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_ib_qp_rtr(id_priv); + if (ret) + goto reject; + + ret = cma_modify_ib_qp_rts(id_priv); + if (ret) + goto reject; + + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_rtu_recv(struct rdma_id_private *id_priv) +{ + int ret; + + ret = cma_modify_ib_qp_rts(id_priv); + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(&id_priv->id); + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + return ret; +} + +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv = cm_id->context; + enum rdma_event_type event; + u8 private_data_len = 0; + int ret = 0, status = 0; + + if (!cma_comp(id_priv, CMA_CONNECT)) + return 0; + + switch (ib_event->event) { + case IB_CM_REQ_ERROR: + case IB_CM_REP_ERROR: + event = RDMA_EVENT_UNREACHABLE; + status = -ETIMEDOUT; + break; + case IB_CM_REP_RECEIVED: + status = cma_rep_recv(id_priv); + event = status ? RDMA_EVENT_CONNECT_ERROR : + RDMA_EVENT_ESTABLISHED; + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; + break; + case IB_CM_RTU_RECEIVED: + status = cma_rtu_recv(id_priv); + event = status ? RDMA_EVENT_CONNECT_ERROR : + RDMA_EVENT_ESTABLISHED; + break; + case IB_CM_DREQ_ERROR: + status = -ETIMEDOUT; /* fall through */ + case IB_CM_DREQ_RECEIVED: + case IB_CM_DREP_RECEIVED: + event = RDMA_EVENT_DISCONNECTED; + break; + case IB_CM_TIMEWAIT_EXIT: + case IB_CM_MRA_RECEIVED: + /* ignore event */ + goto out; + case IB_CM_REJ_RECEIVED: + cma_modify_qp_err(&id_priv->id); + status = ib_event->param.rej_rcvd.reason; + event = RDMA_EVENT_REJECTED; + break; + default: + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", + ib_event->event); + goto out; + } + + ret = cma_notify_user(id_priv, event, status, ib_event->private_data, + private_data_len); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + id_priv->cm_id = NULL; + rdma_destroy_id(&id_priv->id); + } +out: + return ret; +} + +static struct rdma_id_private* cma_new_id(struct rdma_id *listen_id, + struct ib_cm_event *ib_event) +{ + struct rdma_id_private *id_priv; + struct rdma_id *id; + struct rdma_route *route; + struct sockaddr_in *ip_addr; + struct ib_sa_path_rec *path_rec; + struct cma_addr *addr; + int num_paths; + + ip_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; + if (cma_verify_addr(ib_event->private_data, ip_addr)) + return NULL; + + num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); + path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); + if (!path_rec) + return NULL; + + id = rdma_create_id(listen_id->event_handler, listen_id->context); + if (!id) + goto err; + + route = &id->route; + route->addr.src_addr = listen_id->route.addr.src_addr; + route->addr.dst_addr.sa_family = ip_addr->sin_family; + + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; + addr = ib_event->private_data; + ip_addr->sin_addr.s_addr = be32_to_cpu(addr->src_addr.ver.ip4.addr); + + route->num_paths = num_paths; + route->path_rec = path_rec; + path_rec[0] = *ib_event->param.req_rcvd.primary_path; + if (num_paths == 2) + path_rec[1] = *ib_event->param.req_rcvd.alternate_path; + + route->addr.addr.ibaddr.sgid = path_rec->dgid; + route->addr.addr.ibaddr.dgid = path_rec->sgid; + route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->state = CMA_CONNECT; + return id_priv; +err: + kfree(path_rec); + return NULL; +} + +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) +{ + struct rdma_id_private *listen_id, *conn_id; + int offset, ret; + + listen_id = cm_id->context; + conn_id = cma_new_id(&listen_id->id, ib_event); + if (!conn_id) + return -ENOMEM; + + ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); + if (ret) { + ret = -ENODEV; + goto err; + } + + conn_id->cm_id = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_ib_handler; + conn_id->state = CMA_CONNECT; + + offset = sizeof(struct cma_addr); + ret = cma_notify_user(conn_id, RDMA_EVENT_CONNECT_REQUEST, 0, + ib_event->private_data + offset, + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); + if (ret) { + /* Destroy the CM ID by returning a non-zero value. */ + conn_id->cm_id = NULL; + rdma_destroy_id(&conn_id->id); + } + return ret; +err: + rdma_destroy_id(&conn_id->id); + return ret; +} + +static __be64 cma_get_service_id(struct sockaddr *addr) +{ + return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + + ((struct sockaddr_in *) addr)->sin_port); +} + +static int cma_ib_listen(struct rdma_id_private *id_priv) +{ + __be64 svc_id; + int ret; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) + return PTR_ERR(id_priv->cm_id); + + svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); + if (ret) + ib_destroy_cm_id(id_priv->cm_id); + + return ret; +} + +int rdma_listen(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) + return -EINVAL; + + /* TODO: handle listen across multiple devices */ + if (!id->device) { + ret = -ENOSYS; + goto err; + } + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_ib_listen(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); + return ret; +}; +EXPORT_SYMBOL(rdma_listen); + +static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, + void *context) +{ + struct rdma_id_private *id_priv = context; + struct rdma_route *route = &id_priv->id.route; + enum rdma_event_type event = RDMA_EVENT_ROUTE_RESOLVED; + + if (!status) { + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); + if (route->path_rec) { + route->num_paths = 1; + *route->path_rec = *path_rec; + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED)) { + kfree(route->path_rec); + goto out; + } + } else + status = -ENOMEM; + } + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) + goto out; + event = RDMA_EVENT_ROUTE_ERROR; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_deref_id(id_priv); +} + +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) +{ + struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; + struct ib_sa_path_rec path_rec; + int ret; + u8 port; + + ret = ib_find_cached_gid(id_priv->id.device, &addr->sgid, &port, NULL); + if (ret) + return -ENODEV; + + memset(&path_rec, 0, sizeof path_rec); + path_rec.sgid = addr->sgid; + path_rec.dgid = addr->dgid; + path_rec.pkey = addr->pkey; + path_rec.numb_path = 1; + + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, + port, &path_rec, + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, + timeout_ms, GFP_KERNEL, + cma_query_handler, id_priv, &id_priv->query); + + return (id_priv->query_id < 0) ? id_priv->query_id : 0; +} + +int rdma_resolve_route(struct rdma_id *id, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_resolve_ib_route(id_priv, timeout_ms); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_route); + +static void addr_handler(int status, struct sockaddr *src_addr, + struct ib_addr *ibaddr, void *context) +{ + struct rdma_id_private *id_priv = context; + enum rdma_event_type event; + + if (!status) + status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + + if (status) { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE)) + goto out; + event = RDMA_EVENT_ADDR_ERROR; + } else { + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) + goto out; + id_priv->id.route.addr.src_addr = *src_addr; + event = RDMA_EVENT_ADDR_RESOLVED; + } + + if (cma_notify_user(id_priv, event, status, NULL, 0)) { + cma_deref_id(id_priv); + rdma_destroy_id(&id_priv->id); + return; + } +out: + cma_deref_id(id_priv); +} + +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, + struct sockaddr *dst_addr, int timeout_ms) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) + return -EINVAL; + + atomic_inc(&id_priv->refcount); + id->route.addr.dst_addr = *dst_addr; + ret = ib_resolve_addr(src_addr, dst_addr, &id->route.addr.addr.ibaddr, + timeout_ms, addr_handler, id_priv); + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE); + cma_deref_id(id_priv); + return ret; +} +EXPORT_SYMBOL(rdma_resolve_addr); + +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr) +{ + struct rdma_id_private *id_priv; + struct sockaddr_in *ip_addr = (struct sockaddr_in *) addr; + struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; + int ret; + + if (addr->sa_family != AF_INET) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) + return -EINVAL; + + if (ip_addr->sin_addr.s_addr) { + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); + if (!ret) + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); + } else + ret = -ENOSYS; /* TODO: support wild card addresses */ + + if (ret) + goto err; + + id->route.addr.src_addr = *addr; + return 0; +err: + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); + return ret; +} +EXPORT_SYMBOL(rdma_bind_addr); + +static void cma_format_addr(struct cma_addr *addr, struct rdma_route *route) +{ + struct sockaddr_in *ip_addr; + + memset(addr, 0, sizeof *addr); + cma_set_vers(addr, 1, 4); + + ip_addr = (struct sockaddr_in *) &route->addr.src_addr; + addr->src_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); + + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; + addr->dst_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); + addr->port = cpu_to_be16(ip_addr->sin_port); +} + +static int cma_connect_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_req_param req; + struct rdma_route *route; + struct cma_addr *addr; + void *private_data; + int ret; + + memset(&req, 0, sizeof req); + req.private_data_len = sizeof *addr + conn_param->private_data_len; + + private_data = kmalloc(req.private_data_len, GFP_ATOMIC); + if (!private_data) + return -ENOMEM; + + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, + id_priv); + if (IS_ERR(id_priv->cm_id)) { + ret = PTR_ERR(id_priv->cm_id); + goto out; + } + + addr = private_data; + route = &id_priv->id.route; + cma_format_addr(addr, route); + + if (conn_param->private_data && conn_param->private_data_len) + memcpy(addr + 1, conn_param->private_data, + conn_param->private_data_len); + req.private_data = private_data; + + req.primary_path = &route->path_rec[0]; + if (route->num_paths == 2) + req.alternate_path = &route->path_rec[1]; + + req.service_id = cma_get_service_id(&route->addr.dst_addr); + req.qp_num = id_priv->id.qp->qp_num; + req.qp_type = IB_QPT_RC; + req.starting_psn = req.qp_num; + req.responder_resources = conn_param->responder_resources; + req.initiator_depth = conn_param->initiator_depth; + req.flow_control = conn_param->flow_control; + req.retry_count = conn_param->retry_count; + req.rnr_retry_count = conn_param->rnr_retry_count; + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; + req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.srq = id_priv->id.qp->srq ? 1 : 0; + + ret = ib_send_cm_req(id_priv->cm_id, &req); +out: + kfree(private_data); + return ret; +} + +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_connect_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + if (ret) + goto err; + + return 0; +err: + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); + return ret; +} +EXPORT_SYMBOL(rdma_connect); + +static int cma_accept_ib(struct rdma_id_private *id_priv, + struct rdma_conn_param *conn_param) +{ + struct ib_cm_rep_param rep; + int ret; + + ret = cma_modify_ib_qp_rtr(id_priv); + if (ret) + return ret; + + memset(&rep, 0, sizeof rep); + rep.qp_num = id_priv->id.qp->qp_num; + rep.starting_psn = rep.qp_num; + rep.private_data = conn_param->private_data; + rep.private_data_len = conn_param->private_data_len; + rep.responder_resources = conn_param->responder_resources; + rep.initiator_depth = conn_param->initiator_depth; + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; + rep.failover_accepted = 0; + rep.flow_control = conn_param->flow_control; + rep.rnr_retry_count = conn_param->rnr_retry_count; + rep.srq = id_priv->id.qp->srq ? 1 : 0; + + return ib_send_cm_rep(id_priv->cm_id, &rep); +} + +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_accept_ib(id_priv, conn_param); + break; + default: + ret = -ENOSYS; + break; + } + + if (ret) + goto reject; + + return 0; +reject: + cma_modify_qp_err(id); + rdma_reject(id, NULL, 0); + return ret; +} +EXPORT_SYMBOL(rdma_accept); + +int rdma_reject(struct rdma_id *id, const void *private_data, + u8 private_data_len) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + switch (id->device->node_type) { + case IB_NODE_CA: + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, private_data, private_data_len); + break; + default: + ret = -ENOSYS; + break; + } + return ret; +}; +EXPORT_SYMBOL(rdma_reject); + +int rdma_disconnect(struct rdma_id *id) +{ + struct rdma_id_private *id_priv; + int ret; + + id_priv = container_of(id, struct rdma_id_private, id); + if (!cma_comp(id_priv, CMA_CONNECT)) + return -EINVAL; + + ret = cma_modify_qp_err(id); + if (ret) + goto out; + + switch (id->device->node_type) { + case IB_NODE_CA: + /* Initiate or respond to a disconnect. */ + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) + ib_send_cm_drep(id_priv->cm_id, NULL, 0); + break; + default: + break; + } +out: + return ret; +} +EXPORT_SYMBOL(rdma_disconnect); + +/* TODO: add this to the device structure - see Roland's patch */ +static __be64 get_ca_guid(struct ib_device *device) +{ + struct ib_device_attr *device_attr; + __be64 guid; + int ret; + + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); + if (!device_attr) + return 0; + + ret = ib_query_device(device, device_attr); + guid = ret ? 0 : device_attr->node_guid; + kfree(device_attr); + return guid; +} + +static void cma_add_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + unsigned long flags; + + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); + if (!cma_dev) + return; + + cma_dev->device = device; + cma_dev->node_guid = get_ca_guid(device); + if (!cma_dev->node_guid) + goto err; + + init_waitqueue_head(&cma_dev->wait); + atomic_set(&cma_dev->refcount, 1); + INIT_LIST_HEAD(&cma_dev->id_list); + ib_set_client_data(device, &cma_client, cma_dev); + + spin_lock_irqsave(&lock, flags); + list_add_tail(&cma_dev->list, &dev_list); + spin_unlock_irqrestore(&lock, flags); + return; +err: + kfree(cma_dev); +} + +static int cma_remove_id_dev(struct rdma_id_private *id_priv) +{ + enum cma_state state; + + /* Record that we want to remove the device */ + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); + if (state == CMA_DESTROYING) + return 0; + + /* TODO: wait until safe to process removal. */ + + /* Check for destruction from another callback. */ + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) + return 0; + + return cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, 0, NULL, 0); +} + +static void cma_process_remove(struct cma_device *cma_dev) +{ + struct list_head remove_list; + struct rdma_id_private *id_priv; + unsigned long flags; + int ret; + + INIT_LIST_HEAD(&remove_list); + + spin_lock_irqsave(&lock, flags); + while (!list_empty(&cma_dev->id_list)) { + id_priv = list_entry(cma_dev->id_list.next, + struct rdma_id_private, list); + list_del(&id_priv->list); + list_add_tail(&id_priv->list, &remove_list); + atomic_inc(&id_priv->refcount); + spin_unlock_irqrestore(&lock, flags); + + ret = cma_remove_id_dev(id_priv); + cma_deref_id(id_priv); + if (ret) + rdma_destroy_id(&id_priv->id); + + spin_lock_irqsave(&lock, flags); + } + spin_unlock_irqrestore(&lock, flags); + + atomic_dec(&cma_dev->refcount); + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); +} + +static void cma_remove_one(struct ib_device *device) +{ + struct cma_device *cma_dev; + unsigned long flags; + + cma_dev = ib_get_client_data(device, &cma_client); + if (!cma_dev) + return; + + spin_lock_irqsave(&lock, flags); + list_del(&cma_dev->list); + spin_unlock_irqrestore(&lock, flags); + + cma_process_remove(cma_dev); + kfree(cma_dev); +} + +static int cma_init(void) +{ + return ib_register_client(&cma_client); +} + +static void cma_cleanup(void) +{ + ib_unregister_client(&cma_client); +} + +module_init(cma_init); +module_exit(cma_cleanup); From yaronh at voltaire.com Fri Oct 7 12:52:29 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 7 Oct 2005 21:52:29 +0200 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <35EA21F54A45CB47B879F21A91F4862F7FA3A1@taurus.voltaire.com> > ________________________________________ > From: Michael Krause [mailto:krause at cup.hp.com] > Sent: Friday, October 07, 2005 12:29 PM > To: Yaron Haviv > Cc: Openib > Subject: RE: [openib-general] [RFC] IB address translation using ARP > > At 06:24 AM 9/30/2005, Yaron Haviv wrote: > > > -----Original Message----- > > From: Roland Dreier [ mailto:rolandd at cisco.com] > > Sent: Thursday, September 29, 2005 9:50 PM > > To: Sean Hefty > > Cc: Yaron Haviv; Openib > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > I think the usage model is the following: you have some magic device > > that has an IB port on one side and "something else" on the other > > side.  Think of something like a gateway that talks SDP on the IB side > > and TCP/IP on the other side. > > > > >Also applicable to two IB ports, e.g. forwarding SDP traffic from one IB > >partition to SDP on another partition (may even be the same port with > >two P_Keys), and doing some load-balancing or traffic management in > >between, overall there are many use cases for that. > > While I can envision how an endpoint could communicate with another in > separate partitions, doing so really violates the spirit of the > partitioning where endpoints must be in the same partition in order to see > one another and communicate.  Mike, This is exactly the same case as two IPoIB interfaces over same port with two partitions configured with IP routing between them, or a layer 7 proxy that connects two network segments I don’t see anything wrong with such a model > Attempting to create an intermediary who has > insights into both and then somehow is able to communicate how to find one > another using some proprietary (can't be through standards that I can > think of) method, seems like way too much complexity to be worth it. > Assuming the ULPs on both sides are standards, how the proxy is built and how it functions is application dependent just like people do proxies for XML which don’t need to obey to any standard beside be transparent to both sides. OpenIB should not block the ability to provide gateway/proxy functionality, or routing traffic beyond a single IP addressing hop. This is just matching IB to capabilities already available in iWarp. Yaron From yaronh at voltaire.com Fri Oct 7 12:59:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Fri, 7 Oct 2005 21:59:00 +0200 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, October 07, 2005 12:40 PM > To: 'Michael Krause'; Caitlin Bestler > Cc: Openib > Subject: RE: [openib-general] [RFC] IB address translation using ARP > > >It would be best to define a CM architecture that enabled communication > >between like endpoints and avoid the gateway dilemma. Let the gateway > >provider work out such issues as there are many requirements already > >on each side of these interconnects. > > > I've given this some more thought since the original postings and agree > with > you. It doesn't seem right to me to have the CM establish a connection to > something that is not the specified destination, under the assumption that > whatever is being connected to is a gateway. I think it would be better > for the > application to determine that the actual destination is on a different > subnet, > locate the gateway, and issue a connection request to the gateway. > > - Sean > Sean, I believe this is exactly how it is been proposed The gateway is the endpoint in IB, and the IB CM request is done against the gateway, the gateway may decide to create its own connection on the other side based on IB headers or Private data or even application data (depend on the type of the gateway), this just requires that traffic targeted to a certain IP range/subnet/non-local will end up in the gateway without the need to specify address by address individually (just like its done in IP) Yaron > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From mshefty at ichips.intel.com Fri Oct 7 13:10:34 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 13:10:34 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> Message-ID: <4346D63A.2070801@ichips.intel.com> Yaron Haviv wrote: > Sean, I believe this is exactly how it is been proposed > The gateway is the endpoint in IB, and the IB CM request is done against > the gateway, the gateway may decide to create its own connection on the Yes - I agree with that. I'm referring to the RDMA connection manager, versus the IB connection manager. > targeted to a certain IP range/subnet/non-local will end up in the > gateway without the need to specify address by address individually > (just like its done in IP) IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. With TCP, the connection is to the actual endpoint, not the IP router. This seems more similar to an application requesting a connection to a proxy server. - Sean From halr at voltaire.com Fri Oct 7 13:07:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 16:07:37 -0400 Subject: [openib-general] Re: Questions about mad_test In-Reply-To: References: Message-ID: <1128715656.4382.9844.camel@hal.voltaire.com> Hi Pradeep, On Fri, 2005-10-07 at 15:19, Pradeep Satyanarayana wrote: > I am hoping some one will be able to help me out with a few answers > saving me some debug time, or having to expend effort on something > that is already known. > I was trying to execute mad_test and found that it errors out. What is your command invocation ? Can you send the output of ibnetdiscover ? > For some reason it does not like the DR Path that I gave it. > > 1. I ran ibnetdiscover and got the set of LIDs that I use is DR Path. > Is that correct way to go about it? > It always errors out with something like: hop 0 != 0 or hop 1 != > dev_port It's telling you the DR path you specified is invalid. LIDs go "direct" and are hardware forwarded (via LID routing). DR is uses a list of next hop (switch) ports (and not LIDs) and is firmware or software forwarded usually although that is more an implementation than architectural. See IBA 1.2 14.2.2 p.797 on for more on DR SMPs (MADs). > 2. Also there is an expectation of there being a device > /dev/infiniband/mthca0/ports/1/mad (using all defaults in this case) > -is that correct? Any specific major and minor numbers I must use? No. It just accesses those and some /sys/class/infiniband infiniband_mad files. > 3. Anything else that I am missing? > > I am using this from trunk 3675 on 2.6.13 kernel. > > Thanks in advance for all the help! > > Pradeep > pradeep at us.ibm.com From halr at voltaire.com Fri Oct 7 13:17:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 16:17:00 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346D63A.2070801@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> Message-ID: <1128716018.4382.9900.camel@hal.voltaire.com> On Fri, 2005-10-07 at 16:10, Sean Hefty wrote: > Yaron Haviv wrote: > > Sean, I believe this is exactly how it is been proposed > > The gateway is the endpoint in IB, and the IB CM request is done against > > the gateway, the gateway may decide to create its own connection on the > > Yes - I agree with that. I'm referring to the RDMA connection manager, versus > the IB connection manager. > > > targeted to a certain IP range/subnet/non-local will end up in the > > gateway without the need to specify address by address individually > > (just like its done in IP) > > IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. IP is connectionless but has been implemented on top of connection oriented link layers which may gateway to other connection oriented link layers or non connection oriented link layers. I think it is analagous to that. -- Hal > With TCP, the connection is to the actual endpoint, not the IP router. This > seems more similar to an application requesting a connection to a proxy server. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Fri Oct 7 14:02:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 14:02:09 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128716018.4382.9900.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> Message-ID: <4346E251.9080109@ichips.intel.com> Hal Rosenstock wrote: >>IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. > > > IP is connectionless but has been implemented on top of connection > oriented link layers which may gateway to other connection oriented link > layers or non connection oriented link layers. I think it is analagous > to that. I didn't think that IP was even being run in this case. Aren't we talking about an application level gateway? If the RDMA CM ran a protocol that ensured that data sent from the source reached the actual destination, then this would make more sense to me. But the protocol is coming from the client. I just don't think that the RDMA CM should connect to a gateway under the assumption that a client is running a protocol that operates this way. If the source and destination were both running iWarp, then wouldn't a connection be established to the actual destination, and not a gateway? - Sean From halr at voltaire.com Fri Oct 7 14:08:35 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 17:08:35 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346E251.9080109@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> Message-ID: <1128719144.4382.10255.camel@hal.voltaire.com> On Fri, 2005-10-07 at 17:02, Sean Hefty wrote: > Hal Rosenstock wrote: > >>IP is connectionless, so I'm not sure how to relate from IP to the RDMA CM. > > > > > > IP is connectionless but has been implemented on top of connection > > oriented link layers which may gateway to other connection oriented link > > layers or non connection oriented link layers. I think it is analagous > > to that. > > I didn't think that IP was even being run in this case. Aren't we talking about > an application level gateway? Yes. > If the RDMA CM ran a protocol that ensured that data sent from the source reached the actual destination, then this would make > more sense to me. But the protocol is coming from the client. Wouldn't the gateway/host reject or drop the connection if it couldn't do what was required ? > I just don't think that the RDMA CM should connect to a gateway under the > assumption that a client is running a protocol that operates this way. If the > source and destination were both running iWarp, then wouldn't a connection be > established to the actual destination, and not a gateway? Would it shortcut the connection across IP subnets or go through a gateway in that case ? -- Hal From mshefty at ichips.intel.com Fri Oct 7 14:30:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 14:30:43 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128719144.4382.10255.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> Message-ID: <4346E903.8030601@ichips.intel.com> Hal Rosenstock wrote: >> If the RDMA CM ran a protocol that ensured that data sent from the source >> reached the actual destination, then this would make more sense to me. But >> the protocol is coming from the client. > > Wouldn't the gateway/host reject or drop the connection if it couldn't do > what was required ? I would assume so, and maybe that's sufficient. The one problem that I see if this feature weren't in the RDMA CM is that clients may need to be transport aware. (Assuming that an iWarp connection would go directly to the destination.) - Sean From halr at voltaire.com Fri Oct 7 16:48:00 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 19:48:00 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <4346E903.8030601@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> Message-ID: <1128728790.4382.11354.camel@hal.voltaire.com> On Fri, 2005-10-07 at 17:30, Sean Hefty wrote: > Hal Rosenstock wrote: > >> If the RDMA CM ran a protocol that ensured that data sent from the source > >> reached the actual destination, then this would make more sense to me. But > >> the protocol is coming from the client. > > > > Wouldn't the gateway/host reject or drop the connection if it couldn't do > > what was required ? > > I would assume so, and maybe that's sufficient. The one problem that I see if > this feature weren't in the RDMA CM is that clients may need to be transport > aware. (Assuming that an iWarp connection would go directly to the destination.) Would an iWARP connection jump across IP subnets ? It would need to determine that it could do this (ala NHRP with ATM). Also, could there be other RDMA networks between them (like IB) ? -- Hal From mshefty at ichips.intel.com Fri Oct 7 16:57:48 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 07 Oct 2005 16:57:48 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128728790.4382.11354.camel@hal.voltaire.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> Message-ID: <43470B7C.7060600@ichips.intel.com> Hal Rosenstock wrote: > Would an iWARP connection jump across IP subnets ? It would need to > determine that it could do this (ala NHRP with ATM). Also, could there > be other RDMA networks between them (like IB) ? if iWarp is on top of TCP, I don't think that it would care about IP subnets. - Sean From halr at voltaire.com Fri Oct 7 17:13:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Oct 2005 20:13:18 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <43470B7C.7060600@ichips.intel.com> References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> <43470B7C.7060600@ichips.intel.com> Message-ID: <1128730364.4382.11557.camel@hal.voltaire.com> On Fri, 2005-10-07 at 19:57, Sean Hefty wrote: > Hal Rosenstock wrote: > > Would an iWARP connection jump across IP subnets ? It would need to > > determine that it could do this (ala NHRP with ATM). Also, could there > > be other RDMA networks between them (like IB) ? > > if iWarp is on top of TCP, I don't think that it would care about IP subnets. I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? Doesn't a routing decision still need to be made at the IP layer ? Doesn't the IP next hop need to be determined (e.g. gateway when the destination is off the local IP subnet) ? Is there something that precludes iWARP from working across IP subnets ? -- Hal From rolandd at cisco.com Fri Oct 7 18:16:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 07 Oct 2005 18:16:37 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128672413.13948.326.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 01:06:53 -0700") References: <1128672413.13948.326.camel@localhost> Message-ID: <52br20lsei.fsf@cisco.com> I wonder if this BIC bug has anything to do with it: http://lkml.org/lkml/2005/10/7/230 From hozer at hozed.org Fri Oct 7 19:03:08 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 7 Oct 2005 21:03:08 -0500 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <20051008020308.GZ4612@kalmia.hozed.org> On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: > Hi, Troy, > > There is INSTALL file in the EHCA driver package. > In OpenPower 720 port 1 is at the top, port 2 is at the bottom. > In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 19:22:56 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 19:22:56 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br20lsei.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> Message-ID: <1128738176.13952.365.camel@localhost> On Fri, 2005-10-07 at 18:16 -0700, Roland Dreier wrote: > I wonder if this BIC bug has anything to do with it: http://lkml.org/lkml/2005/10/7/230 > I'm not sure this helps. I'm seeing the performance drop of happen between 2.6.12-rc4 (470 MB/s) and 2.6.12-rc5 (405 MB/s). I'll send out my new data and cc netdev. - Matt From mlleinin at hpcn.ca.sandia.gov Fri Oct 7 19:25:49 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 07 Oct 2005 19:25:49 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br20lsei.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> Message-ID: <1128738350.13945.369.camel@localhost> I'm adding netdev to this thread to see if they can help. I'm seeing an IPoIB (IP over InfiniBand) netperf performance drop off, of up to 90 MB/s, when using kernels newer than 2.6.11. This doesn't appear to be an OpenIB IPoIB issue since the older in-kernel IB for 2.6.11 and a recent svn3687 snapshot both have the same performance (464 MB/s) with 2.6.11. I used the same kernel config file as a starting point for each of these kernel builds. Have there been any changes in Linux that would explain these results? Here is the hardware setup and netperf results using 'netperf -f -M -c -C -H IPoIB_ADDRESS All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12.5-lustre in-kernel 1 399 2.6.12.5 in-kernel 1 402 2.6.12 in-kernel 1 406 2.6.12-rc6 in-kernel 1 407 2.6.12-rc5 in-kernel 1 405 <<<<< 2.6.12-rc4 in-kernel 1 470 <<<<< 2.6.12-rc3 in-kernel 1 466 2.6.12-rc2 in-kernel 1 469 2.6.12-rc1 in-kernel 1 466 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) Thanks, - Matt From info at giiut.com Fri Oct 7 07:20:24 2005 From: info at giiut.com (info at giiut.com) Date: 7 Oct 2005 23:20:24 +0900 Subject: [openib-general] fu-ka.jpg Message-ID: <20051007142024.5251.qmail@mail.giiut.com> $B!!(B $B!!<:NiCW$7$^$9!#>!@\!L$*M6$$%a!<%kMzNr!M$,0l7oJ]N1Cf$K$J$C$F$*$j$^$9!#(B $B%3%A%i$+$i%a!<%kFbMF$r%3%T!<$7$FG[?.$9$k;v$HCW$7$^$7$F!"$43N(B $BG'$N>e!"JV;v!!(Bhttp://www.alladdin-master.com?return1 $B$r$*4j$$CW$7$^$9!#(B $B"""#Iw9a(B $B$5$s"#""(B $BK\J8(B: $B!V$O$8$a$^$7$F(B^^$B6a=j$NJ}$rC5$7$F$F!"$"$J$?$r>R2p$5$l$?$N$GJV(B $B;v$r=P$7$F$_$^$7$?!#2qM5$,$"$j$^$9$N$G!"(B $B$"$kDxEY(B(20$BK|0L$+$J!&!&!&(B $B>P(B)$B$ONO$K$J$C$F$"$2$k$3$H$,$G$-$k$H;W$$$^$9!#$G$-$l$PAa4|$,(B $B$$$$$N$G!"D>@\%a!<%k$G$-$^$;$s$+!)(B $B!!;d$N%"%I%l%9$O(Bfu-ka*e*cco@ hotmail.com$B59$7$/!"JV;vBT$C$F$^(B $B$9$M"v!W(B $B"(%W%i%$%P%7!pJs$r3NG'$9$k$K$O%3%A%i$N%Z!<%8$K$F4JC1$JZ$r:Q$^$;$k$H!"99$K(B $B!o(B10,000$B1_(B($BAjEv%]%$%s%H(B)$B$^$G40A4L5NA$G$*;n$7=PMh$^$9!#$@$+$i!"(B $B!o(B0$B1_$G$J$s$H!&!&!&(B $B!!!!!!!!!!(Bhttp://www.alladdin-master.com?return1 $B"(2q0w$NJ}$OF~2q(B24$B;~4V0JFb$K0[@-2q0w$+$i$ND>@\O"Mm$NL5$+$C$?(B $B>l9g$O>r7oL5$/40A4L5NA$H at _Dj$5$;$FD:$-$^$9!#(B $B"($3$N%a!<%k$r3+Iu$7$F(B2$B;~4V0JFb$KEPO?$5$l$k$H!"CO0h0[@-$ND>(B $B%"%I(B($B:GBg(B5$BL>(B)$B$r%W%l%<%s%HCW$7$^$9!#(B ------------------------------------------------------------ $B$* References: <35EA21F54A45CB47B879F21A91F4862F7FA3A2@taurus.voltaire.com> <4346D63A.2070801@ichips.intel.com> <1128716018.4382.9900.camel@hal.voltaire.com> <4346E251.9080109@ichips.intel.com> <1128719144.4382.10255.camel@hal.voltaire.com> <4346E903.8030601@ichips.intel.com> <1128728790.4382.11354.camel@hal.voltaire.com> <43470B7C.7060600@ichips.intel.com> <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: <1128829186.25001.76.camel@mail.es335.com> On Fri, 2005-10-07 at 20:13 -0400, Hal Rosenstock wrote: > On Fri, 2005-10-07 at 19:57, Sean Hefty wrote: > > Hal Rosenstock wrote: > > > Would an iWARP connection jump across IP subnets ? It would need to > > > determine that it could do this (ala NHRP with ATM). Also, could there > > > be other RDMA networks between them (like IB) ? > > > > if iWarp is on top of TCP, I don't think that it would care about IP subnets. > > I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > Doesn't a routing decision still need to be made at the IP layer ? > Doesn't the IP next hop need to be determined (e.g. gateway when the > destination is off the local IP subnet) ? Is there something that > precludes iWARP from working across IP subnets ? > > -- Hal > I've just read through entire this thread for the first time, and I sense considerable confusion about how IP routing works. I know I'm confused ;-) With sockets, the path to the remote peer is determined *after* the connection request is submitted by the app (connect(...)). The app has no idea which local interface will ultimately handle this connection or what the path (route) is to the remote peer. It simply says connect(67.65.105.4, ...). In fact, TCP doesn't know this either! Like Hal suggests, the connect request (SYN packet) gets all the way down to IP where the least cost route is selected, and if not already known, the Ethernet address is determined (arp) for the next hop. The reasons for this are varied but include: routes may change, Ethernet addresses for next hops change, all within the lifetime of a connection. Almost certainly if the connection lasts more than 15 minutes. The route identifies the local interface, and next hop IP. An interface is only ever on a single subnet. The ARP broadcast is issued on this interface and is only on this one subnet. We're not broadcasting across subnets. Note that the local interface is "logical", and a single Ethernet NIC may have multiple IP addresses and may in fact be on multiple subnets if using VLAN. It is theoretically possible to support all this on an IPoIB based network. Multiple subnets, multiple routes to remote peers, ICMP redirect, multiple IP addresses for each physical interface, yada yada yada. But IMHO, the only way to do this would be to tie directly into the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise you'll end up recreating a gigantic (and I mean GIGANTIC) amount of code. This belief is why I've been a proponent of mapping GIDs to one and only one IP address and treating it for management purposes as the equivalent of an IP address. Without this, the whole mechanism for determining routes, etc.. breaks down. If you treat the GID like a MAC address -- it breaks, because a MAC address can have multiple IP addresses -- the observation that lead to the conclusion that ATS was broken in the first place. I know there is significant resistance to this idea, but I just don't see how we get this generically resolved without binding the two addressing schemes more closely. With the current binding, I just don't think it works. If I'm off in the weeds, please let me know ... and I'll cease spouting off. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Sun Oct 9 01:44:55 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Oct 2005 10:44:55 +0200 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max In-Reply-To: <52fyribmtc.fsf@cisco.com> References: <52fyribmtc.fsf@cisco.com> Message-ID: <20051009084455.GA24993@mellanox.co.il> Hi, I'm proposing a better fix. see below. On Mon, Oct 03, 2005 at 06:13:51PM +0200, Roland Dreier wrote: > Seems reasonable. However, looking back at the chip documentation, it > seems that the max CQEs should really be 0x1ffff rather than 0xffff as > I had it. Can you confirm? > > Thanks, > Roland -------------------------------------------------- Best to take the actual max cqes from QUERY_DEV_LIMS -- new patch below. The "- 1" is there because the cq needs one spare cqe (circular list logic). Jack Signed-off-by: Jack Morgenstein Index: linux-kernel/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -134,6 +134,7 @@ int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; Index: linux-kernel/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -173,6 +173,7 @@ mdev->limits.reserved_pds = dev_lim->reserved_pds; mdev->limits.port_width_cap = dev_lim->max_port_width; mdev->limits.flags = dev_lim->flags; + mdev->limits.max_cqes = dev_lim->max_cq_sz - 1; /* IB_DEVICE_RESIZE_MAX_WR not supported by driver. May be doable since hardware supports it for SRQ. Index: linux-kernel/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3632) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -93,7 +93,7 @@ props->max_qp_wr = 0xffff; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; @@ -639,7 +639,11 @@ struct mthca_cq *cq; int nent; int err; + struct mthca_dev* mdev = to_mdev(ibdev); + if (mdev->limits.max_cqes < entries || entries < 0) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); From info at ppuyt.com Sun Oct 9 01:35:37 2005 From: info at ppuyt.com (info at ppuyt.com) Date: 9 Oct 2005 17:35:37 +0900 Subject: [openib-general] $B9b3[1g=u$G$bL5M}$+$J!)!JN^!K(B Message-ID: <20051009083537.17994.qmail@mail.ppuyt.com> $B7k:'(B5$BG/L\(B28$B:P!#;R6!$,M_$7$/$F;EJ}$J$$$N$K=PMh$^$;$s!#(B $BK\5$$G@:;R$r;d$N%*!{!{%3$K=P$7$F$/$l$^$;$s$+!)@dBP$K(B $BLBOG$+$1$^$;$s!#(B $BA06b$G7 at Ls$9$k;v$G$9!#K\Ev$K=u$1$F$/$@$5$$!#59$7$/(B $B$*4j$$CW$7$^$9!#BT$C$F$^$9!#(B http://awg.webchu.com/sweet-s/?gyakuen $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B NO.I don't veceive your mail sweet_baby_sweet_12 at yahoo.it $B:#8e!"l9g$O(B sweet_baby_sweet_12 at yahoo.it $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From yael at mellanox.co.il Sun Oct 9 04:18:23 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 09 Oct 2005 13:18:23 +0200 Subject: [openib-general] [PATCH] Opensm - handling immediate error in vendor_send Message-ID: <5zu0frvszk.fsf@mtl066.yok.mtl.com> Hi Hal, During our tests on Windows we encountered an issue that is caused due to some problem in the lower layer, but causes problem in the opensm. If the osm_vendor_send call fails immediatly, we need to update several counters (currently, only qp0_mads_sent is decremented), and also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in order to signal the state mgr). What we saw was that these counters weren't decremented, and thus the state mgr wasn't signalled, and the opensm didn't proceed in traversing through its stages. The following patch updates the relevant counters, and calls the dispatcher, if neccessary. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_vl15intf.h =================================================================== --- include/opensm/osm_vl15intf.h (revision 3703) +++ include/opensm/osm_vl15intf.h (working copy) @@ -60,6 +60,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -137,6 +138,8 @@ typedef struct _osm_vl15 osm_vendor_t *p_vend; osm_log_t *p_log; osm_stats_t *p_stats; + osm_subn_t *p_subn; + cl_disp_reg_handle_t h_disp; } osm_vl15_t; /* @@ -176,6 +179,12 @@ typedef struct _osm_vl15 * p_stats * Pointer to the OpenSM statistics block. * +* p_subn +* Pointer to the Subnet object for this subnet. +* +* h_disp +* Handle returned from dispatcher registration. +* * SEE ALSO * VL15 object *********/ @@ -265,7 +274,9 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ); + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp ); /* * PARAMETERS * p_vl15 @@ -283,6 +294,12 @@ osm_vl15_init( * max_wire_smps * [in] Maximum number of MADs allowed on the wire at one time. * +* p_subn +* [in] Pointer to the subnet object. +* +* p_disp +* [in] Pointer to the dispatcher object. +* * RETURN VALUES * IB_SUCCESS if the VL15 object was initialized successfully. * Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3703) +++ opensm/osm_opensm.c (working copy) @@ -257,7 +257,7 @@ osm_opensm_init( status = osm_vl15_init( &p_osm->vl15, p_osm->p_vendor, - &p_osm->log, &p_osm->stats, p_opt->max_wire_smps ); + &p_osm->log, &p_osm->stats, p_opt->max_wire_smps, &p_osm->subn, &p_osm->disp ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3703) +++ opensm/osm_vl15intf.c (working copy) @@ -157,6 +157,8 @@ __osm_vl15_poller( if( status != IB_SUCCESS ) { + uint32_t outstanding; + cl_status_t cl_status; osm_log( p_vl->p_log, OSM_LOG_ERROR, "__osm_vl15_poller: ERR 3E03: " "MAD send failed (%s).\n", @@ -166,7 +168,64 @@ __osm_vl15_poller( The MAD was never successfully sent, so fix up the pre-incremented count values. */ + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire + that was incremented in the code above. */ mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); + if( p_madw->resp_expected == TRUE ) + if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire ) + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E04: " + "Trying to dec qp0_mads_outstanding_on_wire=0. " + "Problem with transaction mgr!\n"); + else + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); + + /* The following code is similar to the one in + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the + qp0_mads_outstanding counter, and if we reached 0 - need to call + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order + to wake up the state mgr). */ + if ( !&p_vl->p_stats->qp0_mads_outstanding ) + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E05: " + "Trying to dec qp0_mads_outstanding=0. " + "Problem with transaction mgr!\n"); + else + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); + + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "%u(%u) QP0 MADs outstanding.\n", + p_vl->p_stats->qp0_mads_outstanding,outstanding ); + + if( outstanding == 0 ) + { + /* + The wire is clean. + Signal the state manager. + */ + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "Posting Dispatcher message %s.\n", + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); + } + + cl_status = cl_disp_post( p_vl->h_disp, + OSM_MSG_NO_SMPS_OUTSTANDING, + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, + NULL, + NULL ); + + if( cl_status != CL_SUCCESS ) + { + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E06: " + "Dispatcher post message failed (%s).\n", + CL_STATUS_MSG( cl_status ) ); + } + } } else { @@ -232,6 +291,7 @@ osm_vl15_construct( cl_qlist_init( &p_vl->rfifo ); cl_qlist_init( &p_vl->ufifo ); cl_thread_construct( &p_vl->poller ); + p_vl->h_disp = CL_DISP_INVALID_HANDLE; } /********************************************************************** @@ -281,6 +341,8 @@ osm_vl15_destroy( p_vl->state = OSM_VL15_STATE_INIT; cl_spinlock_destroy( &p_vl->lock ); + cl_disp_unregister( p_vl->h_disp ); + OSM_LOG_EXIT( p_vl->p_log ); } @@ -292,7 +354,10 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ) + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp + ) { ib_api_status_t status = IB_SUCCESS; OSM_LOG_ENTER( p_log, osm_vl15_init ); @@ -301,6 +366,7 @@ osm_vl15_init( p_vl->p_log = p_log; p_vl->p_stats = p_stats; p_vl->max_wire_smps = max_wire_smps; + p_vl->p_subn = p_subn; status = cl_event_init( &p_vl->signal, FALSE ); if( status != IB_SUCCESS ) @@ -321,6 +387,21 @@ osm_vl15_init( if( status != IB_SUCCESS ) goto Exit; + p_vl->h_disp = cl_disp_register( + p_disp, + CL_DISP_MSGID_NONE, + NULL, + NULL ); + + if( p_vl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_vl15_init: ERR 3E01: " + "Dispatcher registration failed.\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + Exit: OSM_LOG_EXIT( p_log ); return( status ); From sean.hefty at intel.com Sun Oct 9 07:19:37 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 9 Oct 2005 07:19:37 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? I'm referring to the case that iWarp is running over TCP. I know that it can run over SCTP, but I'm not familiar with the details of that protocol. With TCP, this is an end-to-end connection, so layering iWarp over it, only the endpoints need to deal with it. I believe the same is true for SCTP. >Doesn't a routing decision still need to be made at the IP layer ? Routing of the IP packets is done at the IP layer, but I don't see how this affects iWarp. >Doesn't the IP next hop need to be determined (e.g. gateway when the >destination is off the local IP subnet) ? Is there something that >precludes iWARP from working across IP subnets ? I can't think of anything that would preclude iWarp from working across subnets. - Sean From sean.hefty at intel.com Sun Oct 9 07:57:04 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Sun, 9 Oct 2005 07:57:04 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128829186.25001.76.camel@mail.es335.com> Message-ID: >It is theoretically possible to support all this on an IPoIB based >network. Multiple subnets, multiple routes to remote peers, ICMP >redirect, multiple IP addresses for each physical interface, yada yada >yada. But IMHO, the only way to do this would be to tie directly into >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of The current implementation ties into the standard Linux ARP tables. If connections were made over TCP/IP, using IPoIB, then I don't think that there would be any issues. The issues only arise because of the desire to use TCP/IP network addresses over a non-TCP/IP network. >code. This belief is why I've been a proponent of mapping GIDs to one >and only one IP address and treating it for management purposes as the >equivalent of an IP address. Without this, the whole mechanism for >determining routes, etc.. breaks down. If you treat the GID like a MAC >address -- it breaks, because a MAC address can have multiple IP >addresses -- the observation that lead to the conclusion that ATS was >broken in the first place. We should be able to handle the case where a GID has multiple IP addresses bound to it. But even if we added a 1:1 restriction, the connection over IB issue still exists. >I know there is significant resistance to this idea, but I just don't >see how we get this generically resolved without binding the two >addressing schemes more closely. With the current binding, I just don't >think it works. Again, I don't think that the binding is the issue, so much as the desire to use an address for a protocol that isn't actually being used for communication. I don't view a GID as an IP address because we're not sending and receiving IP packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I think is the proper view. Anyway, returning back to the original problem of connecting to an IB gateway if a given a destination IP address on a different subnet... I'm slowly convincing myself that either the CMA or AT should do this. (I believe that the ib_addr code will do this now, but still wasn't sure that we wanted it to.) - Sean From surs at cse.ohio-state.edu Sun Oct 9 08:18:53 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 9 Oct 2005 11:18:53 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <52achmo18d.fsf@cisco.com> References: <20051005183649.GA9036@cse.ohio-state.edu> <52vf0bpxaz.fsf@cisco.com> <20051006021529.GA14502@cse.ohio-state.edu> <523bnfp8jk.fsf@cisco.com> <20051006133937.GA23901@cse.ohio-state.edu> <20051006184652.GA27969@cse.ohio-state.edu> <52achmo18d.fsf@cisco.com> Message-ID: <20051009151851.GA16147@cse.ohio-state.edu> Roland, * On Oct,13 Roland Dreier wrote : > Sayantan> I noticed that the test re-posts buffers only when the > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > Sayantan> consumed from the SRQ, right? > > Yes, your code is correct. The problem was that the mthca kernel > driver was dispatching SRQ events incorrectly, so the event never > reached userspace. I've checked in a fix for that, and I'm going to > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > working). I did some further testing with this. Apparently, when the asynchronous thread is first started, it gets the limit event (since no receives are posted yet ...). But after that when the number of posted receives actually drop below max_recv - 5, I am not able to see another limit event. Do you think that this could happen in the current implementation? Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From jackm at mellanox.co.il Sun Oct 9 09:30:05 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Sun, 9 Oct 2005 18:30:05 +0200 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051009151851.GA16147@cse.ohio-state.edu> References: <20051009151851.GA16147@cse.ohio-state.edu> Message-ID: <20051009163005.GA26296@mellanox.co.il> Sayantan, The Limit Event must be re-armed after an event has occurred (it is a "one-shot"). (i.e., modify-srq/set-limit must be re-invoked).This is compliant with the IB Spec (see section 10.2.9.3, first paragraph). (Note that after each SRQ LWM event, the limit for the SRQ gets reset back to zero -- i.e., disabled). Therefore, proper use of this feature is as follows (after creating the SRQ): a. Post the SRQ WQEs b. Arm the Limit to a non-zero value (less than the number of WQEs posted, or the arming is useless -- you will immediately get the event). c. If the number of posted WQEs falls below your limit, you will get an event. d. Handling the event: 1) FIRST, post more WQEs to the SRQ to get the number of posted wqe's to be greater than your desired limit. 2) THEN, re-arm the event (i.e., modify the SRQ limit again to be a non-zero value). Jack -----Original Message----- On Sun, Oct 09, 2005 at 05:18:53PM +0200, Sayantan Sur wrote: > Roland, > > * On Oct,13 Roland Dreier wrote : > > Sayantan> I noticed that the test re-posts buffers only when the > > Sayantan> outstanding recv count is <= 1. I set a SRQ limit as > > Sayantan> max_recv - 5. So, I should get the event when 5 WQEs are > > Sayantan> consumed from the SRQ, right? > > > > Yes, your code is correct. The problem was that the mthca kernel > > driver was dispatching SRQ events incorrectly, so the event never > > reached userspace. I've checked in a fix for that, and I'm going to > > queue the SRQ limit event stuff for 2.6.15 (now that I've seen it > > working). > > I did some further testing with this. Apparently, when the asynchronous > thread is first started, it gets the limit event (since no receives are > posted yet ...). But after that when the number of posted receives > actually drop below max_recv - 5, I am not able to see another limit > event. > > Do you think that this could happen in the current implementation? > > Thanks, > Sayantan. > > -- > http://www.cse.ohio-state.edu/~surs > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From tom at ammasso.com Sun Oct 9 10:10:18 2005 From: tom at ammasso.com (Tom Tucker) Date: Sun, 09 Oct 2005 12:10:18 -0500 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: Message-ID: <1128877818.24182.54.camel@mail.es335.com> On Sun, 2005-10-09 at 07:57 -0700, Sean Hefty wrote: > >It is theoretically possible to support all this on an IPoIB based > >network. Multiple subnets, multiple routes to remote peers, ICMP > >redirect, multiple IP addresses for each physical interface, yada yada > >yada. But IMHO, the only way to do this would be to tie directly into > >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise > >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of > > The current implementation ties into the standard Linux ARP tables. If > connections were made over TCP/IP, using IPoIB, then I don't think that there > would be any issues. The issues only arise because of the desire to use TCP/IP > network addresses over a non-TCP/IP network. > > >code. This belief is why I've been a proponent of mapping GIDs to one > >and only one IP address and treating it for management purposes as the > >equivalent of an IP address. Without this, the whole mechanism for > >determining routes, etc.. breaks down. If you treat the GID like a MAC > >address -- it breaks, because a MAC address can have multiple IP > >addresses -- the observation that lead to the conclusion that ATS was > >broken in the first place. > > We should be able to handle the case where a GID has multiple IP addresses bound > to it. But even if we added a 1:1 restriction, the connection over IB issue > still exists. I agree, except for RARP. > > >I know there is significant resistance to this idea, but I just don't > >see how we get this generically resolved without binding the two > >addressing schemes more closely. With the current binding, I just don't > >think it works. > > Again, I don't think that the binding is the issue, so much as the desire to use > an address for a protocol that isn't actually being used for communication. Not to be pedantic, but if binding or mapping or somesuch weren't an issue we wouldn't need AT. > I > don't view a GID as an IP address because we're not sending and receiving IP > packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I > think is the proper view. > > Anyway, returning back to the original problem of connecting to an IB gateway if > a given a destination IP address on a different subnet... I'm slowly convincing > myself that either the CMA or AT should do this. (I believe that the ib_addr > code will do this now, but still wasn't sure that we wanted it to.) > IMHO, you need a service separate from the CMA to do address translation. My (iWARP's) rationale for this is that there are two clients of the service, the CM and IP. For CM, you need it to elect a route and thereby a local interface. For IP you need it because routes change and ARP entries time out. BTW, can you educate me ... is the following what you're thinking: On the client side... - route is discovered by looking at the Linux routing table - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) - send ARP AT message over local IB interface At the gateway...bridging to IP - ARP AT query received on IB interface - Lookup route to destination IP address in gateway's route table. - If next hop's Ethernet address is already known, it is returned - Otherwise, local interface identified is IPoEthernet - New ARP query goes out on the local interface from the route - When response comes back, answer is returned. At the gateway...bridging to IPoIB - ARP AT message received on IB interface, delivered to AT - Lookup route to destination IP address in gateway's route table - If next hop's Ethernet address is already known, it is returned - otherwise, local interface identified in route is IPoIB - New ARP AT query goes out on the local interface - When response comes back, answer is returned. Thanks, > - Sean > > From surs at cse.ohio-state.edu Sun Oct 9 11:50:31 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 9 Oct 2005 14:50:31 -0400 Subject: [openib-general] segmentation fault in ibv_modify_srq In-Reply-To: <20051009163005.GA26296@mellanox.co.il> References: <20051009151851.GA16147@cse.ohio-state.edu> <20051009163005.GA26296@mellanox.co.il> Message-ID: <20051009185029.GA16927@cse.ohio-state.edu> Jack, * On Oct,16 Jack Morgenstein wrote : > Sayantan, > The Limit Event must be re-armed after an event has occurred (it is a "one-shot"). > (i.e., modify-srq/set-limit must be re-invoked).This is compliant with the > IB Spec (see section 10.2.9.3, first paragraph). (Note that after each SRQ LWM > event, the limit for the SRQ gets reset back to zero -- i.e., disabled). > > Therefore, proper use of this feature is as follows (after creating the SRQ): > a. Post the SRQ WQEs > b. Arm the Limit to a non-zero value (less than the number of WQEs posted, > or the arming is useless -- you will immediately get the event). > c. If the number of posted WQEs falls below your limit, you will get an > event. > d. Handling the event: > 1) FIRST, post more WQEs to the SRQ to get the number of posted wqe's to be > greater than your desired limit. > 2) THEN, re-arm the event (i.e., modify the SRQ limit again to > be a non-zero value). Thanks for the detailed instructions. I am able to see the limit event exactly when the buffer count goes down. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From braam at clusterfs.com Sun Oct 9 14:17:56 2005 From: braam at clusterfs.com (Peter J. Braam) Date: Sun, 9 Oct 2005 17:17:56 -0400 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? Message-ID: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Cluster File Systems, Inc and its customers have been wondering if the Lustre Network Driver (LND) for OpenIb gen2, which we will begin to develop during the coming months, should be based on kdapl or verbs. The driver we plan to develop should strive to address several goals: - high reliability and performance - allow interoperability between user and kernel level - allow interoperability, or better, portability among different operating systems (Linux, OS X, Windows, Solaris) - be suitable for inclusion in the Linux kernel We are keen to hear some opinions! Thanks Peter Braam -------------- next part -------------- An HTML attachment was scrubbed... URL: From hozer at hozed.org Sun Oct 9 18:32:57 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sun, 9 Oct 2005 20:32:57 -0500 Subject: [openib-general] IBM eHCA testing.. Message-ID: <20051010013256.GE4612@kalmia.hozed.org> What's the status on getting the ehca driver integrated into subversion? If there's something holding it up, can we at least get a version that can be dropped into drivers/infiniband/hw ? Also, one final note, is it really appropriate to have ehca/ebus in the infiniband directory? It's really a PPC64 specific driver that works for more than just the ehca device, correct? I have the correct port plugged in now, and I can see the logical HCA device in the output of 'ibnetdiscover' (from another node), but trying to bring up ib0 caused this: [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff [ 452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0 CTR: D000000000049C60 [ 528.813554] REGS: c00000000f1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: 0000000000000000 DSISR: 0000000040000000 [ 528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD: c00000000f1e8000 CPU: 0 [ 528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0 0000000000000000 [ 528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540 0000000000000001 [ 528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004 D000000000049C60 [ 528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000 00000000100A0000 [ 528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000 0000000010020000 [ 528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914 C0000001E46D4810 [ 528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10 0000000000000002 [ 528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640 C00000000F1EB548 [ 528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5 (unreliable) [ 528.813846] [c00000000f1eb4d0] [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [c00000000f1eb5f0] [d00000000007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [c00000000f1eb680] [d00000000006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120 [ 528.813942] [c00000000f1eb7c0] [c000000000328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [c00000000f1eb860] [c00000000037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [c00000000f1eb970] [c00000000037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [c00000000f1eba00] [c00000000031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [c00000000f1ebca0] [c00000000001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [c00000000f1ebd70] [c0000000000e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 60000000 60000000 [ 528.814150] 60000000 7c0802a6 f8010010 f821ff81 e9490170 e80a0000 f8410028 [ 528.814174] <7>RTAS: event: 3, Type: Platform Error, Severity: 2 From ftillier at silverstorm.com Sun Oct 9 21:53:36 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Sun, 9 Oct 2005 21:53:36 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <000301c5cd56$94101290$9e5aa8c0@infiniconsys.com> > From: Peter J. Braam [mailto:braam at clusterfs.com] > Sent: Sunday, October 09, 2005 2:18 PM > > Cluster File Systems, Inc and its customers have been wondering if the Lustre > Network Driver (LND) for OpenIb gen2, which we will begin to develop during > the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating > systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel I think that suitability for inclusion in the Linux kernel is going to be mutually exclusive with portability between different operating systems. If you want to be in the Linux kernel, you need to be a native Linux driver, and not use any sorts of abstraction layers. Feedback to date on abstraction layers has been consistently clear that they will not be tolerated in the kernel. With the ongoing work to support both IB and iWarp devices under the OpenIB verbs, I think coding directly to verbs would be just fine. You'll likely want to use the higher level CM abstraction being developed now for establishing connections in a transport neutral manner, but the verbs themselves should be the same. Others more closely involved can likely give you better guidance. With all this said, I'm personally interested to see a cluster file system on top of the OpenIB Windows stack, and since kDAPL doesn't exist in Windows at the moment, interfacing to native verbs would be my preference. There really aren't that many differences in verbs, though Windows will likely make you deal with more things asynchronously depending on your IRQL. I'd be happy to field specific questions about Windows on the openib-windows mailing list if you have them. Cheers, - Fab From IBMEHCAD at de.ibm.com Mon Oct 10 00:23:59 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Mon, 10 Oct 2005 09:23:59 +0200 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <20051008020308.GZ4612@kalmia.hozed.org> Message-ID: This is caused by a complex interaction of ib_mad, hcad_mod and pSeries firmware. As you might already have noticed a eHCA doesn't show up as a "port" but as a switch in the fabric. Reason for this is partition support and virtualisation in Infininband. If you want to give each partition in a system a "own" IB adapter, it has to have its "own" LID(s) and therefore it's own GUIDs. IB standard only allows one way currently how to accomplish this: You need a switch and multiple adapters behind. So that's exactly how the eHCA shows up in the fabric. In our case system firmware handles the SMA traffic for that "switch" and for all "adapters" (running an SMA or SM on QP0 is currently not supported). This brings up another problem: you definetly won't want to allocate LIDs for all "potentially possible" operating system partitions (not to confuse with IB partitioning), otherwise you could come close to the 48000 LIDs/subnet limit pretty quickly. So you need some kind of signal from the operating system to system firmware, which in the eHCA case is the H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI parameter. AFTER that call handshaking between system firmware and the SM will start, here's a new adapter active on a switch port... what's your guid? here's your LID, p_key, SM lid.... ...and after all that it's possible to send and receive packets from the fabric. The openib stack expects that a port is fully functional after this create_qp returns, and starts to do all sorts of modify QP and post send. So the only choice we have there is to delay create_qp until the complete handshaking between system firmware and the SM has finished (until we see a IB_PORT_ACTIVE in hcad_mod). If we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have to return an error code to openib, otherwise we're seriously in trouble (tried that). Shirley already pointed out on the mailinglist, that ib_mad and others have different recovery depending on the success of ib_create_qp(IB_QPT_GSI), especially ib_mad decides it's the best thing to kill the complete adapter if that call fails on a single port. so that's the full explanation of ehca_nr_ports and hopefully answers your question.... Troy Benjegerdes 08.10.2005 04:03 To Shirley Ma cc Pradeep Satyanarayana , Troy Benjegerdes , IBMEHCA DD/Germany/IBM at IBMDE, openib-general at openib.org, openib-general-bounces at openib.org Subject Re: [openib-general] IBM eHCA testing.. On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: > Hi, Troy, > > There is INSTALL file in the EHCA driver package. > In OpenPower 720 port 1 is at the top, port 2 is at the bottom. > In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at vbdfsp.com Sun Oct 9 22:21:30 2005 From: info at vbdfsp.com (info at vbdfsp.com) Date: 10 Oct 2005 14:21:30 +0900 Subject: [openib-general] $BCK@-I,$:2T$2$k%7%9%F%`$G$9(B Message-ID: <20051010052130.13602.qmail@mail.vbdfsp.com> $B=w$N;R$H%"%]$r@\$d$jl9g$O(B awg_tokyo at yahoo.com.au $B"#(B==========================$B"#(B From yipeeyipeeyipeeyipee at yahoo.com Mon Oct 10 01:28:06 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 10 Oct 2005 08:28:06 +0000 (UTC) Subject: [openib-general] IRQ sharing on PCIe bus Message-ID: Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 kernel. A Mellanox memfree PCIe ddr HCA is connected. Why do I see IRQ sharing although I'm using msi_x and PCIe? Doesn't IRQ sharing only happen on older non PCIe busses? When insmod'ing ib_mthca.ko I see: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:06:00.0 IRQ for 0000:06:00.0[A] -> PIRQ 60, mask dcd8, excl 0000 -> newirq=10 -> got IRQ 10 PCI: Found IRQ 10 for device 0000:06:00.0 PCI: Sharing IRQ 10 with 0000:00:01.0 PCI: Sharing IRQ 10 with 0000:00:02.0 PCI: Sharing IRQ 10 with 0000:00:04.0 PCI: Sharing IRQ 10 with 0000:00:05.0 PCI: Sharing IRQ 10 with 0000:00:06.0 PCI: Sharing IRQ 10 with 0000:00:1d.0 PCI: Sharing IRQ 10 with 0000:07:04.0 PCI: Setting latency timer of device 0000:06:00.0 to 64 the /proc/pci is: PCI devices found: Bus 0, device 0, function 0: Class 0600: PCI device 8086:3590 (rev 12). Bus 0, device 0, function 1: Class ff00: PCI device 8086:3591 (rev 12). Bus 0, device 1, function 0: Class 0880: PCI device 8086:3594 (rev 12). IRQ 10. Non-prefetchable 32 bit memory at 0xfcdff000 [0xfcdfffff]. Bus 0, device 2, function 0: Class 0604: PCI device 8086:3595 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 4, function 0: Class 0604: PCI device 8086:3597 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 5, function 0: Class 0604: PCI device 8086:3598 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=7. Bus 0, device 6, function 0: Class 0604: PCI device 8086:3599 (rev 12). IRQ 10. Master Capable. No bursts. Min Gnt=6. Bus 0, device 29, function 0: Class 0c03: PCI device 8086:24d2 (rev 2). IRQ 10. I/O at 0xd800 [0xd81f]. Bus 0, device 29, function 1: Class 0c03: PCI device 8086:24d4 (rev 2). IRQ 7. I/O at 0xd880 [0xd89f]. Bus 0, device 29, function 2: Class 0c03: PCI device 8086:24d7 (rev 2). IRQ 15. I/O at 0xdc00 [0xdc1f]. Bus 0, device 29, function 7: Class 0c03: PCI device 8086:24dd (rev 2). IRQ 5. Non-prefetchable 32 bit memory at 0xfcdfec00 [0xfcdfefff]. Bus 0, device 30, function 0: Class 0604: PCI device 8086:244e (rev 194). Master Capable. No bursts. Min Gnt=11. Bus 0, device 31, function 0: Class 0601: PCI device 8086:24d0 (rev 2). Bus 0, device 31, function 1: Class 0101: PCI device 8086:24db (rev 2). IRQ 15. I/O at 0xfc00 [0xfc0f]. Non-prefetchable 32 bit memory at 0x80100000 [0x801003ff]. Bus 0, device 31, function 3: Class 0c05: PCI device 8086:24d3 (rev 2). IRQ 11. I/O at 0x540 [0x55f]. Bus 1, device 0, function 0: Class 0604: PCI device 8086:0329 (rev 9). Master Capable. No bursts. Min Gnt=7. Bus 1, device 0, function 1: Class 0800: PCI device 8086:0326 (rev 9). Non-prefetchable 32 bit memory at 0xfcefe000 [0xfcefefff]. Bus 1, device 0, function 2: Class 0604: PCI device 8086:032a (rev 9). Master Capable. No bursts. Min Gnt=7. Bus 1, device 0, function 3: Class 0800: PCI device 8086:0327 (rev 9). Non-prefetchable 32 bit memory at 0xfceff000 [0xfcefffff]. Bus 6, device 0, function 0: Class 0c06: PCI device 15b3:6282 (rev 32). IRQ 10. Non-prefetchable 64 bit memory at 0xfcf00000 [0xfcffffff]. Prefetchable 64 bit memory at 0xfb800000 [0xfbffffff]. Bus 7, device 4, function 0: Class 0200: PCI device 8086:1076 (rev 5). IRQ 10. Master Capable. Latency=32. Min Gnt=255. Non-prefetchable 32 bit memory at 0xfebe0000 [0xfebfffff]. I/O at 0xec00 [0xec3f]. Bus 7, device 6, function 0: Class 0200: PCI device 8086:107c (rev 5). IRQ 15. Master Capable. Latency=32. Min Gnt=255. Non-prefetchable 32 bit memory at 0xfeba0000 [0xfebbffff]. Non-prefetchable 32 bit memory at 0xfeb80000 [0xfeb9ffff]. I/O at 0xe880 [0xe8bf]. Bus 7, device 12, function 0: Class 0300: PCI device 1002:4752 (rev 39). IRQ 11. Master Capable. Latency=32. Min Gnt=8. Non-prefetchable 32 bit memory at 0xfd000000 [0xfdffffff]. I/O at 0xe400 [0xe4ff]. Non-prefetchable 32 bit memory at 0xfebdb000 [0xfebdbfff]. Thanks, y From david at allinea.com Mon Oct 10 02:23:21 2005 From: david at allinea.com (David Lecomber) Date: Mon, 10 Oct 2005 10:23:21 +0100 Subject: [openib-general] ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Message-ID: <1128936201.26749.10.camel@delmo.priv.wark.uk.streamline-computing.com> Dear all, I'm having a kernel problem which I believe to be caused by the infiniband drivers on the system I am using. Kernel 2.6.11, Mellanox software stack IBGD 1.7.0. Essentially, once an MPI code is started, the kernel refuses to allow ptrace() access to the text segment (ie. where the program instructions lie), although it is possible to access the data segment. This means debugging is impossible (gdb, idb, ddt, etc.). The attached code demonstrates the problem. Untar, and then make. Run the 'mpi' program, and pick a line of it's output, paste into another shell. On the standard, non MPI test code, the ptrace reads are all successful. On the MPI test, it gives an error for the text segment reads.. Is this a known issue - are there any upgrades/fixes which should have been applied? I would appreciate if someone could run the test suggested on a really new setup, and see if the error happens. Regards David -- David Lecomber, CTO, Allinea Software tel: +44 1926 623231 fax: +44 1926 623232 -------------- next part -------------- A non-text attachment was scrubbed... Name: ib.tar Type: application/x-tar Size: 10240 bytes Desc: not available URL: From SCHICKHJ at de.ibm.com Mon Oct 10 03:53:23 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Mon, 10 Oct 2005 12:53:23 +0200 Subject: [openib-general] IBM eHCA testing.. Message-ID: Hello Troy, below you will find our preliminary analysis about the problem you've reported on Oct 10 via the OpenIB mailing-list [1]: [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff [ 452.821917] PU0002 000b03aa:ehca_create_qp <<< failed ret=ffffffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1000000003000004 r5=2000000000000008 r6=8a40000000000000 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c00000000f2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D000000000049C6C XER: 20000020 LR: D0000000000760A0 CTR: D000000000049C60 [ 528.813554] REGS: c00000000f1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: 0000000000000000 DSISR: 0000000040000000 [ 528.813592] TASK: c00000000209a9a0[2021] 'ifconfig' THREAD: c00000000f1e8000 CPU: 0 [ 528.813605] GPR00: D0000000000760A0 C00000000F1EB450 D00000000005FFF0 0000000000000000 [ 528.813625] GPR04: C00000000F1EB548 0000000000000071 C00000000F1EB540 0000000000000001 [ 528.813645] GPR08: 000000000000000B 0000000000000001 0000000000000004 D000000000049C60 [ 528.813664] GPR12: D0000000000774C0 C0000000004B4000 00000000100C0000 00000000100A0000 [ 528.813685] GPR16: 0000000000000000 0000000000000000 0000000010020000 0000000010020000 [ 528.813704] GPR20: 000000001001E71C C0000001E466C000 FFFFFFFFFFFF8914 C0000001E46D4810 [ 528.813725] GPR24: C0000001E46D4800 C00000000F43B900 C00000000F1EBD10 0000000000000002 [ 528.813745] GPR28: 0000000000000000 C0000001E466C380 D000000000084640 C00000000F1EB548 [ 528.813768] NIP [d000000000049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [c00000000f1eb450] [00000000434849c5] 0x434849c5 (unreliable) [ 528.813846] [c00000000f1eb4d0] [d0000000000760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [c00000000f1eb5f0] [d00000000007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [c00000000f1eb680] [d00000000006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [c00000000f1eb720] [c00000000032a650] .dev_open+0xc0/0x120 [ 528.813942] [c00000000f1eb7c0] [c000000000328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [c00000000f1eb860] [c00000000037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [c00000000f1eb970] [c00000000037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [c00000000f1eba00] [c00000000031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [c00000000f1ebaa0] [c0000000000c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [c00000000f1ebb40] [c0000000000c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [c00000000f1ebbf0] [c0000000000c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [c00000000f1ebca0] [c00000000001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [c00000000f1ebd70] [c0000000000e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [c00000000f1ebe30] [c00000000000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 60000000 60000000 [ 528.814150] 60000000 7c0802a6 f8010010 f821ff81 e9490170 e80a0000 f8410028 [ 528.814174] <7>RTAS: event: 3, Type: Platform Error, Severity: 2 It looks that IPoIB uses ressources which are already freed. We don't receive a "port active" event for port 1 in time (after 20 seconds). This means, that the ib_mad stack tries to create an AQP1. Here, our eHCA InfiniBand Device Driver waits for a maximum of 20 seconds for a port active event. It seems that with the usage of OpenSM we will receive the "port active" event after ca. 45 seconds. For the MAD and IPoIB stack this means the following: MAD: ==== 1. No AQP1 QP will exist for port 1, because of the missing "port active event". 2. All resources are freed, because of the error handling routines in ib_mad. create_mad_qp reports an error to ib_mad_port_open which destroys all allocated resources (workqueue, AQPs, MR, PD, CQ, etc.). 3. Multicast join request to the SM won't work !!! IPoIB doesn't work on ifconfig ib0 xxx.xxx.xxx.xxx !!! IPoIB: ====== For IPoIB a "port active" event which is to late is going to be a problem. 1. The function ipoib_add_one calls ipoib_add_port which creates all IB ressources (QPs, CQ, etc. function ipoib_dev_init -> ipoib_in_dev_init, ...) 2. Function ipoib_ib_dev_init (executed at startup / module load) calls ipoib_ib_dev_open which wants to modify the IPoIB QP from INIT -> RTR -> RTS via ipoib_qp_create. 3. The first ib_modify_qp functions (Reset2Init) in ipoib_qp_create failes, because the port is not active at the moment. See: [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ... [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 4. If that happes the function ipoib_qp_create in ib_verbs.c will destroy the IPoIB QP. 5. A user enters ifconfig ib0 xxx.xxx.xxx.xxx which executes ipoib_open. This function executes also ipoib_ib_dev_open which wants to modifies the IPoIB QP from INIT -> RTR -> RTS via ipoib_qp_create. 6. ib_modify_qp will occur a Kernel panic (because priv->qp is NULL see function ipoib_qp_create). Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Mon Oct 10 06:22:35 2005 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 10 Oct 2005 15:22:35 +0200 Subject: [openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> David, IBGD 1.7 does not support kernel 2.6.11 so I assume you have made changes to IBGD to make it compile. In the files you sent I can't see a call to ptrace with PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in the IBGD stack, registered buffers are not inherited by a child process when a the parent forks. This is accomplished by setting the VM_DONTCOPY flag on the vma. It is so done to retain the virtual to physical translation of a page at the parent by disabling COW on the pages. So the child may not even have these buffers in its address space and this could be the reason why ptrace fails. Note also that IBGD 1.8 is the latest release and it does support kernel 2.6.11 so you may consider using it, though the description above holds also for IBGD 1.8 Eli -----Original Message----- From: David Lecomber [mailto:david at allinea.com] Sent: Monday, October 10, 2005 11:23 AM To: openib-general at openib.org Subject: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster Dear all, I'm having a kernel problem which I believe to be caused by the infiniband drivers on the system I am using. Kernel 2.6.11, Mellanox software stack IBGD 1.7.0. Essentially, once an MPI code is started, the kernel refuses to allow ptrace() access to the text segment (ie. where the program instructions lie), although it is possible to access the data segment. This means debugging is impossible (gdb, idb, ddt, etc.). The attached code demonstrates the problem. Untar, and then make. Run the 'mpi' program, and pick a line of it's output, paste into another shell. On the standard, non MPI test code, the ptrace reads are all successful. On the MPI test, it gives an error for the text segment reads.. Is this a known issue - are there any upgrades/fixes which should have been applied? I would appreciate if someone could run the test suggested on a really new setup, and see if the error happens. Regards David -- David Lecomber, CTO, Allinea Software tel: +44 1926 623231 fax: +44 1926 623232 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david at allinea.com Mon Oct 10 06:22:32 2005 From: david at allinea.com (David Lecomber) Date: Mon, 10 Oct 2005 14:22:32 +0100 Subject: [openib-general] RE: ptrace peektext failure for Mellanox IBGD 1.7.0 based cluster In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E3066244@mtlexch01.mtl.com> Message-ID: <1128950552.26749.36.camel@delmo.priv.wark.uk.streamline-computing.com> On Mon, 2005-10-10 at 15:22 +0200, Eli Cohen wrote: > David, > IBGD 1.7 does not support kernel 2.6.11 so I assume you have made > changes to IBGD to make it compile. > In the files you sent I can't see a call to ptrace with > PTRACE_PEEKTEXT but I can see a call to PTRACE_PEEKDATA. Note that in > the IBGD stack, registered buffers are not inherited by a child > process when a the parent forks. This is accomplished by setting the > VM_DONTCOPY flag on the vma. It is so done to retain the virtual to > physical translation of a page at the parent by disabling COW on the > pages. So the child may not even have these buffers in its address > space and this could be the reason why ptrace fails. > > Note also that IBGD 1.8 is the latest release and it does support > kernel 2.6.11 so you may consider using it, though the description > above holds also for IBGD 1.8 > > Eli Hi Eli, Thanks for looking at this. Peektext/peekdata are synonymous, at least in Linux (c.f. the man page for ptrace). Do you happen to have a 1.8 based machine you could try the example on for me (please!)? Do you have any suggestions for a way to work around this. All debuggers need to be able to read memory locations, and even write to them (for breakpoints) - so it's kind of essential! Regards David From mst at mellanox.co.il Mon Oct 10 06:57:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 15:57:24 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: References: Message-ID: <20051010135723.GT21551@mellanox.co.il> Quoting Sean Hefty : > Subject: [PATCH] [CMA] RDMA CM abstraction module > > The following patch adds in a basic RDMA connection management abstraction. > It is functional, but needs additional work for handling device removal, > plus several missing features. > > I'd like to merge this back into the trunk, and continue working on it > from there. > > This depends on the ib_addr module. > > Signed-off-by: Sean Hefty > > > > Index: include/rdma/rdma_cm.h > =================================================================== > --- include/rdma/rdma_cm.h (revision 0) > +++ include/rdma/rdma_cm.h (revision 0) > @@ -0,0 +1,201 @@ > > [........... snip ...............] > > + > +#if !defined(RDMA_CM_H) > +#define RDMA_CM_H > + > +#include > +#include > +#include > + > +/* > + * Upon receiving a device removal event, users must destroy the > associated > + * RDMA identifier and release all resources allocated with the device. > + */ > +enum rdma_event_type { > + RDMA_EVENT_ADDR_RESOLVED, > + RDMA_EVENT_ADDR_ERROR, > + RDMA_EVENT_ROUTE_RESOLVED, > + RDMA_EVENT_ROUTE_ERROR, > + RDMA_EVENT_CONNECT_REQUEST, > + RDMA_EVENT_CONNECT_ERROR, > + RDMA_EVENT_UNREACHABLE, > + RDMA_EVENT_REJECTED, > + RDMA_EVENT_ESTABLISHED, > + RDMA_EVENT_DISCONNECTED, > + RDMA_EVENT_DEVICE_REMOVAL, > +}; > + > +struct rdma_addr { > + struct sockaddr src_addr; > + struct sockaddr dst_addr; > + union { > + struct ib_addr ibaddr; > + } addr; > +}; > + > +struct rdma_route { > + struct rdma_addr addr; > + struct ib_sa_path_rec *path_rec; > + int num_paths; > +}; > + > +struct rdma_event { > + enum rdma_event_type event; > + int status; > + void *private_data; > + u8 private_data_len; > +}; Wouldnt is be a good idea to start names with rdma_cm or rdma_cma or something like that? For example, rdma_event_type is a bit confusing since this actually only includes CM events. Similiar comments apply to other names. > +struct rdma_id; I propose renaming this to rdma_connection or something else more specific than just "id". Makes sense? > +/** > + * rdma_event_handler - Callback used to report user events. > + * > + * Notes: Users may not call rdma_destroy_id from this callback to destroy > + * the passed in id, or a corresponding listen id. Returning a > + * non-zero value from the callback will destroy the corresponding id. > + */ > +typedef int (*rdma_event_handler)(struct rdma_id *id, struct rdma_event *event); > + > +struct rdma_id { > + struct ib_device *device; > + void *context; > + struct ib_qp *qp; > + rdma_event_handler event_handler; > + struct rdma_route route; > +}; > + > +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void > *context); > + > +void rdma_destroy_id(struct rdma_id *id); > + > +/** > + * rdma_bind_addr - Bind an RDMA identifier to a source address and > + * associated RDMA device, if needed. > + * > + * @id: RDMA identifier. > + * @addr: Local address information. Wildcard values are permitted. > + * > + * This associates a source address with the RDMA identifier before calling > + * rdma_listen. If a specific local address is given, the RDMA identifier will > + * be bound to a local RDMA device. > + */ > +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr); > + > +/** > + * rdma_resolve_addr - Resolve destination and optional source addresses > + * from IP addresses to an RDMA address. If successful, the specified > + * rdma_id will be bound to a local device. > + * > + * @id: RDMA identifier. > + * @src_addr: Source address information. This parameter may be NULL. > + * @dst_addr: Destination address information. > + * @timeout_ms: Time to wait for resolution to complete. > + */ > +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, > + struct sockaddr *dst_addr, int timeout_ms); > + > +/** > + * rdma_resolve_route - Resolve the RDMA address bound to the RDMA identifier > + * into route information needed to establish a connection. > + * > + * This is called on the client side of a connection, but its use is optional. > + * Users must have first called rdma_bind_addr to resolve a dst_addr > + * into an RDMA address before calling this routine. > + */ > +int rdma_resolve_route(struct rdma_id *id, int timeout_ms); Not sure I understand what this does, since the only extra parameter is timeout_ms. > +/** > + * rdma_create_qp - Allocate a QP and associate it with the specified RDMA > + * identifier. > + */ > +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > + struct ib_qp_init_attr *qp_init_attr); > + > +/** > + * rdma_destroy_qp - Deallocate the QP associated with the specified RDMA > + * identifier. > + * > + * Users must destroy any QP associated with an RDMA identifier before > + * destroying the RDMA ID. > + */ > +void rdma_destroy_qp(struct rdma_id *id); Not sure what the intended usage is. When does the user need to call this? > +struct rdma_conn_param { > + const void *private_data; > + u8 private_data_len; > + u8 responder_resources; > + u8 initiator_depth; > + u8 flow_control; > + u8 retry_count; /* ignored when accepting */ > + u8 rnr_retry_count; > +}; > + > +/** > + * rdma_connect - Initiate an active connection request. > + * > + * Users must have bound the rdma_id to a local device by having called > + * rdma_resolve_addr before calling this routine. Users may also resolve the > + * RDMA address to a route with rdma_resolve_route, but if a route has not > + * been resolved, a default route will be selected. > + * > + * Note that the QP must be in the INIT state. > + */ > +int rdma_connect(struct rdma_id *id, struct rdma_conn_param *conn_param); > + > +/** > + * rdma_listen - This function is called by the passive side to > + * listen for incoming connection requests. > + * > + * Users must have bound the rdma_id to a local address by calling > + * rdma_bind_addr before calling this routine. > + */ > +int rdma_listen(struct rdma_id *id); > + > +/** > + * rdma_accept - Called on the passive side to accept a connection request > + * > + * Note that the QP must be in the INIT state. > + */ > +int rdma_accept(struct rdma_id *id, struct rdma_conn_param > *conn_param); > + > +/** > + * rdma_reject - Called on the passive side to reject a connection request. > + */ > +int rdma_reject(struct rdma_id *id, const void *private_data, > + u8 private_data_len); > + > +/** > + * rdma_disconnect - This function disconnects the associated QP. > + */ > +int rdma_disconnect(struct rdma_id *id); > + > +#endif /* RDMA_CM_H */ > + > Index: core/cma.c > =================================================================== > --- core/cma.c (revision 0) > +++ core/cma.c (revision 0) > @@ -0,0 +1,1207 @@ > + > [ ......... snip .............. ] > > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include Are all of these headers really needed? For example, I dont see arp.h used anywhere. Am I missing something? > +MODULE_AUTHOR("Guy German"); > +MODULE_DESCRIPTION("Generic RDMA CM Agent"); > +MODULE_LICENSE("Dual BSD/GPL"); > + > +#define CMA_CM_RESPONSE_TIMEOUT 20 > +#define CMA_MAX_CM_RETRIES 3 > + > +static void cma_add_one(struct ib_device *device); > +static void cma_remove_one(struct ib_device *device); > + > +static struct ib_client cma_client = { > + .name = "cma", > + .add = cma_add_one, > + .remove = cma_remove_one > +}; > + > +static DEFINE_SPINLOCK(lock); > +static LIST_HEAD(dev_list); > + > +struct cma_device { > + struct list_head list; > + struct ib_device *device; > + __be64 node_guid; > + wait_queue_head_t wait; > + atomic_t refcount; > + struct list_head id_list; > +}; > + > +enum cma_state { > + CMA_IDLE, > + CMA_ADDR_QUERY, > + CMA_ADDR_RESOLVED, > + CMA_ROUTE_QUERY, > + CMA_ROUTE_RESOLVED, > + CMA_CONNECT, > + CMA_ADDR_BOUND, > + CMA_LISTEN, > + CMA_DEVICE_REMOVAL, > + CMA_DESTROYING > +}; > + > +/* > + * Device removal can occur at anytime, so we need extra handling to > + * serialize notifying the user of device removal with other callbacks. > + * We do this by disabling removal notification while a callback is in process, > + * and reporting it after the callback completes. > + */ > +struct rdma_id_private { > + struct rdma_id id; > + > + struct list_head list; > + struct cma_device *cma_dev; > + > + enum cma_state state; > + spinlock_t lock; > + wait_queue_head_t wait; > + atomic_t refcount; > + atomic_t dev_remove; > + > + int timeout_ms; > + struct ib_sa_query *query; > + int query_id; > + struct ib_cm_id *cm_id; > +}; > + > +struct cma_addr { > + u8 version; /* CMA version: 7:4, IP version: 3:0 */ > + u8 reserved; > + __be16 port; > + struct { > + union { > + struct in6_addr ip6; > + struct { > + __be32 pad[3]; > + __be32 addr; > + } ip4; > + } ver; > + } src_addr, dst_addr; > +}; > + > +static int cma_comp(struct rdma_id_private *id_priv, enum cma_state > comp) > +{ > + unsigned long flags; > + int ret; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + ret = (id_priv->state == comp); > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return ret; > +} > + > +static int cma_comp_exch(struct rdma_id_private *id_priv, > + enum cma_state comp, enum cma_state exch) > +{ > + unsigned long flags; > + int ret; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + if ((ret = (id_priv->state == comp))) > + id_priv->state = exch; > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return ret; > +} > + > +static enum cma_state cma_exch(struct rdma_id_private *id_priv, > + enum cma_state exch) > +{ > + unsigned long flags; > + enum cma_state old; > + > + spin_lock_irqsave(&id_priv->lock, flags); > + old = id_priv->state; > + id_priv->state = exch; > + spin_unlock_irqrestore(&id_priv->lock, flags); > + return old; > +} > + > +static inline u8 cma_get_ip_ver(struct cma_addr *addr) > +{ > + return addr->version & 0xF; > +} > + > +static inline u8 cma_get_cma_ver(struct cma_addr *addr) > +{ > + return addr->version >> 4; > +} > + > +static inline void cma_set_vers(struct cma_addr *addr, u8 cma_ver, u8 > ip_ver) > +{ > + addr->version = (cma_ver << 4) + (ip_ver & 0xF); > +} > + > +static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, > + union ib_gid *gid) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + int ret = -ENODEV; > + u8 port; > + > + spin_lock_irqsave(&lock, flags); > + list_for_each_entry(cma_dev, &dev_list, list) { > + ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); > + if (!ret) { > + atomic_inc(&cma_dev->refcount); > + id_priv->cma_dev = cma_dev; > + id_priv->id.device = cma_dev->device; > + list_add_tail(&id_priv->list, &cma_dev->id_list); > + break; > + } > + } > + spin_unlock_irqrestore(&lock, flags); > + return ret; > +} > + > +static void cma_release_dev(struct rdma_id_private *id_priv) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&lock, flags); > + list_del(&id_priv->list); > + spin_unlock_irqrestore(&lock, flags); > + > + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) > + wake_up(&id_priv->cma_dev->wait); > +} > + > +static void cma_deref_id(struct rdma_id_private *id_priv) > +{ > + if (atomic_dec_and_test(&id_priv->refcount)) > + wake_up(&id_priv->wait); > +} > + > +struct rdma_id* rdma_create_id(rdma_event_handler event_handler, void > *context) > +{ > + struct rdma_id_private *id_priv; > + > + id_priv = kmalloc(sizeof *id_priv, GFP_KERNEL); > + if (!id_priv) > + return NULL; > + memset(id_priv, 0, sizeof *id_priv); > + > + id_priv->state = CMA_IDLE; > + id_priv->id.context = context; > + id_priv->id.event_handler = event_handler; > + spin_lock_init(&id_priv->lock); > + init_waitqueue_head(&id_priv->wait); > + atomic_set(&id_priv->refcount, 1); > + atomic_set(&id_priv->dev_remove, 1); > + > + return &id_priv->id; > +} > +EXPORT_SYMBOL(rdma_create_id); > + > +static int cma_init_ib_qp(struct rdma_id_private *id_priv, struct ib_qp > *qp) > +{ > + struct ib_qp_attr qp_attr; > + struct ib_sa_path_rec *path_rec; > + int ret; > + > + qp_attr.qp_state = IB_QPS_INIT; > + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + > + path_rec = id_priv->id.route.path_rec; > + ret = ib_find_cached_gid(id_priv->id.device, &path_rec->sgid, > + &qp_attr.port_num, NULL); > + if (ret) > + return ret; > + > + ret = ib_find_cached_pkey(id_priv->id.device, qp_attr.port_num, > + > id_priv->id.route.addr.addr.ibaddr.pkey, > + &qp_attr.pkey_index); > + if (ret) > + return ret; > + > + return ib_modify_qp(qp, &qp_attr, IB_QP_STATE | IB_QP_ACCESS_FLAGS | > + IB_QP_PKEY_INDEX | IB_QP_PORT); > +} > + > +int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > + struct ib_qp_init_attr *qp_init_attr) > +{ > + struct rdma_id_private *id_priv; > + struct ib_qp *qp; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (id->device != pd->device) > + return -EINVAL; > + > + qp = ib_create_qp(pd, qp_init_attr); > + if (IS_ERR(qp)) > + return PTR_ERR(qp); > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_init_ib_qp(id_priv, qp); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + > + if (ret) > + goto err; > + > + id->qp = qp; > + return 0; > +err: > + ib_destroy_qp(qp); > + return ret; > +} > +EXPORT_SYMBOL(rdma_create_qp); What about replacing switch with one case statements with if statements. Like this: if (id->device->node_type == IB_NODE_CA) ret = cma_init_ib_qp(id_priv, qp); else ret = -ENOSYS; Or even ret = id->device->node_type == IB_NODE_CA ? cma_init_ib_qp(id_priv, qp) : -ENOSYS; I also wander why do we really need all these node_type checks. The code above seems to imply that rdma_create_qp will fail on non-CA. Why is that? > +void rdma_destroy_qp(struct rdma_id *id) > +{ > + ib_destroy_qp(id->qp); > +} > +EXPORT_SYMBOL(rdma_destroy_qp); > + > +static int cma_modify_ib_qp_rtr(struct rdma_id_private *id_priv) > +{ > + struct ib_qp_attr qp_attr; > + int qp_attr_mask, ret; > + > + /* Need to update QP attributes from default values. */ > + qp_attr.qp_state = IB_QPS_INIT; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + ret = ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > + if (ret) > + return ret; > + > + qp_attr.qp_state = IB_QPS_RTR; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + qp_attr.rq_psn = id_priv->id.qp->qp_num; > + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > +} > + > +static int cma_modify_ib_qp_rts(struct rdma_id_private *id_priv) > +{ > + struct ib_qp_attr qp_attr; > + int qp_attr_mask, ret; > + > + qp_attr.qp_state = IB_QPS_RTS; > + ret = ib_cm_init_qp_attr(id_priv->cm_id, &qp_attr, > &qp_attr_mask); > + if (ret) > + return ret; > + > + return ib_modify_qp(id_priv->id.qp, &qp_attr, qp_attr_mask); > +} > + > +static int cma_modify_qp_err(struct rdma_id *id) > +{ > + struct ib_qp_attr qp_attr; > + > + qp_attr.qp_state = IB_QPS_ERR; > + return ib_modify_qp(id->qp, &qp_attr, IB_QP_STATE); > +} > + > +static int cma_verify_addr(struct cma_addr *addr, > + struct sockaddr_in *ip_addr) > +{ > + if (cma_get_cma_ver(addr) != 1 || cma_get_ip_ver(addr) != 4) > + return -EINVAL; > + > + if (ip_addr->sin_port != be16_to_cpu(addr->port)) > + return -EINVAL; > + > + if (ip_addr->sin_addr.s_addr && > + (ip_addr->sin_addr.s_addr != be32_to_cpu(addr->dst_addr. > + ver.ip4.addr))) > + return -EINVAL; > + > + return 0; > +} > + > +static int cma_notify_user(struct rdma_id_private *id_priv, > + enum rdma_event_type type, int status, > + void *data, u8 data_len) > +{ > + struct rdma_event event; > + > + event.event = type; > + event.status = status; > + event.private_data = data; > + event.private_data_len = data_len; > + > + return id_priv->id.event_handler(&id_priv->id, &event); > +} > + > +static inline void cma_disable_dev_remove(struct rdma_id_private > *id_priv) > +{ > + atomic_inc(&id_priv->dev_remove); > +} > + > +static inline void cma_deref_dev(struct rdma_id_private *id_priv) > +{ > +// if (atomic_dec_and_test(&id_priv->dev_remove)) > +// wake_up(&id_priv->wait); > +// return atomic_dec_and_test(&id_priv->dev_remove) ? > +// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, > +// NULL, 0) : 0; > +} The above seems to need some cleanup. Some of the comments above apply to the patch as a whole, so I'm preserving the rest of it here for reference. There aren't any more my comments below. Thanks, MST ---------------------------------------------- > +static void cma_cancel_addr(struct rdma_id_private *id_priv) > +{ > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + ib_addr_cancel(&id_priv->id.route.addr.addr.ibaddr); > + break; > + default: > + break; > + } > +} > + > +static void cma_cancel_route(struct rdma_id_private *id_priv) > +{ > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + ib_sa_cancel_query(id_priv->query_id, id_priv->query); > + break; > + default: > + break; > + } > +} > + > +static void cma_cancel_operation(struct rdma_id_private *id_priv, > + enum cma_state state) > +{ > + switch (state) { > + case CMA_ADDR_QUERY: > + cma_cancel_addr(id_priv); > + break; > + case CMA_ROUTE_QUERY: > + cma_cancel_route(id_priv); > + break; > + default: > + break; > + } > +} > + > +static void cma_free_id(struct rdma_id_private *id_priv) > +{ > + if (id_priv->cma_dev) { > + switch (id_priv->id.device->node_type) { > + case IB_NODE_CA: > + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) > + ib_destroy_cm_id(id_priv->cm_id); > + break; > + default: > + break; > + } > + cma_release_dev(id_priv); > + } > + > + atomic_dec(&id_priv->refcount); > + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); > + > + kfree(id_priv->id.route.path_rec); > + kfree(id_priv); > +} > + > +void rdma_destroy_id(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + enum cma_state state; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + > + state = cma_exch(id_priv, CMA_DESTROYING); > + cma_cancel_operation(id_priv, state); > + cma_free_id(id_priv); > +} > +EXPORT_SYMBOL(rdma_destroy_id); > + > +static int cma_rep_recv(struct rdma_id_private *id_priv) > +{ > + int ret; > + > + ret = cma_modify_ib_qp_rtr(id_priv); > + if (ret) > + goto reject; > + > + ret = cma_modify_ib_qp_rts(id_priv); > + if (ret) > + goto reject; > + > + ret = ib_send_cm_rtu(id_priv->cm_id, NULL, 0); > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(&id_priv->id); > + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, NULL, 0); > + return ret; > +} > + > +static int cma_rtu_recv(struct rdma_id_private *id_priv) > +{ > + int ret; > + > + ret = cma_modify_ib_qp_rts(id_priv); > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(&id_priv->id); > + ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, NULL, 0); > + return ret; > +} > + > +static int cma_ib_handler(struct ib_cm_id *cm_id, struct ib_cm_event > *ib_event) > +{ > + struct rdma_id_private *id_priv = cm_id->context; > + enum rdma_event_type event; > + u8 private_data_len = 0; > + int ret = 0, status = 0; > + > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return 0; > + > + switch (ib_event->event) { > + case IB_CM_REQ_ERROR: > + case IB_CM_REP_ERROR: > + event = RDMA_EVENT_UNREACHABLE; > + status = -ETIMEDOUT; > + break; > + case IB_CM_REP_RECEIVED: > + status = cma_rep_recv(id_priv); > + event = status ? RDMA_EVENT_CONNECT_ERROR : > + RDMA_EVENT_ESTABLISHED; > + private_data_len = IB_CM_REP_PRIVATE_DATA_SIZE; > + break; > + case IB_CM_RTU_RECEIVED: > + status = cma_rtu_recv(id_priv); > + event = status ? RDMA_EVENT_CONNECT_ERROR : > + RDMA_EVENT_ESTABLISHED; > + break; > + case IB_CM_DREQ_ERROR: > + status = -ETIMEDOUT; /* fall through */ > + case IB_CM_DREQ_RECEIVED: > + case IB_CM_DREP_RECEIVED: > + event = RDMA_EVENT_DISCONNECTED; > + break; > + case IB_CM_TIMEWAIT_EXIT: > + case IB_CM_MRA_RECEIVED: > + /* ignore event */ > + goto out; > + case IB_CM_REJ_RECEIVED: > + cma_modify_qp_err(&id_priv->id); > + status = ib_event->param.rej_rcvd.reason; > + event = RDMA_EVENT_REJECTED; > + break; > + default: > + printk(KERN_ERR "RDMA CMA: unexpected IB CM event: %d", > + ib_event->event); > + goto out; > + } > + > + ret = cma_notify_user(id_priv, event, status, > ib_event->private_data, > + private_data_len); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + id_priv->cm_id = NULL; > + rdma_destroy_id(&id_priv->id); > + } > +out: > + return ret; > +} > + > +static struct rdma_id_private* cma_new_id(struct rdma_id *listen_id, > + struct ib_cm_event *ib_event) > +{ > + struct rdma_id_private *id_priv; > + struct rdma_id *id; > + struct rdma_route *route; > + struct sockaddr_in *ip_addr; > + struct ib_sa_path_rec *path_rec; > + struct cma_addr *addr; > + int num_paths; > + > + ip_addr = (struct sockaddr_in *) &listen_id->route.addr.src_addr; > + if (cma_verify_addr(ib_event->private_data, ip_addr)) > + return NULL; > + > + num_paths = 1 + (ib_event->param.req_rcvd.alternate_path != NULL); > + path_rec = kmalloc(sizeof *path_rec * num_paths, GFP_KERNEL); > + if (!path_rec) > + return NULL; > + > + id = rdma_create_id(listen_id->event_handler, listen_id->context); > + if (!id) > + goto err; > + > + route = &id->route; > + route->addr.src_addr = listen_id->route.addr.src_addr; > + route->addr.dst_addr.sa_family = ip_addr->sin_family; > + > + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > + addr = ib_event->private_data; > + ip_addr->sin_addr.s_addr = be32_to_cpu(addr->src_addr.ver.ip4.addr); > + > + route->num_paths = num_paths; > + route->path_rec = path_rec; > + path_rec[0] = *ib_event->param.req_rcvd.primary_path; > + if (num_paths == 2) > + path_rec[1] = *ib_event->param.req_rcvd.alternate_path; > + > + route->addr.addr.ibaddr.sgid = path_rec->dgid; > + route->addr.addr.ibaddr.dgid = path_rec->sgid; > + route->addr.addr.ibaddr.pkey = be16_to_cpu(path_rec->pkey); > + > + id_priv = container_of(id, struct rdma_id_private, id); > + id_priv->state = CMA_CONNECT; > + return id_priv; > +err: > + kfree(path_rec); > + return NULL; > +} > + > +static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event > *ib_event) > +{ > + struct rdma_id_private *listen_id, *conn_id; > + int offset, ret; > + > + listen_id = cm_id->context; > + conn_id = cma_new_id(&listen_id->id, ib_event); > + if (!conn_id) > + return -ENOMEM; > + > + ret = cma_acquire_ib_dev(conn_id, &conn_id->id.route.path_rec[0].sgid); > + if (ret) { > + ret = -ENODEV; > + goto err; > + } > + > + conn_id->cm_id = cm_id; > + cm_id->context = conn_id; > + cm_id->cm_handler = cma_ib_handler; > + conn_id->state = CMA_CONNECT; > + > + offset = sizeof(struct cma_addr); > + ret = cma_notify_user(conn_id, RDMA_EVENT_CONNECT_REQUEST, 0, > + ib_event->private_data + offset, > + IB_CM_REQ_PRIVATE_DATA_SIZE - offset); > + if (ret) { > + /* Destroy the CM ID by returning a non-zero value. */ > + conn_id->cm_id = NULL; > + rdma_destroy_id(&conn_id->id); > + } > + return ret; > +err: > + rdma_destroy_id(&conn_id->id); > + return ret; > +} > + > +static __be64 cma_get_service_id(struct sockaddr *addr) > +{ > + return cpu_to_be64(((u64)IB_OPENIB_OUI << 48) + > + ((struct sockaddr_in *) addr)->sin_port); > +} > + > +static int cma_ib_listen(struct rdma_id_private *id_priv) > +{ > + __be64 svc_id; > + int ret; > + > + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_req_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id)) > + return PTR_ERR(id_priv->cm_id); > + > + svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); > + ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); > + if (ret) > + ib_destroy_cm_id(id_priv->cm_id); > + > + return ret; > +} > + > +int rdma_listen(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) > + return -EINVAL; > + > + /* TODO: handle listen across multiple devices */ > + if (!id->device) { > + ret = -ENOSYS; > + goto err; > + } > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_ib_listen(id_priv); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_LISTEN, CMA_ADDR_BOUND); > + return ret; > +}; > +EXPORT_SYMBOL(rdma_listen); > + > +static void cma_query_handler(int status, struct ib_sa_path_rec > *path_rec, > + void *context) > +{ > + struct rdma_id_private *id_priv = context; > + struct rdma_route *route = &id_priv->id.route; > + enum rdma_event_type event = RDMA_EVENT_ROUTE_RESOLVED; > + > + if (!status) { > + route->path_rec = kmalloc(sizeof *route->path_rec, GFP_KERNEL); > + if (route->path_rec) { > + route->num_paths = 1; > + *route->path_rec = *path_rec; > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, > + CMA_ROUTE_RESOLVED)) > { > + kfree(route->path_rec); > + goto out; > + } > + } else > + status = -ENOMEM; > + } > + > + if (status) { > + if (!cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED)) > + goto out; > + event = RDMA_EVENT_ROUTE_ERROR; > + } > + > + if (cma_notify_user(id_priv, event, status, NULL, 0)) { > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > +out: > + cma_deref_id(id_priv); > +} > + > +static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int > timeout_ms) > +{ > + struct ib_addr *addr = &id_priv->id.route.addr.addr.ibaddr; > + struct ib_sa_path_rec path_rec; > + int ret; > + u8 port; > + > + ret = ib_find_cached_gid(id_priv->id.device, &addr->sgid, &port, NULL); > + if (ret) > + return -ENODEV; > + > + memset(&path_rec, 0, sizeof path_rec); > + path_rec.sgid = addr->sgid; > + path_rec.dgid = addr->dgid; > + path_rec.pkey = addr->pkey; > + path_rec.numb_path = 1; > + > + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > + port, &path_rec, > + IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | > + IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, > + timeout_ms, GFP_KERNEL, > + cma_query_handler, id_priv, &id_priv->query); > + > + return (id_priv->query_id < 0) ? id_priv->query_id : 0; > +} > + > +int rdma_resolve_route(struct rdma_id *id, int timeout_ms) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ADDR_RESOLVED, CMA_ROUTE_QUERY)) > + return -EINVAL; > + > + atomic_inc(&id_priv->refcount); > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_resolve_ib_route(id_priv, timeout_ms); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ROUTE_QUERY, CMA_ADDR_RESOLVED); > + cma_deref_id(id_priv); > + return ret; > +} > +EXPORT_SYMBOL(rdma_resolve_route); > + > +static void addr_handler(int status, struct sockaddr *src_addr, > + struct ib_addr *ibaddr, void *context) > +{ > + struct rdma_id_private *id_priv = context; > + enum rdma_event_type event; > + > + if (!status) > + status = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + > + if (status) { > + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE)) > + goto out; > + event = RDMA_EVENT_ADDR_ERROR; > + } else { > + if (!cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_ADDR_RESOLVED)) > + goto out; > + id_priv->id.route.addr.src_addr = *src_addr; > + event = RDMA_EVENT_ADDR_RESOLVED; > + } > + > + if (cma_notify_user(id_priv, event, status, NULL, 0)) { > + cma_deref_id(id_priv); > + rdma_destroy_id(&id_priv->id); > + return; > + } > +out: > + cma_deref_id(id_priv); > +} > + > +int rdma_resolve_addr(struct rdma_id *id, struct sockaddr *src_addr, > + struct sockaddr *dst_addr, int timeout_ms) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_QUERY)) > + return -EINVAL; > + > + atomic_inc(&id_priv->refcount); > + id->route.addr.dst_addr = *dst_addr; > + ret = ib_resolve_addr(src_addr, dst_addr, > &id->route.addr.addr.ibaddr, > + timeout_ms, addr_handler, id_priv); > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ADDR_QUERY, CMA_IDLE); > + cma_deref_id(id_priv); > + return ret; > +} > +EXPORT_SYMBOL(rdma_resolve_addr); > + > +int rdma_bind_addr(struct rdma_id *id, struct sockaddr *addr) > +{ > + struct rdma_id_private *id_priv; > + struct sockaddr_in *ip_addr = (struct sockaddr_in *) addr; > + struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; > + int ret; > + > + if (addr->sa_family != AF_INET) > + return -EINVAL; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) > + return -EINVAL; > + > + if (ip_addr->sin_addr.s_addr) { > + ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); > + if (!ret) > + ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); > + } else > + ret = -ENOSYS; /* TODO: support wild card addresses */ > + > + if (ret) > + goto err; > + > + id->route.addr.src_addr = *addr; > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); > + return ret; > +} > +EXPORT_SYMBOL(rdma_bind_addr); > + > +static void cma_format_addr(struct cma_addr *addr, struct rdma_route > *route) > +{ > + struct sockaddr_in *ip_addr; > + > + memset(addr, 0, sizeof *addr); > + cma_set_vers(addr, 1, 4); > + > + ip_addr = (struct sockaddr_in *) &route->addr.src_addr; > + addr->src_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); > + > + ip_addr = (struct sockaddr_in *) &route->addr.dst_addr; > + addr->dst_addr.ver.ip4.addr = cpu_to_be32(ip_addr->sin_addr.s_addr); > + addr->port = cpu_to_be16(ip_addr->sin_port); > +} > + > +static int cma_connect_ib(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct ib_cm_req_param req; > + struct rdma_route *route; > + struct cma_addr *addr; > + void *private_data; > + int ret; > + > + memset(&req, 0, sizeof req); > + req.private_data_len = sizeof *addr + conn_param->private_data_len; > + > + private_data = kmalloc(req.private_data_len, GFP_ATOMIC); > + if (!private_data) > + return -ENOMEM; > + > + id_priv->cm_id = ib_create_cm_id(id_priv->id.device, cma_ib_handler, > + id_priv); > + if (IS_ERR(id_priv->cm_id)) { > + ret = PTR_ERR(id_priv->cm_id); > + goto out; > + } > + > + addr = private_data; > + route = &id_priv->id.route; > + cma_format_addr(addr, route); > + > + if (conn_param->private_data && conn_param->private_data_len) > + memcpy(addr + 1, conn_param->private_data, > + conn_param->private_data_len); > + req.private_data = private_data; > + > + req.primary_path = &route->path_rec[0]; > + if (route->num_paths == 2) > + req.alternate_path = &route->path_rec[1]; > + > + req.service_id = cma_get_service_id(&route->addr.dst_addr); > + req.qp_num = id_priv->id.qp->qp_num; > + req.qp_type = IB_QPT_RC; > + req.starting_psn = req.qp_num; > + req.responder_resources = conn_param->responder_resources; > + req.initiator_depth = conn_param->initiator_depth; > + req.flow_control = conn_param->flow_control; > + req.retry_count = conn_param->retry_count; > + req.rnr_retry_count = conn_param->rnr_retry_count; > + req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > + req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; > + req.max_cm_retries = CMA_MAX_CM_RETRIES; > + req.srq = id_priv->id.qp->srq ? 1 : 0; > + > + ret = ib_send_cm_req(id_priv->cm_id, &req); > +out: > + kfree(private_data); > + return ret; > +} > + > +int rdma_connect(struct rdma_id *id, struct rdma_conn_param > *conn_param) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp_exch(id_priv, CMA_ROUTE_RESOLVED, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_connect_ib(id_priv, conn_param); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + if (ret) > + goto err; > + > + return 0; > +err: > + cma_comp_exch(id_priv, CMA_CONNECT, CMA_ROUTE_RESOLVED); > + return ret; > +} > +EXPORT_SYMBOL(rdma_connect); > + > +static int cma_accept_ib(struct rdma_id_private *id_priv, > + struct rdma_conn_param *conn_param) > +{ > + struct ib_cm_rep_param rep; > + int ret; > + > + ret = cma_modify_ib_qp_rtr(id_priv); > + if (ret) > + return ret; > + > + memset(&rep, 0, sizeof rep); > + rep.qp_num = id_priv->id.qp->qp_num; > + rep.starting_psn = rep.qp_num; > + rep.private_data = conn_param->private_data; > + rep.private_data_len = conn_param->private_data_len; > + rep.responder_resources = conn_param->responder_resources; > + rep.initiator_depth = conn_param->initiator_depth; > + rep.target_ack_delay = CMA_CM_RESPONSE_TIMEOUT; > + rep.failover_accepted = 0; > + rep.flow_control = conn_param->flow_control; > + rep.rnr_retry_count = conn_param->rnr_retry_count; > + rep.srq = id_priv->id.qp->srq ? 1 : 0; > + > + return ib_send_cm_rep(id_priv->cm_id, &rep); > +} > + > +int rdma_accept(struct rdma_id *id, struct rdma_conn_param *conn_param) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = cma_accept_ib(id_priv, conn_param); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + > + if (ret) > + goto reject; > + > + return 0; > +reject: > + cma_modify_qp_err(id); > + rdma_reject(id, NULL, 0); > + return ret; > +} > +EXPORT_SYMBOL(rdma_accept); > + > +int rdma_reject(struct rdma_id *id, const void *private_data, > + u8 private_data_len) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + ret = ib_send_cm_rej(id_priv->cm_id, IB_CM_REJ_CONSUMER_DEFINED, > + NULL, 0, private_data, > private_data_len); > + break; > + default: > + ret = -ENOSYS; > + break; > + } > + return ret; > +}; > +EXPORT_SYMBOL(rdma_reject); > + > +int rdma_disconnect(struct rdma_id *id) > +{ > + struct rdma_id_private *id_priv; > + int ret; > + > + id_priv = container_of(id, struct rdma_id_private, id); > + if (!cma_comp(id_priv, CMA_CONNECT)) > + return -EINVAL; > + > + ret = cma_modify_qp_err(id); > + if (ret) > + goto out; > + > + switch (id->device->node_type) { > + case IB_NODE_CA: > + /* Initiate or respond to a disconnect. */ > + if (ib_send_cm_dreq(id_priv->cm_id, NULL, 0)) > + ib_send_cm_drep(id_priv->cm_id, NULL, 0); > + break; > + default: > + break; > + } > +out: > + return ret; > +} > +EXPORT_SYMBOL(rdma_disconnect); > + > +/* TODO: add this to the device structure - see Roland's patch */ > +static __be64 get_ca_guid(struct ib_device *device) > +{ > + struct ib_device_attr *device_attr; > + __be64 guid; > + int ret; > + > + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); > + if (!device_attr) > + return 0; > + > + ret = ib_query_device(device, device_attr); > + guid = ret ? 0 : device_attr->node_guid; > + kfree(device_attr); > + return guid; > +} > + > +static void cma_add_one(struct ib_device *device) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + > + cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); > + if (!cma_dev) > + return; > + > + cma_dev->device = device; > + cma_dev->node_guid = get_ca_guid(device); > + if (!cma_dev->node_guid) > + goto err; > + > + init_waitqueue_head(&cma_dev->wait); > + atomic_set(&cma_dev->refcount, 1); > + INIT_LIST_HEAD(&cma_dev->id_list); > + ib_set_client_data(device, &cma_client, cma_dev); > + > + spin_lock_irqsave(&lock, flags); > + list_add_tail(&cma_dev->list, &dev_list); > + spin_unlock_irqrestore(&lock, flags); > + return; > +err: > + kfree(cma_dev); > +} > + > +static int cma_remove_id_dev(struct rdma_id_private *id_priv) > +{ > + enum cma_state state; > + > + /* Record that we want to remove the device */ > + state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); > + if (state == CMA_DESTROYING) > + return 0; > + > + /* TODO: wait until safe to process removal. */ > + > + /* Check for destruction from another callback. */ > + if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) > + return 0; > + > + return cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, 0, > NULL, 0); > +} > + > +static void cma_process_remove(struct cma_device *cma_dev) > +{ > + struct list_head remove_list; > + struct rdma_id_private *id_priv; > + unsigned long flags; > + int ret; > + > + INIT_LIST_HEAD(&remove_list); > + > + spin_lock_irqsave(&lock, flags); > + while (!list_empty(&cma_dev->id_list)) { > + id_priv = list_entry(cma_dev->id_list.next, > + struct rdma_id_private, list); > + list_del(&id_priv->list); > + list_add_tail(&id_priv->list, &remove_list); > + atomic_inc(&id_priv->refcount); > + spin_unlock_irqrestore(&lock, flags); > + > + ret = cma_remove_id_dev(id_priv); > + cma_deref_id(id_priv); > + if (ret) > + rdma_destroy_id(&id_priv->id); > + > + spin_lock_irqsave(&lock, flags); > + } > + spin_unlock_irqrestore(&lock, flags); > + > + atomic_dec(&cma_dev->refcount); > + wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); > +} > + > +static void cma_remove_one(struct ib_device *device) > +{ > + struct cma_device *cma_dev; > + unsigned long flags; > + > + cma_dev = ib_get_client_data(device, &cma_client); > + if (!cma_dev) > + return; > + > + spin_lock_irqsave(&lock, flags); > + list_del(&cma_dev->list); > + spin_unlock_irqrestore(&lock, flags); > + > + cma_process_remove(cma_dev); > + kfree(cma_dev); > +} > + > +static int cma_init(void) > +{ > + return ib_register_client(&cma_client); > +} > + > +static void cma_cleanup(void) > +{ > + ib_unregister_client(&cma_client); > +} > + > +module_init(cma_init); > +module_exit(cma_cleanup); -- MST From mst at mellanox.co.il Mon Oct 10 06:58:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 15:58:00 +0200 Subject: [openib-general] Re: Linux 2.6.13 Kernel Support Question In-Reply-To: <43457E64.1010406@dbresearch.net> References: <43457E64.1010406@dbresearch.net> Message-ID: <20051010135800.GU21551@mellanox.co.il> Quoting Sean Hubbell : > Subject: Linux 2.6.13 Kernel Support Question > > Hello, > > Will openib still supply patches to the 2.6.13 Kernel or do I need to > upgrade my kernel to 2.6.14? > > Thanks, > > Sean Hubbell As Roland commented, once 2.6.14 is out the trunk will target it. I keep patches to make trunk compile on older kernels under https://openib.org/svn/gen2/branches/backport/ Its usually an uncomplicated excersize to add support to more kernels, so I usually do it a couple of days after trunk switches to newer kernels, but one has to keep in mind that testing is another matter. Here in mellanox people are testing against kernels that come with popular distributions, so we are currently testing 2.6.9 on RHEL4, 2.6.11_FC4 (which is between 2.6.11 and 2.6.12) on FC4 and 2.6.11 on SuSE Pro 9.3. Whether 2.6.13 will be tested in mellanox depends on whether there is/will be a distribution tested here that will include this kernel revision. -- MST From halr at voltaire.com Mon Oct 10 07:03:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:03:46 -0400 Subject: [openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send In-Reply-To: <5zu0frvszk.fsf@mtl066.yok.mtl.com> References: <5zu0frvszk.fsf@mtl066.yok.mtl.com> Message-ID: <1128953025.4377.72.camel@hal.voltaire.com> Hi Yael, On Sun, 2005-10-09 at 07:18, Yael Kalka wrote: > During our tests on Windows we encountered an issue that is caused due > to some problem in the lower layer, but causes problem in the opensm. > If the osm_vendor_send call fails immediatly, we need to update > several counters (currently, only qp0_mads_sent is decremented), and > also all the dispatcher, if we reached qp0_mads_outstanding == 0 (in > order to signal the state mgr). > What we saw was that these counters weren't decremented, and thus the > state mgr wasn't signalled, and the opensm didn't proceed in > traversing through its stages. > The following patch updates the relevant counters, and calls the > dispatcher, if neccessary. Is there a similar issue with QP1 as well ? Also, in general, atomic_inc and atomic_dec deal with int32 quantities. There is potential danger if they wrap from positive to negative or visa versa. I don't think there is any code which deals with this. I have some comments and questions on this patch embedded below. -- Hal > > Thanks, > Yael > > Signed-off-by: Yael Kalka > Index: opensm/osm_vl15intf.c > =================================================================== > --- opensm/osm_vl15intf.c (revision 3703) > +++ opensm/osm_vl15intf.c (working copy) > @@ -157,6 +157,8 @@ __osm_vl15_poller( > > if( status != IB_SUCCESS ) > { > + uint32_t outstanding; > + cl_status_t cl_status; > osm_log( p_vl->p_log, OSM_LOG_ERROR, > "__osm_vl15_poller: ERR 3E03: " > "MAD send failed (%s).\n", > @@ -166,7 +168,64 @@ __osm_vl15_poller( > The MAD was never successfully sent, so > fix up the pre-incremented count values. > */ > + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire > + that was incremented in the code above. */ > mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); > + if( p_madw->resp_expected == TRUE ) > + if ( !&p_vl->p_stats->qp0_mads_outstanding_on_wire ) Should this be !&p_vl->p_stats->qp0_mads_outstanding_on_wire or just !p_vl->p_stats->qp0_mads_outstanding_on_wire ? If it is the latter, should there be locking around it like: CL_PLOCK_ACQUIRE( p_ctrl->p_lock ); outstanding = p_ctrl->p_stats->qp0_mads_outstanding; CL_PLOCK_RELEASE( p_ctrl->p_lock ); Also, this appears to be debug code (not in other places) ? Why is it needed here ? > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E04: " > + "Trying to dec qp0_mads_outstanding_on_wire=0. " > + "Problem with transaction mgr!\n"); In this case, outstanding is not initialized so what is supposed to occur below when outstanding is checked against 0. (Should it be initialized to 0 ? Do extra signals to the state manager (for NO_PENDING_TRANSACTIONS) cause the wrong thing to occur ?). > + else > + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); > + > + /* The following code is similar to the one in > + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the > + qp0_mads_outstanding counter, and if we reached 0 - need to call > + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order > + to wake up the state mgr). */ > + if ( !&p_vl->p_stats->qp0_mads_outstanding ) > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E05: " > + "Trying to dec qp0_mads_outstanding=0. " > + "Problem with transaction mgr!\n"); > + else > + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); > + > + osm_log( p_vl->p_log, OSM_LOG_DEBUG, > + "__osm_vl15_poller: " > + "%u(%u) QP0 MADs outstanding.\n", > + p_vl->p_stats->qp0_mads_outstanding,outstanding ); Should the following preceed this DEBUG call to osm_log: if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > + if( outstanding == 0 ) > + { > + /* > + The wire is clean. > + Signal the state manager. > + */ > + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) > + { > + osm_log( p_vl->p_log, OSM_LOG_DEBUG, > + "__osm_vl15_poller: " > + "Posting Dispatcher message %s.\n", > + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); > + } > + > + cl_status = cl_disp_post( p_vl->h_disp, > + OSM_MSG_NO_SMPS_OUTSTANDING, > + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, > + NULL, > + NULL ); > + > + if( cl_status != CL_SUCCESS ) > + { > + osm_log( p_vl->p_log, OSM_LOG_ERROR, > + "__osm_vl15_poller: ERR 3E06: " > + "Dispatcher post message failed (%s).\n", > + CL_STATUS_MSG( cl_status ) ); > + } > + } > } > else > { Also, the formatting has extra whitespace. (I fixed this by hand). -- Hal From rolandd at cisco.com Mon Oct 10 07:26:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 07:26:01 -0700 Subject: [openib-general] Re: [PATCH] mthca: when creating a cq, check that requested cqes does not exceed HCA max References: <52fyribmtc.fsf@cisco.com> <20051009084455.GA24993@mellanox.co.il> Message-ID: <52zmphih3a.fsf@cisco.com> Thanks, I extended this even further -- we might as well do similar checking for QPs and SRQs while we're at it. How does this seem? - R. --- linux-kernel/infiniband/hw/mthca/mthca_dev.h (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -128,12 +128,15 @@ struct mthca_limits { int num_uars; int max_sg; int num_qps; + int max_wqes; int reserved_qps; int num_srqs; + int max_srq_wqes; int reserved_srqs; int num_eecs; int reserved_eecs; int num_cqs; + int max_cqes; int reserved_cqs; int num_eqs; int reserved_eqs; --- linux-kernel/infiniband/hw/mthca/mthca_main.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_main.c (working copy) @@ -162,9 +162,17 @@ static int __devinit mthca_dev_lim(struc mdev->limits.pkey_table_len = dev_lim->max_pkeys; mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; mdev->limits.max_sg = dev_lim->max_sg; + mdev->limits.max_wqes = dev_lim->max_qp_sz; mdev->limits.reserved_qps = dev_lim->reserved_qps; + mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + /* + * Subtract 1 from the limit because we need to allocate a + * spare CQE so the HCA HW can tell the difference between an + * empty CQ and a full CQ. + */ + mdev->limits.max_cqes = dev_lim->max_cq_sz - 1; mdev->limits.reserved_cqs = dev_lim->reserved_cqs; mdev->limits.reserved_eqs = dev_lim->reserved_eqs; mdev->limits.reserved_mtts = dev_lim->reserved_mtts; --- linux-kernel/infiniband/hw/mthca/mthca_provider.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -90,14 +90,17 @@ static int mthca_query_device(struct ib_ props->max_mr_size = ~0ull; props->max_qp = mdev->limits.num_qps - mdev->limits.reserved_qps; - props->max_qp_wr = 0xffff; + props->max_qp_wr = mdev->limits.max_wqes; props->max_sge = mdev->limits.max_sg; props->max_cq = mdev->limits.num_cqs - mdev->limits.reserved_cqs; - props->max_cqe = 0xffff; + props->max_cqe = mdev->limits.max_cqes; props->max_mr = mdev->limits.num_mpts - mdev->limits.reserved_mrws; props->max_pd = mdev->limits.num_pds - mdev->limits.reserved_pds; props->max_qp_rd_atom = 1 << mdev->qp_table.rdb_shift; props->max_qp_init_rd_atom = 1 << mdev->qp_table.rdb_shift; + props->max_srq = mdev->limits.num_srqs - mdev->limits.reserved_srqs; + props->max_srq_wr = mdev->limits.max_srq_wqes; + props->max_srq_sge = mdev->limits.max_sg; props->local_ca_ack_delay = mdev->limits.local_ca_ack_delay; props->atomic_cap = mdev->limits.flags & DEV_LIM_FLAG_ATOMIC ? IB_ATOMIC_HCA : IB_ATOMIC_NONE; @@ -640,6 +643,9 @@ static struct ib_cq *mthca_create_cq(str int nent; int err; + if (entries < 1 || entries > to_mdev(ibdev)->limits.max_cqes) + return ERR_PTR(-EINVAL); + if (context) { if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) return ERR_PTR(-EFAULT); --- linux-kernel/infiniband/hw/mthca/mthca_cmd.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -933,9 +933,9 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev goto out; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_SRQ_SZ_OFFSET); - dev_lim->max_srq_sz = 1 << field; + dev_lim->max_srq_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_SZ_OFFSET); - dev_lim->max_qp_sz = 1 << field; + dev_lim->max_qp_sz = (1 << field) - 1; MTHCA_GET(field, outbox, QUERY_DEV_LIM_RSVD_QP_OFFSET); dev_lim->reserved_qps = 1 << (field & 0xf); MTHCA_GET(field, outbox, QUERY_DEV_LIM_MAX_QP_OFFSET); @@ -1045,6 +1045,8 @@ int mthca_QUERY_DEV_LIM(struct mthca_dev dev_lim->max_pds, dev_lim->reserved_pds, dev_lim->reserved_uars); mthca_dbg(dev, "Max QP/MCG: %d, reserved MGMs: %d\n", dev_lim->max_pds, dev_lim->reserved_mgms); + mthca_dbg(dev, "Max CQEs: %d, max WQEs: %d, max SRQ WQEs: %d\n", + dev_lim->max_cq_sz, dev_lim->max_qp_sz, dev_lim->max_srq_sz); mthca_dbg(dev, "Flags: %08x\n", dev_lim->flags); --- linux-kernel/infiniband/hw/mthca/mthca_srq.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_srq.c (working copy) @@ -186,7 +186,8 @@ int mthca_alloc_srq(struct mthca_dev *de int err; /* Sanity check SRQ size before proceeding */ - if (attr->max_wr > 16 << 20 || attr->max_sge > 64) + if (attr->max_wr > dev->limits.max_srq_wqes || + attr->max_sge > dev->limits.max_sg) return -EINVAL; srq->max = attr->max_wr; --- linux-kernel/infiniband/hw/mthca/mthca_qp.c (revision 3704) +++ linux-kernel/infiniband/hw/mthca/mthca_qp.c (working copy) @@ -1112,8 +1112,10 @@ static int mthca_set_qp_size(struct mthc struct mthca_qp *qp) { /* Sanity check QP size before proceeding */ - if (cap->max_send_wr > 65536 || cap->max_recv_wr > 65536 || - cap->max_send_sge > 64 || cap->max_recv_sge > 64) + if (cap->max_send_wr > dev->limits.max_wqes || + cap->max_recv_wr > dev->limits.max_wqes || + cap->max_send_sge > dev->limits.max_sg || + cap->max_recv_sge > dev->limits.max_sg) return -EINVAL; if (mthca_is_memfree(dev)) { From halr at voltaire.com Mon Oct 10 07:45:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:45:59 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: Message-ID: <1128955559.4377.81.camel@hal.voltaire.com> On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > I'm referring to the case that iWarp is running over TCP. I know that it can > run over SCTP, but I'm not familiar with the details of that protocol. With > TCP, this is an end-to-end connection, so layering iWarp over it, only the > endpoints need to deal with it. I believe the same is true for SCTP. Yes, SCTP is similar in those regards. > >Doesn't a routing decision still need to be made at the IP layer ? > > Routing of the IP packets is done at the IP layer, but I don't see how this > affects iWarp. It does under the "covers", those covers being IP routing. > >Doesn't the IP next hop need to be determined (e.g. gateway when the > >destination is off the local IP subnet) ? Is there something that > >precludes iWARP from working across IP subnets ? > > I can't think of anything that would preclude iWarp from working > across subnets. Doesn't the IP next hop need determining in that case ? Why is that not relevant ? I don't think the iWARP connection is end to end in all cases. -- Hal From halr at voltaire.com Mon Oct 10 07:56:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 10:56:49 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128877818.24182.54.camel@mail.es335.com> References: <1128877818.24182.54.camel@mail.es335.com> Message-ID: <1128956208.4377.103.camel@hal.voltaire.com> Hi Tom, On Sun, 2005-10-09 at 13:10, Tom Tucker wrote: > On Sun, 2005-10-09 at 07:57 -0700, Sean Hefty wrote: > > >It is theoretically possible to support all this on an IPoIB based > > >network. Multiple subnets, multiple routes to remote peers, ICMP > > >redirect, multiple IP addresses for each physical interface, yada yada > > >yada. But IMHO, the only way to do this would be to tie directly into > > >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise > > >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of > > > > The current implementation ties into the standard Linux ARP tables. If > > connections were made over TCP/IP, using IPoIB, then I don't think that there > > would be any issues. The issues only arise because of the desire to use TCP/IP > > network addresses over a non-TCP/IP network. > > > > >code. This belief is why I've been a proponent of mapping GIDs to one > > >and only one IP address and treating it for management purposes as the > > >equivalent of an IP address. Without this, the whole mechanism for > > >determining routes, etc.. breaks down. If you treat the GID like a MAC > > >address -- it breaks, because a MAC address can have multiple IP > > >addresses -- the observation that lead to the conclusion that ATS was > > >broken in the first place. > > > > We should be able to handle the case where a GID has multiple IP addresses bound > > to it. But even if we added a 1:1 restriction, the connection over IB issue > > still exists. > > I agree, except for RARP. Not sure what you mean "except for RARP". Can you elaborate ? [snip...] > > I > > don't view a GID as an IP address because we're not sending and receiving IP > > packets on the GID. IPoIB treats GIDs as only part of a MAC address, which I > > think is the proper view. > > > > Anyway, returning back to the original problem of connecting to an IB gateway if > > a given a destination IP address on a different subnet... I'm slowly convincing > > myself that either the CMA or AT should do this. (I believe that the ib_addr > > code will do this now, but still wasn't sure that we wanted it to.) > > > > IMHO, you need a service separate from the CMA to do address > translation. My (iWARP's) rationale for this is that there are two > clients of the service, the CM and IP. For CM, you need it to elect a > route and thereby a local interface. For IP you need it because routes > change and ARP entries time out. > > BTW, can you educate me ... is the following what you're thinking: > > On the client side... > > - route is discovered by looking at the Linux routing table > - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) > - send ARP AT message over local IB interface It's just a normal IPoIB ARP to the destination IP address initiated by AT. (With ATS, it could have been an SA Get ServiceRecord as an alternative). I think the current CMA code handles client above and server but not (bridging) gateway below. > At the gateway...bridging to IP > - ARP AT query received on IB interface > - Lookup route to destination IP address in gateway's route table. > - If next hop's Ethernet address is already known, it is returned ^^^^^^^^ hardware (may not be ethernet) > - Otherwise, local interface identified is IPoEthernet > - New ARP query goes out on the local interface from the route > - When response comes back, answer is returned. > At the gateway...bridging to IPoIB > > - ARP AT message received on IB interface, delivered to AT > - Lookup route to destination IP address in gateway's route table > - If next hop's Ethernet address is already known, it is returned > - otherwise, local interface identified in route is IPoIB > - New ARP AT query goes out on the local interface > - When response comes back, answer is returned. -- Hal From halr at voltaire.com Mon Oct 10 08:03:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 11:03:24 -0400 Subject: [openib-general] Re: [PATCH] IPoIB: Add API to retrieve ib device, port, and pkey In-Reply-To: <52r7ayoa9l.fsf@cisco.com> References: <1128613310.4382.609.camel@hal.voltaire.com> <52r7ayoa9l.fsf@cisco.com> Message-ID: <1128956603.4377.112.camel@hal.voltaire.com> On Thu, 2005-10-06 at 12:55, Roland Dreier wrote: > Did we ever figure out how to handle the hotplug issues with the > lifetime of the struct ib_device pointer? Right now this API is > unsafe, because a caller can get a pointer to a device that has > already disappeared. I think this can be handled as follows: The netdev references would be maintained for the duration each AT call until it completes/times out. If subsequent calls are made based on an ib_device which has been removed, an error could be returned based on the fact that AT maintaining a list of devices and validate the supplied device against its list. ipoib_get_info() would be called only with a valid device and the caller holding a netdev reference for at least the duration of that call. > Also if we do decide to add an API like this, the struct ipoib_info > and ipoib_get_info() declarations should be in > rather than in the private ipoib.h header. OK. -- Hal From caitlin.bestler at gmail.com Mon Oct 10 08:47:27 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 10 Oct 2005 08:47:27 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: <1128730364.4382.11557.camel@hal.voltaire.com> Message-ID: <469958e00510100847v53bbc1baq726a3bf0e9561d90@mail.gmail.com> On 10/9/05, Sean Hefty wrote: > > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > I'm referring to the case that iWarp is running over TCP. I know that it > can > run over SCTP, but I'm not familiar with the details of that protocol. > With > TCP, this is an end-to-end connection, so layering iWarp over it, only the > endpoints need to deal with it. I believe the same is true for SCTP. The main impact of SCTP is that even the IP address can change under the covers. So not only is their routing that is transparent to the RDMA consumer, there is also selection of source/destination IP addresses . -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlin.bestler at gmail.com Mon Oct 10 08:50:59 2005 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Mon, 10 Oct 2005 08:50:59 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128955559.4377.81.camel@hal.voltaire.com> References: <1128955559.4377.81.camel@hal.voltaire.com> Message-ID: <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> On 10 Oct 2005 10:45:59 -0400, Hal Rosenstock wrote: > > On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > > > I'm referring to the case that iWarp is running over TCP. I know that it > can > > run over SCTP, but I'm not familiar with the details of that protocol. > With > > TCP, this is an end-to-end connection, so layering iWarp over it, only > the > > endpoints need to deal with it. I believe the same is true for SCTP. > > Yes, SCTP is similar in those regards. > > > >Doesn't a routing decision still need to be made at the IP layer ? > > > > Routing of the IP packets is done at the IP layer, but I don't see how > this > > affects iWarp. > > It does under the "covers", those covers being IP routing. > > > >Doesn't the IP next hop need to be determined (e.g. gateway when the > > >destination is off the local IP subnet) ? Is there something that > > >precludes iWARP from working across IP subnets ? > > > > I can't think of anything that would preclude iWarp from working > > across subnets. > > Doesn't the IP next hop need determining in that case ? Why is that not > relevant ? I don't think the iWARP connection is end to end in all > cases. Of course it's end to end. It's just that only the end points understand that it is an iWARP connection. Or more properly, the underlying transport (or "LLP") connections are end to end, but the iWARP semantics exist only in the RDMA endpoints. That is why iWARP works across multiple subnets. We've actually done true worldwide connections. The exisitng IP network carries the iWARP traffic because it is indeed just TCP traffic to the intermediate network. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Mon Oct 10 08:51:37 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 10 Oct 2005 08:51:37 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005C505A8@orsmsx408> Peter Braam wrote, > Cluster File Systems, Inc and its customers have been wondering if the Lustre > Network Driver (LND) for > OpenIb gen2, which we will begin to develop during > the coming months, should be based on kdapl or verbs. >The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > We are keen to hear some opinions! For Linux, I would target Sean's new CMA for connection establishment and then the current IB verbs which are being modified to support both iWarp and IB. my 2 cents, woody From halr at voltaire.com Mon Oct 10 09:08:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 12:08:09 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> Message-ID: <1128960287.4377.378.camel@hal.voltaire.com> On Mon, 2005-10-10 at 11:50, Caitlin Bestler wrote: > Doesn't the IP next hop need determining in that case ? Why is > that not > relevant ? I don't think the iWARP connection is end to end in > all > cases. > > > Of course it's end to end. It's just that only the end points > understand that it is an iWARP connection. What about the case of iWARP <-> IB ? > Or more properly, the underlying transport (or "LLP") connections > are end to end, but the iWARP semantics exist only in the RDMA > endpoints. > > That is why iWARP works across multiple subnets. ^^^^^^^ IP subnets > We've actually > done true worldwide connections. The exisitng IP network carries > the iWARP traffic because it is indeed just TCP traffic to the > intermediate network. -- Hal From rolandd at cisco.com Mon Oct 10 09:22:34 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 09:22:34 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: (yipee's message of "Mon, 10 Oct 2005 08:28:06 +0000 (UTC)") References: Message-ID: <52d5mdibp1.fsf@cisco.com> yipee> Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 yipee> kernel. A Mellanox memfree PCIe ddr HCA is connected. Why yipee> do I see IRQ sharing although I'm using msi_x and PCIe? yipee> Doesn't IRQ sharing only happen on older non PCIe busses? I think the messages you see are coming from the ACPI interrupt routing that is done when the driver calls pci_enable_device(). However, if you use MSI-X then that interrupt won't actually be used. If you check /proc/interrupts you should see ib_mthca using 3 non-shared interrupts. BTW, for "INTx emulation" on PCI Express, there are no physical interrupt lines -- interrupts are asserted and deasserted with messages. So PCI Express interrupts are unshared. However, the PCI Express host bridge turns those interrupts into real interrupts to the system's interrupt controller, and for that part of the story, it's entirely possible for two different PCI Express devices to end up sharing the same interrupt line. - R. From mshefty at ichips.intel.com Mon Oct 10 10:21:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:21:02 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010135723.GT21551@mellanox.co.il> References: <20051010135723.GT21551@mellanox.co.il> Message-ID: <434AA2FE.6000702@ichips.intel.com> Thanks for the feedback. See below. Michael S. Tsirkin wrote: > Wouldnt is be a good idea to start names with rdma_cm > or rdma_cma or something like that? > For example, rdma_event_type is a bit confusing since this actually only > includes CM events. Similiar comments apply to other names. I had that originally, but changed it. I figured that names like rdma_connect() and rdma_listen() were clear enough that they were for connection management. >>+struct rdma_id; > > I propose renaming this to rdma_connection or something > else more specific than just "id". Makes sense? I can change this to rdma_cm_id or rdma_cma or something else... >>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > Not sure I understand what this does, since the only extra parameter is > timeout_ms. For IB, this results in a path record query based on the GIDs that were set with the rdma_id from rdma_resolve_addr(). The GIDs are in rdma_id.route.addr.ibaddr. The output is saved to rdma_id.route.path_rec. My intent is to make this call optional in the future. >>+int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, >>+ struct ib_qp_init_attr *qp_init_attr); >>+ >>+void rdma_destroy_qp(struct rdma_id *id); > > Not sure what the intended usage is. > When does the user need to call this? The CMA needs to associate a QP with the rdma_id, and CMA will transition the QP through its connection states. The rdma_create_qp() is called to allocate a QP and transition it to the INIT state, so users can post receives to the QP. The destroy call is a pass-through call provided simply for symmetry. >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include >>+#include > > Are all of these headers really needed? > For example, I dont see arp.h used anywhere. > Am I missing something? They were needed at one point, but might not all be needed now. I will see which ones can be removed. Some were only needed for address translation, which was originally part of this file while I worked out its API. > What about replacing switch with one case statements with if statements. > Like this: > > if (id->device->node_type == IB_NODE_CA) > ret = cma_init_ib_qp(id_priv, qp); > else > ret = -ENOSYS; I tried to make it easy to modify the code to support iWarp, or some other RDMA device. I'd prefer to leave these checks as switch statements for that reason, or just remove them completely. > I also wander why do we really need all these node_type checks. > The code above seems to imply that rdma_create_qp will fail > on non-CA. Why is that? The code doesn't set the right parameters to INIT for an iWarp QP. >>+static inline void cma_deref_dev(struct rdma_id_private *id_priv) >>+{ >>+// if (atomic_dec_and_test(&id_priv->dev_remove)) >>+// wake_up(&id_priv->wait); >>+// return atomic_dec_and_test(&id_priv->dev_remove) ? >>+// cma_notify_user(id_priv, RDMA_EVENT_DEVICE_REMOVAL, -ENODEV, >>+// NULL, 0) : 0; >>+} > > > The above seems to need some cleanup. This has been cleaned up in my latest version. It was part of the initial device removal handling code that didn't work. I decided to just try to get connection establishment working, and then come back to fix device removal. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:36:35 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:36:35 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128877818.24182.54.camel@mail.es335.com> References: <1128877818.24182.54.camel@mail.es335.com> Message-ID: <434AA6A3.5090504@ichips.intel.com> Tom Tucker wrote: >>Again, I don't think that the binding is the issue, so much as the desire to use >>an address for a protocol that isn't actually being used for communication. > > Not to be pedantic, but if binding or mapping or somesuch weren't an > issue we wouldn't need AT. We need AT because we're not using network addresses. If a client used an IP address and ran over IP, we wouldn't need to do anything special. > IMHO, you need a service separate from the CMA to do address > translation. My (iWARP's) rationale for this is that there are two > clients of the service, the CM and IP. For CM, you need it to elect a > route and thereby a local interface. For IP you need it because routes > change and ARP entries time out. The connection management and address translation are separate services, with the CMA calling the address translation for the user. You may want to look at ib_addr for details on how the address translation works. > - route is discovered by looking at the Linux routing table ^^^^^ address mapping from IP to GID/Pkey. > - local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) The address translation looks only at the hardware and broadcast addresses. No additional rdma_ptr is needed with ib_addr. > - send ARP AT message over local IB interface It sends a normal IP ARP to get the remove hardware address, which contains the destination GID. An ARP is sent only if the mapping isn't available in the local ARP table. At this point, the client has the SGID, DGID, and PKey. It then issues a path record query to obtain the "route" to the destination. The CMA doesn't really care if that destination is the actual destination or some gateway. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:40:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:40:16 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128960287.4377.378.camel@hal.voltaire.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> Message-ID: <434AA780.60808@ichips.intel.com> Hal Rosenstock wrote: > What about the case of iWARP <-> IB ? Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using IPoIB. You could build a gateway to make the transfer across IB more efficient, but it's not required. - Sean From mshefty at ichips.intel.com Mon Oct 10 10:59:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 10:59:51 -0700 Subject: [openib-general] [PATCH] [ADDR] address translation module for CMA In-Reply-To: References: Message-ID: <434AAC17.1010709@ichips.intel.com> Sean Hefty wrote: > The following patch adds a simple IP to IB address translation module > using ARP. It is based off AT and SDP, but kept as simple as possible. > > I would like to merge this back into the trunk, and apply other changes > there. I didn't see any objections, so I have committed this to the trunk as part of the core software. - Sean From mshefty at ichips.intel.com Mon Oct 10 11:01:03 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:01:03 -0700 Subject: [openib-general] [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: References: Message-ID: <434AAC5F.70301@ichips.intel.com> Sean Hefty wrote: > The following patch adds in a basic RDMA connection management abstraction. > It is functional, but needs additional work for handling device removal, plus > several missing features. > > I'd like to merge this back into the trunk, and continue working on it from > there. I didn't see any objections, so I have merged this into the trunk. Changes were made from the original patch based on Michael's feedback, and device removal handling was added. - Sean From mst at mellanox.co.il Mon Oct 10 11:07:50 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 20:07:50 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AA2FE.6000702@ichips.intel.com> References: <434AA2FE.6000702@ichips.intel.com> Message-ID: <20051010180750.GA5916@mellanox.co.il> Quoting Sean Hefty : > > Wouldnt is be a good idea to start names with rdma_cm > > or rdma_cma or something like that? > > For example, rdma_event_type is a bit confusing since this actually only > > includes CM events. Similiar comments apply to other names. > > I had that originally, but changed it. I figured that names like rdma_connect() > and rdma_listen() were clear enough that they were for connection management. Yes, fine, but names like rdma_event_type probably do need the prefix, dont they? > >>+struct rdma_id; > > > > I propose renaming this to rdma_connection or something > > else more specific than just "id". Makes sense? > > I can change this to rdma_cm_id or rdma_cma or something else... Maybe rdma_connection (these things encapsulate connectin state)? Or, rdma_sock or rdma_socket, since people are used to the fact that connections are sockets? > >>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > > > Not sure I understand what this does, since the only extra parameter is > > timeout_ms. > > For IB, this results in a path record query based on the GIDs that were set with > the rdma_id from rdma_resolve_addr(). The GIDs are in > rdma_id.route.addr.ibaddr. The output is saved to rdma_id.route.path_rec. My > intent is to make this call optional in the future. I was trying to say, why doesnt rdma_connect just do this transparently? Why do we need a separate call? > >>+int rdma_create_qp(struct rdma_id *id, struct ib_pd *pd, > >>+ struct ib_qp_init_attr *qp_init_attr); > >>+ > >>+void rdma_destroy_qp(struct rdma_id *id); > > > > Not sure what the intended usage is. > > When does the user need to call this? > > The CMA needs to associate a QP with the rdma_id, and CMA will transition the QP > through its connection states. The rdma_create_qp() is called to allocate a QP > and transition it to the INIT state, so users can post receives to the QP. The > destroy call is a pass-through call provided simply for symmetry. What happends on the passive side? May we need more than one qp per rdma_id? Or is a new id created each time a connection request arrives? -- MST From mshefty at ichips.intel.com Mon Oct 10 11:15:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:15:57 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010180750.GA5916@mellanox.co.il> References: <434AA2FE.6000702@ichips.intel.com> <20051010180750.GA5916@mellanox.co.il> Message-ID: <434AAFDD.90208@ichips.intel.com> Michael S. Tsirkin wrote: > Yes, fine, but names like rdma_event_type probably do need the prefix, > dont they? I'll fix this. > Maybe rdma_connection (these things encapsulate connectin state)? > Or, rdma_sock or rdma_socket, since people are used to the fact that connections > are sockets? Any objection to rdma_socket? >>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > I was trying to say, why doesnt rdma_connect just do this > transparently? Why do we need a separate call? Eventually rdma_connect will call this for the user if a route hasn't been resolved. At some point though, the API will likely need to be expanded to specify some sort of quality of service. > What happends on the passive side? > May we need more than one qp per rdma_id? > Or is a new id created each time a connection request arrives? A new identifier is created each time a connection request arrives. The goal is to support a single listen across multiple devices, so listen id's will not necessarily be bound to an ib_device. The new id will be bound to the device that the connection request was received on. - Sean From rolandd at cisco.com Mon Oct 10 11:23:45 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 11:23:45 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128738350.13945.369.camel@localhost> (Matt Leininger's message of "Fri, 07 Oct 2005 19:25:49 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> Message-ID: <521x2tgrim.fsf@cisco.com> > 2.6.12-rc5 in-kernel 1 405 <<<<< > 2.6.12-rc4 in-kernel 1 470 <<<<< I was optimistic when I saw this, because the changeover to git occurred with 2.6.12-rc2, so I thought I could use git bisect to track down exactly when the performance regression happened. However, I haven't been able to get numbers that are stable enough to track this down. I have two systems, both HP DL145s with dual Opteron 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the completion interrupt affinity set to CPU 0, and "taskset 2" to run netserver and netperf on CPU 1. With default netperf parameters (just "-H otherguy") I get numbers between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. The numbers are quite consistent between reboots, but if I reboot the system (even keeping the kernel identical), I see large performance changes. Presumably something is happening like the cache coloring of some hot data structures changing semi-randomly depending on the timing of various initialations. Matt, how stable are your numbers? - R. From tom at ammasso.com Mon Oct 10 11:30:53 2005 From: tom at ammasso.com (Tom Tucker) Date: Mon, 10 Oct 2005 14:30:53 -0400 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Monday, October 10, 2005 12:37 PM > To: Tom Tucker > Cc: Sean Hefty; Openib > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > Tom Tucker wrote: > >>Again, I don't think that the binding is the issue, so much > as the desire to use > >>an address for a protocol that isn't actually being used > for communication. > > > > Not to be pedantic, but if binding or mapping or somesuch weren't an > > issue we wouldn't need AT. > > We need AT because we're not using network addresses. If a > client used an IP > address and ran over IP, we wouldn't need to do anything special. agreed. > > > IMHO, you need a service separate from the CMA to do address > > translation. My (iWARP's) rationale for this is that there are two > > clients of the service, the CM and IP. For CM, you need it > to elect a > > route and thereby a local interface. For IP you need it > because routes > > change and ARP entries time out. > > The connection management and address translation are > separate services, with > the CMA calling the address translation for the user. You > may want to look at > ib_addr for details on how the address translation works. Very cool. I've applied the patch and will take a look. > > > - route is discovered by looking at the Linux routing table > ^^^^^ > address mapping from IP to GID/Pkey. I think I understand where I'm upside down now. In my world, you don't know which interface to send the ARP request on until you've identified the local interface and you can't identify the local interface until you've looked up the route. Not all interface have a path to all remote peers. In your world, you can't look up the path record until you've identified the remote GID. What I don't get is, if you have more than one IB interface, which interface do you submit your IPoIB ARP request on? All of them? > > > - local interface is IPoIB (looks at rdma_ptr embedded in > netdev struct) > The address translation looks only at the hardware and > broadcast addresses. No > additional rdma_ptr is needed with ib_addr. > Cool, I must have misunderstood an earlier discussion. > > - send ARP AT message over local IB interface > It sends a normal IP ARP to get the remove hardware address, > which contains the > destination GID. An ARP is sent only if the mapping isn't > available in the > local ARP table. Not sure what a "normal IP ARP" message is. In my world, ARP and IP are peer protocols. ARP does not sit on top of IP, nor is it a special kind of IP message. Forgive my ignorance, but does IPoIB have ARP built into it? But regardless, how do you know which local interface to send the IP ARP message on? > > At this point, the client has the SGID, DGID, and PKey. It > then issues a path > record query to obtain the "route" to the destination. The > CMA doesn't really > care if that destination is the actual destination or some gateway. Thanks for the clarifications. > > - Sean > From mshefty at ichips.intel.com Mon Oct 10 11:43:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:43:51 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> References: <8E9D028761D8264D910612167E8457E801195C45@mail2.ammasso.com> Message-ID: <434AB667.8060707@ichips.intel.com> Tom Tucker wrote: > I think I understand where I'm upside down now. In my world, > you don't know which interface to send the ARP request on > until you've identified the local interface and you can't > identify the local interface until you've looked up the route. > Not all interface have a path to all remote peers. We have the same restriction. I lookup the route based on the destination IP address to get the local interface. > In your world, you can't look up the path record until you've > identified the remote GID. What I don't get is, if you have more > than one IB interface, which interface do you submit your IPoIB ARP > request on? All of them? It's based on the device returned by the route lookup. I've attached the relevant code portion below. If the code below fails, I generate an ARP, wait for the reply, then re-execute the code. > Not sure what a "normal IP ARP" message is. In my world, ARP and > IP are peer protocols. ARP does not sit on top of IP, nor is it a > special kind of IP message. Forgive my ignorance, but does IPoIB > have ARP built into it? I was being confusing. The ARP is sent on the IPoIB net_device to map an IP address to the remote hardware address. There's nothing special about the ARP. - Sean static int addr_resolve_remote(struct sockaddr_in *src_in, struct sockaddr_in *dst_in, struct ib_addr *addr) { u32 src_ip = src_in->sin_addr.s_addr; u32 dst_ip = dst_in->sin_addr.s_addr; struct flowi fl; struct rtable *rt; struct neighbour *neigh; int ret; memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = dst_ip; fl.nl_u.ip4_u.saddr = src_ip; ret = ip_route_output_key(&rt, &fl); if (ret) goto out; neigh = neigh_lookup(&arp_tbl, &dst_ip, rt->idev->dev); if (!neigh) { ret = -ENODATA; goto err1; } if (!(neigh->nud_state & NUD_VALID)) { ret = -ENODATA; goto err2; } if (!src_ip) { src_in->sin_family = dst_in->sin_family; src_in->sin_addr.s_addr = rt->rt_src; } addr->sgid = *(union ib_gid *) (neigh->dev->dev_addr + 4); addr->dgid = *(union ib_gid *) (neigh->ha + 4); addr->pkey = addr_get_pkey(neigh->dev); err2: neigh_release(neigh); err1: ip_rt_put(rt); out: return ret; } static void addr_send_arp(struct sockaddr_in *dst_in) { struct rtable *rt; struct flowi fl; u32 dst_ip = dst_in->sin_addr.s_addr; memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = dst_ip; if (ip_route_output_key(&rt, &fl)) return; arp_send(ARPOP_REQUEST, ETH_P_ARP, dst_ip, rt->idev->dev, rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); ip_rt_put(rt); } From mshefty at ichips.intel.com Mon Oct 10 11:46:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 11:46:30 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> References: <434AA2FE.6000702@ichips.intel.com> <20051010180750.GA5916@mellanox.co.il> <434AAFDD.90208@ichips.intel.com> Message-ID: <434AB706.8040101@ichips.intel.com> Sean Hefty wrote: > Michael S. Tsirkin wrote: > >> Yes, fine, but names like rdma_event_type probably do need the prefix, >> dont they? > > I'll fix this. I've just committed a patch to rename the prefix. - Sean From ftillier at silverstorm.com Mon Oct 10 12:13:15 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Mon, 10 Oct 2005 12:13:15 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> Message-ID: <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Monday, October 10, 2005 11:16 AM > > Michael S. Tsirkin wrote: > > Maybe rdma_connection (these things encapsulate connectin state)? > > Or, rdma_sock or rdma_socket, since people are used to the fact that > > connections are sockets? > > Any objection to rdma_socket? I don't like rdma_socket, since you can't actually perform any I/O operations on the rdma_socket, unlike normal sockets. We're dealing only with the connection part of the problem, and the name should reflect that. So rdma_connection, rdma_conn, or rdma_cid seem more appropriate. - Fab From mst at mellanox.co.il Mon Oct 10 13:00:42 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 22:00:42 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <434AAFDD.90208@ichips.intel.com> References: <434AAFDD.90208@ichips.intel.com> Message-ID: <20051010200042.GB6633@mellanox.co.il> Quoting Sean Hefty : > > Maybe rdma_connection (these things encapsulate connectin state)? > > Or, rdma_sock or rdma_socket, since people are used to the fact that connections > > are sockets? > > Any objection to rdma_socket? Fine with me, this makes the intent of bind/listen explicit. > >>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); > > > > I was trying to say, why doesnt rdma_connect just do this > > transparently? Why do we need a separate call? > > Eventually rdma_connect will call this for the user if a route hasn't been > resolved. At some point though, the API will likely need to be expanded to > specify some sort of quality of service. I thought that would also happen at connect time. No? -- MST From krause at cup.hp.com Mon Oct 10 12:53:29 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:53:29 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> References: <434AAFDD.90208@ichips.intel.com> <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> Message-ID: <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> At 12:13 PM 10/10/2005, Fab Tillier wrote: > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Monday, October 10, 2005 11:16 AM > > > > Michael S. Tsirkin wrote: > > > Maybe rdma_connection (these things encapsulate connectin state)? > > > Or, rdma_sock or rdma_socket, since people are used to the fact that > > > connections are sockets? > > > > Any objection to rdma_socket? > >I don't like rdma_socket, since you can't actually perform any I/O >operations on >the rdma_socket, unlike normal sockets. We're dealing only with the >connection >part of the problem, and the name should reflect that. So rdma_connection, >rdma_conn, or rdma_cid seem more appropriate. Naming should not involve sockets as that is part of existing standards. There are also the new standard Sockets extension API available today that might be extended sometime in the future to include explicit RDMA support should people decide to bypass SDP and go straight to a more robust API definition. The Sockets Extensions already comprehend explicit memory management, async comms, etc. making a significant improvement over the existing sync Sockets as well as going further in solving areas like memory management beyond what was done in Winsocks. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Oct 10 13:03:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 10 Oct 2005 22:03:22 +0200 Subject: [openib-general] Re: Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <521x2tgrim.fsf@cisco.com> Message-ID: <20051010200321.GC6633@mellanox.co.il> Hi Roland, Quoting r. Roland Dreier : > However, I haven't been able to get numbers that are stable enough to > track this down. Disabling irq balancing sometimes helps me make the numbers more stable. Hope this helps, -- MST From krause at cup.hp.com Mon Oct 10 12:56:19 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:56:19 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AA780.60808@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> Message-ID: <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> At 10:40 AM 10/10/2005, Sean Hefty wrote: >Hal Rosenstock wrote: >>What about the case of iWARP <-> IB ? > >Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >using IPoIB. You could build a gateway to make the transfer across IB >more efficient, but it's not required. I don't understand this statement. iWARP is RDMA based and if someone wanted to build a gateway with IB in between, it should be mapped to an IB RC connection 1:1. Going through IPoIB is a waste and would result in a very poor performing solution (not that such a solution would deliver stellar performance to start with. Prior similar solutions used ULP over IB and the gateway then provided ULP over TOE and would then be easily extended to do iWARP. In general, you would want to have defined domains for each interconnect and not try to add poor ROI superset functionality of one over the other - waste of time and money. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Mon Oct 10 12:50:59 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 12:50:59 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: <52d5mdibp1.fsf@cisco.com> References: <52d5mdibp1.fsf@cisco.com> Message-ID: <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> At 09:22 AM 10/10/2005, Roland Dreier wrote: > yipee> Hi, My setup is a 3GHz Xeon (x86_64) with a 2.6.13.2 > yipee> kernel. A Mellanox memfree PCIe ddr HCA is connected. Why > yipee> do I see IRQ sharing although I'm using msi_x and PCIe? > yipee> Doesn't IRQ sharing only happen on older non PCIe busses? > >I think the messages you see are coming from the ACPI interrupt >routing that is done when the driver calls pci_enable_device(). >However, if you use MSI-X then that interrupt won't actually be used. >If you check /proc/interrupts you should see ib_mthca using 3 >non-shared interrupts. > >BTW, for "INTx emulation" on PCI Express, there are no physical >interrupt lines -- interrupts are asserted and deasserted with >messages. So PCI Express interrupts are unshared. They are messages upstream that any device. >However, the PCI Express host bridge turns those interrupts into real >interrupts to the system's interrupt controller, and for that part of the >story, it's entirely possible for two different PCI Express devices to end >up sharing the same interrupt line. Correct, the host bridge may map them to a "monarch" processor and thus any or all devices can share the same interrupt. This is why within the PCI-SIG we recommend using MSI-X and long-term, many of us would simply like to drop INTx and make MSI-X mandatory. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From hch at lst.de Mon Oct 10 13:09:21 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 10 Oct 2005 22:09:21 +0200 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> References: <434AAFDD.90208@ichips.intel.com> <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> Message-ID: <20051010200921.GB25968@lst.de> On Mon, Oct 10, 2005 at 12:53:29PM -0700, Michael Krause wrote: > standards. There are also the new standard Sockets extension API available > today that might be extended sometime in the future to include explicit which is never going to get into linux. one more of these braindead standards people masturbating in a dark room and coming up with a frankenstein bastard cases. From rick.jones2 at hp.com Mon Oct 10 13:17:56 2005 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 10 Oct 2005 13:17:56 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <434ACC74.3020404@hp.com> Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. > > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. Which rev of netperf are you using, and areyou using the "confidence intervals" options (-i, -I)? for a long time, the linux-unique behaviour of returning the overhead bytes for SO_[SND|RCV]BUF and them being 2X what one gives in setsockopt() gave netperf some trouble - the socket buffer would double in size each iteration on a confidence interval run. Later netperf versions (late 2.3, and 2.4.X) have a kludge for this. Slightly related to that, IIRC, the linux receiver code adjusts the advertised window as the connection goes along - how far the receive code opens the window may change from run to run - might that have an effect? If there is a way to get the linux receiver to simply advertise the full window from the beginning that might help minimize the number of variables. Are there large changes in service demand along with the large performance changes? FWIW, on later netperfs the -T option should allow you to specify the CPU on which netperf and/or netserver run, although I've had some trouble reliably detecting the right sched_setaffinity syntax among the releases. rick jones From vuhuong at mellanox.com Mon Oct 10 13:25:44 2005 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 10 Oct 2005 13:25:44 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <433C78A1.30207@mellanox.com> References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> Message-ID: <434ACE48.1030208@mellanox.com> Roland, >> >>That makes some sense. An issue is that FMRs are a fairly limited >>resource, and a system with many SRP targets where each target doesn't >>get much traffic could tie up a lot of FMRs. >> >> >> > You're right. For the same reason of unused port (ie. srp_host), I > create fmr resource per device and keep it in srp_device_data struct > > I put back fmr + your patch and it works well with my setup. > > Signed-off-by: Vu Pham > Have you got time to review this SRP's FMR patch? Thanks, vu From rolandd at cisco.com Mon Oct 10 13:53:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 13:53:29 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> (Peter J. Braam's message of "Sun, 9 Oct 2005 17:17:56 -0400") References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <52mzlhf60m.fsf@cisco.com> > The driver we plan to develop should strive to address several goals: > - high reliability and performance It seems unlikely that you would get more reliability or performance by adding another layer of software in your stack. > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) Interoperability seems a function of designing an appropriate wire protocol rather than how you choose to implement the protocol. I believe that experience has proven that trying to maintain a single codebase portable to different OS kernels is always more work than just having separate codebases for separate kernels. Even trying to use the same code in both Linux kernel 2.4 and kernel 2.6 is enough of a pain that it's probably not worth it. > - be suitable for inclusion in the Linux kernel It extremely unlikely that kDAPL will ever be included in the kernel. Does this last point mean that you are planning to try again and work on merging Lustre into the mainline kernel? - R. From rolandd at cisco.com Mon Oct 10 13:58:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 13:58:03 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <434ACC74.3020404@hp.com> (Rick Jones's message of "Mon, 10 Oct 2005 13:17:56 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <434ACC74.3020404@hp.com> Message-ID: <52irw5f5t0.fsf@cisco.com> Rick> Which rev of netperf are you using, and areyou using the Rick> "confidence intervals" options (-i, -I)? for a long time, Rick> the linux-unique behaviour of returning the overhead bytes Rick> for SO_[SND|RCV]BUF and them being 2X what one gives in Rick> setsockopt() gave netperf some trouble - the socket buffer Rick> would double in size each iteration on a confidence interval Rick> run. Later netperf versions (late 2.3, and 2.4.X) have a Rick> kludge for this. I believe it's netperf 2.2. I'm not using any confidence interval stuff. However, the variation is not between single runs of netperf -- if I do 5 runs of netperf in a row, I get roughly the same number from each run. For example, I might see something like TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3869.82 and then TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3862.41 for two successive runs. However, if I reboot the system into the same kernel (ie everything set up exactly the same), the same invocation of netperf might give TCP STREAM TEST to 192.168.145.2 : histogram Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 4389.20 Rick> Are there large changes in service demand along with the Rick> large performance changes? Not sure. How do I have netperf report service demand? - R. From mshefty at ichips.intel.com Mon Oct 10 13:59:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 13:59:09 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> Message-ID: <434AD61D.4060205@ichips.intel.com> Michael Krause wrote: >>> What about the case of iWARP <-> IB ? >> >> Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >> using IPoIB. You could build a gateway to make the transfer across IB >> more efficient, but it's not required. > > I don't understand this statement. iWARP is RDMA based and if someone I was referring to the case where both endpoints are running over iWarp, with IB being one of the subnets being crossed. I believe that you're referring to one side running over iWarp, and the other running over IB, with an application level gateway in between. For the latter case, I would think that the gateway needs to establish iWarp connections for any IP addresses that reside on the IB subnet behind it, with a separate IB connection on the back-end. It seems to me that this would occur transparently to the application using iWarp. - Sean From rolandd at cisco.com Mon Oct 10 14:03:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:03:26 -0700 Subject: [openib-general] Re: Timeline of IPoIB performance In-Reply-To: <20051010200321.GC6633@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 10 Oct 2005 22:03:22 +0200") References: <521x2tgrim.fsf@cisco.com> <20051010200321.GC6633@mellanox.co.il> Message-ID: <52ek6tf5k1.fsf@cisco.com> Michael> Disabling irq balancing sometimes helps me make the Michael> numbers more stable. I don't think that's an issue. I'm running on x86_64, which I don't think has the kernel irq balancer, and I'm not running a userspace IRQ balancer. I can see all the mthca interrupts going to the CPU I set through the smp_affinity file. - R. From rolandd at cisco.com Mon Oct 10 14:05:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:05:08 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> (Michael Krause's message of "Mon, 10 Oct 2005 12:50:59 -0700") References: <52d5mdibp1.fsf@cisco.com> <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> Message-ID: <52achhf5h7.fsf@cisco.com> Roland> BTW, for "INTx emulation" on PCI Express, there are no Roland> physical interrupt lines -- interrupts are asserted and Roland> deasserted with messages. So PCI Express interrupts are Roland> unshared. Michael> They are messages upstream that any device. That doesn't parse for me. Was what I said wrong? - R. From rolandd at cisco.com Mon Oct 10 14:05:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:05:43 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <434ACE48.1030208@mellanox.com> (Vu Pham's message of "Mon, 10 Oct 2005 13:25:44 -0700") References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> <434ACE48.1030208@mellanox.com> Message-ID: <5264s5f5g8.fsf@cisco.com> Vu> Have you got time to review this SRP's FMR patch? Sorry, no. I haven't had much time to work on SRP for the past few weeks. - R. From krause at cup.hp.com Mon Oct 10 14:09:45 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 10 Oct 2005 14:09:45 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AD61D.4060205@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> <6.2.0.14.2.20051010125333.0259c078@esmail.cup.hp.com> <434AD61D.4060205@ichips.intel.com> Message-ID: <6.2.0.14.2.20051010140748.025c5fa0@esmail.cup.hp.com> At 01:59 PM 10/10/2005, Sean Hefty wrote: >Michael Krause wrote: >>>>What about the case of iWARP <-> IB ? >>> >>>Crossing IB shouldn't matter. iWarp should simply cross the IB subnet >>>using IPoIB. You could build a gateway to make the transfer across IB >>>more efficient, but it's not required. >>I don't understand this statement. iWARP is RDMA based and if someone > >I was referring to the case where both endpoints are running over iWarp, >with IB being one of the subnets being crossed. I believe that you're >referring to one side running over iWarp, and the other running over IB, >with an application level gateway in between. > >For the latter case, I would think that the gateway needs to establish >iWarp connections for any IP addresses that reside on the IB subnet behind >it, with a separate IB connection on the back-end. It seems to me that >this would occur transparently to the application using iWarp. iWARP with IB in between seems like a waste of time to do (very small if any market for such a beast). IB HCA on a host with an iWARP edge device may be reasonable but again seems like a waste to construct. These types of corner usage models while of interest to comprehend to see if there is any architectural issues to insure they are not precluded really are just that, corner cases, and little time or effort should be spent on their support. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Mon Oct 10 14:14:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:14:49 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <433C78A1.30207@mellanox.com> (Vu Pham's message of "Thu, 29 Sep 2005 16:28:33 -0700") References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> Message-ID: <521x2tf512.fsf@cisco.com> OK, a few trivial comments: > +struct srp_device_data { > + struct list_head *dev_list; > + struct ib_pd *pd; > + struct ib_mr *mr; > + struct ib_fmr_pool *fmr_pool; > +}; Why put a pointer to struct list_head here instead of just a struct list_head? If you just used the struct, then you wouldn't need this: > + srp_data->dev_list = kmalloc(sizeof *srp_data->dev_list, GFP_KERNEL); > + if (!srp_data->dev_list) > + goto free_params_attr; > @@ -94,10 +115,14 @@ struct srp_request { > struct scsi_cmnd *scmnd; > struct srp_iu *cmd; > struct srp_iu *tsk_mgmt; > + DECLARE_PCI_UNMAP_ADDR(direct_mapping) > struct completion done; > short next; > u8 cmd_done; > u8 tsk_status; > + struct srp_fmr *fmr_arr; > + u16 fmr_cnt; > + u16 in_use; > }; I can't find anywhere that the in_use flag is used. > +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, > + int sg_cnt, struct srp_request *req) [...] > + return -ENOMEM; > + } else if (fmr_cnt <= 0) { fmr_cnt is unsigned so I think this is going to get you in trouble. Might as well make fmr_cnt a plain int to make things simpler. Also, it might be good to try and add some more comments explaining srp_map_fmr() -- it would definitely help me review. - R. From mshefty at ichips.intel.com Mon Oct 10 14:25:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 10 Oct 2005 14:25:10 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010200042.GB6633@mellanox.co.il> References: <434AAFDD.90208@ichips.intel.com> <20051010200042.GB6633@mellanox.co.il> Message-ID: <434ADC36.60101@ichips.intel.com> Michael S. Tsirkin wrote: >>Any objection to rdma_socket? > > Fine with me, this makes the intent of bind/listen explicit. I have rdma_cm_id right now, and will likely just keep it as that. >>>>>>+int rdma_resolve_route(struct rdma_id *id, int timeout_ms); >>> >>>I was trying to say, why doesnt rdma_connect just do this >>>transparently? Why do we need a separate call? >> >>Eventually rdma_connect will call this for the user if a route hasn't been >>resolved. At some point though, the API will likely need to be expanded to >>specify some sort of quality of service. > > I thought that would also happen at connect time. No? I went with the option of exposing the necessary functionality. Folding this into the connect call means that the user cannot view the returned route before deciding to establishing a connection, and the CMA sets the timeout/retry policy for resolving routes. The only benefit of hiding this call is a slight code simplification for the user: case RDMA_CM_EVENT_ADDR_RESOLVED: ret = rdma_resolve_route(cma_id->context, timeout); if (ret) connect_error(); break; case RDMA_CM_EVENT_ROUTE_RESOLVED: connect(cma_id->context); break; becomes: case RDMA_CM_EVENT_ADDR_RESOLVED: connect(cma_id->context); break; To make the API slightly easier to use, I thought of letting rdma_resolve_route() be optional. But, that makes it more difficult to handle device removal, and I'm not sure that it's even worth it. As for QoS, I'm not even sure that it shouldn't be specified when performing the address resolution, so that the correct device can be selected. - Sean From iod00d at hp.com Mon Oct 10 14:26:52 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 10 Oct 2005 14:26:52 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <20051010212652.GG9613@esmail.cup.hp.com> On Mon, Oct 10, 2005 at 11:23:45AM -0700, Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. As you know, opteron boxes are NUMA. I think you want MSI-X interrupt bound to the same CPU that's connected to the IO. Is CPU 0 closer to IO? I would bind netperf to CPU0 and netserver to CPU 1 on each box respectively. Or just try all 4 combinations to see which combinations are CPU bound vs memory/IO bound. > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. I gather you meant "tests" in the first phrase? (vs reboot). > Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. My guess is based on the same premise. The mem-free card will be very sensitive to were it's control data is allocated. Is either box configured to interleave memory from both CPUs? If it's interleaving, every other cacheline will be "local". Can you disable interleave and try different netperf/server bindings as suggested above? hth, grant From rick.jones2 at hp.com Mon Oct 10 14:22:40 2005 From: rick.jones2 at hp.com (Rick Jones) Date: Mon, 10 Oct 2005 14:22:40 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52irw5f5t0.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <434ACC74.3020404@hp.com> <52irw5f5t0.fsf@cisco.com> Message-ID: <434ADBA0.5070103@hp.com> Roland Dreier wrote: > Rick> Which rev of netperf are you using, and areyou using the > Rick> "confidence intervals" options (-i, -I)? for a long time, > Rick> the linux-unique behaviour of returning the overhead bytes > Rick> for SO_[SND|RCV]BUF and them being 2X what one gives in > Rick> setsockopt() gave netperf some trouble - the socket buffer > Rick> would double in size each iteration on a confidence interval > Rick> run. Later netperf versions (late 2.3, and 2.4.X) have a > Rick> kludge for this. > > I believe it's netperf 2.2. That's rather old. I literally just put 2.4.1 out on ftp.cup.hp.com - probably better to use that if possible. Not that it will change the variability just that I like it when people are up-to-date on the versions :) If nothing else, the 2.4.X version(s) have a much improved (hopefully) manual in doc/ [If you are really maschochistic, the very first release of netperf 4.0.0 source has happened. I can make no guarantees as to its actually working at the moment though :) Netperf4 is going to be the stream for the multiple-connection, multiple system tests rather than the single-connection nature of netperf2] > I'm not using any confidence interval stuff. However, the variation > is not between single runs of netperf -- if I do 5 runs of netperf in > a row, I get roughly the same number from each run. For example, I > might see something like > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 3869.82 > > and then > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 3862.41 > > for two successive runs. However, if I reboot the system into the > same kernel (ie everything set up exactly the same), the same > invocation of netperf might give > > TCP STREAM TEST to 192.168.145.2 : histogram > Recv Send Send > Socket Socket Message Elapsed > Size Size Size Time Throughput > bytes bytes bytes secs. 10^6bits/sec > > 87380 16384 16384 10.00 4389.20 > > Rick> Are there large changes in service demand along with the > Rick> large performance changes? > > Not sure. How do I have netperf report service demand? Ask for CPU utilization with -c (local) and -C (remote). The /proc/stat stuff used by Linux does not need calibration (IIRC) so you don't have to worry about that. If cache effects are involved, you can make netperf "harder" or "easier" on the caches by altering the size of the send and/or recv buffer rings. By default they are one more than the socket buffer size divided by the send size, but you can make them larger or smaller with the -W option. These days I use a 128K socket buffer and 32K send for the "canonical" (although not default :) netperf TCP_STREAM test: netperf -H remote -c -C -- -s 128K -S 128K -m 32K In netperf-speak K == 1024, k == 1000, M == 2^20, m == 10^6, G == 2^40, g == 10^9... rick jones From rolandd at cisco.com Mon Oct 10 14:44:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 14:44:21 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: (IBMEHCA DD's message of "Mon, 10 Oct 2005 09:23:59 +0200") References: Message-ID: <52oe5xdp3e.fsf@cisco.com> IBMEHCA> So you need some kind of signal from the operating system IBMEHCA> to system firmware, which in the eHCA case is the IBMEHCA> H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI IBMEHCA> parameter. AFTER that call handshaking between system IBMEHCA> firmware and the SM will start, here's a new adapter IBMEHCA> active on a switch port... what's your guid? here's your IBMEHCA> LID, p_key, SM lid.... ...and after all that it's IBMEHCA> possible to send and receive packets from the fabric. IBMEHCA> The openib stack expects that a port is fully functional IBMEHCA> after this create_qp returns, and starts to do all sorts IBMEHCA> of modify QP and post send. So the only choice we have IBMEHCA> there is to delay create_qp until the complete IBMEHCA> handshaking between system firmware and the SM has IBMEHCA> finished (until we see a IB_PORT_ACTIVE in hcad_mod). If IBMEHCA> we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have IBMEHCA> to return an error code to openib, otherwise we're IBMEHCA> seriously in trouble (tried that). I think this scheme breaks the IB model. When consumers get access to an HCA, they expect to be able to access the HCA, even if an SM has not configured it (and even in the case no cable is connected). As an example of why this is useful, if the link won't come up, it's nice to be able to get to query the port's PMA counters to see if there are excessive errors or something like that. I understand that you don't want to have all HCAs always visible to the SM, but the scheme you've chosen puts an unneeded dependency between driver initialization and the external SM. It would be fine if creating QP1 triggered the transition of the port from DOWN to INIT so that it is discoverable by the SM, but there's no reason for creation of QP1 to wait to finish until the SM has brought the port up. (As a side note, Mellanox HCAs don't bring a port to INIT until the host driver has transitioned QP0 to the RTR state, which seems more sensible than waiting for QP1 to be created) I hope this can be fixed in firmware with your current HCA hardware. - R. From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:25:07 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:25:07 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <521x2tgrim.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> Message-ID: <1128986707.13945.424.camel@localhost> On Mon, 2005-10-10 at 11:23 -0700, Roland Dreier wrote: > > 2.6.12-rc5 in-kernel 1 405 <<<<< > > 2.6.12-rc4 in-kernel 1 470 <<<<< > > I was optimistic when I saw this, because the changeover to git > occurred with 2.6.12-rc2, so I thought I could use git bisect to track > down exactly when the performance regression happened. > > However, I haven't been able to get numbers that are stable enough to > track this down. I have two systems, both HP DL145s with dual Opteron > 875s and two-port mem-free PCI Express HCAs. I use MSI-X with the > completion interrupt affinity set to CPU 0, and "taskset 2" to run > netserver and netperf on CPU 1. > > With default netperf parameters (just "-H otherguy") I get numbers > between ~490 MB/sec and ~550 MB/sec for 2.6.12-rc4 and 2.6.12-rc5. > The numbers are quite consistent between reboots, but if I reboot the > system (even keeping the kernel identical), I see large performance > changes. Presumably something is happening like the cache coloring of > some hot data structures changing semi-randomly depending on the > timing of various initialations. > > Matt, how stable are your numbers? Pretty consistent. Here are a few runs with 2.6.12-rc5 with reboots in between each run. I'm using netperf-2.3pl1. Run 1: TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 410302.39 99.89 92.09 4.869 4.489 Run 2: (after another reboot) TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 409510.33 99.89 91.59 4.879 4.473 Run 3: (after reboot) TCP STREAM TEST to 10.128.20.6 Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. KBytes /s % T % T us/KB us/KB 87380 16384 16384 10.00 404354.11 99.89 91.39 4.941 4.520 I see the same variance in netperf results if I don't reboot between runs. - Matt > From iod00d at hp.com Mon Oct 10 16:30:54 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 10 Oct 2005 16:30:54 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <20051010212652.GG9613@esmail.cup.hp.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <20051010212652.GG9613@esmail.cup.hp.com> Message-ID: <20051010233054.GA11213@esmail.cup.hp.com> On Mon, Oct 10, 2005 at 02:26:52PM -0700, Grant Grundler wrote: ... > If it's interleaving, every other cacheline will be "local". ISTR AMD64 was page-interleaved but then got confused by documents describing "128-bit" 2-way interleave. I now realize the 128bit is refering to interleave between two "banks" of memory behind each memory controller. ie 2 * 128-bit provides in the 32-byte cacheline size that most x86 programs expect. Anyway, I'm hoping that we'll see a consistent result if node interleave is turned off. sorry for the confusion, grant From rolandd at cisco.com Mon Oct 10 16:38:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 16:38:13 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128986707.13945.424.camel@localhost> (Matt Leininger's message of "Mon, 10 Oct 2005 16:25:07 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> Message-ID: <52br1xdjtm.fsf@cisco.com> Matt> Pretty consistent. Here are a few runs with 2.6.12-rc5 Matt> with reboots in between each run. I'm using netperf-2.3pl1. That's interesting. I'm guessing you're using mem-ful HCAs? Given that your results are more stable than mine, if you're up for it, you could install git, clone Linus's tree, and then do a git bisect between 2.6.12-rc4 and 2.6.12-rc5 to narrow down the regression to a single commit (if in fact that's possible). - R. From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:42:52 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:42:52 -0700 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: <1128987772.13945.439.camel@localhost> On Sun, 2005-10-09 at 17:17 -0400, Peter J. Braam wrote: > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to > develop during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > These last two bullets are mutually exclusive. Submitting code, for inclusion into Linux, that contains an OS abstraction is a sure way to get your code rejected. It happened to the IBAL stack and it will happen again unless you focus on a Linux specific "Lustre network driver". As a custom of IB products and Lustre, I'd recommend coding to the OpenIB Verbs layer and use the new CM code as it develops (as Fab described). It's not difficult to port from VAPI to OpenIB Verbs so your current VAPI NAL would be a good starting point. It would be great to see fewer Lustre kernel patches and more of Lustre in the Linux kernel. Thanks, - Matt From mlleinin at hpcn.ca.sandia.gov Mon Oct 10 16:44:57 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 10 Oct 2005 16:44:57 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52br1xdjtm.fsf@cisco.com> References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> Message-ID: <1128987897.13952.441.camel@localhost> On Mon, 2005-10-10 at 16:38 -0700, Roland Dreier wrote: > Matt> Pretty consistent. Here are a few runs with 2.6.12-rc5 > Matt> with reboots in between each run. I'm using netperf-2.3pl1. > > That's interesting. I'm guessing you're using mem-ful HCAs? Yes, I'm using mem-full HCAs. I could try reflashing the firmware for memfree if that's of interest. > > Given that your results are more stable than mine, if you're up for > it, you could install git, clone Linus's tree, and then do a git > bisect between 2.6.12-rc4 and 2.6.12-rc5 to narrow down the regression > to a single commit (if in fact that's possible). I was hoping someone else would do this. :) I'll start working on it tomorrow if no one else gets to it. Thanks, - Matt From rolandd at cisco.com Mon Oct 10 16:53:12 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 16:53:12 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1128987897.13952.441.camel@localhost> (Matt Leininger's message of "Mon, 10 Oct 2005 16:44:57 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> <1128987897.13952.441.camel@localhost> Message-ID: <527jcldj4n.fsf@cisco.com> Matt> Yes, I'm using mem-full HCAs. I could try reflashing the Matt> firmware for memfree if that's of interest. No, probably not. If I get a chance I'll do the opposite (flash mem-free -> mem-full, since my HCAs do have memory) and see if it makes my results stable. Matt> I was hoping someone else would do this. :) I'll start Matt> working on it tomorrow if no one else gets to it. I might get a chance to do it tonight... I'll post if I do. - R. From halr at voltaire.com Mon Oct 10 17:33:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Oct 2005 20:33:28 -0400 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <434AA780.60808@ichips.intel.com> References: <1128955559.4377.81.camel@hal.voltaire.com> <469958e00510100850r106feb56x17e0fbddb9a5ee83@mail.gmail.com> <1128960287.4377.378.camel@hal.voltaire.com> <434AA780.60808@ichips.intel.com> Message-ID: <1128990622.4377.3828.camel@hal.voltaire.com> On Mon, 2005-10-10 at 13:40, Sean Hefty wrote: > Hal Rosenstock wrote: > > What about the case of iWARP <-> IB ? > > Crossing IB shouldn't matter. iWarp should simply cross the IB subnet using > IPoIB. You could build a gateway to make the transfer across IB more efficient, > but it's not required. I was referring to gatewaying to an IB end client from iWARP. -- Hal From ak at suse.de Mon Oct 10 17:51:22 2005 From: ak at suse.de (Andi Kleen) Date: Tue, 11 Oct 2005 02:51:22 +0200 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <20051010233054.GA11213@esmail.cup.hp.com> References: <1128672413.13948.326.camel@localhost> <20051010212652.GG9613@esmail.cup.hp.com> <20051010233054.GA11213@esmail.cup.hp.com> Message-ID: <200510110251.22442.ak@suse.de> On Tuesday 11 October 2005 01:30, Grant Grundler wrote: > On Mon, Oct 10, 2005 at 02:26:52PM -0700, Grant Grundler wrote: > ... > > > If it's interleaving, every other cacheline will be "local". > > ISTR AMD64 was page-interleaved but then got confused by documents > describing "128-bit" 2-way interleave. I now realize the 128bit > is refering to interleave between two "banks" of memory behind > each memory controller. ie 2 * 128-bit provides in the 32-byte > cacheline size that most x86 programs expect. The cache line size on K7 and K8 is 64 bytes. > Anyway, I'm hoping that we'll see a consistent result if node interleave > is turned off. Yes usually a good idea. -Andi From rolandd at cisco.com Mon Oct 10 21:03:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 10 Oct 2005 21:03:35 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <527jcldj4n.fsf@cisco.com> (Roland Dreier's message of "Mon, 10 Oct 2005 16:53:12 -0700") References: <1128672413.13948.326.camel@localhost> <52br20lsei.fsf@cisco.com> <1128738350.13945.369.camel@localhost> <521x2tgrim.fsf@cisco.com> <1128986707.13945.424.camel@localhost> <52br1xdjtm.fsf@cisco.com> <1128987897.13952.441.camel@localhost> <527jcldj4n.fsf@cisco.com> Message-ID: <52u0fod7jc.fsf@cisco.com> Roland> I might get a chance to do it tonight... I'll post if I do. I'm giving it a shot but I just can't reproduce this well on my systems. I do see a pretty big regression between 2.6.12-rc4 and 2.6.14-rc2, but 2.6.12-rc5 looks OK on my systems. I reflashed to FW 4.7.0 (mem-ful) and built netperf 2.4.1. With 2.6.12-rc4 I've seen runs as slow as: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.00 553.71 37.46 -1.00 2.642 -1.000 and with 2.6.12-rc5 I've seen runs as fast as: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.00 581.82 39.58 -1.00 2.657 -1.000 so not much difference there. With 2.6.14-rc2, the best of 10 runs was: TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.145.2 (192.168.145.2) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % U us/KB us/KB 87380 16384 16384 10.01 497.00 39.71 -1.00 3.121 -1.000 so we've definitely lost something there. Time to do some more bisecting... - R. From SCHICKHJ at de.ibm.com Mon Oct 10 23:07:24 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 08:07:24 +0200 Subject: [openib-general] Wrong minor number for /dev/uat in README file Message-ID: Hello, I think the minor number for /dev/uat in /src/userspace/libibat/README is wrong. mknod /dev/infiniband/uat c 231 254 should be replaced by mknod /dev/infiniband/uat c 231 191 At least, the file /src/linux-kernel/infiniband/core/uat.c has the following content: enum { IB_UAT_MAJOR = 231, IB_UAT_MINOR = 191 }; Many thanks in advance! Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 From yael at mellanox.co.il Tue Oct 11 01:28:31 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 11 Oct 2005 10:28:31 +0200 Subject: [openib-general] [PATCH] Opensm - handling immediate error in vendor_send new Message-ID: <5zslv8wj80.fsf@mtl066.yok.mtl.com> Hi Hal, Attached is a new patch with several fixes for this issue. I decided to remove the checking for zero in the atomic_dec after all, since as I mentioned before - clearing it is not a fix, and we will see the value in other infos in the log file. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_vl15intf.h =================================================================== --- include/opensm/osm_vl15intf.h (revision 3704) +++ include/opensm/osm_vl15intf.h (working copy) @@ -55,11 +55,13 @@ #include #include #include +#include #include #include #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -137,6 +139,9 @@ typedef struct _osm_vl15 osm_vendor_t *p_vend; osm_log_t *p_log; osm_stats_t *p_stats; + osm_subn_t *p_subn; + cl_disp_reg_handle_t h_disp; + cl_plock_t *p_lock; } osm_vl15_t; /* @@ -176,6 +181,15 @@ typedef struct _osm_vl15 * p_stats * Pointer to the OpenSM statistics block. * +* p_subn +* Pointer to the Subnet object for this subnet. +* +* h_disp +* Handle returned from dispatcher registration. +* +* p_lock +* Pointer to the serializing lock. +* * SEE ALSO * VL15 object *********/ @@ -265,7 +279,10 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ); + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp, + IN cl_plock_t* const p_lock ); /* * PARAMETERS * p_vl15 @@ -283,6 +300,15 @@ osm_vl15_init( * max_wire_smps * [in] Maximum number of MADs allowed on the wire at one time. * +* p_subn +* [in] Pointer to the subnet object. +* +* p_disp +* [in] Pointer to the dispatcher object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* * RETURN VALUES * IB_SUCCESS if the VL15 object was initialized successfully. * Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3704) +++ opensm/osm_opensm.c (working copy) @@ -257,7 +257,8 @@ osm_opensm_init( status = osm_vl15_init( &p_osm->vl15, p_osm->p_vendor, - &p_osm->log, &p_osm->stats, p_opt->max_wire_smps ); + &p_osm->log, &p_osm->stats, p_opt->max_wire_smps, + &p_osm->subn, &p_osm->disp, &p_osm->lock ); if( status != IB_SUCCESS ) goto Exit; Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3704) +++ opensm/osm_vl15intf.c (working copy) @@ -157,6 +157,8 @@ __osm_vl15_poller( if( status != IB_SUCCESS ) { + uint32_t outstanding; + cl_status_t cl_status; osm_log( p_vl->p_log, OSM_LOG_ERROR, "__osm_vl15_poller: ERR 3E03: " "MAD send failed (%s).\n", @@ -166,7 +168,69 @@ __osm_vl15_poller( The MAD was never successfully sent, so fix up the pre-incremented count values. */ + /* Decrement qp0_mads_sent and qp0_mads_outstanding_on_wire + that was incremented in the code above. */ mads_sent = cl_atomic_dec( &p_vl->p_stats->qp0_mads_sent ); + if( p_madw->resp_expected == TRUE ) + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding_on_wire ); + + /* + The following code is similar to the one in + __osm_sm_mad_ctrl_retire_trans_mad. We need to decrement the + qp0_mads_outstanding counter, and if we reached 0 - need to call + the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order + to wake up the state mgr). + */ + cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); + + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "%u QP0 MADs outstanding.\n", + p_vl->p_stats->qp0_mads_outstanding ); + + /* + Acquire the lock non-exclusively. + Other modules that send MADs grab this lock exclusively. + These modules that are in the process of sending MADs + will hold the lock until they finish posting all the MADs + they plan to send. While the other module is sending MADs + the outstanding count may temporarily go to zero. + Thus, by grabbing the lock ourselves, we get an accurate + view of whether or not the number of outstanding MADs is + really zero. + */ + CL_PLOCK_ACQUIRE( p_vl->p_lock ); + outstanding = p_vl->p_stats->qp0_mads_outstanding; + CL_PLOCK_RELEASE( p_vl->p_lock ); + + if( outstanding == 0 ) + { + /* + The wire is clean. + Signal the state manager. + */ + if( osm_log_is_active( p_vl->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_vl->p_log, OSM_LOG_DEBUG, + "__osm_vl15_poller: " + "Posting Dispatcher message %s.\n", + osm_get_disp_msg_str( OSM_MSG_NO_SMPS_OUTSTANDING ) ); + } + + cl_status = cl_disp_post( p_vl->h_disp, + OSM_MSG_NO_SMPS_OUTSTANDING, + (void *)OSM_SIGNAL_NO_PENDING_TRANSACTIONS, + NULL, + NULL ); + + if( cl_status != CL_SUCCESS ) + { + osm_log( p_vl->p_log, OSM_LOG_ERROR, + "__osm_vl15_poller: ERR 3E06: " + "Dispatcher post message failed (%s).\n", + CL_STATUS_MSG( cl_status ) ); + } + } } else { @@ -232,6 +296,7 @@ osm_vl15_construct( cl_qlist_init( &p_vl->rfifo ); cl_qlist_init( &p_vl->ufifo ); cl_thread_construct( &p_vl->poller ); + p_vl->h_disp = CL_DISP_INVALID_HANDLE; } /********************************************************************** @@ -281,6 +346,8 @@ osm_vl15_destroy( p_vl->state = OSM_VL15_STATE_INIT; cl_spinlock_destroy( &p_vl->lock ); + cl_disp_unregister( p_vl->h_disp ); + OSM_LOG_EXIT( p_vl->p_log ); } @@ -292,7 +359,11 @@ osm_vl15_init( IN osm_vendor_t* const p_vend, IN osm_log_t* const p_log, IN osm_stats_t* const p_stats, - IN const int32_t max_wire_smps ) + IN const int32_t max_wire_smps, + IN osm_subn_t* const p_subn, + IN cl_dispatcher_t* const p_disp, + IN cl_plock_t* const p_lock + ) { ib_api_status_t status = IB_SUCCESS; OSM_LOG_ENTER( p_log, osm_vl15_init ); @@ -301,6 +372,8 @@ osm_vl15_init( p_vl->p_log = p_log; p_vl->p_stats = p_stats; p_vl->max_wire_smps = max_wire_smps; + p_vl->p_subn = p_subn; + p_vl->p_lock = p_lock; status = cl_event_init( &p_vl->signal, FALSE ); if( status != IB_SUCCESS ) @@ -321,6 +394,21 @@ osm_vl15_init( if( status != IB_SUCCESS ) goto Exit; + p_vl->h_disp = cl_disp_register( + p_disp, + CL_DISP_MSGID_NONE, + NULL, + NULL ); + + if( p_vl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_vl15_init: ERR 3E01: " + "Dispatcher registration failed.\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + Exit: OSM_LOG_EXIT( p_log ); return( status ); From mohitka at noida.hcltech.com Tue Oct 11 02:58:19 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Tue, 11 Oct 2005 15:28:19 +0530 Subject: [openib-general] SRP & Infiniband Message-ID: <3E6BB9CEE261E2428AD25D0D553DC4970142EA36@HSDLNTD1110010.noida.hcltech.com> Hi all, I am just an investigating level newbee to Infiniband and I have a query in it. I am not clear about the functionalities of the user level HCA driver? Are there any specifications for it or it is totally vendor based? It is also said it is used in speed path operations? Does anyone has any ideas how does it do accomplishes it? If I have SCSI storage devices in a SAN then can I use SRP module to send some request and User mode HCA library for some speed path operation? Basically I wanted to know that for SCSI devices can User mode HCA library be used for speed path operations . If yes the how they can be used(Only theoretical details rest I wil try) Thanks in advance for all the help I am going to get Mohit Katiyar -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 03:50:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 06:50:15 -0400 Subject: [openib-general] Wrong minor number for /dev/uat in README file In-Reply-To: References: Message-ID: <1129027815.4377.6876.camel@hal.voltaire.com> On Tue, 2005-10-11 at 02:07, Heiko J Schick wrote: > Hello, > > I think the minor number for /dev/uat in /src/userspace/libibat/README is > wrong. > > mknod /dev/infiniband/uat c 231 254 > should be replaced by > mknod /dev/infiniband/uat c 231 191 > > At least, the file /src/linux-kernel/infiniband/core/uat.c has the > following content: > > enum { > IB_UAT_MAJOR = 231, > IB_UAT_MINOR = 191 > }; > > Many thanks in advance! Thanks. The README wasn't updated when this occured (on 9/15). -- Hal From yael at mellanox.co.il Tue Oct 11 05:24:49 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 11 Oct 2005 14:24:49 +0200 Subject: [openib-general] [PATCH] Opensm - enabling erase of log file flag Message-ID: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> Hi Hal, Currently the osm log file is accumulative. I've added an option to erase the log file before starting to write it. By default, still, the log is still accumulative. Attached is a patch for that. Thanks, Yael Signed-off-by: Yael Kalka Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 3704) +++ include/opensm/osm_subnet.h (working copy) @@ -220,6 +220,7 @@ typedef struct _osm_subn_opt uint8_t log_flags; char * dump_files_dir; char * log_file; + boolean_t accum_log_file; cl_map_t port_pro_ignore_guids; boolean_t port_profile_switch_nodes; uint32_t max_port_profile; @@ -319,6 +320,10 @@ typedef struct _osm_subn_opt * log_file * Name of the log file (or NULL) for stdout. * +* accum_log_file +* If TRUE (default) - the log file will be accumulated. +* If FALSE - the log file will be erased before starting current opensm run. +* * port_pro_ignore_guids * A map of guids to be ignored by port profiling. * Index: include/opensm/osm_log.h =================================================================== --- include/opensm/osm_log.h (revision 3704) +++ include/opensm/osm_log.h (working copy) @@ -218,7 +218,8 @@ osm_log_init( IN osm_log_t* const p_log, IN const boolean_t flush, IN const uint8_t log_flags, - IN const char *log_file) + IN const char *log_file, + IN const boolean_t accum_log_file ) { p_log->level = log_flags; p_log->flush = flush; @@ -229,10 +230,18 @@ osm_log_init( } else { + if (accum_log_file) p_log->out_port = fopen(log_file,"a+"); + else + p_log->out_port = fopen(log_file,"w+"); + if (!p_log->out_port) { + if (accum_log_file) printf("Cannot open %s for appending. Permission denied\n", log_file); + else + printf("Cannot open %s for writing. Permission denied\n", log_file); + return(IB_UNKNOWN_ERROR); } } Index: complib/cl_event_wheel.c =================================================================== --- complib/cl_event_wheel.c (revision 3704) +++ complib/cl_event_wheel.c (working copy) @@ -597,7 +597,7 @@ main () cl_event_wheel_construct( &event_wheel ); /* init */ - osm_log_init( &log, TRUE, 0xff, NULL); + osm_log_init( &log, TRUE, 0xff, NULL, FALSE); cl_event_wheel_init( &event_wheel, &log ); /* Start Playing */ Index: osmtest/osmtest.c =================================================================== --- osmtest/osmtest.c (revision 3704) +++ osmtest/osmtest.c (working copy) @@ -507,7 +507,7 @@ osmtest_init( IN osmtest_t * const p_osm osmtest_construct( p_osmt ); status = osm_log_init( &p_osmt->log, p_opt->force_log_flush, - 0x0001, p_opt->log_file ); + 0x0001, p_opt->log_file, TRUE ); if( status != IB_SUCCESS ) return ( status ); /* but we do not want any extra staff here */ Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 3704) +++ opensm/osm_subnet.c (working copy) @@ -427,6 +427,7 @@ osm_subn_set_default_opt( p_opt->dump_files_dir = OSM_DEFAULT_TMP_DIR; p_opt->log_file = OSM_DEFAULT_LOG_FILE; + p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; p_opt->max_port_profile = 0xffffffff; p_opt->pfn_ui_pre_lid_assign = NULL; @@ -754,6 +755,10 @@ osm_subn_parse_conf_file( __osm_subn_opts_unpack_charp( "log_file" , p_key, p_val, &p_opts->log_file); + __osm_subn_opts_unpack_boolean( + "accum_log_file", + p_key, p_val, &p_opts->accum_log_file); + __osm_subn_opts_unpack_charp( "dump_files_dir" , p_key, p_val, &p_opts->dump_files_dir); @@ -920,6 +925,7 @@ osm_subn_write_conf_file( "force_log_flush %s\n\n" "# Log file to be used\n" "log_file %s\n\n" + "accum_log_file %s\n\n" "# The directory to hold the file OpenSM dumps\n" "dump_files_dir %s\n\n" "# If TRUE if OpenSM should disable multicast support\n" @@ -929,6 +935,7 @@ osm_subn_write_conf_file( p_opts->log_flags, p_opts->force_log_flush ? "TRUE" : "FALSE", p_opts->log_file, + p_opts->accum_log_file, p_opts->dump_files_dir, p_opts->no_multicast_option ? "TRUE" : "FALSE", p_opts->disable_multicast ? "TRUE" : "FALSE" Index: opensm/osm_db_files.c =================================================================== --- opensm/osm_db_files.c (revision 3704) +++ opensm/osm_db_files.c (working copy) @@ -673,7 +673,7 @@ main(int argc, char **argv) cl_list_construct( &keys ); cl_list_init( &keys, 10 ); - osm_log_init( &log, TRUE, 0xff, "/tmp/test_osm_db.log"); + osm_log_init( &log, TRUE, 0xff, "/tmp/test_osm_db.log", FALSE); osm_db_construct(&db); if (osm_db_init(&db, &log)) Index: opensm/osm_opensm.c =================================================================== --- opensm/osm_opensm.c (revision 3704) +++ opensm/osm_opensm.c (working copy) @@ -205,7 +205,7 @@ osm_opensm_init( osm_opensm_construct( p_osm ); status = osm_log_init( &p_osm->log, p_opt->force_log_flush, - p_opt->log_flags, p_opt->log_file ); + p_opt->log_flags, p_opt->log_file, p_opt->accum_log_file ); if( status != IB_SUCCESS ) return ( status ); Index: opensm/main.c =================================================================== --- opensm/main.c (revision 3704) +++ opensm/main.c (working copy) @@ -167,6 +167,11 @@ show_usage(void) " This option defines the log to be the given file.\n" " By default the log goes to /var/log/osm.log.\n" " For the log to go to standard output use -f stdout.\n\n"); + printf( "-e\n" + "--erase_log_file\n" + " This option will cause deletion of the log file \n" + " (if it previously exists). By default, the log file \n" + " is accumulative.\n\n"); printf( "-v\n" "--verbose\n" " This option increases the log verbosity level.\n" @@ -447,7 +452,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:d:g:l:s:t:vVhorc"; + const char * const short_option = "i:f:ed:g:l:s:t:vVhorc"; /* In the array below, the 2nd parameter specified the number @@ -467,6 +472,7 @@ main( { "verbose", 0, NULL, 'v'}, { "D", 1, NULL, 'D'}, { "log_file", 1, NULL, 'f'}, + { "erase_log_file",0, NULL, 'e'}, { "maxsmps", 1, NULL, 'n'}, { "V", 0, NULL, 'V'}, { "help", 0, NULL, 'h'}, @@ -636,6 +642,11 @@ main( opt.log_file = optarg; break; + case 'e': + opt.accum_log_file = FALSE; + printf(" Creating new log file\n"); + break; + case 'v': log_flags = (log_flags <<1 )|1; printf(" Verbose option -v (log flags = 0x%X)\n", log_flags ); From SCHICKHJ at de.ibm.com Tue Oct 11 05:43:42 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 14:43:42 +0200 Subject: [openib-general] IBM eHCA testing.. Message-ID: Hello Troy, this morning I've looked in detail into the problem you've reported on Oct 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an IPoIB issues. [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html The following things appens: 1. modprobe hcad_mod ehca_nr_ports=1 The eHCA InfiniBand Device Driver is loaded. 2. modprobe ib_mad The ib_mad stack creates an AQP1. This will start the port activation process. By my count it will take more than 110 / 120 seconds to activate a port. Our device driver gets a timeout, which means that the port is NOT active. and ib_modify_qp will not work (for any QP, doesn't matter if it was created in the ib_mad stack or in the ib_ipoib stack). 3. modprobe ib_ipoib All ressources for IPoIB are allocated (CQ, QPs, MR, etc.) 4. A user runs ifconfig ib0 xxx.xxx.xxx.xxx which executes the following functions: ipoib_open -> ipoib_ib_dev_open -> ipoib_qp_create. The user should see the following error message: l2:/home/schickhj/ibt/linstack/ehca2/ehca2 # ifconfig ib0 192.168.8.8 SIOCSIFFLAGS: Invalid argument 5. The function ipoib_qp_create modifies the QP from Reset 2 Init 2 RTR 2 RTS. If one of these three ib_modify_qp doesn't work, the IPoIB QP (priv->qp) will be destroyed (by the ipoib_qp_create error routine / out_fail) and priv->qp will be NULL. --> see /src/linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c function ipoib_qp_create 6. A user runs (again) ifconfig ib0 xxx.xxx.xxx which executes (again) the following functions: ipoib_open -> ipoib_ib_dev_open -> ipoib_qp_create 7. ipoib_qp_create wants to modify the IPoIB QP (priv->qp) which is NULL, because the QP was destroy earlier in time by the error handling routine in ipoib_qp_create (see 5.) I think this error could also show up on Mellanox based IB cards when ib_modify_qp failes in ipoib_qp_create. In dmesg you should see: (see 1.) eHCA Infiniband Device Driver (Rel.: ) xics_enable_irq: irq=9029: ibm_int_on returned fffffffd eHCA Infiniband Device Driver (Rel.: ) (see 2.) PU0000 000b0078:ehca_define_sqp HCAD_ERROR Port 1 is not active. PU0000 000b0387:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc=ffffffffffffffff PU0000 000b03ae:ehca_create_qp <<< failed ret=ffffffea ib_mad: Couldn't create ib_mad QP1 ib_mad: Couldn't open ehca0 port 1 PU0001 00060103:ehca_parse_ec EHCA port 1 is available. PU0000 000b00bd:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1001000503000004 r5=200100000000002c r6=8a40000000000000 3ed48000 r8=0 r9=0 r10=0 PU0000 000b00c4:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffffffffffffffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=800000000005aa18 r10=0 (see 4.) PU0000 000b0564:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffffffffffffffd3 ehca_qp=c000000003ba4e00 qp_num=2c ib0: failed to modify QP to init, ret = -22 ib0: ipoib_qp_create returned -22 Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 From halr at voltaire.com Tue Oct 11 05:42:37 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 08:42:37 -0400 Subject: [openib-general] Re: [PATCH] Opensm - handling immediate error in vendor_send new In-Reply-To: <5zslv8wj80.fsf@mtl066.yok.mtl.com> References: <5zslv8wj80.fsf@mtl066.yok.mtl.com> Message-ID: <1129034556.4377.7616.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-10-11 at 04:28, Yael Kalka wrote: > Attached is a new patch with several fixes for this issue. Thanks. Applied. There were still extra whitespace issues which I fixed by hand. Please try to eliminate these so I don't have to do hand touch ups. > I decided to remove the checking for zero in the atomic_dec after all, > since as I mentioned before - clearing it is not a fix, and we will > see the value in other infos in the log file. But there is danger is these counters wrap, right ? Also, in looking further at the code, the same issue does not appear to occur for QP1 handling, right ? -- Hal From halr at voltaire.com Tue Oct 11 05:48:25 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 08:48:25 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129034904.4377.7666.camel@hal.voltaire.com> Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: > this morning I've looked in detail into the problem you've reported on Oct > 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an > IPoIB issues. > > [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html > > The following things appens: > > 1. modprobe hcad_mod ehca_nr_ports=1 > The eHCA InfiniBand Device Driver is loaded. > > 2. modprobe ib_mad > The ib_mad stack creates an AQP1. This will start the port > activation process. > By my count it will take more than 110 / 120 seconds to activate a > port. > Our device driver gets a timeout, which means that the port is NOT > active. and > ib_modify_qp will not work (for any QP, doesn't matter if it was > created in the ib_mad > stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal From SCHICKHJ at de.ibm.com Tue Oct 11 06:21:34 2005 From: SCHICKHJ at de.ibm.com (Heiko J Schick) Date: Tue, 11 Oct 2005 15:21:34 +0200 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <1129034904.4377.7666.camel@hal.voltaire.com> Message-ID: Hello Hal, normally the timeout is set to 30 seconds. If you need more information about the "activation" please see [1]. [1]: http://openib.org/pipermail/openib-general/2005-October/012355.html Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: schickhj at de.ibm.com External: 49-7031-16-0 x4219, t/l: 120-4219 Hal Rosenstock 11.10.2005 14:48 To Heiko J Schick/Germany/IBM at IBMDE cc openib-general at openib.org, Christoph Raisch/Germany/IBM at IBMDE Subject Re: Re: Re: [openib-general] IBM eHCA testing.. Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: > this morning I've looked in detail into the problem you've reported on Oct > 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an > IPoIB issues. > > [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html > > The following things appens: > > 1. modprobe hcad_mod ehca_nr_ports=1 > The eHCA InfiniBand Device Driver is loaded. > > 2. modprobe ib_mad > The ib_mad stack creates an AQP1. This will start the port > activation process. > By my count it will take more than 110 / 120 seconds to activate a > port. > Our device driver gets a timeout, which means that the port is NOT > active. and > ib_modify_qp will not work (for any QP, doesn't matter if it was > created in the ib_mad > stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 06:18:16 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:18:16 -0400 Subject: [openib-general] Re: [PATCH] Opensm - enabling erase of log file flag In-Reply-To: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> References: <5z1x2sxmum.fsf@mtl066.yok.mtl.com> Message-ID: <1129036689.4377.7915.camel@hal.voltaire.com> Hi Yael, On Tue, 2005-10-11 at 08:24, Yael Kalka wrote: > Currently the osm log file is accumulative. I've added an option to > erase the log file before starting to write it. > By default, still, the log is still accumulative. > Attached is a patch for that. One minor comment on this... > Thanks, > Yael > > Signed-off-by: Yael Kalka > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 3704) > +++ opensm/osm_subnet.c (working copy) > @@ -920,6 +925,7 @@ osm_subn_write_conf_file( > "force_log_flush %s\n\n" > "# Log file to be used\n" > "log_file %s\n\n" > + "accum_log_file %s\n\n" > "# The directory to hold the file OpenSM dumps\n" > "dump_files_dir %s\n\n" > "# If TRUE if OpenSM should disable multicast support\n" > @@ -929,6 +935,7 @@ osm_subn_write_conf_file( > p_opts->log_flags, > p_opts->force_log_flush ? "TRUE" : "FALSE", > p_opts->log_file, > + p_opts->accum_log_file, Shouldn't this line be: p_opts->accum_log_file ? "TRUE" : "FALSE", -- Hal From jlentini at netapp.com Tue Oct 11 06:33:37 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Oct 2005 09:33:37 -0400 (EDT) Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: On Sun, 9 Oct 2005, Peter J. Braam wrote: > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to > develop during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different > operating systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > > We are keen to hear some opinions! > > Thanks > > Peter Braam Hi Peter, I am the maintainer of the kDAPL reference implementation. If you are interested in portability, I would recommend kDAPL. Earlier this year, there was an effort to modify the kDAPL API to make it acceptable for inclusion in the Linux kernel. After making these modifications, the OpenIB community still felt that the kDAPL API was not ready for merging into the upstream kernel. As a result, a new project was begun to develop an API capable of supporting both IB and iWARP and suitable for kernel inclusion. At the present time, neither the kDAPL API or the new RDMA API (verbs + CMA) has been sent upstream. The current thinking is that the RDMA API has a better chance than kDAPL. james From halr at voltaire.com Tue Oct 11 06:31:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:31:18 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129037478.4377.8015.camel@hal.voltaire.com> Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: > Hello Hal, > > normally the timeout is set to 30 seconds. Why does there need to be a timeout for this ? There is no time defined in the IB spec for activating a port. The SM may or may not be up and it is implementation specific when it activates any particular port. > If you need more information about the "activation" please see [1]. > > [1]: > http://openib.org/pipermail/openib-general/2005-October/012355.html Yes, I saw that post yesterday. -- Hal From sinate at yahoo.com Tue Oct 11 06:37:00 2005 From: sinate at yahoo.com (Steven Wooding) Date: Tue, 11 Oct 2005 14:37:00 +0100 (BST) Subject: [openib-general] Compiling an application that calls ib_cm_* functions Message-ID: <20051011133700.77105.qmail@web32506.mail.mud.yahoo.com> Hi, I wonder if someone could help me with compiling my IB application? The problem is when I go to link my program I get all of the ib_cm* function calls come up as "undefined reference". Also dlist_start and _dlist_mark_move (dlist_next in the code). Here is my linking command: icpc -o ib_comms_test1 ib_comms_test1.o ib_queue_pair.o ib_comms_manager.o -L/usr/local/lib -libcm -libat -libverbs -libumad -lsysfs -ldl Get the same result when using g++ The cmpost.c example compiles fine. I've tried to see what it is doing. It seems to link-in the libibcm.la file, but when I try this with icpc or g++, they say they cannot recogised the file type. Maybe someone can spot the simple mistake I'm making. Cheers, Steve. --------------------------------- To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Oct 11 06:38:11 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 09:38:11 -0400 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: Message-ID: <1129037891.4377.8074.camel@hal.voltaire.com> Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: > Hello Hal, > > normally the timeout is set to 30 seconds. One more thing: How can the timeout be adjusted ? Is it an module parameter ? -- Hal From bardov at gmail.com Tue Oct 11 06:44:25 2005 From: bardov at gmail.com (Dan Bar Dov) Date: Tue, 11 Oct 2005 15:44:25 +0200 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? In-Reply-To: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> References: <9025E129D3FCD340A7BA67E342D10E7A0D34DA2B@ms06> Message-ID: Hi Peter, I can testify from first hand experience - we first developed ISER over KDAPL. It simplified our work since kDAPL was pretty stable at the time. We are now porting ISER to run over openIB-verbs + CMA. Although CMA is not there yet, the port does simplify the code compared to the kDAPL implementation. Dan On 10/9/05, Peter J. Braam wrote: > > Cluster File Systems, Inc and its customers have been wondering if the > Lustre Network Driver (LND) for OpenIb gen2, which we will begin to develop > during the coming months, should be based on kdapl or verbs. > > The driver we plan to develop should strive to address several goals: > - high reliability and performance > - allow interoperability between user and kernel level > - allow interoperability, or better, portability among different operating > systems (Linux, OS X, Windows, Solaris) > - be suitable for inclusion in the Linux kernel > > We are keen to hear some opinions! > > Thanks > > Peter Braam > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From mst at mellanox.co.il Tue Oct 11 06:47:47 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 15:47:47 +0200 Subject: [openib-general] Re: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes In-Reply-To: <1128091110.5270.1072.camel@hal.voltaire.com> References: <1128091110.5270.1072.camel@hal.voltaire.com> Message-ID: <20051011134747.GA17185@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes > > SDP: In sdp_link.c::do_link_path_lookup, handle interface table > numbering holes > (similar to James Lentini's patch to at.c) > > (this is untested) > > Signed-off-by: Hal Rosenstock > > Index: sdp_link.c > =================================================================== > --- sdp_link.c (revision 3623) > +++ sdp_link.c (working copy) > @@ -354,7 +354,6 @@ static void do_link_path_lookup(struct s > struct ipoib_dev_priv *priv; > struct net_device *dev = NULL; > struct rtable *rt; > - int counter = 0; > int result = 0; > struct flowi fl = { > .oif = info->dif, /* oif */ > @@ -435,7 +434,7 @@ static void do_link_path_lookup(struct s > > if (dev->flags & IFF_LOOPBACK) { > dev_put(dev); > - while ((dev = dev_get_by_index(++counter))) { > + for (dev = dev_base; dev; dev = dev->next) { > if (dev->type == ARPHRD_INFINIBAND && > (dev->flags & IFF_UP)) > break; > I think this list scan needs some kind of protection. The following is what I checked in. Does this needs to be updated in other places as well? Handle net interface table numbering holes (similar to James Lentini's patch to at.c) Signed-off-by: Michael S. Tsirkin Signed-off-by: Hal Rosenstock Index: linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c =================================================================== --- linux-kernel.orig/drivers/infiniband/ulp/sdp/sdp_link.c 2005-10-11 13:48:30.000000000 +0200 +++ linux-kernel/drivers/infiniband/ulp/sdp/sdp_link.c 2005-10-11 13:55:15.000000000 +0200 @@ -433,13 +433,15 @@ static void do_link_path_lookup(struct s if (dev->flags & IFF_LOOPBACK) { dev_put(dev); - while ((dev = dev_get_by_index(++counter))) { + read_lock(&dev_base_lock); + for (dev = dev_base; dev; dev = dev->next) { if (dev->type == ARPHRD_INFINIBAND && - (dev->flags & IFF_UP)) + (dev->flags & IFF_UP)) { + dev_hold(dev); break; - else - dev_put(dev); + } } + read_unlock(&dev_base_lock); } if (!dev) { -- MST From mst at mellanox.co.il Tue Oct 11 07:02:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 16:02:24 +0200 Subject: [openib-general] [PATCH] reduce the number of included files in cma.c Message-ID: <20051011140224.GB17185@mellanox.co.il> Remove unnecessary includes from cma.c Signed-off-by: Michael S. Tsirkin Index: linux-2.6.13/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.13/drivers/infiniband/core/cma.c (revision 3720) +++ linux-2.6.13/drivers/infiniband/core/cma.c (working copy) @@ -30,10 +30,6 @@ */ #include #include -#include -#include -#include -#include #include #include #include -- MST From tziporet at mellanox.co.il Tue Oct 11 09:11:42 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 11 Oct 2005 18:11:42 +0200 Subject: [openib-general] segmentation fault in ibv_modify_srq Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E33E79B9@mtlexch01.mtl.com> SRQ limit event will be supported also on cards with memory (both Infinihost and Infinihost III) If someone need it nowadays we can give a drop of FW that supports it. It will be officially released in Q4. Tziporet -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Wednesday, October 05, 2005 9:42 PM To: Sayantan Sur Cc: openib-general at openib.org Subject: Re: [openib-general] segmentation fault in ibv_modify_srq Sayantan> Hello, This is in regard to the use of `ibv_modify_srq' Sayantan> call. When I use this call, I get a segmentation Sayantan> fault. This is because the modify SRQ operation is not implemented at all in libmthca. Do you just want to set the SRQ limit? That's not so hard for me to implement. However, you should be aware that as far as I know, only mem-free HCAs generate the SRQ limited reached event. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Oct 11 09:13:20 2005 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 11 Oct 2005 09:13:20 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <1129037891.4377.8074.camel@hal.voltaire.com> Message-ID: The IB stack doesn't handle errors during client initialization. This problem is easy to reproduce by inducing errors (resouce allocation failure or query failure) in mad_client or sa_client registration. I am working on a patch, but I am in class the whole week, don't have time to verify the patch. I hope the patch will be available early next week to fix the panic. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From amitk at mellanox.co.il Tue Oct 11 10:38:57 2005 From: amitk at mellanox.co.il (Amit Krig) Date: Tue, 11 Oct 2005 19:38:57 +0200 Subject: [openib-general] some bugs that can be found using the gen2_b asic in the contrib/m ellanox folder Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E311B0A6@mtlexch01.mtl.com> Hi Roland, Dotan is on vacation until the end of the month, (Ami will send an update) Regarding the max qp number the main reason for the test is to see that we are in the ballpark, Your point was taken and we will focus on some heavy data movement from there we will continue to some error flows -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Monday, October 03, 2005 7:30 PM To: Dotan Barak Cc: openib-general at openib.org Subject: Re: [openib-general] some bugs that can be found using the gen2_basic in the contrib/m mellanox folder I finally got a chance to try your tests. A few comments: - Several of the tests are buggy. See the patch below at least. - It would be much more useful if the COMPARE() macro printed the expected and actual value on failure. - Similarly, other macros should probably also print more context. For example, in something like: CHECK_PTR("ibv_create_qp", qp[i], goto cleanup); I would probably want to know the value of i on failure. - I don't believe some of the tests are really valid. For example, the max number of QPs doesn't have to be precisely correct -- no valid app is going to depend on being able to create exactly that number of QPs and no more. - In any case, I'm not convinced that this sort of negative testing is the most valuable thing to focus on right now. I think it would be better to have regression tests of basic functionality (sends, receives, RDMA, CQ polling, etc) and stress tests before testing whether a buggy app will get the right error value when passing invalid parameters. - R. Index: test_cq.c =================================================================== --- test_cq.c (revision 3639) +++ test_cq.c (working copy) @@ -106,6 +106,7 @@ int cq_2( { struct ibv_context *ib_cont = NULL; struct ibv_pd *pd = NULL; + struct ibv_comp_channel *channel = NULL; struct ibv_cq *cq = NULL; struct ibv_cq *event_cq = NULL; struct ibv_qp *qp = NULL; @@ -132,8 +133,11 @@ int cq_2( pd = ibv_alloc_pd(ib_cont); CHECK_PTR("ibv_alloc_pd", pd, goto cleanup); + channel = ibv_create_comp_channel(ib_cont); + CHECK_PTR("ibv_create_comp_channel", channel, goto cleanup); + cq_size = VL_range(rand_gen, 1, device_attr.max_cqe); - cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, NULL, 0); + cq = ibv_create_cq(ib_cont, cq_size, (void *)&count, channel, 0); CHECK_PTR("ibv_create_cq", cq, goto cleanup); mr_size = VL_range(rand_gen, 1, 1024); @@ -211,6 +215,7 @@ int cq_2( CHECK_MALLOC(event_count, goto cleanup); *event_count = 0; + rc = ibv_get_cq_event(channel, (void *)&event_cq, (void +*)&event_count); rc = ibv_get_cq_event(NULL, (void *)&event_cq, (void *)&event_count); CHECK_VALUE("ibv_get_cq_event", rc, 0, goto cleanup); Index: test_hca.c =================================================================== --- test_hca.c (revision 3639) +++ test_hca.c (working copy) @@ -230,7 +230,7 @@ int hca_5( j = port_attr.gid_tbl_len + VL_random(rand_gen, 0xFFFFFFFF - port_attr.gid_tbl_len); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); } PASSED; @@ -239,7 +239,7 @@ int hca_5( i = VL_range(rand_gen, device_attr.phys_port_cnt + 1, 0xFF); rc = ibv_query_gid(ib_cont, i, j, &gid); - CHECK_VALUE("ibv_query_gid", rc, 0, goto cleanup); + CHECK_VALUE("ibv_query_gid", rc, -1, goto cleanup); PASSED; test_result = 0; _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Oct 11 10:39:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 10:39:25 -0700 Subject: [openib-general] [CMA] blocking in rdma_listen() Message-ID: <434BF8CD.8060409@ichips.intel.com> Does anyone have any objection to rdma_listen() blocking? I'm working on adding support for listening across any device, but need to synchronize with device addition/removal. - Sean From rolandd at cisco.com Tue Oct 11 10:45:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 10:45:53 -0700 Subject: [openib-general] SRP & Infiniband In-Reply-To: <3E6BB9CEE261E2428AD25D0D553DC4970142EA36@HSDLNTD1110010.noida.hcltech.com> (Mohit Katiyar's message of "Tue, 11 Oct 2005 15:28:19 +0530") References: <3E6BB9CEE261E2428AD25D0D553DC4970142EA36@HSDLNTD1110010.noida.hcltech.com> Message-ID: <52k6gkaqwe.fsf@cisco.com> Mohit> I am not clear about the functionalities of the user level Mohit> HCA driver? Are there any specifications for it or it is Mohit> totally vendor based? The userspace interface is based on the "verbs" described in chapter 11 of the IB spec, but there is no formal API spec. Mohit> It is also said it is used in speed path operations? Does Mohit> anyone has any ideas how does it do accomplishes it? The kernel sets up a mapping of HCA registers into userspace, and then userspace can talk directly to the IB hardware without going through the kernel. Mohit> If I have SCSI storage devices in a SAN then can I use SRP Mohit> module to send some request and User mode HCA library for Mohit> some speed path operation? Basically I wanted to know that Mohit> for SCSI devices can User mode HCA library be used for Mohit> speed path operations . If yes the how they can be Mohit> used(Only theoretical details rest I wil try) It would be theoretically possible to implement a userspace process that connects to an SRP target and implement SRP in userspace. However, I don't think this would be any better than using a kernel SRP driver along with direct IO from userspace. - R. From rolandd at cisco.com Tue Oct 11 10:56:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 10:56:59 -0700 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: (Heiko J. Schick's message of "Tue, 11 Oct 2005 14:43:42 +0200") References: Message-ID: <52ek6saqdw.fsf@cisco.com> Heiko> 7. ipoib_qp_create wants to modify the IPoIB QP (priv->qp) Heiko> which is NULL, because the QP was destroy earlier in time Heiko> by the error handling routine in ipoib_qp_create (see 5.) Heiko> I think this error could also show up on Mellanox based IB Heiko> cards when ib_modify_qp failes in ipoib_qp_create. Yes, this is a bug. I think something like the patch below is needed -- ipoib_qp_create() should not destroy the QP on failure, since it no longer creates the QP. In fact we should fix the name as well, since creation of the QP has moved elsewhere. I'll check this in and queue it for 2.6.15. Thanks, Roland --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -92,7 +92,7 @@ int ipoib_mcast_detach(struct net_device return ret; } -int ipoib_qp_create(struct net_device *dev) +int ipoib_init_qp(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; @@ -149,10 +149,11 @@ int ipoib_qp_create(struct net_device *d return 0; out_fail: - ib_destroy_qp(priv->qp); - priv->qp = NULL; + qp_attr.qp_state = IB_QPS_RESET; + if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) + ipoib_warn(priv, "Failed to modify QP to RESET state\n"); - return -EINVAL; + return ret; } int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) --- infiniband/ulp/ipoib/ipoib.h (revision 3707) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -277,7 +277,7 @@ int ipoib_mcast_attach(struct net_device int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid); -int ipoib_qp_create(struct net_device *dev); +int ipoib_init_qp(struct net_device *dev); int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); void ipoib_transport_dev_cleanup(struct net_device *dev); --- infiniband/ulp/ipoib/ipoib_ib.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -387,9 +387,9 @@ int ipoib_ib_dev_open(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - ret = ipoib_qp_create(dev); + ret = ipoib_init_qp(dev); if (ret) { - ipoib_warn(priv, "ipoib_qp_create returned %d\n", ret); + ipoib_warn(priv, "ipoib_init_qp returned %d\n", ret); return -1; } From vuhuong at mellanox.com Tue Oct 11 11:03:15 2005 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 11 Oct 2005 11:03:15 -0700 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it In-Reply-To: <521x2tf512.fsf@cisco.com> References: <52vf0kii49.fsf@cisco.com> <433C1821.6000809@mellanox.com> <52zmpvhll8.fsf@cisco.com> <433C78A1.30207@mellanox.com> <521x2tf512.fsf@cisco.com> Message-ID: <434BFE63.7080800@mellanox.com> Roland, Thanks or reviewing it. Responding to your feedback, I prepare new patch (attached) > > Why put a pointer to struct list_head here instead of just a struct > list_head? If you just used the struct, then you wouldn't need this: > Done. Using struct list_head instead of pointer > > + u16 in_use; > > }; > > I can't find anywhere that the in_use flag is used. > Removed > > +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, > > + int sg_cnt, struct srp_request *req) > > [...] > > > + return -ENOMEM; > > > + } else if (fmr_cnt <= 0) { > > fmr_cnt is unsigned so I think this is going to get you in trouble. > Might as well make fmr_cnt a plain int to make things simpler. > In previous patch, fmr_cnt was already declared as int > Also, it might be good to try and add some more comments explaining > srp_map_fmr() -- it would definitely help me review. > I added some comments - Hope they help your review (instead of confusing you more :)) Signed-off-by: Vu Pham -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: srp.patch.Oct112005 URL: From steve_wooding at keysounds.co.uk Tue Oct 11 11:14:12 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Tue, 11 Oct 2005 19:14:12 +0100 Subject: [openib-general] Re: Compiling an application that calls ib_cm_* functions Message-ID: <434C00F4.8010403@keysounds.co.uk> OK. It's that extern "C" issue again. Maybe someone could put the "ifdef __cplusplus" code block at the top of the cm.h file so future C++ programers don't have to put extern "C" {} around the #include. Cheers, Steve. Steve wrote: > Hi, > > I wonder if someone could help me with compiling my IB application? > The problem is when I go to link my program I get all of the ib_cm* > function calls > come up as "undefined reference". Also dlist_start and > _dlist_mark_move (dlist_next in the code). > > Here is my linking command: > icpc -o ib_comms_test1 ib_comms_test1.o ib_queue_pair.o > ib_comms_manager.o -L/usr/local/lib -libcm -libat -libverbs -libumad > -lsysfs -ldl > > Get the same result when using g++ > The cmpost.c example compiles fine. I've tried to see what it is > doing. It seems to link-in the libibcm.la file, > but when I try this with icpc or g++, they say they cannot recogised > the file type. > > Maybe someone can spot the simple mistake I'm making. > > Cheers, > > Steve. From krause at cup.hp.com Tue Oct 11 12:03:04 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 11 Oct 2005 12:03:04 -0700 Subject: [openib-general] IRQ sharing on PCIe bus In-Reply-To: <52achhf5h7.fsf@cisco.com> References: <52d5mdibp1.fsf@cisco.com> <6.2.0.14.2.20051010124836.023afd08@esmail.cup.hp.com> <52achhf5h7.fsf@cisco.com> Message-ID: <6.2.0.14.2.20051011115855.024b0248@esmail.cup.hp.com> At 02:05 PM 10/10/2005, Roland Dreier wrote: > Roland> BTW, for "INTx emulation" on PCI Express, there are no > Roland> physical interrupt lines -- interrupts are asserted and > Roland> deasserted with messages. So PCI Express interrupts are > Roland> unshared. > > Michael> They are messages upstream that any device. ^ sent Sorry. Insert "sent" above. >That doesn't parse for me. Was what I said wrong? No. Just clarifying that they are not unique per device. INTx being a message does not change the fundamental semantics of a "wire" being asserted. Hence, if the wire was shared before, then there is no reason why this would not be the same with PCIe sans. It really is an OS issue as to how INTx interrupts are assigned to different processors and to what extent then end up being shared. The host bridge can play some tricks as well as you noted. Again, the goal within the PCI-SIG is to move people to MSI-X and to eliminate INTx long-term. In fact, one area under development is asking the SIG's members whether INTx can be eliminated entirely which would go a long ways to simplifying designs both in hardware and software. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 11 12:46:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 12:46:36 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Wed, 28 Sep 2005 12:50:07 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> Message-ID: <52y84z96qr.fsf@cisco.com> I started working on some final cleanups of the uverbs interface before merging stuff onto the trunk. The patch below mixes some simple cleanups with a slight change to the work request posting interface. I changed the ABI so that the work requests are passed as part of the same write as the command, and modified the implementation to copy the work requests one by one instead of in one giant chunk. I also added a WQE size field to allow for future device-specific extensions to work request posting. The following is compile-tested only, and I haven't modified the userspace library to match, but I wanted to give you some idea of what I was doing in case you had some comments or started working on it too. What do you think? - R. --- infiniband/include/rdma/ib_user_verbs.h (revision 3725) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -89,8 +89,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + * multiple of 8 bytes. Otherwise the structure size will be + * different between 32-bit and 64-bit architectures. */ struct ib_uverbs_async_event_desc { @@ -284,12 +287,12 @@ struct ib_uverbs_wc { __u8 sl; __u8 dlid_path_bits; __u8 port_num; - __u8 reserved; /* Align struct to 8 bytes */ + __u8 reserved; }; struct ib_uverbs_poll_cq_resp { __u32 count; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; struct ib_uverbs_wc wc[]; }; @@ -417,20 +420,20 @@ struct ib_uverbs_send_wr { struct { __u64 remote_addr; __u32 rkey; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; } rdma; struct { __u64 remote_addr; __u64 compare_add; __u64 swap; __u32 rkey; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; } atomic; struct { __u32 ah; __u32 remote_qpn; __u32 remote_qkey; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; } ud; } wr; }; @@ -440,8 +443,7 @@ struct ib_uverbs_post_send { __u32 qp_handle; __u32 wr_count; __u32 sge_count; - __u32 reserved; /* Align struct to 8 bytes */ - __u64 wr; + __u32 wqe_size; }; struct ib_uverbs_post_send_resp { @@ -451,7 +453,7 @@ struct ib_uverbs_post_send_resp { struct ib_uverbs_recv_wr { __u64 wr_id; __u32 num_sge; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; }; struct ib_uverbs_post_recv { @@ -459,8 +461,7 @@ struct ib_uverbs_post_recv { __u32 qp_handle; __u32 wr_count; __u32 sge_count; - __u32 reserved; /* Align struct to 8 bytes */ - __u64 wr; + __u32 wqe_size; }; struct ib_uverbs_post_recv_resp { @@ -472,47 +473,38 @@ struct ib_uverbs_post_srq_recv { __u32 srq_handle; __u32 wr_count; __u32 sge_count; - __u32 reserved; /* Align struct to 8 bytes */ - __u64 wr; + __u32 wqe_size; }; struct ib_uverbs_post_srq_recv_resp { __u32 bad_wr; }; -union ib_uverbs_gid { - __u8 raw[16]; - struct { - __u64 subnet_prefix; - __u64 interface_id; - } global; -}; - -struct ibv_m_global_route { - union ib_uverbs_gid dgid; +struct ib_uverbs_global_route { + __u8 dgid[16]; __u32 flow_label; __u8 sgid_index; __u8 hop_limit; __u8 traffic_class; - __u8 reserved; /* Align struct to 8 bytes */ + __u8 reserved; }; struct ib_uverbs_ah_attr { - struct ibv_m_global_route grh; + struct ib_uverbs_global_route grh; __u16 dlid; __u8 sl; __u8 src_path_bits; __u8 static_rate; __u8 is_global; __u8 port_num; - __u8 reserved; /* Align struct to 8 bytes */ + __u8 reserved; }; struct ib_uverbs_create_ah { __u64 response; __u64 user_handle; __u32 pd_handle; - __u32 reserved; /* Align struct to 8 bytes */ + __u32 reserved; struct ib_uverbs_ah_attr attr; }; --- infiniband/core/uverbs_cmd.c (revision 3725) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -680,6 +680,10 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + /* Don't let userspace make us allocate a huge buffer */ + if (cmd.ne > 256) + return -ENOMEM; + wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); if (!wc) return -ENOMEM; @@ -699,24 +703,25 @@ ssize_t ib_uverbs_poll_cq(struct ib_uver } resp->count = ib_poll_cq(cq, cmd.ne, wc); - for(i = 0; i < cmd.ne; i++) { - resp->wc[i].wr_id = wc[i].wr_id; - resp->wc[i].status = wc[i].status; - resp->wc[i].opcode = wc[i].opcode; - resp->wc[i].vendor_err = wc[i].vendor_err; - resp->wc[i].byte_len = wc[i].byte_len; - resp->wc[i].imm_data = wc[i].imm_data; - resp->wc[i].qp_num = wc[i].qp_num; - resp->wc[i].src_qp = wc[i].src_qp; - resp->wc[i].wc_flags = wc[i].wc_flags; - resp->wc[i].pkey_index = wc[i].pkey_index; - resp->wc[i].slid = wc[i].slid; - resp->wc[i].sl = wc[i].sl; + for (i = 0; i < resp->count; i++) { + resp->wc[i].wr_id = wc[i].wr_id; + resp->wc[i].status = wc[i].status; + resp->wc[i].opcode = wc[i].opcode; + resp->wc[i].vendor_err = wc[i].vendor_err; + resp->wc[i].byte_len = wc[i].byte_len; + resp->wc[i].imm_data = wc[i].imm_data; + resp->wc[i].qp_num = wc[i].qp_num; + resp->wc[i].src_qp = wc[i].src_qp; + resp->wc[i].wc_flags = wc[i].wc_flags; + resp->wc[i].pkey_index = wc[i].pkey_index; + resp->wc[i].slid = wc[i].slid; + resp->wc[i].sl = wc[i].sl; resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits; - resp->wc[i].port_num = wc[i].port_num; + resp->wc[i].port_num = wc[i].port_num; } - if (copy_to_user((void __user *)cmd.response, resp, rsize)) + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) ret = -EFAULT; out: @@ -741,15 +746,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i down(&ib_uverbs_idr_mutex); cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); - if (!cq || cq->uobject->context != file->ucontext) - goto out; - - ib_req_notify_cq(cq, cmd.solicited ? IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); - - ret = in_len; - -out: + if (cq && cq->uobject->context == file->ucontext) { + ib_req_notify_cq(cq, cmd.solicited ? IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + ret = in_len; + } up(&ib_uverbs_idr_mutex); + return ret; } @@ -1097,195 +1099,296 @@ ssize_t ib_uverbs_post_send(struct ib_uv { struct ib_uverbs_post_send cmd; struct ib_uverbs_post_send_resp resp; - struct ib_uverbs_send_wr *m_wr, *j; - struct ib_send_wr *wr, *i, *bad_wr; - struct ib_sge *s; + struct ib_uverbs_send_wr *user_wr; + struct ib_send_wr *wr = NULL, *last, *next, *bad_wr; struct ib_qp *qp; - int size; - int count; + int i, sg_ind; ssize_t ret = -EINVAL; - resp.bad_wr = 0; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + if (in_len < sizeof cmd + cmd.wqe_size * cmd.wr_count + + cmd.sge_count * sizeof (struct ib_uverbs_sge)) + return -EINVAL; + + if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr)) + return -EINVAL; + + /* Don't let userspace make us allocate a huge buffer */ + if (cmd.wqe_size > 4096) + return -ENOMEM; + + user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); + if (!user_wr) + return -ENOMEM; + down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); if (!qp || qp->uobject->context != file->ucontext) goto out; - size = (cmd.wr_count * sizeof *wr) + (cmd.sge_count * sizeof *s); - m_wr = kmalloc(size, GFP_KERNEL); - if (!m_wr) { - ret = -ENOMEM; - goto out; - } - - if (copy_from_user(m_wr, (void __user *)cmd.wr, size)) { - ret = -EFAULT; - goto wrout; - } + sg_ind = 0; + last = NULL; + for (i = 0; i < cmd.wr_count; ++i) { + if (copy_from_user(user_wr, + buf + sizeof cmd + i * cmd.wqe_size, + cmd.wqe_size)) { + ret = -EFAULT; + goto out; + } - wr = kmalloc(cmd.wr_count * sizeof *wr, GFP_KERNEL); - if (!wr) { - ret = -ENOMEM; - goto wrout; - } + /* Don't let userspace make us allocate a huge buffer */ + if (user_wr->num_sge > 256) { + ret = -ENOMEM; + goto out; + } - s = (struct ib_sge *)(m_wr + cmd.wr_count); + if (user_wr->num_sge + sg_ind > cmd.sge_count) { + ret = -EINVAL; + goto out; + } - i = wr; - j = m_wr; - count = 0; - while (count++ < cmd.wr_count) { - struct ib_send_wr *t = i++; - struct ib_uverbs_send_wr *u = j++; + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto out; + } - if (count < cmd.wr_count) - t->next = i; + if (!last) + wr = next; else - t->next = NULL; + last->next = next; + last = next; - t->wr_id = u->wr_id; - t->num_sge = u->num_sge; - t->opcode = u->opcode; - t->send_flags = u->send_flags; - t->imm_data = u->imm_data; + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + next->opcode = user_wr->opcode; + next->send_flags = user_wr->send_flags; + next->imm_data = user_wr->imm_data; if (qp->qp_type == IB_QPT_UD) { - t->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, u->wr.ud.ah); - if (!t->wr.ud.ah) - goto kwrout; - t->wr.ud.remote_qpn = u->wr.ud.remote_qpn; - t->wr.ud.remote_qkey = u->wr.ud.remote_qkey; + next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, + user_wr->wr.ud.ah); + if (!next->wr.ud.ah) { + ret = -EINVAL; + goto out; + } + next->wr.ud.remote_qpn = user_wr->wr.ud.remote_qpn; + next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey; } else { - switch (t->opcode) { + switch (next->opcode) { case IB_WR_RDMA_WRITE: case IB_WR_RDMA_WRITE_WITH_IMM: case IB_WR_RDMA_READ: - t->wr.rdma.remote_addr = u->wr.rdma.remote_addr; - t->wr.rdma.rkey = u->wr.rdma.rkey; + next->wr.rdma.remote_addr = + user_wr->wr.rdma.remote_addr; + next->wr.rdma.rkey = + user_wr->wr.rdma.rkey; break; case IB_WR_ATOMIC_CMP_AND_SWP: case IB_WR_ATOMIC_FETCH_AND_ADD: - t->wr.atomic.remote_addr = - u->wr.atomic.remote_addr; - t->wr.atomic.compare_add = - u->wr.atomic.compare_add; - t->wr.atomic.swap = u->wr.atomic.swap; - t->wr.atomic.rkey = u->wr.atomic.rkey; + next->wr.atomic.remote_addr = + user_wr->wr.atomic.remote_addr; + next->wr.atomic.compare_add = + user_wr->wr.atomic.compare_add; + next->wr.atomic.swap = user_wr->wr.atomic.swap; + next->wr.atomic.rkey = user_wr->wr.atomic.rkey; break; default: break; } } - if (t->num_sge) { - t->sg_list = s; - s += t->num_sge; + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + sizeof cmd + + cmd.wr_count * cmd.wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto out; + } + sg_ind += next->num_sge; } else - t->sg_list = NULL; + next->sg_list = NULL; } + resp.bad_wr = 0; ret = qp->device->post_send(qp, wr, &bad_wr); - resp.bad_wr = ret ? (bad_wr - wr) + 1 : 0; - -kwrout: - kfree(wr); + if (ret) { + for (next = wr; next; next = next->next) { + if (next == bad_wr) + break; + ++resp.bad_wr; + } + } -wrout: - kfree(m_wr); + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; out: up(&ib_uverbs_idr_mutex); - if (copy_to_user((void __user *) (unsigned long) cmd.response, - &resp, sizeof resp)) - ret = -EFAULT; + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + kfree(user_wr); return ret ? ret : in_len; } +static struct ib_recv_wr *ib_uverbs_unmarshall_recv(const char __user *buf, + int in_len, + u32 wr_count, + u32 sge_count, + u32 wqe_size) +{ + struct ib_uverbs_recv_wr *user_wr; + struct ib_recv_wr *wr = NULL, *last, *next; + int sg_ind; + int i; + int ret; + + if (in_len < wqe_size * wr_count + + sge_count * sizeof (struct ib_uverbs_sge)) + return ERR_PTR(-EINVAL); + + if (wqe_size < sizeof (struct ib_uverbs_recv_wr)) + return ERR_PTR(-EINVAL); + + /* Don't let userspace make us allocate a huge buffer */ + if (wqe_size > 4096) + return ERR_PTR(-ENOMEM); + + user_wr = kmalloc(wqe_size, GFP_KERNEL); + if (!user_wr) + return ERR_PTR(-ENOMEM); + + sg_ind = 0; + last = NULL; + for (i = 0; i < wr_count; ++i) { + if (copy_from_user(user_wr, buf + i * wqe_size, + wqe_size)) { + ret = -EFAULT; + goto err; + } + + /* Don't let userspace make us allocate a huge buffer */ + if (user_wr->num_sge > 256) { + ret = -ENOMEM; + goto err; + } + + if (user_wr->num_sge + sg_ind > sge_count) { + ret = -EINVAL; + goto err; + } + + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto err; + } + + if (!last) + wr = next; + else + last->next = next; + last = next; + + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + wr_count * wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto err; + } + sg_ind += next->num_sge; + } else + next->sg_list = NULL; + } + + kfree(user_wr); + return wr; + +err: + kfree(user_wr); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ERR_PTR(ret); +} + ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) { struct ib_uverbs_post_recv cmd; struct ib_uverbs_post_recv_resp resp; - struct ib_uverbs_recv_wr *m_wr, *j; - struct ib_recv_wr *wr, *i, *bad_wr; - struct ib_sge *s; + struct ib_recv_wr *wr, *next, *bad_wr; struct ib_qp *qp; - int size; - int count; ssize_t ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + down(&ib_uverbs_idr_mutex); qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); if (!qp || qp->uobject->context != file->ucontext) goto out; - size = (cmd.wr_count * sizeof *m_wr) + (cmd.sge_count * sizeof *s); - m_wr = kmalloc(size, GFP_KERNEL); - if (!m_wr) { - ret = -ENOMEM; - goto out; - } - - if (copy_from_user(m_wr, (void __user *)cmd.wr, size)) { - ret = -EFAULT; - goto wrout; - } - - wr = kmalloc(cmd.wr_count * sizeof *wr, GFP_KERNEL); - if (!wr) { - ret = -ENOMEM; - goto wrout; - } - - s = (struct ib_sge *)(m_wr + cmd.wr_count); - - i = wr; - j = m_wr; - count = 0; - while (count++ < cmd.wr_count) { - struct ib_recv_wr *t = i++; - struct ib_uverbs_recv_wr *u = j++; - - if (count < cmd.wr_count) - t->next = i; - else - t->next = NULL; - - t->wr_id = u->wr_id; - t->num_sge = u->num_sge; - - if (t->num_sge) { - t->sg_list = s; - s += t->num_sge; - } else - t->sg_list = NULL; - } - + resp.bad_wr = 0; ret = qp->device->post_recv(qp, wr, &bad_wr); - resp.bad_wr = ret ? (bad_wr - wr) + 1 : 0; + if (ret) + for (next = wr; next; next = next->next) { + if (next == bad_wr) + break; + ++resp.bad_wr; + } + + up(&ib_uverbs_idr_mutex); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) ret = -EFAULT; - kfree(wr); - -wrout: - kfree(m_wr); - out: - up(&ib_uverbs_idr_mutex); + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } return ret ? ret : in_len; } @@ -1294,80 +1397,48 @@ ssize_t ib_uverbs_post_srq_recv(struct i const char __user *buf, int in_len, int out_len) { - struct ib_uverbs_post_srq_recv cmd; + struct ib_uverbs_post_srq_recv cmd; struct ib_uverbs_post_srq_recv_resp resp; - struct ib_uverbs_recv_wr *m_wr, *j; - struct ib_recv_wr *wr, *i, *bad_wr; - struct ib_sge *s; - struct ib_srq *srq; - int size; - int count; - ssize_t ret = -EFAULT; + struct ib_recv_wr *wr, *next, *bad_wr; + struct ib_srq *srq; + ssize_t ret = -EINVAL; if (copy_from_user(&cmd, buf, sizeof cmd)) return -EFAULT; + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + down(&ib_uverbs_idr_mutex); srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); if (!srq || srq->uobject->context != file->ucontext) goto out; - size = (cmd.wr_count * sizeof *m_wr) + (cmd.sge_count * sizeof *s); - m_wr = kmalloc(size, GFP_KERNEL); - if (!m_wr) { - ret = -ENOMEM; - goto out; - } - - if (copy_from_user(m_wr, (void __user *)cmd.wr, size)) { - goto wrout; - } - - wr = kmalloc(cmd.wr_count * sizeof *wr, GFP_KERNEL); - if (!wr) { - ret = -ENOMEM; - goto wrout; - } - - s = (struct ib_sge *)(m_wr + cmd.wr_count); - - i = wr; - j = m_wr; - count = 0; - while (count++ < cmd.wr_count) { - struct ib_recv_wr *t = i++; - struct ib_uverbs_recv_wr *u = j++; - - if (count < cmd.wr_count) - t->next = i; - else - t->next = NULL; - - t->wr_id = u->wr_id; - t->num_sge = u->num_sge; - - if (t->num_sge) { - t->sg_list = s; - s += t->num_sge; - } else - t->sg_list = NULL; - } - + resp.bad_wr = 0; ret = srq->device->post_srq_recv(srq, wr, &bad_wr); - resp.bad_wr = ret ? (bad_wr - wr) + 1 : 0; + if (ret) + for (next = wr; next; next = next->next) { + if (next == bad_wr) + break; + ++resp.bad_wr; + } + + up(&ib_uverbs_idr_mutex); if (copy_to_user((void __user *) (unsigned long) cmd.response, &resp, sizeof resp)) ret = -EFAULT; - kfree(wr); - -wrout: - kfree(m_wr); - out: - up(&ib_uverbs_idr_mutex); + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } return ret ? ret : in_len; } @@ -1405,19 +1476,16 @@ ssize_t ib_uverbs_create_ah(struct ib_uv uobj->user_handle = cmd.user_handle; uobj->context = file->ucontext; - attr.dlid = cmd.attr.dlid; - attr.sl = cmd.attr.sl; - attr.src_path_bits = cmd.attr.src_path_bits; - attr.static_rate = cmd.attr.static_rate; - attr.port_num = cmd.attr.port_num; - attr.grh.flow_label = cmd.attr.grh.flow_label; - attr.grh.sgid_index = cmd.attr.grh.sgid_index; - attr.grh.hop_limit = cmd.attr.grh.hop_limit; + attr.dlid = cmd.attr.dlid; + attr.sl = cmd.attr.sl; + attr.src_path_bits = cmd.attr.src_path_bits; + attr.static_rate = cmd.attr.static_rate; + attr.port_num = cmd.attr.port_num; + attr.grh.flow_label = cmd.attr.grh.flow_label; + attr.grh.sgid_index = cmd.attr.grh.sgid_index; + attr.grh.hop_limit = cmd.attr.grh.hop_limit; attr.grh.traffic_class = cmd.attr.grh.traffic_class; - attr.grh.dgid.global.subnet_prefix = - cmd.attr.grh.dgid.global.subnet_prefix; - attr.grh.dgid.global.interface_id = - cmd.attr.grh.dgid.global.interface_id; + memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16); ah = ib_create_ah(pd, &attr); if (IS_ERR(ah)) { From rjwalsh at pathscale.com Tue Oct 11 12:59:46 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 11 Oct 2005 12:59:46 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <52y84z96qr.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> Message-ID: <1129060787.29804.7.camel@hematite.internal.keyresearch.com> On Tue, 2005-10-11 at 12:46 -0700, Roland Dreier wrote: > I started working on some final cleanups of the uverbs interface > before merging stuff onto the trunk. The patch below mixes some > simple cleanups with a slight change to the work request posting > interface. I changed the ABI so that the work requests are passed as > part of the same write as the command, and modified the implementation > to copy the work requests one by one instead of in one giant chunk. > > I also added a WQE size field to allow for future device-specific > extensions to work request posting. > > The following is compile-tested only, and I haven't modified the > userspace library to match, but I wanted to give you some idea of what > I was doing in case you had some comments or started working on it too. > > What do you think? I'll spend some time today or tomorrow looking at this, getting it integrated and finishing the userland stuff. Thanks for doing this! Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From krause at cup.hp.com Tue Oct 11 13:01:59 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 11 Oct 2005 13:01:59 -0700 Subject: [openib-general] Re: [PATCH] [CMA] RDMA CM abstraction module In-Reply-To: <20051010200921.GB25968@lst.de> References: <434AAFDD.90208@ichips.intel.com> <000601c5cdce$abf93440$9e5aa8c0@infiniconsys.com> <6.2.0.14.2.20051010125123.0238b4d0@esmail.cup.hp.com> <20051010200921.GB25968@lst.de> Message-ID: <6.2.0.14.2.20051011125725.02502300@esmail.cup.hp.com> At 01:09 PM 10/10/2005, Christoph Hellwig wrote: >On Mon, Oct 10, 2005 at 12:53:29PM -0700, Michael Krause wrote: > > standards. There are also the new standard Sockets extension API > available > > today that might be extended sometime in the future to include explicit > >which is never going to get into linux. one more of these braindead >standards people masturbating in a dark room and coming up with a >frankenstein bastard cases. Everyone is free to have an opinion. Sockets extensions are not braindead nor created using whatever methods you envision. The extensions were created by Sockets engineers with 20+ years experience. But, hey, why put any faith into people who develop and implement Sockets for a living? One day perhaps you'll learn a bit of professionalism and perhaps open your mind that there are people out in the world besides yourself you don't take a NIH approach to the world and are actually qualified engineers who have a clue. All you get with these constant unprofessional diatribes is a continual loss in credibility. But, hey, that is just an opinion. BTW, do you feel the same way about the people who created IB? How about iWARP? How about PCIe? Are all of the engineers who work on trying to accelerate technology, its performance, etc. who take into account and try to find a balanced approach to problem solving simply all in dark little rooms? All of these specs are created by companies. Those same companies who fund open source efforts and many of the people working here. One last thing, I'm not the only person who feels this way about your unprofessional behavior. There are many others who have simply don't want to bother writing or have simply written you off as whatever. Sad state to be in and I suspect you don't care since you view them all as in dark little rooms anyway. Just something you might want to keep in mind. There is a much larger world out there where people value other people's professional opinions and ideas. They don't simply discount what they produce because it was not done in whatever form you prefer. It is called reality. Get used to it. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Tue Oct 11 14:18:09 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Oct 2005 17:18:09 -0400 (EDT) Subject: [openib-general] [CMA] blocking in rdma_listen() In-Reply-To: <434BF8CD.8060409@ichips.intel.com> References: <434BF8CD.8060409@ichips.intel.com> Message-ID: On Tue, 11 Oct 2005, Sean Hefty wrote: > Does anyone have any objection to rdma_listen() blocking? > > I'm working on adding support for listening across any device, but need to > synchronize with device addition/removal. I have a strong objection to making it block. Our goal is to provide an interface with semantics similar to the sockets interface. A socket's listen function does not block (e.g. inet_listen). Since not blocking is what ULPs expect, kDAPL's listen function does not block. The same should be true of the CMA function. james From mst at mellanox.co.il Tue Oct 11 14:23:14 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 23:23:14 +0200 Subject: [openib-general] Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <52slvr7w1l.fsf@cisco.com> References: <52slvr7w1l.fsf@cisco.com> Message-ID: <20051011212314.GA18896@mellanox.co.il> Quoting r. Roland Dreier : > Subject: 2.6.14 heads up: ip_dev_find() not exported > > I noticed while compiling against an up-to-date kernel tree that SDP > and IBAT both use the function ip_dev_find(). The EXPORT_SYMBOL for > this function was removed during the 2.6.14 devel cycle. > > I haven't looked yet at what this function does, how SDP and IBAT use > it or what it could be replaced by. But now would be a good time to > figure out whether we need to ask for it to be re-exported, or if > there's a better alternative to do whatever it does for us. > > - R. Guys, did anyone figure out yet how we can find a device by its address without ip_dev_find? To remind you all, we use it to handle cases where the address is local and so ip_route_output_key gets us a loopback device. If not, is it too late to ask for it to be re-exported to modules? -- MST From mshefty at ichips.intel.com Tue Oct 11 14:27:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 14:27:02 -0700 Subject: [openib-general] [CMA] blocking in rdma_listen() In-Reply-To: References: <434BF8CD.8060409@ichips.intel.com> Message-ID: <434C2E26.9050505@ichips.intel.com> James Lentini wrote: > Our goal is to provide an interface with semantics similar to the > sockets interface. A socket's listen function does not block (e.g. > inet_listen). > > Since not blocking is what ULPs expect, kDAPL's listen function does > not block. The same should be true of the CMA function. From what I can see, kDAPL connect and listen calls can block, as does inet_listen. I'm referring to the thread blocking within the call, specifically on a semaphore and memory allocation using GFP_KERNEL. I am not referring to listen blocking until a connection request is received. - Sean From mshefty at ichips.intel.com Tue Oct 11 14:30:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 14:30:02 -0700 Subject: [openib-general] Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <20051011212314.GA18896@mellanox.co.il> References: <52slvr7w1l.fsf@cisco.com> <20051011212314.GA18896@mellanox.co.il> Message-ID: <434C2EDA.5020007@ichips.intel.com> Michael S. Tsirkin wrote: > Guys, did anyone figure out yet how we can find a device by its address > without ip_dev_find? I wrote ib_addr to call ip_dev_find(). I didn't see a cleaner way to do this. > If not, is it too late to ask for it to be re-exported to modules? Hal already tried to re-export it. The response was that exporting it will only be accepted once code is submitted for inclusion that calls it. - Sean From mst at mellanox.co.il Tue Oct 11 14:33:15 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 23:33:15 +0200 Subject: [openib-general] Re: [CMA] blocking in rdma_listen() In-Reply-To: <434BF8CD.8060409@ichips.intel.com> References: <434BF8CD.8060409@ichips.intel.com> Message-ID: <20051011213315.GB18896@mellanox.co.il> Quoting Sean Hefty : > Subject: [CMA] blocking in rdma_listen() > > Does anyone have any objection to rdma_listen() blocking? > > I'm working on adding support for listening across any device, but need to > synchronize with device addition/removal. > > - Sean Sean, when you say "blocking", do you mean "might sleep"? If so, I dont have any objections. -- MST From mshefty at ichips.intel.com Tue Oct 11 14:35:36 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 14:35:36 -0700 Subject: [openib-general] Re: [CMA] blocking in rdma_listen() In-Reply-To: <20051011213315.GB18896@mellanox.co.il> References: <434BF8CD.8060409@ichips.intel.com> <20051011213315.GB18896@mellanox.co.il> Message-ID: <434C3028.8040800@ichips.intel.com> Michael S. Tsirkin wrote: > Sean, when you say "blocking", do you mean "might sleep"? > If so, I dont have any objections. Yes - I mean might sleep. I wasn't very clear on that. - Sean From jlentini at netapp.com Tue Oct 11 14:37:02 2005 From: jlentini at netapp.com (James Lentini) Date: Tue, 11 Oct 2005 17:37:02 -0400 (EDT) Subject: [openib-general] [CMA] blocking in rdma_listen() In-Reply-To: <434C2E26.9050505@ichips.intel.com> References: <434BF8CD.8060409@ichips.intel.com> <434C2E26.9050505@ichips.intel.com> Message-ID: On Tue, 11 Oct 2005, Sean Hefty wrote: > James Lentini wrote: > > Our goal is to provide an interface with semantics similar to the sockets > > interface. A socket's listen function does not block (e.g. inet_listen). > > Since not blocking is what ULPs expect, kDAPL's listen function does not > > block. The same should be true of the CMA function. > > From what I can see, kDAPL connect and listen calls can block, as does > inet_listen. I'm referring to the thread blocking within the call, > specifically on a semaphore and memory allocation using GFP_KERNEL. I am not > referring to listen blocking until a connection request is received. I thought you meant blocking for a connection request to arrive. Your right the kDAPL and inet_listen functions can block for the reasons you list. I'm ok with rdma_listen() also blocking for these reasons. james From nacc at us.ibm.com Tue Oct 11 14:45:21 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 11 Oct 2005 14:45:21 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1128692935.4382.7072.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> Message-ID: <20051011214521.GM5972@us.ibm.com> On 07.10.2005 [09:48:56 -0400], Hal Rosenstock wrote: > On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > > > Great! Thanks. > > > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > > weren't running) now and will post the latest results. > > > > > > > > You might also want to apply > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > > to get rid of the AT and SDP warnings. > > > > > > This patch does remove the warning regarding undefined symbols during > > > modpost, but does not remove the warnings > > > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > > work on a patch for this too. Thanks. > > Can you try this patch for the above 2 warnings ? If it works, I check > it into the patches directory. Thanks. > > -- Hal > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > packet_type Sorry for the delay, I haven't yet had time to test the patches :/ I'll try to get to it tonight or tomorrow. Is there anyway you can send me patches against the kernel tree as opposed to the svn repo? It makes my side of things *a lot* easier, as right now I have to take your patch against svn and either hand-edit or patch my checkout and then diff against the current kernel tree. Thanks, Nish From hozer at hozed.org Tue Oct 11 14:45:28 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Tue, 11 Oct 2005 16:45:28 -0500 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: <1129037891.4377.8074.camel@hal.voltaire.com> Message-ID: <20051011214527.GH4612@kalmia.hozed.org> On Tue, Oct 11, 2005 at 09:13:20AM -0700, Shirley Ma wrote: > The IB stack doesn't handle errors during client initialization. This > problem is easy to reproduce by inducing errors (resouce allocation > failure or query failure) in mad_client or sa_client registration. I am > working on a patch, but I am in class the whole week, don't have time to > verify the patch. I hope the patch will be available early next week to > fix the panic. I'd be happy to verify the patch, but I need to get the latest version of the ehca driver, ideally already integrated into the subversion tree. Otherwise a tar.gz I can extract and drop in drivers/infiniband/hw/ehca would work just fine. I'm still not sure I got an answer why the ehca is so senstive to which port is plugged in. From mst at mellanox.co.il Tue Oct 11 14:50:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 11 Oct 2005 23:50:22 +0200 Subject: [openib-general] Re: Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <434C2EDA.5020007@ichips.intel.com> References: <434C2EDA.5020007@ichips.intel.com> Message-ID: <20051011215022.GC18896@mellanox.co.il> Quoting r. Sean Hefty : > > Guys, did anyone figure out yet how we can find a device by its address > > without ip_dev_find? > > I wrote ib_addr to call ip_dev_find(). I didn't see a cleaner way to do this. > > > If not, is it too late to ask for it to be re-exported to modules? > > Hal already tried to re-export it. The response was that exporting it will only > be accepted once code is submitted for inclusion that calls it. > > - Sean > Hmm, maybe posting addr.c on lkml will help? -- MST From mshefty at ichips.intel.com Tue Oct 11 14:52:29 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 14:52:29 -0700 Subject: [openib-general] Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <20051011215022.GC18896@mellanox.co.il> References: <434C2EDA.5020007@ichips.intel.com> <20051011215022.GC18896@mellanox.co.il> Message-ID: <434C341D.1080605@ichips.intel.com> Michael S. Tsirkin wrote: > Hmm, maybe posting addr.c on lkml will help? It probably needs to be reviewed and tested a little more first. Plus, the only user of it at the moment is the CMA. We may find that to add addr.c, we need a user, which requires the cma, which requires yet another user... - Sean From mst at mellanox.co.il Tue Oct 11 15:03:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 12 Oct 2005 00:03:58 +0200 Subject: [openib-general] Re: Latest build test results In-Reply-To: <20051011214521.GM5972@us.ibm.com> References: <20051011214521.GM5972@us.ibm.com> Message-ID: <20051011220358.GD18896@mellanox.co.il> Quoting Nishanth Aravamudan : > Is there anyway you can send me patches against the kernel tree as > opposed to the svn repo? It makes my side of things *a lot* easier, as > right now I have to take your patch against svn and either hand-edit or > patch my checkout and then diff against the current kernel tree. In case this is useful to others, I am using the following trick with softlinks to create -p1 patches suitable to applying to kernel (requires svn client revision 1.2.3 and up): cd trunk/src/linux-kernel/ ln -s . drivers cd ../ svn diff --diff-cmd "/usr/bin/diff" -x -up linux-kernel/drivers/infiniband I've put this information in the FAQ. -- MST From mst at mellanox.co.il Tue Oct 11 15:13:39 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 12 Oct 2005 00:13:39 +0200 Subject: [openib-general] Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <434C341D.1080605@ichips.intel.com> References: <434C341D.1080605@ichips.intel.com> Message-ID: <20051011221339.GE18896@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: 2.6.14 heads up: ip_dev_find() not exported > > Michael S. Tsirkin wrote: > > Hmm, maybe posting addr.c on lkml will help? > > It probably needs to be reviewed and tested a little more first. Certainly, but maybe thats a good way to get more review. > Plus, the only > user of it at the moment is the CMA. We may find that to add addr.c, we need a > user, which requires the cma, which requires yet another user... > > - Sean Hmm. BTW, we need to add something for userspace? Userspace can already get at GIDs, I think, but how does it get the IPoIB pkey? -- MST From rjwalsh at pathscale.com Tue Oct 11 15:17:04 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Tue, 11 Oct 2005 15:17:04 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <52y84z96qr.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> Message-ID: <1129069024.29804.24.camel@hematite.internal.keyresearch.com> Some comments on the patch: > + /* Don't let userspace make us allocate a huge buffer */ > + if (cmd.ne > 256) > + return -ENOMEM; > + Is this necessary? Won't the following fail with ENOMEM anyway if cmd.ne is too big: > wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); > if (!wc) > return -ENOMEM; Same here: > + /* Don't let userspace make us allocate a huge buffer */ > + if (cmd.wqe_size > 4096) > + return -ENOMEM; > + > + user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); > + if (!user_wr) > + return -ENOMEM; > + What meaning do those numbers have, exactly? i.e. the 4096 number above? > + if (ret) { > + for (next = wr; next; next = next->next) { > + if (next == bad_wr) > + break; > + ++resp.bad_wr; > + } > + } Will this work? If bad_wr is the first wr, then resp.bad_wr will be zero. The current user code (which has to change anyway) assumes 0 == "no bad wr" and 1 == "the first wr is the bad wr", etc. > + while (wr) { > + next = wr->next; > + kfree(wr); > + wr = next; > + } > + > + kfree(user_wr); One reason why I originally allocated one big wr area instead of a bunch of smaller ones was to keep the cost of this down. Is it a good idea to be doing this with a bunch of kmallocs? This is all I've had a chance to look at for the moment. More later. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Tue Oct 11 15:16:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 15:16:57 -0700 Subject: [openib-general] Re: 2.6.14 heads up: ip_dev_find() not exported In-Reply-To: <20051011221339.GE18896@mellanox.co.il> References: <434C341D.1080605@ichips.intel.com> <20051011221339.GE18896@mellanox.co.il> Message-ID: <434C39D9.30704@ichips.intel.com> Michael S. Tsirkin wrote: > Hmm. > BTW, we need to add something for userspace? > Userspace can already get at GIDs, I think, but how does it get the > IPoIB pkey? Something needs to be done for userspace, but I'm not entirely sure what yet. I've given it some thought, but was deferring doing too much until I had a couple of missing areas completed in the kernel CMA first. I think that the pkey is exported by ipoib through /sys/class/net/ib0. - Sean From rolandd at cisco.com Tue Oct 11 15:25:50 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 15:25:50 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129069024.29804.24.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 11 Oct 2005 15:17:04 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> Message-ID: <52oe5v8zdd.fsf@cisco.com> Robert> Is this necessary? Won't the following fail with ENOMEM Robert> anyway if cmd.ne is too big: Yeah, you're right. For a while in the kernel tree, an oversized kmalloc() triggered a bug, but it was reverted: commit dbdb90450059e17e8e005ebd3ce0a1fd6008a0c8 Author: Andrew Morton Date: Fri Sep 23 13:24:10 2005 -0700 [PATCH] revert oversized kmalloc check As davem points out, this wasn't such a great idea. There may be some code which does: size = 1024*1024; while (kmalloc(size, ...) == 0) size /= 2; which will now explode. Cc: "David S. Miller" Cc: Christoph Lameter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds I'll delete these checks. Robert> Will this work? If bad_wr is the first wr, then Robert> resp.bad_wr will be zero. The current user code (which Robert> has to change anyway) assumes 0 == "no bad wr" and 1 == Robert> "the first wr is the bad wr", etc. I missed that. I'll fix up the code locally. Robert> One reason why I originally allocated one big wr area Robert> instead of a bunch of smaller ones was to keep the cost of Robert> this down. Is it a good idea to be doing this with a Robert> bunch of kmallocs? I think the common case is probably posting a single work request. And kmalloc() is pretty cheap. So I think this is OK -- better than failing when memory gets fragmented. - R. From halr at voltaire.com Tue Oct 11 18:27:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 21:27:15 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051011214521.GM5972@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> Message-ID: <1129080434.4377.12024.camel@hal.voltaire.com> Hi Nish, On Tue, 2005-10-11 at 17:45, Nishanth Aravamudan wrote: > On 07.10.2005 [09:48:56 -0400], Hal Rosenstock wrote: > > On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > > > > > Great! Thanks. > > > > > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > > > weren't running) now and will post the latest results. > > > > > > > > > > You might also want to apply > > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > > > to get rid of the AT and SDP warnings. > > > > > > > > This patch does remove the warning regarding undefined symbols during > > > > modpost, but does not remove the warnings > > > > > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > > > work on a patch for this too. Thanks. > > > > Can you try this patch for the above 2 warnings ? If it works, I check > > it into the patches directory. Thanks. > > > > -- Hal > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > packet_type > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > try to get to it tonight or tomorrow. > > Is there anyway you can send me patches against the kernel tree as > opposed to the svn repo? It makes my side of things *a lot* easier, as > right now I have to take your patch against svn and either hand-edit or > patch my checkout and then diff against the current kernel tree. Since you were reporting iSER, AT, and SDP compile warnings/errors, aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? Which patches are you referring to ? Was it the fib_frontend.c one ? Not sure why they would need any manual fixup. At least that one was pretty straightforward. 2.6.14-rc4 is out now. -- Hal From nacc at us.ibm.com Tue Oct 11 18:39:30 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 11 Oct 2005 18:39:30 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1129080434.4377.12024.camel@hal.voltaire.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> Message-ID: <20051012013930.GB13157@us.ibm.com> On 11.10.2005 [21:27:15 -0400], Hal Rosenstock wrote: > Hi Nish, > > On Tue, 2005-10-11 at 17:45, Nishanth Aravamudan wrote: > > On 07.10.2005 [09:48:56 -0400], Hal Rosenstock wrote: > > > On Thu, 2005-10-06 at 15:26, Hal Rosenstock wrote: > > > > On Thu, 2005-10-06 at 15:20, Nishanth Aravamudan wrote: > > > > > On 06.10.2005 [13:25:35 -0400], Hal Rosenstock wrote: > > > > > > On Thu, 2005-10-06 at 13:11, Nishanth Aravamudan wrote: > > > > > > > On 06.10.2005 [19:40:40 +0300], Dan Bar Dov wrote: > > > > > > > > I've fixed the 2.6.14-rc3 compilation warnings with iSER on x86 in version 3682. > > > > > > > > > > > > > > Great! Thanks. > > > > > > > > > > > > > > I'm re-running the tests (due to a subtle flaw in my PATH, my cronjobs > > > > > > > weren't running) now and will post the latest results. > > > > > > > > > > > > You might also want to apply > > > > > > https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/linux-2.6.14-rc3-fib-frontend.diff > > > > > > to get rid of the AT and SDP warnings. > > > > > > > > > > This patch does remove the warning regarding undefined symbols during > > > > > modpost, but does not remove the warnings > > > > > > > > > > drivers/infiniband/core/at.c:1547: warning: initialization from incompatible pointer type > > > > > > > > > > drivers/infiniband/ulp/sdp/sdp_link.c:752: warning: initialization from incompatible pointer type > > > > > > > > Right. Roland reported a change to struct packet_type in 2.6.14. I'll > > > > work on a patch for this too. Thanks. > > > > > > Can you try this patch for the above 2 warnings ? If it works, I check > > > it into the patches directory. Thanks. > > > > > > -- Hal > > > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > packet_type > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > try to get to it tonight or tomorrow. > > > > Is there anyway you can send me patches against the kernel tree as > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > right now I have to take your patch against svn and either hand-edit or > > patch my checkout and then diff against the current kernel tree. > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? Yes; but you have to understand that the automated build system I have access to 1) does not have external internet access (i.e., to the svn tree) and 2) only builds kernels unless I manually send commands to the terminal. So, the way I'm doing things is: Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then generate a patch of the latest svn tree against the current -git release (a patch to the kernel) and send it in as a parameter to my builds to test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 with =y and =m). I'm *only* doing kernel build testing right now. > Which patches are you referring to ? Was it the fib_frontend.c one ? > Not sure why they would need any manual fixup. At least that one was > pretty straightforward. In the sense that I have to edit them to kernel relative paths, not in the content of the patch. To test any patch in the system I have access to, it needs to be a normal kernel patch (-p1 applicable to the base tree). Going through and manually applying patches to the svn tree and then regenerating the diff completely defeats the purpose of automated compilation testing. > 2.6.14-rc4 is out now. Yes, I know. Thanks, Nish From halr at voltaire.com Tue Oct 11 20:15:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Oct 2005 23:15:27 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051012013930.GB13157@us.ibm.com> References: <20051003221553.GA27996@us.ibm.com> <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> Message-ID: <1129086927.4377.12455.camel@hal.voltaire.com> Hi again Nish, On Tue, 2005-10-11 at 21:39, Nishanth Aravamudan wrote: > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > > packet_type > > > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > > try to get to it tonight or tomorrow. > > > > > > Is there anyway you can send me patches against the kernel tree as > > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > > right now I have to take your patch against svn and either hand-edit or > > > patch my checkout and then diff against the current kernel tree. > > > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? > > Yes; but you have to understand that the automated build system I have > access to 1) does not have external internet access (i.e., to the svn > tree) and 2) only builds kernels unless I manually send commands to the > terminal. > > So, the way I'm doing things is: > > Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then > generate a patch of the latest svn tree against the current -git release > (a patch to the kernel) and send it in as a parameter to my builds to > test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 > with =y and =m). > > I'm *only* doing kernel build testing right now. > > > Which patches are you referring to ? Was it the fib_frontend.c one ? > > Not sure why they would need any manual fixup. At least that one was > > pretty straightforward. > > In the sense that I have to edit them to kernel relative paths, not in > the content of the patch. To test any patch in the system I have access > to, it needs to be a normal kernel patch (-p1 applicable to the base > tree). > > Going through and manually applying patches to the svn tree and then > regenerating the diff completely defeats the purpose of automated > compilation testing. OK. Do you need any patches regenerated or is this more for the future ? -- Hal From rolandd at cisco.com Tue Oct 11 20:39:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 20:39:26 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer Message-ID: <52fyr78kup.fsf@cisco.com> I recently got a chance to play with an eval board for the PowerPC 440SPe -- an embedded system with PCI Express support where the PCI bus is not cache coherent with the CPU. Of course I plugged an HCA in and tried out our current drivers. It turns out that everything works pretty well, except the HCA's ports never make it past INIT. I did some debugging, and the reason for this is that the MAD layer doesn't quite use the DMA mapping API properly. Once we call dma_map_single() on a buffer, the CPU may not touch that buffer until after the corresponding dma_unmap_single(). On mainstream architectures, it turns out that we can get away with violating this rule. However, on non-cache-coherent architectures like PowerPC 4xx, dma_map_single(..., DMA_TO_DEVICE) does a cache flush, which makes sure that the contents of the CPU's cache are really written to memory. If a driver then changes the contents of the buffer after the call to dma_map_single(), then it's quite likely that the change will be made only in the CPU's cache and the device will end up DMA-ing the old data. The problem I hit is in ib_post_send_mad(), specifically: smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { ret = handle_outgoing_dr_smp(mad_agent_priv, smp, send_wr); basically, when the MAD layer goes to send a directed route reply, it changes the MAD buffer after the DMA mapping is done. The HCA doesn't see the change, the wrong packet gets sent and the SM never sees replies to its queries. Adding a PPC-specific cache flush call after the call to handle_outgoing_dr_smp() fixes things to the point that the port can be brought to ACTIVE, and in fact IPoIB works as well. However, this is just a cludge -- the real fix will need to be more invasive. It seems that the whole interface to the MAD layer may need to be reorganized to avoid doing this. It looks like there is a similar problem with ib_create_send_mad(): it does DMA mapping on a buffer that is then returned for the caller to modify. Finally, some of the MAD structures like struct ib_mad_private look risky to me, since kernel data might potentially share a cache line with DMA buffers. See for a nice writeup of the class of bug that might be lurking. Sorry for missing all of this when the MAD layer was first being developed and reviewed. - R. From rolandd at cisco.com Tue Oct 11 21:08:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 21:08:27 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129069024.29804.24.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 11 Oct 2005 15:17:04 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> Message-ID: <527jcj8jic.fsf@cisco.com> Here is an updated kernel patch and a matching libibverbs patch (both against your branch). Still compile tested only. I'll get my Pathscale system set up to do some testing of this code tomorrow morning and see if this stuff actually works. - R. -------------- next part -------------- A non-text attachment was scrubbed... Name: ipath-kernel.diff Type: text/x-patch Size: 18858 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ipath-libibverbs.diff Type: text/x-patch Size: 11480 bytes Desc: not available URL: From nacc at us.ibm.com Tue Oct 11 21:17:46 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 11 Oct 2005 21:17:46 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1129086927.4377.12455.camel@hal.voltaire.com> References: <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> <1129086927.4377.12455.camel@hal.voltaire.com> Message-ID: <20051012041746.GC13157@us.ibm.com> On 11.10.2005 [23:15:27 -0400], Hal Rosenstock wrote: > Hi again Nish, > > On Tue, 2005-10-11 at 21:39, Nishanth Aravamudan wrote: > > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > > > packet_type > > > > > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > > > try to get to it tonight or tomorrow. > > > > > > > > Is there anyway you can send me patches against the kernel tree as > > > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > > > right now I have to take your patch against svn and either hand-edit or > > > > patch my checkout and then diff against the current kernel tree. > > > > > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > > > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? > > > > Yes; but you have to understand that the automated build system I have > > access to 1) does not have external internet access (i.e., to the svn > > tree) and 2) only builds kernels unless I manually send commands to the > > terminal. > > > > So, the way I'm doing things is: > > > > Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then > > generate a patch of the latest svn tree against the current -git release > > (a patch to the kernel) and send it in as a parameter to my builds to > > test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 > > with =y and =m). > > > > I'm *only* doing kernel build testing right now. > > > > > Which patches are you referring to ? Was it the fib_frontend.c one ? > > > Not sure why they would need any manual fixup. At least that one was > > > pretty straightforward. > > > > In the sense that I have to edit them to kernel relative paths, not in > > the content of the patch. To test any patch in the system I have access > > to, it needs to be a normal kernel patch (-p1 applicable to the base > > tree). > > > > Going through and manually applying patches to the svn tree and then > > regenerating the diff completely defeats the purpose of automated > > compilation testing. > > OK. Do you need any patches regenerated or is this more for the future ? If you could regen the patches, that would definitely speed things up for me, but I can handle these few, it's not a big deal. Definitely, in the future, it makes it an almost instantaneous build test if I have the kernel-relative patch. Thanks, Nish From sean.hefty at intel.com Tue Oct 11 21:24:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 11 Oct 2005 21:24:43 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <52fyr78kup.fsf@cisco.com> Message-ID: >properly. Once we call dma_map_single() on a buffer, the CPU may not >touch that buffer until after the corresponding dma_unmap_single(). It sounds like we need to change how the mapping is done. Can we let the MAD layer always control the mapping? Considering how RMPP works, I'm not sure what else we could do. >is just a cludge -- the real fix will need to be more invasive. It >seems that the whole interface to the MAD layer may need to be >reorganized to avoid doing this. We really just need to change the post_send_mad routine, don't we? The original intent around that API was to permit posting the WR directly onto the QP. Since this isn't the case, what about changing post send to take as input an ib_mad_send_buf, with the work request and SGE fields removed? This could permit some additional optimization, such as avoiding additional allocations within the post send call. (Taking it a step further, we could create a new structure to permit using a received MAD as input to a send.) >It looks like there is a similar problem with ib_create_send_mad(): it >does DMA mapping on a buffer that is then returned for the caller to modify. If we pass the send_buf into post_send_mad, then the mapping could be deferred. >Finally, some of the MAD structures like struct ib_mad_private look >risky to me, since kernel data might potentially share a cache line >with DMA buffers. See for a nice >writeup of the class of bug that might be lurking. This sounds like a separate issue, is that the case? - Sean From rolandd at cisco.com Tue Oct 11 21:35:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 11 Oct 2005 21:35:35 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: (Sean Hefty's message of "Tue, 11 Oct 2005 21:24:43 -0700") References: Message-ID: <52y84z73oo.fsf@cisco.com> Sean> It sounds like we need to change how the mapping is done. Sean> Can we let the MAD layer always control the mapping? I guess so. Another alternative would be for the consumer to provide some sort of callback interface to handle the mapping, but that doesn't seem feasible. Sean> We really just need to change the post_send_mad routine, Sean> don't we? I guess so -- and remove the DMA mapping call from ib_create_send_mad(). Sean> The original intent around that API was to permit posting Sean> the WR directly onto the QP. Since this isn't the case, Sean> what about changing post send to take as input an Sean> ib_mad_send_buf, with the work request and SGE fields Sean> removed? We probably still want to handle gather lists for posting sends I think. Another (rather unrelated) issue that I just noticed the other day is that something like sending a response to a GetTable request for PortInfo for every port in a large fabric is going to end up sending a very large RMPP message, probably too large to fit in a single kmalloc()ed buffer. So I don't think we should require that all send requests have a single gather entry. Roland> Finally, some of the MAD structures like struct Roland> ib_mad_private look risky to me, since kernel data might Roland> potentially share a cache line with DMA buffers. See Roland> for a nice writeup of the Roland> class of bug that might be lurking. Sean> This sounds like a separate issue, is that the case? Yes. In fact I'm not sure there's really a bug there. It's just something questionable that I saw while trying to find the real problem on 440SPe. - R. From ftillier at silverstorm.com Tue Oct 11 22:59:27 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 11 Oct 2005 22:59:27 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <52fyr78kup.fsf@cisco.com> Message-ID: <001b01c5cef2$1bfcb180$9e5aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 11, 2005 8:39 PM > > On mainstream architectures, it turns out that we can get away with > violating this rule. However, on non-cache-coherent architectures > like PowerPC 4xx, dma_map_single(..., DMA_TO_DEVICE) does a cache > flush, which makes sure that the contents of the CPU's cache are > really written to memory. If a driver then changes the contents of > the buffer after the call to dma_map_single(), then it's quite likely > that the change will be made only in the CPU's cache and the device > will end up DMA-ing the old data. > > The problem I hit is in ib_post_send_mad(), specifically: > > smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; > if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { > ret = handle_outgoing_dr_smp(mad_agent_priv, smp, > send_wr); > > basically, when the MAD layer goes to send a directed route reply, it > changes the MAD buffer after the DMA mapping is done. The HCA > doesn't see the change, the wrong packet gets sent and the SM never > sees replies to its queries. > > Adding a PPC-specific cache flush call after the call to > handle_outgoing_dr_smp() fixes things to the point that the port can > be brought to ACTIVE, and in fact IPoIB works as well. However, this > is just a cludge -- the real fix will need to be more invasive. It > seems that the whole interface to the MAD layer may need to be > reorganized to avoid doing this. Why not just use inline sends for the special QPs and remove the need to perform any DMA mappings on the send side altogether? - Fab From yael at mellanox.co.il Tue Oct 11 23:52:08 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 12 Oct 2005 08:52:08 +0200 Subject: [openib-general] RE: [PATCH] Opensm - enabling erase of log file flag Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2351@mtlexch01.mtl.com> You are right. Thanks! Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, October 11, 2005 3:18 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: [PATCH] Opensm - enabling erase of log file flag Hi Yael, On Tue, 2005-10-11 at 08:24, Yael Kalka wrote: > Currently the osm log file is accumulative. I've added an option to > erase the log file before starting to write it. > By default, still, the log is still accumulative. > Attached is a patch for that. One minor comment on this... > Thanks, > Yael > > Signed-off-by: Yael Kalka > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 3704) > +++ opensm/osm_subnet.c (working copy) > @@ -920,6 +925,7 @@ osm_subn_write_conf_file( > "force_log_flush %s\n\n" > "# Log file to be used\n" > "log_file %s\n\n" > + "accum_log_file %s\n\n" > "# The directory to hold the file OpenSM dumps\n" > "dump_files_dir %s\n\n" > "# If TRUE if OpenSM should disable multicast support\n" > @@ -929,6 +935,7 @@ osm_subn_write_conf_file( > p_opts->log_flags, > p_opts->force_log_flush ? "TRUE" : "FALSE", > p_opts->log_file, > + p_opts->accum_log_file, Shouldn't this line be: p_opts->accum_log_file ? "TRUE" : "FALSE", -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Wed Oct 12 00:53:28 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Wed, 12 Oct 2005 09:53:28 +0200 Subject: [openib-general] RE: [PATCH] Opensm - handling immediate error in vendor_send new Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2352@mtlexch01.mtl.com> Hi Hal, Hal Rosenstock wrote: > Hi Yael, > > On Tue, 2005-10-11 at 04:28, Yael Kalka wrote: > > Attached is a new patch with several fixes for this issue. > > Thanks. Applied. > > There were still extra whitespace issues which I fixed by hand. Please > try to eliminate these so I don't have to do hand touch ups. > I will. Sorry. > > I decided to remove the checking for zero in the atomic_dec after all, > > since as I mentioned before - clearing it is not a fix, and we will > > see the value in other infos in the log file. > > But there is danger is these counters wrap, right ? > There is still some danger - as you noted - the counters can wrap. This will happen if there is some problem in the lower layer. For example - if we get the same mad twice, and we allocated it already for another request (after getting the first answer). It shouldn't happen if the lower layer is functioning correctly. > Also, in looking further at the code, the same issue does not appear to > occur for QP1 handling, right ? > No. There is no such issue in the QP1 handling. > -- Hal > -------------- next part -------------- An HTML attachment was scrubbed... URL: From IBMEHCAD at de.ibm.com Wed Oct 12 02:36:59 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 12 Oct 2005 11:36:59 +0200 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: <52oe5xdp3e.fsf@cisco.com> Message-ID: This is basically the answer why its so "sensitive" which port is plugged. We're working on a solution to that problem. But currently we only see a chance to change this behaviour by also changing the firmware interface, which needs to be coordinated with firmware development. Roland Dreier wrote on 10.10.2005 23:44:21: > IBMEHCA> So you need some kind of signal from the operating system > IBMEHCA> to system firmware, which in the eHCA case is the > IBMEHCA> H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI > IBMEHCA> parameter. AFTER that call handshaking between system > IBMEHCA> firmware and the SM will start, here's a new adapter > IBMEHCA> active on a switch port... what's your guid? here's your > IBMEHCA> LID, p_key, SM lid.... ...and after all that it's > IBMEHCA> possible to send and receive packets from the fabric. > IBMEHCA> The openib stack expects that a port is fully functional > IBMEHCA> after this create_qp returns, and starts to do all sorts > IBMEHCA> of modify QP and post send. So the only choice we have > IBMEHCA> there is to delay create_qp until the complete > IBMEHCA> handshaking between system firmware and the SM has > IBMEHCA> finished (until we see a IB_PORT_ACTIVE in hcad_mod). If > IBMEHCA> we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have > IBMEHCA> to return an error code to openib, otherwise we're > IBMEHCA> seriously in trouble (tried that). > > I think this scheme breaks the IB model. When consumers get access to > an HCA, they expect to be able to access the HCA, even if an SM has > not configured it (and even in the case no cable is connected). As an > example of why this is useful, if the link won't come up, it's nice to > be able to get to query the port's PMA counters to see if there are > excessive errors or something like that. > I understand that you don't want to have all HCAs always visible to > the SM, but the scheme you've chosen puts an unneeded dependency > between driver initialization and the external SM. It would be fine > if creating QP1 triggered the transition of the port from DOWN to INIT > so that it is discoverable by the SM, but there's no reason for > creation of QP1 to wait to finish until the SM has brought the port up. > (As a side note, Mellanox HCAs don't bring a port to INIT until the > host driver has transitioned QP0 to the RTR state, which seems more > sensible than waiting for QP1 to be created) > I hope this can be fixed in firmware with your current HCA hardware. > - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at sydge.com Wed Oct 12 01:49:10 2005 From: info at sydge.com (info at sydge.com) Date: 12 Oct 2005 17:49:10 +0900 Subject: [openib-general] $BF|BX$o$jNx0&$7$^$;$s$+(B Message-ID: <20051012084910.1664.qmail@mail.sydge.com> $B!ZK\F|$N=w at -C#$G$9!#![(B $B"(L5NAEPO?$G8!:w$7$F$_$F$/$@$5$$!#(B $B!T$*;n$7EPO?$@$+$i$I$J$?MM$b!I%?%@!I$G$9!U(B $B#5L>$N%K%C%/%M!<%`$O2<5-$NDL$j!#!#!#!#(B 1$B!"7k!!(B2$B!"%(%j%+!!(B3$B!"??M3H~!"(B4$B!":Z=o!!(B5$B!"@:;RJg=8!&!&!&!&(B http://1191.jp/week/index.html ******************************** NO.I don't veceive your mail sweet_as_candy_700 at yahoo.fr $B:#8e!"l9g$O(B sweet_as_candy_700 at yahoo.fr ******************************** From IBMEHCAD at de.ibm.com Wed Oct 12 04:04:37 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 12 Oct 2005 13:04:37 +0200 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <20051007141207.GX4612@kalmia.hozed.org> Message-ID: I just released the ehca2_0028 which uses svn 3615 on https://sourceforge.net/projects/ibmehcad/ As you might notice the license already has changed to the openib.org license. With 2.6.13 we had the non-issue that our maun focus was on 2.6.5-7.191 and we're only now moving to the latest kernel. We're currently reworking the kernel-user interface of hcad_mod and libehca to also support a 32bit userspace libehca (to be released in the next few days). That will be the initial version to move to the openib.org svn. Christoph Troy Benjegerdes wrote on 07.10.2005 16:12:07: > I have two IBM eHCA cards installed and it appears that OpenSM > is happily talking to the firmware and bringing up the links. > So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz > code drop, and wondering what (if any) issues there are with a 2.6.13 > kernel, or later OpenIB svn drops. > Is there a later code drop I can get ahold of? Is the nr_ports issue > something in the driver? I wound up connecting to the lower port in the > Openpower720 machine.. do you know if that's port 1 or 2? -------------- next part -------------- An HTML attachment was scrubbed... URL: From IBMEHCAD at de.ibm.com Wed Oct 12 04:18:15 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 12 Oct 2005 13:18:15 +0200 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <20051007141207.GX4612@kalmia.hozed.org> Message-ID: Troy Benjegerdes wrote on 07.10.2005 16:12:07: > I have two IBM eHCA cards installed and it appears that OpenSM > is happily talking to the firmware and bringing up the links. > So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz > code drop, and wondering what (if any) issues there are with a 2.6.13 > kernel, or later OpenIB svn drops. There's not really an issue with 2.6.13, but we did focus so far on 2.6.5-7.191, which will change now. > Is there a later code drop I can get ahold of? I just released ehca2 0028 to https://sourceforge.net/projects/ibmehcad/ This version uses svn 3615 You might also notice that all licenses in there now should be acceptable to openib.org. We're currently modifying the user-kernel interface between libehca and hcad_mod to also support 32 bit versions of libehca. That should then be the initial release version to the openib.org svn. Christoph -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Oct 12 04:48:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Oct 2005 07:48:05 -0400 Subject: [openib-general] RE: [PATCH] Opensm - enabling erase of log file flag In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2351@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2351@mtlexch01.mtl.com> Message-ID: <1129117681.4377.14541.camel@hal.voltaire.com> On Wed, 2005-10-12 at 02:52, Yael Kalka wrote: > You are right. Thanks! Thanks. Applied. -- Hal > Yael > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, October 11, 2005 3:18 PM > To: Yael Kalka > Cc: openib-general at openib.org; Eitan Zahavi > Subject: Re: [PATCH] Opensm - enabling erase of log file flag > > > Hi Yael, > > On Tue, 2005-10-11 at 08:24, Yael Kalka wrote: > > Currently the osm log file is accumulative. I've added an option to > > erase the log file before starting to write it. > > By default, still, the log is still accumulative. > > Attached is a patch for that. > > One minor comment on this... > > > Thanks, > > Yael > > > > Signed-off-by: Yael Kalka > > > Index: opensm/osm_subnet.c > > =================================================================== > > --- opensm/osm_subnet.c (revision 3704) > > +++ opensm/osm_subnet.c (working copy) > > > @@ -920,6 +925,7 @@ osm_subn_write_conf_file( > > "force_log_flush %s\n\n" > > "# Log file to be used\n" > > "log_file %s\n\n" > > + "accum_log_file %s\n\n" > > "# The directory to hold the file OpenSM dumps\n" > > "dump_files_dir %s\n\n" > > "# If TRUE if OpenSM should disable multicast support\n" > > @@ -929,6 +935,7 @@ osm_subn_write_conf_file( > > p_opts->log_flags, > > p_opts->force_log_flush ? "TRUE" : "FALSE", > > p_opts->log_file, > > + p_opts->accum_log_file, > > Shouldn't this line be: > p_opts->accum_log_file ? "TRUE" : "FALSE", > > -- Hal > From halr at voltaire.com Wed Oct 12 06:20:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Oct 2005 09:20:54 -0400 Subject: [openib-general] RE: [PATCH] Opensm - handling immediate error in vendor_send new In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2352@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2352@mtlexch01.mtl.com> Message-ID: <1129123242.4377.14819.camel@hal.voltaire.com> Hi again Yael, On Wed, 2005-10-12 at 03:53, Yael Kalka wrote: > > > I decided to remove the checking for zero in the atomic_dec after > all, > > > since as I mentioned before - clearing it is not a fix, and we > will > > > see the value in other infos in the log file. > > > > But there is danger is these counters wrap, right ? > > > There is still some danger - as you noted - the counters can wrap. > This will happen if there is some problem in the lower layer. > For example - if we get the same mad twice, and we allocated it > already for another request (after getting the first answer). > It shouldn't happen if the lower layer is functioning correctly. I think that it's more than lower layer malfunction that can cause this to occur. -- Hal From info at njhfges.com Wed Oct 12 04:27:20 2005 From: info at njhfges.com (info at njhfges.com) Date: 12 Oct 2005 20:27:20 +0900 Subject: [openib-general] $B$"$d$G$9(B Message-ID: <20051012112720.29352.qmail@mail.njhfges.com> $B$"$d$C$F$$$$$^$9!#2K$J$i2q$C$F%(%C%A$7$^$;$s$+!)5^$K%S%C%/%j$G$9$h$M!#$I$&$7$F$b%(%C%A$,$7$?$7$?$/$F!"%(%C%A$,;d$O:#$+$i$G$bJ?5$$G$9!#l9g$O!d"M!!(Bawg_tokyo at yahoo.com.au ////////////////////////////////////////////////////////// From rolandd at cisco.com Wed Oct 12 08:22:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 08:22:11 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <001b01c5cef2$1bfcb180$9e5aa8c0@infiniconsys.com> (Fab Tillier's message of "Tue, 11 Oct 2005 22:59:27 -0700") References: <001b01c5cef2$1bfcb180$9e5aa8c0@infiniconsys.com> Message-ID: <52r7aq7obg.fsf@cisco.com> Fab> Why not just use inline sends for the special QPs and remove Fab> the need to perform any DMA mappings on the send side Fab> altogether? Not all HCAs necessarily support inline sends, so we can't use them in core code. In fact, I don't think that even all Mellanox HCAs would be able to handle big enough inline sends. Sending a 256-byte MAD would require a 512-byte WQE to put the payload in inline data, and if I recall correctly, MT25204 can't handle WQEs that big. - R. From johnip at sgi.com Wed Oct 12 08:30:12 2005 From: johnip at sgi.com (John Partridge) Date: Wed, 12 Oct 2005 10:30:12 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem Message-ID: <434D2C04.7040603@sgi.com> I am compiling mvapich-gen2 on a IA64 (SGI Altix) and have hit an issue with ibverbs_const.h DEFAULT_MTU is defined for IA32, X86_64 and EM64T, but not IA64, so please may I propose this small patch to fix this problem :- --- ibverbs_const.h 2005-10-10 15:43:41.615100090 -0500 +++ ibverbs_const.h-johnip 2005-10-10 15:46:14.696637248 -0500 @@ -20,11 +20,10 @@ #define DEFAULT_MAX_RECV_WQE (300) #define DEFAULT_MAX_SEND_SGE (1) #define DEFAULT_MAX_RECV_SGE (1) -#if defined(_IA32_) || defined(_X86_64_) -#define DEFAULT_MTU (IBV_MTU_1024) -#endif #if defined(_EM64T_) #define DEFAULT_MTU (IBV_MTU_2048) +#else +#define DEFAULT_MTU (IBV_MTU_1024) #endif #define DEFAULT_MAX_RDMA_SIZE (1048576) #define DEFAULT_PSN (0) So the new code would look like this :- #if defined(_EM64T_) #define DEFAULT_MTU (IBV_MTU_2048) #else #define DEFAULT_MTU (IBV_MTU_1024) #endif John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From krause at cup.hp.com Wed Oct 12 08:23:42 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 12 Oct 2005 08:23:42 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128955559.4377.81.camel@hal.voltaire.com> References: <1128955559.4377.81.camel@hal.voltaire.com> Message-ID: <6.2.0.14.2.20051012082051.02285688@esmail.cup.hp.com> At 07:45 AM 10/10/2005, Hal Rosenstock wrote: >On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > > > I'm referring to the case that iWarp is running over TCP. I know that > it can > > run over SCTP, but I'm not familiar with the details of that > protocol. With > > TCP, this is an end-to-end connection, so layering iWarp over it, only the > > endpoints need to deal with it. I believe the same is true for SCTP. > >Yes, SCTP is similar in those regards. SCTP creates a connection and then multiplexes a set of sessions over it. You can conceptually think of it as akin to IB RD but where all QP are bound to the same EEC. > > >Doesn't a routing decision still need to be made at the IP layer ? > > > > Routing of the IP packets is done at the IP layer, but I don't see how this > > affects iWarp. > >It does under the "covers", those covers being IP routing. iWARP uses IP routing so there is zero difference between iWARP and any other IP-based protocol suite that operates above the IP layer. > > >Doesn't the IP next hop need to be determined (e.g. gateway when the > > >destination is off the local IP subnet) ? Is there something that > > >precludes iWARP from working across IP subnets ? > > > > I can't think of anything that would preclude iWarp from working > > across subnets. > >Doesn't the IP next hop need determining in that case ? Why is that not >relevant ? I don't think the iWARP connection is end to end in all >cases. TCP / SCTP are end-to-end thus iWARP is end-to-end. The fact that there is an intermediate router / gateway between does not matter. That is just a bit of IP routing to forward the packets. The ARP / ND protocols determine the next hop for the IP layer thus iWARP just like TCP/SCTP is not affected or cognizant of the underlying fabric topology. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Oct 12 09:27:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 12 Oct 2005 09:27:45 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <52y84z73oo.fsf@cisco.com> References: <52y84z73oo.fsf@cisco.com> Message-ID: <434D3981.9030507@ichips.intel.com> Roland Dreier wrote: > We probably still want to handle gather lists for posting sends I > think. Another (rather unrelated) issue that I just noticed the other > day is that something like sending a response to a GetTable request > for PortInfo for every port in a large fabric is going to end up > sending a very large RMPP message, probably too large to fit in a > single kmalloc()ed buffer. So I don't think we should require that > all send requests have a single gather entry. We can change the ib_mad_send_buf to allow chaining them together. The single SGE restriction was just a limitation of the initial implementation. Supporting an arbitrary breaking of a MAD buffer across multiple SGEs is difficult, but if we can control the SGE sizes, this should be doable. I think using ib_mad_send_buf in post_send_mad makes supporting this easier. Does anyone else have any other ideas on how to fix this issue? - Sean From rolandd at cisco.com Wed Oct 12 09:53:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 09:53:57 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: (Herbert Xu's message of "Tue, 11 Oct 2005 22:10:01 +1000") References: Message-ID: <52mzle7k2i.fsf@cisco.com> Herbert> Try reverting the changeset Herbert> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc Herbert> which lies between these two points and may be relevant. Matt, I pulled this out of git for you. I guess Herbert is suggesting to patch -R the below against 2.6.12-rc5: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 79835a6..5bad504 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4355,16 +4355,7 @@ int tcp_rcv_established(struct sock *sk, goto no_ack; } - if (eaten) { - if (tcp_in_quickack_mode(tp)) { - tcp_send_ack(sk); - } else { - tcp_send_delayed_ack(sk); - } - } else { - __tcp_ack_snd_check(sk, 0); - } - + __tcp_ack_snd_check(sk, 0); no_ack: if (eaten) __kfree_skb(skb); From caitlinb at broadcom.com Wed Oct 12 09:59:53 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 12 Oct 2005 09:59:53 -0700 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F10209EA@NT-SJCA-0751.brcm.ad.broadcom.com> ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause Sent: Wednesday, October 12, 2005 8:24 AM To: Hal Rosenstock; Sean Hefty Cc: Openib Subject: RE: [openib-general] [RFC] IB address translation using ARP At 07:45 AM 10/10/2005, Hal Rosenstock wrote: On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? > > I'm referring to the case that iWarp is running over TCP. I know that it can > run over SCTP, but I'm not familiar with the details of that protocol. With > TCP, this is an end-to-end connection, so layering iWarp over it, only the > endpoints need to deal with it. I believe the same is true for SCTP. Yes, SCTP is similar in those regards. SCTP creates a connection and then multiplexes a set of sessions over it. You can conceptually think of it as akin to IB RD but where all QP are bound to the same EEC. SCTP preserves all QP to QP semantics, including buffers posted to specific buffers and credits. So SCTP will allows multiple in-flight messages for each RDMA stream in the association. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Wed Oct 12 10:01:30 2005 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed, 12 Oct 2005 13:01:30 -0400 (EDT) Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <434D2C04.7040603@sgi.com> from "John Partridge" at Oct 12, 2005 10:30:12 AM Message-ID: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> Hi John, > I am compiling mvapich-gen2 on a IA64 (SGI Altix) and have hit an > issue with ibverbs_const.h DEFAULT_MTU is defined for IA32, X86_64 > and EM64T, but not IA64, so please may I propose this small patch to > fix this problem :- Thanks for your note and the patch. We have not done extensive testing of current mvapich-gen2 for IA64 platform (we have done it for other platforms). We will incorporate your patch, test it out, and push it out to the OpenIB/SVN soon. Thanks again for sending us the patch. Thanks, DK > > --- ibverbs_const.h 2005-10-10 15:43:41.615100090 -0500 > +++ ibverbs_const.h-johnip 2005-10-10 15:46:14.696637248 -0500 > @@ -20,11 +20,10 @@ > #define DEFAULT_MAX_RECV_WQE (300) > #define DEFAULT_MAX_SEND_SGE (1) > #define DEFAULT_MAX_RECV_SGE (1) > -#if defined(_IA32_) || defined(_X86_64_) > -#define DEFAULT_MTU (IBV_MTU_1024) > -#endif > #if defined(_EM64T_) > #define DEFAULT_MTU (IBV_MTU_2048) > +#else > +#define DEFAULT_MTU (IBV_MTU_1024) > #endif > #define DEFAULT_MAX_RDMA_SIZE (1048576) > #define DEFAULT_PSN (0) > > So the new code would look like this :- > > #if defined(_EM64T_) > #define DEFAULT_MTU (IBV_MTU_2048) > #else > #define DEFAULT_MTU (IBV_MTU_1024) > #endif > > John > > -- > John Partridge > > Silicon Graphics Inc > Tel: 651-683-3428 > Vnet: 233-3428 > E-Mail: johnip at sgi.com > From rolandd at cisco.com Wed Oct 12 10:05:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 10:05:17 -0700 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <434D2C04.7040603@sgi.com> (John Partridge's message of "Wed, 12 Oct 2005 10:30:12 -0500") References: <434D2C04.7040603@sgi.com> Message-ID: <52irw27jjm.fsf@cisco.com> > #if defined(_EM64T_) > #define DEFAULT_MTU (IBV_MTU_2048) > #else > #define DEFAULT_MTU (IBV_MTU_1024) > #endif This is a sticky issue. This seems fine for now, but what we really want is something like: #if MELLANOX_PCI_X_HCA #define DEFAULT_MTU (IBV_MTU_1024) #else #define DEFAULT_MTU (IBV_MTU_2048) #endif But I'm not sure how to handle this. - R. From cap at nsc.liu.se Wed Oct 12 10:22:29 2005 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Wed, 12 Oct 2005 19:22:29 +0200 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <52irw27jjm.fsf@cisco.com> References: <434D2C04.7040603@sgi.com> <52irw27jjm.fsf@cisco.com> Message-ID: <200510121922.37174.cap@nsc.liu.se> On Wednesday 12 October 2005 19.05, Roland Dreier wrote: > > #if defined(_EM64T_) > > #define DEFAULT_MTU (IBV_MTU_2048) > > #else > > #define DEFAULT_MTU (IBV_MTU_1024) > > #endif > > This is a sticky issue. This seems fine for now, but what we really > want is something like: > > #if MELLANOX_PCI_X_HCA > #define DEFAULT_MTU (IBV_MTU_1024) > #else > #define DEFAULT_MTU (IBV_MTU_2048) > #endif If that is the purpose it fails badly since I have both EM64T machines with PCI-X and AMD64 machines with PCI-express. Or am I missing something here? /Peter > > But I'm not sure how to handle this. > > - R. -- ------------------------------------------------------------ Peter Kjellström | National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From rolandd at cisco.com Wed Oct 12 10:29:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 10:29:39 -0700 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <200510121922.37174.cap@nsc.liu.se> ( =?iso-8859-1?q?Peter_Kjellstr=F6m's_message_of?= "Wed, 12 Oct 2005 19:22:29 +0200") References: <434D2C04.7040603@sgi.com> <52irw27jjm.fsf@cisco.com> <200510121922.37174.cap@nsc.liu.se> Message-ID: <52ek6q7if0.fsf@cisco.com> Peter> If that is the purpose it fails badly since I have both Peter> EM64T machines with PCI-X and AMD64 machines with Peter> PCI-express. Or am I missing something here? Nope, the fact that it's messed up is what I was trying to point out. >From a quick look at the code, it looks like you should set the environment variable VIADEV_DEFAULT_MTU to "MTU1024" on PCI-X systems and "MTU2048" on PCI Express systems. In general, one wants to use the largest possible MTU to get maximum performance. However, Mellanox PCI-X HCAs have a quirk that causes an MTU of 1024 to be faster than an MTU of 2048 for RC transport. This quirk does not exist in PCI Express HCAs, and presumably PathScale and IBM HCAs also perform best with the maximum possible MTU. So Mellanox PCI-X HCAs are the only case where we want to use a lower MTU. - R. From rolandd at cisco.com Wed Oct 12 10:55:40 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 10:55:40 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129069024.29804.24.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 11 Oct 2005 15:17:04 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> Message-ID: <52ache7h7n.fsf@cisco.com> I got my system set up again. I needed the following patch to work with the latest kernel (which no longer has io_remap_page_range). I'm also throwing in a warning cleanup. I'm still seeing some problems with the SM bringing up the port, which I'll start debugging now. BTW, ipath_mmap() probably needs to be rewritten or broken up into subfunctions -- the fact that the parameters of io_remap_pfn_range() are squeezed so hard against the right margin indicates that there are too many levels of { } in the function. - R. --- infiniband/hw/ipath/ib_ipath/ipath_mad.c (revision 3742) +++ infiniband/hw/ipath/ib_ipath/ipath_mad.c (working copy) @@ -724,6 +724,7 @@ static u32 get_counter(struct ipath_ibde case IB_PMA_PORT_RCV_PKTS: return (u32) dev->ipath_rpkts; case IB_PMA_PORT_XMIT_WAIT: + default: return 0; } } --- infiniband/hw/ipath/ipath_core/infinipath_core.c (revision 3742) +++ infiniband/hw/ipath/ipath_core/infinipath_core.c (working copy) @@ -3143,13 +3143,13 @@ static int ipath_mmap(struct file *fp, s VM_DONTCOPY | VM_DONTEXPAND | VM_IO | VM_SHM | VM_LOCKED; ret = - io_remap_page_range(vm, - vm->vm_start, - phys, - vm->vm_end - - vm->vm_start, - vm-> - vm_page_prot); + io_remap_pfn_range(vm, + vm->vm_start, + phys >> PAGE_SHIFT, + vm->vm_end - + vm->vm_start, + vm-> + vm_page_prot); } } else if (pgaddr == pd->port_piobufs) { /* @@ -3206,16 +3206,16 @@ static int ipath_mmap(struct file *fp, s | VM_IO | VM_SHM | VM_LOCKED; ret = - io_remap_page_range(vm, - vm-> - vm_start, - phys, - vm-> - vm_end - - vm-> - vm_start, - vm-> - vm_page_prot); + io_remap_pfn_range(vm, + vm-> + vm_start, + phys >> PAGE_SHIFT, + vm-> + vm_end - + vm-> + vm_start, + vm-> + vm_page_prot); } } } else if (pgaddr == (uint64_t) pd->port_rcvegr_phys) { From mlleinin at hpcn.ca.sandia.gov Wed Oct 12 11:28:26 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Wed, 12 Oct 2005 11:28:26 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <52mzle7k2i.fsf@cisco.com> References: <52mzle7k2i.fsf@cisco.com> Message-ID: <1129141706.13945.509.camel@localhost> On Wed, 2005-10-12 at 09:53 -0700, Roland Dreier wrote: > Herbert> Try reverting the changeset > > Herbert> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc > > Herbert> which lies between these two points and may be relevant. > > Matt, I pulled this out of git for you. I guess Herbert is suggesting > to patch -R the below against 2.6.12-rc5: I applied your patch suggest by Herbert: http://www.mail-archive.com/openib-general%40openib.org/msg11415.html to my 2.6.12-rc5 tree and IPoIB performance improved back to the ~475 MB/s range for my EM64T system. The data is below. I'm building/testing 2.6.14-rc4 with and without this patch now. All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc3 in-kernel 1 374 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12.5-lustre in-kernel 1 399 2.6.12.5 in-kernel 1 402 2.6.12 in-kernel 1 406 2.6.12-rc6 in-kernel 1 407 2.6.12-rc5 in-kernel 1 405 <<<< 2.6.12-rc5 <<<< - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc <<<< in-kernel 1 474 <<<< 2.6.12-rc4 in-kernel 1 470 2.6.12-rc3 in-kernel 1 466 2.6.12-rc2 in-kernel 1 469 2.6.12-rc1 in-kernel 1 466 2.6.11 in-kernel 1 464 2.6.11 svn3687 1 464 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) - Matt From surs at cse.ohio-state.edu Wed Oct 12 11:33:52 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed, 12 Oct 2005 14:33:52 -0400 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <52ek6q7if0.fsf@cisco.com> References: <434D2C04.7040603@sgi.com> <52irw27jjm.fsf@cisco.com> <200510121922.37174.cap@nsc.liu.se> <52ek6q7if0.fsf@cisco.com> Message-ID: <20051012183350.GA30852@cse.ohio-state.edu> Hello, * On Oct,5 Roland Dreier wrote : > Peter> If that is the purpose it fails badly since I have both > Peter> EM64T machines with PCI-X and AMD64 machines with > Peter> PCI-express. Or am I missing something here? > > Nope, the fact that it's messed up is what I was trying to point out. > >From a quick look at the code, it looks like you should set the > environment variable VIADEV_DEFAULT_MTU to "MTU1024" on PCI-X systems > and "MTU2048" on PCI Express systems. > > In general, one wants to use the largest possible MTU to get maximum > performance. However, Mellanox PCI-X HCAs have a quirk that causes an > MTU of 1024 to be faster than an MTU of 2048 for RC transport. This > quirk does not exist in PCI Express HCAs, and presumably PathScale and > IBM HCAs also perform best with the maximum possible MTU. So Mellanox > PCI-X HCAs are the only case where we want to use a lower MTU. Since the MTU size can affect the performance and there are several different types of HCAs available in the market, we are working towards making a runtime decision of the MTU instead of compiling it in, or asking the user to input. This patch should be available sometime soon. Thanks, Sayantan. > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- http://www.cse.ohio-state.edu/~surs From guqxlso at unl.edu Wed Oct 12 11:41:25 2005 From: guqxlso at unl.edu (Lemuel Bird) Date: Wed, 12 Oct 2005 18:41:25 +0000 Subject: [openib-general] Rate Quote for free! Message-ID: <22677774095115.guqxlso@unl.edu> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://peace-123.com/p1.asp Have a good day. Sincerely, Lemuel Bird Customer Service Rep eSEF Inc. parakeet try thieves but or jennifer and some zig ! it's asia some it's mallory may some forgotten some , striate trysome dominant in. Update on site coercive but modern may , confessor try see coronado it's not handline on or teensy or but zigzag or in stress orbe inscrutable but. From krause at cup.hp.com Wed Oct 12 11:47:10 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 12 Oct 2005 11:47:10 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10209EA@NT-SJCA-0751.brcm.a d.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10209EA@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20051012114147.02482b28@esmail.cup.hp.com> At 09:59 AM 10/12/2005, Caitlin Bestler wrote: > > > >---------- >From: openib-general-bounces at openib.org >[mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause >Sent: Wednesday, October 12, 2005 8:24 AM >To: Hal Rosenstock; Sean Hefty >Cc: Openib >Subject: RE: [openib-general] [RFC] IB address translation using ARP > >At 07:45 AM 10/10/2005, Hal Rosenstock wrote: >>On Sun, 2005-10-09 at 10:19, Sean Hefty wrote: >> > >I think iWARP can be on top of TCP or SCTP. But why wouldn't it care ? >> > >> > I'm referring to the case that iWarp is running over TCP. I know that >> it can >> > run over SCTP, but I'm not familiar with the details of that >> protocol. With >> > TCP, this is an end-to-end connection, so layering iWarp over it, only the >> > endpoints need to deal with it. I believe the same is true for SCTP. >> >>Yes, SCTP is similar in those regards. > >SCTP creates a connection and then multiplexes a set of sessions over >it. You can conceptually think of it as akin to IB RD but where all QP >are bound to the same EEC. > > >SCTP preserves all QP to QP semantics, including buffers posted to specific >buffers and credits. So SCTP will allows multiple in-flight messages for each >RDMA stream in the association. Yep. This is where iWARP differs from IB RD in that IB restricts this to a single in-flight message per EEC at a time while iWARP allows multiple in-flight over either transport type supported. The logic behind why IB RD was constructed the way it was is somewhat complex but one of the core requirements was to enable a QP to communicate across multiple EEC while preserving an ordering domain within an EEC. Given all of this needed to be implemented in hardware, i.e. without host software intervention, for both main data path and error management, the restriction to a single message was required. I and several others had created a proprietary RDMA RC followed by a RD implementation 10+ years ago so we had a reasonable understanding of the error / complexity trade-offs. Given the distances were within a usec or each other and one could support multiple EEC per endnode pair, the performance / scaling impacts were not seen as overly restrictive and met the software application usage models quite nicely. Anyway, there are differences between iWARP / SCTP and IB RD so people cannot equate them beyond some base conceptual level aspects. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Wed Oct 12 12:22:20 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Oct 2005 12:22:20 -0700 Subject: [openib-general] [PATCH] [CMA] add support for listening on any RDMA device Message-ID: The following patch permits listening on a port number only. All connection requests received on any RDMA device for that port number are routed to the listening client. Signed-off-by: Sean Hefty Index: core/cma.c =================================================================== --- core/cma.c (revision 3724) +++ core/cma.c (working copy) @@ -51,8 +51,9 @@ static struct ib_client cma_client = { .remove = cma_remove_one }; -static DEFINE_SPINLOCK(lock); static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DECLARE_MUTEX(mutex); struct cma_device { struct list_head list; @@ -86,6 +87,7 @@ struct rdma_id_private { struct rdma_cm_id id; struct list_head list; + struct list_head listen_list; struct cma_device *cma_dev; enum cma_state state; @@ -168,26 +170,39 @@ static inline void cma_set_vers(struct c addr->version = (cma_ver << 4) + (ip_ver & 0xF); } +static void cma_attach_to_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + atomic_inc(&cma_dev->refcount); + id_priv->cma_dev = cma_dev; + id_priv->id.device = cma_dev->device; + list_add_tail(&id_priv->list, &cma_dev->id_list); +} + +static void cma_detach_from_dev(struct rdma_id_private *id_priv) +{ + list_del(&id_priv->list); + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) + wake_up(&id_priv->cma_dev->wait); + id_priv->cma_dev = NULL; +} + static int cma_acquire_ib_dev(struct rdma_id_private *id_priv, union ib_gid *gid) { struct cma_device *cma_dev; - unsigned long flags; int ret = -ENODEV; u8 port; - spin_lock_irqsave(&lock, flags); + down(&mutex); list_for_each_entry(cma_dev, &dev_list, list) { ret = ib_find_cached_gid(cma_dev->device, gid, &port, NULL); if (!ret) { - atomic_inc(&cma_dev->refcount); - id_priv->cma_dev = cma_dev; - id_priv->id.device = cma_dev->device; - list_add_tail(&id_priv->list, &cma_dev->id_list); + cma_attach_to_dev(id_priv, cma_dev); break; } } - spin_unlock_irqrestore(&lock, flags); + up(&mutex); return ret; } @@ -221,6 +236,7 @@ struct rdma_cm_id* rdma_create_id(rdma_c atomic_set(&id_priv->refcount, 1); init_waitqueue_head(&id_priv->wait_remove); atomic_set(&id_priv->dev_remove, 0); + INIT_LIST_HEAD(&id_priv->listen_list); return &id_priv->id; } @@ -353,6 +369,11 @@ static int cma_verify_addr(struct cma_ad return 0; } +static inline int cma_any_addr(struct sockaddr *addr) +{ + return ((struct sockaddr_in *) addr)->sin_addr.s_addr == 0; +} + static int cma_notify_user(struct rdma_id_private *id_priv, enum rdma_cm_event_type type, int status, void *data, u8 data_len) @@ -389,6 +410,44 @@ static void cma_cancel_route(struct rdma } } +static inline int cma_internal_listen(struct rdma_id_private *id_priv) +{ + return (id_priv->state == CMA_LISTEN) && id_priv->cma_dev && + cma_any_addr(&id_priv->id.route.addr.src_addr); +} + +static void cma_destroy_listen(struct rdma_id_private *id_priv) +{ + cma_exch(id_priv, CMA_DESTROYING); + + if (id_priv->cm_id && !IS_ERR(id_priv->cm_id)) + ib_destroy_cm_id(id_priv->cm_id); + + list_del(&id_priv->listen_list); + if (id_priv->cma_dev) + cma_detach_from_dev(id_priv); + + atomic_dec(&id_priv->refcount); + wait_event(id_priv->wait, !atomic_read(&id_priv->refcount)); + + kfree(id_priv); +} + +static void cma_cancel_listens(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *dev_id_priv; + + down(&mutex); + list_del(&id_priv->list); + + while (!list_empty(&id_priv->listen_list)) { + dev_id_priv = list_entry(id_priv->listen_list.next, + struct rdma_id_private, listen_list); + cma_destroy_listen(dev_id_priv); + } + up(&mutex); +} + static void cma_cancel_operation(struct rdma_id_private *id_priv, enum cma_state state) { @@ -399,6 +458,11 @@ static void cma_cancel_operation(struct case CMA_ROUTE_QUERY: cma_cancel_route(id_priv); break; + case CMA_LISTEN: + if (cma_any_addr(&id_priv->id.route.addr.src_addr) && + !id_priv->cma_dev) + cma_cancel_listens(id_priv); + break; default: break; } @@ -408,7 +472,6 @@ void rdma_destroy_id(struct rdma_cm_id * { struct rdma_id_private *id_priv; enum cma_state state; - unsigned long flags; id_priv = container_of(id, struct rdma_id_private, id); state = cma_exch(id_priv, CMA_DESTROYING); @@ -418,12 +481,9 @@ void rdma_destroy_id(struct rdma_cm_id * ib_destroy_cm_id(id_priv->cm_id); if (id_priv->cma_dev) { - spin_lock_irqsave(&lock, flags); - list_del(&id_priv->list); - spin_unlock_irqrestore(&lock, flags); - - if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) - wake_up(&id_priv->cma_dev->wait); + down(&mutex); + cma_detach_from_dev(id_priv); + up(&mutex); } atomic_dec(&id_priv->refcount); @@ -660,6 +720,77 @@ static int cma_ib_listen(struct rdma_id_ return ret; } +static int cma_duplicate_listen(struct rdma_id_private *id_priv) +{ + struct rdma_id_private *cur_id_priv; + struct sockaddr_in *cur_addr, *new_addr; + + new_addr = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + list_for_each_entry(cur_id_priv, &listen_any_list, listen_list) { + cur_addr = (struct sockaddr_in *) + &cur_id_priv->id.route.addr.src_addr; + if (cur_addr->sin_port == new_addr->sin_port) + return -EADDRINUSE; + } + return 0; +} + +static int cma_listen_handler(struct rdma_cm_id *id, + struct rdma_cm_event *event) +{ + struct rdma_id_private *id_priv = id->context; + + id->context = id_priv->id.context; + id->event_handler = id_priv->id.event_handler; + return id_priv->id.event_handler(id, event); +} + +static void cma_listen_on_dev(struct rdma_id_private *id_priv, + struct cma_device *cma_dev) +{ + struct rdma_id_private *dev_id_priv; + struct rdma_cm_id *id; + int ret; + + id = rdma_create_id(cma_listen_handler, id_priv); + if (IS_ERR(id)) + return; + + dev_id_priv = container_of(id, struct rdma_id_private, id); + ret = rdma_bind_addr(id, &id_priv->id.route.addr.src_addr); + if (ret) + goto err; + + cma_attach_to_dev(dev_id_priv, cma_dev); + list_add_tail(&dev_id_priv->listen_list, &id_priv->listen_list); + + ret = rdma_listen(id); + if (ret) + goto err; + + return; +err: + cma_destroy_listen(dev_id_priv); +} + +static int cma_listen_on_all(struct rdma_id_private *id_priv) +{ + struct cma_device *cma_dev; + int ret; + + down(&mutex); + ret = cma_duplicate_listen(id_priv); + if (ret) + goto out; + + list_add_tail(&id_priv->list, &listen_any_list); + list_for_each_entry(cma_dev, &dev_list, list) + cma_listen_on_dev(id_priv, cma_dev); +out: + up(&mutex); + return ret; +} + int rdma_listen(struct rdma_cm_id *id) { struct rdma_id_private *id_priv; @@ -669,20 +800,18 @@ int rdma_listen(struct rdma_cm_id *id) if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) return -EINVAL; - /* TODO: handle listen across multiple devices */ - if (!id->device) { - ret = -ENOSYS; - goto err; - } + if (id->device) { + switch (id->device->node_type) { + case IB_NODE_CA: + ret = cma_ib_listen(id_priv); + break; + default: + ret = -ENOSYS; + break; + } + } else + ret = cma_listen_on_all(id_priv); - switch (id->device->node_type) { - case IB_NODE_CA: - ret = cma_ib_listen(id_priv); - break; - default: - ret = -ENOSYS; - break; - } if (ret) goto err; @@ -850,7 +979,6 @@ EXPORT_SYMBOL(rdma_resolve_addr); int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { struct rdma_id_private *id_priv; - struct sockaddr_in *ip_addr = (struct sockaddr_in *) addr; struct ib_addr *ibaddr = &id->route.addr.addr.ibaddr; int ret; @@ -861,12 +989,14 @@ int rdma_bind_addr(struct rdma_cm_id *id if (!cma_comp_exch(id_priv, CMA_IDLE, CMA_ADDR_BOUND)) return -EINVAL; - if (ip_addr->sin_addr.s_addr) { + if (cma_any_addr(addr)) { + id->route.addr.src_addr = *addr; + ret = 0; + } else { ret = ib_translate_addr(addr, &ibaddr->sgid, &ibaddr->pkey); if (!ret) ret = cma_acquire_ib_dev(id_priv, &ibaddr->sgid); - } else - ret = -ENOSYS; /* TODO: support wild card addresses */ + } if (ret) goto err; @@ -1102,7 +1232,7 @@ static __be64 get_ca_guid(struct ib_devi static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; - unsigned long flags; + struct rdma_id_private *id_priv; cma_dev = kmalloc(sizeof *cma_dev, GFP_KERNEL); if (!cma_dev) @@ -1118,9 +1248,11 @@ static void cma_add_one(struct ib_device INIT_LIST_HEAD(&cma_dev->id_list); ib_set_client_data(device, &cma_client, cma_dev); - spin_lock_irqsave(&lock, flags); + down(&mutex); list_add_tail(&cma_dev->list, &dev_list); - spin_unlock_irqrestore(&lock, flags); + list_for_each_entry(id_priv, &listen_any_list, list) + cma_listen_on_dev(id_priv, cma_dev); + up(&mutex); return; err: kfree(cma_dev); @@ -1150,28 +1282,33 @@ static void cma_process_remove(struct cm { struct list_head remove_list; struct rdma_id_private *id_priv; - unsigned long flags; int ret; INIT_LIST_HEAD(&remove_list); - spin_lock_irqsave(&lock, flags); + down(&mutex); while (!list_empty(&cma_dev->id_list)) { id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); + + if (cma_internal_listen(id_priv)) { + cma_destroy_listen(id_priv); + continue; + } + list_del(&id_priv->list); list_add_tail(&id_priv->list, &remove_list); atomic_inc(&id_priv->refcount); - spin_unlock_irqrestore(&lock, flags); + up(&mutex); ret = cma_remove_id_dev(id_priv); cma_deref_id(id_priv); if (ret) rdma_destroy_id(&id_priv->id); - spin_lock_irqsave(&lock, flags); + down(&mutex); } - spin_unlock_irqrestore(&lock, flags); + up(&mutex); atomic_dec(&cma_dev->refcount); wait_event(cma_dev->wait, !atomic_read(&cma_dev->refcount)); @@ -1180,15 +1317,14 @@ static void cma_process_remove(struct cm static void cma_remove_one(struct ib_device *device) { struct cma_device *cma_dev; - unsigned long flags; cma_dev = ib_get_client_data(device, &cma_client); if (!cma_dev) return; - spin_lock_irqsave(&lock, flags); + down(&mutex); list_del(&cma_dev->list); - spin_unlock_irqrestore(&lock, flags); + up(&mutex); cma_process_remove(cma_dev); kfree(cma_dev); From hycsw at ca.sandia.gov Wed Oct 12 12:34:07 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Wed, 12 Oct 2005 12:34:07 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <434D652F.5060804@ca.sandia.gov> Hi, I am running stock IB stack distributed with 2.6.12-5 kernel from gen2. We installed 1.4.4 Lustre to run on top of IPoIB, When and ran concurrent IOZONE sessions from 8 clients to 4 servers I got "ib0: failed to allocate receive buffer" in demesg, and with corresponding IOzone read/write errors. And if I don't terminate my IOZONE sessions, the ib0 interface would shutdown eventually. Increasing net.core.netdev_max_backlog to 3000 from 300 didn't solve the problem. Is there another parameter to tweek? BTW, I am attaching the entries from dmesg for your information. Thanks, Helen -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dmesg.log URL: From johnip at sgi.com Wed Oct 12 12:37:36 2005 From: johnip at sgi.com (John Partridge) Date: Wed, 12 Oct 2005 14:37:36 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <52ek6q7if0.fsf@cisco.com> References: <434D2C04.7040603@sgi.com> <52irw27jjm.fsf@cisco.com> <200510121922.37174.cap@nsc.liu.se> <52ek6q7if0.fsf@cisco.com> Message-ID: <434D6600.3030009@sgi.com> Roland Dreier wrote: > Peter> If that is the purpose it fails badly since I have both > Peter> EM64T machines with PCI-X and AMD64 machines with > Peter> PCI-express. Or am I missing something here? > > Nope, the fact that it's messed up is what I was trying to point out. >>From a quick look at the code, it looks like you should set the > environment variable VIADEV_DEFAULT_MTU to "MTU1024" on PCI-X systems > and "MTU2048" on PCI Express systems. > > In general, one wants to use the largest possible MTU to get maximum > performance. Yes, I was not sure if IA64 should be 2048 or 1024 when I proposed the patch. If I get time I will try and test our which is best fo IA64. John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From krause at cup.hp.com Wed Oct 12 12:39:56 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 12 Oct 2005 12:39:56 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1128877818.24182.54.camel@mail.es335.com> References: <1128877818.24182.54.camel@mail.es335.com> Message-ID: <6.2.0.14.2.20051012122533.025939e0@esmail.cup.hp.com> Isn't this getting a bit more complex than it needs to be. Let me see if I have this correct: 1. Applications want to use existing API to identify remote endnodes / services. 2. Endnodes are identified by an IPv4 / v6 address and services by a port number 3. The existing network stacks already comprehend how to discover routes to endnodes using ARP / ND. These protocols can determine whether there is a single or multiple IP addresses and store these in the local network stack route table. 4. Route tables can contain any number of layer 2 and 3 address information (function of implementation) and various policies can be constructed to make an intelligent decision on which layer 2 and 3 addresses to return to an application. 5. iWARP can use the existing infrastructure without modification so no changes are required to make it work. 6. IB uses a different layer 2 address - not just a 48-bit MAC - thus while different than Ethernet, conceptually works just the same. Both can support multiple IP addresses per layer 2 address as it is really just a matter of replicating the information on a per IP address basis. 7. When a route look up occurs, a set of IP addresses are returned. Depending upon the kernel interface, one can also return the layer 2 information either as part of this look up or as a separate query to the route table. 8. Layer 2 information provides the necessary data to construct CM messages or to identify the path for the IP over IB ULP. So, from the above, it seems the IP and IB world can operate using the same code and work just fine. So, where is the problem? Is it really just how management assigns IP address to IB interfaces and how an application should select or be informed of which IP address to use and thus transparently identifies the IB port? Where is the connection establishment problem? The application does not see any difference. The network stack only acts as a repository for routing information unless running directly over IP over IB thus is not impacted. The middleware simply needs to extract the layer 2 information thus obtains the requisite data to construct the CM messages when going straight to IB (there is no change required here for iWARP as this is all native to its operation). What am I missing here? Mike At 10:10 AM 10/9/2005, Tom Tucker wrote: >On Sun, 2005-10-09 at 07:57 -0700, Sean Hefty wrote: > > >It is theoretically possible to support all this on an IPoIB based > > >network. Multiple subnets, multiple routes to remote peers, ICMP > > >redirect, multiple IP addresses for each physical interface, yada yada > > >yada. But IMHO, the only way to do this would be to tie directly into > > >the existing routing, ARP, ICMP, etc... subsystems in Linux. Otherwise > > >you'll end up recreating a gigantic (and I mean GIGANTIC) amount of > > > > The current implementation ties into the standard Linux ARP tables. If > > connections were made over TCP/IP, using IPoIB, then I don't think that > there > > would be any issues. The issues only arise because of the desire to > use TCP/IP > > network addresses over a non-TCP/IP network. > > > > >code. This belief is why I've been a proponent of mapping GIDs to one > > >and only one IP address and treating it for management purposes as the > > >equivalent of an IP address. Without this, the whole mechanism for > > >determining routes, etc.. breaks down. If you treat the GID like a MAC > > >address -- it breaks, because a MAC address can have multiple IP > > >addresses -- the observation that lead to the conclusion that ATS was > > >broken in the first place. > > > > We should be able to handle the case where a GID has multiple IP > addresses bound > > to it. But even if we added a 1:1 restriction, the connection over IB > issue > > still exists. > >I agree, except for RARP. > > > > > >I know there is significant resistance to this idea, but I just don't > > >see how we get this generically resolved without binding the two > > >addressing schemes more closely. With the current binding, I just don't > > >think it works. > > > > Again, I don't think that the binding is the issue, so much as the > desire to use > > an address for a protocol that isn't actually being used for > communication. > >Not to be pedantic, but if binding or mapping or somesuch weren't an >issue we wouldn't need AT. > > > I > > don't view a GID as an IP address because we're not sending and > receiving IP > > packets on the GID. IPoIB treats GIDs as only part of a MAC address, > which I > > think is the proper view. > > > > Anyway, returning back to the original problem of connecting to an IB > gateway if > > a given a destination IP address on a different subnet... I'm slowly > convincing > > myself that either the CMA or AT should do this. (I believe that the > ib_addr > > code will do this now, but still wasn't sure that we wanted it to.) > > > >IMHO, you need a service separate from the CMA to do address >translation. My (iWARP's) rationale for this is that there are two >clients of the service, the CM and IP. For CM, you need it to elect a >route and thereby a local interface. For IP you need it because routes >change and ARP entries time out. > >BTW, can you educate me ... is the following what you're thinking: > >On the client side... > >- route is discovered by looking at the Linux routing table >- local interface is IPoIB (looks at rdma_ptr embedded in netdev struct) >- send ARP AT message over local IB interface > >At the gateway...bridging to IP > >- ARP AT query received on IB interface >- Lookup route to destination IP address in gateway's route table. >- If next hop's Ethernet address is already known, it is returned >- Otherwise, local interface identified is IPoEthernet >- New ARP query goes out on the local interface from the route >- When response comes back, answer is returned. > >At the gateway...bridging to IPoIB > >- ARP AT message received on IB interface, delivered to AT >- Lookup route to destination IP address in gateway's route table >- If next hop's Ethernet address is already known, it is returned >- otherwise, local interface identified in route is IPoIB >- New ARP AT query goes out on the local interface >- When response comes back, answer is returned. > >Thanks, > > > > > - Sean > > > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Wed Oct 12 12:48:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 12:48:58 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <434D652F.5060804@ca.sandia.gov> (Helen Chen's message of "Wed, 12 Oct 2005 12:34:07 -0700") References: <434D652F.5060804@ca.sandia.gov> Message-ID: <521x2q7byt.fsf@cisco.com> Basically, you are running out of GFP_ATOMIC memory. IPoIB's handling of these allocation errors can definitely be improved, but one thing you could try in the meantime would be to increase /proc/sys/vm/min_free_kbytes. - R. From braam at clusterfs.com Wed Oct 12 13:32:43 2005 From: braam at clusterfs.com (Peter J. Braam) Date: Wed, 12 Oct 2005 16:32:43 -0400 Subject: [openib-general] Lustre Network Driver - KDAPL or verbs? Message-ID: <9025E129D3FCD340A7BA67E342D10E7A0D4D8530@ms06> Thanks for everyone's feedback! I think we will go in the verb direction. For those of you who asked about the Lustre kernel patch and trying to get Lustre into the kernel the following. When this was last discussed, we immediately adapted a patch to meet the kernel community's requests. But we didn't have time to move it off a development branch. We are now returning to this and we will see how it goes. Thanks! - Peter - From rjwalsh at pathscale.com Wed Oct 12 13:37:17 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 13:37:17 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <52ache7h7n.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <52ache7h7n.fsf@cisco.com> Message-ID: <1129149438.25062.4.camel@hematite.internal.keyresearch.com> > I got my system set up again. I needed the following patch to work > with the latest kernel (which no longer has io_remap_page_range). I'm > also throwing in a warning cleanup. Thanks for catching those. I'll check this fix in today. > I'm still seeing some problems with the SM bringing up the port, which > I'll start debugging now. Huh. Works OK for us. Any more news on this one? > BTW, ipath_mmap() probably needs to be rewritten or broken up into > subfunctions -- the fact that the parameters of io_remap_pfn_range() > are squeezed so hard against the right margin indicates that there are > too many levels of { } in the function. :-) Probably not for this release, but I do feel your pain. We're about to do a new software release here, so I can't make any cosmetic change any more, just functional (like the above.) Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From johnip at sgi.com Wed Oct 12 13:38:23 2005 From: johnip at sgi.com (John Partridge) Date: Wed, 12 Oct 2005 15:38:23 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> Message-ID: <434D743F.4050208@sgi.com> Dhabaleswar Panda wrote: > Thanks for your note and the patch. We have not done extensive testing > of current mvapich-gen2 for IA64 platform (we have done it for other > platforms). We will incorporate your patch, test it out, and push it > out to the OpenIB/SVN soon. Please let me know if I help I with IA64 testing I would be happy to do what I can. I have just hit another issue (which I don't believe is just an IA64 issue) while compiling viapriv.o and viainit.o for overtake. It appears that some of the structure definitions are missing, has anyone else seen this or have I missed a config step ? Errors below. Can you help ? Thanks John make overtake /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/bin/mpicc -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c overtake.c /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/bin/mpicc -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c test.c /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/bin/mpicc -o overtake overtake.o test.o /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viapriv.o)(.text+0x72): In function `register_memory': : undefined reference to `ibv_reg_mr' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viapriv.o)(.text+0xd2): In function `deregister_memory': : undefined reference to `ibv_dereg_mr' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x42): In function `open_hca': : undefined reference to `ibv_get_devices' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x62): In function `open_hca': : undefined reference to `dlist_start' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x82): In function `open_hca': : undefined reference to `_dlist_mark_move' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x132): In function `open_hca': : undefined reference to `ibv_open_device' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x2a2): In function `open_hca': : undefined reference to `ibv_alloc_pd' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x442): In function `get_lid': : undefined reference to `ibv_query_port' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x612): In function `create_cq': : undefined reference to `ibv_create_cq' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x952): In function `create_qps': : undefined reference to `ibv_create_qp' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0xca2): In function `create_qps': : undefined reference to `ibv_modify_qp' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1392): In function `ib_qp_enable': : undefined reference to `ibv_modify_qp' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1692): In function `ib_qp_enable': : undefined reference to `ibv_modify_qp' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1922): In function `ib_finalize': : undefined reference to `ibv_destroy_qp' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1a92): In function `ib_finalize': : undefined reference to `ibv_destroy_cq' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1bd2): In function `ib_finalize': : undefined reference to `ibv_dealloc_pd' /usr/src/openib/gen2/trunk/src/userspace/mpi/mvapich-gen2/lib/libmpich.a(viainit.o)(.text+0x1d12): In function `ib_finalize': : undefined reference to `ibv_close_device' collect2: ld returned 1 exit status make[4]: *** [overtake] Error 1 make[3]: [linktest] Error 2 (ignored) Could not link a C program with MPI libraries make[3]: *** [linktest] Error 1 make[2]: *** [linktest] Error 2 make[1]: *** [mpi-lib-test] Error 2 make: *** [mpi] Error 2 -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From rolandd at cisco.com Wed Oct 12 13:53:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 13:53:18 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129149438.25062.4.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Wed, 12 Oct 2005 13:37:17 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <52ache7h7n.fsf@cisco.com> <1129149438.25062.4.camel@hematite.internal.keyresearch.com> Message-ID: <52psqa5uf5.fsf@cisco.com> Roland> I'm still seeing some problems with the SM bringing up the Roland> port, which I'll start debugging now. Robert> Huh. Works OK for us. Any more news on this one? It's intermittent for me. It started working after I added some debugging prints. The failure mode was that when the SM sent a Set of PortInfo to bring the port to ACTIVE, the ipath driver timed out after 5 seconds of waiting for its local set of port state. Robert> :-) Probably not for this release, but I do feel your Robert> pain. We're about to do a new software release here, so I Robert> can't make any cosmetic change any more, just functional Robert> (like the above.) OK, no hurry. It's going to be a requirement for an upstream merge, though. BTW, I've found a bunch of bugs in the uverbs changes I made. I'll post new patches once I have it working here. - R. From rolandd at cisco.com Wed Oct 12 13:53:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 13:53:54 -0700 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <434D743F.4050208@sgi.com> (John Partridge's message of "Wed, 12 Oct 2005 15:38:23 -0500") References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> Message-ID: <52ll0y5ue5.fsf@cisco.com> John> I have just hit another issue (which I don't believe is just John> an IA64 issue) while compiling viapriv.o and viainit.o for John> overtake. It appears that some of the structure definitions John> are missing, has anyone else seen this or have I missed a John> config step ? It looks like you don't have libibverbs installed -- the linker isn't finding a lot of symbols from that library. - R. From rjwalsh at pathscale.com Wed Oct 12 14:16:06 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 14:16:06 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <52psqa5uf5.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <52ache7h7n.fsf@cisco.com> <1129149438.25062.4.camel@hematite.internal.keyresearch.com> <52psqa5uf5.fsf@cisco.com> Message-ID: <1129151766.25062.7.camel@hematite.internal.keyresearch.com> > Robert> :-) Probably not for this release, but I do feel your > Robert> pain. We're about to do a new software release here, so I > Robert> can't make any cosmetic change any more, just functional > Robert> (like the above.) > > OK, no hurry. It's going to be a requirement for an upstream merge, > though. Fair enough. In the meantime, I've pull the tabs in a little like this: ret = io_remap_pfn_range(...) so that it looks a little less weird. Still not perfect, but more readable for the moment. I'm going to compile and test here before doing a check-in, but expect one soon. > BTW, I've found a bunch of bugs in the uverbs changes I made. I'll > post new patches once I have it working here. OK. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From mshefty at ichips.intel.com Wed Oct 12 14:35:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 12 Oct 2005 14:35:45 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051012122533.025939e0@esmail.cup.hp.com> References: <1128877818.24182.54.camel@mail.es335.com> <6.2.0.14.2.20051012122533.025939e0@esmail.cup.hp.com> Message-ID: <434D81B1.1040309@ichips.intel.com> Michael Krause wrote: > 1. Applications want to use existing API to identify remote endnodes / > services. To clarify, the applications want to use IP based addressing to identify remote endnotes. The connection API is under development. > 7. When a route look up occurs, a set of IP addresses are returned. > Depending upon the kernel interface, one can also return the layer 2 > information either as part of this look up or as a separate query to the > route table. For IB, the route lookup returns a set of IB network addresses that are associated with the IP network addresses specified by the user. See comments below. > So, from the above, it seems the IP and IB world can operate using the > same code and work just fine. So, where is the problem? Is it really > just how management assigns IP address to IB interfaces and how an > application should select or be informed of which IP address to use and > thus transparently identifies the IB port? Where is the connection > establishment problem? The application does not see any difference. > The network stack only acts as a repository for routing information > unless running directly over IP over IB thus is not impacted. The > middleware simply needs to extract the layer 2 information thus obtains > the requisite data to construct the CM messages when going straight to > IB (there is no change required here for iWARP as this is all native to > its operation). What am I missing here? The problem is that an application wants to use the network address from one protocol (IP), but run over a different network protocol (IB). The solution is to translate IP addresses to a layer 2 address. Converting a local IP address involves making a couple of calls using the standard Linux interfaces. Converting a remote IP address requires more work, and currently there are two possible solutions. One is to simply use ARP, which is what ib_addr, sdp, and ib_at do. A second solution is to use the address translation service defined by DAT, which is supported by ib_at. On the server side of a connection request, a reverse mapping is desired. But with IB, IP was not involved as part of the connection. So, the receiver of a connection request needs a mechanism to identify which IP address the sender used when connecting. The solution is to pass the source IP address in the private data of the REQ. From an application's viewpoint, the address translation is done for them by the CMA. But the CMA uses ib_addr to perform the actual translation. Is this the information that you were looking for? - Sean From johnip at sgi.com Wed Oct 12 14:51:17 2005 From: johnip at sgi.com (John Partridge) Date: Wed, 12 Oct 2005 16:51:17 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <52ll0y5ue5.fsf@cisco.com> References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> <52ll0y5ue5.fsf@cisco.com> Message-ID: <434D8555.3070401@sgi.com> OK thanks John Roland Dreier wrote: > John> I have just hit another issue (which I don't believe is just > John> an IA64 issue) while compiling viapriv.o and viainit.o for > John> overtake. It appears that some of the structure definitions > John> are missing, has anyone else seen this or have I missed a > John> config step ? > > It looks like you don't have libibverbs installed -- the linker isn't > finding a lot of symbols from that library. > > - R. -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From rolandd at cisco.com Wed Oct 12 15:02:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 15:02:07 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129069024.29804.24.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Tue, 11 Oct 2005 15:17:04 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> Message-ID: <528xwy5r8g.fsf@cisco.com> OK, here's my latest set of patches. With these changes, userspace verbs work for me (tested between my one ipath system and a system with a Mellanox HCA). I do see a few anomalies, but I don't think they're caused by the generic uverbs changes I made, since I'm not touching the ipath driver: - ibv_rc_pingpong is very very slow (like 26 Mb/sec with default params). - ibv_rc_pingpong on loopback doesn't work: $ ibv_rc_pingpong & sleep 1 && ibv_rc_pingpong localhost [2] 15790 local address: LID 0x000a, QPN 0x00006c, PSN 0xf277e7 local address: LID 0x000a, QPN 0x00006d, PSN 0xdcd50e remote address: LID 0x000a, QPN 0x00006d, PSN 0xdcd50e remote address: LID 0x000a, QPN 0x00006c, PSN 0xf277e7 Failed status 12 for wr_id 2 - ibv_ud_pingpong on loopback doesn't work either: $ ibv_ud_pingpong & sleep 1 && ibv_ud_pingpong localhost [3] 15800 local address: LID 0x000a, QPN 0x00006e, PSN 0xa622b2 local address: LID 0x000a, QPN 0x00006f, PSN 0x350102 remote address: LID 0x000a, QPN 0x00006f, PSN 0x350102 remote address: LID 0x000a, QPN 0x00006e, PSN 0xa622b2 Couldn't post send As a bonus, I'm throwing in a patch for libipathoib to fix the RPM build on Fedora Core 4, and delete the unused ipathoib-abi.h header. (You'll have to do svn rm yourself to completely kill the file, if you apply this patch). BTW, I think it would be a good idea to rename libipathoib to something like "libipath" (or "libipathverbs" or whatever). I don't think we want to put "oib" or "openib" in the name, since in the long run we're just going to be part of the standard Linux IB (or RDMA) drivers. - R. -------------- next part -------------- A non-text attachment was scrubbed... Name: ipath-kernel.diff Type: text/x-patch Size: 15318 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ipath-libibverbs.diff Type: text/x-patch Size: 12603 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ipath-libipathoib.diff Type: text/x-patch Size: 4350 bytes Desc: not available URL: From caitlinb at broadcom.com Wed Oct 12 15:14:11 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 12 Oct 2005 15:14:11 -0700 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Wednesday, October 12, 2005 2:36 PM > To: Michael Krause > Cc: openib-general at openib.org > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > Michael Krause wrote: > > 1. Applications want to use existing API to identify remote > endnodes / > > services. > > To clarify, the applications want to use IP based addressing > to identify remote endnotes. The connection API is under development. > No, I think Mike's comment was dead on. Applications want to use the existing API. They want to use the existing API even when the API is clearly defective. Note that there are several generations of host-resolution APIs for the IP world, with the earlier ones clearly being heavily inferior (not thread safe, not IPv4/IPv6 neutral, etc). But they have not been eliminated. Why, because applications want to use the existing API. If application developers were rationale and totally open to adopt new ideas instantly then the active side would ask to make a connection to a *service*, not to a host with a service qualifier. A new API may be under development to meet new needs. But keep in mind that the application developers expect it to be as close to what they are used to as possible, and will grumble that it is not 100% compatible. From johnip at sgi.com Wed Oct 12 15:18:59 2005 From: johnip at sgi.com (John Partridge) Date: Wed, 12 Oct 2005 17:18:59 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <52ll0y5ue5.fsf@cisco.com> References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> <52ll0y5ue5.fsf@cisco.com> Message-ID: <434D8BD3.2070004@sgi.com> Roland, Actually, I just checked (and reinstalled in case there was a problem) and libibverbs is installed OK and I still get the problem. From /usr/local/lib :- -rw-r--r-- 1 root root 495K Oct 12 16:54 libibverbs.a -rwxr-xr-x 1 root root 849 Oct 12 16:54 libibverbs.la lrwxrwxrwx 1 root root 19 Oct 12 16:54 libibverbs.so -> libibverbs.so.1.0.0 lrwxrwxrwx 1 root root 19 Oct 12 16:54 libibverbs.so.1 -> libibverbs.so.1.0.0 -rwxr-xr-x 1 root root 203K Oct 12 16:54 libibverbs.so.1.0.0 So I checked my /etc/ld.so.conf :- root on mig133 > cat /etc/ld.so.conf # ld.so.conf autogenerated by env-update; make all changes to # contents of /etc/env.d directory /usr/local/lib /usr/ia64-unknown-linux-gnu/lib /usr/lib/gcc-lib/ia64-unknown-linux-gnu/3.3.2 /usr/local/lib/infiniband /usr/local/ib/lib I also made sure I did an ldconfig before attempting the mvapitch-gen build. Do you think I messed something else ? Thanks John Roland Dreier wrote: > John> I have just hit another issue (which I don't believe is just > John> an IA64 issue) while compiling viapriv.o and viainit.o for > John> overtake. It appears that some of the structure definitions > John> are missing, has anyone else seen this or have I missed a > John> config step ? > > It looks like you don't have libibverbs installed -- the linker isn't > finding a lot of symbols from that library. > > - R. -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From mshefty at ichips.intel.com Wed Oct 12 15:19:41 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 12 Oct 2005 15:19:41 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <434D8BFD.5050800@ichips.intel.com> Caitlin Bestler wrote: > No, I think Mike's comment was dead on. Applications want to > use the existing API. They want to use the existing API even > when the API is clearly defective. Note that there are several > generations of host-resolution APIs for the IP world, with the > earlier ones clearly being heavily inferior (not thread safe, > not IPv4/IPv6 neutral, etc). But they have not been eliminated. What existing API are you referring to? We have SDP to support standard sockets. - Sean From rolandd at cisco.com Wed Oct 12 15:21:55 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 15:21:55 -0700 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <434D8BD3.2070004@sgi.com> (John Partridge's message of "Wed, 12 Oct 2005 17:18:59 -0500") References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> <52ll0y5ue5.fsf@cisco.com> <434D8BD3.2070004@sgi.com> Message-ID: <524q7m5qbg.fsf@cisco.com> John> Roland, Actually, I just checked (and reinstalled in case John> there was a problem) and libibverbs is installed OK and I John> still get the problem. Not sure then. The linker isn't finding symbols that are in libibverbs. Maybe you need to add "-libverbs" somewhere? - R. From rjwalsh at pathscale.com Wed Oct 12 15:51:18 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 15:51:18 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <52ache7h7n.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <52ache7h7n.fsf@cisco.com> Message-ID: <1129157478.25062.11.camel@hematite.internal.keyresearch.com> > I got my system set up again. I needed the following patch to work > with the latest kernel (which no longer has io_remap_page_range). I'm > also throwing in a warning cleanup. Applied. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rjwalsh at pathscale.com Wed Oct 12 15:56:19 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 15:56:19 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <528xwy5r8g.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <528xwy5r8g.fsf@cisco.com> Message-ID: <1129157780.25062.16.camel@hematite.internal.keyresearch.com> > OK, here's my latest set of patches. With these changes, userspace > verbs work for me (tested between my one ipath system and a system > with a Mellanox HCA). OK - I'll start applying these now. Expect check-ins before the end of the day. > - ibv_rc_pingpong is very very slow (like 26 Mb/sec with default params). We haven't touched performance yet. We'll be looking at that real soon. That said, 26Mb/sec is slower than we see. We'll get back to you after we investigate this a bit. > - ibv_rc_pingpong on loopback doesn't work: Huh. That should work. I'll look at it. > - ibv_ud_pingpong on loopback doesn't work either: Ditto. > As a bonus, I'm throwing in a patch for libipathoib to fix the RPM > build on Fedora Core 4, and delete the unused ipathoib-abi.h header. > (You'll have to do svn rm yourself to completely kill the file, if you > apply this patch). Great. Thanks. > BTW, I think it would be a good idea to rename libipathoib to > something like "libipath" (or "libipathverbs" or whatever). I don't > think we want to put "oib" or "openib" in the name, since in the long > run we're just going to be part of the standard Linux IB (or RDMA) > drivers. I'll think up a better name. libipath is already taken with something else we ship, so I can't use that. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rjwalsh at pathscale.com Wed Oct 12 16:13:51 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 16:13:51 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <528xwy5r8g.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <528xwy5r8g.fsf@cisco.com> Message-ID: <1129158831.25062.18.camel@hematite.internal.keyresearch.com> > BTW, I think it would be a good idea to rename libipathoib to > something like "libipath" (or "libipathverbs" or whatever). I don't > think we want to put "oib" or "openib" in the name, since in the long > run we're just going to be part of the standard Linux IB (or RDMA) > drivers. libipathverbs is went over well here. The next thing is, can svn handle directory name changes? -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Wed Oct 12 16:21:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 12 Oct 2005 16:21:01 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <1129158831.25062.18.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Wed, 12 Oct 2005 16:13:51 -0700") References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <528xwy5r8g.fsf@cisco.com> <1129158831.25062.18.camel@hematite.internal.keyresearch.com> Message-ID: <52vf02490i.fsf@cisco.com> Robert> libipathverbs is went over well here. The next thing is, Robert> can svn handle directory name changes? Yes, it should be able to. Or we can just rename it when it comes over to the trunk. - R. From sean.hefty at intel.com Wed Oct 12 16:39:09 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 12 Oct 2005 16:39:09 -0700 Subject: [openib-general] [PATCH] [ADDR] return gateway GID for non-local IP addresses Message-ID: The following patch returns the GID of the IP gateway for non-local subnet IP addresses. Hal, does this change look correct to you? I don't have an easy way to test this fully. Signed-off-by: Sean Hefty Index: core/addr.c =================================================================== --- core/addr.c (revision 3707) +++ core/addr.c (working copy) @@ -121,8 +121,8 @@ static void addr_send_arp(struct sockadd if (ip_route_output_key(&rt, &fl)) return; - arp_send(ARPOP_REQUEST, ETH_P_ARP, dst_ip, rt->idev->dev, rt->rt_src, - NULL, rt->idev->dev->dev_addr, NULL); + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, + rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); ip_rt_put(rt); } @@ -144,7 +144,7 @@ static int addr_resolve_remote(struct so if (ret) goto out; - neigh = neigh_lookup(&arp_tbl, &dst_ip, rt->idev->dev); + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev); if (!neigh) { ret = -ENODATA; goto err1; From rjwalsh at pathscale.com Wed Oct 12 17:42:19 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 12 Oct 2005 17:42:19 -0700 Subject: [openib-general] InfiniPath driver announcement In-Reply-To: <528xwy5r8g.fsf@cisco.com> References: <1127937007.6858.7.camel@hematite.internal.keyresearch.com> <52y84z96qr.fsf@cisco.com> <1129069024.29804.24.camel@hematite.internal.keyresearch.com> <528xwy5r8g.fsf@cisco.com> Message-ID: <1129164139.25062.36.camel@hematite.internal.keyresearch.com> > BTW, I think it would be a good idea to rename libipathoib to > something like "libipath" (or "libipathverbs" or whatever). I don't > think we want to put "oib" or "openib" in the name, since in the long > run we're just going to be part of the standard Linux IB (or RDMA) > drivers. Done. I just blew away the old directory and created a new one. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From mlleinin at hpcn.ca.sandia.gov Wed Oct 12 18:24:32 2005 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Wed, 12 Oct 2005 18:24:32 -0700 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1129141706.13945.509.camel@localhost> References: <52mzle7k2i.fsf@cisco.com> <1129141706.13945.509.camel@localhost> Message-ID: <1129166672.13948.530.camel@localhost> On Wed, 2005-10-12 at 11:28 -0700, Matt Leininger wrote: > On Wed, 2005-10-12 at 09:53 -0700, Roland Dreier wrote: > > Herbert> Try reverting the changeset > > > > Herbert> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc > > > > Herbert> which lies between these two points and may be relevant. > > > > Matt, I pulled this out of git for you. I guess Herbert is suggesting > > to patch -R the below against 2.6.12-rc5: > I applied your patch suggest by Herbert: > > http://www.mail-archive.com/openib-general%40openib.org/msg11415.html > I backed out this patch out of a few other kernels and always see a performance improvement. This gets back ~50-60 MB/s of the 90-100 MB/s drop off in IPoIB performance. Is it still worth testing the TSO patches that Herbert suggested for some of the 2.6.13-rc kernels? Thanks, - Matt All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0 dual EM64T 3.2 GHz PCIe IB HCA (memfull) Kernel OpenIB msi_x netperf (MB/s) 2.6.14-rc4 in-kernel 1 434 (backed out patch) 2.6.14-rc4 in-kernel 1 385 2.6.13.2 svn3627 1 446 (backed out patch) 2.6.13.2 svn3627 1 386 2.6.13.2 in-kernel 1 394 2.6.12.5 in-kernel 1 464 (backed out patch) 2.6.12.5 in-kernel 1 402 2.6.12-rc6 in-kernel 1 470 (backed out patch) 2.6.12-rc6 in-kernel 1 407 2.6.12-rc5 in-kernel 1 474 (backed out patch) 2.6.12-rc5 in-kernel 1 405 2.6.9-11.ELsmp svn3513 1 425 (Woody's results, 3.6Ghz EM64T) From herbert at gondor.apana.org.au Wed Oct 12 18:48:14 2005 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Thu, 13 Oct 2005 11:48:14 +1000 Subject: [openib-general] Timeline of IPoIB performance In-Reply-To: <1129166672.13948.530.camel@localhost> References: <52mzle7k2i.fsf@cisco.com> <1129141706.13945.509.camel@localhost> <1129166672.13948.530.camel@localhost> Message-ID: <20051013014814.GA3688@gondor.apana.org.au> On Wed, Oct 12, 2005 at 06:24:32PM -0700, Matt Leininger wrote: > > Is it still worth testing the TSO patches that Herbert suggested for > some of the 2.6.13-rc kernels? If you're still seeing a performance regression compared to 2.6.12-rc4, then yes (According to the figures in your message there does seem to be a bit of loss after the release of 2.6.12). The patch you reverted may degrade the performance on the receiver. The TSO patches may be causing some degradation on your sender. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From hozer at hozed.org Wed Oct 12 18:53:05 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 12 Oct 2005 20:53:05 -0500 Subject: [openib-general] IBM eHCA testing.. In-Reply-To: References: <52oe5xdp3e.fsf@cisco.com> Message-ID: <20051013015305.GK4612@kalmia.hozed.org> What is the turnaround time on a firmware change? If we can get an update, I think that would be the best solution. I'll be happy to test this. On Wed, Oct 12, 2005 at 11:36:59AM +0200, IBMEHCA DD wrote: > This is basically the answer why its so "sensitive" which port is plugged. > We're working on a solution to that problem. > But currently we only see a chance to change this behaviour by also > changing the firmware interface, > which needs to be coordinated with firmware development. > > Roland Dreier wrote on 10.10.2005 23:44:21: > > > IBMEHCA> So you need some kind of signal from the operating system > > IBMEHCA> to system firmware, which in the eHCA case is the > > IBMEHCA> H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI > > IBMEHCA> parameter. AFTER that call handshaking between system > > IBMEHCA> firmware and the SM will start, here's a new adapter > > IBMEHCA> active on a switch port... what's your guid? here's your > > IBMEHCA> LID, p_key, SM lid.... ...and after all that it's > > IBMEHCA> possible to send and receive packets from the fabric. > > IBMEHCA> The openib stack expects that a port is fully functional > > IBMEHCA> after this create_qp returns, and starts to do all sorts > > IBMEHCA> of modify QP and post send. So the only choice we have > > IBMEHCA> there is to delay create_qp until the complete > > IBMEHCA> handshaking between system firmware and the SM has > > IBMEHCA> finished (until we see a IB_PORT_ACTIVE in hcad_mod). If > > IBMEHCA> we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have > > IBMEHCA> to return an error code to openib, otherwise we're > > IBMEHCA> seriously in trouble (tried that). > > From surs at cse.ohio-state.edu Wed Oct 12 21:33:44 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 13 Oct 2005 00:33:44 -0400 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <434D8BD3.2070004@sgi.com> References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> <52ll0y5ue5.fsf@cisco.com> <434D8BD3.2070004@sgi.com> Message-ID: <20051013043343.GA5823@cse.ohio-state.edu> Hi John, * On Oct,6 John Partridge wrote : > Roland, > > Actually, I just checked (and reinstalled in case there was a problem) > and libibverbs is installed OK and I still get the problem. The mvapich.make.[gcc,icc,pgi] script in the top level directory of MVAPICH-Gen2 includes all the library paths and appropriate -l's. Can you please tell us if you are using this script? There is a user guide in the distribution too (called: mvapich.user_guide.pdf), which lists some common troubleshooting issues when installing/using MVAPICH. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From mohitka at noida.hcltech.com Thu Oct 13 00:10:20 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Thu, 13 Oct 2005 12:40:20 +0530 Subject: [openib-general] Migration Solution Message-ID: <3E6BB9CEE261E2428AD25D0D553DC4970145BB40@HSDLNTD1110010.noida.hcltech.com> Hi all, If anyone can suggest some good possible solution for migrating from Clients --------FC Switch ---------> SAN connection To Clients-------> IB network-----------> SAN Connection The most economical I can think of is Clients ---------> IB Switch --------> IB FC gateway-------> FC Switch--------> SAN But performance enhancement is doubtful The Expensive but high performance will be Clients --------> IB Switch---------> SAN Does anyone having any other ideas or any other middleway? Thanks Mohit From halr at voltaire.com Thu Oct 13 04:54:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Oct 2005 07:54:04 -0400 Subject: [openib-general] Re: [PATCH] [ADDR] return gateway GID for non-local IP addresses In-Reply-To: References: Message-ID: <1129204444.4402.1210.camel@hal.voltaire.com> On Wed, 2005-10-12 at 19:39, Sean Hefty wrote: > The following patch returns the GID of the IP gateway for non-local > subnet IP addresses. > > Hal, does this change look correct to you? I don't have an easy way > to test this fully. Yes, this looks right. I think the address resolution part can be tested without a real gateway for the connection by just adding a route off the IPoIB subnet to some other endnode and trying to connect to something on that remote destination subnet. You should at least see the ARP complete for that next hop and the connect (perhaps) fail depending on the discrimination in the passive side on the IP address passed in the private data. -- Hal From halr at voltaire.com Thu Oct 13 04:59:34 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Oct 2005 07:59:34 -0400 Subject: [openib-general] Migration Solution In-Reply-To: <3E6BB9CEE261E2428AD25D0D553DC4970145BB40@HSDLNTD1110010.noida.hcltech.com> References: <3E6BB9CEE261E2428AD25D0D553DC4970145BB40@HSDLNTD1110010.noida.hcltech.com> Message-ID: <1129204451.4402.1212.camel@hal.voltaire.com> On Thu, 2005-10-13 at 03:10, Mohit Katiyar, Noida wrote: > Hi all, > If anyone can suggest some good possible solution for migrating from > Clients --------FC Switch ---------> SAN connection > To > Clients-------> IB network-----------> SAN Connection It depends on your storage. There are two choices here: iSER based IB storage and SRP based IB storage. > The most economical I can think of is > Clients ---------> IB Switch --------> IB FC gateway-------> FC > Switch--------> SAN > But performance enhancement is doubtful > The Expensive but high performance will be > Clients --------> IB Switch---------> SAN Yes, this is more direct and is higher performance but is this more expensive ? The tradeoff is the cost of the IB FC gateway versus the cost delta of the native IB v. FC based storage. The main issue is the availability of the native IB storage solutions (I think several are emerging) and the initiator side (there are iSER and SRP initiators available for OpenIB). > Does anyone having any other ideas or any other middleway? Not that I am aware of. -- Hal > Thanks > Mohit > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From johnip at sgi.com Thu Oct 13 08:07:40 2005 From: johnip at sgi.com (John Partridge) Date: Thu, 13 Oct 2005 10:07:40 -0500 Subject: [openib-general] mvapich-gen2 IA64 compile problem In-Reply-To: <20051013043343.GA5823@cse.ohio-state.edu> References: <200510121701.j9CH1UGk015952@xi.cse.ohio-state.edu> <434D743F.4050208@sgi.com> <52ll0y5ue5.fsf@cisco.com> <434D8BD3.2070004@sgi.com> <20051013043343.GA5823@cse.ohio-state.edu> Message-ID: <434E783C.4080201@sgi.com> Sayantan, Thanks for the reply. I was just using make in the mvapich-gen2 directory, that may call the script I don't know. I'll take a look at the doc you suggested and go through the troule shooting in there. John Sayantan Sur wrote: > Hi John, > > * On Oct,6 John Partridge wrote : > >>Roland, >> >>Actually, I just checked (and reinstalled in case there was a problem) >>and libibverbs is installed OK and I still get the problem. > > > The mvapich.make.[gcc,icc,pgi] script in the top level directory of > MVAPICH-Gen2 includes all the library paths and appropriate -l's. > > Can you please tell us if you are using this script? There is a user > guide in the distribution too (called: mvapich.user_guide.pdf), which > lists some common troubleshooting issues when installing/using MVAPICH. > > Thanks, > Sayantan. > -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From ardavis at ichips.intel.com Thu Oct 13 09:42:10 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 13 Oct 2005 09:42:10 -0700 Subject: [openib-general] QP with large starting sequence adds latency to RDMA READ??? In-Reply-To: <433AD711.2000602@ichips.intel.com> References: <433099C0.1070408@ichips.intel.com> <433AD711.2000602@ichips.intel.com> Message-ID: <434E8E62.1080909@ichips.intel.com> Sean Hefty wrote: > Arlin Davis wrote: > >> I just noticed some RDMA read performance issues that seem to be >> related to the QP starting sequence number. If I set the starting >> sequence to 1 then all is fine but if I set it to 0x10000 then it >> seems to add ~40us to my 32KB RDMA read operation (polling for >> completions). Has anyone seen anything like this? > > > Has anyone else noticed this issue? You could try to reproduce this > by using the rdma_bw test and changing the PSN. > > - Sean > I added a starting PSN and RDMA READ option to the rdma_bw test and was able to reproduce on a PCI-E adapter with 4.6.2 firmware. I retried on a system with 4.7.0 and it looks like the problem is fixed. However, I see nothing about this problem in the "bug fix" list in the release notes. Can someone at Mellanox confirm this problem with RDMA reads and add to release notes as a fix so it is documented somewhere? http://www.mellanox.com/products/fw_images/fw-25208-4_7_0-release_notes.pdf -arlin From ftillier at silverstorm.com Thu Oct 13 09:50:24 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 13 Oct 2005 09:50:24 -0700 Subject: [openib-general] QP with large starting sequence adds latencyto RDMA READ??? In-Reply-To: <434E8E62.1080909@ichips.intel.com> Message-ID: <000b01c5d016$35cecb50$9e5aa8c0@infiniconsys.com> > From: Arlin Davis [mailto:ardavis at ichips.intel.com] > Sent: Thursday, October 13, 2005 9:42 AM > > Sean Hefty wrote: > > > Arlin Davis wrote: > > > >> I just noticed some RDMA read performance issues that seem to be > >> related to the QP starting sequence number. If I set the starting > >> sequence to 1 then all is fine but if I set it to 0x10000 then it > >> seems to add ~40us to my 32KB RDMA read operation (polling for > >> completions). Has anyone seen anything like this? > > > > > > Has anyone else noticed this issue? You could try to reproduce this > > by using the rdma_bw test and changing the PSN. > > > > - Sean > > > > I added a starting PSN and RDMA READ option to the rdma_bw test and was > able to reproduce on a PCI-E adapter with 4.6.2 firmware. I retried on a > system with 4.7.0 and it looks like the problem is fixed. However, I > see nothing about this problem in the "bug fix" list in the release > notes. Can someone at Mellanox confirm this problem with RDMA reads and > add to release notes as a fix so it is documented somewhere? > > http://www.mellanox.com/products/fw_images/fw-25208-4_7_0-release_notes.pdf Note that I have seen similar behavior (drop in bandwidth) correlated to starting PSN using Winsock Direct under Windows, so this doesn't seem to be a uDAPL or Linux issue. As for Arlin, the issue disappeared in firmware 4.7.0, and I too would like to see some confirmation that there was an issue and that it was fixed. Thanks, - Fab From arlin.r.davis at intel.com Thu Oct 13 10:22:19 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 13 Oct 2005 10:22:19 -0700 Subject: [openib-general] [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN Message-ID: Michael, The patch adds command line options for RDMA reads and starting PSN. I used these modifications to help isolate the RDMA read performance degradation with 4.6.2 firmware. -arlin Signed-off by: Arlin Davis Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 3768) +++ rdma_bw.c (working copy) @@ -304,7 +304,9 @@ static struct pingpong_context *pp_init_ * The Consumer is not allowed to assign Remote Write or Remote Atomic to * a Memory Region that has not been assigned Local Write. */ ctx->mr = ibv_reg_mr(ctx->pd, ctx->buf, size * 2, - IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_LOCAL_WRITE); + IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_LOCAL_WRITE); if (!ctx->mr) { fprintf(stderr, "Couldn't allocate MR\n"); return NULL; @@ -345,7 +347,9 @@ static struct pingpong_context *pp_init_ attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = port; - attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE; + attr.qp_access_flags = IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_LOCAL_WRITE; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | @@ -370,7 +374,7 @@ static int pp_connect_ctx(struct pingpon attr.path_mtu = IBV_MTU_2048; attr.dest_qp_num = dest->qpn; attr.rq_psn = dest->psn; - attr.max_dest_rd_atomic = 1; + attr.max_dest_rd_atomic = 4; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.dlid = dest->lid; @@ -394,7 +398,7 @@ static int pp_connect_ctx(struct pingpon attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = my_psn; - attr.max_rd_atomic = 1; + attr.max_rd_atomic = 4; if (ibv_modify_qp(ctx->qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | @@ -417,6 +421,7 @@ static void usage(const char *argv0) printf("\n"); printf("Options:\n"); printf(" -p, --port= listen on/connect to port (default 18515)\n"); + printf(" -P, --starting_psn starting sequence on QP (default random)\n"); printf(" -d, --ib-dev= use IB device (default first device found)\n"); printf(" -i, --ib-port= use port of IB device (default 1)\n"); printf(" -s, --size= size of message to exchange (default 65536)\n"); @@ -487,6 +492,8 @@ int main(int argc, char *argv[]) int scnt, ccnt; int sockfd; int duplex = 0; + int rdma_read = 0; + int starting_psn = 0; struct ibv_qp *qp; cycles_t *tposted; @@ -498,16 +505,18 @@ int main(int argc, char *argv[]) static struct option long_options[] = { { .name = "port", .has_arg = 1, .val = 'p' }, + { .name = "starting_psn", .has_arg = 1, .val = 'P' }, { .name = "ib-dev", .has_arg = 1, .val = 'd' }, { .name = "ib-port", .has_arg = 1, .val = 'i' }, { .name = "size", .has_arg = 1, .val = 's' }, { .name = "iters", .has_arg = 1, .val = 'n' }, { .name = "tx-depth", .has_arg = 1, .val = 't' }, { .name = "bidirectional", .has_arg = 0, .val = 'b' }, + { .name = "rdma_read", .has_arg = 0, .val = 'r' }, { 0 } }; - c = getopt_long(argc, argv, "p:d:i:s:n:t:b", long_options, NULL); + c = getopt_long(argc, argv, "p:P:d:i:s:n:t:br", long_options, NULL); if (c == -1) break; @@ -520,6 +529,14 @@ int main(int argc, char *argv[]) } break; + case 'P': + starting_psn = strtol(optarg, NULL, 0); + if (port <= 0) { + usage(argv[0]); + return 1; + } + break; + case 'd': ib_devname = strdupa(optarg); break; @@ -567,6 +584,10 @@ int main(int argc, char *argv[]) duplex = 1; break; + case 'r': + rdma_read = 1; + break; + default: usage(argv[0]); return 1; @@ -615,7 +636,11 @@ int main(int argc, char *argv[]) my_dest.lid = pp_get_local_lid(ctx, ib_port); my_dest.qpn = ctx->qp->qp_num; - my_dest.psn = lrand48() & 0xffffff; + if (!starting_psn) + my_dest.psn = lrand48() & 0xffffff; + else + my_dest.psn = starting_psn; + if (!my_dest.lid) { fprintf(stderr, "Local lid 0x0 detected. Is an SM running?\n"); return 1; @@ -624,9 +649,10 @@ int main(int argc, char *argv[]) my_dest.vaddr = (uintptr_t)ctx->buf + ctx->size; printf(" local address: LID %#04x, QPN %#06x, PSN %#06x " - "RKey %#08x VAddr %#016Lx\n", + "RKey %#08x VAddr %#016Lx %s\n", my_dest.lid, my_dest.qpn, my_dest.psn, - my_dest.rkey, my_dest.vaddr); + my_dest.rkey, my_dest.vaddr, + rdma_read ? "RDMA_READ":"RDMA_WRITE"); if (servername) { sockfd = pp_client_connect(servername, port); @@ -643,10 +669,11 @@ int main(int argc, char *argv[]) if (!rem_dest) return 1; - printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x, " - "RKey %#08x VAddr %#016Lx\n", + printf(" remote address: LID %#04x, QPN %#06x, PSN %#06x " + "RKey %#08x VAddr %#016Lx %s\n", rem_dest->lid, rem_dest->qpn, rem_dest->psn, - rem_dest->rkey, rem_dest->vaddr); + rem_dest->rkey, rem_dest->vaddr, + rdma_read ? "RDMA_READ":"RDMA_WRITE"); if (pp_connect_ctx(ctx, ib_port, my_dest.psn, rem_dest)) return 1; @@ -675,7 +702,11 @@ int main(int argc, char *argv[]) ctx->wr.wr_id = PINGPONG_RDMA_WRID; ctx->wr.sg_list = &ctx->list; ctx->wr.num_sge = 1; - ctx->wr.opcode = IBV_WR_RDMA_WRITE; + if (rdma_read) + ctx->wr.opcode = IBV_WR_RDMA_READ; + else + ctx->wr.opcode = IBV_WR_RDMA_WRITE; + ctx->wr.send_flags = IBV_SEND_SIGNALED; ctx->wr.next = NULL; From krause at cup.hp.com Thu Oct 13 10:29:09 2005 From: krause at cup.hp.com (Michael Krause) Date: Thu, 13 Oct 2005 10:29:09 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.a d.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <6.2.0.14.2.20051013101446.022d3ba0@esmail.cup.hp.com> At 03:14 PM 10/12/2005, Caitlin Bestler wrote: > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > > Sent: Wednesday, October 12, 2005 2:36 PM > > To: Michael Krause > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > > > Michael Krause wrote: > > > 1. Applications want to use existing API to identify remote > > endnodes / > > > services. > > > > To clarify, the applications want to use IP based addressing > > to identify remote endnotes. The connection API is under development. > > > > >No, I think Mike's comment was dead on. Applications want to >use the existing API. They want to use the existing API even >when the API is clearly defective. Note that there are several >generations of host-resolution APIs for the IP world, with the >earlier ones clearly being heavily inferior (not thread safe, >not IPv4/IPv6 neutral, etc). But they have not been eliminated. > >Why, because applications want to use the existing API. > >If application developers were rationale and totally open to >adopt new ideas instantly then the active side would ask to >make a connection to a *service*, not to a host with a service >qualifier. > >A new API may be under development to meet new needs. But keep in >mind that the application developers expect it to be as close to >what they are used to as possible, and will grumble that it is >not 100% compatible. This all comes down to economics which is why some ULP such as SDP are created. Let's examine SDP for a moment. The purpose of SDP to enable synchronous and asynchronous Sockets applications to transparently run unmodified over a RDMA capable interconnect. Unmodified means no source code changes and no recompile required (this is possible if the Sockets library is a shared library and dynamically linked). The first part of unmodified means that the existing address / service resolution API calls work (further, no change to the address family, etc. is required to make this work either). Hence, pick any of the get* API calls that are in use today and they should just work. How does this work? The SDP implementation takes on the burden for the application developer. For iWARP, there really isn't anything special that has to be done as these calls all should provide the necessary information. The port mapper protocol would be invoked which would map to the actual RDMA listen QP and target RNIC. For IB, there is some additional work both in using SID as well as resolving the IP address to the IB address vector but the work isn't that hard to implement (we know this because this has all been implemented on various OS within the industry). The same will be true for NFS/RDMA and iSER - again all use the existing interfaces to identify the address / service and map to an address vector (and again, all of this has been implemented on various OS within the industry). The above makes ISV and customers very happy as they can take advantage of RDMA technologies without having to go through the lengthy and expensive qualification process that comes when any application is modified / recompiled. This keeps costs low and improves TTM. As for the RDMA connection API, that is simply attempting to abstract to a common interface that any ULP implementation can use to access either iWARP or IB. The RDMA connection API should not be viewed as something end application developers will use but towards middleware developers. This allows everyone to use IP addresses, port spaces, etc. through the existing application API while allowing RDMA to transparently add some intelligence to the process and eventually enable new capabilities like policy management (e.g. how best to map ULP QoS needs to a given path, service rate,etc.) without permuting everything above. Keeping things transparent is best for all. Attempting to require end application developers to modify their code will result in slower adoption and reduced utilization of RDMA technologies within the industry. It really is all about economics and re-using the existing ecosystem / infrastructure. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Thu Oct 13 10:55:47 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 13 Oct 2005 10:55:47 -0700 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A03@NT-SJCA-0751.brcm.ad.broadcom.com> I agree with Mike's analysis. But I'd also like to point out that even when source compatability is not a requirement, source familiarity is. That is, even when recoding is feasible the API should only introduce new concepts as required to improve efficiency. The shift from socket model to QP/CQ is challenging enough as is. It's also where the benefit is. Changing how the application requests and accepts connections is just piling on more things for the developers to learn onto an already very full plate, and with nowhere near the same benefit. The simple, IP/DNS-centric methods that Mike outlined will work on either iWARP or IB, and are very easily understood by those familiar with existing sockets/IP network development. The more complex models provide minor enhancements for very corner cases at the very heavy concept of requiring the developer to understand a lot more about network topology. -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Thu Oct 13 11:23:38 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 13 Oct 2005 11:23:38 -0700 Subject: [openib-general] [PATCH] uDAPL async QP/CQ error handling fixed Message-ID: James, Patch will fix the async error handling and callback mappings. QP/CQ error mappings were totally screwed up. Updated TODO list. -arlin Signed-off by: Arlin Davis Index: dapl/openib/TODO =================================================================== --- dapl/openib/TODO (revision 3768) +++ dapl/openib/TODO (working copy) @@ -1,12 +1,10 @@ IB Verbs: - CQ resize -- mulitple CQ event support - memory window support DAPL: - reinit EP needs a QP timewait completion notification -- direct cq_wait_object when multi-CQ verbs event support arrives - shared receive queue support Under discussion: Index: dapl/openib/dapl_ib_util.c =================================================================== --- dapl/openib/dapl_ib_util.c (revision 3768) +++ dapl/openib/dapl_ib_util.c (working copy) @@ -214,8 +214,11 @@ DAT_RETURN dapls_ib_open_hca ( /* Get list of all IB devices, find match, open */ dev_list = ibv_get_devices(); dlist_start(dev_list); - dlist_for_each_data(dev_list,hca_ptr->ib_trans.ib_dev,struct ibv_device) { - if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) + dlist_for_each_data(dev_list, + hca_ptr->ib_trans.ib_dev, + struct ibv_device) { + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_name)) break; } @@ -226,20 +229,22 @@ DAT_RETURN dapls_ib_open_hca ( return DAT_INTERNAL_ERROR; } - dapl_dbg_log (DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev), - (unsigned long long)bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); + dapl_dbg_log ( + DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + (unsigned long long) + bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); if (!hca_ptr->ib_hca_handle) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: IB dev open failed for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); return DAT_INTERNAL_ERROR; } hca_ptr->ib_trans.ib_ctx = hca_ptr->ib_hca_handle; - /* set inline max with enviromment or default, get local lid and gid 0 */ + /* set inline max with env or default, get local lid and gid 0 */ hca_ptr->ib_trans.max_inline_send = dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); @@ -253,15 +258,17 @@ DAT_RETURN dapls_ib_open_hca ( } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " open_hca: GID subnet %016llx id %016llx\n", - (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), - (unsigned long long)bswap_64(hca_ptr->ib_trans.gid.global.interface_id) ); + " open_hca: GID subnet %016llx id %016llx\n", + (unsigned long long) + bswap_64(hca_ptr->ib_trans.gid.global.subnet_prefix), + (unsigned long long) + bswap_64(hca_ptr->ib_trans.gid.global.interface_id)); /* get the IP address of the device using GID */ if (dapli_get_hca_addr(hca_ptr)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, " open_hca: ERR ib_at_ips_by_gid for %s\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); goto bail; } @@ -310,15 +317,23 @@ DAT_RETURN dapls_ib_open_hca ( write(g_ib_pipe[1], "w", sizeof "w"); dapl_os_unlock(&g_hca_lock); - dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " open_hca: %s, port %d, %s %d.%d.%d.%d INLINE_MAX=%d\n", - ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_family == AF_INET ? "AF_INET":"AF_INET6", - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, - ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, - hca_ptr->ib_trans.max_inline_send ); + dapl_dbg_log ( + DAPL_DBG_TYPE_UTIL, + " open_hca: %s, port %d, %s %d.%d.%d.%d INLINE_MAX=%d\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + hca_ptr->port_num, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_family == AF_INET ? + "AF_INET":"AF_INET6", + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *) + &hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff, + hca_ptr->ib_trans.max_inline_send ); hca_ptr->ib_trans.d_hca = hca_ptr; return DAT_SUCCESS; @@ -370,7 +385,7 @@ DAT_RETURN dapls_ib_close_hca ( IN DAP sleep.tv_sec = 0; sleep.tv_nsec = 10000000; /* 10 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " ib_thread_destroy: waiting on hca %p destroy\n"); + " ib_thread_destroy: wait on hca %p destroy\n"); nanosleep (&sleep, &remain); } return (DAT_SUCCESS); @@ -425,19 +440,26 @@ DAT_RETURN dapls_ib_query_hca ( if (ia_attr != NULL) { ia_attr->adapter_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; ia_attr->vendor_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; - ia_attr->ia_address_ptr = (DAT_IA_ADDRESS_PTR)&hca_ptr->hca_address; + ia_attr->ia_address_ptr = + (DAT_IA_ADDRESS_PTR)&hca_ptr->hca_address; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " query_hca: %s %s %d.%d.%d.%d\n", ibv_get_device_name(hca_ptr->ib_trans.ib_dev), - ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_family == AF_INET ? "AF_INET":"AF_INET6", - ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, - ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, - ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, - ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff ); + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_family == AF_INET ? + "AF_INET":"AF_INET6", + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *) + ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff); ia_attr->hardware_version_major = dev_attr.hw_ver; - ia_attr->hardware_version_minor = dev_attr.fw_ver; + /* ia_attr->hardware_version_minor = dev_attr.fw_ver; */ ia_attr->max_eps = dev_attr.max_qp; ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; @@ -468,7 +490,6 @@ DAT_RETURN dapls_ib_query_hca ( ia_attr->max_mtu_size, ia_attr->max_rdma_size, ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, ia_attr->max_rmrs ); - } if (ep_attr != NULL) { @@ -522,27 +543,28 @@ DAT_RETURN dapls_ib_setup_async_callback ib_hca_transport_t *hca_ptr; dapl_dbg_log (DAPL_DBG_TYPE_UTIL, - " setup_async_cb: ia %p type %d handle %p cb %p ctx %p\n", + " setup_async_cb: ia %p type %d hdl %p cb %p ctx %p\n", ia_ptr, handler_type, evd_ptr, callback, context); hca_ptr = &ia_ptr->hca_ptr->ib_trans; switch(handler_type) { case DAPL_ASYNC_UNAFILIATED: - hca_ptr->async_unafiliated = callback; + hca_ptr->async_unafiliated = + (ib_async_handler_t)callback; hca_ptr->async_un_ctx = context; break; case DAPL_ASYNC_CQ_ERROR: - hca_ptr->async_cq_error = callback; - hca_ptr->async_cq_ctx = context; + hca_ptr->async_cq_error = + (ib_async_cq_handler_t)callback; break; case DAPL_ASYNC_CQ_COMPLETION: - hca_ptr->async_cq = callback; - hca_ptr->async_ctx = context; + hca_ptr->async_cq = + (ib_async_dto_handler_t)callback; break; case DAPL_ASYNC_QP_ERROR: - hca_ptr->async_qp_error = callback; - hca_ptr->async_qp_ctx = context; + hca_ptr->async_qp_error = + (ib_async_qp_handler_t)callback; break; default: break; @@ -573,7 +595,6 @@ void dapli_ib_thread_destroy(void) int retries = 10; dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy(%d)\n", getpid()); - /* * wait for async thread to terminate. * pthread_join would be the correct method @@ -623,34 +644,42 @@ void dapli_async_event_cb(struct _ib_hca case IBV_EVENT_CQ_ERR: { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, - " dapli_async_event CQ ERR %d\n", - event.event_type); + struct dapl_ep *evd_ptr = + event.element.cq->cq_context; + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_async_event CQ (%p) ERR %d\n", + evd_ptr, event.event_type); /* report up if async callback still setup */ if (hca->async_cq_error) hca->async_cq_error(hca->ib_ctx, + event.element.cq, &event, - hca->async_cq_ctx); + (void*)evd_ptr); break; } case IBV_EVENT_COMM_EST: { - /* Received messages on connected QP before RTU */ - struct dapl_ep *ep_ptr = event.element.qp->qp_context; + /* Received msgs on connected QP before RTU */ + struct dapl_ep *ep_ptr = + event.element.qp->qp_context; /* TODO: cannot process COMM_EST until ibv * guarantees valid QP context for events. * Race conditions exist with QP destroy call. * For now, assume the RTU will arrive. */ - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " dapli_async_event COMM_EST (qp=%p)\n", - event.element.qp); + dapl_dbg_log( + DAPL_DBG_TYPE_UTIL, + " dapli_async_event COMM_EST(qp=%p)\n", + event.element.qp); if (!DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) && ep_ptr->cm_handle != IB_INVALID_HANDLE) - ib_cm_establish(ep_ptr->cm_handle->cm_id); + ib_cm_establish( + ep_ptr->cm_handle->cm_id); break; } @@ -662,15 +691,20 @@ void dapli_async_event_cb(struct _ib_hca case IBV_EVENT_SRQ_LIMIT_REACHED: case IBV_EVENT_SQ_DRAINED: { - dapl_dbg_log(DAPL_DBG_TYPE_WARN, - " dapli_async_event QP ERR %d\n", - event.event_type); + struct dapl_ep *ep_ptr = + event.element.qp->qp_context; + + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " dapli_async_event QP (%p) ERR %d\n", + ep_ptr, event.event_type); /* report up if async callback still setup */ if (hca->async_qp_error) hca->async_qp_error(hca->ib_ctx, + event.element.qp, &event, - hca->async_qp_ctx); + (void*)ep_ptr); break; } case IBV_EVENT_PATH_MIG: Index: dapl/openib/dapl_ib_util.h =================================================================== --- dapl/openib/dapl_ib_util.h (revision 3768) +++ dapl/openib/dapl_ib_util.h (working copy) @@ -141,7 +141,7 @@ typedef enum ibv_send_flags ib_send_op_t typedef struct ibv_sge ib_data_segment_t; typedef enum ibv_qp_state ib_qp_state_t; typedef enum ibv_event_type ib_async_event_type; -typedef struct ibv_async_event ib_error_record_t; +typedef struct ibv_async_event ib_error_record_t; /* CQ notifications */ typedef enum @@ -222,12 +222,30 @@ typedef struct ibv_comp_channel *ib_wait * ibv_post_recv - Return 0, -1 & bad_wr */ -/* async handler for CQ, QP, and unafiliated */ +/* async handler for DTO, CQ, QP, and unafiliated */ +typedef void (*ib_async_dto_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_cq_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_cq_handle_t ib_cq_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_qp_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_qp_handle_t ib_qp_handle, + IN ib_error_record_t *err_code, + IN void *context); + typedef void (*ib_async_handler_t)( IN ib_hca_handle_t ib_hca_handle, IN ib_error_record_t *err_code, IN void *context); + /* ib_hca_transport_t, specific to this implementation */ typedef struct _ib_hca_transport { @@ -244,12 +262,9 @@ typedef struct _ib_hca_transport union ibv_gid gid; ib_async_handler_t async_unafiliated; void *async_un_ctx; - ib_async_handler_t async_cq_error; - void *async_ctx; - ib_async_handler_t async_cq; - void *async_cq_ctx; - ib_async_handler_t async_qp_error; - void *async_qp_ctx; + ib_async_cq_handler_t async_cq_error; + ib_async_dto_handler_t async_cq; + ib_async_qp_handler_t async_qp_error; } ib_hca_transport_t; From mst at mellanox.co.il Thu Oct 13 12:14:07 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 13 Oct 2005 21:14:07 +0200 Subject: [openib-general] Re: Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <521x2q7byt.fsf@cisco.com> References: <521x2q7byt.fsf@cisco.com> Message-ID: <20051013191407.GC13356@mellanox.co.il> Quoting r. Roland Dreier : > IPoIB's handling of these allocation errors can definitely be improved Yes, it seems that if such an allocation fails IPoIB may never repost the receive buffer. Is that right? -- MST From mst at mellanox.co.il Thu Oct 13 12:34:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 13 Oct 2005 21:34:36 +0200 Subject: [openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN In-Reply-To: References: Message-ID: <20051013193436.GA13514@mellanox.co.il> Quoting r. Arlin Davis : > Subject: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN > > Michael, > > The patch adds command line options for RDMA reads and starting PSN. I > used these modifications to > help isolate the RDMA read performance degradation with 4.6.2 firmware. > > -arlin Thanks Arlin. I plan to look into integrating this. One question: for which psn values do you see performance drop on 4.6.0 FW? -- MST From rolandd at cisco.com Thu Oct 13 12:33:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 12:33:28 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <20051013191407.GC13356@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 13 Oct 2005 21:14:07 +0200") References: <521x2q7byt.fsf@cisco.com> <20051013191407.GC13356@mellanox.co.il> Message-ID: <52irw12ovr.fsf@cisco.com> Michael> Yes, it seems that if such an allocation fails IPoIB may Michael> never repost the receive buffer. Is that right? I think so. My plan is to change the receive handling of IPoIB slightly, so that if it can't allocate a new receive buffer, it reposts the old buffer and drops the packet it just received. - R. From mst at mellanox.co.il Thu Oct 13 12:39:24 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 13 Oct 2005 21:39:24 +0200 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <52irw12ovr.fsf@cisco.com> References: <52irw12ovr.fsf@cisco.com> Message-ID: <20051013193924.GB13514@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer > > Michael> Yes, it seems that if such an allocation fails IPoIB may > Michael> never repost the receive buffer. Is that right? > > I think so. > > My plan is to change the receive handling of IPoIB slightly, so that > if it can't allocate a new receive buffer, it reposts the old buffer > and drops the packet it just received. Sounds like a good idea. -- MST From wcxrtjttivki at euskaltel.es Thu Oct 13 13:32:33 2005 From: wcxrtjttivki at euskaltel.es (Osvaldo Hart) Date: Thu, 13 Oct 2005 20:32:33 -0000 Subject: [openib-general] Haunted by your credit score? Message-ID: <28119474095115.wcxrtjttivki@euskaltel.es> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://joins-123.net/save3.asp Have a good day. Sincerely, Osvaldo Hart Customer Service Rep eBNF Inc. lotte may oxen may and genoa , ! wreck ! , otherworldly some not suggestible and be compound or not invidious seebut vulture but. Update on site incriminate be cyclorama it's on then be but enol try in introversion or and assignee may see antagonism a in bundoora ,! masochism or. From jlentini at netapp.com Thu Oct 13 13:33:21 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 13 Oct 2005 16:33:21 -0400 (EDT) Subject: [openib-general] Re: [PATCH] uDAPL async QP/CQ error handling fixed In-Reply-To: References: Message-ID: On Thu, 13 Oct 2005, Arlin Davis wrote: > James, > > Patch will fix the async error handling and callback mappings. QP/CQ > error mappings were totally screwed up. Updated TODO list. > > -arlin Committed in revision 3774. From hycsw at ca.sandia.gov Thu Oct 13 13:38:31 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Thu, 13 Oct 2005 13:38:31 -0700 (PDT) Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <200510132038.NAA24914@ca.sandia.gov> Roland, Thank you for your response. That fixed my initial buffer allocation failure. After we tuned the Lustre and reran same IOZONE tests again, we got the following problem. Was there an actual network interrupt? If so, the problem is not obvious now; the two nodes are pinging over IPoIB. Please advice. Thanks, Helen ---- Dmesg Report from Lustre server ----- NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3846 Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down. LustreError: 10501:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021 LustreError: 10793:0:(ldlm_lib.c:506:target_handle_reconnect()) 460e5_lov2_7d3910bb5c reconnecting ----- Dmesg from Lustre client (192.168.2.79) ------ NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17965 Lustre: 10035:0:(socknal_cb.c:1326:ksocknal_process_receive()) [f6256000] EOF from 0xc0a80253 ip 192.168.2.83:988 LustreError: 10169:0:(client.c:568:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 req at d3585600 x13853/t0 o400->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107 LustreError: Connection to service on5-ost2 via nid 192.168.2.76 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection lost to on5-ost2_UUID at NID_on5-ib_UUID LustreError: This client was evicted by on5-ost2; in progress operations using this service will fail. LustreError: 10413:0:(rw.c:1253:ll_readpage()) page c1538cc0 map f6193328 index 825344 flags 20001023 count 3 priv e91da940: lock match failed: rc -5 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d01f2200 x13862/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d51ea400 x13868/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 4 similar messages LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d3c7ea00 x13880/t0 o3->on5-ost2_UUID at NID_on5-ib_UUID:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 11 similar messages Lustre: A connection with 192.168.2.75 timed out; the network or that node may be down. LustreError: 10041:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024b ip 192.168.2.75:988 Lustre: Connection restored to service on5-ost2 using nid 192.168.2.76. Lustre: 10496:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection restored to on5-ost2_UUID at NID_on5-ib_UUID LustreError: 10169:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129234515, 101s ago) req at f6233e00 x13850/t0 o400->on12-mds2_UUID at NID_on12-ib_UUID:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: Connection to service on12-mds2 via nid 192.168.2.83 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) MDC_on8_on12-mds2_MNT_on8-ib_2: connection lost to on12-mds2_UUID at NID_on12-ib_UUID Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74. Lustre: 10170:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on3-ost2_MNT_on8-ib_2: connection restored to on3-ost2_UUID at NID_on3-ib_UUID From ardavis at ichips.intel.com Thu Oct 13 13:48:47 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 13 Oct 2005 13:48:47 -0700 Subject: [openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN In-Reply-To: <20051013193436.GA13514@mellanox.co.il> References: <20051013193436.GA13514@mellanox.co.il> Message-ID: <434EC82F.4080908@ichips.intel.com> Michael S. Tsirkin wrote: >Quoting r. Arlin Davis : > > >>Subject: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN >> >>Michael, >> >>The patch adds command line options for RDMA reads and starting PSN. I >>used these modifications to >>help isolate the RDMA read performance degradation with 4.6.2 firmware. >> >>-arlin >> >> > >Thanks Arlin. I plan to look into integrating this. >One question: for which psn values do you see performance drop on 4.6.0 FW? > > > > A quick run at 1 and then 0x100000 dropped from 682MB/s to 49MB/s for 32KB buffers. What is really strange is that it takes a couple runs to start seeing the drop in performance. PSN=1 no problems... [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x20406, PSN 0x0001 RKey 0x0c0032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0x20406, PSN 0x0001 RKey 0x0c0032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #999): 682.504 MB/sec Bandwidth average: 682.501 MB/sec Service Demand peak (#0 to #999): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x30406, PSN 0x0001 RKey 0x120032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0x30406, PSN 0x0001 RKey 0x120032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #990): 682.496 MB/sec Bandwidth average: 682.496 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0x40406, PSN 0x0001 RKey 0x180032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0x40406, PSN 0x0001 RKey 0x180032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #990): 682.5 MB/sec Bandwidth average: 682.499 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB PSN=0x100000 (start to see problems after first run) [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x100000 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xb0406, PSN 0x100000 RKey 0x420032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0x90406, PSN 0x100000 RKey 0x360032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #996): 682.5 MB/sec Bandwidth average: 682.499 MB/sec Service Demand peak (#0 to #996): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x100000 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xc0406, PSN 0x100000 RKey 0x480032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0xa0406, PSN 0x100000 RKey 0x3c0032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #0): 48.5441 MB/sec Bandwidth average: 47.4502 MB/sec Service Demand peak (#0 to #0): 72244 cycles/KB Service Demand Avg : 73909 cycles/KB [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x100000 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xd0406, PSN 0x100000 RKey 0x4e0032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0xb0406, PSN 0x100000 RKey 0x420032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #0): 48.4803 MB/sec Bandwidth average: 47.4501 MB/sec Service Demand peak (#0 to #0): 72339 cycles/KB Service Demand Avg : 73909 cycles/KB PSN = 1 (first run is bad, and then it is back to normal) [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xe0406, PSN 0x0001 RKey 0x540032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0xc0406, PSN 0x0001 RKey 0x480032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #0): 48.5798 MB/sec Bandwidth average: 47.4502 MB/sec Service Demand peak (#0 to #0): 72190 cycles/KB Service Demand Avg : 73909 cycles/KB [ardavis at iclust-20 perftest]$ ./rdma_bw -P 0x1 -s 32768 -r iclust-19 local address: LID 0x02, QPN 0xf0406, PSN 0x0001 RKey 0x5a0032 VAddr 0x00000000514000 RDMA_READ remote address: LID 0x05, QPN 0xd0406, PSN 0x0001 RKey 0x4e0032 VAddr 0x00000000513000 RDMA_READ Bandwidth peak (#0 to #990): 682.492 MB/sec Bandwidth average: 682.49 MB/sec Service Demand peak (#0 to #990): 5138 cycles/KB Service Demand Avg : 5138 cycles/KB -arlin From rolandd at cisco.com Thu Oct 13 13:52:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 13:52:49 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <200510132038.NAA24914@ca.sandia.gov> (Helen Chen's message of "Thu, 13 Oct 2005 13:38:31 -0700 (PDT)") References: <200510132038.NAA24914@ca.sandia.gov> Message-ID: <52fyr516n2.fsf@cisco.com> Helen> Roland, Thank you for your response. That fixed my initial Helen> buffer allocation failure. After we tuned the Lustre and Helen> reran same IOZONE tests again, we got the following Helen> problem. Was there an actual network interrupt? If so, the Helen> problem is not obvious now; the two nodes are pinging over Helen> IPoIB. Please advice. That's very odd. This message: Helen> NETDEV WATCHDOG: ib0: transmit timed out Helen> ib0: transmit timeout: latency 1846 says that we are not seeing send completions from the HCA. However, are you saying that even when you are seeing this message, ping over IPoIB is working? - R. From hycsw at ca.sandia.gov Thu Oct 13 14:21:16 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Thu, 13 Oct 2005 14:21:16 -0700 (PDT) Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <200510132121.OAA29376@ca.sandia.gov> Roland, >From rolandd at cisco.com Thu Oct 13 13:53:05 2005 > > Helen> Roland, Thank you for your response. That fixed my initial > Helen> buffer allocation failure. After we tuned the Lustre and > Helen> reran same IOZONE tests again, we got the following > Helen> problem. Was there an actual network interrupt? If so, the > Helen> problem is not obvious now; the two nodes are pinging over > Helen> IPoIB. Please advice. > >That's very odd. This message: > > Helen> NETDEV WATCHDOG: ib0: transmit timed out > Helen> ib0: transmit timeout: latency 1846 > >says that we are not seeing send completions from the HCA. However, >are you saying that even when you are seeing this message, ping over >IPoIB is working? > No, I didn't know there were any problem until IOZONE reported read error from the Lustre Client. BTW, the backend storage is iSCSI over 10 GbE using jumbo frame. This pl\roblem only appeared after our tuning errfor: we increased the iSCSI payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I will shrink my TCP window and see if the problem goes away. Thanks, Helen From rolandd at cisco.com Thu Oct 13 14:21:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 14:21:28 -0700 Subject: [openib-general] [RFC] Kernel uverbs changes for PathScale merge Message-ID: <52br1t15bb.fsf@cisco.com> Here are the changes to the kernel part of userspace verbs required to support PathScale's driver. I'm now happy with them and ready to commit them to the svn trunk and queue them for 2.6.15. This will allow the PathScale hardware-specific driver to be move to the trunk as well, although quite a bit of cleanup is necessary before merging the driver upstream. Does anyone have any comments on these changes before I commit? Thanks, Roland --- infiniband/include/rdma/ib_user_verbs.h (revision 3707) +++ infiniband/include/rdma/ib_user_verbs.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -88,8 +89,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + * multiple of 8 bytes. Otherwise the structure size will be + * different between 32-bit and 64-bit architectures. */ struct ib_uverbs_async_event_desc { @@ -261,6 +265,42 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_poll_cq { + __u64 response; + __u32 cq_handle; + __u32 ne; + __u64 wc; +}; + +struct ib_uverbs_wc { + __u64 wr_id; + __u32 status; + __u32 opcode; + __u32 vendor_err; + __u32 byte_len; + __u32 imm_data; + __u32 qp_num; + __u32 src_qp; + __u32 wc_flags; + __u16 pkey_index; + __u16 slid; + __u8 sl; + __u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_poll_cq_resp { + __u32 count; + __u32 reserved; + struct ib_uverbs_wc wc[0]; +}; + +struct ib_uverbs_req_notify_cq { + __u32 cq_handle; + __u32 solicited_only; +}; + struct ib_uverbs_destroy_cq { __u64 response; __u32 cq_handle; @@ -358,6 +398,127 @@ struct ib_uverbs_destroy_qp_resp { __u32 events_reported; }; +/* + * Note: the ib_uverbs_sge structure isn't used anywhere, as the ib_sge + * structure is packed the same way on 32-bit and 64-bit architectures + * in both kernel and user space. It's just here to document the ABI. + */ + +struct ib_uverbs_sge { + __u64 addr; + __u32 length; + __u32 lkey; +}; + +struct ib_uverbs_send_wr { + __u64 wr_id; + __u32 num_sge; + __u32 opcode; + __u32 send_flags; + __u32 imm_data; + union { + struct { + __u64 remote_addr; + __u32 rkey; + __u32 reserved; + } rdma; + struct { + __u64 remote_addr; + __u64 compare_add; + __u64 swap; + __u32 rkey; + __u32 reserved; + } atomic; + struct { + __u32 ah; + __u32 remote_qpn; + __u32 remote_qkey; + __u32 reserved; + } ud; + } wr; +}; + +struct ib_uverbs_post_send { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_send_wr send_wr[0]; +}; + +struct ib_uverbs_post_send_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_recv_wr { + __u64 wr_id; + __u32 num_sge; + __u32 reserved; +}; + +struct ib_uverbs_post_recv { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv_wr[0]; +}; + +struct ib_uverbs_post_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_post_srq_recv { + __u64 response; + __u32 srq_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv[0]; +}; + +struct ib_uverbs_post_srq_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_create_ah { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 reserved; + struct ib_uverbs_ah_attr attr; +}; + +struct ib_uverbs_create_ah_resp { + __u32 ah_handle; +}; + +struct ib_uverbs_destroy_ah { + __u32 ah_handle; +}; + struct ib_uverbs_attach_mcast { __u8 gid[16]; __u32 qp_handle; --- infiniband/core/uverbs_main.c (revision 3740) +++ infiniband/core/uverbs_main.c (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2005 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -86,10 +87,17 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr, [IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL] = ib_uverbs_create_comp_channel, [IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq, + [IB_USER_VERBS_CMD_POLL_CQ] = ib_uverbs_poll_cq, + [IB_USER_VERBS_CMD_REQ_NOTIFY_CQ] = ib_uverbs_req_notify_cq, [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, [IB_USER_VERBS_CMD_CREATE_QP] = ib_uverbs_create_qp, [IB_USER_VERBS_CMD_MODIFY_QP] = ib_uverbs_modify_qp, [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, + [IB_USER_VERBS_CMD_POST_SEND] = ib_uverbs_post_send, + [IB_USER_VERBS_CMD_POST_RECV] = ib_uverbs_post_recv, + [IB_USER_VERBS_CMD_POST_SRQ_RECV] = ib_uverbs_post_srq_recv, + [IB_USER_VERBS_CMD_CREATE_AH] = ib_uverbs_create_ah, + [IB_USER_VERBS_CMD_DESTROY_AH] = ib_uverbs_destroy_ah, [IB_USER_VERBS_CMD_ATTACH_MCAST] = ib_uverbs_attach_mcast, [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_srq, @@ -111,7 +119,13 @@ static int ib_dealloc_ucontext(struct ib down(&ib_uverbs_idr_mutex); - /* XXX Free AHs */ + list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { + struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id); + idr_remove(&ib_uverbs_ah_idr, uobj->id); + ib_destroy_ah(ah); + list_del(&uobj->list); + kfree(uobj); + } list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); --- infiniband/core/uverbs.h (revision 3707) +++ infiniband/core/uverbs.h (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2005 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -140,10 +141,17 @@ IB_UVERBS_DECLARE_CMD(reg_mr); IB_UVERBS_DECLARE_CMD(dereg_mr); IB_UVERBS_DECLARE_CMD(create_comp_channel); IB_UVERBS_DECLARE_CMD(create_cq); +IB_UVERBS_DECLARE_CMD(poll_cq); +IB_UVERBS_DECLARE_CMD(req_notify_cq); IB_UVERBS_DECLARE_CMD(destroy_cq); IB_UVERBS_DECLARE_CMD(create_qp); IB_UVERBS_DECLARE_CMD(modify_qp); IB_UVERBS_DECLARE_CMD(destroy_qp); +IB_UVERBS_DECLARE_CMD(post_send); +IB_UVERBS_DECLARE_CMD(post_recv); +IB_UVERBS_DECLARE_CMD(post_srq_recv); +IB_UVERBS_DECLARE_CMD(create_ah); +IB_UVERBS_DECLARE_CMD(destroy_ah); IB_UVERBS_DECLARE_CMD(attach_mcast); IB_UVERBS_DECLARE_CMD(detach_mcast); IB_UVERBS_DECLARE_CMD(create_srq); --- infiniband/core/uverbs_cmd.c (revision 3707) +++ infiniband/core/uverbs_cmd.c (working copy) @@ -665,6 +665,93 @@ err: return ret; } +ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_poll_cq cmd; + struct ib_uverbs_poll_cq_resp *resp; + struct ib_cq *cq; + struct ib_wc *wc; + int ret = 0; + int i; + int rsize; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); + if (!wc) + return -ENOMEM; + + rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc); + resp = kmalloc(rsize, GFP_KERNEL); + if (!resp) { + ret = -ENOMEM; + goto out_wc; + } + + down(&ib_uverbs_idr_mutex); + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (!cq || cq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + resp->count = ib_poll_cq(cq, cmd.ne, wc); + + for (i = 0; i < resp->count; i++) { + resp->wc[i].wr_id = wc[i].wr_id; + resp->wc[i].status = wc[i].status; + resp->wc[i].opcode = wc[i].opcode; + resp->wc[i].vendor_err = wc[i].vendor_err; + resp->wc[i].byte_len = wc[i].byte_len; + resp->wc[i].imm_data = wc[i].imm_data; + resp->wc[i].qp_num = wc[i].qp_num; + resp->wc[i].src_qp = wc[i].src_qp; + resp->wc[i].wc_flags = wc[i].wc_flags; + resp->wc[i].pkey_index = wc[i].pkey_index; + resp->wc[i].slid = wc[i].slid; + resp->wc[i].sl = wc[i].sl; + resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits; + resp->wc[i].port_num = wc[i].port_num; + } + + if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + kfree(resp); + +out_wc: + kfree(wc); + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_req_notify_cq cmd; + struct ib_cq *cq; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (cq && cq->uobject->context == file->ucontext) { + ib_req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + ret = in_len; + } + up(&ib_uverbs_idr_mutex); + + return ret; +} + ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -1003,6 +1090,468 @@ out: return ret ? ret : in_len; } +ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_send cmd; + struct ib_uverbs_post_send_resp resp; + struct ib_uverbs_send_wr *user_wr; + struct ib_send_wr *wr = NULL, *last, *next, *bad_wr; + struct ib_qp *qp; + int i, sg_ind; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + if (in_len < sizeof cmd + cmd.wqe_size * cmd.wr_count + + cmd.sge_count * sizeof (struct ib_uverbs_sge)) + return -EINVAL; + + if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr)) + return -EINVAL; + + user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); + if (!user_wr) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + sg_ind = 0; + last = NULL; + for (i = 0; i < cmd.wr_count; ++i) { + if (copy_from_user(user_wr, + buf + sizeof cmd + i * cmd.wqe_size, + cmd.wqe_size)) { + ret = -EFAULT; + goto out; + } + + if (user_wr->num_sge + sg_ind > cmd.sge_count) { + ret = -EINVAL; + goto out; + } + + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto out; + } + + if (!last) + wr = next; + else + last->next = next; + last = next; + + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + next->opcode = user_wr->opcode; + next->send_flags = user_wr->send_flags; + next->imm_data = user_wr->imm_data; + + if (qp->qp_type == IB_QPT_UD) { + next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, + user_wr->wr.ud.ah); + if (!next->wr.ud.ah) { + ret = -EINVAL; + goto out; + } + next->wr.ud.remote_qpn = user_wr->wr.ud.remote_qpn; + next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey; + } else { + switch (next->opcode) { + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + case IB_WR_RDMA_READ: + next->wr.rdma.remote_addr = + user_wr->wr.rdma.remote_addr; + next->wr.rdma.rkey = + user_wr->wr.rdma.rkey; + break; + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + next->wr.atomic.remote_addr = + user_wr->wr.atomic.remote_addr; + next->wr.atomic.compare_add = + user_wr->wr.atomic.compare_add; + next->wr.atomic.swap = user_wr->wr.atomic.swap; + next->wr.atomic.rkey = user_wr->wr.atomic.rkey; + break; + default: + break; + } + } + + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + sizeof cmd + + cmd.wr_count * cmd.wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto out; + } + sg_ind += next->num_sge; + } else + next->sg_list = NULL; + } + + resp.bad_wr = 0; + ret = qp->device->post_send(qp, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + kfree(user_wr); + + return ret ? ret : in_len; +} + +static struct ib_recv_wr *ib_uverbs_unmarshall_recv(const char __user *buf, + int in_len, + u32 wr_count, + u32 sge_count, + u32 wqe_size) +{ + struct ib_uverbs_recv_wr *user_wr; + struct ib_recv_wr *wr = NULL, *last, *next; + int sg_ind; + int i; + int ret; + + if (in_len < wqe_size * wr_count + + sge_count * sizeof (struct ib_uverbs_sge)) + return ERR_PTR(-EINVAL); + + if (wqe_size < sizeof (struct ib_uverbs_recv_wr)) + return ERR_PTR(-EINVAL); + + user_wr = kmalloc(wqe_size, GFP_KERNEL); + if (!user_wr) + return ERR_PTR(-ENOMEM); + + sg_ind = 0; + last = NULL; + for (i = 0; i < wr_count; ++i) { + if (copy_from_user(user_wr, buf + i * wqe_size, + wqe_size)) { + ret = -EFAULT; + goto err; + } + + if (user_wr->num_sge + sg_ind > sge_count) { + ret = -EINVAL; + goto err; + } + + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto err; + } + + if (!last) + wr = next; + else + last->next = next; + last = next; + + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + wr_count * wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto err; + } + sg_ind += next->num_sge; + } else + next->sg_list = NULL; + } + + kfree(user_wr); + return wr; + +err: + kfree(user_wr); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ERR_PTR(ret); +} + +ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_recv cmd; + struct ib_uverbs_post_recv_resp resp; + struct ib_recv_wr *wr, *next, *bad_wr; + struct ib_qp *qp; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + resp.bad_wr = 0; + ret = qp->device->post_recv(qp, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_post_srq_recv(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_srq_recv cmd; + struct ib_uverbs_post_srq_recv_resp resp; + struct ib_recv_wr *wr, *next, *bad_wr; + struct ib_srq *srq; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) + goto out; + + resp.bad_wr = 0; + ret = srq->device->post_srq_recv(srq, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_ah cmd; + struct ib_uverbs_create_ah_resp resp; + struct ib_uobject *uobj; + struct ib_pd *pd; + struct ib_ah *ah; + struct ib_ah_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + if (!pd || pd->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + attr.dlid = cmd.attr.dlid; + attr.sl = cmd.attr.sl; + attr.src_path_bits = cmd.attr.src_path_bits; + attr.static_rate = cmd.attr.static_rate; + attr.port_num = cmd.attr.port_num; + attr.grh.flow_label = cmd.attr.grh.flow_label; + attr.grh.sgid_index = cmd.attr.grh.sgid_index; + attr.grh.hop_limit = cmd.attr.grh.hop_limit; + attr.grh.traffic_class = cmd.attr.grh.traffic_class; + memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16); + + ah = ib_create_ah(pd, &attr); + if (IS_ERR(ah)) { + ret = PTR_ERR(ah); + goto err_up; + } + + ah->uobject = uobj; + +retry: + if (!idr_pre_get(&ib_uverbs_ah_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_destroy; + } + + ret = idr_get_new(&ib_uverbs_ah_idr, ah, &uobj->id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_destroy; + + resp.ah_handle = uobj->id; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_idr; + } + + down(&file->mutex); + list_add_tail(&uobj->list, &file->ucontext->ah_list); + up(&file->mutex); + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_idr: + idr_remove(&ib_uverbs_ah_idr, uobj->id); + +err_destroy: + ib_destroy_ah(ah); + +err_up: + up(&ib_uverbs_idr_mutex); + + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file, + const char __user *buf, int in_len, int out_len) +{ + struct ib_uverbs_destroy_ah cmd; + struct ib_ah *ah; + struct ib_uobject *uobj; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle); + if (!ah || ah->uobject->context != file->ucontext) + goto out; + + uobj = ah->uobject; + + ret = ib_destroy_ah(ah); + if (ret) + goto out; + + idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle); + + down(&file->mutex); + list_del(&uobj->list); + up(&file->mutex); + + kfree(uobj); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) From rolandd at cisco.com Thu Oct 13 14:22:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 14:22:26 -0700 Subject: [openib-general] [RFC] libibverbs changes for PathScale merge Message-ID: <527jch159p.fsf@cisco.com> Here are the changes to libibverbs required to support PathScale's driver. Again, I'm happy with them and would just like to get comments on them before I commit them to svn. Thanks, Roland --- libibverbs/include/infiniband/driver.h (revision 3774) +++ libibverbs/include/infiniband/driver.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -92,6 +93,8 @@ extern int ibv_cmd_create_cq(struct ibv_ int comp_vector, struct ibv_cq *cq, struct ibv_create_cq *cmd, size_t cmd_size, struct ibv_create_cq_resp *resp, size_t resp_size); +extern int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); +extern int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited); extern int ibv_cmd_destroy_cq(struct ibv_cq *cq); extern int ibv_cmd_create_srq(struct ibv_pd *pd, @@ -111,6 +114,15 @@ extern int ibv_cmd_modify_qp(struct ibv_ enum ibv_qp_attr_mask attr_mask, struct ibv_modify_qp *cmd, size_t cmd_size); extern int ibv_cmd_destroy_qp(struct ibv_qp *qp); +extern int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, + struct ibv_send_wr **bad_wr); +extern int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); +extern int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr); +extern int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, + struct ibv_ah_attr *attr); +extern int ibv_cmd_destroy_ah(struct ibv_ah *ah); extern int ibv_cmd_attach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); extern int ibv_cmd_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); --- libibverbs/include/infiniband/verbs.h (revision 3774) +++ libibverbs/include/infiniband/verbs.h (working copy) @@ -2,6 +2,7 @@ * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2004 Intel Corporation. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -488,6 +489,7 @@ struct ibv_qp { uint32_t handle; uint32_t qp_num; enum ibv_qp_state state; + enum ibv_qp_type qp_type; pthread_mutex_t mutex; pthread_cond_t cond; @@ -513,6 +515,7 @@ struct ibv_cq { struct ibv_ah { struct ibv_context *context; struct ibv_pd *pd; + uint32_t handle; }; struct ibv_device; --- libibverbs/include/infiniband/kern-abi.h (revision 3774) +++ libibverbs/include/infiniband/kern-abi.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -93,8 +94,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + * multiple of 8 bytes. Otherwise the structure size will be + * different between 32-bit and 64-bit architectures. */ struct ibv_kern_async_event { @@ -298,6 +302,47 @@ struct ibv_create_cq_resp { __u32 cqe; }; +struct ibv_kern_wc { + __u64 wr_id; + __u32 status; + __u32 opcode; + __u32 vendor_err; + __u32 byte_len; + __u32 imm_data; + __u32 qp_num; + __u32 src_qp; + __u32 wc_flags; + __u16 pkey_index; + __u16 slid; + __u8 sl; + __u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ibv_poll_cq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 cq_handle; + __u32 ne; +}; + +struct ibv_poll_cq_resp { + __u32 count; + __u32 reserved; + struct ibv_kern_wc wc[0]; +}; + +struct ibv_req_notify_cq { + __u32 command; + __u16 in_words; + __u16 out_words; + __u32 cq_handle; + __u32 solicited; +}; + struct ibv_destroy_cq { __u32 command; __u16 in_words; @@ -400,6 +445,130 @@ struct ibv_destroy_qp_resp { __u32 events_reported; }; +struct ibv_kern_send_wr { + __u64 wr_id; + __u32 num_sge; + __u32 opcode; + __u32 send_flags; + __u32 imm_data; + union { + struct { + __u64 remote_addr; + __u32 rkey; + __u32 reserved; + } rdma; + struct { + __u64 remote_addr; + __u64 compare_add; + __u64 swap; + __u32 rkey; + __u32 reserved; + } atomic; + struct { + __u32 ah; + __u32 remote_qpn; + __u32 remote_qkey; + __u32 reserved; + } ud; + } wr; +}; + +struct ibv_post_send { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ibv_kern_send_wr send_wr[0]; +}; + +struct ibv_post_send_resp { + __u32 bad_wr; +}; + +struct ibv_kern_recv_wr { + __u64 wr_id; + __u32 num_sge; + __u32 reserved; +}; + +struct ibv_post_recv { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ibv_kern_recv_wr recv_wr[0]; +}; + +struct ibv_post_recv_resp { + __u32 bad_wr; +}; + +struct ibv_post_srq_recv { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u32 srq_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ibv_kern_recv_wr recv_wr[0]; +}; + +struct ibv_post_srq_recv_resp { + __u32 bad_wr; +}; + +struct ibv_kern_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ibv_kern_ah_attr { + struct ibv_kern_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ibv_create_ah { + __u32 command; + __u16 in_words; + __u16 out_words; + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 reserved; + struct ibv_kern_ah_attr attr; +}; + +struct ibv_create_ah_resp { + __u32 handle; +}; + +struct ibv_destroy_ah { + __u32 command; + __u16 in_words; + __u16 out_words; + __u32 ah_handle; +}; + struct ibv_attach_mcast { __u32 command; __u16 in_words; --- libibverbs/src/libibverbs.map (revision 3774) +++ libibverbs/src/libibverbs.map (working copy) @@ -41,6 +41,8 @@ IBVERBS_1.0 { ibv_cmd_reg_mr; ibv_cmd_dereg_mr; ibv_cmd_create_cq; + ibv_cmd_poll_cq; + ibv_cmd_req_notify_cq; ibv_cmd_destroy_cq; ibv_cmd_create_srq; ibv_cmd_modify_srq; @@ -48,6 +50,11 @@ IBVERBS_1.0 { ibv_cmd_create_qp; ibv_cmd_modify_qp; ibv_cmd_destroy_qp; + ibv_cmd_post_send; + ibv_cmd_post_recv; + ibv_cmd_post_srq_recv; + ibv_cmd_create_ah; + ibv_cmd_destroy_ah; ibv_cmd_attach_mcast; ibv_cmd_detach_mcast; local: *; --- libibverbs/src/cmd.c (revision 3774) +++ libibverbs/src/cmd.c (working copy) @@ -1,5 +1,6 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -304,6 +305,65 @@ int ibv_cmd_create_cq(struct ibv_context return 0; } +int ibv_cmd_poll_cq(struct ibv_cq *ibcq, int ne, struct ibv_wc *wc) +{ + struct ibv_poll_cq cmd; + struct ibv_poll_cq_resp *resp; + int i; + int rsize; + int ret; + + rsize = sizeof *resp + ne * sizeof(struct ibv_kern_wc); + resp = malloc(rsize); + if (!resp) + return -1; + + IBV_INIT_CMD_RESP(&cmd, sizeof cmd, POLL_CQ, resp, rsize); + cmd.cq_handle = ibcq->handle; + cmd.ne = ne; + + if (write(ibcq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) { + ret = -1; + goto out; + } + + for (i = 0; i < resp->count; i++) { + wc[i].wr_id = resp->wc[i].wr_id; + wc[i].status = resp->wc[i].status; + wc[i].opcode = resp->wc[i].opcode; + wc[i].vendor_err = resp->wc[i].vendor_err; + wc[i].byte_len = resp->wc[i].byte_len; + wc[i].imm_data = resp->wc[i].imm_data; + wc[i].qp_num = resp->wc[i].qp_num; + wc[i].src_qp = resp->wc[i].src_qp; + wc[i].wc_flags = resp->wc[i].wc_flags; + wc[i].pkey_index = resp->wc[i].pkey_index; + wc[i].slid = resp->wc[i].slid; + wc[i].sl = resp->wc[i].sl; + wc[i].dlid_path_bits = resp->wc[i].dlid_path_bits; + } + + ret = resp->count; + +out: + free(resp); + return ret; +} + +int ibv_cmd_req_notify_cq(struct ibv_cq *ibcq, int solicited) +{ + struct ibv_req_notify_cq cmd; + + IBV_INIT_CMD(&cmd, sizeof cmd, REQ_NOTIFY_CQ); + cmd.cq_handle = ibcq->handle; + cmd.solicited = solicited ? 0 : 1; + + if (write(ibcq->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + return 0; +} + static int ibv_cmd_destroy_cq_v1(struct ibv_cq *cq) { struct ibv_destroy_cq_v1 cmd; @@ -441,6 +501,7 @@ int ibv_cmd_create_qp(struct ibv_pd *pd, qp->handle = resp.qp_handle; qp->qp_num = resp.qpn; + qp->qp_type = attr->qp_type; return 0; } @@ -518,6 +579,251 @@ static int ibv_cmd_destroy_qp_v1(struct return 0; } +int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, + struct ibv_send_wr **bad_wr) +{ + struct ibv_post_send *cmd; + struct ibv_post_send_resp resp; + struct ibv_send_wr *i; + struct ibv_kern_send_wr *n, *tmp; + struct ibv_sge *s; + unsigned wr_count = 0; + unsigned sge_count = 0; + int size; + int ret = 0; + + for (i = wr; i; i = i->next) { + wr_count++; + sge_count += i->num_sge; + } + + size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; + cmd = alloca(size); + + IBV_INIT_CMD_RESP(cmd, size, POST_SEND, &resp, sizeof resp); + cmd->qp_handle = ibqp->handle; + cmd->wr_count = wr_count; + cmd->sge_count = sge_count; + cmd->wqe_size = sizeof *n; + + n = (struct ibv_kern_send_wr *) ((void *) cmd + sizeof *cmd); + s = (struct ibv_sge *) (n + wr_count); + + tmp = n; + for (i = wr; i; i = i->next) { + tmp->wr_id = i->wr_id; + tmp->num_sge = i->num_sge; + tmp->opcode = i->opcode; + tmp->send_flags = i->send_flags; + tmp->imm_data = i->imm_data; + if (ibqp->qp_type == IBV_QPT_UD) { + tmp->wr.ud.ah = i->wr.ud.ah->handle; + tmp->wr.ud.remote_qpn = i->wr.ud.remote_qpn; + tmp->wr.ud.remote_qkey = i->wr.ud.remote_qkey; + } else { + switch(i->opcode) { + case IBV_WR_RDMA_WRITE: + case IBV_WR_RDMA_WRITE_WITH_IMM: + case IBV_WR_RDMA_READ: + tmp->wr.rdma.remote_addr = + i->wr.rdma.remote_addr; + tmp->wr.rdma.rkey = i->wr.rdma.rkey; + break; + case IBV_WR_ATOMIC_CMP_AND_SWP: + case IBV_WR_ATOMIC_FETCH_AND_ADD: + tmp->wr.atomic.remote_addr = + i->wr.atomic.remote_addr; + tmp->wr.atomic.compare_add = + i->wr.atomic.compare_add; + tmp->wr.atomic.swap = i->wr.atomic.swap; + tmp->wr.atomic.rkey = i->wr.atomic.rkey; + break; + default: + break; + } + } + + if (tmp->num_sge) { + memcpy(s, i->sg_list, tmp->num_sge * sizeof *s); + s += tmp->num_sge; + } + + tmp++; + } + + resp.bad_wr = 0; + if (write(ibqp->context->cmd_fd, cmd, size) != sizeof cmd) + ret = errno; + + wr_count = resp.bad_wr; + if (wr_count) { + i = wr; + while (--wr_count) + i = i->next; + *bad_wr = i; + } + + return ret; +} + +int ibv_cmd_post_recv(struct ibv_qp *ibqp, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ibv_post_recv *cmd; + struct ibv_post_recv_resp resp; + struct ibv_recv_wr *i; + struct ibv_kern_recv_wr *n, *tmp; + struct ibv_sge *s; + unsigned wr_count = 0; + unsigned sge_count = 0; + int size; + int ret = 0; + + for (i = wr; i; i = i->next) { + wr_count++; + sge_count += i->num_sge; + } + + size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; + cmd = alloca(size); + + IBV_INIT_CMD_RESP(cmd, size, POST_RECV, &resp, sizeof resp); + cmd->qp_handle = ibqp->handle; + cmd->wr_count = wr_count; + cmd->sge_count = sge_count; + cmd->wqe_size = sizeof *n; + + n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct ibv_sge *) (n + wr_count); + + tmp = n; + for (i = wr; i; i = i->next) { + tmp->wr_id = i->wr_id; + tmp->num_sge = i->num_sge; + + if (tmp->num_sge) { + memcpy(s, i->sg_list, tmp->num_sge * sizeof *s); + s += tmp->num_sge; + } + + tmp++; + } + + resp.bad_wr = 0; + if (write(ibqp->context->cmd_fd, cmd, size) != sizeof cmd) + ret = errno; + + wr_count = resp.bad_wr; + if (wr_count) { + i = wr; + while (--wr_count) + i = i->next; + *bad_wr = i; + } + + return ret; +} + +int ibv_cmd_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *wr, + struct ibv_recv_wr **bad_wr) +{ + struct ibv_post_srq_recv *cmd; + struct ibv_post_srq_recv_resp resp; + struct ibv_recv_wr *i; + struct ibv_kern_recv_wr *n, *tmp; + struct ibv_sge *s; + unsigned wr_count = 0; + unsigned sge_count = 0; + int size; + int ret = 0; + + for (i = wr; i; i = i->next) { + wr_count++; + sge_count += i->num_sge; + } + + size = sizeof *cmd + wr_count * sizeof *n + sge_count * sizeof *s; + cmd = alloca(size); + + IBV_INIT_CMD_RESP(cmd, size, POST_SRQ_RECV, &resp, sizeof resp); + cmd->srq_handle = srq->handle; + cmd->wr_count = wr_count; + cmd->sge_count = sge_count; + cmd->wqe_size = sizeof *n; + + n = (struct ibv_kern_recv_wr *) ((void *) cmd + sizeof *cmd); + s = (struct ibv_sge *) (n + wr_count); + + tmp = n; + for (i = wr; i; i = i->next) { + tmp->wr_id = i->wr_id; + tmp->num_sge = i->num_sge; + + if (tmp->num_sge) { + memcpy(s, i->sg_list, tmp->num_sge * sizeof *s); + s += tmp->num_sge; + } + + tmp++; + } + + resp.bad_wr = 0; + if (write(srq->context->cmd_fd, cmd, size) != sizeof cmd) + ret = errno; + + wr_count = resp.bad_wr; + if (wr_count) { + i = wr; + while (--wr_count) + i = i->next; + *bad_wr = i; + } + + return ret; +} + +int ibv_cmd_create_ah(struct ibv_pd *pd, struct ibv_ah *ah, + struct ibv_ah_attr *attr) +{ + struct ibv_create_ah cmd; + struct ibv_create_ah_resp resp; + + IBV_INIT_CMD_RESP(&cmd, sizeof cmd, CREATE_AH, &resp, sizeof resp); + cmd.user_handle = (uintptr_t) ah; + cmd.pd_handle = pd->handle; + cmd.attr.dlid = attr->dlid; + cmd.attr.sl = attr->sl; + cmd.attr.src_path_bits = attr->src_path_bits; + cmd.attr.static_rate = attr->static_rate; + cmd.attr.is_global = attr->is_global; + cmd.attr.port_num = attr->port_num; + cmd.attr.grh.flow_label = attr->grh.flow_label; + cmd.attr.grh.sgid_index = attr->grh.sgid_index; + cmd.attr.grh.hop_limit = attr->grh.hop_limit; + cmd.attr.grh.traffic_class = attr->grh.traffic_class; + memcpy(cmd.attr.grh.dgid, attr->grh.dgid.raw, 16); + + if (write(pd->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + ah->handle = resp.handle; + + return 0; +} + +int ibv_cmd_destroy_ah(struct ibv_ah *ah) +{ + struct ibv_destroy_ah cmd; + + IBV_INIT_CMD(&cmd, sizeof cmd, DESTROY_AH); + cmd.ah_handle = ah->handle; + + if (write(ah->context->cmd_fd, &cmd, sizeof cmd) != sizeof cmd) + return errno; + + return 0; +} + int ibv_cmd_destroy_qp(struct ibv_qp *qp) { struct ibv_destroy_qp cmd; From mst at mellanox.co.il Thu Oct 13 14:35:06 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 13 Oct 2005 23:35:06 +0200 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <52br1t15bb.fsf@cisco.com> References: <52br1t15bb.fsf@cisco.com> Message-ID: <20051013213506.GB13857@mellanox.co.il> Quoting r. Roland Dreier : > Subject: [RFC] Kernel uverbs changes for PathScale merge > > Here are the changes to the kernel part of userspace verbs required to > support PathScale's driver. I'm now happy with them and ready to > commit them to the svn trunk and queue them for 2.6.15. This will > allow the PathScale hardware-specific driver to be move to the trunk > as well, although quite a bit of cleanup is necessary before merging > the driver upstream. > > Does anyone have any comments on these changes before I commit? What prevents the user from passing e.g. poll cq command on mthca device? If that happens, it seems that ib_poll_cq will then crash. Is there a mask somewhere that lets the device specify which uverbs commands are allowed for it? > --- infiniband/core/uverbs_cmd.c (revision 3707) > +++ infiniband/core/uverbs_cmd.c (working copy) > @@ -665,6 +665,93 @@ err: > return ret; > } > > +ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, > + const char __user *buf, int in_len, > + int out_len) > +{ > + struct ib_uverbs_poll_cq cmd; > + struct ib_uverbs_poll_cq_resp *resp; > + struct ib_cq *cq; > + struct ib_wc *wc; > + int ret = 0; > + int i; > + int rsize; > + > + if (copy_from_user(&cmd, buf, sizeof cmd)) > + return -EFAULT; > + > + wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); > + if (!wc) > + return -ENOMEM; > + > + rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc); > + resp = kmalloc(rsize, GFP_KERNEL); > + if (!resp) { > + ret = -ENOMEM; > + goto out_wc; > + } > + > + down(&ib_uverbs_idr_mutex); > + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); > + if (!cq || cq->uobject->context != file->ucontext) { > + ret = -EINVAL; > + goto out; > + } > + > + resp->count = ib_poll_cq(cq, cmd.ne, wc); > + > + for (i = 0; i < resp->count; i++) { > + resp->wc[i].wr_id = wc[i].wr_id; > + resp->wc[i].status = wc[i].status; > + resp->wc[i].opcode = wc[i].opcode; > + resp->wc[i].vendor_err = wc[i].vendor_err; > + resp->wc[i].byte_len = wc[i].byte_len; > + resp->wc[i].imm_data = wc[i].imm_data; > + resp->wc[i].qp_num = wc[i].qp_num; > + resp->wc[i].src_qp = wc[i].src_qp; > + resp->wc[i].wc_flags = wc[i].wc_flags; > + resp->wc[i].pkey_index = wc[i].pkey_index; > + resp->wc[i].slid = wc[i].slid; > + resp->wc[i].sl = wc[i].sl; > + resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits; > + resp->wc[i].port_num = wc[i].port_num; > + } > + > + if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize)) > + ret = -EFAULT; > + > +out: > + up(&ib_uverbs_idr_mutex); > + kfree(resp); > + > +out_wc: > + kfree(wc); > + return ret ? ret : in_len; > +} -- MST From rjwalsh at pathscale.com Thu Oct 13 14:33:47 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 13 Oct 2005 14:33:47 -0700 Subject: [openib-general] [RFC] libibverbs changes for PathScale merge In-Reply-To: <527jch159p.fsf@cisco.com> References: <527jch159p.fsf@cisco.com> Message-ID: <1129239227.17665.6.camel@hematite.internal.keyresearch.com> > @@ -488,6 +489,7 @@ struct ibv_qp { > uint32_t handle; > uint32_t qp_num; > enum ibv_qp_state state; > + enum ibv_qp_type qp_type; > > pthread_mutex_t mutex; > pthread_cond_t cond; Since qp_type is now in ibv_qp, it probably no longer needs to be in mthca_qp. This is just a minor optimization. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Thu Oct 13 14:39:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 14:39:16 -0700 Subject: [openib-general] [RFC] libibverbs changes for PathScale merge In-Reply-To: <1129239227.17665.6.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Thu, 13 Oct 2005 14:33:47 -0700") References: <527jch159p.fsf@cisco.com> <1129239227.17665.6.camel@hematite.internal.keyresearch.com> Message-ID: <52y84xyu4b.fsf@cisco.com> Robert> Since qp_type is now in ibv_qp, it probably no longer Robert> needs to be in mthca_qp. This is just a minor Robert> optimization. Yep, I'll make that change too. - R. From rolandd at cisco.com Thu Oct 13 14:40:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 14:40:51 -0700 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <20051013213506.GB13857@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 13 Oct 2005 23:35:06 +0200") References: <52br1t15bb.fsf@cisco.com> <20051013213506.GB13857@mellanox.co.il> Message-ID: <52u0flyu1o.fsf@cisco.com> Michael> What prevents the user from passing e.g. poll cq command Michael> on mthca device? If that happens, it seems that Michael> ib_poll_cq will then crash. Michael> Is there a mask somewhere that lets the device specify Michael> which uverbs commands are allowed for it? Hmm, excellent point. A mask would be one way to avoid this -- let me think about whether there's a better way to handle this. Thanks, Roland From rolandd at cisco.com Thu Oct 13 15:03:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 15:03:39 -0700 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <52u0flyu1o.fsf@cisco.com> (Roland Dreier's message of "Thu, 13 Oct 2005 14:40:51 -0700") References: <52br1t15bb.fsf@cisco.com> <20051013213506.GB13857@mellanox.co.il> <52u0flyu1o.fsf@cisco.com> Message-ID: <52hdblyszo.fsf@cisco.com> OK, here's a new patch that adds a mask of allowed userspace commands set by the kernel low-level driver. Thanks, good catch Michael... - R. --- include/rdma/ib_user_verbs.h (revision 3707) +++ include/rdma/ib_user_verbs.h (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -88,8 +89,11 @@ enum { * Make sure that all structs defined in this file remain laid out so * that they pack the same way on 32-bit and 64-bit architectures (to * avoid incompatibility between 32-bit userspace and 64-bit kernels). - * In particular do not use pointer types -- pass pointers in __u64 - * instead. + * Specifically: + * - Do not use pointer types -- pass pointers in __u64 instead. + * - Make sure that any structure larger than 4 bytes is padded to a + * multiple of 8 bytes. Otherwise the structure size will be + * different between 32-bit and 64-bit architectures. */ struct ib_uverbs_async_event_desc { @@ -261,6 +265,42 @@ struct ib_uverbs_create_cq_resp { __u32 cqe; }; +struct ib_uverbs_poll_cq { + __u64 response; + __u32 cq_handle; + __u32 ne; + __u64 wc; +}; + +struct ib_uverbs_wc { + __u64 wr_id; + __u32 status; + __u32 opcode; + __u32 vendor_err; + __u32 byte_len; + __u32 imm_data; + __u32 qp_num; + __u32 src_qp; + __u32 wc_flags; + __u16 pkey_index; + __u16 slid; + __u8 sl; + __u8 dlid_path_bits; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_poll_cq_resp { + __u32 count; + __u32 reserved; + struct ib_uverbs_wc wc[0]; +}; + +struct ib_uverbs_req_notify_cq { + __u32 cq_handle; + __u32 solicited_only; +}; + struct ib_uverbs_destroy_cq { __u64 response; __u32 cq_handle; @@ -358,6 +398,127 @@ struct ib_uverbs_destroy_qp_resp { __u32 events_reported; }; +/* + * Note: the ib_uverbs_sge structure isn't used anywhere, as the ib_sge + * structure is packed the same way on 32-bit and 64-bit architectures + * in both kernel and user space. It's just here to document the ABI. + */ + +struct ib_uverbs_sge { + __u64 addr; + __u32 length; + __u32 lkey; +}; + +struct ib_uverbs_send_wr { + __u64 wr_id; + __u32 num_sge; + __u32 opcode; + __u32 send_flags; + __u32 imm_data; + union { + struct { + __u64 remote_addr; + __u32 rkey; + __u32 reserved; + } rdma; + struct { + __u64 remote_addr; + __u64 compare_add; + __u64 swap; + __u32 rkey; + __u32 reserved; + } atomic; + struct { + __u32 ah; + __u32 remote_qpn; + __u32 remote_qkey; + __u32 reserved; + } ud; + } wr; +}; + +struct ib_uverbs_post_send { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_send_wr send_wr[0]; +}; + +struct ib_uverbs_post_send_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_recv_wr { + __u64 wr_id; + __u32 num_sge; + __u32 reserved; +}; + +struct ib_uverbs_post_recv { + __u64 response; + __u32 qp_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv_wr[0]; +}; + +struct ib_uverbs_post_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_post_srq_recv { + __u64 response; + __u32 srq_handle; + __u32 wr_count; + __u32 sge_count; + __u32 wqe_size; + struct ib_uverbs_recv_wr recv[0]; +}; + +struct ib_uverbs_post_srq_recv_resp { + __u32 bad_wr; +}; + +struct ib_uverbs_global_route { + __u8 dgid[16]; + __u32 flow_label; + __u8 sgid_index; + __u8 hop_limit; + __u8 traffic_class; + __u8 reserved; +}; + +struct ib_uverbs_ah_attr { + struct ib_uverbs_global_route grh; + __u16 dlid; + __u8 sl; + __u8 src_path_bits; + __u8 static_rate; + __u8 is_global; + __u8 port_num; + __u8 reserved; +}; + +struct ib_uverbs_create_ah { + __u64 response; + __u64 user_handle; + __u32 pd_handle; + __u32 reserved; + struct ib_uverbs_ah_attr attr; +}; + +struct ib_uverbs_create_ah_resp { + __u32 ah_handle; +}; + +struct ib_uverbs_destroy_ah { + __u32 ah_handle; +}; + struct ib_uverbs_attach_mcast { __u8 gid[16]; __u32 qp_handle; --- include/rdma/ib_verbs.h (revision 3707) +++ include/rdma/ib_verbs.h (working copy) @@ -951,6 +951,7 @@ struct ib_device { IB_DEV_UNREGISTERED } reg_state; + u64 uverbs_cmd_mask; int uverbs_abi_ver; u8 node_type; --- core/uverbs_main.c (revision 3740) +++ core/uverbs_main.c (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2005 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -86,10 +87,17 @@ static ssize_t (*uverbs_cmd_table[])(str [IB_USER_VERBS_CMD_DEREG_MR] = ib_uverbs_dereg_mr, [IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL] = ib_uverbs_create_comp_channel, [IB_USER_VERBS_CMD_CREATE_CQ] = ib_uverbs_create_cq, + [IB_USER_VERBS_CMD_POLL_CQ] = ib_uverbs_poll_cq, + [IB_USER_VERBS_CMD_REQ_NOTIFY_CQ] = ib_uverbs_req_notify_cq, [IB_USER_VERBS_CMD_DESTROY_CQ] = ib_uverbs_destroy_cq, [IB_USER_VERBS_CMD_CREATE_QP] = ib_uverbs_create_qp, [IB_USER_VERBS_CMD_MODIFY_QP] = ib_uverbs_modify_qp, [IB_USER_VERBS_CMD_DESTROY_QP] = ib_uverbs_destroy_qp, + [IB_USER_VERBS_CMD_POST_SEND] = ib_uverbs_post_send, + [IB_USER_VERBS_CMD_POST_RECV] = ib_uverbs_post_recv, + [IB_USER_VERBS_CMD_POST_SRQ_RECV] = ib_uverbs_post_srq_recv, + [IB_USER_VERBS_CMD_CREATE_AH] = ib_uverbs_create_ah, + [IB_USER_VERBS_CMD_DESTROY_AH] = ib_uverbs_destroy_ah, [IB_USER_VERBS_CMD_ATTACH_MCAST] = ib_uverbs_attach_mcast, [IB_USER_VERBS_CMD_DETACH_MCAST] = ib_uverbs_detach_mcast, [IB_USER_VERBS_CMD_CREATE_SRQ] = ib_uverbs_create_srq, @@ -111,7 +119,13 @@ static int ib_dealloc_ucontext(struct ib down(&ib_uverbs_idr_mutex); - /* XXX Free AHs */ + list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) { + struct ib_ah *ah = idr_find(&ib_uverbs_ah_idr, uobj->id); + idr_remove(&ib_uverbs_ah_idr, uobj->id); + ib_destroy_ah(ah); + list_del(&uobj->list); + kfree(uobj); + } list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = idr_find(&ib_uverbs_qp_idr, uobj->id); @@ -514,7 +528,8 @@ static ssize_t ib_uverbs_write(struct fi if (hdr.command < 0 || hdr.command >= ARRAY_SIZE(uverbs_cmd_table) || - !uverbs_cmd_table[hdr.command]) + !uverbs_cmd_table[hdr.command] || + !(file->device->ib_dev->uverbs_cmd_mask & (1ull << hdr.command))) return -EINVAL; if (!file->ucontext && --- core/uverbs.h (revision 3707) +++ core/uverbs.h (working copy) @@ -3,6 +3,7 @@ * Copyright (c) 2005 Cisco Systems. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2005 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -140,10 +141,17 @@ IB_UVERBS_DECLARE_CMD(reg_mr); IB_UVERBS_DECLARE_CMD(dereg_mr); IB_UVERBS_DECLARE_CMD(create_comp_channel); IB_UVERBS_DECLARE_CMD(create_cq); +IB_UVERBS_DECLARE_CMD(poll_cq); +IB_UVERBS_DECLARE_CMD(req_notify_cq); IB_UVERBS_DECLARE_CMD(destroy_cq); IB_UVERBS_DECLARE_CMD(create_qp); IB_UVERBS_DECLARE_CMD(modify_qp); IB_UVERBS_DECLARE_CMD(destroy_qp); +IB_UVERBS_DECLARE_CMD(post_send); +IB_UVERBS_DECLARE_CMD(post_recv); +IB_UVERBS_DECLARE_CMD(post_srq_recv); +IB_UVERBS_DECLARE_CMD(create_ah); +IB_UVERBS_DECLARE_CMD(destroy_ah); IB_UVERBS_DECLARE_CMD(attach_mcast); IB_UVERBS_DECLARE_CMD(detach_mcast); IB_UVERBS_DECLARE_CMD(create_srq); --- core/uverbs_cmd.c (revision 3707) +++ core/uverbs_cmd.c (working copy) @@ -665,6 +665,93 @@ err: return ret; } +ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_poll_cq cmd; + struct ib_uverbs_poll_cq_resp *resp; + struct ib_cq *cq; + struct ib_wc *wc; + int ret = 0; + int i; + int rsize; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL); + if (!wc) + return -ENOMEM; + + rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc); + resp = kmalloc(rsize, GFP_KERNEL); + if (!resp) { + ret = -ENOMEM; + goto out_wc; + } + + down(&ib_uverbs_idr_mutex); + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (!cq || cq->uobject->context != file->ucontext) { + ret = -EINVAL; + goto out; + } + + resp->count = ib_poll_cq(cq, cmd.ne, wc); + + for (i = 0; i < resp->count; i++) { + resp->wc[i].wr_id = wc[i].wr_id; + resp->wc[i].status = wc[i].status; + resp->wc[i].opcode = wc[i].opcode; + resp->wc[i].vendor_err = wc[i].vendor_err; + resp->wc[i].byte_len = wc[i].byte_len; + resp->wc[i].imm_data = wc[i].imm_data; + resp->wc[i].qp_num = wc[i].qp_num; + resp->wc[i].src_qp = wc[i].src_qp; + resp->wc[i].wc_flags = wc[i].wc_flags; + resp->wc[i].pkey_index = wc[i].pkey_index; + resp->wc[i].slid = wc[i].slid; + resp->wc[i].sl = wc[i].sl; + resp->wc[i].dlid_path_bits = wc[i].dlid_path_bits; + resp->wc[i].port_num = wc[i].port_num; + } + + if (copy_to_user((void __user *) (unsigned long) cmd.response, resp, rsize)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + kfree(resp); + +out_wc: + kfree(wc); + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_req_notify_cq(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_req_notify_cq cmd; + struct ib_cq *cq; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + cq = idr_find(&ib_uverbs_cq_idr, cmd.cq_handle); + if (cq && cq->uobject->context == file->ucontext) { + ib_req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + ret = in_len; + } + up(&ib_uverbs_idr_mutex); + + return ret; +} + ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -1003,6 +1090,468 @@ out: return ret ? ret : in_len; } +ssize_t ib_uverbs_post_send(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_send cmd; + struct ib_uverbs_post_send_resp resp; + struct ib_uverbs_send_wr *user_wr; + struct ib_send_wr *wr = NULL, *last, *next, *bad_wr; + struct ib_qp *qp; + int i, sg_ind; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + if (in_len < sizeof cmd + cmd.wqe_size * cmd.wr_count + + cmd.sge_count * sizeof (struct ib_uverbs_sge)) + return -EINVAL; + + if (cmd.wqe_size < sizeof (struct ib_uverbs_send_wr)) + return -EINVAL; + + user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); + if (!user_wr) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + sg_ind = 0; + last = NULL; + for (i = 0; i < cmd.wr_count; ++i) { + if (copy_from_user(user_wr, + buf + sizeof cmd + i * cmd.wqe_size, + cmd.wqe_size)) { + ret = -EFAULT; + goto out; + } + + if (user_wr->num_sge + sg_ind > cmd.sge_count) { + ret = -EINVAL; + goto out; + } + + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto out; + } + + if (!last) + wr = next; + else + last->next = next; + last = next; + + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + next->opcode = user_wr->opcode; + next->send_flags = user_wr->send_flags; + next->imm_data = user_wr->imm_data; + + if (qp->qp_type == IB_QPT_UD) { + next->wr.ud.ah = idr_find(&ib_uverbs_ah_idr, + user_wr->wr.ud.ah); + if (!next->wr.ud.ah) { + ret = -EINVAL; + goto out; + } + next->wr.ud.remote_qpn = user_wr->wr.ud.remote_qpn; + next->wr.ud.remote_qkey = user_wr->wr.ud.remote_qkey; + } else { + switch (next->opcode) { + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + case IB_WR_RDMA_READ: + next->wr.rdma.remote_addr = + user_wr->wr.rdma.remote_addr; + next->wr.rdma.rkey = + user_wr->wr.rdma.rkey; + break; + case IB_WR_ATOMIC_CMP_AND_SWP: + case IB_WR_ATOMIC_FETCH_AND_ADD: + next->wr.atomic.remote_addr = + user_wr->wr.atomic.remote_addr; + next->wr.atomic.compare_add = + user_wr->wr.atomic.compare_add; + next->wr.atomic.swap = user_wr->wr.atomic.swap; + next->wr.atomic.rkey = user_wr->wr.atomic.rkey; + break; + default: + break; + } + } + + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + sizeof cmd + + cmd.wr_count * cmd.wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto out; + } + sg_ind += next->num_sge; + } else + next->sg_list = NULL; + } + + resp.bad_wr = 0; + ret = qp->device->post_send(qp, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + kfree(user_wr); + + return ret ? ret : in_len; +} + +static struct ib_recv_wr *ib_uverbs_unmarshall_recv(const char __user *buf, + int in_len, + u32 wr_count, + u32 sge_count, + u32 wqe_size) +{ + struct ib_uverbs_recv_wr *user_wr; + struct ib_recv_wr *wr = NULL, *last, *next; + int sg_ind; + int i; + int ret; + + if (in_len < wqe_size * wr_count + + sge_count * sizeof (struct ib_uverbs_sge)) + return ERR_PTR(-EINVAL); + + if (wqe_size < sizeof (struct ib_uverbs_recv_wr)) + return ERR_PTR(-EINVAL); + + user_wr = kmalloc(wqe_size, GFP_KERNEL); + if (!user_wr) + return ERR_PTR(-ENOMEM); + + sg_ind = 0; + last = NULL; + for (i = 0; i < wr_count; ++i) { + if (copy_from_user(user_wr, buf + i * wqe_size, + wqe_size)) { + ret = -EFAULT; + goto err; + } + + if (user_wr->num_sge + sg_ind > sge_count) { + ret = -EINVAL; + goto err; + } + + next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + + user_wr->num_sge * sizeof (struct ib_sge), + GFP_KERNEL); + if (!next) { + ret = -ENOMEM; + goto err; + } + + if (!last) + wr = next; + else + last->next = next; + last = next; + + next->next = NULL; + next->wr_id = user_wr->wr_id; + next->num_sge = user_wr->num_sge; + + if (next->num_sge) { + next->sg_list = (void *) next + + ALIGN(sizeof *next, sizeof (struct ib_sge)); + if (copy_from_user(next->sg_list, + buf + wr_count * wqe_size + + sg_ind * sizeof (struct ib_sge), + next->num_sge * sizeof (struct ib_sge))) { + ret = -EFAULT; + goto err; + } + sg_ind += next->num_sge; + } else + next->sg_list = NULL; + } + + kfree(user_wr); + return wr; + +err: + kfree(user_wr); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ERR_PTR(ret); +} + +ssize_t ib_uverbs_post_recv(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_recv cmd; + struct ib_uverbs_post_recv_resp resp; + struct ib_recv_wr *wr, *next, *bad_wr; + struct ib_qp *qp; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + + down(&ib_uverbs_idr_mutex); + + qp = idr_find(&ib_uverbs_qp_idr, cmd.qp_handle); + if (!qp || qp->uobject->context != file->ucontext) + goto out; + + resp.bad_wr = 0; + ret = qp->device->post_recv(qp, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_post_srq_recv(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_post_srq_recv cmd; + struct ib_uverbs_post_srq_recv_resp resp; + struct ib_recv_wr *wr, *next, *bad_wr; + struct ib_srq *srq; + ssize_t ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + wr = ib_uverbs_unmarshall_recv(buf + sizeof cmd, + in_len - sizeof cmd, cmd.wr_count, + cmd.sge_count, cmd.wqe_size); + if (IS_ERR(wr)) + return PTR_ERR(wr); + + down(&ib_uverbs_idr_mutex); + + srq = idr_find(&ib_uverbs_srq_idr, cmd.srq_handle); + if (!srq || srq->uobject->context != file->ucontext) + goto out; + + resp.bad_wr = 0; + ret = srq->device->post_srq_recv(srq, wr, &bad_wr); + if (ret) + for (next = wr; next; next = next->next) { + ++resp.bad_wr; + if (next == bad_wr) + break; + } + + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) + ret = -EFAULT; + +out: + up(&ib_uverbs_idr_mutex); + + while (wr) { + next = wr->next; + kfree(wr); + wr = next; + } + + return ret ? ret : in_len; +} + +ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_ah cmd; + struct ib_uverbs_create_ah_resp resp; + struct ib_uobject *uobj; + struct ib_pd *pd; + struct ib_ah *ah; + struct ib_ah_attr attr; + int ret; + + if (out_len < sizeof resp) + return -ENOSPC; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + uobj = kmalloc(sizeof *uobj, GFP_KERNEL); + if (!uobj) + return -ENOMEM; + + down(&ib_uverbs_idr_mutex); + + pd = idr_find(&ib_uverbs_pd_idr, cmd.pd_handle); + if (!pd || pd->uobject->context != file->ucontext) { + ret = -EINVAL; + goto err_up; + } + + uobj->user_handle = cmd.user_handle; + uobj->context = file->ucontext; + + attr.dlid = cmd.attr.dlid; + attr.sl = cmd.attr.sl; + attr.src_path_bits = cmd.attr.src_path_bits; + attr.static_rate = cmd.attr.static_rate; + attr.port_num = cmd.attr.port_num; + attr.grh.flow_label = cmd.attr.grh.flow_label; + attr.grh.sgid_index = cmd.attr.grh.sgid_index; + attr.grh.hop_limit = cmd.attr.grh.hop_limit; + attr.grh.traffic_class = cmd.attr.grh.traffic_class; + memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16); + + ah = ib_create_ah(pd, &attr); + if (IS_ERR(ah)) { + ret = PTR_ERR(ah); + goto err_up; + } + + ah->uobject = uobj; + +retry: + if (!idr_pre_get(&ib_uverbs_ah_idr, GFP_KERNEL)) { + ret = -ENOMEM; + goto err_destroy; + } + + ret = idr_get_new(&ib_uverbs_ah_idr, ah, &uobj->id); + + if (ret == -EAGAIN) + goto retry; + if (ret) + goto err_destroy; + + resp.ah_handle = uobj->id; + + if (copy_to_user((void __user *) (unsigned long) cmd.response, + &resp, sizeof resp)) { + ret = -EFAULT; + goto err_idr; + } + + down(&file->mutex); + list_add_tail(&uobj->list, &file->ucontext->ah_list); + up(&file->mutex); + + up(&ib_uverbs_idr_mutex); + + return in_len; + +err_idr: + idr_remove(&ib_uverbs_ah_idr, uobj->id); + +err_destroy: + ib_destroy_ah(ah); + +err_up: + up(&ib_uverbs_idr_mutex); + + kfree(uobj); + return ret; +} + +ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file, + const char __user *buf, int in_len, int out_len) +{ + struct ib_uverbs_destroy_ah cmd; + struct ib_ah *ah; + struct ib_uobject *uobj; + int ret = -EINVAL; + + if (copy_from_user(&cmd, buf, sizeof cmd)) + return -EFAULT; + + down(&ib_uverbs_idr_mutex); + + ah = idr_find(&ib_uverbs_ah_idr, cmd.ah_handle); + if (!ah || ah->uobject->context != file->ucontext) + goto out; + + uobj = ah->uobject; + + ret = ib_destroy_ah(ah); + if (ret) + goto out; + + idr_remove(&ib_uverbs_ah_idr, cmd.ah_handle); + + down(&file->mutex); + list_del(&uobj->list); + up(&file->mutex); + + kfree(uobj); + +out: + up(&ib_uverbs_idr_mutex); + + return ret ? ret : in_len; +} + ssize_t ib_uverbs_attach_mcast(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) --- hw/mthca/mthca_provider.c (revision 3710) +++ hw/mthca/mthca_provider.c (working copy) @@ -37,6 +37,7 @@ */ #include +#include #include #include "mthca_dev.h" @@ -1077,6 +1078,25 @@ int mthca_register_device(struct mthca_d dev->ib_dev.owner = THIS_MODULE; dev->ib_dev.uverbs_abi_ver = MTHCA_UVERBS_ABI_VERSION; + dev->ib_dev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ); dev->ib_dev.node_type = IB_NODE_CA; dev->ib_dev.phys_port_cnt = dev->limits.num_ports; dev->ib_dev.dma_device = &dev->pdev->dev; From rolandd at cisco.com Thu Oct 13 15:07:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 15:07:10 -0700 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <52hdblyszo.fsf@cisco.com> (Roland Dreier's message of "Thu, 13 Oct 2005 15:03:39 -0700") References: <52br1t15bb.fsf@cisco.com> <20051013213506.GB13857@mellanox.co.il> <52u0flyu1o.fsf@cisco.com> <52hdblyszo.fsf@cisco.com> Message-ID: <52d5m9ystt.fsf@cisco.com> And here's a patch to ipath to make it work with the uverbs command mask... Index: infiniband/hw/ipath/ib_ipath/ipath_openib.c =================================================================== --- infiniband/hw/ipath/ib_ipath/ipath_openib.c (revision 3758) +++ infiniband/hw/ipath/ib_ipath/ipath_openib.c (working copy) @@ -5733,6 +5733,32 @@ static int ipath_register_ib_device(cons strlcpy(dev->name, "infinipath_ib%d", IB_DEVICE_NAME_MAX); dev->uverbs_abi_ver = IPATH_UVERBS_ABI_VERSION; + dev->uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); dev->node_type = IB_NODE_CA; dev->phys_port_cnt = 1; dev->dma_device = ipath_layer_get_pcidev(t); From hycsw at ca.sandia.gov Thu Oct 13 15:07:18 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Thu, 13 Oct 2005 15:07:18 -0700 (PDT) Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <200510132207.PAA11912@ca.sandia.gov> Roland, It doesn't seem like shrinking the TCP window had helped. I captured the Dmesg log from Lustre server and associated client reporting IOZONE error. BTW, this problem is a moving target so it is hard to believe that it is hardware related(?) BTW, I am using the mellanox DDR switch and HCA. Thanks, Helen ------- Dmesg from Lustre server ------ NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 18638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 19638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 20638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 21638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 22638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 23638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 24638 LustreError: 12471:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET req at f5d8e000 x20249/t0 o4->@:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12485:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 5f9e7_lov2_e307d728c2 at NET_0xc0a80249_UUID id 192.168.2.73-12345 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET req at d51dfa00 x20359/t0 o4->@:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) previously skipped 1 similar messages LustreError: 12477:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 30326_lov2_7ce4b0bf00 at NET_0xc0a8024e_UUID id 192.168.2.78-12345 LustreError: 12477:0:(filter.c:1728:filter_grant_sanity_check()) filter_disconnect: tot_granted 48570368 != fo_tot_granted 49618944 LustreError: 12477:0:(filter.c:1731:filter_grant_sanity_check()) filter_disconnect: tot_pending 7340032 != fo_tot_pending 8388608 Lustre: A connection with 192.168.2.80 timed out; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80250 ip 192.168.2.80:1022 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 25638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 26638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 27638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 28638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 29638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 30638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 31638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 32638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 33638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 34638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 35638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 36638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 37638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 38638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 39638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 40638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 41638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 42638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 43638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 44638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 45638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 46638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 47638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 48638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 49638 LustreError: A timeout occurred receiving data from 192.168.2.73; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2214:ksocknal_find_timed_out_conn()) Timed out RX from 0xc0a80249 f2630000 192.168.2.73 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80249 ip 192.168.2.73:1021 LustreError: 12189:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing partial receive from 0xc0a8024e, ip 192.168.2.78:1021, with error LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc eb0c8000 LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc f2603000 LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting 30326_lov2_7ce4b0bf00 at NET_0xc0a8024e_UUID id 192.168.2.78-12345 LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) previously skipped 6 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 50638 Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 1 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 51638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 52638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 53638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 54638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 55638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 56638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 57638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 58638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 59638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 60638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 61638 Lustre: A connection with 192.168.2.72 timed out; the network or that node may be down. Lustre: previously skipped 3 similar messages LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80248 ip 192.168.2.72:1021 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 3 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 62638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 63638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 64638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 65638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 66638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 67638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 68638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 69638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 70638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 71638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 72638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 73638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 74638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 75638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 76638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 77638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 78638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 79638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 80638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 81638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 82638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 83638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 84638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 85638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 86638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 87638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 88638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 89638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 90638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 91638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 92638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 93638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 94638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 95638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 96638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 97638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 98638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 99638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 100638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 101638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 102638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 103638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 104638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 105638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 106638 LustreError: 12458:0:(ldlm_lib.c:506:target_handle_reconnect()) 709aa0a3-a6a1-4134-b2b4-805212eb9430 reconnecting Lustre: 12470:0:(filter.c:2645:filter_set_info()) on3-ost1: received MDS connection (0xbc2765ac563141df) Lustre: 12486:0:(filter.c:2082:filter_destroy_precreated()) on3-ost2: deleting orphan objects from 6 to 67 Lustre: 12583:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 0x149423e:3575f5db at index 2 of catalog 0x149423a Lustre: 12583:0:(filter_log.c:235:filter_recov_log_mds_ost_cb()) fetch generation log, send cookie Lustre: 12583:0:(llog.c:287:llog_process()) recovery from log: 0x149423e:3575f5db stopped LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) 8ebea_lov2_7a4510c13a reconnecting LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) e24e8_lov1_13fb4ed690 reconnecting LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12461:0:(ldlm_lib.c:506:target_handle_reconnect()) 97cda_lov2_81558eef0b reconnecting LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) 03c5b_lov2_084e2d0661 reconnecting LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) 8da95_lov1_79a1a2e0bd reconnecting LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 4 similar messages LustreError: 12454:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239844, 100s ago) req at ea8d0800 x5/t0 o401->@NET_0xc0a80253_UUID:15 lens 104/64 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 12454:0:(recov_thread.c:410:log_commit_thread()) commit f538e000:f7679e80 drop 1 cookies: rc -110 --------- Dmesg from Lustre client ----------------------- Lustre: A connection with 192.168.2.74 timed out; the network or that node may be down. LustreError: 11145:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024a ip 192.168.2.74:988 LustreError: 11143:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988 LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 11143:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 64 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73) LustreError: 11143:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req at f615f600 x20271/t0 o400->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239884, 3s ago) req at f615f600 x20271/t0 o400->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: Connection to service on3-ost2 via nid 192.168.2.74 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 11269:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection lost to on3-ost2_UUID at NID_on3-ib_UUID LustreError: 11270:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 0xc0a8024a: 19 LustreError: 11141:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988 LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 240 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73) LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously skipped 1 similar messages LustreError: 11141:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 req at f66a9600 x20283/t0 o8->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 240/144 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11141:0:(events.c:61:request_out_callback()) previously skipped 3 similar messages LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239912, 3s ago) req at f66a9600 x20283/t0 o8->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 240/144 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 3 similar messages LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239819, 100s ago) req at f528ca00 x20242/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 1 similar messages LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239834, 100s ago) req at f66a9a00 x20256/t0 o400->on3-ost1_UUID at NID_on3-ib_UUID:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 8 similar messages Lustre: Connection restored to service on3-ost1 using nid 192.168.2.74. Lustre: 11270:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost1_MNT_on2-ib: connection restored to on3-ost1_UUID at NID_on3-ib_UUID LustreError: This client was evicted by on3-ost2; in progress operations using this service will fail. LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f528c600 x20302/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f528c200 x20303/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d7c41c00 x20305/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at d7c41800 x20306/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at f607d600 x20307/t0 o4->on3-ost2_UUID at NID_on3-ib_UUID:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1925dc0 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1779840 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 275 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c177d820 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 485 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1792560 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 815 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18dd440 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 1399 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18e3600 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 2637 similar messages Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74. Lustre: 11530:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection restored to on3-ost2_UUID at NID_on3-ib_UUID >From hycsw Thu Oct 13 14:21:18 2005 A From: hycsw (Helen Chen) Message-Id: <200510132121.OAA29376 at ca.sandia.gov> To: hycsw at ca, rolandd at cisco.com Subject: Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Cc: hycsw at sandia.gov, openib-general at openib.org Status: R Roland, >From rolandd at cisco.com Thu Oct 13 13:53:05 2005 > > Helen> Roland, Thank you for your response. That fixed my initial > Helen> buffer allocation failure. After we tuned the Lustre and > Helen> reran same IOZONE tests again, we got the following > Helen> problem. Was there an actual network interrupt? If so, the > Helen> problem is not obvious now; the two nodes are pinging over > Helen> IPoIB. Please advice. > >That's very odd. This message: > > Helen> NETDEV WATCHDOG: ib0: transmit timed out > Helen> ib0: transmit timeout: latency 1846 > >says that we are not seeing send completions from the HCA. However, >are you saying that even when you are seeing this message, ping over >IPoIB is working? > No, I didn't know there were any problem until IOZONE reported read error from the Lustre Client. BTW, the backend storage is iSCSI over 10 GbE using jumbo frame. This pl\roblem only appeared after our tuning errfor: we increased the iSCSI payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I will shrink my TCP window and see if the problem goes away. Thanks, Helen From rolandd at cisco.com Thu Oct 13 15:12:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 15:12:54 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <200510132207.PAA11912@ca.sandia.gov> (Helen Chen's message of "Thu, 13 Oct 2005 15:07:18 -0700 (PDT)") References: <200510132207.PAA11912@ca.sandia.gov> Message-ID: <528xwxysk9.fsf@cisco.com> Helen> It doesn't seem like shrinking the TCP window had helped. Helen> I captured the Dmesg log from Lustre server and associated Helen> client reporting IOZONE error. What is the state of the system after you start seeing the ib0 transmit time out messages? Does IPoIB work at all? Is the HCA responsive at all -- for example what do you see if you do cat /sys/class/infiniband/mthca0/ports/1/state or cat /sys/class/infiniband/mthca0/ports/1/counters/* Helen> BTW, this problem is a moving target so it is hard to Helen> believe that it is hardware related(?) BTW, I am using the Helen> mellanox DDR switch and HCA. Not sure what you mean by a moving target... the symptoms really look like a crash of the HCA firmware to me. Thanks, Roland From troy at scl.ameslab.gov Thu Oct 13 15:46:47 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Thu, 13 Oct 2005 17:46:47 -0500 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: References: <20051007141207.GX4612@kalmia.hozed.org> Message-ID: <20051013224647.GC7707@minbar.scl.ameslab.gov> On Wed, Oct 12, 2005 at 01:04:37PM +0200, IBMEHCA DD wrote: > I just released the ehca2_0028 which uses svn 3615 on > https://sourceforge.net/projects/ibmehcad/ > As you might notice the license already has changed to the openib.org > license. > > With 2.6.13 we had the non-issue that our maun focus was on 2.6.5-7.191 > and we're only now moving to the latest kernel. I just built against svn 3774, and 2.6.13.3, with the timeout set to 120 seconds. There's some bad interaction going on with OpenSM. p5l2:~# modprobe hcad_mod ehca_nr_ports=1 [ 6186.855237] eBus Device Driver [ 6186.907578] eHCA Infiniband Device Driver (Rel.: EHCA2_0028) [ 6186.912203] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd p5l2:~# modprobe ib_ipoib ****hang for awhile.. entries appear in osm.log *** [ 6309.683651] PU0003 00060103:ehca_parse_ec EHCA port 1 is available. [ 6310.253303] kernel BUG in dma_map_single at arch/ppc64/kernel/dma.c:86! [ 6310.253320] Oops: Exception in kernel mode, sig: 5 [#1] [ 6310.253339] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 6310.253364] Modules linked in: ib_mad hcad_mod ib_core ebus [ 6310.253383] NIP: C00000000000FA10 XER: 00000020 LR: C00000000000F9B0 CTR: C00000000000F980 [ 6310.253400] REGS: c00000000f3bb770 TRAP: 0700 Not tainted (2.6.13.3-power5) [ 6310.253421] MSR: 8000000000029032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 24002444 [ 6310.253436] DAR: 0000000000000000 DSISR: 0000000000000000 [ 6310.253471] TASK: c00000000209f060[1874] 'modprobe' THREAD: c00000000f3b8000CPU: 7 [ 6310.253492] GPR00: C0000000004B3660 C00000000F3BB9F0 C0000000005EE948 C0000001DBEC5C18 [ 6310.253513] GPR04: C0000003CB5B1D0C 0000000000000128 0000000000000002 0000000000000008 [ 6310.253532] GPR08: C0000003CBD5EEE8 0000000000000000 C00000000F67FC00 C00000000000F980 [ 6310.253553] GPR12: D0000000000621D0 C0000000004B7800 0000000010017078 0000000000000000 [ 6310.253609] GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000001 [ 6310.253665] GPR20: C000000008DE7800 0000000000000002 0000000000000001 C00000000F67FDC8 [ 6310.253688] GPR24: C00000000F67FD40 0000000000000002 C0000001DBEC5C18 0000000000000002 [ 6310.253708] GPR28: 0000000000000128 C0000003CB5B1D0C D00000000006EB00 C0000003CB5B1C80 [ 6310.253731] NIP [c00000000000fa10] .dma_map_single+0x90/0xc0 [ 6310.253753] LR [c00000000000f9b0] .dma_map_single+0x30/0xc0 [ 6310.253778] Call Trace: [ 6310.253797] [c00000000f3bb9f0] [c000000008de7800] 0xc000000008de7800 (unreliable) [ 6310.253838] [c00000000f3bba90] [d00000000005aee8] .ib_mad_post_receive_mads+0xb8/0x270 [ib_mad] [ 6310.253880] [c00000000f3bbb80] [d00000000005c840] .ib_mad_init_device+0x350/0x660 [ib_mad] [ 6310.253905] [c00000000f3bbc70] [d00000000004d0bc] .ib_register_client+0xdc/0x150 [ib_core] [ 6310.253936] [c00000000f3bbd00] [d000000000061e6c] .ib_mad_init_module+0x8c/0xf0 [ib_mad] [ 6310.253999] [c00000000f3bbd90] [c000000000070720] .sys_init_module+0x1e0/0x4d0 [ 6310.254030] [c00000000f3bbe30] [c00000000000d300] syscall_exit+0x0/0x18 [ 6310.254045] Instruction dump: [ 6310.254053] 4e800421 e8410028 382100a0 e8010010 eb41ffd0 eb61ffd8 eb81ffe0 eba1ffe8 [ 6310.254089] 7c0803a6 4e800020 60000000 60000000 <0fe00000> 382100a0 38600000e8010010 [ 6310.254206] Segmentation fault I'm also attaching part of an opensm log file. (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) The IBM galaxy adapters are at: Initial path: [0][1][16] Initial path: [0][1][13] -------------- next part -------------- 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 13 10:42:05 978875 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping. Oct 13 10:42:05 978883 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 2 DR SLID 0x0 DR DLID 0x0 Oct 13 10:42:05 978892 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 13 10:42:05 978925 [42FFF970] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x2 trans_id................0x1810 attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x3E0000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][16] Return path: [0][0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 13 10:42:06 378879 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping. Oct 13 10:42:06 378891 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 2 DR SLID 0x0 DR DLID 0x0 Oct 13 10:42:06 378900 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 13 10:42:06 378934 [42FFF970] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x2 trans_id................0x1811 attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x3F0000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][16] Return path: [0][0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 13 10:42:06 806879 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping. Oct 13 10:42:06 806887 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0 Oct 13 10:42:06 806896 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 13 10:42:06 806930 [42FFF970] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x3 trans_id................0x1835 attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x10000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][16][2] Return path: [0][0][0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 From rjwalsh at pathscale.com Thu Oct 13 16:00:12 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 13 Oct 2005 16:00:12 -0700 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <52d5m9ystt.fsf@cisco.com> References: <52br1t15bb.fsf@cisco.com> <20051013213506.GB13857@mellanox.co.il> <52u0flyu1o.fsf@cisco.com> <52hdblyszo.fsf@cisco.com> <52d5m9ystt.fsf@cisco.com> Message-ID: <1129244412.17665.49.camel@hematite.internal.keyresearch.com> > And here's a patch to ipath to make it work with the uverbs command mask... Roland, Since the rest of the patch needed to get this working isn't applied to either the trunk or the ipath branch yet (and since the branch will be going away shortly), can you just apply this patch to the trunk when you do the merge? Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From xma at us.ibm.com Thu Oct 13 16:01:52 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 13 Oct 2005 16:01:52 -0700 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <20051013224647.GC7707@minbar.scl.ameslab.gov> Message-ID: I am not sure whether something related to dma_addr_t. Could you please try below patch? > http://ozlabs.org/pipermail/linuxppc64-dev/2005-July/004662.html1 Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Oct 13 16:02:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:02:26 -0700 Subject: [openib-general] Re: [RFC] Kernel uverbs changes for PathScale merge In-Reply-To: <1129244412.17665.49.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Thu, 13 Oct 2005 16:00:12 -0700") References: <52br1t15bb.fsf@cisco.com> <20051013213506.GB13857@mellanox.co.il> <52u0flyu1o.fsf@cisco.com> <52hdblyszo.fsf@cisco.com> <52d5m9ystt.fsf@cisco.com> <1129244412.17665.49.camel@hematite.internal.keyresearch.com> Message-ID: <524q7lyq9p.fsf@cisco.com> Robert> Since the rest of the patch needed to get this working Robert> isn't applied to either the trunk or the ipath branch yet Robert> (and since the branch will be going away shortly), can you Robert> just apply this patch to the trunk when you do the merge? Sure, no problem. - R. From rolandd at cisco.com Thu Oct 13 16:04:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:04:03 -0700 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: (Shirley Ma's message of "Thu, 13 Oct 2005 16:01:52 -0700") References: Message-ID: <52vf01xbmk.fsf@cisco.com> > http://ozlabs.org/pipermail/linuxppc64-dev/2005-July/004662.html1 delete the '1' from the end of the URL... - R. From rolandd at cisco.com Thu Oct 13 16:14:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:14:46 -0700 Subject: [PATCH, please test] IPoIB: recycle RX bufs (was: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer) In-Reply-To: <52irw12ovr.fsf@cisco.com> (Roland Dreier's message of "Thu, 13 Oct 2005 12:33:28 -0700") References: <521x2q7byt.fsf@cisco.com> <20051013191407.GC13356@mellanox.co.il> <52irw12ovr.fsf@cisco.com> Message-ID: <52r7apxb4p.fsf_-_@cisco.com> Roland> My plan is to change the receive handling of IPoIB Roland> slightly, so that if it can't allocate a new receive Roland> buffer, it reposts the old buffer and drops the packet it Roland> just received. Here's a patch that changes IPoIB to use this scheme. This should be much more robust when the system gets low on GFP_ATOMIC memory. I'd appreciate it if people could stress test and benchmark this. It works well for me, but I'm wondering if this patch has any effect on performance (either better or worse). Helen, it would be especially interesting if you could run your test with this patch and without increasing min_free_kbytes, since you are able to reproduce GFP_ATOMIC failures. I'd be curious to know what you see in /sys/class/net/ib0/statistics/rx_dropped after running the test. Thanks, Roland --- infiniband/ulp/ipoib/ipoib_main.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_main.c (working copy) @@ -729,7 +729,7 @@ int ipoib_dev_init(struct net_device *de /* Allocate RX/TX "rings" to hold queued skbs */ - priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf), + priv->rx_ring = kmalloc(IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf), GFP_KERNEL); if (!priv->rx_ring) { printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n", @@ -737,9 +737,9 @@ int ipoib_dev_init(struct net_device *de goto out; } memset(priv->rx_ring, 0, - IPOIB_RX_RING_SIZE * sizeof (struct ipoib_buf)); + IPOIB_RX_RING_SIZE * sizeof (struct ipoib_rx_buf)); - priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf), + priv->tx_ring = kmalloc(IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf), GFP_KERNEL); if (!priv->tx_ring) { printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n", @@ -747,7 +747,7 @@ int ipoib_dev_init(struct net_device *de goto out_rx_ring_cleanup; } memset(priv->tx_ring, 0, - IPOIB_TX_RING_SIZE * sizeof (struct ipoib_buf)); + IPOIB_TX_RING_SIZE * sizeof (struct ipoib_tx_buf)); /* priv->tx_head & tx_tail are already 0 */ --- infiniband/ulp/ipoib/ipoib.h (revision 3726) +++ infiniband/ulp/ipoib/ipoib.h (working copy) @@ -100,7 +100,12 @@ struct ipoib_pseudoheader { struct ipoib_mcast; -struct ipoib_buf { +struct ipoib_rx_buf { + struct sk_buff *skb; + dma_addr_t mapping; +}; + +struct ipoib_tx_buf { struct sk_buff *skb; DECLARE_PCI_UNMAP_ADDR(mapping) }; @@ -150,14 +155,14 @@ struct ipoib_dev_priv { unsigned int admin_mtu; unsigned int mcast_mtu; - struct ipoib_buf *rx_ring; + struct ipoib_rx_buf *rx_ring; - spinlock_t tx_lock; - struct ipoib_buf *tx_ring; - unsigned tx_head; - unsigned tx_tail; - struct ib_sge tx_sge; - struct ib_send_wr tx_wr; + spinlock_t tx_lock; + struct ipoib_tx_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + struct ib_sge tx_sge; + struct ib_send_wr tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; --- infiniband/ulp/ipoib/ipoib_ib.c (revision 3726) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -95,57 +95,65 @@ void ipoib_free_ah(struct kref *kref) } } -static inline int ipoib_ib_receive(struct ipoib_dev_priv *priv, - unsigned int wr_id, - dma_addr_t addr) -{ - struct ib_sge list = { - .addr = addr, - .length = IPOIB_BUF_SIZE, - .lkey = priv->mr->lkey, - }; - struct ib_recv_wr param = { - .wr_id = wr_id | IPOIB_OP_RECV, - .sg_list = &list, - .num_sge = 1, - }; +static int ipoib_ib_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sge list; + struct ib_recv_wr param; struct ib_recv_wr *bad_wr; + int ret; + + list.addr = priv->rx_ring[id].mapping; + list.length = IPOIB_BUF_SIZE; + list.lkey = priv->mr->lkey; + + param.next = NULL; + param.wr_id = id | IPOIB_OP_RECV; + param.sg_list = &list; + param.num_sge = 1; + + ret = ib_post_recv(priv->qp, ¶m, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); + dma_unmap_single(priv->ca->dma_device, + priv->rx_ring[id].mapping, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->rx_ring[id].skb); + priv->rx_ring[id].skb = NULL; + } - return ib_post_recv(priv->qp, ¶m, &bad_wr); + return ret; } -static int ipoib_ib_post_receive(struct net_device *dev, int id) +static int ipoib_alloc_rx_skb(struct net_device *dev, int id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct sk_buff *skb; dma_addr_t addr; - int ret; skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); - if (!skb) { - ipoib_warn(priv, "failed to allocate receive buffer\n"); - - priv->rx_ring[id].skb = NULL; + if (!skb) return -ENOMEM; - } - skb_reserve(skb, 4); /* 16 byte align IP header */ - priv->rx_ring[id].skb = skb; + + /* + * IB will leave a 40 byte gap for a GRH and IPoIB adds a 4 byte + * header. So we need 4 more bytes to get to 48 and align the + * IP header to a multiple of 16. + */ + skb_reserve(skb, 4); + addr = dma_map_single(priv->ca->dma_device, skb->data, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); - pci_unmap_addr_set(&priv->rx_ring[id], mapping, addr); - - ret = ipoib_ib_receive(priv, id, addr); - if (ret) { - ipoib_warn(priv, "ipoib_ib_receive failed for buf %d (%d)\n", - id, ret); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + if (unlikely(dma_mapping_error(addr))) { dev_kfree_skb_any(skb); - priv->rx_ring[id].skb = NULL; + return -EIO; } - return ret; + priv->rx_ring[id].skb = skb; + priv->rx_ring[id].mapping = addr; + + return 0; } static int ipoib_ib_post_receives(struct net_device *dev) @@ -154,6 +162,10 @@ static int ipoib_ib_post_receives(struct int i; for (i = 0; i < IPOIB_RX_RING_SIZE; ++i) { + if (ipoib_alloc_rx_skb(dev, i)) { + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + return -ENOMEM; + } if (ipoib_ib_post_receive(dev, i)) { ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); return -EIO; @@ -176,28 +188,36 @@ static void ipoib_ib_handle_wc(struct ne wr_id &= ~IPOIB_OP_RECV; if (wr_id < IPOIB_RX_RING_SIZE) { - struct sk_buff *skb = priv->rx_ring[wr_id].skb; - - priv->rx_ring[wr_id].skb = NULL; + struct sk_buff *skb = priv->rx_ring[wr_id].skb; + dma_addr_t addr = priv->rx_ring[wr_id].mapping; - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(&priv->rx_ring[wr_id], - mapping), - IPOIB_BUF_SIZE, - DMA_FROM_DEVICE); - - if (wc->status != IB_WC_SUCCESS) { + if (unlikely(wc->status != IB_WC_SUCCESS)) { if (wc->status != IB_WC_WR_FLUSH_ERR) ipoib_warn(priv, "failed recv event " "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); dev_kfree_skb_any(skb); + priv->rx_ring[wr_id].skb = NULL; return; } + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); @@ -220,8 +240,8 @@ static void ipoib_ib_handle_wc(struct ne dev_kfree_skb_any(skb); } - /* repost receive */ - if (ipoib_ib_post_receive(dev, wr_id)) + repost: + if (unlikely(ipoib_ib_post_receive(dev, wr_id))) ipoib_warn(priv, "ipoib_ib_post_receive failed " "for buf %d\n", wr_id); } else @@ -229,7 +249,7 @@ static void ipoib_ib_handle_wc(struct ne wr_id); } else { - struct ipoib_buf *tx_req; + struct ipoib_tx_buf *tx_req; unsigned long flags; if (wr_id >= IPOIB_TX_RING_SIZE) { @@ -302,7 +322,7 @@ void ipoib_send(struct net_device *dev, struct ipoib_ah *address, u32 qpn) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_buf *tx_req; + struct ipoib_tx_buf *tx_req; dma_addr_t addr; if (skb->len > dev->mtu + INFINIBAND_ALEN) { @@ -468,7 +488,7 @@ int ipoib_ib_dev_stop(struct net_device struct ib_qp_attr qp_attr; int attr_mask; unsigned long begin; - struct ipoib_buf *tx_req; + struct ipoib_tx_buf *tx_req; int i; /* Kill the existing QP and allocate a new one */ From hycsw at ca.sandia.gov Thu Oct 13 16:15:38 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Thu, 13 Oct 2005 16:15:38 -0700 (PDT) Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <200510132315.QAA25127@ca.sandia.gov> Roland, Ci So you are right, it is not a moving target. After repeating the IOZONE tests several times, I narrowed down the culprit, server on3-ib. Parallel I/O had made it a bit difficult to chase it down :-( BTW, the state of the IPoIB network seemed fine after the failed test, nd the mthca counters are moving up nicely. Do you still think this is a crash of the HCA firmware? Should I call Mellanox? Thanks, Helen ---------- Original Message ----------------- >From rolandd at cisco.com Thu Oct 13 15:13:16 2005 > > Helen> It doesn't seem like shrinking the TCP window had helped. > Helen> I captured the Dmesg log from Lustre server and associated > Helen> client reporting IOZONE error. > >What is the state of the system after you start seeing the ib0 >transmit time out messages? Does IPoIB work at all? Is the HCA >responsive at all -- for example what do you see if you do > > cat /sys/class/infiniband/mthca0/ports/1/state > >or > > cat /sys/class/infiniband/mthca0/ports/1/counters/* > > Helen> BTW, this problem is a moving target so it is hard to > Helen> believe that it is hardware related(?) BTW, I am using the > Helen> mellanox DDR switch and HCA. > >Not sure what you mean by a moving target... the symptoms really look >like a crash of the HCA firmware to me. > >Thanks, > Roland > From mshefty at ichips.intel.com Thu Oct 13 16:17:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Oct 2005 16:17:45 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <434D3981.9030507@ichips.intel.com> References: <52y84z73oo.fsf@cisco.com> <434D3981.9030507@ichips.intel.com> Message-ID: <434EEB19.3010202@ichips.intel.com> Sean Hefty wrote: > Does anyone else have any other ideas on how to fix this issue? The current MAD interface requires the user to have code similar to this: send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device, buf, buf_size, DMA_TO_DEVICE); pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr); This is consistent with how an ib_send_wr would be formatted for other QPs. Another possibility, however, is to let the user do: send_buf->sge.addr = (unsigned long) buf; And then have the MAD layer perform the mapping/unmapping immediately before and after posting to the QP. This keeps the syntax of the current interface, but still requires user changes. Any preference to pursuing this change or modifying ib_post_send_mad to take an ib_mad_send_buf? - Sean From xma at us.ibm.com Thu Oct 13 16:18:16 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 13 Oct 2005 16:18:16 -0700 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <52vf01xbmk.fsf@cisco.com> Message-ID: Thanks. It's strange the copy-paste gave an extra 1. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Thu Oct 13 16:19:22 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:19:22 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <200510132315.QAA25127@ca.sandia.gov> (Helen Chen's message of "Thu, 13 Oct 2005 16:15:38 -0700 (PDT)") References: <200510132315.QAA25127@ca.sandia.gov> Message-ID: <52mzldxax1.fsf@cisco.com> Helen> BTW, the state of the IPoIB network seemed fine after the Helen> failed test, nd the mthca counters are moving up nicely. Even on the server on3-ib? Helen> Do you still think this is a crash of the HCA firmware? Helen> Should I call Mellanox? Not if IPoIB is working on the systems printing the TX time out messages. However, if everything stops working on one of your systems, then yes, an HCA crash is likely. I'm still a unclear on what is happening. Do you see TX time out messages on a particular server, but IPoIB and mthca counters still work fine on that same server? Or is it just the rest of the fabric that continues working? Thanks, Roland From rolandd at cisco.com Thu Oct 13 16:21:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:21:39 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <434EEB19.3010202@ichips.intel.com> (Sean Hefty's message of "Thu, 13 Oct 2005 16:17:45 -0700") References: <52y84z73oo.fsf@cisco.com> <434D3981.9030507@ichips.intel.com> <434EEB19.3010202@ichips.intel.com> Message-ID: <52irw1xat8.fsf@cisco.com> Sean> Any preference to pursuing this change or modifying Sean> ib_post_send_mad to take an ib_mad_send_buf? I think it's going to be confusing to cast a virtual address to a long and then ignore the lkey field. So I would go with a new interface not built on ib_sge. On the other hand, maybe struct sg_list is what we should be using?? (Just thinking out loud here, so to speak) - R. From sean.hefty at intel.com Thu Oct 13 16:25:57 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 13 Oct 2005 16:25:57 -0700 Subject: [openib-general] [PATCH] [SA Query] Change sa_query MAD allocation Message-ID: This patch changes sa_query to allocate MADs using the ib_create_send_mad() routine. The intent behind this change was to eventually change ib_post_send_mad() to take an ib_send_mad_buf as input, but see the "DMA mapping abuses in MAD layer" thread. We may want to go with an alternate solution. However, I'm posting the patch since it's usable even without changes to ib_post_send_mad(). Signed-off-by: Sean Hefty Index: sa_query.c =================================================================== --- sa_query.c (revision 3692) +++ sa_query.c (working copy) @@ -74,9 +74,8 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); struct ib_sa_port *port; - struct ib_sa_mad *mad; + struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; - DECLARE_PCI_UNMAP_ADDR(mapping) int id; }; @@ -426,6 +425,7 @@ void ib_sa_cancel_query(int id, struct i { unsigned long flags; struct ib_mad_agent *agent; + u64 wr_id; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, id) != query) { @@ -433,9 +433,10 @@ void ib_sa_cancel_query(int id, struct i return; } agent = query->port->agent; + wr_id = (unsigned long) query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); - ib_cancel_mad(agent, id); + ib_cancel_mad(agent, wr_id); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -455,73 +456,51 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } +static void acquire_ah(struct ib_sa_port *port, struct ib_sa_query *query) +{ + unsigned long flags; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + spin_unlock_irqrestore(&port->ah_lock, flags); +} + static int send_mad(struct ib_sa_query *query, int timeout_ms) { struct ib_sa_port *port = query->port; + struct ib_send_wr *bad_wr; unsigned long flags; - int ret; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { - .ud = { - .mad_hdr = &query->mad->mad_hdr, - .remote_qpn = 1, - .remote_qkey = IB_QP1_QKEY, - .timeout_ms = timeout_ms, - } - } - }; + int ret, id; retry: if (!idr_pre_get(&query_idr, GFP_ATOMIC)) return -ENOMEM; spin_lock_irqsave(&idr_lock, flags); - ret = idr_get_new(&query_idr, query, &query->id); + ret = idr_get_new(&query_idr, query, &id); spin_unlock_irqrestore(&idr_lock, flags); if (ret == -EAGAIN) goto retry; if (ret) return ret; - wr.wr_id = query->id; - - spin_lock_irqsave(&port->ah_lock, flags); - kref_get(&port->sm_ah->ref); - query->sm_ah = port->sm_ah; - wr.wr.ud.ah = port->sm_ah->ah; - spin_unlock_irqrestore(&port->ah_lock, flags); - - gather_list.addr = dma_map_single(port->agent->device->dma_device, - query->mad, - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->agent->mr->lkey; - pci_unmap_addr_set(query, mapping, gather_list.addr); + query->mad_buf->send_wr.wr.ud.timeout_ms = timeout_ms; + query->mad_buf->context[0] = query; + query->id = id; - ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + ret = ib_post_send_mad(port->agent, &query->mad_buf->send_wr, &bad_wr); if (ret) { - dma_unmap_single(port->agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, query->id); + idr_remove(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); } /* * It's not safe to dereference query any more, because the * send may already have completed and freed the query in - * another context. So use wr.wr_id, which has a copy of the - * query's id. + * another context. */ - return ret ? ret : wr.wr_id; + return ret ? ret : id; } static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, @@ -543,7 +522,6 @@ static void ib_sa_path_rec_callback(stru static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } @@ -585,42 +563,53 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, IB_MGMT_MAD_DATA - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_GET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->sa_hdr.comp_mask = comp_mask; - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), - rec, query->sa_query.mad->data); + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); @@ -643,7 +632,6 @@ static void ib_sa_service_rec_callback(s static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); } @@ -687,6 +675,7 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; if (method != IB_MGMT_METHOD_GET && @@ -697,38 +686,48 @@ int ib_sa_service_rec_query(struct ib_de query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, IB_MGMT_MAD_DATA - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = - cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; + query->sa_query.release = ib_sa_service_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); @@ -751,7 +750,6 @@ static void ib_sa_mcmember_rec_callback( static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } @@ -770,42 +768,54 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, IB_MGMT_MAD_DATA - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); @@ -813,14 +823,11 @@ static void send_handler(struct ib_mad_a struct ib_mad_send_wc *mad_send_wc) { struct ib_sa_query *query; + struct ib_mad_send_buf *mad_buf; unsigned long flags; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_send_wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); - - if (!query) - return; + mad_buf = (struct ib_mad_send_buf *)(unsigned long)mad_send_wc->wr_id; + query = mad_buf->context[0]; if (query->callback) switch (mad_send_wc->status) { @@ -838,30 +845,25 @@ static void send_handler(struct ib_mad_a break; } - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); - - query->release(query); - spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, mad_send_wc->wr_id); + idr_remove(&query_idr, query->id); spin_unlock_irqrestore(&idr_lock, flags); + + ib_free_send_mad(query->mad_buf); + kref_put(&query->sm_ah->ref, free_sm_ah); + query->release(query); } static void recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { struct ib_sa_query *query; - unsigned long flags; + struct ib_mad_send_buf *mad_buf; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); + mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; + query = mad_buf->context[0]; - if (query && query->callback) { + if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? From rolandd at cisco.com Thu Oct 13 16:35:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 16:35:09 -0700 Subject: [openib-general] Re: [PATCH] [SA Query] Change sa_query MAD allocation In-Reply-To: (Sean Hefty's message of "Thu, 13 Oct 2005 16:25:57 -0700") References: Message-ID: <52ek6pxa6q.fsf@cisco.com> Thanks, I'll read this over. What's the motivation here? To shift over to ib_create_send_mad() so that all the MAD-related DMA mapping stuff is in one place, to make it easier to fix? - R. From hycsw at ca.sandia.gov Thu Oct 13 16:38:12 2005 From: hycsw at ca.sandia.gov (Helen Chen) Date: Thu, 13 Oct 2005 16:38:12 -0700 (PDT) Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Message-ID: <200510132338.QAA15017@ca.sandia.gov> Roland, >From rolandd at cisco.com Thu Oct 13 16:19:30 2005 > > Helen> BTW, the state of the IPoIB network seemed fine after the > Helen> failed test, nd the mthca counters are moving up nicely. > >Even on the server on3-ib? Yes, even on the server on3-ib. > > Helen> Do you still think this is a crash of the HCA firmware? > Helen> Should I call Mellanox? > >Not if IPoIB is working on the systems printing the TX time out >messages. However, if everything stops working on one of your >systems, then yes, an HCA crash is likely. > >I'm still a unclear on what is happening. Do you see TX time >out messages on a particular server, but IPoIB and mthca counters >still work fine on that same server? Or is it just the rest of the >fabric that continues working? > Not in realtime. My observations were made after the fact. I supose I can launch another test and watch the cunter in realtime if you believe that is necessary? >Thanks, > Roland Thank you so much for the speedy fix. I will apply the patch and stress test it as soon as possible. Helen :-) From mshefty at ichips.intel.com Thu Oct 13 17:05:06 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 13 Oct 2005 17:05:06 -0700 Subject: [openib-general] Re: [PATCH] [SA Query] Change sa_query MAD allocation In-Reply-To: <52ek6pxa6q.fsf@cisco.com> References: <52ek6pxa6q.fsf@cisco.com> Message-ID: <434EF632.7020104@ichips.intel.com> Roland Dreier wrote: > Thanks, I'll read this over. > > What's the motivation here? To shift over to ib_create_send_mad() so > that all the MAD-related DMA mapping stuff is in one place, to make it > easier to fix? Yes - the motivation is to fix the DMA mapping issue that you pointed out by changing ib_post_send_mad() to take an ib_mad_send_buf as input. There are three places that I see where ib_post_send_mad() is called without using ib_create_mad_send(): sa_query, mthca_mad, and agent. (Their implementation pre-dates the call.) My intent was to patch each of these separately to use ib_create_mad_send(), then apply a patch to convert the API. If the API does not change to take an ib_mad_send_buf, then it's your call whether to apply the patch. - Sean From rolandd at cisco.com Thu Oct 13 17:54:36 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 13 Oct 2005 17:54:36 -0700 Subject: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer In-Reply-To: <200510132338.QAA15017@ca.sandia.gov> (Helen Chen's message of "Thu, 13 Oct 2005 16:38:12 -0700 (PDT)") References: <200510132338.QAA15017@ca.sandia.gov> Message-ID: <521x2oyl2r.fsf@cisco.com> Helen> Not in realtime. My observations were made after the fact. Helen> I supose I can launch another test and watch the cunter in Helen> realtime if you believe that is necessary? That might be interesting. Assuming the HCA continues to work fine, and IPoIB recovers, the only theory I can come up is that something is causing interrupts to be held off for a long time, so the IPoIB driver doesn't get to see sends completing. But I don't know what such a workload might be. Perhaps something else you're running (Lustre?, iSCSI?) holds a lock for a long time and causes the timeout. But it's not clear to me why the TX watchdog would get to run if the interrupt handler doesn't get to run. - R. From lbdwzv at msn.com Thu Oct 13 15:36:42 2005 From: lbdwzv at msn.com (Trenton Woods) Date: Fri, 14 Oct 2005 03:36:42 +0500 Subject: [openib-general] R0LEX Replica - make your first impressions count! Message-ID: <765e590w.1135650@msn.com> We noticed you had bought one of our products before. We just recently slashed prices, and thought we should let you know. http://ewatchsnow.com/ Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Trenton Woods Customer Service Rep. pyongyang it bat and be extinct some a seize may ! cedric a be snigger , it's rosy may try megohm notit's actinium may. dynamite it's clientele some be arose some try gnomon see , boar see try mountainside a a long try it monkish maythe halloween but. From halr at voltaire.com Thu Oct 13 22:13:45 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Oct 2005 01:13:45 -0400 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <20051013224647.GC7707@minbar.scl.ameslab.gov> References: <20051007141207.GX4612@kalmia.hozed.org> <20051013224647.GC7707@minbar.scl.ameslab.gov> Message-ID: <1129266824.4402.5286.camel@hal.voltaire.com> On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > I'm also attaching part of an opensm log file. > > (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) > > The IBM galaxy adapters are at: > Initial path: [0][1][16] > Initial path: [0][1][13] > The OpenSM is just saying that a SMP transaction it issued (in this case, SM Get P_KeyTable) is timing out (no response made it back to OpenSM). BTW, what svn rev is OpenSM up to ? -- Hal From hch at lst.de Fri Oct 14 02:26:24 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 14 Oct 2005 11:26:24 +0200 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <6.2.0.14.2.20051013101446.022d3ba0@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F10209F7@NT-SJCA-0751.brcm.ad.broadcom.com> <6.2.0.14.2.20051013101446.022d3ba0@esmail.cup.hp.com> Message-ID: <20051014092624.GA11046@lst.de> On Thu, Oct 13, 2005 at 10:29:09AM -0700, Michael Krause wrote: > This all comes down to economics which is why some ULP such as SDP are > created. Let's examine SDP for a moment. The purpose of SDP to enable > synchronous and asynchronous Sockets applications to transparently run > unmodified over a RDMA capable interconnect. Unmodified means no source > code changes and no recompile required (this is possible if the Sockets > library is a shared library and dynamically linked). The first part of > unmodified means that the existing address / service resolution API calls > work (further, no change to the address family, etc. is required to make > this work either). Hence, pick any of the get* API calls that are in use > today and they should just work. That's not who SDP is going to work on Linux, though. Where not into your crude hacks to let broken applications work with new technology business. Applications will have to use SDP directly or via getaddrinfo and we will never put in a broken sockets switch. And can you _please_ stop all thise time to market and similar business crap? That simply doesn't matter when designing something properly. From hch at lst.de Fri Oct 14 02:28:23 2005 From: hch at lst.de (Christoph Hellwig) Date: Fri, 14 Oct 2005 11:28:23 +0200 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <434EEB19.3010202@ichips.intel.com> References: <52y84z73oo.fsf@cisco.com> <434D3981.9030507@ichips.intel.com> <434EEB19.3010202@ichips.intel.com> Message-ID: <20051014092823.GB11046@lst.de> On Thu, Oct 13, 2005 at 04:17:45PM -0700, Sean Hefty wrote: > Sean Hefty wrote: > >Does anyone else have any other ideas on how to fix this issue? > > The current MAD interface requires the user to have code similar to this: > > send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device, > buf, buf_size, DMA_TO_DEVICE); > pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr); > > This is consistent with how an ib_send_wr would be formatted for other QPs. > Another possibility, however, is to let the user do: > > send_buf->sge.addr = (unsigned long) buf; > > And then have the MAD layer perform the mapping/unmapping immediately > before and after posting to the QP. This keeps the syntax of the current > interface, but still requires user changes. If you change behaviour you should change the interface, in this case you'd _really_ want to pass down the buffer as void pointer and not cast it to a dma_addr_t - that would in fact break on ppc64 where dma_addr_t is a 32bit data type and a pointer is 64bits wide. From jqdizeewwmaok at euskaltel.es Fri Oct 14 07:26:57 2005 From: jqdizeewwmaok at euskaltel.es (Dee Simpson) Date: Fri, 14 Oct 2005 15:26:57 +0100 Subject: [openib-general] Re: 0278. Message-ID: <29528284095115.jqdizeewwmaok@euskaltel.es> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://rejoice-123.net/save1.asp Have a good day. Sincerely, Dee Simpson Customer Service Rep eNAI Inc. aftereffect may tricky not but colossi try on tabernacle on it's follow and may transgress not try republic some on inhibit !and carnegie try. Update on site volumetric it indecent in it's gyrocompass the some jejune may see sourberry it's or basso it in cortical and , gleam noton combinator !. From caitlinb at broadcom.com Fri Oct 14 08:38:18 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 14 Oct 2005 08:38:18 -0700 Subject: [openib-general] [RFC] IB address translation using ARP Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Christoph Hellwig > Sent: Friday, October 14, 2005 2:26 AM > To: Michael Krause > Cc: openib-general at openib.org > Subject: Re: [openib-general] [RFC] IB address translation using ARP > > On Thu, Oct 13, 2005 at 10:29:09AM -0700, Michael Krause wrote: > > This all comes down to economics which is why some ULP such > > as SDP are created. Let's examine SDP for a moment. The purpose > > of SDP to enable synchronous and asynchronous Sockets applications to > > transparently run unmodified over a RDMA capable interconnect. > > Unmodified means no source code changes and no recompile required > > (this is possible if the Sockets library is a shared library and > > dynamically linked). The first part of unmodified means that the > > existing address / service resolution API calls work (further, no > > change to the address family, etc. is required to make this work either). > > Hence, pick any of the get* API calls that > > are in use today and they should just work. > > That's not who SDP is going to work on Linux, though. Where > not into your crude hacks to let broken applications work > with new technology business. Applications will have to use > SDP directly or via getaddrinfo and we will never put in a > broken sockets switch. > I can't think of a better example of something that is truly brain dead than an application *written* to use Sockets Direct Protocol. The protocol offers *zero* advantages to the network or to the application over direct use of RDMA (or of TOE) unless you presume that the application will continue to use a sockets API. If the application is using a QP/CQ API it does not need, and should not use SDP. The sole technical merit of SDP is its ability to support streaming semantics that very precisely match the current semantics for sockets over TCP without requiring byte stream support in the hardware. It has poorer network utilization compared to TOE, and every objection raised to TOE applies to SDP. So if you aren't preserving the sockets API what is the point in using the protocol? > And can you _please_ stop all thise time to market and > similar business crap? That simply doesn't matter when > designing something properly. If we really were to play stop-the-world-while-I-redesign-it games then the resulting solution would not use sockets, TCP or even Linux. Real solutions, from NICs through Operating Systems, recognize that their legacy is part of their strength as well as a nuisance. From krause at cup.hp.com Fri Oct 14 09:04:04 2005 From: krause at cup.hp.com (Michael Krause) Date: Fri, 14 Oct 2005 09:04:04 -0700 Subject: [PATCH, please test] IPoIB: recycle RX bufs (was: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer) In-Reply-To: <52r7apxb4p.fsf_-_@cisco.com> References: <521x2q7byt.fsf@cisco.com> <20051013191407.GC13356@mellanox.co.il> <52irw12ovr.fsf@cisco.com> <52r7apxb4p.fsf_-_@cisco.com> Message-ID: <6.2.0.14.2.20051014090128.025210d8@esmail.cup.hp.com> At 04:14 PM 10/13/2005, Roland Dreier wrote: > Roland> My plan is to change the receive handling of IPoIB > Roland> slightly, so that if it can't allocate a new receive > Roland> buffer, it reposts the old buffer and drops the packet it > Roland> just received. > >Here's a patch that changes IPoIB to use this scheme. This should be >much more robust when the system gets low on GFP_ATOMIC memory. > >I'd appreciate it if people could stress test and benchmark this. It >works well for me, but I'm wondering if this patch has any effect on >performance (either better or worse). > >Helen, it would be especially interesting if you could run your test >with this patch and without increasing min_free_kbytes, since you are >able to reproduce GFP_ATOMIC failures. I'd be curious to know what >you see in /sys/class/net/ib0/statistics/rx_dropped after running the test. As a general rule, dropping a packet that has traversed the network is frowned upon by the IETF (this is perhaps more due to the view of the network being all of IP and not just within a data center). I understand the idea but given it is UD, the HCA can effectively drop the packet without causing any side effects which should result in lower host CPU / I/O / memory utilitization. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From troy at scl.ameslab.gov Fri Oct 14 09:08:33 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Fri, 14 Oct 2005 11:08:33 -0500 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <1129266824.4402.5286.camel@hal.voltaire.com> References: <20051007141207.GX4612@kalmia.hozed.org> <20051013224647.GC7707@minbar.scl.ameslab.gov> <1129266824.4402.5286.camel@hal.voltaire.com> Message-ID: <434FD801.3000508@scl.ameslab.gov> Hal Rosenstock wrote: >On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > > >>I'm also attaching part of an opensm log file. >> >>(the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) >> >>The IBM galaxy adapters are at: >> Initial path: [0][1][16] >> Initial path: [0][1][13] >> >> >> > >The OpenSM is just saying that a SMP transaction it issued (in this >case, SM Get P_KeyTable) is timing out (no response made it back to >OpenSM). > >BTW, what svn rev is OpenSM up to ? > >-- Hal > > So, how about a patch to opensm to report what svn rev it was built from ;) I just discovered another problem.. We have been running pfvs2 over IPoIB on the same subnet, and in debugging this, I restarted opensm several times, and somewhere in the stack a PVFS2 write failed. I wouldn't think that a short downtime of the SM from restarting it would cause any IPoIB TCP sessions to fall over.. From rolandd at cisco.com Fri Oct 14 09:14:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 14 Oct 2005 09:14:13 -0700 Subject: [openib-general] Re: [PATCH, please test] IPoIB: recycle RX bufs In-Reply-To: <6.2.0.14.2.20051014090128.025210d8@esmail.cup.hp.com> (Michael Krause's message of "Fri, 14 Oct 2005 09:04:04 -0700") References: <521x2q7byt.fsf@cisco.com> <20051013191407.GC13356@mellanox.co.il> <52irw12ovr.fsf@cisco.com> <52r7apxb4p.fsf_-_@cisco.com> <6.2.0.14.2.20051014090128.025210d8@esmail.cup.hp.com> Message-ID: <52k6ggvzxm.fsf@cisco.com> Michael> As a general rule, dropping a packet that has traversed Michael> the network is frowned upon by the IETF (this is perhaps Michael> more due to the view of the network being all of IP and Michael> not just within a data center). I understand the idea Michael> but given it is UD, the HCA can effectively drop the Michael> packet without causing any side effects which should Michael> result in lower host CPU / I/O / memory utilitization. I think you accidentally left your patch out of the email ;) - R. From mshefty at ichips.intel.com Fri Oct 14 09:18:31 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 09:18:31 -0700 Subject: [openib-general] DMA mapping abuses in MAD layer In-Reply-To: <20051014092823.GB11046@lst.de> References: <52y84z73oo.fsf@cisco.com> <434D3981.9030507@ichips.intel.com> <434EEB19.3010202@ichips.intel.com> <20051014092823.GB11046@lst.de> Message-ID: <434FDA57.2040309@ichips.intel.com> Christoph Hellwig wrote: > If you change behaviour you should change the interface, in this case > you'd _really_ want to pass down the buffer as void pointer and not cast > it to a dma_addr_t - that would in fact break on ppc64 where dma_addr_t > is a 32bit data type and a pointer is 64bits wide. To clarify, I'm proceeding down the path of changing: int ib_post_send_mad(struct ib_mad_agent *mad_agent, struct ib_send_wr *send_wr, struct ib_send_wr **bad_send_wr); to something like: int ib_post_send_mad(struct ib_mad_agent *mad_agent, struct ib_mad_send_buf *send_buf, struct ib_mad_send_buf **bad_send_buf); (with minor changes to struct ib_mad_send_buf) This should permit chaining together sends, plus control how memory is allocated for very large MAD transfers. I threw out the other possibility to ensure that there's not a better solution that we're missing. I believe that the other solution requires substantially fewer changes to the existing codebase. - Sean From ftillier at silverstorm.com Fri Oct 14 09:23:30 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 14 Oct 2005 09:23:30 -0700 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <434FD801.3000508@scl.ameslab.gov> Message-ID: <001101c5d0db$9ec987b0$9e5aa8c0@infiniconsys.com> > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > Sent: Friday, October 14, 2005 9:09 AM > > I just discovered another problem.. We have been running pfvs2 over > IPoIB on the same subnet, and in debugging this, I restarted opensm > several times, and somewhere in the stack a PVFS2 write failed. I > wouldn't think that a short downtime of the SM from restarting it would > cause any IPoIB TCP sessions to fall over.. If the path has already been resolved, traffic (even multicast) between existing nodes will survive the SM going down. You run into issues if you try to talk to a new node and attempt to contact the SM for a path record, or if you try to bring up a new interface. - Fab From halr at voltaire.com Fri Oct 14 10:41:13 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Oct 2005 13:41:13 -0400 Subject: [openib-general] Re: IBM eHCA testing.. In-Reply-To: <434FD801.3000508@scl.ameslab.gov> References: <20051007141207.GX4612@kalmia.hozed.org> <20051013224647.GC7707@minbar.scl.ameslab.gov> <1129266824.4402.5286.camel@hal.voltaire.com> <434FD801.3000508@scl.ameslab.gov> Message-ID: <1129311672.16900.89.camel@hal.voltaire.com> On Fri, 2005-10-14 at 12:08, Troy Benjegerdes wrote: > Hal Rosenstock wrote: > > >On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > > > > > >>I'm also attaching part of an opensm log file. > >> > >>(the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) > >> > >>The IBM galaxy adapters are at: > >> Initial path: [0][1][16] > >> Initial path: [0][1][13] > >> > >> > >> > > > >The OpenSM is just saying that a SMP transaction it issued (in this > >case, SM Get P_KeyTable) is timing out (no response made it back to > >OpenSM). > > > >BTW, what svn rev is OpenSM up to ? > > > >-- Hal > > > > > So, how about a patch to opensm to report what svn rev it was built from ;) Can you do svn info in the userspace/management/osm directory ? > I just discovered another problem.. We have been running pfvs2 over > IPoIB on the same subnet, and in debugging this, I restarted opensm > several times, and somewhere in the stack a PVFS2 write failed. I > wouldn't think that a short downtime of the SM from restarting it would > cause any IPoIB TCP sessions to fall over.. As Fab indicated, there are a number of places where the SM/SA is needed: 1. SA PathRecords (used when a path to a new IP end node is needed or an existing one timesout) 2. SA MCMemberRecord joins, queries, and leaves (used when an interface is up'ed, down'ed, etc.) Is this on an existing TCP session ? Is it OpenIB IPoIB clients at each end ? What svn version is being used for this ? -- Hal From mshefty at ichips.intel.com Fri Oct 14 10:52:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 10:52:55 -0700 Subject: [openib-general] [PATCH] [SA Query] Change sa_query MAD allocation In-Reply-To: References: Message-ID: <434FF077.1090309@ichips.intel.com> Sean Hefty wrote: > + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, > + query->sa_query.sm_ah->ah, > + 0, IB_MGMT_MAD_DATA - > + IB_MGMT_SA_DATA, > + IB_MGMT_SA_DATA, gfp_mask); My testing worked with this, but the call above is wrong. The header size is not calculated correctly. It should be something like: query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, query->sa_query.sm_ah->ah, 0, sizeof(struct ib_mad) - IB_MGMT_SA_DATA, IB_MGMT_SA_DATA, gfp_mask); This error appears in two other places in the patch. - Sean From sean.hefty at intel.com Fri Oct 14 11:18:10 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 11:18:10 -0700 Subject: [openib-general] [PATCHv2] [SA Query] Change sa_query MAD allocation In-Reply-To: <434FF077.1090309@ichips.intel.com> Message-ID: Here's an updated version. Signed-off-by: Sean Hefty Index: sa_query.c =================================================================== --- sa_query.c (revision 3692) +++ sa_query.c (working copy) @@ -74,9 +74,8 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); struct ib_sa_port *port; - struct ib_sa_mad *mad; + struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; - DECLARE_PCI_UNMAP_ADDR(mapping) int id; }; @@ -426,6 +425,7 @@ void ib_sa_cancel_query(int id, struct i { unsigned long flags; struct ib_mad_agent *agent; + u64 wr_id; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, id) != query) { @@ -433,9 +433,10 @@ void ib_sa_cancel_query(int id, struct i return; } agent = query->port->agent; + wr_id = (unsigned long) query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); - ib_cancel_mad(agent, id); + ib_cancel_mad(agent, wr_id); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -455,73 +456,51 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } +static void acquire_ah(struct ib_sa_port *port, struct ib_sa_query *query) +{ + unsigned long flags; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + spin_unlock_irqrestore(&port->ah_lock, flags); +} + static int send_mad(struct ib_sa_query *query, int timeout_ms) { struct ib_sa_port *port = query->port; + struct ib_send_wr *bad_wr; unsigned long flags; - int ret; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { - .ud = { - .mad_hdr = &query->mad->mad_hdr, - .remote_qpn = 1, - .remote_qkey = IB_QP1_QKEY, - .timeout_ms = timeout_ms, - } - } - }; + int ret, id; retry: if (!idr_pre_get(&query_idr, GFP_ATOMIC)) return -ENOMEM; spin_lock_irqsave(&idr_lock, flags); - ret = idr_get_new(&query_idr, query, &query->id); + ret = idr_get_new(&query_idr, query, &id); spin_unlock_irqrestore(&idr_lock, flags); if (ret == -EAGAIN) goto retry; if (ret) return ret; - wr.wr_id = query->id; - - spin_lock_irqsave(&port->ah_lock, flags); - kref_get(&port->sm_ah->ref); - query->sm_ah = port->sm_ah; - wr.wr.ud.ah = port->sm_ah->ah; - spin_unlock_irqrestore(&port->ah_lock, flags); - - gather_list.addr = dma_map_single(port->agent->device->dma_device, - query->mad, - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->agent->mr->lkey; - pci_unmap_addr_set(query, mapping, gather_list.addr); + query->mad_buf->send_wr.wr.ud.timeout_ms = timeout_ms; + query->mad_buf->context[0] = query; + query->id = id; - ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + ret = ib_post_send_mad(port->agent, &query->mad_buf->send_wr, &bad_wr); if (ret) { - dma_unmap_single(port->agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, query->id); + idr_remove(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); } /* * It's not safe to dereference query any more, because the * send may already have completed and freed the query in - * another context. So use wr.wr_id, which has a copy of the - * query's id. + * another context. */ - return ret ? ret : wr.wr_id; + return ret ? ret : id; } static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, @@ -543,7 +522,6 @@ static void ib_sa_path_rec_callback(stru static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } @@ -585,42 +563,53 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, sizeof *mad - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_GET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->sa_hdr.comp_mask = comp_mask; - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), - rec, query->sa_query.mad->data); + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); @@ -643,7 +632,6 @@ static void ib_sa_service_rec_callback(s static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); } @@ -687,6 +675,7 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; if (method != IB_MGMT_METHOD_GET && @@ -697,38 +686,48 @@ int ib_sa_service_rec_query(struct ib_de query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, sizeof *mad - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = - cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; + query->sa_query.release = ib_sa_service_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); @@ -751,7 +750,6 @@ static void ib_sa_mcmember_rec_callback( static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } @@ -770,42 +768,54 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port = &sa_dev->port[port_num - sa_dev->start_port]; struct ib_mad_agent *agent = port->agent; + struct ib_sa_mad *mad; int ret; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + query->sa_query.sm_ah->ah, + 0, sizeof *mad - + IB_MGMT_SA_DATA, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = (struct ib_sa_mad *) query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); @@ -813,14 +823,11 @@ static void send_handler(struct ib_mad_a struct ib_mad_send_wc *mad_send_wc) { struct ib_sa_query *query; + struct ib_mad_send_buf *mad_buf; unsigned long flags; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_send_wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); - - if (!query) - return; + mad_buf = (struct ib_mad_send_buf *)(unsigned long)mad_send_wc->wr_id; + query = mad_buf->context[0]; if (query->callback) switch (mad_send_wc->status) { @@ -838,30 +845,25 @@ static void send_handler(struct ib_mad_a break; } - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); - - query->release(query); - spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, mad_send_wc->wr_id); + idr_remove(&query_idr, query->id); spin_unlock_irqrestore(&idr_lock, flags); + + ib_free_send_mad(query->mad_buf); + kref_put(&query->sm_ah->ref, free_sm_ah); + query->release(query); } static void recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { struct ib_sa_query *query; - unsigned long flags; + struct ib_mad_send_buf *mad_buf; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); + mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; + query = mad_buf->context[0]; - if (query && query->callback) { + if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? From sean.hefty at intel.com Fri Oct 14 11:25:21 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 11:25:21 -0700 Subject: [openib-general] [PATCH] [MTHCA] change mthca MAD allocation Message-ID: This patch changes mthca_mad to allocate MADs using ib_create_send_mad(). Signed-off-by: Sean Hefty Index: mthca_mad.c =================================================================== --- mthca_mad.c (revision 3692) +++ mthca_mad.c (working copy) @@ -46,11 +46,6 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; -struct mthca_trap_mad { - struct ib_mad *mad; - DECLARE_PCI_UNMAP_ADDR(mapping) -}; - static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -116,48 +111,19 @@ static void forward_trap(struct mthca_de struct ib_mad *mad) { int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; - struct mthca_trap_mad *tmad; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { - .ud = { - .remote_qpn = qpn, - .remote_qkey = qpn ? IB_QP1_QKEY : 0, - .timeout_ms = 0 - } - } - }; + struct ib_mad_send_buf *send_buf; + struct ib_send_wr *bad_wr; struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; int ret; unsigned long flags; if (agent) { - tmad = kmalloc(sizeof *tmad, GFP_KERNEL); - if (!tmad) - return; - - tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); - if (!tmad->mad) { - kfree(tmad); - return; - } - - memcpy(tmad->mad, mad, sizeof *mad); + /* AH set below */ + send_buf = ib_create_send_mad(agent, qpn, 0, NULL, 0, + sizeof *mad - IB_MGMT_MAD_DATA, + IB_MGMT_MAD_DATA, GFP_KERNEL); - wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; - wr.wr_id = (unsigned long) tmad; - - gather_list.addr = dma_map_single(agent->device->dma_device, - tmad->mad, - sizeof *tmad->mad, - DMA_TO_DEVICE); - gather_list.length = sizeof *tmad->mad; - gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; - pci_unmap_addr_set(tmad, mapping, gather_list.addr); + memcpy(send_buf->mad, mad, sizeof *mad); /* * We rely here on the fact that MLX QPs don't use the @@ -166,21 +132,15 @@ static void forward_trap(struct mthca_de * it's OK for our devices). */ spin_lock_irqsave(&dev->sm_lock, flags); - wr.wr.ud.ah = dev->sm_ah[port_num - 1]; - if (wr.wr.ud.ah) - ret = ib_post_send_mad(agent, &wr, &bad_wr); + send_buf->send_wr.wr.ud.ah = dev->sm_ah[port_num - 1]; + if (send_buf->send_wr.wr.ud.ah) + ret = ib_post_send_mad(agent, &send_buf->send_wr, &bad_wr); else ret = -EINVAL; spin_unlock_irqrestore(&dev->sm_lock, flags); - if (ret) { - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(tmad, mapping), - sizeof *tmad->mad, - DMA_TO_DEVICE); - kfree(tmad->mad); - kfree(tmad); - } + if (ret) + ib_free_send_mad(send_buf); } } @@ -267,15 +227,7 @@ int mthca_process_mad(struct ib_device * static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { - struct mthca_trap_mad *tmad = - (void *) (unsigned long) mad_send_wc->wr_id; - - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(tmad, mapping), - sizeof *tmad->mad, - DMA_TO_DEVICE); - kfree(tmad->mad); - kfree(tmad); + ib_free_send_mad((void *) (unsigned long) mad_send_wc->wr_id); } int mthca_create_agents(struct mthca_dev *dev) From brett at scl.ameslab.gov Fri Oct 14 13:22:47 2005 From: brett at scl.ameslab.gov (Brett Bode) Date: Fri, 14 Oct 2005 15:22:47 -0500 Subject: [Fwd: Re: [openib-general] Re: IBM eHCA testing..] In-Reply-To: <43500695.2090108@scl.ameslab.gov> References: <43500695.2090108@scl.ameslab.gov> Message-ID: <9fe7e2b2f954a437c18321a167d587df@scl.ameslab.gov> On Oct 14, 2005, at 2:27 PM, Troy Benjegerdes wrote: > > > From: Hal Rosenstock > Date: October 14, 2005 12:41:13 PM CDT > To: Troy Benjegerdes > Cc: IBMEHCA DD , openib-general at openib.org > Subject: Re: [openib-general] Re: IBM eHCA testing.. > > > On Fri, 2005-10-14 at 12:08, Troy Benjegerdes wrote: >> Hal Rosenstock wrote: >> >>> On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: >>> >>> >>>> I'm also attaching part of an opensm log file. >>>> >>>> (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) >>>> >>>> The IBM galaxy adapters are at: >>>> Initial path: [0][1][16] >>>> Initial path: [0][1][13] >>>> >>>> >>>> >>> >>> The OpenSM is just saying that a SMP transaction it issued (in this >>> case, SM Get P_KeyTable) is timing out (no response made it back to >>> OpenSM). >>> >>> BTW, what svn rev is OpenSM up to ? >>> >>> -- Hal >>> >>> >> So, how about a patch to opensm to report what svn rev it was built >> from ;) > > Can you do svn info in the userspace/management/osm directory ? Path: . URL: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3493 Node Kind: directory Schedule: normal Last Changed Author: roland Last Changed Rev: 3487 Last Changed Date: 2005-09-19 17:59:27 -0500 (Mon, 19 Sep 2005) Properties Last Updated: 2005-02-15 16:24:20 -0600 (Tue, 15 Feb 2005) > >> I just discovered another problem.. We have been running pfvs2 over >> IPoIB on the same subnet, and in debugging this, I restarted opensm >> several times, and somewhere in the stack a PVFS2 write failed. I >> wouldn't think that a short downtime of the SM from restarting it >> would >> cause any IPoIB TCP sessions to fall over.. > > As Fab indicated, there are a number of places where the SM/SA is > needed: > 1. SA PathRecords (used when a path to a new IP end node is needed or > an > existing one timesout) > 2. SA MCMemberRecord joins, queries, and leaves (used when an interface > is up'ed, down'ed, etc.) > > Is this on an existing TCP session ? Is it OpenIB IPoIB clients at each > end ? What svn version is being used for this ? > > -- Hal > It looks like each client node maintains an open TCP stream to each of the servers. pvfs2 appears to not be very robust to failure. However the pvfs2 folks just released a new version which changes their network protocol somewhat. I plan to get the new version installed next week and will see if it handles things a bit more robustly. Brett From halr at voltaire.com Fri Oct 14 13:30:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Oct 2005 16:30:32 -0400 Subject: [Fwd: Re: [openib-general] Re: IBM eHCA testing..] In-Reply-To: <9fe7e2b2f954a437c18321a167d587df@scl.ameslab.gov> References: <43500695.2090108@scl.ameslab.gov> <9fe7e2b2f954a437c18321a167d587df@scl.ameslab.gov> Message-ID: <1129321635.16900.134.camel@hal.voltaire.com> On Fri, 2005-10-14 at 16:22, Brett Bode wrote: > On Oct 14, 2005, at 2:27 PM, Troy Benjegerdes wrote: > > > > > > > From: Hal Rosenstock > > Date: October 14, 2005 12:41:13 PM CDT > > To: Troy Benjegerdes > > Cc: IBMEHCA DD , openib-general at openib.org > > Subject: Re: [openib-general] Re: IBM eHCA testing.. > > > > > > On Fri, 2005-10-14 at 12:08, Troy Benjegerdes wrote: > >> Hal Rosenstock wrote: > >> > >>> On Thu, 2005-10-13 at 18:46, Troy Benjegerdes wrote: > >>> > >>> > >>>> I'm also attaching part of an opensm log file. > >>>> > >>>> (the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log ) > >>>> > >>>> The IBM galaxy adapters are at: > >>>> Initial path: [0][1][16] > >>>> Initial path: [0][1][13] > >>>> > >>>> > >>>> > >>> > >>> The OpenSM is just saying that a SMP transaction it issued (in this > >>> case, SM Get P_KeyTable) is timing out (no response made it back to > >>> OpenSM). > >>> > >>> BTW, what svn rev is OpenSM up to ? > >>> > >>> -- Hal > >>> > >>> > >> So, how about a patch to opensm to report what svn rev it was built > >> from ;) > > > > Can you do svn info in the userspace/management/osm directory ? > > Path: . > URL: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband > Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd > Revision: 3493 > Node Kind: directory > Schedule: normal > Last Changed Author: roland > Last Changed Rev: 3487 > Last Changed Date: 2005-09-19 17:59:27 -0500 (Mon, 19 Sep 2005) > Properties Last Updated: 2005-02-15 16:24:20 -0600 (Tue, 15 Feb 2005) If you update and rebuild OpenSM, you will get rid of messages like: Oct 13 10:35:38 366848 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x8:0x0] for guid:0x0002c90108ccc571. Oct 13 10:35:38 366866 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x13:0x1] for guid:0x0002550000039e80. Oct 13 10:35:38 366880 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x5:0x0] for guid:0x00066a00a0000441. Oct 13 10:35:38 366894 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x10:0x1] for guid:0x0002c90108cd0b71. Oct 13 10:35:38 366907 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x11:0x1] for guid:0x00066a00a000044e. Oct 13 10:35:38 366921 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x14:0x1] for guid:0x0002550000038500. Oct 13 10:35:38 366934 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x9:0x0] for guid:0x0002c90200402782. Oct 13 10:35:38 366948 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xa:0x0] for guid:0x0002c90108cd98c1. Oct 13 10:35:38 366961 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xd:0x0] for guid:0x0002c90108cd84a1. Oct 13 10:35:38 366975 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xe:0x0] for guid:0x0002c90200402917. Oct 13 10:35:38 366988 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x1:0x0] for guid:0x0002c90200402781. Oct 13 10:35:38 367001 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xb:0x0] for guid:0x0002c90108cd9bd1. Oct 13 10:35:38 367015 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x15:0x1] for guid:0x0002550000038580. Oct 13 10:35:38 367028 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x2:0x0] for guid:0x0002c90200402915. Oct 13 10:35:38 367042 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x6:0x0] for guid:0x00066a00a0000444. Oct 13 10:35:38 367055 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x4:0x0] for guid:0x00066a00a000043c. Oct 13 10:35:38 367068 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0xc:0x0] for guid:0x0002c90108cd85f1. Oct 13 10:35:38 367082 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x7:0x0] for guid:0x00066a00a0000458. Oct 13 10:35:38 367095 [AB448A30] -> __osm_lid_mgr_validate_db: ERR 0312: Illegal LID range [0x3:0x0] for guid:0x0002c900001cee10. I don't think this causes any harm though. There are some other fixes you will pick up. > >> I just discovered another problem.. We have been running pfvs2 over > >> IPoIB on the same subnet, and in debugging this, I restarted opensm > >> several times, and somewhere in the stack a PVFS2 write failed. I > >> wouldn't think that a short downtime of the SM from restarting it > >> would > >> cause any IPoIB TCP sessions to fall over.. > > > > As Fab indicated, there are a number of places where the SM/SA is > > needed: > > 1. SA PathRecords (used when a path to a new IP end node is needed or > > an > > existing one timesout) > > 2. SA MCMemberRecord joins, queries, and leaves (used when an interface > > is up'ed, down'ed, etc.) > > > > Is this on an existing TCP session ? Is it OpenIB IPoIB clients at each > > end ? What svn version is being used for this ? > > > > -- Hal > > > It looks like each client node maintains an open TCP stream to each of > the servers. pvfs2 appears to not be very robust to failure. However > the pvfs2 folks just released a new version which changes their network > protocol somewhat. I plan to get the new version installed next week > and will see if it handles things a bit more robustly. Is this running on top of OpenIB IPoIB ? If so, what svn version for IPoIB ? Is it the same as OpenSM (3487) ? If so, that should be recent enough and contains the SA reregistration fix for IPoIB. cd linux-kernel/infiniband/ulp/ipoib/ svn info -- Hal From suri at baymicrosystems.com Fri Oct 14 14:22:40 2005 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 14 Oct 2005 17:22:40 -0400 Subject: [openib-general] rdma/ib_verbs.h In-Reply-To: <1129311672.16900.89.camel@hal.voltaire.com> Message-ID: <200510142122.j9ELMe87025163@ns1.baymicrosystems.com> Folks: While writing a switch driver, I noticed that the alloc_pd and create_cq function signatures are different depending on rdma/ib_verbs.h vs. ib_verbs.h. I don't need RDMA for now, so going with the func signature as in ib_verbs.h is OK or is there a necessity to use rdma/ib_verbs.h? Thanks in advance. Suri From mshefty at ichips.intel.com Fri Oct 14 14:28:05 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 14:28:05 -0700 Subject: [openib-general] rdma/ib_verbs.h In-Reply-To: <200510142122.j9ELMe87025163@ns1.baymicrosystems.com> References: <200510142122.j9ELMe87025163@ns1.baymicrosystems.com> Message-ID: <435022E5.7020204@ichips.intel.com> Suresh Shelvapille wrote: > While writing a switch driver, I noticed that the alloc_pd and create_cq > function signatures are different depending on rdma/ib_verbs.h vs. > ib_verbs.h. I don't need RDMA for now, so going with the func signature as > in ib_verbs.h is OK or is there a necessity to use rdma/ib_verbs.h? It sounds like you have an older file on your system. The correct file is in infiniband/rdma. - Sean From rolandd at cisco.com Fri Oct 14 14:30:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 14 Oct 2005 14:30:29 -0700 Subject: [openib-general] rdma/ib_verbs.h In-Reply-To: <200510142122.j9ELMe87025163@ns1.baymicrosystems.com> (Suresh Shelvapille's message of "Fri, 14 Oct 2005 17:22:40 -0400") References: <200510142122.j9ELMe87025163@ns1.baymicrosystems.com> Message-ID: <523bn3wzuy.fsf@cisco.com> Suresh> Folks: While writing a switch driver, I noticed that the Suresh> alloc_pd and create_cq function signatures are different Suresh> depending on rdma/ib_verbs.h vs. ib_verbs.h. I don't need Suresh> RDMA for now, so going with the func signature as in Suresh> ib_verbs.h is OK or is there a necessity to use Suresh> rdma/ib_verbs.h? Which two files are you comparing? There should only be one ib_verbs.h file in your kernel tree. Otherwise you're just asking for trouble. You may be getting confused because the include files were moved from drivers/infiniband/ to include/rdma/ between kernel version 2.6.13 and 2.6.14-rc1. However the files were just moved. You should just work against the latest version of ib_verbs.h when writing your driver. - R. From suri at baymicrosystems.com Fri Oct 14 14:38:01 2005 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 14 Oct 2005 17:38:01 -0400 Subject: [openib-general] rdma/ib_verbs.h In-Reply-To: <523bn3wzuy.fsf@cisco.com> Message-ID: <200510142138.j9ELc187025583@ns1.baymicrosystems.com> I am not sure when I SVNed but, I have the same files under both include/ and include/rdma in my copy of the Infiniband tree!! So, there should not be any header files under 'include', but they should all be under include/rdma? Please clarify...thanks, Suri > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Friday, October 14, 2005 5:30 PM > To: Suresh Shelvapille > Cc: openib-general at openib.org > Subject: Re: [openib-general] rdma/ib_verbs.h > > Suresh> Folks: While writing a switch driver, I noticed that the > Suresh> alloc_pd and create_cq function signatures are different > Suresh> depending on rdma/ib_verbs.h vs. ib_verbs.h. I don't need > Suresh> RDMA for now, so going with the func signature as in > Suresh> ib_verbs.h is OK or is there a necessity to use > Suresh> rdma/ib_verbs.h? > > Which two files are you comparing? There should only be one > ib_verbs.h file in your kernel tree. Otherwise you're just asking for > trouble. You may be getting confused because the include files were > moved from drivers/infiniband/ to include/rdma/ between kernel version > 2.6.13 and 2.6.14-rc1. > > However the files were just moved. You should just work against the > latest version of ib_verbs.h when writing your driver. > > - R. From rolandd at cisco.com Fri Oct 14 14:44:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 14 Oct 2005 14:44:58 -0700 Subject: [openib-general] rdma/ib_verbs.h In-Reply-To: <200510142138.j9ELc187025583@ns1.baymicrosystems.com> (Suresh Shelvapille's message of "Fri, 14 Oct 2005 17:38:01 -0400") References: <200510142138.j9ELc187025583@ns1.baymicrosystems.com> Message-ID: <52u0fjvkmd.fsf@cisco.com> Suresh> I am not sure when I SVNed but, I have the same files Suresh> under both include/ and include/rdma in my copy of the Suresh> Infiniband tree!! Suresh> So, there should not be any header files under 'include', Suresh> but they should all be under include/rdma? Right: $ svn ls https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/include rdma/ $ svn ls https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/include/rdma ib_addr.h ib_at.h ib_cache.h ib_cm.h ib_fmr_pool.h ib_mad.h ib_pack.h ib_sa.h ib_smi.h ib_user_at.h ib_user_cm.h ib_user_mad.h ib_user_verbs.h ib_verbs.h rdma_cm.h There is no revision in the subversion tree where both include/ and include/rdma/ have files in them, so I'm not sure how you managed to create your tree. - R. From iod00d at hp.com Fri Oct 14 15:39:53 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 14 Oct 2005 15:39:53 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051014223953.GB27904@esmail.cup.hp.com> On Fri, Oct 14, 2005 at 08:38:18AM -0700, Caitlin Bestler wrote: > > That's not who SDP is going to work on Linux, though. Where > > not into your crude hacks to let broken applications work > > with new technology business. Applications will have to use > > SDP directly or via getaddrinfo and we will never put in a > > broken sockets switch. > > I can't think of a better example of something that is truly > brain dead than an application *written* to use Sockets Direct > Protocol. That wasn't hch's point. His point was the kernel would never make SDP transperent to user space. But using LD_PRELOAD and libsdp.so, SDP can be transperently used by the application. It's the user's (ok, maybe sysadmin's) choice. ... > So if you aren't preserving the sockets API what is > the point in using the protocol? Yes, the intent is to let the application to continue using sockets API. But if sysadmin is asking for AF_INET or AF_INET6, then they want TCP/IP *plus* netfilter and other features in the linux kernel. Not something else. If the sysadmin decides they don't need netfilter/tcp, then they can use LD_PRELOAD as noted above. > > And can you _please_ stop all thise time to market and > > similar business crap? That simply doesn't matter when > > designing something properly. > > If we really were to play stop-the-world-while-I-redesign-it > games then the resulting solution would not use sockets, TCP > or even Linux. Well, linux kernel doesn't play "stop-the-world-while-I-redesign-it". The revolution happened (open source collaborative developement). Linux kernel development is now an evolution. Rule #1 for linux kernel develepment: "labor is free" We *know* that's not true in commercial reality. But kernel developement just works that way because of it's origins and Linus likes it that way. If someone wants something changed in the linux kernel, they can develope/submit the changes themselves or pay someone to do the work. In either case, Linus doesn't pay for it. Seems like a sufficient number of smart people agree with him and play the game the way he has defined it. The folks who do NOT like his game, grab some version of the source tree and do what they like with it (as long as they meet licensing requirements). That's ok too. > Real solutions, from NICs through Operating > Systems, recognize that their legacy is part of their strength > as well as a nuisance. Legacy is definitely a linux strength. Open source does NOT ignore legacy applications: 1) Anyone can continue to update and run on the linux kernel version they have source code for if they don't want to (or can't) change the application or newer kernels break the ABI. Many people are still very happy using 2.4 linux kernels. [ Linux kernel has no ABI obligation to closed source apps given the availability of source code. That's what vendors like RH, SuSE, and their competitors are paid to provide - support for stable ABI. ] 2) kernel developers DO modify open source user programs to work with "updated" kernel interfaces if there is a clear advantage. scsitools and pciutils might be a good examples. X.org might be a more contemporary one. 3) kernel developers do NOT break an API/ABI just because it's tuesday and they had a bad burrito for lunch. They eat their own dogfood and don't want to have ABI "events" on their box once a week either. Some ABIs have been deprecated or intentionally broken to improve things. But that's not the norm. We know it's not painless. 4) deprecated functionality is clearly marked and only removed after a reasonably long period (at least 12 months, usually 2-3 years). I know apps live longer than that. I live in many worlds: "paid to provide stable ABI", "be good citizen, make changes available upstream", and "upstream is cost effective for HP". hth, grant From sean.hefty at intel.com Fri Oct 14 16:00:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 14 Oct 2005 16:00:44 -0700 Subject: [openib-general] [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() Message-ID: The following patch converts agent.c to call ib_create_send_mad() when sending a MAD. I also cleaned up the code in several ways: - Removed agent_priv.h. (Whoever commits will need to do svn rm.) - Moved ib_agent_port_list_lock internal to agent.c. - Removed unused code from __ib_get_agent_port(). - Simplified agents to be generic send MAD agents for QP0/1. - Removed unneeded send tracking. Signed-off-by: Sean Hefty Index: agent.c =================================================================== --- agent.c (revision 3692) +++ agent.c (working copy) @@ -36,58 +36,41 @@ * * $Id$ */ +#include "agent.h" +#include "smi.h" -#include -#include - -#include +#define SPFX "ib_agent: " -#include "smi.h" -#include "agent_priv.h" -#include "mad_priv.h" -#include "agent.h" +struct ib_agent_port_private { + struct list_head port_list; + struct ib_mad_agent *agent[2]; +}; -spinlock_t ib_agent_port_list_lock; +static DEFINE_SPINLOCK(ib_agent_port_list_lock); static LIST_HEAD(ib_agent_port_list); -/* - * Caller must hold ib_agent_port_list_lock - */ -static inline struct ib_agent_port_private * -__ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; - BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ - - if (device) { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->smp_agent->device == device && - entry->port_num == port_num) - return entry; - } - } else { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if ((entry->smp_agent == mad_agent) || - (entry->perf_mgmt_agent == mad_agent)) - return entry; - } + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->agent[0]->device == device && + entry->agent[0]->port_num == port_num) + return entry; } return NULL; } -static inline struct ib_agent_port_private * -ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - entry = __ib_get_agent_port(device, port_num, mad_agent); + entry = __ib_get_agent_port(device, port_num); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - return entry; } @@ -99,192 +82,67 @@ int smi_check_local_dr_smp(struct ib_smp if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) return 1; - port_priv = ib_get_agent_port(device, port_num, NULL); + + port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " - "not open\n", - device->name, port_num); + "not open\n", device->name, port_num); return 1; } - return smi_check_local_smp(port_priv->smp_agent, smp); + return smi_check_local_smp(port_priv->agent[0], smp); } -static int agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_agent_port_private *port_priv, - struct ib_mad_private *mad_priv, - struct ib_grh *grh, - struct ib_wc *wc) -{ - struct ib_agent_send_wr *agent_send_wr; - struct ib_sge gather_list; - struct ib_send_wr send_wr; - struct ib_send_wr *bad_send_wr; - struct ib_ah_attr ah_attr; - unsigned long flags; - int ret = 1; - - agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); - if (!agent_send_wr) - goto out; - agent_send_wr->mad = mad_priv; - - gather_list.addr = dma_map_single(mad_agent->device->dma_device, - &mad_priv->mad, - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - gather_list.length = sizeof(mad_priv->mad); - gather_list.lkey = mad_agent->mr->lkey; - - send_wr.next = NULL; - send_wr.opcode = IB_WR_SEND; - send_wr.sg_list = &gather_list; - send_wr.num_sge = 1; - send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ - send_wr.wr.ud.timeout_ms = 0; - send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - - ah_attr.dlid = wc->slid; - ah_attr.port_num = mad_agent->port_num; - ah_attr.src_path_bits = wc->dlid_path_bits; - ah_attr.sl = wc->sl; - ah_attr.static_rate = 0; - ah_attr.ah_flags = 0; /* No GRH */ - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - if (wc->wc_flags & IB_WC_GRH) { - ah_attr.ah_flags = IB_AH_GRH; - /* Should sgid be looked up ? */ - ah_attr.grh.sgid_index = 0; - ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpu( - grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpu( - grh->version_tclass_flow) >> 20) & 0xff; - memcpy(ah_attr.grh.dgid.raw, - grh->sgid.raw, - sizeof(ah_attr.grh.dgid)); - } - } - - agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); - if (IS_ERR(agent_send_wr->ah)) { - printk(KERN_ERR SPFX "No memory for address handle\n"); - kfree(agent_send_wr); - goto out; - } - - send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - send_wr.wr.ud.pkey_index = wc->pkey_index; - send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; - } else { /* for SMPs */ - send_wr.wr.ud.pkey_index = 0; - send_wr.wr.ud.remote_qkey = 0; - } - send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; - send_wr.wr_id = (unsigned long)agent_send_wr; - - pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); - - /* Send */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - ib_destroy_ah(agent_send_wr->ah); - kfree(agent_send_wr); - } else { - list_add_tail(&agent_send_wr->send_list, - &port_priv->send_posted_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - ret = 0; - } - -out: - return ret; -} - -int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num) +void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn) { struct ib_agent_port_private *port_priv; - struct ib_mad_agent *mad_agent; + struct ib_mad_agent *agent; + struct ib_mad_send_buf *send_buf; + struct ib_send_wr *bad_wr; + struct ib_ah *ah; - port_priv = ib_get_agent_port(device, port_num, NULL); - if (!port_priv) { - printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", - device->name, port_num); - return 1; - } + port_priv = ib_get_agent_port(device, port_num); + if (!port_priv) + return; - /* Get mad agent based on mgmt_class in MAD */ - switch (mad->mad.mad.mad_hdr.mgmt_class) { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - case IB_MGMT_CLASS_SUBN_LID_ROUTED: - mad_agent = port_priv->smp_agent; - break; - case IB_MGMT_CLASS_PERF_MGMT: - mad_agent = port_priv->perf_mgmt_agent; - break; - default: - return 1; - } + agent = port_priv->agent[qpn]; + ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); + if (IS_ERR(ah)) + return; - return agent_mad_send(mad_agent, port_priv, mad, grh, wc); + send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, ah, 0, + sizeof *mad - IB_MGMT_MAD_DATA, + IB_MGMT_MAD_DATA, GFP_KERNEL); + if (IS_ERR(send_buf)) + goto err1; + + *send_buf->mad = *mad; + if (ib_post_send_mad(agent, &send_buf->send_wr, &bad_wr)) + goto err2; + return; +err2: + ib_free_send_mad(send_buf); +err1: + ib_destroy_ah(ah); } static void agent_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_agent_port_private *port_priv; - struct ib_agent_send_wr *agent_send_wr; - unsigned long flags; + struct ib_mad_send_buf *send_buf; - /* Find matching MAD agent */ - port_priv = ib_get_agent_port(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " - "agent %p\n", mad_agent); - return; - } - - agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - /* Remove completed send from posted send MAD list */ - list_del(&agent_send_wr->send_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(agent_send_wr->mad->mad), - DMA_TO_DEVICE); - - ib_destroy_ah(agent_send_wr->ah); - - /* Release allocated memory */ - kmem_cache_free(ib_mad_cache, agent_send_wr->mad); - kfree(agent_send_wr); + send_buf = (void *)(unsigned long) mad_send_wc->wr_id; + ib_destroy_ah(send_buf->send_wr.wr.ud.ah); + ib_free_send_mad(send_buf); } int ib_agent_port_open(struct ib_device *device, int port_num) { - int ret; struct ib_agent_port_private *port_priv; unsigned long flags; - - /* First, check if port already open for SMI */ - port_priv = ib_get_agent_port(device, port_num, NULL); - if (port_priv) { - printk(KERN_DEBUG SPFX "%s port %d already open\n", - device->name, port_num); - return 0; - } + int ret; /* Create new device info */ port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); @@ -293,32 +151,25 @@ int ib_agent_port_open(struct ib_device ret = -ENOMEM; goto error1; } - memset(port_priv, 0, sizeof *port_priv); - port_priv->port_num = port_num; - spin_lock_init(&port_priv->send_list_lock); - INIT_LIST_HEAD(&port_priv->send_posted_list); - - /* Obtain send only MAD agent for SM class (SMI QP) */ - port_priv->smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->smp_agent)) { - ret = PTR_ERR(port_priv->smp_agent); + /* Obtain send only MAD agent for SMI QP */ + port_priv->agent[0] = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[0])) { + ret = PTR_ERR(port_priv->agent[0]); goto error2; } - /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, - IB_QPT_GSI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->perf_mgmt_agent)) { - ret = PTR_ERR(port_priv->perf_mgmt_agent); + /* Obtain send only MAD agent for GSI QP */ + port_priv->agent[1] = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[1])) { + ret = PTR_ERR(port_priv->agent[1]); goto error3; } @@ -329,7 +180,7 @@ int ib_agent_port_open(struct ib_device return 0; error3: - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[0]); error2: kfree(port_priv); error1: @@ -342,7 +193,7 @@ int ib_agent_port_close(struct ib_device unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - port_priv = __ib_get_agent_port(device, port_num, NULL); + port_priv = __ib_get_agent_port(device, port_num); if (port_priv == NULL) { spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); printk(KERN_ERR SPFX "Port %d not found\n", port_num); @@ -351,9 +202,8 @@ int ib_agent_port_close(struct ib_device list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[1]); + ib_unregister_mad_agent(port_priv->agent[0]); kfree(port_priv); - return 0; } Index: agent.h =================================================================== --- agent.h (revision 3692) +++ agent.h (working copy) @@ -39,17 +39,14 @@ #ifndef __AGENT_H_ #define __AGENT_H_ -extern spinlock_t ib_agent_port_list_lock; +#include -extern int ib_agent_port_open(struct ib_device *device, - int port_num); +extern int ib_agent_port_open(struct ib_device *device, int port_num); extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num); +extern void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn); #endif /* __AGENT_H_ */ Index: mad.c =================================================================== --- mad.c (revision 3692) +++ mad.c (working copy) @@ -1728,11 +1728,11 @@ local: if (ret & IB_MAD_RESULT_CONSUMED) goto out; if (ret & IB_MAD_RESULT_REPLY) { - /* Send response */ - if (!agent_send(response, &recv->grh, wc, - port_priv->device, - port_priv->port_num)) - response = NULL; + agent_send_response(&response->mad.mad, + &recv->grh, wc, + port_priv->device, + port_priv->port_num, + qp_info->qp->qp_num); goto out; } } @@ -2761,7 +2761,6 @@ static int __init ib_mad_init_module(voi int ret; spin_lock_init(&ib_mad_port_list_lock); - spin_lock_init(&ib_agent_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), Index: smi.h =================================================================== --- smi.h (revision 3692) +++ smi.h (working copy) @@ -35,10 +35,11 @@ * * $Id$ */ - #ifndef __SMI_H_ #define __SMI_H_ +#include + int smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, int port_num, From troy at scl.ameslab.gov Fri Oct 14 16:19:55 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Fri, 14 Oct 2005 18:19:55 -0500 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) Message-ID: <20051014231954.GC8748@minbar.scl.ameslab.gov> In the interest of plugging absolutely everything I have with infiniband ports together and seeing what falls over, I connected a Cray XD1 to a small (2 machine) infiniband network running OpenSM. Ideally, I'd like to find out what sort of minimal emulation code needs to be running on the XD1 nodes to answer standards compliant SM requests. Failing that, Is there a way we can tell OpenSM to ignore parts of the IB network? I am seeing a lot of stuff in the osm.log like this: (port 16 is on the internal mellanox chip in the XD1) ct 14 18:07:22 646959 [43005960] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=15) -- dropping. Oct 14 18:07:22 646972 [43005960] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0 Oct 14 18:07:22 646983 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 14 18:07:22 647020 [43005960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x3 trans_id................0x1420 attr_id.................0x15 (PortInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][4][16] Return path: [0][0][0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 From rolandd at cisco.com Fri Oct 14 17:46:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 14 Oct 2005 17:46:46 -0700 Subject: [openib-general] Initial ipath review brain dump Message-ID: <524q7jvc7d.fsf@cisco.com> Now that I got through reviewing the generic parts of the PathScale merge and the low-level driver is on the trunk, I started looking through the real driver. I'm only about a third of the way through infinipath_core.c, but here's a quick dump of what I see as needing work so far: flatten source from four dirs into a single directory? Makefiles/Kconfig shouldn't have huge copyright notices (or any copyright notices at all for that matter) You need better Kconfig help text. Consistent naming convention ipath vs infinipath? get rid of "openib" uses -- these are just Linux drivers and I don't think OpenIB nomenclature is appropriate or helpful. You already depend on PCI_MSI in Kconfig -- no need to test in source file this must be a bug (since various places do ipath_sma_first++ etc): static volatile unsigned ipath_sma_first; /* oldest sma packet index */ use lock around it instead In general "volatile" is a pretty good marker for bugs and is almost never correct in a declaration. get rid of infinipath_stats typedef -- just use struct infinipath_stats no need to have a sysctl for debug level -- just use module param move /proc files to sysfs/debugfs as appropriate no need to test #ifndef HAVE_COMPAT_IOCTL -- new kernels have it. same for HAVE_UNLOCKED_IOCTL It's OK to maintain backport patches separately but the driver that goes upstream shouldn't have this obsolete code. Don't hard-code limit of 4 devices dev = ++chip_idx; is a bug if PCI probing becomes multi-threaded. just allocate dev structs as needed rather than having a static table Add required PCI cap/HT stuff to drivers/pci and linux/pci_regs.h instead of hiding in ipath_setup_htconfig(). It does seem an extension to pci_find_capability() is required to handle multiple capabilities with the same ID. make debugging code simpler (consider relayfs or just printk -- if you temporarily turn off messages going to the console with dmesg -n, and have CONFIG_PRINTK_TIME=y, I think printk does everything you want) put .owner = THIS_MODULE in ipath_fops instead of fooling with try_module_get()/module_put() -- that's racy. From halr at voltaire.com Fri Oct 14 18:48:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Oct 2005 21:48:24 -0400 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) In-Reply-To: <20051014231954.GC8748@minbar.scl.ameslab.gov> References: <20051014231954.GC8748@minbar.scl.ameslab.gov> Message-ID: <1129340904.16900.198.camel@hal.voltaire.com> On Fri, 2005-10-14 at 19:19, Troy Benjegerdes wrote: > In the interest of plugging absolutely everything I have with infiniband > ports together and seeing what falls over, I connected a Cray XD1 to a > small (2 machine) infiniband network running OpenSM. > > Ideally, I'd like to find out what sort of minimal emulation code needs > to be running on the XD1 nodes to answer standards compliant SM > requests. You need an SMA on the Cray node. > Failing that, Is there a way we can tell OpenSM to ignore parts > of the IB network? I am seeing a lot of stuff in the osm.log like this: > > (port 16 is on the internal mellanox chip in the XD1) > > ct 14 18:07:22 646959 [43005960] -> umad_receiver: ERR 5409: send > completed with error (method=1 attr=15) -- dropping. > Oct 14 18:07:22 646972 [43005960] -> umad_receiver: ERR 5411: DR SMP hop > ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0 > Oct 14 18:07:22 646983 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT). > Oct 14 18:07:22 647020 [43005960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x1 (SubnGet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x3 > trans_id................0x1420 > attr_id.................0x15 (PortInfo) > resv....................0x0 > attr_mod................0x0 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: [0][1][4][16] > Return path: [0][0][0][0] > Reserved: [0][0][0][0][0][0][0] > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 I'm unaware of such an option. Not sure how you would specify which nodes to ignore. Why would you want them on the net if they are to be ignored ? Nodes are supposed to be IB compliant: SMA is a required component of all nodes. -- Hal From nacc at us.ibm.com Fri Oct 14 22:24:56 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 14 Oct 2005 22:24:56 -0700 Subject: [openib-general] Latest build test results In-Reply-To: <1129086927.4377.12455.camel@hal.voltaire.com> References: <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> <1129086927.4377.12455.camel@hal.voltaire.com> Message-ID: <20051015052456.GF28213@us.ibm.com> On 11.10.2005 [23:15:27 -0400], Hal Rosenstock wrote: > Hi again Nish, > > On Tue, 2005-10-11 at 21:39, Nishanth Aravamudan wrote: > > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > > > packet_type > > > > > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > > > try to get to it tonight or tomorrow. > > > > > > > > Is there anyway you can send me patches against the kernel tree as > > > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > > > right now I have to take your patch against svn and either hand-edit or > > > > patch my checkout and then diff against the current kernel tree. > > > > > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > > > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? > > > > Yes; but you have to understand that the automated build system I have > > access to 1) does not have external internet access (i.e., to the svn > > tree) and 2) only builds kernels unless I manually send commands to the > > terminal. > > > > So, the way I'm doing things is: > > > > Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then > > generate a patch of the latest svn tree against the current -git release > > (a patch to the kernel) and send it in as a parameter to my builds to > > test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 > > with =y and =m). > > > > I'm *only* doing kernel build testing right now. > > > > > Which patches are you referring to ? Was it the fib_frontend.c one ? > > > Not sure why they would need any manual fixup. At least that one was > > > pretty straightforward. > > > > In the sense that I have to edit them to kernel relative paths, not in > > the content of the patch. To test any patch in the system I have access > > to, it needs to be a normal kernel patch (-p1 applicable to the base > > tree). > > > > Going through and manually applying patches to the svn tree and then > > regenerating the diff completely defeats the purpose of automated > > compilation testing. > > OK. Do you need any patches regenerated or is this more for the future ? Please check-in the at.c, sdp_link.c and iser.h fixes, as now gen2 code builds on x86 and ppc64 with only the following warning (which I believe is new) drivers/infiniband/core/addr.c:330: warning: initialization from incompatible pointer type when the patches are applied. Without them the x86 build fails completely and the ppc64 build emits several warning. Sorry for the *long* delay, it took a bit of effort to get the patches to cooperate with our automated build system. Thanks to Hal for his quick response and generous patience in waiting for my ack. So, officially, I give Acked-by: Nishanth Aravamudan to the at.c, sdp_link.c and iser.h fixes. Thanks, Nish From info at qopir.com Fri Oct 14 20:26:00 2005 From: info at qopir.com (info at qopir.com) Date: 15 Oct 2005 12:26:00 +0900 Subject: [openib-general] $BO"Mm@h8x3+"v(B Message-ID: <20051015032600.11891.qmail@mail.qopir.com> $B0l=o$K5o$F$/$l$k=w at -$r3N References: <20051014231954.GC8748@minbar.scl.ameslab.gov> <1129340904.16900.198.camel@hal.voltaire.com> Message-ID: <1129368676.16900.592.camel@hal.voltaire.com> On Fri, 2005-10-14 at 21:48, Hal Rosenstock wrote: > On Fri, 2005-10-14 at 19:19, Troy Benjegerdes wrote: > > In the interest of plugging absolutely everything I have with infiniband > > ports together and seeing what falls over, I connected a Cray XD1 to a > > small (2 machine) infiniband network running OpenSM. > > > > Ideally, I'd like to find out what sort of minimal emulation code needs > > to be running on the XD1 nodes to answer standards compliant SM > > requests. > > You need an SMA on the Cray node. > > > Failing that, Is there a way we can tell OpenSM to ignore parts > > of the IB network? I am seeing a lot of stuff in the osm.log like this: > > > > (port 16 is on the internal mellanox chip in the XD1) > > > > ct 14 18:07:22 646959 [43005960] -> umad_receiver: ERR 5409: send > > completed with error (method=1 attr=15) -- dropping. > > Oct 14 18:07:22 646972 [43005960] -> umad_receiver: ERR 5411: DR SMP hop > > ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0 > > Oct 14 18:07:22 646983 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > > 3113: MAD completed in error (IB_TIMEOUT). > > Oct 14 18:07:22 647020 [43005960] -> SMP dump: > > base_ver................0x1 > > mgmt_class..............0x81 > > class_ver...............0x1 > > method..................0x1 (SubnGet) > > D bit...................0x0 > > status..................0x0 > > hop_ptr.................0x0 > > hop_count...............0x3 > > trans_id................0x1420 > > attr_id.................0x15 (PortInfo) > > resv....................0x0 > > attr_mod................0x0 > > m_key...................0x0000000000000000 > > dr_slid.................0xFFFF > > dr_dlid.................0xFFFF > > > > Initial path: [0][1][4][16] > > Return path: [0][0][0][0] > > Reserved: [0][0][0][0][0][0][0] > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > I'm unaware of such an option. Not sure how you would specify which > nodes to ignore. Why would you want them on the net if they are to be > ignored ? > > Nodes are supposed to be IB compliant: SMA is a required component of > all nodes. So I presume there is no SMA for the Cray XD1. If someone is going to implement this, we can document what portion of the SMA needs to implemented to work with OpenSM. That wouldn't necessarily guarantee it should work with any SM as other SMs may rely on some slightly different things or do things in a slightly different way since there is much more flexibility allowed on the SM side. -- Hal From halr at voltaire.com Sat Oct 15 03:05:54 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Oct 2005 06:05:54 -0400 Subject: [openib-general] Latest build test results In-Reply-To: <20051015052456.GF28213@us.ibm.com> References: <20051006171128.GA15908@us.ibm.com> <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> <1129086927.4377.12455.camel@hal.voltaire.com> <20051015052456.GF28213@us.ibm.com> Message-ID: <1129370754.16900.664.camel@hal.voltaire.com> Hi Nish, On Sat, 2005-10-15 at 01:24, Nishanth Aravamudan wrote: > On 11.10.2005 [23:15:27 -0400], Hal Rosenstock wrote: > > Hi again Nish, > > > > On Tue, 2005-10-11 at 21:39, Nishanth Aravamudan wrote: > > > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > > > > packet_type > > > > > > > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > > > > try to get to it tonight or tomorrow. > > > > > > > > > > Is there anyway you can send me patches against the kernel tree as > > > > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > > > > right now I have to take your patch against svn and either hand-edit or > > > > > patch my checkout and then diff against the current kernel tree. > > > > > > > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > > > > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? > > > > > > Yes; but you have to understand that the automated build system I have > > > access to 1) does not have external internet access (i.e., to the svn > > > tree) and 2) only builds kernels unless I manually send commands to the > > > terminal. > > > > > > So, the way I'm doing things is: > > > > > > Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then > > > generate a patch of the latest svn tree against the current -git release > > > (a patch to the kernel) and send it in as a parameter to my builds to > > > test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 > > > with =y and =m). > > > > > > I'm *only* doing kernel build testing right now. > > > > > > > Which patches are you referring to ? Was it the fib_frontend.c one ? > > > > Not sure why they would need any manual fixup. At least that one was > > > > pretty straightforward. > > > > > > In the sense that I have to edit them to kernel relative paths, not in > > > the content of the patch. To test any patch in the system I have access > > > to, it needs to be a normal kernel patch (-p1 applicable to the base > > > tree). > > > > > > Going through and manually applying patches to the svn tree and then > > > regenerating the diff completely defeats the purpose of automated > > > compilation testing. > > > > OK. Do you need any patches regenerated or is this more for the future ? > > Please check-in the at.c, sdp_link.c and iser.h fixes, as now gen2 code > builds on x86 and ppc64 with only the following warning (which I believe > is new) > > drivers/infiniband/core/addr.c:330: warning: initialization from incompatible pointer type > > when the patches are applied. Without them the x86 build fails > completely and the ppc64 build emits several warning. > > Sorry for the *long* delay, it took a bit of effort to get the patches > to cooperate with our automated build system. Thanks to Hal for his > quick response and generous patience in waiting for my ack. > > So, officially, I give > > Acked-by: Nishanth Aravamudan > > to the at.c, sdp_link.c and iser.h fixes. Thanks for trying out these patches. Sorry for the manual intervention. I regenerated the patches for fib_frontend.c, at.c, and sdp_link.c and they are in linux-kernel/patches. Hopefully these will work with your automated build system. These are found in linux-kernel/patches as: linux-2.6.14-rc3-at.diff linux-2.6.14-rc3-fib-frontend.diff linux-2.6.14-rc3-sdp_link.diff Dan will be checking in the iser.h fix. -- Hal From halr at voltaire.com Sat Oct 15 03:46:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Oct 2005 06:46:06 -0400 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: References: Message-ID: <1129373165.16900.679.camel@hal.voltaire.com> On Fri, 2005-10-14 at 19:00, Sean Hefty wrote: > The following patch converts agent.c to call ib_create_send_mad() when > sending a MAD. I also cleaned up the code in several ways: > > - Removed agent_priv.h. (Whoever commits will need to do svn rm.) > - Moved ib_agent_port_list_lock internal to agent.c. > - Removed unused code from __ib_get_agent_port(). > - Simplified agents to be generic send MAD agents for QP0/1. > - Removed unneeded send tracking. > > Signed-off-by: Sean Hefty Looks good. One comment below on agent_send_response. Have you tested this ? -- Hal > Index: agent.c > =================================================================== > --- agent.c (revision 3692) > +++ agent.c (working copy) > @@ -36,58 +36,41 @@ > * > * $Id$ > */ > +#include "agent.h" > +#include "smi.h" > > -#include > -#include > - > -#include > +#define SPFX "ib_agent: " > > -#include "smi.h" > -#include "agent_priv.h" > -#include "mad_priv.h" > -#include "agent.h" > +struct ib_agent_port_private { > + struct list_head port_list; > + struct ib_mad_agent *agent[2]; > +}; > > -spinlock_t ib_agent_port_list_lock; > +static DEFINE_SPINLOCK(ib_agent_port_list_lock); > static LIST_HEAD(ib_agent_port_list); > > -/* > - * Caller must hold ib_agent_port_list_lock > - */ > -static inline struct ib_agent_port_private * > -__ib_get_agent_port(struct ib_device *device, int port_num, > - struct ib_mad_agent *mad_agent) > +static struct ib_agent_port_private * > +__ib_get_agent_port(struct ib_device *device, int port_num) > { > struct ib_agent_port_private *entry; > > - BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ > - > - if (device) { > - list_for_each_entry(entry, &ib_agent_port_list, port_list) { > - if (entry->smp_agent->device == device && > - entry->port_num == port_num) > - return entry; > - } > - } else { > - list_for_each_entry(entry, &ib_agent_port_list, port_list) { > - if ((entry->smp_agent == mad_agent) || > - (entry->perf_mgmt_agent == mad_agent)) > - return entry; > - } > + list_for_each_entry(entry, &ib_agent_port_list, port_list) { > + if (entry->agent[0]->device == device && > + entry->agent[0]->port_num == port_num) > + return entry; > } > return NULL; > } > > -static inline struct ib_agent_port_private * > -ib_get_agent_port(struct ib_device *device, int port_num, > - struct ib_mad_agent *mad_agent) > +static struct ib_agent_port_private * > +ib_get_agent_port(struct ib_device *device, int port_num) > { > struct ib_agent_port_private *entry; > unsigned long flags; > > spin_lock_irqsave(&ib_agent_port_list_lock, flags); > - entry = __ib_get_agent_port(device, port_num, mad_agent); > + entry = __ib_get_agent_port(device, port_num); > spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); > - > return entry; > } > > @@ -99,192 +82,67 @@ int smi_check_local_dr_smp(struct ib_smp > > if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) > return 1; > - port_priv = ib_get_agent_port(device, port_num, NULL); > + > + port_priv = ib_get_agent_port(device, port_num); > if (!port_priv) { > printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " > - "not open\n", > - device->name, port_num); > + "not open\n", device->name, port_num); > return 1; > } > > - return smi_check_local_smp(port_priv->smp_agent, smp); > + return smi_check_local_smp(port_priv->agent[0], smp); > } > > -static int agent_mad_send(struct ib_mad_agent *mad_agent, > - struct ib_agent_port_private *port_priv, > - struct ib_mad_private *mad_priv, > - struct ib_grh *grh, > - struct ib_wc *wc) > -{ > - struct ib_agent_send_wr *agent_send_wr; > - struct ib_sge gather_list; > - struct ib_send_wr send_wr; > - struct ib_send_wr *bad_send_wr; > - struct ib_ah_attr ah_attr; > - unsigned long flags; > - int ret = 1; > - > - agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); > - if (!agent_send_wr) > - goto out; > - agent_send_wr->mad = mad_priv; > - > - gather_list.addr = dma_map_single(mad_agent->device->dma_device, > - &mad_priv->mad, > - sizeof(mad_priv->mad), > - DMA_TO_DEVICE); > - gather_list.length = sizeof(mad_priv->mad); > - gather_list.lkey = mad_agent->mr->lkey; > - > - send_wr.next = NULL; > - send_wr.opcode = IB_WR_SEND; > - send_wr.sg_list = &gather_list; > - send_wr.num_sge = 1; > - send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ > - send_wr.wr.ud.timeout_ms = 0; > - send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; > - > - ah_attr.dlid = wc->slid; > - ah_attr.port_num = mad_agent->port_num; > - ah_attr.src_path_bits = wc->dlid_path_bits; > - ah_attr.sl = wc->sl; > - ah_attr.static_rate = 0; > - ah_attr.ah_flags = 0; /* No GRH */ > - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { > - if (wc->wc_flags & IB_WC_GRH) { > - ah_attr.ah_flags = IB_AH_GRH; > - /* Should sgid be looked up ? */ > - ah_attr.grh.sgid_index = 0; > - ah_attr.grh.hop_limit = grh->hop_limit; > - ah_attr.grh.flow_label = be32_to_cpu( > - grh->version_tclass_flow) & 0xfffff; > - ah_attr.grh.traffic_class = (be32_to_cpu( > - grh->version_tclass_flow) >> 20) & 0xff; > - memcpy(ah_attr.grh.dgid.raw, > - grh->sgid.raw, > - sizeof(ah_attr.grh.dgid)); > - } > - } > - > - agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); > - if (IS_ERR(agent_send_wr->ah)) { > - printk(KERN_ERR SPFX "No memory for address handle\n"); > - kfree(agent_send_wr); > - goto out; > - } > - > - send_wr.wr.ud.ah = agent_send_wr->ah; > - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { > - send_wr.wr.ud.pkey_index = wc->pkey_index; > - send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; > - } else { /* for SMPs */ > - send_wr.wr.ud.pkey_index = 0; > - send_wr.wr.ud.remote_qkey = 0; > - } > - send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; > - send_wr.wr_id = (unsigned long)agent_send_wr; > - > - pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); > - > - /* Send */ > - spin_lock_irqsave(&port_priv->send_list_lock, flags); > - if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { > - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); > - dma_unmap_single(mad_agent->device->dma_device, > - pci_unmap_addr(agent_send_wr, mapping), > - sizeof(mad_priv->mad), > - DMA_TO_DEVICE); > - ib_destroy_ah(agent_send_wr->ah); > - kfree(agent_send_wr); > - } else { > - list_add_tail(&agent_send_wr->send_list, > - &port_priv->send_posted_list); > - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); > - ret = 0; > - } > - > -out: > - return ret; > -} > - > -int agent_send(struct ib_mad_private *mad, > - struct ib_grh *grh, > - struct ib_wc *wc, > - struct ib_device *device, > - int port_num) > +void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, ^^^^ int Shouldn't this be left as int (and set error returns internal to this routine where they occur) ? There seem to be a number of them although the number has been reduced. > + struct ib_wc *wc, struct ib_device *device, > + int port_num, int qpn) > { > struct ib_agent_port_private *port_priv; > - struct ib_mad_agent *mad_agent; > + struct ib_mad_agent *agent; > + struct ib_mad_send_buf *send_buf; > + struct ib_send_wr *bad_wr; > + struct ib_ah *ah; > > - port_priv = ib_get_agent_port(device, port_num, NULL); > - if (!port_priv) { > - printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", > - device->name, port_num); > - return 1; > - } > + port_priv = ib_get_agent_port(device, port_num); > + if (!port_priv) > + return; > > - /* Get mad agent based on mgmt_class in MAD */ > - switch (mad->mad.mad.mad_hdr.mgmt_class) { > - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: > - case IB_MGMT_CLASS_SUBN_LID_ROUTED: > - mad_agent = port_priv->smp_agent; > - break; > - case IB_MGMT_CLASS_PERF_MGMT: > - mad_agent = port_priv->perf_mgmt_agent; > - break; > - default: > - return 1; > - } > + agent = port_priv->agent[qpn]; > + ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); > + if (IS_ERR(ah)) > + return; > > - return agent_mad_send(mad_agent, port_priv, mad, grh, wc); > + send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, ah, 0, > + sizeof *mad - IB_MGMT_MAD_DATA, > + IB_MGMT_MAD_DATA, GFP_KERNEL); > + if (IS_ERR(send_buf)) > + goto err1; > + > + *send_buf->mad = *mad; > + if (ib_post_send_mad(agent, &send_buf->send_wr, &bad_wr)) > + goto err2; > + return; > +err2: > + ib_free_send_mad(send_buf); > +err1: > + ib_destroy_ah(ah); > } > > static void agent_send_handler(struct ib_mad_agent *mad_agent, > struct ib_mad_send_wc *mad_send_wc) > { > - struct ib_agent_port_private *port_priv; > - struct ib_agent_send_wr *agent_send_wr; > - unsigned long flags; > + struct ib_mad_send_buf *send_buf; > > - /* Find matching MAD agent */ > - port_priv = ib_get_agent_port(NULL, 0, mad_agent); > - if (!port_priv) { > - printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " > - "agent %p\n", mad_agent); > - return; > - } > - > - agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; > - spin_lock_irqsave(&port_priv->send_list_lock, flags); > - /* Remove completed send from posted send MAD list */ > - list_del(&agent_send_wr->send_list); > - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); > - > - dma_unmap_single(mad_agent->device->dma_device, > - pci_unmap_addr(agent_send_wr, mapping), > - sizeof(agent_send_wr->mad->mad), > - DMA_TO_DEVICE); > - > - ib_destroy_ah(agent_send_wr->ah); > - > - /* Release allocated memory */ > - kmem_cache_free(ib_mad_cache, agent_send_wr->mad); > - kfree(agent_send_wr); > + send_buf = (void *)(unsigned long) mad_send_wc->wr_id; > + ib_destroy_ah(send_buf->send_wr.wr.ud.ah); > + ib_free_send_mad(send_buf); > } > > int ib_agent_port_open(struct ib_device *device, int port_num) > { > - int ret; > struct ib_agent_port_private *port_priv; > unsigned long flags; > - > - /* First, check if port already open for SMI */ > - port_priv = ib_get_agent_port(device, port_num, NULL); > - if (port_priv) { > - printk(KERN_DEBUG SPFX "%s port %d already open\n", > - device->name, port_num); > - return 0; > - } > + int ret; > > /* Create new device info */ > port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); > @@ -293,32 +151,25 @@ int ib_agent_port_open(struct ib_device > ret = -ENOMEM; > goto error1; > } > - > memset(port_priv, 0, sizeof *port_priv); > - port_priv->port_num = port_num; > - spin_lock_init(&port_priv->send_list_lock); > - INIT_LIST_HEAD(&port_priv->send_posted_list); > - > - /* Obtain send only MAD agent for SM class (SMI QP) */ > - port_priv->smp_agent = ib_register_mad_agent(device, port_num, > - IB_QPT_SMI, > - NULL, 0, > - &agent_send_handler, > - NULL, NULL); > > - if (IS_ERR(port_priv->smp_agent)) { > - ret = PTR_ERR(port_priv->smp_agent); > + /* Obtain send only MAD agent for SMI QP */ > + port_priv->agent[0] = ib_register_mad_agent(device, port_num, > + IB_QPT_SMI, NULL, 0, > + &agent_send_handler, > + NULL, NULL); > + if (IS_ERR(port_priv->agent[0])) { > + ret = PTR_ERR(port_priv->agent[0]); > goto error2; > } > > - /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ > - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, > - IB_QPT_GSI, > - NULL, 0, > - &agent_send_handler, > - NULL, NULL); > - if (IS_ERR(port_priv->perf_mgmt_agent)) { > - ret = PTR_ERR(port_priv->perf_mgmt_agent); > + /* Obtain send only MAD agent for GSI QP */ > + port_priv->agent[1] = ib_register_mad_agent(device, port_num, > + IB_QPT_GSI, NULL, 0, > + &agent_send_handler, > + NULL, NULL); > + if (IS_ERR(port_priv->agent[1])) { > + ret = PTR_ERR(port_priv->agent[1]); > goto error3; > } > > @@ -329,7 +180,7 @@ int ib_agent_port_open(struct ib_device > return 0; > > error3: > - ib_unregister_mad_agent(port_priv->smp_agent); > + ib_unregister_mad_agent(port_priv->agent[0]); > error2: > kfree(port_priv); > error1: > @@ -342,7 +193,7 @@ int ib_agent_port_close(struct ib_device > unsigned long flags; > > spin_lock_irqsave(&ib_agent_port_list_lock, flags); > - port_priv = __ib_get_agent_port(device, port_num, NULL); > + port_priv = __ib_get_agent_port(device, port_num); > if (port_priv == NULL) { > spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); > printk(KERN_ERR SPFX "Port %d not found\n", port_num); > @@ -351,9 +202,8 @@ int ib_agent_port_close(struct ib_device > list_del(&port_priv->port_list); > spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); > > - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); > - ib_unregister_mad_agent(port_priv->smp_agent); > + ib_unregister_mad_agent(port_priv->agent[1]); > + ib_unregister_mad_agent(port_priv->agent[0]); > kfree(port_priv); > - > return 0; > } > Index: agent.h > =================================================================== > --- agent.h (revision 3692) > +++ agent.h (working copy) > @@ -39,17 +39,14 @@ > #ifndef __AGENT_H_ > #define __AGENT_H_ > > -extern spinlock_t ib_agent_port_list_lock; > +#include > > -extern int ib_agent_port_open(struct ib_device *device, > - int port_num); > +extern int ib_agent_port_open(struct ib_device *device, int port_num); > > extern int ib_agent_port_close(struct ib_device *device, int port_num); > > -extern int agent_send(struct ib_mad_private *mad, > - struct ib_grh *grh, > - struct ib_wc *wc, > - struct ib_device *device, > - int port_num); > +extern void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, > + struct ib_wc *wc, struct ib_device *device, > + int port_num, int qpn); > > #endif /* __AGENT_H_ */ > Index: mad.c > =================================================================== > --- mad.c (revision 3692) > +++ mad.c (working copy) > @@ -1728,11 +1728,11 @@ local: > if (ret & IB_MAD_RESULT_CONSUMED) > goto out; > if (ret & IB_MAD_RESULT_REPLY) { > - /* Send response */ > - if (!agent_send(response, &recv->grh, wc, > - port_priv->device, > - port_priv->port_num)) > - response = NULL; > + agent_send_response(&response->mad.mad, > + &recv->grh, wc, > + port_priv->device, > + port_priv->port_num, > + qp_info->qp->qp_num); > goto out; > } > } > @@ -2761,7 +2761,6 @@ static int __init ib_mad_init_module(voi > int ret; > > spin_lock_init(&ib_mad_port_list_lock); > - spin_lock_init(&ib_agent_port_list_lock); > > ib_mad_cache = kmem_cache_create("ib_mad", > sizeof(struct ib_mad_private), > Index: smi.h > =================================================================== > --- smi.h (revision 3692) > +++ smi.h (working copy) > @@ -35,10 +35,11 @@ > * > * $Id$ > */ > - > #ifndef __SMI_H_ > #define __SMI_H_ > > +#include > + > int smi_handle_dr_smp_recv(struct ib_smp *smp, > u8 node_type, > int port_num, > > > From mohitka at noida.hcltech.com Sat Oct 15 06:07:59 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Sat, 15 Oct 2005 18:37:59 +0530 Subject: [openib-general] Infiniband over FC Message-ID: <3E6BB9CEE261E2428AD25D0D553DC497014E6B06@HSDLNTD1110010.noida.hcltech.com> Hi all, I just cant clear a doubt about IB. In the first figure given below the max speed that can be obtained between the client and the IO storage is 2Gb/s Figure 1 While in the figure given below the client to IB FC gateway speed is > 10 GB/s and from Gateway to I/O storage is 2GB/s and if port aggregation is applied at gateway then 4GB/s. So the total effective speed from client to I/O storage can max be reached at 4GB/s Figure 2 So can anyone explain me am I correct in my approach? Are there any other advantages in shifting from figure 1 architecture to figure 2 architecture? It does not seem any advantageous in shifting from FC SAN to IB FC SAN through such a pattern? Can anyone help me in deciding about this?? Thanks in advance Mohit Katiyar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 3623 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 4485 bytes Desc: image002.gif URL: From mohitka at noida.hcltech.com Sat Oct 15 06:28:38 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Sat, 15 Oct 2005 18:58:38 +0530 Subject: [openib-general] IB and FC Message-ID: <3E6BB9CEE261E2428AD25D0D553DC497014E6B0B@HSDLNTD1110010.noida.hcltech.com> Hi all, Sorry previous mail got scrapped due to HTML pictures so now with text pictures I just cant clear a doubt about IB. In the first figure given below the max speed that can be obtained between the client and the IO storage is 2Gb/s Client --------| Client --------| |--- FC Switch---| . | | | . |---FC Cables--| |-----I/O storage . | Each client | | Client --------| connected |--- FC Switch---| To both switch Figure 1 While in the figure given below the client to IB FC gateway speed is > 10 GB/s and from Gateway to I/O storage is 2GB/s and if port aggregation is applied at gateway then 4GB/s. So the total effective speed from client to I/O storage can max be reached at 4GB/s IB cables Client --------| Client --------| |----- FC Switch---| . | IB cables | | . |------------IB FC ------ | |---FC Cables------I/O storage . | Gateway/Router | | Client --------| |----- FC Switch---| Figure 2 So can anyone explain me am I correct in my approach? Are there any other advantages in shifting from figure 1 architecture to figure 2 architecture? It does not seem any advantageous in shifting from FC SAN to IB FC SAN through such a pattern? Can anyone help me in deciding about this?? Thanks in advance Mohit Katiyar From mohitka at noida.hcltech.com Sat Oct 15 06:40:01 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Sat, 15 Oct 2005 19:10:01 +0530 Subject: [openib-general] IB and FC Message-ID: <3E6BB9CEE261E2428AD25D0D553DC497014E6B11@HSDLNTD1110010.noida.hcltech.com> Hi all, Sorry previous mail got scrapped due to HTML pictures andsecond due to tabs so I hope this reaches correctly I just cant clear a doubt about IB. In the first figure given below the max speed that can be obtained between the client and the IO storage is 2Gb/s Client --------| Client --------| |--- FC Switch---| | | | |---FC Cables--| |-----I/O storage | Each client | | Client --------| connected |--- FC Switch---| To both switch Figure 1 While in the figure given below the client to IB FC gateway speed is > 10 GB/s and from Gateway to I/O storage is 2GB/s and if port aggregation is applied at gateway then 4GB/s. So the total effective speed from client to I/O storage can max be reached at 4GB/s IB cables Client ----| Client -- -| |----- FC Switch---| . | IB cables | | . |--IB FC ------ | |-------I/O storage | Gateway | | Client ----| Router |----- FC Switch---| Figure 2 So can anyone explain me am I correct in my approach? Are there any other advantages in shifting from figure 1 architecture to figure 2 architecture? It does not seem any advantageous in shifting from FC SAN to IB FC SAN through such a pattern? Can anyone help me in deciding about this?? Thanks in advance Mohit Katiyar From troy at scl.ameslab.gov Sat Oct 15 07:13:05 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sat, 15 Oct 2005 09:13:05 -0500 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) In-Reply-To: <1129368676.16900.592.camel@hal.voltaire.com> References: <20051014231954.GC8748@minbar.scl.ameslab.gov> <1129340904.16900.198.camel@hal.voltaire.com> <1129368676.16900.592.camel@hal.voltaire.com> Message-ID: <43510E71.3080106@scl.ameslab.gov> >>I'm unaware of such an option. Not sure how you would specify which >>nodes to ignore. Why would you want them on the net if they are to be >>ignored ? >> >>Nodes are supposed to be IB compliant: SMA is a required component of >>all nodes. >> >> > >So I presume there is no SMA for the Cray XD1. If someone is going to >implement this, we can document what portion of the SMA needs to >implemented to work with OpenSM. That wouldn't necessarily guarantee it >should work with any SM as other SMs may rely on some slightly different >things or do things in a slightly different way since there is much more >flexibility allowed on the SM side. > >-- Hal > > I would be interested to know what minimal set of the SMA needs to be implemented. Does anyone have any SMA code in openib project yet? If so it might be a (theoretically) simple matter of running that code on the Cray XD1 nodes, and having the cray hardware pass the SMA packets up to the linux kernel. Is the SMA implemented in firmware on mellanox cards? And if there's a 'soft-sma' available, could it be made to work on pathscale cards as well? From eitan at mellanox.co.il Sat Oct 15 08:54:07 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 15 Oct 2005 17:54:07 +0200 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain gui ds?) Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E306935A@mtlexch01.mtl.com> Hi Troy, The only "soft SMA" I am aware off is part of the ib management simulator code. See: sma.cpp in https://openib.org/svn/gen2/utils/src/linux-user/ibmgtsim/src/sma.cpp Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Troy Benjegerdes [mailto:troy at scl.ameslab.gov] > Sent: Saturday, October 15, 2005 4:13 PM > To: Hal Rosenstock > Cc: xd1-kernel at lists.scl.ameslab.gov; openib-general at openib.org > Subject: Re: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) > > > >>I'm unaware of such an option. Not sure how you would specify which > >>nodes to ignore. Why would you want them on the net if they are to be > >>ignored ? > >> > >>Nodes are supposed to be IB compliant: SMA is a required component of > >>all nodes. > >> > >> > > > >So I presume there is no SMA for the Cray XD1. If someone is going to > >implement this, we can document what portion of the SMA needs to > >implemented to work with OpenSM. That wouldn't necessarily guarantee it > >should work with any SM as other SMs may rely on some slightly different > >things or do things in a slightly different way since there is much more > >flexibility allowed on the SM side. > > > >-- Hal > > > > > I would be interested to know what minimal set of the SMA needs to be > implemented. Does anyone have any SMA code in openib project yet? If so > it might be a (theoretically) simple matter of running that code on the > Cray XD1 nodes, and having the cray hardware pass the SMA packets up to > the linux kernel. > > Is the SMA implemented in firmware on mellanox cards? And if there's a > 'soft-sma' available, could it be made to work on pathscale cards as well? > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sat Oct 15 09:17:42 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Oct 2005 12:17:42 -0400 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) In-Reply-To: <43510E71.3080106@scl.ameslab.gov> References: <20051014231954.GC8748@minbar.scl.ameslab.gov> <1129340904.16900.198.camel@hal.voltaire.com> <1129368676.16900.592.camel@hal.voltaire.com> <43510E71.3080106@scl.ameslab.gov> Message-ID: <1129393061.16900.1649.camel@hal.voltaire.com> On Sat, 2005-10-15 at 10:13, Troy Benjegerdes wrote: > I would be interested to know what minimal set of the SMA needs to be > implemented. OK. Let me know when you would start to implement. I will document what is needed before then. > Does anyone have any SMA code in openib project yet? PathScale has one but it is married to the PathScale hardware. Look in ipath/src/linux-kernel/infiniband/hw/ipath/ib_ipath > If so > it might be a (theoretically) simple matter of running that code on the > Cray XD1 nodes, and having the cray hardware pass the SMA packets up to > the linux kernel. It's more than that: in addition to the IB hardware/driver difference, it will need to be ported from Linux to whatever Cray OS is. > Is the SMA implemented in firmware on mellanox cards? Yes. > And if there's a > 'soft-sma' available, could it be made to work on pathscale cards as well? PathScale has an soft SMA for their HCAs already (for Linux at least). -- Hal From nacc at us.ibm.com Sat Oct 15 09:56:19 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Sat, 15 Oct 2005 09:56:19 -0700 Subject: [PATCH] core/addr: fix compilation warning {was Re: [openib-general] Latest build test results} In-Reply-To: <1129370754.16900.664.camel@hal.voltaire.com> References: <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> <1129086927.4377.12455.camel@hal.voltaire.com> <20051015052456.GF28213@us.ibm.com> <1129370754.16900.664.camel@hal.voltaire.com> Message-ID: <20051015165619.GG28213@us.ibm.com> On 15.10.2005 [06:05:54 -0400], Hal Rosenstock wrote: > Hi Nish, > > On Sat, 2005-10-15 at 01:24, Nishanth Aravamudan wrote: > > On 11.10.2005 [23:15:27 -0400], Hal Rosenstock wrote: > > > Hi again Nish, > > > > > > On Tue, 2005-10-11 at 21:39, Nishanth Aravamudan wrote: > > > > > > > Update arp_recv functions to latest 2.6.14 netdevice.h API for struct > > > > > > > packet_type > > > > > > > > > > > > Sorry for the delay, I haven't yet had time to test the patches :/ I'll > > > > > > try to get to it tonight or tomorrow. > > > > > > > > > > > > Is there anyway you can send me patches against the kernel tree as > > > > > > opposed to the svn repo? It makes my side of things *a lot* easier, as > > > > > > right now I have to take your patch against svn and either hand-edit or > > > > > > patch my checkout and then diff against the current kernel tree. > > > > > > > > > > Since you were reporting iSER, AT, and SDP compile warnings/errors, > > > > > aren't you using the latest OpenIB svn tree with 2.6.14-rc3 ? > > > > > > > > Yes; but you have to understand that the automated build system I have > > > > access to 1) does not have external internet access (i.e., to the svn > > > > tree) and 2) only builds kernels unless I manually send commands to the > > > > terminal. > > > > > > > > So, the way I'm doing things is: > > > > > > > > Send in 4 jobs for mainline (x86 and ppc64 with =y and =m) and then > > > > generate a patch of the latest svn tree against the current -git release > > > > (a patch to the kernel) and send it in as a parameter to my builds to > > > > test the latest svn tree. This leads to another 4 jobs (x86 and ppc64 > > > > with =y and =m). > > > > > > > > I'm *only* doing kernel build testing right now. > > > > > > > > > Which patches are you referring to ? Was it the fib_frontend.c one ? > > > > > Not sure why they would need any manual fixup. At least that one was > > > > > pretty straightforward. > > > > > > > > In the sense that I have to edit them to kernel relative paths, not in > > > > the content of the patch. To test any patch in the system I have access > > > > to, it needs to be a normal kernel patch (-p1 applicable to the base > > > > tree). > > > > > > > > Going through and manually applying patches to the svn tree and then > > > > regenerating the diff completely defeats the purpose of automated > > > > compilation testing. > > > > > > OK. Do you need any patches regenerated or is this more for the future ? > > > > Please check-in the at.c, sdp_link.c and iser.h fixes, as now gen2 code > > builds on x86 and ppc64 with only the following warning (which I believe > > is new) > > > > drivers/infiniband/core/addr.c:330: warning: initialization from incompatible pointer type > > > > when the patches are applied. Without them the x86 build fails > > completely and the ppc64 build emits several warning. > > > > Sorry for the *long* delay, it took a bit of effort to get the patches > > to cooperate with our automated build system. Thanks to Hal for his > > quick response and generous patience in waiting for my ack. > > > > So, officially, I give > > > > Acked-by: Nishanth Aravamudan > > > > to the at.c, sdp_link.c and iser.h fixes. > > Thanks for trying out these patches. Sorry for the manual intervention. No problem; it unfortunately wasn't as simple as just editing the patches, as when I did so, all the hunks would fail. So I just regenerated them locally against current-git. Not a big deal, since these were small changes. > I regenerated the patches for fib_frontend.c, at.c, and sdp_link.c and > they are in linux-kernel/patches. Hopefully these will work with your > automated build system. These are found in linux-kernel/patches as: > linux-2.6.14-rc3-at.diff > linux-2.6.14-rc3-fib-frontend.diff > linux-2.6.14-rc3-sdp_link.diff Those look great and are identical to the ones I generated. So, my Acked-by applies to those 3. > Dan will be checking in the iser.h fix. Ok, great! I think the core/addr.c problem can be fixed with the following: Thanks, Nish Description: Fix a compilation warning in core/addr.c due to packet_type's func member changing prototype. Signed-off-by: Nishanth Aravamudan --- diff -urpN linux-2.6.14-rc4-git4-dev/drivers/infiniband/core/addr.c linux-2.6.14-rc4-git4-dev2/drivers/infiniband/core/addr.c --- linux-2.6.14-rc4-git4-dev/drivers/infiniband/core/addr.c 2005-10-15 09:52:26.000000000 -0700 +++ linux-2.6.14-rc4-git4-dev2/drivers/infiniband/core/addr.c 2005-10-15 09:52:56.000000000 -0700 @@ -310,7 +310,7 @@ void ib_addr_cancel(struct ib_addr *addr EXPORT_SYMBOL(ib_addr_cancel); static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pkt) + struct packet_type *pkt, struct net_device *orig_dev) { struct arphdr *arp_hdr; From rjwalsh at pathscale.com Sat Oct 15 13:46:45 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sat, 15 Oct 2005 13:46:45 -0700 Subject: [openib-general] Re: Initial ipath review brain dump In-Reply-To: <524q7jvc7d.fsf@cisco.com> References: <524q7jvc7d.fsf@cisco.com> Message-ID: <1129409205.4027.0.camel@hematite.internal.keyresearch.com> > Now that I got through reviewing the generic parts of the PathScale > merge and the low-level driver is on the trunk, I started looking > through the real driver. I'm only about a third of the way through > infinipath_core.c, but here's a quick dump of what I see as needing > work so far: Thanks for the feedback. We're going to review all of these early next week and we'll get back to you then. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From halr at voltaire.com Sat Oct 15 14:51:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Oct 2005 17:51:04 -0400 Subject: [PATCH] core/addr: fix compilation warning {was Re: [openib-general] Latest build test results} In-Reply-To: <20051015165619.GG28213@us.ibm.com> References: <1128619535.4382.1039.camel@hal.voltaire.com> <20051006192024.GC15908@us.ibm.com> <1128626684.4382.1762.camel@hal.voltaire.com> <1128692935.4382.7072.camel@hal.voltaire.com> <20051011214521.GM5972@us.ibm.com> <1129080434.4377.12024.camel@hal.voltaire.com> <20051012013930.GB13157@us.ibm.com> <1129086927.4377.12455.camel@hal.voltaire.com> <20051015052456.GF28213@us.ibm.com> <1129370754.16900.664.camel@hal.voltaire.com> <20051015165619.GG28213@us.ibm.com> Message-ID: <1129413064.16900.2657.camel@hal.voltaire.com> On Sat, 2005-10-15 at 12:56, Nishanth Aravamudan wrote: > > Thanks for trying out these patches. Sorry for the manual intervention. > > No problem; it unfortunately wasn't as simple as just editing the > patches, as when I did so, all the hunks would fail. So I just > regenerated them locally against current-git. Not a big deal, since > these were small changes. > > > I regenerated the patches for fib_frontend.c, at.c, and sdp_link.c and > > they are in linux-kernel/patches. Hopefully these will work with your > > automated build system. These are found in linux-kernel/patches as: > > linux-2.6.14-rc3-at.diff > > linux-2.6.14-rc3-fib-frontend.diff > > linux-2.6.14-rc3-sdp_link.diff > > Those look great and are identical to the ones I generated. So, my > Acked-by applies to those 3. > > > Dan will be checking in the iser.h fix. > > Ok, great! > > I think the core/addr.c problem can be fixed with the following: > > Thanks, > Nish > > Description: Fix a compilation warning in core/addr.c due to > packet_type's func member changing prototype. > > Signed-off-by: Nishanth Aravamudan > > --- > > diff -urpN linux-2.6.14-rc4-git4-dev/drivers/infiniband/core/addr.c linux-2.6.14-rc4-git4-dev2/drivers/infiniband/core/addr.c > --- linux-2.6.14-rc4-git4-dev/drivers/infiniband/core/addr.c 2005-10-15 09:52:26.000000000 -0700 > +++ linux-2.6.14-rc4-git4-dev2/drivers/infiniband/core/addr.c 2005-10-15 09:52:56.000000000 -0700 > @@ -310,7 +310,7 @@ void ib_addr_cancel(struct ib_addr *addr > EXPORT_SYMBOL(ib_addr_cancel); > > static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, > - struct packet_type *pkt) > + struct packet_type *pkt, struct net_device *orig_dev) > { > struct arphdr *arp_hdr; Thanks. I just added that one as linux-kernel/patches/linux-2.6.14-rc3-addr.diff -- Hal From troy at scl.ameslab.gov Sat Oct 15 17:22:57 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sat, 15 Oct 2005 19:22:57 -0500 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) In-Reply-To: <1129393061.16900.1649.camel@hal.voltaire.com> References: <20051014231954.GC8748@minbar.scl.ameslab.gov> <1129340904.16900.198.camel@hal.voltaire.com> <1129368676.16900.592.camel@hal.voltaire.com> <43510E71.3080106@scl.ameslab.gov> <1129393061.16900.1649.camel@hal.voltaire.com> Message-ID: <43519D61.8040208@scl.ameslab.gov> >It's more than that: in addition to the IB hardware/driver difference, >it will need to be ported from Linux to whatever Cray OS is. > > > The Cray XD1 is actually running Linux.. I've even managed to build and boot my own kernel on one. They are actually using a derivative of the OpenIB SDP code that has been ported to what they call "RapidArray", which is infiniband at the wire-protocol level. They don't implement any of the higher level stuff (like, obviously, SMA) > > >>And if there's a >>'soft-sma' available, could it be made to work on pathscale cards as well? >> >> > >PathScale has an soft SMA for their HCAs already (for Linux at least). > >-- Hal > > > I'm going to take a look at pathscale's sma when I get a chance. (which in reality probably won't be until after SC) From info at sdfvg.com Sat Oct 15 16:32:30 2005 From: info at sdfvg.com (info at sdfvg.com) Date: 16 Oct 2005 08:32:30 +0900 Subject: [openib-general] $BL5NA$G9%$-$J=w@-$r(B Message-ID: <20051015233230.29823.qmail@mail.sdfvg.com> http://www.00-love5.com/?i$B3N Finally the best solution of problems! Go on, give it a try. You'll sure enjoy it! SPUR-M: http://www.geocities.com/w8uk86xphw0da/ Discreet, unmarked packaging. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bunghole at futbolamericano.com Sun Oct 16 05:16:38 2005 From: bunghole at futbolamericano.com (Doctor) Date: Sun, 16 Oct 2005 08:16:38 -0400 Subject: [openib-general] The Ultimate Online Pharmaceutical Message-ID: <4671375844.20051016081638@futbolamericano.com> Vlazgra $3.3 Levigtra $3.3 Ciaalis $3.7 Imitlrex $16.4 Flomjax $2.2 Ultrham $0.78 Viofxx $4.75 Ambfien $2.2 Valzium $0.97 Xanapx $1.09 Soxma $3 Merikdia $2.2 our site http://acobanne.com/?GFSMKDXkJWWlxTH1RRW1RAUlh1XkJWWlxTHFxGUg== ___ Best regards, Online Pharmaceuticals -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at ammasso.com Sun Oct 16 07:39:53 2005 From: tom at ammasso.com (Tom Tucker) Date: Sun, 16 Oct 2005 09:39:53 -0500 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A03@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A03@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1129473593.25345.51.camel@mail.es335.com> At 50,000 feet, I don't think anyone disagrees with these lines of reasoning, however, there are some practical design issues that don't yield to the architectural rubric of "design by rule of least astonishment". It may be more complex than it needs to be; so propose an API, submit a patch. I think the current CMA could probably be better. This will give everyone something concrete to consider. IMHO, at this point, these philospohical arguments serve only to consume network bandwidth. On Thu, 2005-10-13 at 13:55 -0400, Caitlin Bestler wrote: > I agree with Mike's analysis. But I'd also like to point out that even > when source compatability is not a requirement, source familiarity > is. That is, even when recoding is feasible the API should only > introduce new concepts as required to improve efficiency. The > shift from socket model to QP/CQ is challenging enough as is. > It's also where the benefit is. Changing how the application > requests and accepts connections is just piling on more things > for the developers to learn onto an already very full plate, and > with nowhere near the same benefit. > > The simple, IP/DNS-centric methods that Mike outlined will > work on either iWARP or IB, and are very easily understood > by those familiar with existing sockets/IP network development. > The more complex models provide minor enhancements for > very corner cases at the very heavy concept of requiring > the developer to understand a lot more about network topology. > Per the above, I don't view these issues as "minor enhancements" or "corner cases", these are features of the network software layer that most applications rely on. > plain text document attachment (ATT49612.txt), "ATT49612.txt" > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Sun Oct 16 07:52:21 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 16 Oct 2005 16:52:21 +0200 Subject: [openib-general] [PATCH] comment fix Message-ID: <20051016145221.GI2608@mellanox.co.il> Fix comment for ibv_ack_cq_events Signed-off-by: Michael S. Tsirkin Index: include/infiniband/verbs.h =================================================================== --- include/infiniband/verbs.h (revision 3788) +++ include/infiniband/verbs.h (working copy) @@ -718,7 +718,7 @@ extern int ibv_get_cq_event(struct ibv_c struct ibv_cq **cq, void **cq_context); /** - * ibv_ack_cq_events - Free an async event + * ibv_ack_cq_events - Free a CQ event * @cq: CQ to acknowledge events for * @nevents: Number of events to acknowledge. * -- MST From steve_wooding at keysounds.co.uk Sun Oct 16 08:26:17 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Sun, 16 Oct 2005 16:26:17 +0100 Subject: [openib-general] How to debug QP INIT->RTR -22 error Message-ID: <43527119.2050103@keysounds.co.uk> Hi there, I'm trying to make a QP connection using the CM, but the active side cannot get to the RTR state. ibv_modify_qp returns errorno -22, invalid argument. What would the best way to find out exactly what the error is and help me fix my app (just to say, it is only my app that's broken, nothing else)? Would turning kernel debugging on be helpful at all? Thanks everyone, Steve. From mst at mellanox.co.il Sun Oct 16 08:43:37 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 16 Oct 2005 17:43:37 +0200 Subject: [openib-general] Re: How to debug QP INIT->RTR -22 error In-Reply-To: <43527119.2050103@keysounds.co.uk> References: <43527119.2050103@keysounds.co.uk> Message-ID: <20051016154337.GJ2608@mellanox.co.il> Quoting r. Steve Wooding : > Subject: How to debug QP INIT->RTR -22 error > > Hi there, > > I'm trying to make a QP connection using the CM, but the active side > cannot get to the RTR state. ibv_modify_qp returns errorno -22, invalid > argument. > > What would the best way to find out exactly what the error is and help > me fix my app (just to say, it is only my app that's broken, nothing > else)? Would turning kernel debugging on be helpful at all? > > Thanks everyone, > > > Steve. Yes, enabling debug messages in mthca is typically helpful. The relevant code with tests is in mthca_qp.c, function mthca_modify_qp, and in uverbs_cmd.c, ib_uverbs_modify_qp. -- MST From steve_wooding at keysounds.co.uk Sun Oct 16 09:05:47 2005 From: steve_wooding at keysounds.co.uk (Steve Wooding) Date: Sun, 16 Oct 2005 17:05:47 +0100 Subject: [openib-general] Re: How to debug QP INIT->RTR -22 error In-Reply-To: <20051016154337.GJ2608@mellanox.co.il> References: <43527119.2050103@keysounds.co.uk> <20051016154337.GJ2608@mellanox.co.il> Message-ID: <43527A5B.4080505@keysounds.co.uk> Thanks Michael, Just looking at the kernel-space code gives me a few things to check with my app. Cheers, Steve. Michael S. Tsirkin wrote: >Quoting r. Steve Wooding : > > >>Subject: How to debug QP INIT->RTR -22 error >> >>Hi there, >> >>I'm trying to make a QP connection using the CM, but the active side >>cannot get to the RTR state. ibv_modify_qp returns errorno -22, invalid >>argument. >> >>What would the best way to find out exactly what the error is and help >>me fix my app (just to say, it is only my app that's broken, nothing >>else)? Would turning kernel debugging on be helpful at all? >> >>Thanks everyone, >> >> >>Steve. >> >> > >Yes, enabling debug messages in mthca is typically helpful. >The relevant code with tests is in mthca_qp.c, function mthca_modify_qp, >and in uverbs_cmd.c, ib_uverbs_modify_qp. > > > From gdror at mellanox.co.il Sun Oct 16 23:27:44 2005 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Mon, 17 Oct 2005 08:27:44 +0200 Subject: [openib-general] IB and FC Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E335CA23@mtlexch01.mtl.com> > From: Mohit Katiyar, Noida [mailto:mohitka at noida.hcltech.com] > Sent: Saturday, October 15, 2005 3:40 PM > > While in the figure given below the client to IB FC gateway speed is > > 10 GB/s and from Gateway to I/O storage is 2GB/s and if port > aggregation > is applied at gateway then 4GB/s. So the total effective speed from > client to I/O storage can max be reached at 4GB/s > > IB cables > Client ----| > Client -- -| |----- FC Switch---| > . | IB cables | | > . |--IB FC ------ | |-------I/O storage > | Gateway | | > Client ----| Router |----- FC Switch---| > > Figure 2 > > > > So can anyone explain me am I correct in my approach? Are there any > other advantages in shifting from figure 1 architecture to figure 2 > architecture? The Gateway from IB to FC can also be a storage virtualization device, in which case it may stripe data amongst multiple FC devices. In this case you can get higher bandwidth (aggregate) to the storage boxes, because there are going to be many of them. Caching may also be doable in the gateway. This may also be an intermediate solution that will enable you to connect native IB storage boxes in the future. In which case you're going to be able to connect both your existing FC storage boxes and new IB storage boxes to the IB fabric. > > It does not seem any advantageous in shifting from FC SAN to IB FC SAN > through such a pattern? Another reason can be cost. If your clients already have IB adapters because they are doing clustering or for other reason, then why buy a FC adapter to each client ? Just use the IB as a consolidated fabric and through the GW you can access the storage. You saved the cost of the FC adapters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yipeeyipeeyipeeyipee at yahoo.com Mon Oct 17 01:26:08 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 17 Oct 2005 08:26:08 +0000 (UTC) Subject: [openib-general] /dev file locations Message-ID: Hi, Is there a reason why the uverbs0, umad0 & umad1 character device files are placed in /dev/infiniband/, while uat & ucm0 are placed directly in /dev/ ? Won't it be more consistent if they are placed at the same directory? Thanks, y From yipeeyipeeyipeeyipee at yahoo.com Mon Oct 17 02:43:15 2005 From: yipeeyipeeyipeeyipee at yahoo.com (yipee) Date: Mon, 17 Oct 2005 09:43:15 +0000 (UTC) Subject: [openib-general] statically linked userspace program failure Message-ID: Hi, I'm trying to create a small statically-linked program that uses the various uverbs interfaces. Unfortunately I'm having some problems with libibverbs.a. Function init_drivers() outputs the message: libibverbs: Warning: no userspace device-specific driver found for uverbs0 driver search path: /usr/local/lib/infiniband Even though I statically compiled the program it is still looking for shared libraries in /usr/local/lib/infiniband and returns me an empty device list. It seems like the 'driver_list' list remains empty because load_driver() is only called once with NULL. Consequently init_drivers() never does a dlist_push(device_list, dev) so 'device_list' remains empty and ibverbs_init() returns an empty device list to my program. Any comments? Am I doing something wrong? How can I fix this problem? Thanks, y From hch at lst.de Mon Oct 17 04:30:00 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 17 Oct 2005 13:30:00 +0200 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051017113000.GA5140@lst.de> On Fri, Oct 14, 2005 at 08:38:18AM -0700, Caitlin Bestler wrote: > I can't think of a better example of something that is truly > brain dead than an application *written* to use Sockets Direct > Protocol. I think you confuse "specificly written to support" with "specificly written to support only". And yes, in the days of getaddrinfo writing an application specific to a protocol instead of IP+Stream or Dgram semantics is pretty bad idea. From hch at lst.de Mon Oct 17 04:32:22 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 17 Oct 2005 13:32:22 +0200 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <20051014223953.GB27904@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A11@NT-SJCA-0751.brcm.ad.broadcom.com> <20051014223953.GB27904@esmail.cup.hp.com> Message-ID: <20051017113222.GB5140@lst.de> On Fri, Oct 14, 2005 at 03:39:53PM -0700, Grant Grundler wrote: > Open source does NOT ignore legacy applications: > 1) Anyone can continue to update and run on the linux kernel version > they have source code for if they don't want to (or can't) change > the application or newer kernels break the ABI. > Many people are still very happy using 2.4 linux kernels. Actually if your aplication plays by the rules and breaks with a new kernel that's a major bug. We definitly guarantee that applications that use the defined syscall interface work on new kernels indefinitly. That doesn't mean they will get all the new features, though. From mst at mellanox.co.il Mon Oct 17 05:17:27 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 17 Oct 2005 14:17:27 +0200 Subject: [openib-general] statically linked userspace program failure In-Reply-To: References: Message-ID: <20051017121727.GB5995@mellanox.co.il> Quoting yipee : > Subject: [openib-general] statically linked userspace program failure > > Hi, > > I'm trying to create a small statically-linked program that uses the > various > uverbs interfaces. Unfortunately I'm having some problems with > libibverbs.a. > libibverbs currently looks symbol openib_driver_init. You need to export that from your program, and if you do it hook to that before scanning the plugin directory /usr/local/lib/infiniband First, build mthca with --enable-static. This will produce mthca.a in /usr/local/lib/infiniband Now, link your program with this library, adding flags -rdynamic -u openib_driver_init Here's a link to discussion on this topic in the archives See this message: http://article.gmane.org/gmane.linux.drivers.openib/11977 In this thread: http://thread.gmane.org/gmane.linux.drivers.openib/11283 -- MST From jlentini at netapp.com Mon Oct 17 06:14:02 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 17 Oct 2005 09:14:02 -0400 (EDT) Subject: [openib-general] /dev file locations In-Reply-To: References: Message-ID: On Mon, 17 Oct 2005, yipee wrote: > Hi, > > Is there a reason why the uverbs0, umad0 & umad1 character device files are > placed in /dev/infiniband/, while uat & ucm0 are placed directly in /dev/ ? > Won't it be more consistent if they are placed at the same directory? > > > Thanks, > y Your udev rules control where uat and ucm0 are placed. There is an explanation of how to set these up correctly in the Installation Cheat Sheet: https://openib.org/tiki/tiki-index.php?page=Installation+Cheat+Sheet From halr at voltaire.com Mon Oct 17 06:16:24 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 09:16:24 -0400 Subject: [openib-general] Cray XD1 and OpenSM.. (ignoreing certain guids?) In-Reply-To: <43519D61.8040208@scl.ameslab.gov> References: <20051014231954.GC8748@minbar.scl.ameslab.gov> <1129340904.16900.198.camel@hal.voltaire.com> <1129368676.16900.592.camel@hal.voltaire.com> <43510E71.3080106@scl.ameslab.gov> <1129393061.16900.1649.camel@hal.voltaire.com> <43519D61.8040208@scl.ameslab.gov> Message-ID: <1129554879.16900.13526.camel@hal.voltaire.com> On Sat, 2005-10-15 at 20:22, Troy Benjegerdes wrote: > >It's more than that: in addition to the IB hardware/driver difference, > >it will need to be ported from Linux to whatever Cray OS is. > > > > > > > The Cray XD1 is actually running Linux.. I've even managed to build and > boot my own kernel on one. They are actually using a derivative of the > OpenIB SDP code that has been ported to what they call "RapidArray", > which is infiniband at the wire-protocol level. They don't implement any > of the higher level stuff (like, obviously, SMA) > > > > > > >>And if there's a > >>'soft-sma' available, could it be made to work on pathscale cards as well? > >> > >> > > > >PathScale has an soft SMA for their HCAs already (for Linux at least). > > > >-- Hal > > > > > > > I'm going to take a look at pathscale's sma when I get a chance. (which > in reality probably won't be until after SC) Here is a quick spec: SMA (for HCA device) Get NodeInfo Get NodeDescription Get/Set PortInfo Get/Set P_KeyTable (OpenSM only currently gets the P_KeyTable) Get/Set SLtoVLMappingTable (OpenSM does not get or set the SLtoVLMappingTable currently) Get/Set VLArbitrationTable if supported in IB device (OpenSM does not currently get or set VLArbitrationTable) Get/Set GUIDInfo (OpenSM does not get or set GUIDInfo currently) The bulk of the work is in PortInfo support. Also, you might want to implement PMA as well: Get/Set PortCounters Get/Set PortSamplesControl Get PortSamplesResult Get ClassPortInfo perquery only uses PortCounters. -- Hal From halr at voltaire.com Mon Oct 17 07:07:43 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 10:07:43 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a Message-ID: <1129558062.16900.13783.camel@hal.voltaire.com> Hi Heiko, On Mon, 2005-10-17 at 09:54, Heiko J Schick wrote: > Hello Roland and Hal, > > did you changed the mailing-list settings, because it seems that I can > sent anymore to "openib-general". Must I be a member nowdays? I > apologize when you received my message twice. You shouldn't need to be a member to send. It's an open list. > I have some basic question about address translation in OpenIB > (libibat, ib_uat, and ib_at). > > When I run "uatt" I will get the output below. To me it seems that > function ib_at_route_by_ip work just fine. At least I receive a > callback and gets the SGID, DGID, etc. But I'm not sure how > ib_at_cancel works. This functions always reportes -1 (EPERM / > Operation not permitted) as return code. I don't think it it always but that's what is currently returned if there is no pending request to cancel. > It seems to me that ib_at_cancel in > /trunk/src/linux-kernel/infiniband/core/at.c only reports -1 when > lookup_req_id founds no corresponding pending request with the same > ID. So is it ok that ib_cancel_at reports -EPERM? EPERM is 1 so this is the same thing. The comments say: /** * ib_at_cancel - cancel possible active asynchronous operation * @req_id: asynchronous request ID * * Return 0 if canceled, -1 if cancel failed (e.g. bad ID) */ > When should ib_at_cancel normally called? To terminate a pending request (if the callback to some AT request has not been issued). It does no harm to call it even if the callback has been invoked. > XXXXXXXXXXX:/tmp/heiko # ./uatt > uatt: main: src ip address c0a80841 > uatt: main: dest ip address c0a80841 > uatt: main: uat test start > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 1 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x10013258 completed with rec_num 1 > ===> rt 0x10013258 sgid 0xfe8000000000000002e625f000020003 dgid > 0xfe8000000000000002e625f000020003 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 1 1 > uatt: att_path_comp_fn: id 1 context 0x10012658 completed with rec_num > 1 > ===> slid 0x7 dlid 0x7 > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 2 id 0 0 > uatt: att_rt_comp_fn: id 0 context 0x10013290 completed with rec_num 1 > ===> rt 0x10013290 sgid 0xfe8000000000000002e625f000020003 dgid > 0xfe8000000000000002e625f000020003 > uatt: att_rt_comp_fn: ib_at_paths_by_route: ret 0 errno 0 id 2 2 > ... > uatt: main: sleeping for 30 secs > ... > uatt: main: uat test cleanup > uatt: main: cancel but no rt id 0 ret -1 errno 1 > uatt: main: cancel but no path id 1 ret -1 errno 1 > uatt: main: cancel but no rt id 0 ret -1 errno 1 > uatt: main: cancel but no path id 2 ret -1 errno 1 > > If I understood everything correctly the normal sequence is like: > > 1. Execute ib_at_route_by_ip and check return code agains >0 =0 <0, > etc. > 2. Callback will be executed and I can process the received > information included in struct ib_at_ib_route *rt (context) > 3. After some timeout cancel pending requests with ib_at_cancel Yes, but no requests were pending in this test execution. > I've modified the att.c testcase and run in the route completion > function ibv_get_device_name with parameter rt->out_dev. The source > code looks like: > > static void att_rt_comp_fn(uint64_t req_id, void *context, int > rec_num) > { > struct ib_at_ib_route *rt = context; > int r, i; > uint64_t req_id2; > char *ib_dev_name; > > printf("rt->out_dev: %p\n", rt->out_dev); > ibv_get_device_name(rt->out_dev); > ... > > Should this code work, because it seems that out_dev is a kernel > address (platform: PPC64) which cannot accessed by a userspace > program. Via GDB I can see that rt has the following content: > > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a > kernel address. Yes, this is a bug which has been previously pointed out on the list and not fixed. -- Hal > Starting program: /home/schickhj/heiko/att -s 3232237633 -d 3232237633 > [Thread debugging using libthread_db enabled] > [New Thread 549758242848 (LWP 3430)] > uatt: main: src ip address c0a80841 > uatt: main: dest ip address c0a80841 > uatt: main: uat test start > uatt: main: ib_at_route_by_ip: ret 1 errno 0 for request 1 id 0 0 > [Switching to Thread 549758242848 (LWP 3430)] > > Breakpoint 1, att_rt_comp_fn (req_id=0, context=0x10013208, rec_num=1) > at att.c:139 > 139 struct ib_at_ib_route *rt = context; > (gdb) bt > > (gdb) print /x *rt > $1 = {sgid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, > 0xe6, 0x25, 0xf0, 0x0, 0x2, 0x0, 0x3}, global = { > subnet_prefix = 0xfe80000000000000, interface_id = > 0x2e625f000020003}}, dgid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, > 0x0, 0x0, 0x2, 0xe6, 0x25, 0xf0, 0x0, 0x2, 0x0, 0x3}, global = > {subnet_prefix = 0xfe80000000000000, > interface_id = 0x2e625f000020003}}, out_dev = > 0xc0000000cffaa800, out_port = 0x1, attr = {qos_tag = 0x0, pkey = > 0xffff, > multi_path_type = 0x0}} > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 549758242848 (LWP 3605)] > ibv_get_device_name (device=0xc0000000cffaa800) at device.c:62 > 62 device.c: No such file or directory. > in device.c > (gdb) p /x *device > Cannot access memory at address 0xc0000000cffaa800 > > Mit freundlichen Gruessen / Kind Regards > Heiko Joerg Schick > > IBM Deutschland Entwicklung GmbH > I/Ox Microcode Development > Linux Infiniband Device Drivers > > Schoenaicher Str. 220 > 71032 Boeblingen > E-Mail: schickhj at de.ibm.com > External: 49-7031-16-0 x4219, t/l: 120-4219 From halr at voltaire.com Mon Oct 17 07:16:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 10:16:51 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: <1129558062.16900.13783.camel@hal.voltaire.com> References: <1129558062.16900.13783.camel@hal.voltaire.com> Message-ID: <1129558611.16900.13824.camel@hal.voltaire.com> On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: > > Should this code work, because it seems that out_dev is a kernel > > address (platform: PPC64) which cannot accessed by a userspace > > program. Via GDB I can see that rt has the following content: > > > > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a > > kernel address. > > Yes, this is a bug which has been previously pointed out on the list and > not fixed. The fix for this involves an ABI change: it should return the GID of the outgoing IB device. -- Hal From jim.ryan at intel.com Mon Oct 17 07:24:14 2005 From: jim.ryan at intel.com (Ryan, Jim) Date: Mon, 17 Oct 2005 07:24:14 -0700 Subject: [openib-general] FW: Open letter to OpenIB membership re-SC05 and response to last board meeting Message-ID: I am cross-posting this email to the general mailing list to bring attention to the activities at SC'05 related to InfiniBand. This is important work you may not be aware of Thanks, Jim Ryan -----Original Message----- From: Bill Boas [mailto:bboas at llnl.gov] Sent: Sunday, October 16, 2005 1:21 PM To: Ryan, Jim; openib-promoters at openib.org Cc: Rupert Dance; Phamdo, Tuan; tom tucker Subject: Open letter to OpenIB membership re-SC05 and response to last board meeting Jim, and Fellow Members, I apologize for not participating in the last Board meeting where the IBTA CIWG and OpenIB presentation was made to those on the call and voted on. The slides presented contain serious misconceptions of the state of the OpenIB core stack (not some of the ULPs) that call into question the value of the full set of members of OpenIB accepting the proposal as presented and the subsequent time to market and financial implications. First the misconceptions: Slide 1 Current status - it is rather simplistic to state that each vendor provides its own unique stack and that interoperability is not well established. Many customers like the Labs have mixes of HCA's from one vendor working, and some in production, with anothers vendors switches. It is in the subnet managers that the mixing has not taken place so extensively, and the SMs are involved more in configuring and management than interoperability. One of the main goals, already well on its way to achievement, is to have only ONE CORE stack in the Linux distributions that all IB vendors, mobo vendors, OEMs (IBM, dell, Sun etc.), database, filesytem and storage (Oracle, NetApp and DDN etc.) [I'm going to label these the IB EcoSystem"], AND IB customers (Labs, NSF sites, enterprise data centers, embedded systems, etc.) ALL use in their OWN product releases and that customers use at their sites. This consistency derives from the open source nature of the core stack and its resilience and readiness for use UBIQUITOUSLY can only derive from widespread EcoSystem and Customer beta testing. Slide 2 - Motivation - each of the statements on this slide are substantially erroneous IMHO. Interoperability between IB vendor h/w and s/w is regularly deployed today with what we used to call Gen1 and with some sites already using Gen2 (i.e. OpenIB release 1.0 candidates) in production end users i.e customers understand "vendor lock" is probably not a problem today and will go away with the release of OpenIB Rel 1.0 and its distribution made available to customers. Motivation -Cont The major purpose of the OpenIB sponsorship of the IB infrastructure at SC05 in November is to demonstrate in a public forum that the IB EcoSytem and its Customers (abeit in HPC at this time) are "THERE NOW". I agree that the Enterprise community may not feel that if HPC shows OpenIB rel 1.0 is "THERE" that more proof maybe needed, but a Plugfest is not the forum for this, I suggest. The solution is getting more customers from the Enterprise community to understand that in an open source ecosytem validation that new or updated software is ready for their production use is their responsibility (if they want the full value of open source) or they can turn to a Tier 1 or 2 or integrator to provide that assurance/comfort for them at some cost they are willing to bear). Slide 3 - Technical details - many details of what needs to be tested and verified in a beta environment over a sustained period of time are missing (perhaps as Lamprey and UNH are not part of the ecosystem or substantial customers they don't yet understand what needs to be done or the open source and open community environment necessary). verbs, kdapl, portals NAL, udapl, iser, cm are all missing ..... as customers we may have issues about the physical layer, links and cables but the issues are not solvable at a plugfest.... Event Logistics - SC05 and participation in SCinet05-IB is OPEN to ANYONE, there are products and members from the whole EcoSystem and Customers (more than just HPC), there's more variety of hardware, software and applications there than UNH could ever put together, anyone can demonstrate or test anything from anywhere in the world because its open to the Internet, if the exhibitors allow it. The occurrence of problems will be in the open. And BTW there's data center fiber, campus, MAN and WAN IB compliance and interoperability being shown also. The press will be there and can be shown, or see, IB in action - they need not be wary of a press release issued by those who have self- interest in its content. So here's a suggestion or two for activities that I hope others will organize at SC05: 1) "IBTA CI and Interop" stakeholders prepare tests to run over SCinet05-IB (see diagram attached) to ferret out the problems that they think exist and document the results of those tests on this and the openib-general mail lists for the whole community to learn about. If Lamprey and UNH wish to participate and learn I'm sure the IB EcoSystem will welcome them and help. 2) Most IMPORTANTLY, (as SC05 is not a marketing trade show per se, but a technical and professional conference where testing and experimentation is encouraged,) the engineering, software developers, the QA and support teams of each of the IB vendors, Tier 1and 2s, Mother board and commodity system vendors, Integrators, software distributors and Customers (they are all at SC05) TEST THE HECK OUT the CORE STACK and the ULPs and their customers' APPLICATIONS (SCinet05-IB should be up and running without interruption for approx 4 x 24 hour days and open to the world) 3) All members of the IB EcoSystem should also post on these mail lists and the OpenIB web site any issues with OpenIB, Operating Systems, and Applications they find on the SCinet05-IB infrastructure. 4) At SC05 OpenIB organizes (Jim can you arrange that?) a meeting of the membership to review in real time what has been learned and decide on whether and when OpenIB release 1.0 (and what that actually is) can be released. I hope this email generates a lot of discussion and debate......My hope is that SC05 will prove OpenIB release 1.0 is ready for release in December 05.. we'll see... In another email we will update the members on the XNET booth demo an integrated OpenIB and iWARP stack that Tom Tucker et al. are working on courtesy of Sandia and NetApp and will be a useful input to the integration working group referenced in the minutes. Respectfully, Bill. At 04:09 PM 10/12/2005, Ryan, Jim wrote: >Let me know if comments/corrections on the attached. Note the >request for the integration team to start work ASAP > >Thanks, Jim Ryan, Chairman, OpenIB > > > >_______________________________________________ >openib-promoters mailing list >openib-promoters at openib.org >http://openib.org/mailman/listinfo/openib-promoters Bill Boas bboas at llnl.gov ICCD LLNL, B-453, R-2018 Wk: 925-422-4110 7000 East Ave, L-555 Cell: 925-337-2224 Livermore, CA 94551 Pgr: 877-203-2248 -------------- next part -------------- A non-text attachment was scrubbed... Name: Layout of Infiniband Links in SCinet 05 - IB 10_15_05.ppt Type: application/octet-stream Size: 56832 bytes Desc: Layout of Infiniband Links in SCinet 05 - IB 10_15_05.ppt URL: From hch at lst.de Mon Oct 17 08:44:26 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 17 Oct 2005 17:44:26 +0200 Subject: [openib-general] FW: Open letter to OpenIB membership re-SC05 and response to last board meeting In-Reply-To: References: Message-ID: <20051017154426.GA9348@lst.de> On Mon, Oct 17, 2005 at 07:24:14AM -0700, Ryan, Jim wrote: > I am cross-posting this email to the general mailing list to bring > attention to the activities at SC'05 related to InfiniBand. This is > important work you may not be aware of > > Thanks, Jim Ryan Who cares? This douns like really horrible bitching. OpenIB Gen2 works nicely today and everything else fortunately becomes irrelevant. Whether a slide is slightly wrong or not is something you should talk to the author about. From hch at lst.de Mon Oct 17 08:45:57 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 17 Oct 2005 17:45:57 +0200 Subject: [openib-general] FW: Open letter to OpenIB membership re-SC05 and response to last board meeting In-Reply-To: <20051017154426.GA9348@lst.de> References: <20051017154426.GA9348@lst.de> Message-ID: <20051017154557.GB9348@lst.de> On Mon, Oct 17, 2005 at 05:44:26PM +0200, Christoph Hellwig wrote: > On Mon, Oct 17, 2005 at 07:24:14AM -0700, Ryan, Jim wrote: > > I am cross-posting this email to the general mailing list to bring > > attention to the activities at SC'05 related to InfiniBand. This is > > important work you may not be aware of > > > > Thanks, Jim Ryan > > Who cares? This douns like really horrible bitching. OpenIB Gen2 works > nicely today and everything else fortunately becomes irrelevant. > Whether a slide is slightly wrong or not is something you should talk > to the author about. That beeing said this looks like the typical polciy bullshit once a board and business people are involved. Sounds like it's finally time to kill the unwholy openib organization and let the people who get the really nice work done get their work done without all the bullshit involved. From paul.baxter at dsl.pipex.com Mon Oct 17 09:25:03 2005 From: paul.baxter at dsl.pipex.com (Paul Baxter) Date: Mon, 17 Oct 2005 17:25:03 +0100 Subject: [openib-general] FW: Open letter to OpenIB membership re-SC05 andresponse to last board meeting References: <20051017154426.GA9348@lst.de> <20051017154557.GB9348@lst.de> Message-ID: <006b01c5d337$557992e0$8000000a@blorp> > That beeing said this looks like the typical polciy bullshit once a > board and business people are involved. Sounds like it's finally time > to kill the unwholy openib organization and let the people who get the > really nice work done get their work done without all the bullshit > involved. I suggest you back up your opinions with action and avoid future postings to the OpenIB mailing lists. I, for one, am sick of your constant 'holier than thou' attitude. OpenIB may have a few faults but it has accomplished a lot in this niche market. Regards Paul Baxter From hch at lst.de Mon Oct 17 09:28:59 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 17 Oct 2005 18:28:59 +0200 Subject: [openib-general] FW: Open letter to OpenIB membership re-SC05 andresponse to last board meeting In-Reply-To: <006b01c5d337$557992e0$8000000a@blorp> References: <20051017154557.GB9348@lst.de> <006b01c5d337$557992e0$8000000a@blorp> Message-ID: <20051017162859.GB10163@lst.de> On Mon, Oct 17, 2005 at 05:25:03PM +0100, Paul Baxter wrote: > I suggest you back up your opinions with action and avoid future postings > to the OpenIB mailing lists. > > I, for one, am sick of your constant 'holier than thou' attitude. > > OpenIB may have a few faults but it has accomplished a lot in this niche > market. As I said I have lots of respect for people like Roland to actually get all the work done. The political bullshit on the other hand is a constant annoyance. There's a reason the other linux subsystems don't have a trade organization with a political agenda behind them. From mshefty at ichips.intel.com Mon Oct 17 09:42:13 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 09:42:13 -0700 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <1129373165.16900.679.camel@hal.voltaire.com> References: <1129373165.16900.679.camel@hal.voltaire.com> Message-ID: <4353D465.5080203@ichips.intel.com> Hal Rosenstock wrote: > Looks good. One comment below on agent_send_response. > > Have you tested this ? I did run with this change (along with the related changes to mthca and sa_query). I was able to bring up the node, run ipoib, cmtest, and cmatose. >>+void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, > > ^^^^ > int > > Shouldn't this be left as int (and set error returns internal to this > routine where they occur) ? There seem to be a number of them although > the number has been reduced. From what I could see, this was only called from mad.c, which no longer uses the return value. - Sean From mshefty at ichips.intel.com Mon Oct 17 09:49:08 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 09:49:08 -0700 Subject: [openib-general] How to debug QP INIT->RTR -22 error In-Reply-To: <43527119.2050103@keysounds.co.uk> References: <43527119.2050103@keysounds.co.uk> Message-ID: <4353D604.6080509@ichips.intel.com> Steve Wooding wrote: > I'm trying to make a QP connection using the CM, but the active side > cannot get to the RTR state. ibv_modify_qp returns errorno -22, invalid > argument. How are you setting the QP attributes? You can try using the ib_cm_init_qp_attr() call to set the attributes if you're not. - Sean From halr at voltaire.com Mon Oct 17 09:45:19 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 12:45:19 -0400 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <4353D465.5080203@ichips.intel.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> Message-ID: <1129567518.16900.14584.camel@hal.voltaire.com> On Mon, 2005-10-17 at 12:42, Sean Hefty wrote: > >>+void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, > > > > ^^^^ > > int > > > > Shouldn't this be left as int (and set error returns internal to this > > routine where they occur) ? There seem to be a number of them although > > the number has been reduced. > > From what I could see, this was only called from mad.c, which no longer uses > the return value. Yes, but why not ? (I think that was also part of your change). At least, the errors should be indicated with printk's so it can be seen in the log what failure occured in agent_send_response (like failed ib_create_ah_from_wc or ib_create_send_mad or ib_post_send_mad). -- Hal From krause at cup.hp.com Mon Oct 17 09:53:09 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 17 Oct 2005 09:53:09 -0700 Subject: [openib-general] IB and FC In-Reply-To: <3E6BB9CEE261E2428AD25D0D553DC497014E6B0B@HSDLNTD1110010.no ida.hcltech.com> References: <3E6BB9CEE261E2428AD25D0D553DC497014E6B0B@HSDLNTD1110010.noida.hcltech.com> Message-ID: <6.2.0.14.2.20051017095225.025edc88@esmail.cup.hp.com> These types of discussions should be taken up with IB technology / OEM vendors directly as they have nothing to do with development. Mike At 06:28 AM 10/15/2005, Mohit Katiyar, Noida wrote: >Hi all, >Sorry previous mail got scrapped due to HTML pictures so now with text >pictures >I just cant clear a doubt about IB. >In the first figure given below the max speed that can be obtained >between the client and the IO storage is 2Gb/s > >Client --------| >Client --------| |--- FC Switch---| > . | | | > . |---FC Cables--| |-----I/O storage > . | Each client | | >Client --------| connected |--- FC Switch---| > To both switch > > > Figure 1 > > > >While in the figure given below the client to IB FC gateway speed is > >10 GB/s and from Gateway to I/O storage is 2GB/s and if port aggregation >is applied at gateway then 4GB/s. So the total effective speed from >client to I/O storage can max be reached at 4GB/s > > IB cables >Client --------| >Client --------| |----- FC >Switch---| > . | IB cables | >| > . |------------IB FC ------ | |---FC >Cables------I/O storage > . | Gateway/Router | >| >Client --------| |----- FC >Switch---| > > > Figure 2 > > > >So can anyone explain me am I correct in my approach? Are there any >other advantages in shifting from figure 1 architecture to figure 2 >architecture? > >It does not seem any advantageous in shifting from FC SAN to IB FC SAN >through such a pattern? > >Can anyone help me in deciding about this?? > > > > > > > >Thanks in advance > > > >Mohit Katiyar >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Oct 17 10:01:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 10:01:43 -0700 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <1129567518.16900.14584.camel@hal.voltaire.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> <1129567518.16900.14584.camel@hal.voltaire.com> Message-ID: <4353D8F7.40602@ichips.intel.com> Hal Rosenstock wrote: > Yes, but why not ? (I think that was also part of your change). The agent code now allocates a buffer for the MAD by calling ib_create_send_mad(). The input MAD buffer is copied to the send buffer, then transmitted. The result is that the input MAD buffer is always available for posting on the receive queue. > At least, the errors should be indicated with printk's so it can be seen > in the log what failure occured in agent_send_response (like failed > ib_create_ah_from_wc or ib_create_send_mad or ib_post_send_mad). IMO, the print statements would be better placed in agent_send_response(), which knows the context surrounding the failure, than by a caller. I'm fine with having the function return a value, but the return value will still be ignored by mad.c. I will change the function from void to int though, so that a caller will be aware that a failure did occur. - Sean From sean.hefty at intel.com Mon Oct 17 10:11:17 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 10:11:17 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1129473593.25345.51.camel@mail.es335.com> Message-ID: >I think the current CMA could probably be better. Can you please clarify what you would change to the CMA API or implementation? I would rather get changes in sooner, rather than waiting until it has been pushed upstream. And to be clear, the current interface is not attempting to abstract QPs, CQs, or other hardware resources. - Sean From sean.hefty at intel.com Mon Oct 17 10:37:56 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 10:37:56 -0700 Subject: [PATCH] core/addr: fix compilation warning {was Re:[openib-general] Latest build test results} In-Reply-To: <1129413064.16900.2657.camel@hal.voltaire.com> Message-ID: >> static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev, >> - struct packet_type *pkt) >> + struct packet_type *pkt, struct net_device *orig_dev) >> { >> struct arphdr *arp_hdr; > >Thanks. I just added that one as >linux-kernel/patches/linux-2.6.14-rc3-addr.diff Thanks, Hal. Does anyone know the reason for this change? - Sean From sean.hefty at intel.com Mon Oct 17 10:48:16 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 10:48:16 -0700 Subject: [openib-general] [PATCHv2] [MAD/Agent] convert agent.c touse ib_create_send_mad() In-Reply-To: <1129567518.16900.14584.camel@hal.voltaire.com> Message-ID: Here's an updated version that defines agent_send_response as returning an int. Signed-off-by: Sean Hefty Index: agent.c =================================================================== --- agent.c (revision 3794) +++ agent.c (working copy) @@ -36,58 +36,41 @@ * * $Id$ */ +#include "agent.h" +#include "smi.h" -#include -#include - -#include +#define SPFX "ib_agent: " -#include "smi.h" -#include "agent_priv.h" -#include "mad_priv.h" -#include "agent.h" +struct ib_agent_port_private { + struct list_head port_list; + struct ib_mad_agent *agent[2]; +}; -spinlock_t ib_agent_port_list_lock; +static DEFINE_SPINLOCK(ib_agent_port_list_lock); static LIST_HEAD(ib_agent_port_list); -/* - * Caller must hold ib_agent_port_list_lock - */ -static inline struct ib_agent_port_private * -__ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; - BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ - - if (device) { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->smp_agent->device == device && - entry->port_num == port_num) - return entry; - } - } else { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if ((entry->smp_agent == mad_agent) || - (entry->perf_mgmt_agent == mad_agent)) - return entry; - } + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->agent[0]->device == device && + entry->agent[0]->port_num == port_num) + return entry; } return NULL; } -static inline struct ib_agent_port_private * -ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - entry = __ib_get_agent_port(device, port_num, mad_agent); + entry = __ib_get_agent_port(device, port_num); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - return entry; } @@ -99,192 +82,71 @@ int smi_check_local_dr_smp(struct ib_smp if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) return 1; - port_priv = ib_get_agent_port(device, port_num, NULL); + + port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " - "not open\n", - device->name, port_num); + "not open\n", device->name, port_num); return 1; } - return smi_check_local_smp(port_priv->smp_agent, smp); + return smi_check_local_smp(port_priv->agent[0], smp); } -static int agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_agent_port_private *port_priv, - struct ib_mad_private *mad_priv, - struct ib_grh *grh, - struct ib_wc *wc) -{ - struct ib_agent_send_wr *agent_send_wr; - struct ib_sge gather_list; - struct ib_send_wr send_wr; - struct ib_send_wr *bad_send_wr; - struct ib_ah_attr ah_attr; - unsigned long flags; - int ret = 1; - - agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); - if (!agent_send_wr) - goto out; - agent_send_wr->mad = mad_priv; - - gather_list.addr = dma_map_single(mad_agent->device->dma_device, - &mad_priv->mad, - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - gather_list.length = sizeof(mad_priv->mad); - gather_list.lkey = mad_agent->mr->lkey; - - send_wr.next = NULL; - send_wr.opcode = IB_WR_SEND; - send_wr.sg_list = &gather_list; - send_wr.num_sge = 1; - send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ - send_wr.wr.ud.timeout_ms = 0; - send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - - ah_attr.dlid = wc->slid; - ah_attr.port_num = mad_agent->port_num; - ah_attr.src_path_bits = wc->dlid_path_bits; - ah_attr.sl = wc->sl; - ah_attr.static_rate = 0; - ah_attr.ah_flags = 0; /* No GRH */ - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - if (wc->wc_flags & IB_WC_GRH) { - ah_attr.ah_flags = IB_AH_GRH; - /* Should sgid be looked up ? */ - ah_attr.grh.sgid_index = 0; - ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpu( - grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpu( - grh->version_tclass_flow) >> 20) & 0xff; - memcpy(ah_attr.grh.dgid.raw, - grh->sgid.raw, - sizeof(ah_attr.grh.dgid)); - } - } - - agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); - if (IS_ERR(agent_send_wr->ah)) { - printk(KERN_ERR SPFX "No memory for address handle\n"); - kfree(agent_send_wr); - goto out; - } - - send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - send_wr.wr.ud.pkey_index = wc->pkey_index; - send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; - } else { /* for SMPs */ - send_wr.wr.ud.pkey_index = 0; - send_wr.wr.ud.remote_qkey = 0; - } - send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; - send_wr.wr_id = (unsigned long)agent_send_wr; - - pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); - - /* Send */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - ib_destroy_ah(agent_send_wr->ah); - kfree(agent_send_wr); - } else { - list_add_tail(&agent_send_wr->send_list, - &port_priv->send_posted_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - ret = 0; - } - -out: - return ret; -} - -int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num) +int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn) { struct ib_agent_port_private *port_priv; - struct ib_mad_agent *mad_agent; - - port_priv = ib_get_agent_port(device, port_num, NULL); - if (!port_priv) { - printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", - device->name, port_num); - return 1; - } + struct ib_mad_agent *agent; + struct ib_mad_send_buf *send_buf; + struct ib_send_wr *bad_wr; + struct ib_ah *ah; + int ret; - /* Get mad agent based on mgmt_class in MAD */ - switch (mad->mad.mad.mad_hdr.mgmt_class) { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - case IB_MGMT_CLASS_SUBN_LID_ROUTED: - mad_agent = port_priv->smp_agent; - break; - case IB_MGMT_CLASS_PERF_MGMT: - mad_agent = port_priv->perf_mgmt_agent; - break; - default: - return 1; - } + port_priv = ib_get_agent_port(device, port_num); + if (!port_priv) + return -ENODEV; - return agent_mad_send(mad_agent, port_priv, mad, grh, wc); + agent = port_priv->agent[qpn]; + ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); + if (IS_ERR(ah)) + return PTR_ERR(ah); + + send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, ah, 0, + sizeof *mad - IB_MGMT_MAD_DATA, + IB_MGMT_MAD_DATA, GFP_KERNEL); + if (IS_ERR(send_buf)) { + ret = PTR_ERR(send_buf); + goto err1; + } + + *send_buf->mad = *mad; + if ((ret = ib_post_send_mad(agent, &send_buf->send_wr, &bad_wr))) + goto err2; + return 0; +err2: + ib_free_send_mad(send_buf); +err1: + ib_destroy_ah(ah); + return ret; } static void agent_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_agent_port_private *port_priv; - struct ib_agent_send_wr *agent_send_wr; - unsigned long flags; - - /* Find matching MAD agent */ - port_priv = ib_get_agent_port(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " - "agent %p\n", mad_agent); - return; - } + struct ib_mad_send_buf *send_buf; - agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - /* Remove completed send from posted send MAD list */ - list_del(&agent_send_wr->send_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(agent_send_wr->mad->mad), - DMA_TO_DEVICE); - - ib_destroy_ah(agent_send_wr->ah); - - /* Release allocated memory */ - kmem_cache_free(ib_mad_cache, agent_send_wr->mad); - kfree(agent_send_wr); + send_buf = (void *)(unsigned long) mad_send_wc->wr_id; + ib_destroy_ah(send_buf->send_wr.wr.ud.ah); + ib_free_send_mad(send_buf); } int ib_agent_port_open(struct ib_device *device, int port_num) { - int ret; struct ib_agent_port_private *port_priv; unsigned long flags; - - /* First, check if port already open for SMI */ - port_priv = ib_get_agent_port(device, port_num, NULL); - if (port_priv) { - printk(KERN_DEBUG SPFX "%s port %d already open\n", - device->name, port_num); - return 0; - } + int ret; /* Create new device info */ port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); @@ -293,32 +155,25 @@ int ib_agent_port_open(struct ib_device ret = -ENOMEM; goto error1; } - memset(port_priv, 0, sizeof *port_priv); - port_priv->port_num = port_num; - spin_lock_init(&port_priv->send_list_lock); - INIT_LIST_HEAD(&port_priv->send_posted_list); - - /* Obtain send only MAD agent for SM class (SMI QP) */ - port_priv->smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->smp_agent)) { - ret = PTR_ERR(port_priv->smp_agent); + /* Obtain send only MAD agent for SMI QP */ + port_priv->agent[0] = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[0])) { + ret = PTR_ERR(port_priv->agent[0]); goto error2; } - /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, - IB_QPT_GSI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->perf_mgmt_agent)) { - ret = PTR_ERR(port_priv->perf_mgmt_agent); + /* Obtain send only MAD agent for GSI QP */ + port_priv->agent[1] = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[1])) { + ret = PTR_ERR(port_priv->agent[1]); goto error3; } @@ -329,7 +184,7 @@ int ib_agent_port_open(struct ib_device return 0; error3: - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[0]); error2: kfree(port_priv); error1: @@ -342,7 +197,7 @@ int ib_agent_port_close(struct ib_device unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - port_priv = __ib_get_agent_port(device, port_num, NULL); + port_priv = __ib_get_agent_port(device, port_num); if (port_priv == NULL) { spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); printk(KERN_ERR SPFX "Port %d not found\n", port_num); @@ -351,9 +206,8 @@ int ib_agent_port_close(struct ib_device list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[1]); + ib_unregister_mad_agent(port_priv->agent[0]); kfree(port_priv); - return 0; } Index: agent.h =================================================================== --- agent.h (revision 3794) +++ agent.h (working copy) @@ -39,17 +39,14 @@ #ifndef __AGENT_H_ #define __AGENT_H_ -extern spinlock_t ib_agent_port_list_lock; +#include -extern int ib_agent_port_open(struct ib_device *device, - int port_num); +extern int ib_agent_port_open(struct ib_device *device, int port_num); extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num); +extern int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn); #endif /* __AGENT_H_ */ Index: smi.h =================================================================== --- smi.h (revision 3794) +++ smi.h (working copy) @@ -35,10 +35,11 @@ * * $Id$ */ - #ifndef __SMI_H_ #define __SMI_H_ +#include + int smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, int port_num, Index: mad.c =================================================================== --- mad.c (revision 3794) +++ mad.c (working copy) @@ -1728,11 +1728,11 @@ local: if (ret & IB_MAD_RESULT_CONSUMED) goto out; if (ret & IB_MAD_RESULT_REPLY) { - /* Send response */ - if (!agent_send(response, &recv->grh, wc, - port_priv->device, - port_priv->port_num)) - response = NULL; + agent_send_response(&response->mad.mad, + &recv->grh, wc, + port_priv->device, + port_priv->port_num, + qp_info->qp->qp_num); goto out; } } @@ -2761,7 +2761,6 @@ static int __init ib_mad_init_module(voi int ret; spin_lock_init(&ib_mad_port_list_lock); - spin_lock_init(&ib_agent_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), From rolandd at cisco.com Mon Oct 17 10:54:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 17 Oct 2005 10:54:21 -0700 Subject: [PATCH] core/addr: fix compilation warning {was Re:[openib-general] Latest build test results} In-Reply-To: (Sean Hefty's message of "Mon, 17 Oct 2005 10:37:56 -0700") References: Message-ID: <52d5m4qbaq.fsf@cisco.com> Sean> Does anyone know the reason for this change? >From the git tree, I can tell you that the change is from the commit: diff-tree f2ccd8fa06c8e302116e71df372f5c1f83432e03 (from b6b99eb5409d75ae35390057cd28f3aedfbd4cf4) Author: David S. Miller Date: Tue Aug 9 19:34:12 2005 -0700 [NET]: Kill skb->real_dev Bonding just wants the device before the skb_bond() decapsulation occurs, so simply pass that original device into packet_type->func() as an argument. It remains to be seen whether we can use this same exact thing to get rid of skb->input_dev as well. Signed-off-by: David S. Miller it looks like the networking stack now passes the real device into packet_type.func as orig_dev, rather than having the function look at skb->real_dev. - R. From ttucker at es335.com Mon Oct 17 11:19:50 2005 From: ttucker at es335.com (Tom Tucker) Date: Mon, 17 Oct 2005 13:19:50 -0500 Subject: [openib-general] iWARP Driver Update Message-ID: <1129573190.3774.10.camel@trinity.austin.ammasso.com> We've just checked in a significant patch to the iWARP branch. The driver is now functional to the point of establishing a connection and doing RDMA_SEND/RDMA_RECV. OpenIB now has a functioning iWARP driver! I didn't submit this as a patch for review because it was huge. In any case, please have a look if you're interested. BTW, the CM piece of this code is NOT integrated with Sean's latest CMA. After reviewing the CMA stuff, I think we need to change the IW CM to reflect his design. For what it's worth, I think the CMA stuff is really coming together by the way. Tom From nacc at us.ibm.com Mon Oct 17 12:01:40 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 17 Oct 2005 12:01:40 -0700 Subject: [PATCH] core/addr: fix compilation warning {was Re:[openib-general] Latest build test results} In-Reply-To: <52d5m4qbaq.fsf@cisco.com> References: <52d5m4qbaq.fsf@cisco.com> Message-ID: <20051017190140.GJ28213@us.ibm.com> On 17.10.2005 [10:54:21 -0700], Roland Dreier wrote: > Sean> Does anyone know the reason for this change? > > From the git tree, I can tell you that the change is from the commit: > > diff-tree f2ccd8fa06c8e302116e71df372f5c1f83432e03 (from b6b99eb5409d75ae35390057cd28f3aedfbd4cf4) > Author: David S. Miller > Date: Tue Aug 9 19:34:12 2005 -0700 > > [NET]: Kill skb->real_dev > > Bonding just wants the device before the skb_bond() > decapsulation occurs, so simply pass that original > device into packet_type->func() as an argument. > > It remains to be seen whether we can use this same > exact thing to get rid of skb->input_dev as well. > > Signed-off-by: David S. Miller > > it looks like the networking stack now passes the real device into > packet_type.func as orig_dev, rather than having the function look at > skb->real_dev. So, if I understand David's comment correctly, it's a structure size reduction and a small API change? That probably means we've caught all the bugs that exist right now (compilation wise), as I'm not seeing any in the svn kernel tree with the handful of patches in this thread. And I guess there are any users of real_dev in the tree that we need to worry about, as those would have spat warnings too. Thanks, Nish From tom at ammasso.com Mon Oct 17 12:25:50 2005 From: tom at ammasso.com (Tom Tucker) Date: Mon, 17 Oct 2005 14:25:50 -0500 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: References: Message-ID: <1129577150.3774.25.camel@trinity.austin.ammasso.com> On Mon, 2005-10-17 at 10:11 -0700, Sean Hefty wrote: > >I think the current CMA could probably be better. > > Can you please clarify what you would change to the CMA API or implementation? > I would rather get changes in sooner, rather than waiting until it has been > pushed upstream. At first blush, the API looks good to me. The kinds of changes I was pondering were related to hiding some of the routing issues. For example, if the app. doesn't bind the rdma_cm_id prior to calling rdma_connect, the code will lookup and use the default route instead of returning -EINVAL. These kinds of things allows the app to use bind if they want control, or not use bind (and simplify the code) if they are happy to take the defaults. I was planning to do a patch and submit it for review, but if you'd prefer talking through it -- that's fine two. > And to be clear, the current interface is not attempting to abstract QPs, CQs, > or other hardware resources. > Absolutely. > - Sean > From halr at voltaire.com Mon Oct 17 12:10:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 15:10:06 -0400 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <4353D8F7.40602@ichips.intel.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> <1129567518.16900.14584.camel@hal.voltaire.com> <4353D8F7.40602@ichips.intel.com> Message-ID: <1129576084.16900.15502.camel@hal.voltaire.com> On Mon, 2005-10-17 at 13:01, Sean Hefty wrote: > Hal Rosenstock wrote: > > Yes, but why not ? (I think that was also part of your change). > > The agent code now allocates a buffer for the MAD by calling > ib_create_send_mad(). The input MAD buffer is copied to the send buffer, then > transmitted. The result is that the input MAD buffer is always available for > posting on the receive queue. What about the other resources needed ? Are they always available ? > > At least, the errors should be indicated with printk's so it can be seen > > in the log what failure occured in agent_send_response (like failed > > ib_create_ah_from_wc or ib_create_send_mad or ib_post_send_mad). > > IMO, the print statements would be better placed in agent_send_response(), which > knows the context surrounding the failure, than by a caller. That's what I was suggesting (but I didn't see them in your updated patch). > I'm fine with having the function return a value, but the return value will > still be ignored by mad.c. I will change the function from void to int though, > so that a caller will be aware that a failure did occur. OK. -- Hal From mshefty at ichips.intel.com Mon Oct 17 12:23:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 12:23:45 -0700 Subject: [openib-general] [RFC] IB address translation using ARP In-Reply-To: <1129577150.3774.25.camel@trinity.austin.ammasso.com> References: <1129577150.3774.25.camel@trinity.austin.ammasso.com> Message-ID: <4353FA41.7070301@ichips.intel.com> Tom Tucker wrote: > At first blush, the API looks good to me. The kinds of changes I was > pondering were related to hiding some of the routing issues. For > example, if the app. doesn't bind the rdma_cm_id prior to calling > rdma_connect, the code will lookup and use the default route instead of > returning -EINVAL. From an app's perspective, they need to perform the following on the client side: rdma_create_id(); rdma_resolve_addr(); rdma_create_qp(); rdma_resolve_route(); rdma_connect(); Before rdma_resolve_addr() is called, the rdma_cm_id is not associated with a local device. So, rdma_resolve_addr() must be called before a QP can be allocated. I had planned on making rdma_resolve_route() optional, but it complicates device removal handling. It can still be done, but only saves the client about 2 lines of code. Note that both rdma_resolve_addr() and rdma_resolve_route() are asynchronous for IB. > I was planning to do a patch and submit it for review, but if you'd > prefer talking through it -- that's fine Either will work. I can accept a patch or modify the CMA directly if it's a fairly straightforward change. - Sean From mshefty at ichips.intel.com Mon Oct 17 12:38:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 12:38:10 -0700 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <1129576084.16900.15502.camel@hal.voltaire.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> <1129567518.16900.14584.camel@hal.voltaire.com> <4353D8F7.40602@ichips.intel.com> <1129576084.16900.15502.camel@hal.voltaire.com> Message-ID: <4353FDA2.7030309@ichips.intel.com> Hal Rosenstock wrote: >>The agent code now allocates a buffer for the MAD by calling >>ib_create_send_mad(). The input MAD buffer is copied to the send buffer, then >>transmitted. The result is that the input MAD buffer is always available for >>posting on the receive queue. > > What about the other resources needed ? Are they always available ? Not sure what you're referring to here. The changes to mad.c were very minor. >>IMO, the print statements would be better placed in agent_send_response(), which >> knows the context surrounding the failure, than by a caller. > > That's what I was suggesting (but I didn't see them in your updated > patch). No - I didn't add debug print statements to the code. There were only a couple of prints in the original code. Do you want debug prints added? - Sean From halr at voltaire.com Mon Oct 17 12:39:28 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 15:39:28 -0400 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <4353FDA2.7030309@ichips.intel.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> <1129567518.16900.14584.camel@hal.voltaire.com> <4353D8F7.40602@ichips.intel.com> <1129576084.16900.15502.camel@hal.voltaire.com> <4353FDA2.7030309@ichips.intel.com> Message-ID: <1129577968.16900.15644.camel@hal.voltaire.com> On Mon, 2005-10-17 at 15:38, Sean Hefty wrote: > Hal Rosenstock wrote: > >>The agent code now allocates a buffer for the MAD by calling > >>ib_create_send_mad(). The input MAD buffer is copied to the send buffer, then > >>transmitted. The result is that the input MAD buffer is always available for > >>posting on the receive queue. > > > > What about the other resources needed ? Are they always available ? > > Not sure what you're referring to here. The changes to mad.c were very minor. I was referring to the calls in agent_send_response which could fail. > >>IMO, the print statements would be better placed in agent_send_response(), which > >> knows the context surrounding the failure, than by a caller. > > > > That's what I was suggesting (but I didn't see them in your updated > > patch). > > No - I didn't add debug print statements to the code. There were only a couple > of prints in the original code. Do you want debug prints added? Yes. It can be done as a subsequent patch. I just think we want to know when those calls in agent_send_response do fail as they may explain some other external behavior (e.g. a lack of response by the SMA and/or PMA). -- Hal From mshefty at ichips.intel.com Mon Oct 17 13:03:29 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 17 Oct 2005 13:03:29 -0700 Subject: [openib-general] Re: [PATCH] [MAD/Agent] convert agent.c to use ib_create_send_mad() In-Reply-To: <1129577968.16900.15644.camel@hal.voltaire.com> References: <1129373165.16900.679.camel@hal.voltaire.com> <4353D465.5080203@ichips.intel.com> <1129567518.16900.14584.camel@hal.voltaire.com> <4353D8F7.40602@ichips.intel.com> <1129576084.16900.15502.camel@hal.voltaire.com> <4353FDA2.7030309@ichips.intel.com> <1129577968.16900.15644.camel@hal.voltaire.com> Message-ID: <43540391.50707@ichips.intel.com> Hal Rosenstock wrote: > I was referring to the calls in agent_send_response which could fail. It may not be obvious by reading the patch, but agent_send_response() now consists of: ib_get_agent_port() ib_create_ah_from_wc() ib_create_send_mad() ib_post_send_mad() The wc and grh params are only needed to create the AH, with cleanup handled by the completion callback. Agent_send_response() no longer accesses anything from its parameters after the call returns. > Yes. It can be done as a subsequent patch. I just think we want to know > when those calls in agent_send_response do fail as they may explain some > other external behavior (e.g. a lack of response by the SMA and/or PMA). I'll add it to this one. I'm guessing that we may to commit everything at once, after ib_post_send_mad() has been updated. - Sean From Federico.Sacerdoti at deshaw.com Mon Oct 17 13:05:04 2005 From: Federico.Sacerdoti at deshaw.com (Sacerdoti, Federico) Date: Mon, 17 Oct 2005 16:05:04 -0400 Subject: [openib-general] Announce: openib gen2 Rocks roll Message-ID: Based on the help I received from this list, I am pleased to make available a Rocks roll that provides openib gen2 drivers for Rocks clusters. The roll is based on a SVN snapshot from Sept 22nd, and provides a Linux kernel 2.9.13 as well as tweaks to the ulimits for memory locks, etc. It includes both the openib drivers and an mvapich-gen2 compiled with gcc 3.4.3 20041212 (Red Hat 3.4.3-9.EL4) This roll has been used for performance and correctness testing here at D.E. Shaw R&D, but has not been widely stressed and may still contain bugs. We have tested it with Topspin HCAs (Mellanox). It is appropriate for clusters running Rocks version 4.0. The RPMS contained within may perhaps work with Linux systems based on RHEL4. The Rocks Cluster team has graciously hosted both the binary roll iso and source: ftp://ftp.rocksclusters.org/pub/contrib/openib/ See http://www.rocksclusters.org/Rocks/ for information on Rocks and http://www.rocksclusters.org/rocks-documentation/reference-guide/3.2.0/r oll-file.html#ROLL-DEVEL For information about building rolls. Thank you again for your help, Federico From suri at baymicrosystems.com Mon Oct 17 13:15:11 2005 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Mon, 17 Oct 2005 16:15:11 -0400 Subject: [openib-general] initialization -udev In-Reply-To: <1129577968.16900.15644.camel@hal.voltaire.com> Message-ID: <200510172015.j9HKFB87009663@ns1.baymicrosystems.com> Folks: Another basic question, I just built my 2.6.10 kernel with core Infiniband stack. I only have CONFIG_INFINIBAND=y and nothing else (I don't need anything else for now!). Couple of questions: 1. Do I need to load any of the kernel modules by hand? I don't have etc/udev directory (none of the modules listed in MST's installation cheat sheet are applicable to me I think!) 2. What should I see to confirm that the core Infiniband stack is loaded correctly? Thanks, Suri From rolandd at cisco.com Mon Oct 17 13:41:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 17 Oct 2005 13:41:07 -0700 Subject: [openib-general] initialization -udev In-Reply-To: <200510172015.j9HKFB87009663@ns1.baymicrosystems.com> (Suresh Shelvapille's message of "Mon, 17 Oct 2005 16:15:11 -0400") References: <200510172015.j9HKFB87009663@ns1.baymicrosystems.com> Message-ID: <528xwrri58.fsf@cisco.com> Suresh> Folks: Another basic question, I just built my 2.6.10 Suresh> kernel with core Infiniband stack. I only have Suresh> CONFIG_INFINIBAND=y and nothing else (I don't need Suresh> anything else for now!). Couple of questions: Suresh> 1. Do I need to load any of the kernel modules by hand? I Suresh> don't have etc/udev directory (none of the modules listed Suresh> in MST's installation cheat sheet are applicable to me I Suresh> think!) 2. What should I see to confirm that the core Suresh> Infiniband stack is loaded correctly? If you have set all config options to 'y' or 'n' then there are no IB kernel modules to load. I don't think anything is be printed by the core IB stack so I wouldn't expect to see anything when it loads. You could add a printk() to ib_core_init() if you're worried. BTW, any particular reason you are using such an old kernel instead of 2.6.13 or 2.6.14-rc4? - R. From halr at voltaire.com Mon Oct 17 13:42:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 16:42:21 -0400 Subject: [openib-general] initialization -udev In-Reply-To: <200510172015.j9HKFB87009663@ns1.baymicrosystems.com> References: <200510172015.j9HKFB87009663@ns1.baymicrosystems.com> Message-ID: <1129581740.16900.15907.camel@hal.voltaire.com> Hi Suri, On Mon, 2005-10-17 at 16:15, Suresh Shelvapille wrote: > Folks: > Another basic question, I just built my 2.6.10 kernel with core Infiniband > stack. I only have CONFIG_INFINIBAND=y and nothing else (I don't need > anything else for now!). If you want modules, set it to m not y. > Couple of questions: > > 1. Do I need to load any of the kernel modules by hand? I don't have > etc/udev directory (none of the modules listed in MST's installation cheat > sheet are applicable to me I think!) > 2. What should I see to confirm that the core Infiniband stack is loaded > correctly? You should do an /sbin/lsmod | grep ib_ and see that ib_mthca (or whatever your driver is), ib_mad, and ib_core are loaded. I think that's allz you need right now. -- Hal > > Thanks, > Suri > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From suri at baymicrosystems.com Mon Oct 17 14:16:13 2005 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Mon, 17 Oct 2005 17:16:13 -0400 Subject: [openib-general] initialization -udev In-Reply-To: <1129581740.16900.15907.camel@hal.voltaire.com> Message-ID: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> 1. I did not add the CONFIG_INFINIBAND=y by hand. I used make menuconfig and the GUI let me choose "y" to select the Infiniband core. And the result was, in .config which I presume is the one generated by the GUI has CONIFG_INFINIBAND=y. >From what you say, the ib_core and ib_mad modules are not loaded if CONFIG_INFINIBAND=y and not 'm'? 2. When I do /sbin/lsmod --I see nothing though.... 3. BTW I have not loaded my device driver yet...I will try it once the core modules get loaded correctly! Thanks a lot in advance, Suri > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Monday, October 17, 2005 4:42 PM > To: Suresh Shelvapille > Cc: openib-general at openib.org > Subject: Re: [openib-general] initialization -udev > > Hi Suri, > > On Mon, 2005-10-17 at 16:15, Suresh Shelvapille wrote: > > Folks: > > Another basic question, I just built my 2.6.10 kernel with core > Infiniband > > stack. I only have CONFIG_INFINIBAND=y and nothing else (I don't need > > anything else for now!). > > If you want modules, set it to m not y. > > > Couple of questions: > > > > 1. Do I need to load any of the kernel modules by hand? I don't have > > etc/udev directory (none of the modules listed in MST's installation > cheat > > sheet are applicable to me I think!) > > 2. What should I see to confirm that the core Infiniband stack is loaded > > correctly? > > You should do an /sbin/lsmod | grep ib_ and see that ib_mthca (or > whatever your driver is), ib_mad, and ib_core are loaded. I think that's > allz you need right now. > > -- Hal > > > > > Thanks, > > Suri > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From rolandd at cisco.com Mon Oct 17 14:23:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 17 Oct 2005 14:23:05 -0700 Subject: [openib-general] initialization -udev In-Reply-To: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> (Suresh Shelvapille's message of "Mon, 17 Oct 2005 17:16:13 -0400") References: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> Message-ID: <524q7frg7a.fsf@cisco.com> Suresh> From what you say, the ib_core and ib_mad modules are not Suresh> loaded if CONFIG_INFINIBAND=y and not 'm'? If you say 'y' instead of 'm' then no modules are built. IB support is linked directly into your kernel instead. Suresh> 2. When I do /sbin/lsmod --I see nothing though.... Right, the code is linked directly into your kernel, so there are no modules to list. - R. From nacc at us.ibm.com Mon Oct 17 14:23:49 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 17 Oct 2005 14:23:49 -0700 Subject: [openib-general] initialization -udev In-Reply-To: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> References: <1129581740.16900.15907.camel@hal.voltaire.com> <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> Message-ID: <20051017212349.GK28213@us.ibm.com> On 17.10.2005 [17:16:13 -0400], Suresh Shelvapille wrote: > 1. I did not add the CONFIG_INFINIBAND=y by hand. I used make menuconfig and > the GUI let me choose "y" to select the Infiniband core. And the result was, > in .config which I presume is the one generated by the GUI has > CONIFG_INFINIBAND=y. Please don't top-post. > >From what you say, the ib_core and ib_mad modules are not loaded if > CONFIG_INFINIBAND=y and not 'm'? =y means built-in. There are *no* modules if you select =y. =m means build a as a module. Thanks, Nish From halr at voltaire.com Mon Oct 17 14:23:17 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 17 Oct 2005 17:23:17 -0400 Subject: [openib-general] initialization -udev In-Reply-To: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> References: <200510172116.j9HLGD87011165@ns1.baymicrosystems.com> Message-ID: <1129584196.16900.16106.camel@hal.voltaire.com> On Mon, 2005-10-17 at 17:16, Suresh Shelvapille wrote: > 1. I did not add the CONFIG_INFINIBAND=y by hand. I used make menuconfig and > the GUI let me choose "y" to select the Infiniband core. And the result was, > in .config which I presume is the one generated by the GUI has > CONIFG_INFINIBAND=y. > > >From what you say, the ib_core and ib_mad modules are not loaded if > CONFIG_INFINIBAND=y and not 'm'? Then its builtin to the kernel (rather than them being modules). > 2. When I do /sbin/lsmod --I see nothing though.... That's consistent with the above. > 3. BTW I have not loaded my device driver yet...I will try it once the core > modules get loaded correctly! If you want modules, you need to reconfigure setting CONFIG_INFINIBAND to m and rebuild. You will then need to load the ib_core and ib_mad modules by hand if you are not loading your driver first. It would likely pull them in based on the dependencies in your driver. -- Hal > Thanks a lot in advance, > Suri > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Monday, October 17, 2005 4:42 PM > > To: Suresh Shelvapille > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] initialization -udev > > > > Hi Suri, > > > > On Mon, 2005-10-17 at 16:15, Suresh Shelvapille wrote: > > > Folks: > > > Another basic question, I just built my 2.6.10 kernel with core > > Infiniband > > > stack. I only have CONFIG_INFINIBAND=y and nothing else (I don't need > > > anything else for now!). > > > > If you want modules, set it to m not y. > > > > > Couple of questions: > > > > > > 1. Do I need to load any of the kernel modules by hand? I don't have > > > etc/udev directory (none of the modules listed in MST's installation > > cheat > > > sheet are applicable to me I think!) > > > 2. What should I see to confirm that the core Infiniband stack is loaded > > > correctly? > > > > You should do an /sbin/lsmod | grep ib_ and see that ib_mthca (or > > whatever your driver is), ib_mad, and ib_core are loaded. I think that's > > allz you need right now. > > > > -- Hal > > > > > > > > Thanks, > > > Suri > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > > general > From rolandd at cisco.com Mon Oct 17 15:41:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 17 Oct 2005 15:41:49 -0700 Subject: [openib-general] Re: Initial ipath review brain dump In-Reply-To: <1129409205.4027.0.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Sat, 15 Oct 2005 13:46:45 -0700") References: <524q7jvc7d.fsf@cisco.com> <1129409205.4027.0.camel@hematite.internal.keyresearch.com> Message-ID: <52vezvpxzm.fsf@cisco.com> I came up with the patch below, which lets drivers do something like the following: for (pos = pci_find_capability(pdev, ); pos; pos = pci_find_next_capability(pdev, pos, )) { /* ... */ } I think this works well for infinipath_core.c. What do you think? If it looks OK to you, I'll send it on to Greg K-H for (I hope) inclusion in 2.6.15. [Assuming Linus releases 2.6.14 within the next few days, then the 2.6.15 window for core changes will close by the end of next week, so it's a good idea to get any generic stuff merged ASAP] - R. diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 259d247..b852959 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -120,6 +120,33 @@ int pci_find_capability(struct pci_dev * } /** + * pci_find_next_capability - Find next capability after current position + * @dev: PCI device to query + * @pos: Position to search from + * @cap: capability code + */ +int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap) +{ + u8 id; + int ttl = 48; + + while (ttl--) { + pci_read_config_byte(dev, pos + PCI_CAP_LIST_NEXT, &pos); + pos &= ~3; + if (pos < 0x40) + break; + pci_read_config_byte(dev, pos + PCI_CAP_LIST_ID, &id); + if (id == 0xff) + break; + if (id == cap) + return pos; + } + + return 0; +} +EXPORT_SYMBOL(pci_find_next_capability); + +/** * pci_bus_find_capability - query for devices' capabilities * @bus: the PCI bus to query * @devfn: PCI device to query diff --git a/include/linux/pci.h b/include/linux/pci.h index 7349058..8016d14 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -337,6 +337,7 @@ struct pci_dev *pci_find_device (unsigne struct pci_dev *pci_find_device_reverse (unsigned int vendor, unsigned int device, const struct pci_dev *from); struct pci_dev *pci_find_slot (unsigned int bus, unsigned int devfn); int pci_find_capability (struct pci_dev *dev, int cap); +int pci_find_next_capability (struct pci_dev *dev, u8 pos, int cap); int pci_find_ext_capability (struct pci_dev *dev, int cap); struct pci_bus * pci_find_next_bus(const struct pci_bus *from); @@ -546,6 +547,7 @@ static inline int pci_assign_resource(st static inline int pci_register_driver(struct pci_driver *drv) { return 0;} static inline void pci_unregister_driver(struct pci_driver *drv) { } static inline int pci_find_capability (struct pci_dev *dev, int cap) {return 0; } +static inline int pci_find_next_capability (struct pci_dev *dev, u8 post, int cap) {return 0; } static inline int pci_find_ext_capability (struct pci_dev *dev, int cap) {return 0; } static inline const struct pci_device_id *pci_match_device(const struct pci_device_id *ids, const struct pci_dev *dev) { return NULL; } From rjwalsh at pathscale.com Mon Oct 17 16:01:39 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Mon, 17 Oct 2005 16:01:39 -0700 Subject: [openib-general] Re: Initial ipath review brain dump In-Reply-To: <52vezvpxzm.fsf@cisco.com> References: <524q7jvc7d.fsf@cisco.com> <1129409205.4027.0.camel@hematite.internal.keyresearch.com> <52vezvpxzm.fsf@cisco.com> Message-ID: <1129590099.16851.17.camel@hematite.internal.keyresearch.com> > I came up with the patch below, which lets drivers do something like > the following: > > for (pos = pci_find_capability(pdev, ); > pos; > pos = pci_find_next_capability(pdev, pos, )) { > /* ... */ > } > > I think this works well for infinipath_core.c. What do you think? If > it looks OK to you, I'll send it on to Greg K-H for (I hope) inclusion > in 2.6.15. Hi Roland, This looks reasonable enough, but we're a little short on cycles over here at the moment to fully test it right away. > [Assuming Linus releases 2.6.14 within the next few days, then the > 2.6.15 window for core changes will close by the end of next week, so > it's a good idea to get any generic stuff merged ASAP] Right. Regards, Robert. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 481 bytes Desc: This is a digitally signed message part URL: From rolandd at cisco.com Mon Oct 17 16:06:21 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 17 Oct 2005 16:06:21 -0700 Subject: [openib-general] Re: Initial ipath review brain dump In-Reply-To: <1129590099.16851.17.camel@hematite.internal.keyresearch.com> (Robert Walsh's message of "Mon, 17 Oct 2005 16:01:39 -0700") References: <524q7jvc7d.fsf@cisco.com> <1129409205.4027.0.camel@hematite.internal.keyresearch.com> <52vezvpxzm.fsf@cisco.com> <1129590099.16851.17.camel@hematite.internal.keyresearch.com> Message-ID: <52r7ajpwuq.fsf@cisco.com> Robert> This looks reasonable enough, but we're a little short on Robert> cycles over here at the moment to fully test it right Robert> away. Cool. It works fine in my tests, I just wanted to make sure the interface was OK with you. - R. From sinate at yahoo.com Tue Oct 18 03:42:20 2005 From: sinate at yahoo.com (Steven Wooding) Date: Tue, 18 Oct 2005 11:42:20 +0100 (BST) Subject: [openib-general] Strange output when calling ibv_poll_cq function Message-ID: <20051018104220.89933.qmail@web32508.mail.mud.yahoo.com> Hi, I got a strange problem that I can't figure out. Turning kernel debugging on might help, but I thought I'd run it by the mailing list to see if anyone has come across this before. When calling the ibv_poll_cq() function I get the following printed to standard output: [ 0] 00620406 [ 4] 15000000 [ 8] 02000000 [ c] 00040000 [10] 04330000 [14] 00000000 [18] 00000002 [1c] fe100000 The output is not exactly the same each time. The indexes in square brackets are the same, but the eight digit number field changes (though not much). Also, the data I'm sending does not arrive (though this could be some other problem with my app). I'm using svn 3470 on x86_64 platform. Thanks, Steve. --------------------------------- Yahoo! Messenger NEW - crystal clear PC to PC calling worldwide with voicemail -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 18 09:19:29 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 09:19:29 -0700 Subject: [openib-general] Strange output when calling ibv_poll_cq function In-Reply-To: <20051018104220.89933.qmail@web32508.mail.mud.yahoo.com> (Steven Wooding's message of "Tue, 18 Oct 2005 11:42:20 +0100 (BST)") References: <20051018104220.89933.qmail@web32508.mail.mud.yahoo.com> Message-ID: <5264rupzla.fsf@cisco.com> > When calling the ibv_poll_cq() function I get the following printed to standard output: > [ 0] 00620406 > [ 4] 15000000 > [ 8] 02000000 > [ c] 00040000 > [10] 04330000 > [14] 00000000 > [18] 00000002 > [1c] fe100000 > The output is not exactly the same each time. The indexes in > square brackets are the same, but the eight digit number field > changes (though not much). This is some old debugging code from libmthca, which dumps some hardware-format data every time a completion with error is polled. I've changed it so that these dumps only occur for errors that are likely to indicate a driver bug. However, reading the completion contents, I see that it is a receive completion with status "local protection error." So something is wrong with the receive request you posted -- the address is out of bounds, you used the wrong L_Key, or something like that. > Also, the data I'm sending does not arrive (though this could be > some other problem with my app). I'm using svn 3470 on x86_64 > platform. Not surprising: your receive work request is completing unsuccessfully. - R. From pradeep at us.ibm.com Tue Oct 18 11:05:18 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 18 Oct 2005 11:05:18 -0700 Subject: [openib-general] Bug in recv_handler() -sa_query.c file? In-Reply-To: <5264rupzla.fsf@cisco.com> Message-ID: I was trying out the grmpp module on a Power machine and kept to running into the following error "grmpp: failed path record query: -22". Debugging this revealed that this was a problem with the following code in recv_handler(0 in the sa_query.c file : if (query && query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? -EINVAL : 0, (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad); Now mad_hdr.status field is declared as __be16. So, should the check be (mad_recv_wc->recv_buf.mad->mad_hdr.status & 0xff) before we return EINVAL? That change seemed to fix the grmpp problem for me. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 18 11:11:46 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 11:11:46 -0700 Subject: [openib-general] Re: Bug in recv_handler() -sa_query.c file? In-Reply-To: (Pradeep Satyanarayana's message of "Tue, 18 Oct 2005 11:05:18 -0700") References: Message-ID: <52u0feoftp.fsf@cisco.com> Pradeep> Now mad_hdr.status field is declared as __be16. So, Pradeep> should the check be Pradeep> (mad_recv_wc->recv_buf.mad->mad_hdr.status & 0xff) Pradeep> before we return EINVAL? I don't see why. For one thing, that would be an endianness bug, since as you say, the status field is in big-endian order, so the test would be different depending on whether the host is big- or little-endian. Also, all 16 bits of the status field should be zero if the request succeeds. What value do you see in the status field in the failed response? What SM are you using? - R. From Arkady.Kanevsky at netapp.com Tue Oct 18 11:16:42 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 14:16:42 -0400 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) Message-ID: Enclosed is the proposal to IBTA to add this functionality to CM protocol. The main issue is that there is no protocol that provides both src and dest IP addresses and ports and provide 64 bytes of private data to users simultaneously. The last slide outlines 3 possibilities on how to address this problem but each of them has its short comings. The proposed protocol will be used by both kernel and user space Consumers. There are existing Consumers that rely on 64 bytes of private data. In order to avoid duplicate discussions happening on different reflectors, please use openib-general at openib.org mailing list (accessible by all) for this thread. Feel free to cc dat and ibta swg reflectors. Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, August 25, 2005 3:48 PM > To: Yaron Haviv > Cc: openib-general at openib.org > Subject: iWARP emulation protocol (was: [openib-general] RDMA > connection andaddress translation API) > > > Yaron> I send my proposal from 2004 re-send again as text > Yaron> (attached) Also addresses the ServiceID issue, this can be > Yaron> a baseline for discussions Feel free to change > > I think this protocol is going in exactly the right > direction. Before you sent this email, I had independently > reached the conclusion that what is desired is not a > transport neutral API, but rather a general protocol for > emulating iWARP on IB. Then it's easy to build an API that > covers both native iWARP and emulated iWARP on IB, and use > that for iSER and NFS/RDMA. > > This has some nice properties. For example, the high-level > connection API doesn't have to have a 64-bit service ID > parameter any more -- we can just pass in 16-bit TCP ports, > and map them to IB service IDs. Also, it's easy to put some > filtering in the userspace CM to forbid connections with > source port < 1024 from unprivileged processes. Then > listeners can have some level of trust in the source IP if > the source port is privileged. > > I think that in light of the emerging consensus on using the > IB CM private data to carry IP address information, we can > stop worrying about ATS. We can implement this private data > mechanism immediately, using a service ID base coming from > the OpenIB OUI. Once we have the design nailed down, then we > can go to the IBTA or IETF and standardize a final service ID base. > > I have a few minor quibbles with this proposal. I think it > would be better to have only the IP version, source and > destination IPs and local in the CM private data. The other > fields don't seem generic to all protocols. If we do put the > extra fields in the generic private data, then we need an API > to set them on active connect and get them on passive > connect, and I don't think it's worth it. > > So I would suggest that there be no REP private data, and > that the REQ private data just be something like: > > 0 1 2 > 3 > 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 > 7 8 9 0 1 > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 00 | Src IP (127-96) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 04 | Src IP ( 95-64) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 08 | Src IP ( 63-32) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 12 | Src IP ( 31-00) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 16 | Dst IP (127-96) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 20 | Dst IP ( 95-64) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 24 | Dst IP ( 63-32) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 28 | Dst IP ( 31-00) > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > 32 | IPVer | Reserved | TCP Port > | > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > - R. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM.pdf Type: application/octet-stream Size: 56035 bytes Desc: IP Address Support by InfiniBand CM.pdf URL: From rolandd at cisco.com Tue Oct 18 11:19:03 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 11:19:03 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: (Arkady Kanevsky's message of "Tue, 18 Oct 2005 14:16:42 -0400") References: Message-ID: <52psq2ofhk.fsf@cisco.com> Arkady> The proposed protocol will be used by both kernel and user Arkady> space Consumers. There are existing Consumers that rely Arkady> on 64 bytes of private data. Which consumers are these? - R. From Arkady.Kanevsky at netapp.com Tue Oct 18 11:37:53 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 14:37:53 -0400 Subject: [openib-general] RE: iWARP emulation protocol Message-ID: uDAPL users. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 18, 2005 2:19 PM > To: Kanevsky, Arkady > Cc: Yaron Haviv; openib-general at openib.org; > dat-discussions at yahoogroups.com; swg at infinibandta.org > Subject: Re: iWARP emulation protocol > > > Arkady> The proposed protocol will be used by both kernel and user > Arkady> space Consumers. There are existing Consumers that rely > Arkady> on 64 bytes of private data. > > Which consumers are these? > > - R. > From rolandd at cisco.com Tue Oct 18 11:41:07 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 11:41:07 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: (Arkady Kanevsky's message of "Tue, 18 Oct 2005 14:37:53 -0400") References: Message-ID: <52irvuoegs.fsf@cisco.com> Arkady> uDAPL users. 1) http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt 2) Are there real users or is this a generic uDAPL API thing? - R. From mshefty at ichips.intel.com Tue Oct 18 11:45:38 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 11:45:38 -0700 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) In-Reply-To: References: Message-ID: <435542D2.7070005@ichips.intel.com> Kanevsky, Arkady wrote: > Enclosed is the proposal to IBTA to add this functionality to CM > protocol. > > The main issue is that there is no protocol that provides > both src and dest IP addresses and ports and > provide 64 bytes of private > data to users simultaneously. > The last slide outlines 3 possibilities on how to address this problem > but each of them has its short comings. For the REQ to find its way to the destination, the destination address must be known beforehand. We shouldn't need to pass any data in the REP. The CMA passes both the source and destination address information in the REQ, but only uses the destination to validate against a listen request. The source address is passed to the user. The slides should also discuss how to map from a TCP/IP address to a service ID, so that a REQ can match up with the correct listener. The approach currently taken by the CMA is to use the openib OUI << 48 + TCP port number. - Sean From pradeep at us.ibm.com Tue Oct 18 11:46:58 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Tue, 18 Oct 2005 11:46:58 -0700 Subject: [openib-general] Re: Bug in recv_handler() -sa_query.c file? In-Reply-To: <52u0feoftp.fsf@cisco.com> Message-ID: Good point about the endianness bug. That could be something. I got 0x600 for status (printed as %x). We have a Topspin switch which runs the SM Pradeep pradeep at us.ibm.com Roland Dreier wrote on 10/18/2005 11:11:46 AM: > Pradeep> Now mad_hdr.status field is declared as __be16. So, > Pradeep> should the check be > > Pradeep> (mad_recv_wc->recv_buf.mad->mad_hdr.status & 0xff) > > Pradeep> before we return EINVAL? > > I don't see why. For one thing, that would be an endianness bug, > since as you say, the status field is in big-endian order, so the test > would be different depending on whether the host is big- or little-endian. > > Also, all 16 bits of the status field should be zero if the request > succeeds. What value do you see in the status field in the failed > response? What SM are you using? > > - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rolandd at cisco.com Tue Oct 18 11:52:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 11:52:52 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: (Arkady Kanevsky's message of "Tue, 18 Oct 2005 14:16:42 -0400") References: Message-ID: <52ek6iodx7.fsf@cisco.com> [closed dat-discussions list snipped from Cc list] I have some comments about the proposal. Unfortunately I can't quote from a PDF file but I'll try to make it clear what I'm talking about. The proposal doesn't talk about mapping from TCP port numbers into a 16-bit range of IB service IDs. I think this is necessary. Also, putting the destination address in the REP message doesn't make sense to me. The destination IP and port number is something that the initiator of the connection is sending to the destination, not the other way around. The passive side of the connection (receiver of the REQ) needs the destination IP as part of the REQ so that it can decide whether to accept the connection; the active side (sender of the REQ) knows who it is trying to talk to, so having the address information in the REP is not useful. As I said above I believe the destination port should be encoded in the service ID, but the destination IP address should be in the REQ message. This consumes 16 more bytes of private data, but I would still like to understand whether there are real applications using 64 bytes of private data, or if this is just a uDAPL spec issue. - R. From mshefty at ichips.intel.com Tue Oct 18 11:55:11 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 11:55:11 -0700 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) In-Reply-To: References: Message-ID: <4355450F.7000808@ichips.intel.com> Kanevsky, Arkady wrote: > Enclosed is the proposal to IBTA to add this functionality to CM > protocol. A couple of other notes. Combine major/minor version into a single version, which is what you essentially have anyway. I have no clue what "zero based virtual address exception" means, but that and the SI bit seem out of place in a header containing TCP/IP address information. I would say save the two bits and have a cleaner header. - Sean From Arkady.Kanevsky at netapp.com Tue Oct 18 11:59:09 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 14:59:09 -0400 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) Message-ID: Sean, > For the REQ to find its way to the destination, the > destination address must be > known beforehand. We shouldn't need to pass any data in the > REP. The CMA > passes both the source and destination address information in > the REQ, but only > uses the destination to validate against a listen request. > The source address > is passed to the user. CM passes IB addresses of both src and dest in REQ. How locally dest IP address is mapped to dest IB GID|LID is defined by IPoIB. We can request IBTA to define it also. But the goal is to define a protocol part in IBTA. You are correct that if rely on CM storing the IP address of the dest it is not needed to be passed back in REP. If we do not need to know that response came from a different IP address. Or a different port. > The slides should also discuss how to map from a TCP/IP > address to a service ID, > so that a REQ can match up with the correct listener. The > approach currently > taken by the CMA is to use the openib OUI << 48 + TCP port number. > Correct. If we want IBTA to define a full mapping of addresses and ports then yes. But that does not change the protocol, it is local agreement that must be the same on both sides of the connection. I will include it in the next version. Thanks, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 From Arkady.Kanevsky at netapp.com Tue Oct 18 12:00:22 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 15:00:22 -0400 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) Message-ID: I think it is better to use some of the CM REQ reserved field for it so it will be separate from Addressing. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 18, 2005 2:55 PM > To: Kanevsky, Arkady > Cc: Roland Dreier; Yaron Haviv; swg at infinibandta.org; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: Re: iWARP emulation protocol (was: [openib-general] > RDMA connection andaddress translation API) > > > Kanevsky, Arkady wrote: > > Enclosed is the proposal to IBTA to add this functionality to CM > > protocol. > > A couple of other notes. > > Combine major/minor version into a single version, which is > what you essentially > have anyway. > > I have no clue what "zero based virtual address exception" > means, but that and > the SI bit seem out of place in a header containing TCP/IP > address information. > I would say save the two bits and have a cleaner header. > > - Sean > From rolandd at cisco.com Tue Oct 18 12:01:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 12:01:01 -0700 Subject: [openib-general] Re: Bug in recv_handler() -sa_query.c file? In-Reply-To: (Pradeep Satyanarayana's message of "Tue, 18 Oct 2005 11:46:58 -0700") References: Message-ID: <524q7eodjm.fsf@cisco.com> Pradeep> Good point about the endianness bug. That could be Pradeep> something. I got 0x600 for status (printed as %x). We Pradeep> have a Topspin switch which runs the SM 0x0600 for the status means "insufficient components" in the SA query (IB spec vol 1 table 188). I believe that the shipping version of the Topspin SM does not support all valid PathRecord component masks. So what's probably happening is that the query is really failing and the sa_query module is correctly returning an error. - R. From yaronh at voltaire.com Tue Oct 18 12:08:35 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Tue, 18 Oct 2005 21:08:35 +0200 Subject: [openib-general] RE: iWARP emulation protocol Message-ID: <35EA21F54A45CB47B879F21A91F4862F85669C@taurus.voltaire.com> > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Tuesday, October 18, 2005 2:53 PM > To: Kanevsky, Arkady > Cc: Roland Dreier; Yaron Haviv; openib-general at openib.org; > swg at infinibandta.org > Subject: Re: iWARP emulation protocol > > > The proposal doesn't talk about mapping from TCP port numbers into a > 16-bit range of IB service IDs. I think this is necessary. > I agree, that's part of the other proposals > Also, putting the destination address in the REP message doesn't make > sense to me. The destination IP and port number is something that the > initiator of the connection is sending to the destination, not the > other way around. The passive side of the connection (receiver of the > REQ) needs the destination IP as part of the REQ so that it can decide > whether to accept the connection; the active side (sender of the REQ) > knows who it is trying to talk to, so having the address information > in the REP is not useful. Also Agree, REP just needs few fields (ver, capabilities) Yaron From mshefty at ichips.intel.com Tue Oct 18 12:11:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 12:11:43 -0700 Subject: iWARP emulation protocol (was: [openib-general] RDMA connection andaddress translation API) In-Reply-To: References: Message-ID: <435548EF.10201@ichips.intel.com> Kanevsky, Arkady wrote: > CM passes IB addresses of both src and dest in REQ. > How locally dest IP address is mapped to dest IB GID|LID is > defined by IPoIB. > We can request IBTA to define it also. > But the goal is to define a protocol part in IBTA. The mapping from an IP address to a GID is controlled by a system administrator. ARP can be used to resolve the IP address to the GID, but there still needs to be a way to map the TCP port number to a service ID, which goes across the wire, and needs to be defined. Right now, the service ID is the only indicator that a CM or other recipient has that the private data has a particular format. An alternative is to grab a reserved bit from the CM REQ to indicate that this header is present, and ignore the service ID in such cases (provided the destination TCP/IP address is given in the private data). > You are correct that if rely on CM storing the IP address of the dest > it is not needed to be passed back in REP. > If we do not need to know that response came from a different IP > address. > Or a different port. Why would you want to establish a connection using an address that's different from that specified by the requester? - Sean From caitlinb at broadcom.com Tue Oct 18 12:16:25 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 18 Oct 2005 12:16:25 -0700 Subject: [openib-general] Re: iWARP emulation protocol Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A5C@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Roland Dreier > Sent: Tuesday, October 18, 2005 11:41 AM > To: Kanevsky, Arkady > Subject: [openib-general] Re: iWARP emulation protocol > > Arkady> uDAPL users. > > > 2) Are there real users or is this a generic uDAPL API thing? > uDAPL vs. kDAPL is irrelevant here. The user or Kernel Consumer making the connection does not know whether their peer is running in user or kernel, nor should they. Every discussion of reducing the guaranteed private data size in DAPL has produced adverse reactions from application developers. They're either very good actors or were working on actual applications. An additional space preserving option that Arkady did not mention is limiting the IP alias service to IPv4 addresses. Anyone who really wants IPv6 addresses can get their SM to assign IPv6 compatible GIDs. Of course the flat IPv6 option is far simpler, and probably should be used unless a specific application is identified where those extra 96 bits makes the difference between making the private data be rewritten or left as is. From halr at voltaire.com Tue Oct 18 12:10:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Oct 2005 15:10:31 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <1129662629.16900.23196.camel@hal.voltaire.com> Currently, OpenSM does not support an interactive console. There has been a desire to introduce the ability to change certain parameters (as well as display things) once OpenSM has started. This patch introduces the first most basic commands: help and loglevel. I am investgating adding smpriority to this. The console is invoked by specifying -console as an option on the opensm command line. If you have a request for a command you would like in the console, I would like to compile a list of these. Comments ? -- Hal Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 3801) +++ include/opensm/osm_subnet.h (working copy) @@ -221,6 +221,7 @@ typedef struct _osm_subn_opt char * dump_files_dir; char * log_file; boolean_t accum_log_file; + boolean_t console; cl_map_t port_prof_ignore_guids; boolean_t port_profile_switch_nodes; uint32_t max_port_profile; Index: include/opensm/osm_console.h =================================================================== --- include/opensm/osm_console.h (revision 0) +++ include/opensm/osm_console.h (revision 0) @@ -0,0 +1,56 @@ +/* + * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#ifndef _OSM_CONSOLE_H_ +#define _OSM_CONSOLE_H_ + +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +void osm_console(osm_opensm_t *p_osm); + +END_C_DECLS + +#endif /* _OSM_CONSOLE_H_ */ Property changes on: include/opensm/osm_console.h ___________________________________________________________________ Name: svn:keywords + Id Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 3801) +++ opensm/osm_subnet.c (working copy) @@ -399,6 +399,7 @@ osm_subn_set_default_opt( p_opt->m_key_lease_period = 0; p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE; + p_opt->console = FALSE; p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; /* by default we will consider waiting for 50x transaction timeout normal */ p_opt->max_msg_fifo_timeout = 50*OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; Index: opensm/osm_console.c =================================================================== --- opensm/osm_console.c (revision 0) +++ opensm/osm_console.c (revision 0) @@ -0,0 +1,184 @@ +/* + * Copyright (c) 2005 Voltaire, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE /* for getline */ +#include "stdio.h" +#include + +#define OSM_COMMAND_LINE_LEN 120 +#define OSM_COMMAND_PROMPT "$ " + +struct command { + char *name; + void (*help_function)(void); + void (*parse_function)(char **p_last, osm_opensm_t *p_osm); +}; + +static const struct command console_cmds[]; + +static inline char *next_token(char **p_last) +{ + return strtok_r(NULL, " \t\n", p_last); +} + +static void help_command() +{ + int i; + + printf("Supported commands and syntax:\n"); + printf("help []\n"); + /* skip help command */ + for (i = 1; console_cmds[i].name; i++) + console_cmds[i].help_function(); +} + +static void help_loglevel() +{ + printf("loglevel []\n"); +} + +/* more help routines go here */ + +static void help_parse(char **p_last, osm_opensm_t *p_osm) +{ + char *p_cmd; + int i, found = 0; + + p_cmd = next_token(p_last); + if (!p_cmd) + help_command(); + else { + for (i = 1; console_cmds[i].name; i++) { + if (!strcmp(p_cmd, console_cmds[i].name)) { + found = 1; + console_cmds[i].help_function(); + break; + } + } + if (!found) { + printf("Command %s not found\n\n", p_cmd); + help_command(); + } + } +} + +static void loglevel_parse(char **p_last, osm_opensm_t *p_osm) +{ + char *p_cmd; + int level; + + p_cmd = next_token(p_last); + if (!p_cmd) + printf("Current log level is 0x%x\n", osm_log_get_level(&p_osm->log)); + else { + /* Handle x, 0x, and decimal specification of log level */ + if (!strncmp(p_cmd, "x", 1)) { + p_cmd++; + level = strtoul(p_cmd, NULL, 16); + } else { + if (!strncmp(p_cmd, "0x", 2)) { + p_cmd += 2; + level = strtoul(p_cmd, NULL, 16); + } else + level = strtol(p_cmd, NULL, 10); + } + if ((level >= 0) && (level < 256)) { + printf("Setting log level to 0x%x\n", level); + osm_log_set_level(&p_osm->log, level); + } else + printf("Invalid log level 0x%x\n", level); + } +} + +/* more parse routines go here */ + +static const struct command console_cmds[] = +{ + { "help", &help_command, &help_parse}, + { "loglevel", &help_loglevel, &loglevel_parse}, + { NULL, NULL, NULL} /* end of array */ +}; + +static void parse_cmd_line(char *line, osm_opensm_t *p_osm) +{ + char *p_cmd, *p_last; + int i, found = 0; + + /* find first token which is the command */ + p_cmd = strtok_r(line, " \t\n", &p_last); + if (p_cmd) { + for (i = 0; console_cmds[i].name; i++) { + if (!strcmp(p_cmd, console_cmds[i].name)) { + found = 1; + console_cmds[i].parse_function(&p_last, p_osm); + break; + } + } + if (!found) { + printf("Command %s not found\n\n", p_cmd); + help_command(); + } + } else { + printf("Error parsing command line: %s\n", line); + return; + } +} + +void osm_console(osm_opensm_t *p_osm) +{ + char *p_line; + ssize_t len; + ssize_t n; + + printf("\nOpenSM Console\n\n"); + while (1) { + printf("%s", OSM_COMMAND_PROMPT); + p_line = NULL; + /* Get input line */ + n = getline(&p_line, &len, stdin); + if (n > 0) { + /* Parse and act on input */ + parse_cmd_line(p_line, p_osm); + free(p_line); + } else { + printf("Input error\n"); + fflush(stdin); + } + } +} + Property changes on: opensm/osm_console.c ___________________________________________________________________ Name: svn:keywords + Id Index: opensm/main.c =================================================================== --- opensm/main.c (revision 3801) +++ opensm/main.c (working copy) @@ -61,6 +61,7 @@ #include #include #include +#include /******************************************************************** D E F I N E G L O B A L V A R I A B L E S @@ -157,6 +158,8 @@ show_usage(void) " SMPs.\n" " Without -maxsmps, OpenSM defaults to a maximum of\n" " one outstanding SMP.\n\n" ); + printf( "-console\n" + " This option brings up the OpenSM console.\n\n" ); printf( "-i \n" "-ignore-guids \n" " This option provides the means to define a set of ports\n" @@ -368,6 +371,7 @@ parse_ignore_guids_file(IN char *guids_f uint64_t port_guid; ib_api_status_t status = IB_SUCCESS; unsigned int port_num; + OSM_LOG_ENTER( &p_osm->log, parse_ignore_guids_file ); fh = fopen( guids_file_name, "r" ); @@ -474,6 +478,7 @@ main( { "log_file", 1, NULL, 'f'}, { "erase_log_file",0, NULL, 'e'}, { "maxsmps", 1, NULL, 'n'}, + { "console", 1, NULL, 'q'}, { "V", 0, NULL, 'V'}, { "help", 0, NULL, 'h'}, { "once", 0, NULL, 'o'}, @@ -577,6 +582,14 @@ main( printf(" Max wire smp's = %d\n", opt.max_wire_smps); break; + case 'q': + /* + * OpenSM interactive console + */ + opt.console = TRUE; + printf(" Enabling OpenSM interactive console\n"); + break; + case 'd': dbg_lvl = strtol(optarg, NULL, 0); printf(" d level = 0x%x\n", dbg_lvl); @@ -796,7 +809,10 @@ main( be implemented in this loop. */ while( !osm_exit_flag ) - cl_thread_suspend( 10000 ); + if (opt.console) + osm_console(&osm); + else + cl_thread_suspend( 10000 ); } #if 0 Index: opensm/Makefile.am =================================================================== --- opensm/Makefile.am (revision 3801) +++ opensm/Makefile.am (working copy) @@ -25,8 +25,8 @@ libopensm_la_LDFLAGS = -version-info $(o libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map bin_PROGRAMS = opensm -opensm_SOURCES = main.c osm_db_files.c osm_db_pack.c \ - osm_drop_mgr.c osm_fwd_tbl.c \ +opensm_SOURCES = main.c osm_console.c osm_db_files.c \ + osm_db_pack.c osm_drop_mgr.c osm_fwd_tbl.c \ osm_inform.c osm_lid_mgr.c osm_lin_fwd_rcv.c \ osm_lin_fwd_rcv_ctrl.c osm_lin_fwd_tbl.c osm_link_mgr.c \ osm_matrix.c osm_mcast_fwd_rcv.c osm_mcast_fwd_rcv_ctrl.c \ From Arkady.Kanevsky at netapp.com Tue Oct 18 12:24:10 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 15:24:10 -0400 Subject: [openib-general] Re: iWARP emulation protocol Message-ID: > > An additional space preserving option that Arkady did not > mention is limiting the IP alias service to IPv4 addresses. > Anyone who really wants IPv6 addresses can get their SM to > assign IPv6 compatible GIDs. Of course the flat IPv6 option > is far simpler, and probably should be used unless a specific > application is identified where those extra 96 bits makes the > difference between making the private data be rewritten or left as is. > This can be an extension to proposal 3 of last page. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, October 18, 2005 3:16 PM > To: Roland Dreier; Kanevsky, Arkady > Cc: swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] Re: iWARP emulation protocol > > > > > > -----Original Message----- > > From: Roland Dreier > > Sent: Tuesday, October 18, 2005 11:41 AM > > To: Kanevsky, Arkady > > Subject: [openib-general] Re: iWARP emulation protocol > > > > Arkady> uDAPL users. > > > > > > 2) Are there real users or is this a generic uDAPL API thing? > > > > uDAPL vs. kDAPL is irrelevant here. The user or Kernel > Consumer making the connection does not know whether their > peer is running in user or kernel, nor should they. > > Every discussion of reducing the guaranteed private data size > in DAPL has produced adverse reactions from application > developers. They're either very good actors or were working > on actual applications. > > An additional space preserving option that Arkady did not > mention is limiting the IP alias service to IPv4 addresses. > Anyone who really wants IPv6 addresses can get their SM to > assign IPv6 compatible GIDs. Of course the flat IPv6 option > is far simpler, and probably should be used unless a specific > application is identified where those extra 96 bits makes the > difference between making the private data be rewritten or left as is. > From mshefty at ichips.intel.com Tue Oct 18 12:29:56 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 12:29:56 -0700 Subject: [openib-general] RE: iWARP emulation protocol In-Reply-To: References: Message-ID: <43554D34.6090104@ichips.intel.com> Kanevsky, Arkady wrote: > uDAPL users. I'm not sure how much we should care about higher level abstractions for this discussion. We should do what's right for IB. Abstractions that want to use IP addresses can either use the standard protocol defined by the IBTA or define their own private data. To me, it seems that the most flexible solution is to pass the source and destination IP address in the CM REQ. We can then define a standard mapping from TCP port numbers to IB service records, or change the CM version to read into the private data. What's wrong with this approach? - Sean From ftillier at silverstorm.com Tue Oct 18 12:32:57 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Tue, 18 Oct 2005 12:32:57 -0700 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <1129662629.16900.23196.camel@hal.voltaire.com> Message-ID: <000301c5d41a$bf261770$9e5aa8c0@infiniconsys.com> > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, October 18, 2005 12:11 PM > > If you have a request for a command you would like in the console, I > would like to compile a list of these. I think it would be great to have console commands to dump information from the SM - like linear and multicast forwarding tables, service registrations, LID assignment, etc. Maybe there's a way already to do this interactively, but I'm not aware of one. If there is, please ignore me. Thanks, - Fab From Arkady.Kanevsky at netapp.com Tue Oct 18 13:26:15 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 18 Oct 2005 16:26:15 -0400 Subject: [openib-general] RE: iWARP emulation protocol Message-ID: Sean wrote: > I'm not sure how much we should care about higher level > abstractions for this > discussion. We should do what's right for IB. Abstractions > that want to use IP > addresses can either use the standard protocol defined by the > IBTA or define > their own private data. Correct. But we should define standard protocol suited for most apps to avoid creations of multiple apps specific protocols. > > To me, it seems that the most flexible solution is to pass > the source and > destination IP address in the CM REQ. I agree. This is the cleanest and most simple to define. But it impacts some existing apps. That is why DAT has 64 bytes private data req. We do not loose too many users by the time we define the complete solution stack. > We can then define a > standard mapping > from TCP port numbers to IB service records, or change the CM > version to read > into the private data. What's wrong with this approach? It is the "standard" mapping which we just spend 1 hour discussing at SWG. What is that standard mapping if it is native IB? IPoIB as intermediate layer? SDP as intermediate layer? What is the standard TCP port for iSER (pick your ULP) native over RDMA vs. the same ULP over IPoIB? This have to be defined. But is it part of the IP address and TCP port info sharing between 2 sides of the connection proposal or a separate proposal? I think it is separate proposal but both will have to be in place to support iWARP emulation. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 From robert.j.woodruff at intel.com Tue Oct 18 13:35:15 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 18 Oct 2005 13:35:15 -0700 Subject: [openib-general] [PATCH] Fix to backport infinipath_core.c to 2.6.9 Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005DB2CF2@orsmsx408> In trying to backport svn3796 back to 2.6.9, I got a compile error on infinipath_core.c. Here is a patch that I think fixes the problem, but have no way to test it. Can you review the patch and if it looks OK, I will include it in my next set of backport patches ? woody diff -Naurp linux-2.6.9/drivers/infiniband/hw/ipath/ipath_core/infinipath_core.c linux-2.6.9-openib-drivers-svn3796-fixups/drivers/infiniband/hw/ipath/ip ath_core/infinipath_core.c --- linux-2.6.9/drivers/infiniband/hw/ipath/ipath_core/infinipath_core.c 2005-10-17 13:25:44.000000000 -0700 +++ linux-2.6.9-openib-drivers-svn3796-fixups/drivers/infiniband/hw/ipath/ip ath_core/infinipath_core.c 2005-10-17 16:00:25.000000000 -0700 @@ -1173,7 +1173,6 @@ MODULE_DEVICE_TABLE(pci, infinipath_pci_ static struct pci_driver infinipath_driver = { .name = MODNAME, - .owner = THIS_MODULE, .probe = infinipath_init_one, .remove = __devexit_p(infinipath_remove_one), .id_table = infinipath_pci_tbl, @@ -3143,7 +3142,7 @@ static int ipath_mmap(struct file *fp, s VM_DONTCOPY | VM_DONTEXPAND | VM_IO | VM_SHM | VM_LOCKED; ret = - io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + remap_page_range(vm, vm->vm_start, phys, vm->vm_end - vm->vm_start, vm->vm_page_prot); } @@ -3202,7 +3201,7 @@ static int ipath_mmap(struct file *fp, s | VM_IO | VM_SHM | VM_LOCKED; ret = - io_remap_pfn_range(vm, vm->vm_start, phys >> PAGE_SHIFT, + remap_page_range(vm, vm->vm_start, phys, vm->vm_end - vm->vm_start, vm->vm_page_prot); } From halr at voltaire.com Tue Oct 18 13:29:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Oct 2005 16:29:50 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <000301c5d41a$bf261770$9e5aa8c0@infiniconsys.com> References: <000301c5d41a$bf261770$9e5aa8c0@infiniconsys.com> Message-ID: <1129667389.16900.23596.camel@hal.voltaire.com> Hi Fab, On Tue, 2005-10-18 at 15:32, Fab Tillier wrote: > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, October 18, 2005 12:11 PM > > > > If you have a request for a command you would like in the console, I > > would like to compile a list of these. > > I think it would be great to have console commands to dump information from the > SM - like linear and multicast forwarding tables, service registrations, LID > assignment, etc. Maybe there's a way already to do this interactively, but I'm > not aware of one. If there is, please ignore me. Yes, that's one area I was contemplating covering: both SM and SA queries. This can currently be done with some IBA defined limitations with IBIS. One can also obtain SM attributes via smpquery from the nodes in the network. -- Hal From rolandd at cisco.com Tue Oct 18 13:44:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 13:44:52 -0700 Subject: [openib-general] Re: [PATCH] Fix to backport infinipath_core.c to 2.6.9 In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0005DB2CF2@orsmsx408> (Robert J. Woodruff's message of "Tue, 18 Oct 2005 13:35:15 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0005DB2CF2@orsmsx408> Message-ID: <52wtkamu63.fsf@cisco.com> Robert> In trying to backport svn3796 back to 2.6.9, I got a Robert> compile error on infinipath_core.c. Here is a patch that I Robert> think fixes the problem, but have no way to test it. Can Robert> you review the patch and if it looks OK, I will include it Robert> in my next set of backport patches ? That looks like it should work to me. Incidentally, I hadn't noticed that struct pci_driver had added a .owner member. Does anyone know what it's used for? It seems that very few in-tree drivers set it to anything. - R. From eitan at mellanox.co.il Tue Oct 18 14:11:15 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 18 Oct 2005 23:11:15 +0200 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <1129662629.16900.23196.camel@hal.voltaire.com> References: <1129662629.16900.23196.camel@hal.voltaire.com> Message-ID: <435564F3.6060409@mellanox.co.il> Hal Rosenstock wrote: > Currently, OpenSM does not support an interactive console. There has > been a desire to introduce the ability to change certain parameters (as > well as display things) once OpenSM has started. This patch introduces > the first most basic commands: help and loglevel. I am investgating > adding smpriority to this. The console is invoked by specifying -console > as an option on the opensm command line. > > If you have a request for a command you would like in the console, I > would like to compile a list of these. > > Comments ? OpenSM gen1 has a nice TCL API (named osmsh) that lets you do all that and much more. Setting ALL options is supported. It also provides a Tcl access to the SM Database so you can write your own reports on FDB/MC-FDB etc. Interactive control on the discovery and fabric settings sequence allows "single stepping" too. The OpenSM user manual provides extensive description of it, including some programming examples. Porting of osmsh to gen2 should be very simple. I do not see why we need to invent yet another way to do these things. Instead I would recommend including osm Tcl extension in the gen2 trunk and put it to work. EZ From sean.hefty at intel.com Tue Oct 18 14:22:27 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 14:22:27 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues Message-ID: I'm working on a patch to fix the DMA mapping issues that Roland reported earlier on the mail list. The proposed solution involves the following changes to ib_mad.h and ib_verbs.h. DMA mapping is performed immediately before posting the work request, with unmapping coming after polling the corresponding completion off the CQ. Comments? - Sean Index: ib_mad.h =================================================================== --- ib_mad.h (revision 3796) +++ ib_mad.h (working copy) @@ -203,26 +207,26 @@ struct ib_class_port_info /** * ib_mad_send_buf - MAD data buffer and work request for sends. + * @next: A pointer used to chain together MADs for posting. * @mad: References an allocated MAD data buffer. The size of the data * buffer is specified in the @send_wr.length field. - * @mapping: DMA mapping information. * @mad_agent: MAD agent that allocated the buffer. + * @ah: The address handle to use when sending the MAD. * @context: User-controlled context fields. - * @send_wr: An initialized work request structure used when sending the MAD. - * The wr_id field of the work request is initialized to reference this - * data structure. - * @sge: A scatter-gather list referenced by the work request. + * @timeout_ms: Time to wait for a response. + * @retries: Number of times to retry a request for a response. * * Users are responsible for initializing the MAD buffer itself, with the * exception of specifying the payload length field in any RMPP MAD. */ struct ib_mad_send_buf { + struct ib_mad_send_buf *next; struct ib_mad *mad; - DECLARE_PCI_UNMAP_ADDR(mapping) struct ib_mad_agent *mad_agent; + struct ib_ah *ah; void *context[2]; - struct ib_send_wr send_wr; - struct ib_sge sge; + int timeout_ms; + int retries; }; /** @@ -287,7 +291,7 @@ typedef void (*ib_mad_send_handler)(stru * or @mad_send_wc. */ typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, + struct ib_mad_send_buf *send_buf, struct ib_mad_send_wc *mad_send_wc); /** @@ -334,13 +338,13 @@ struct ib_mad_agent { /** * ib_mad_send_wc - MAD send completion information. - * @wr_id: Work request identifier associated with the send MAD request. + * @send_buf: Send MAD data buffer associated with the send MAD request. * @status: Completion status. * @vendor_err: Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { - u64 wr_id; + struct ib_mad_send_buf *send_buf; enum ib_wc_status status; u32 vendor_err; }; @@ -366,7 +370,7 @@ struct ib_mad_recv_buf { * @rmpp_list: Specifies a list of RMPP reassembled received MAD buffers. * @mad_len: The length of the received MAD, without duplicated headers. * - * For received response, the wr_id field of the wc is set to the wr_id + * For received response, the wr_id contains a pointer to the ib_mad_send_buf * for the corresponding send request. */ struct ib_mad_recv_wc { @@ -463,9 +467,9 @@ int ib_unregister_mad_agent(struct ib_ma /** * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client. - * @mad_agent: Specifies the associated registration to post the send to. - * @send_wr: Specifies the information needed to send the MAD(s). - * @bad_send_wr: Specifies the MAD on which an error was encountered. + * @send_buf: Specifies the information needed to send the MAD(s). + * @bad_send_buf: Specifies the MAD on which an error was encountered. This + * parameter is optional if only a single MAD is posted. * * Sent MADs are not guaranteed to complete in the order that they were posted. * @@ -479,9 +483,8 @@ int ib_unregister_mad_agent(struct ib_ma * defined data being transferred. The paylen_newwin field should be * specified in network-byte order. */ -int ib_post_send_mad(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, - struct ib_send_wr **bad_send_wr); +int ib_post_send_mad(struct ib_mad_send_buf *send_buf, + struct ib_mad_send_buf **bad_send_buf); /** * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. @@ -507,23 +510,25 @@ void ib_free_recv_mad(struct ib_mad_recv /** * ib_cancel_mad - Cancels an outstanding send MAD operation. * @mad_agent: Specifies the registration associated with sent MAD. - * @wr_id: Indicates the work request identifier of the MAD to cancel. + * @send_buf: Indicates the MAD to cancel. * * MADs will be returned to the user through the corresponding * ib_mad_send_handler. */ -void ib_cancel_mad(struct ib_mad_agent *mad_agent, u64 wr_id); +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf); /** * ib_modify_mad - Modifies an outstanding send MAD operation. * @mad_agent: Specifies the registration associated with sent MAD. - * @wr_id: Indicates the work request identifier of the MAD to modify. + * @send_buf: Indicates the MAD to modify. * @timeout_ms: New timeout value for sent MAD. * * This call will reset the timeout value for a sent MAD to the specified * value. */ -int ib_modify_mad(struct ib_mad_agent *mad_agent, u64 wr_id, u32 timeout_ms); +int ib_modify_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf, u32 timeout_ms); /** * ib_redirect_mad_qp - Registers a QP for MAD services. @@ -572,7 +577,6 @@ int ib_process_mad_wc(struct ib_mad_agen * @remote_qpn: Specifies the QPN of the receiving node. * @pkey_index: Specifies which PKey the MAD will be sent using. This field * is valid only if the remote_qpn is QP 1. - * @ah: References the address handle used to transfer to the remote node. * @rmpp_active: Indicates if the send will enable RMPP. * @hdr_len: Indicates the size of the data header of the MAD. This length * should include the common MAD header, RMPP header, plus any class @@ -582,11 +586,10 @@ int ib_process_mad_wc(struct ib_mad_agen * additional padding that may be necessary. * @gfp_mask: GFP mask used for the memory allocation. * - * This is a helper routine that may be used to allocate a MAD. Users are - * not required to allocate outbound MADs using this call. The returned - * MAD send buffer will reference a data buffer usable for sending a MAD, along + * This routine allocates a MAD for sending. The returned MAD send buffer + * will reference a data buffer usable for sending a MAD, along * with an initialized work request structure. Users may modify the returned - * MAD data buffer or work request before posting the send. + * MAD data buffer before posting the send. * * The returned data buffer will be cleared. Users are responsible for * initializing the common MAD and any class specific headers. If @rmpp_active @@ -594,7 +597,7 @@ int ib_process_mad_wc(struct ib_mad_agen */ struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, - struct ib_ah *ah, int rmpp_active, + int rmpp_active, int hdr_len, int data_len, unsigned int __nocast gfp_mask); Index: ib_verbs.h =================================================================== --- ib_verbs.h (revision 3796) +++ ib_verbs.h (working copy) @@ -595,11 +595,8 @@ struct ib_send_wr { } atomic; struct { struct ib_ah *ah; - struct ib_mad_hdr *mad_hdr; u32 remote_qpn; u32 remote_qkey; - int timeout_ms; /* valid for MADs only */ - int retries; /* valid for MADs only */ u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; From rolandd at cisco.com Tue Oct 18 14:36:05 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 14:36:05 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: (Sean Hefty's message of "Tue, 18 Oct 2005 14:22:27 -0700") References: Message-ID: <52ek6imrsq.fsf@cisco.com> This seems reasonable to me -- I like getting rid of the overloading of struct ib_wr with MAD-only fields. A few specific questions/comments about ib_mad_send_buf: > struct ib_mad_send_buf { > + struct ib_mad_send_buf *next; > struct ib_mad *mad; > - DECLARE_PCI_UNMAP_ADDR(mapping) Do we want to get rid of this field? It seems like we'll need to keep track of the DMA mapping somewhere, and this is as good a place as any. > struct ib_mad_agent *mad_agent; > + struct ib_ah *ah; > void *context[2]; I know this isn't being changed, but what was the original reason for needing two context slots? > - struct ib_send_wr send_wr; > - struct ib_sge sge; > + int timeout_ms; > + int retries; > }; Finally, I don't see anywhere that the length of the data buffer is storred in the structure, so how does ib_post_send_mad() know how much to send when it gets called? - R. From rolandd at cisco.com Tue Oct 18 14:46:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 14:46:53 -0700 Subject: [openib-general] Re: [PATCH] mthca: check that QP is not already a member of a MCG before attach In-Reply-To: <20051002151228.GE9873@mellanox.co.il> (Jack Morgenstein's message of "Sun, 2 Oct 2005 17:12:28 +0200") References: <20051002151228.GE9873@mellanox.co.il> Message-ID: <527jcamraq.fsf@cisco.com> Thanks, applied and queued for 2.6.15 (at long last). From mshefty at ichips.intel.com Tue Oct 18 14:58:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 14:58:01 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: <52ek6imrsq.fsf@cisco.com> References: <52ek6imrsq.fsf@cisco.com> Message-ID: <43556FE9.6010009@ichips.intel.com> Roland Dreier wrote: > > struct ib_mad_send_buf { > > + struct ib_mad_send_buf *next; > > struct ib_mad *mad; > > - DECLARE_PCI_UNMAP_ADDR(mapping) > > Do we want to get rid of this field? It seems like we'll need to keep > track of the DMA mapping somewhere, and this is as good a place as any. The implementation merges struct ib_mad_send_buf with struct ib_mad_send_wr_private, which was already used internally to track sends. Mapping has been moved into ib_mad_send_wr_private: @@ -118,9 +118,10 @@ struct ib_mad_send_wr_private { struct ib_mad_list_head mad_list; struct list_head agent_list; struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_buf send_buf; + DECLARE_PCI_UNMAP_ADDR(mapping) struct ib_send_wr send_wr; struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; - u64 wr_id; /* client WR ID */ __be64 tid; unsigned long timeout; int retries; > > struct ib_mad_agent *mad_agent; > > + struct ib_ah *ah; > > void *context[2]; > > I know this isn't being changed, but what was the original reason for > needing two context slots? I usually find that 2 context fields are convenient for middleware. In this case, the CM uses one context to reference the cm_id associated with a message, and the second context records the state that the message was sent in. > > - struct ib_send_wr send_wr; > > - struct ib_sge sge; > > + int timeout_ms; > > + int retries; > > }; > > Finally, I don't see anywhere that the length of the data buffer is > storred in the structure, so how does ib_post_send_mad() know how much > to send when it gets called? The length is stored in send_wr in ib_mad_send_wr_private. ib_post_send_mad() was copying the work request passed in by the user to the one stored in mad_send_wr_private If it makes more sense, I can move the work request from mad_send_wr_private to mad_send_buf. - Sean From rolandd at cisco.com Tue Oct 18 15:06:22 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 15:06:22 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: <43556FE9.6010009@ichips.intel.com> (Sean Hefty's message of "Tue, 18 Oct 2005 14:58:01 -0700") References: <52ek6imrsq.fsf@cisco.com> <43556FE9.6010009@ichips.intel.com> Message-ID: <521x2imqe9.fsf@cisco.com> Sean> The implementation merges struct ib_mad_send_buf with struct Sean> ib_mad_send_wr_private, which was already used internally to Sean> track sends. Got it -- I didn't realize that the ib_mad_send_buf structure was contained in a bigger private structure that has all that extra stuff. I'm still a little confused as to where the data buffer actually is. Is it pointed to by the struct ib_mad *mad member? If so, it seems a little odd to make the pointer have type struct ib_mad *, since struct ib_mad is exactly 256 bytes long. Also, how will it work to post a send for a very large RMPP message? The next member seems to be used to chain separate transactions together -- I don't see a way to have multiple buffers that all contain data for the same message. - R. From kjreilly at us.ibm.com Tue Oct 18 15:40:47 2005 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Tue, 18 Oct 2005 18:40:47 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a Message-ID: On Mon, 2005-10-18 at 10:07, Kevin Reilly wrote: >On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: >> > Should this code work, because it seems that out_dev is a kernel >> > address (platform: PPC64) which cannot accessed by a userspace >> > program. Via GDB I can see that rt has the following content: >> > >> > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a >> > kernel address. >> >> Yes, this is a bug which has been previously pointed out on the list and >> not fixed. > >The fix for this involves an ABI change: it should return the GID of the >outgoing IB device. > >-- Hal Should we (IBM) work on submitting a patch for this? Returning the GID or the device_name would be good fix. I guess our reluctance is that we've heard the this address translation library function might be depreciated for another interface? Having neither leaves us without a method to translate healthy "heartbeat-able" IP interfaces to HCAs where we can run things over verbs. Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 From mshefty at ichips.intel.com Tue Oct 18 15:56:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 18 Oct 2005 15:56:00 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: <521x2imqe9.fsf@cisco.com> References: <52ek6imrsq.fsf@cisco.com> <43556FE9.6010009@ichips.intel.com> <521x2imqe9.fsf@cisco.com> Message-ID: <43557D80.602@ichips.intel.com> Roland Dreier wrote: > I'm still a little confused as to where the data buffer actually is. > Is it pointed to by the struct ib_mad *mad member? If so, it seems a > little odd to make the pointer have type struct ib_mad *, since struct > ib_mad is exactly 256 bytes long. Yes - it's pointed to by struct ib_mad. This is how ib_mad_send_buf worked before. I can change the pointer to void*, but then it requires casting to ib_mad or ib_mad_hdr where ever it is used. I've also considered changing it to a pointer to a union of the different MAD types. > Also, how will it work to post a send for a very large RMPP message? > The next member seems to be used to chain separate transactions > together -- I don't see a way to have multiple buffers that all > contain data for the same message. I've given this some thought, but don't have a decent answer yet. The next member is intended for separate transactions, but isn't used by any clients at this point. I guess that we could either add an sg_list or a next_buffer pointer for this purpose, where next_buffer points to a structure that resembles a SGE. - Sean From iod00d at hp.com Tue Oct 18 16:15:33 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 18 Oct 2005 16:15:33 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A5C@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A5C@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051018231533.GB12879@esmail.cup.hp.com> On Tue, Oct 18, 2005 at 12:16:25PM -0700, Caitlin Bestler wrote: > > 2) Are there real users or is this a generic uDAPL API thing? > > uDAPL vs. kDAPL is irrelevant here. The user or Kernel Consumer > making the connection does not know whether their peer is running > in user or kernel, nor should they. Caitlin, I didn't see an answer to Roland's question in your reply. There is no kDAPL in linux. So yes, I agree uDAPL v kDAPL is irrelevant. > Every discussion of reducing the guaranteed private data size > in DAPL has produced adverse reactions from application developers. > They're either very good actors or were working on actual applications. Roland (and the rest of us) would like to see someone name a real consumer of the proposed interface. ie who depends on this change? Then the dependency for that use/user can be discussed and appropriate tradeoffs made. Make sense? hth, grant From caitlinb at broadcom.com Tue Oct 18 16:40:54 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 18 Oct 2005 16:40:54 -0700 Subject: [openib-general] Re: iWARP emulation protocol Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A63@NT-SJCA-0751.brcm.ad.broadcom.com> > > Roland (and the rest of us) would like to see someone name a > real consumer of the proposed interface. ie who depends on > this change? > Then the dependency for that use/user can be discussed and > appropriate tradeoffs made. Make sense? > Unfortunately not every application that is under development, or even deployed, can be discussed in a google-searchable public forum. That especially applies to user-mode development. So I could have actually tested such applications and still not be free to cite them here. With any luck some of them are following the discussion and will jump in on their own. Unfortunately, since they are developing to uDAPL they are unlikely to be following this discussion. From iod00d at hp.com Tue Oct 18 16:43:50 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 18 Oct 2005 16:43:50 -0700 Subject: [openib-general] Re: [PATCH] Fix to backport infinipath_core.c to 2.6.9 In-Reply-To: <52wtkamu63.fsf@cisco.com> References: <1AC79F16F5C5284499BB9591B33D6F0005DB2CF2@orsmsx408> <52wtkamu63.fsf@cisco.com> Message-ID: <20051018234350.GC12879@esmail.cup.hp.com> On Tue, Oct 18, 2005 at 01:44:52PM -0700, Roland Dreier wrote: > Robert> In trying to backport svn3796 back to 2.6.9, I got a > Robert> compile error on infinipath_core.c. Here is a patch that I > Robert> think fixes the problem, but have no way to test it. Can > Robert> you review the patch and if it looks OK, I will include it > Robert> in my next set of backport patches ? > > That looks like it should work to me. > > Incidentally, I hadn't noticed that struct pci_driver had added a > .owner member. Does anyone know what it's used for? It seems that > very few in-tree drivers set it to anything. Uhm, in 2.6.13, i get: grundler <496>find -name \*.c | xargs fgrep ".owner =" | wc 569 2429 32166 I am getting some false positives in that count though. I expect the key uses are for sysfs: ./fs/sysfs/bin.c: if (!try_module_get(attr->attr.owner)) ./fs/sysfs/bin.c: module_put(attr->attr.owner); ./fs/sysfs/bin.c: module_put(attr->attr.owner); and some similar code in ALSA (sound/). It looks like the module_get/put is just managing the reference count for driver modules when someone opens a /sys file owned by that driver. grant From rolandd at cisco.com Tue Oct 18 16:51:19 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 18 Oct 2005 16:51:19 -0700 Subject: [openib-general] Re: [PATCH] Fix to backport infinipath_core.c to 2.6.9 In-Reply-To: <20051018234350.GC12879@esmail.cup.hp.com> (Grant Grundler's message of "Tue, 18 Oct 2005 16:43:50 -0700") References: <1AC79F16F5C5284499BB9591B33D6F0005DB2CF2@orsmsx408> <52wtkamu63.fsf@cisco.com> <20051018234350.GC12879@esmail.cup.hp.com> Message-ID: <52d5m2l6yw.fsf@cisco.com> > Uhm, in 2.6.13, i get: > grundler <496>find -name \*.c | xargs fgrep ".owner =" | wc > 569 2429 32166 > I am getting some false positives in that count though. Yes, many many false positives. By far the majority of your hits are against other structures like struct file_operations that have an owner member. > I expect the key uses are for sysfs: > ./fs/sysfs/bin.c: if (!try_module_get(attr->attr.owner)) > ./fs/sysfs/bin.c: module_put(attr->attr.owner); > ./fs/sysfs/bin.c: module_put(attr->attr.owner); This owner field is in struct attribute and gets set elsewhere. I asked on lkml and got the answer in http://lkml.org/lkml/2005/10/18/169 which is that it's not really used for ref counting yet, but it does put a symlink in sysfs that may be useful. - R. From iod00d at hp.com Tue Oct 18 17:01:58 2005 From: iod00d at hp.com (Grant Grundler) Date: Tue, 18 Oct 2005 17:01:58 -0700 Subject: [openib-general] Re: iWARP emulation protocol In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A63@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A63@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051019000158.GD12879@esmail.cup.hp.com> On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > Roland (and the rest of us) would like to see someone name a > > real consumer of the proposed interface. ie who depends on > > this change? > > Then the dependency for that use/user can be discussed and > > appropriate tradeoffs made. Make sense? > > Unfortunately not every application that is under development, > or even deployed, can be discussed in a google-searchable > public forum. That especially applies to user-mode development. Well, this is open source. While I don't want to preclude closed source developement, it's usually necessary to have an open source consumer that any open source developer can test with. > So I could have actually tested such applications and still > not be free to cite them here. Understood. I'm not asking *you* to cite one unless you happen to own one of the consumers. > With any luck some of them > are following the discussion and will jump in on their own. > Unfortunately, since they are developing to uDAPL they are > unlikely to be following this discussion. It doesn't help that the DAT yahoo-groups.com mailing list is rejecting my replies. It would be helpful if someone following this forum could share Roland's question with DAT mailing list if it didn't make it there already and possibly explain why naming a consumer is necessary. hth, grant From stjqcrt at hotmail.com Wed Oct 19 01:37:40 2005 From: stjqcrt at hotmail.com (Moises Carey) Date: Wed, 19 Oct 2005 05:37:40 -0300 Subject: [openib-general] Increase your metabolism Message-ID: You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://strongarly.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://strongarly.com/ Regards, Dr. Moises Carey From mohitka at noida.hcltech.com Tue Oct 18 22:41:07 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Wed, 19 Oct 2005 11:11:07 +0530 Subject: [openib-general] I/O controllers Message-ID: <3E6BB9CEE261E2428AD25D0D553DC4970159DB21@HSDLNTD1110010.noida.hcltech.com> Hi all, Can anyone tell me are there any specific I/O controller for the connection between the TCA and SCSI devices or any I/O controller will work between the TCA and SCSI devices Mohit Katiyar -------------- next part -------------- An HTML attachment was scrubbed... URL: From tebyvlcawbvch at hotmail.com Tue Oct 18 21:21:27 2005 From: tebyvlcawbvch at hotmail.com (Ryan Hall) Date: Wed, 19 Oct 2005 06:21:27 +0200 Subject: [openib-general] Request. Message-ID: <24740014095115.tebyvlcawbvch@hotmail.com> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://gener0us.net/save2.asp Have a good day. Sincerely, Ryan Hall Customer Service Rep eAZP Inc. matrimonial or levis it , detestation it not modish or may rescind may it's brethren be but ubiquity be be decline onin wolf ,. Update on site plaza on convolute some be grainy be may bull it see industrialism a be andover it not bonnet not not corvette orbe revved may. From Arkady.Kanevsky at netapp.com Wed Oct 19 06:40:04 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 19 Oct 2005 09:40:04 -0400 Subject: [openib-general] Re: iWARP emulation protocol Message-ID: Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > From krause at cup.hp.com Wed Oct 19 06:42:03 2005 From: krause at cup.hp.com (Michael Krause) Date: Wed, 19 Oct 2005 06:42:03 -0700 Subject: [openib-general] I/O controllers In-Reply-To: <3E6BB9CEE261E2428AD25D0D553DC4970159DB21@HSDLNTD1110010.no ida.hcltech.com> References: <3E6BB9CEE261E2428AD25D0D553DC4970159DB21@HSDLNTD1110010.noida.hcltech.com> Message-ID: <6.2.0.14.2.20051019064128.022a7408@esmail.cup.hp.com> At 10:41 PM 10/18/2005, Mohit Katiyar, Noida wrote: >Content-class: urn:content-classes:message >Content-Type: multipart/alternative; > boundary="----_=_NextPart_001_01C5D46F.B45AF930" > >Hi all, >Can anyone tell me are there any specific I/O controller for the >connection between the TCA and SCSI devices or any I/O controller will >work between the TCA and SCSI devices See various IB vendors for their offerings which include attachment to various I/O device types. Their web pages contain plenty of appropriate information. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From IBMEHCAD at de.ibm.com Wed Oct 19 07:58:46 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 19 Oct 2005 16:58:46 +0200 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: <52ll196xnm.fsf@cisco.com> Message-ID: I put out a initial set of files (as discussed) on openib.org svn. The Kconfig option will follow when I've verified that it really compiles as is in svn. The same set with makefiles and install scripts is available from sourceforge as ehca2_0033 Christoph Roland Dreier wrote on 04.10.2005 18:43:09: > Congratulations on getting to this stage! > > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > > gen2/trunk/src/userspace/libehca > > Yes, this is the right place to add the code. > > We should probably modify the linux-kernel/infiniband/Kconfig to only > > allow to compile the kernel part for ppc64 builds > > Yes, add > source "drivers/infiniband/hw/ehca/Kconfig" > to that Kconfig, and > obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ > to the Makefile. > - R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sinate at yahoo.com Wed Oct 19 07:59:04 2005 From: sinate at yahoo.com (Steven Wooding) Date: Wed, 19 Oct 2005 15:59:04 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? Message-ID: <20051019145904.3332.qmail@web32502.mail.mud.yahoo.com> Hi there, I was wondering whether the CM API currently (I'm currently using svn 3470) supports establishing UC connections? I have the RC transport type working fine using the CM. I've found that somewhere between me sending and receiving the REQ message, the qp_type variable changes from UC to RC (3 to 2). I've checked the value just before the user-space call ib_cm_send_req() and just after receiving the CM event that contains the REQ, so I believe I've ruled out a bug in my app. So this must mean it is switched some where in kernel space driver. I had a quick look in the kernel space code, but I'm not really sure what's going on. Could be a bug is either cm_req_get_qp_type() or cm_req_set_qp_type() in cm_msgs.h. Anyway, perhaps you could confirm whether the CM supports UC and if so, look in to this possible bug. Thank you very much. Regards, Steve. --------------------------------- How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos. Get Yahoo! Photos -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Oct 19 08:29:42 2005 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 19 Oct 2005 08:29:42 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Message-ID: <59278FC0C48A994BABABD069571E45680C9C5773@orsmsx401.amr.corp.intel.com> Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Oct 19 08:23:41 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 11:23:41 -0400 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <435564F3.6060409@mellanox.co.il> References: <1129662629.16900.23196.camel@hal.voltaire.com> <435564F3.6060409@mellanox.co.il> Message-ID: <1129735418.16900.33396.camel@hal.voltaire.com> On Tue, 2005-10-18 at 17:11, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Currently, OpenSM does not support an interactive console. There has > > been a desire to introduce the ability to change certain parameters (as > > well as display things) once OpenSM has started. This patch introduces > > the first most basic commands: help and loglevel. I am investgating > > adding smpriority to this. The console is invoked by specifying -console > > as an option on the opensm command line. > > > > If you have a request for a command you would like in the console, I > > would like to compile a list of these. > > > > Comments ? > > OpenSM gen1 has a nice TCL API (named osmsh) that lets you do all that > and much more. > Setting ALL options is supported. > It also provides a Tcl access to the SM Database so you can write your own > reports on FDB/MC-FDB etc. > Interactive control on the discovery and fabric settings sequence allows > "single stepping" too. IMO osmsh is more a debugger's tool. It relies on OpenSM globals and internal SM data structures rather than well defined APIs which might isolate the user from changes. (It exposes the internals of the SM and SM modifications may cause scripts using osmsh) to stop working, and worse than that, osmsh scripts may cause serious SM bugs. I think there is a place for a "safer" console. Perhaps there are levels of access privileges where some can do RO things and others have RW access. > The OpenSM user manual provides extensive description of it, > including some programming examples. What OpenSM documentation ? I didn't see any with the 1.8.0 release. > Porting of osmsh to gen2 should be very simple. Is someone working on doing this ? > I do not see why we need to invent yet another way to do these things. > Instead I would recommend including osm Tcl extension in the gen2 trunk > and put it to work. -- Hal From Arkady.Kanevsky at netapp.com Wed Oct 19 08:31:44 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 19 Oct 2005 11:31:44 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: Arlin, just to clarify, Intel MPI will not have problems with useing less than 64 bytes of private data. If a solution will provide you with 48 bytes of private data will it be sufficient? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Wednesday, October 19, 2005 11:30 AM To: dat-discussions at yahoogroups.com; Grant Grundler Cc: swg at infinibandta.org; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Oct 19 08:32:38 2005 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 19 Oct 2005 08:32:38 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: <59278FC0C48A994BABABD069571E45680C9C578D@orsmsx401.amr.corp.intel.com> Yes, 48 bytes would be sufficient. ________________________________ From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] Sent: Wednesday, October 19, 2005 8:32 AM To: Davis, Arlin R; dat-discussions at yahoogroups.com; Grant Grundler Cc: swg at infinibandta.org; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arlin, just to clarify, Intel MPI will not have problems with useing less than 64 bytes of private data. If a solution will provide you with 48 bytes of private data will it be sufficient? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Wednesday, October 19, 2005 11:30 AM To: dat-discussions at yahoogroups.com; Grant Grundler Cc: swg at infinibandta.org; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.frank at oracle.com Wed Oct 19 09:08:29 2005 From: richard.frank at oracle.com (Richard Frank) Date: Wed, 19 Oct 2005 12:08:29 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol References: Message-ID: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> MessageOracle currently depends on 64 bytes of private data for connect and accept. ----- Original Message ----- From: Kanevsky, Arkady To: Davis, Arlin R ; dat-discussions at yahoogroups.com ; Grant Grundler Cc: swg at infinibandta.org ; openib-general at openib.org Sent: Wednesday, October 19, 2005 11:31 AM Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arlin, just to clarify, Intel MPI will not have problems with useing less than 64 bytes of private data. If a solution will provide you with 48 bytes of private data will it be sufficient? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Wednesday, October 19, 2005 11:30 AM To: dat-discussions at yahoogroups.com; Grant Grundler Cc: swg at infinibandta.org; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin ---------------------------------------------------------------------------- From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > SPONSORED LINKS Protocol Communication and networking Wireless communication and networking ---------------------------------------------------------------------------- YAHOO! GROUPS LINKS a.. Visit your group "dat-discussions" on the web. b.. To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com c.. Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. ---------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Oct 19 09:28:09 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 19 Oct 2005 18:28:09 +0200 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <1129735418.16900.33396.camel@hal.voltaire.com> References: <1129735418.16900.33396.camel@hal.voltaire.com> Message-ID: <43567419.6050205@mellanox.co.il> Hal Rosenstock wrote: > On Tue, 2005-10-18 at 17:11, Eitan Zahavi wrote: > >>Hal Rosenstock wrote: >> >>>Currently, OpenSM does not support an interactive console. There has >>>been a desire to introduce the ability to change certain parameters > > (as > >>>well as display things) once OpenSM has started. This patch > > introduces > >>>the first most basic commands: help and loglevel. I am investgating >>>adding smpriority to this. The console is invoked by specifying > > -console > >>>as an option on the opensm command line. >>> >>>If you have a request for a command you would like in the console, I >>>would like to compile a list of these. >>> >>>Comments ? >> >>OpenSM gen1 has a nice TCL API (named osmsh) that lets you do all that > > >>and much more. >>Setting ALL options is supported. >>It also provides a Tcl access to the SM Database so you can write your > > own > >>reports on FDB/MC-FDB etc. >>Interactive control on the discovery and fabric settings sequence > > allows > >>"single stepping" too. > > > IMO osmsh is more a debugger's tool. It relies on OpenSM globals and > internal SM data structures rather than well defined APIs which might > isolate the user from changes. (It exposes the internals of the SM and > SM modifications may cause scripts using osmsh) to stop working, and > worse than that, osmsh scripts may cause serious SM bugs. What is unsafe in running the following basic code? osm_opts configure -log_file $log_file_name. osm_init osm_bind $guid osm_sweep osm_set_verbosity 0xffff > > I think there is a place for a "safer" console. Perhaps there are levels > of access privileges where some can do RO things and others have RW > access. How would this privilege right be granted? > > >>The OpenSM user manual provides extensive description of it, >>including some programming examples. > > > What OpenSM documentation ? I didn't see any with the 1.8.0 release. It is in the 1.7.1 1.7.0 manuals too. > > >>Porting of osmsh to gen2 should be very simple. > > > Is someone working on doing this ? No - but if needed we can do that. > > >>I do not see why we need to invent yet another way to do these things. >>Instead I would recommend including osm Tcl extension in the gen2 > > trunk > >>and put it to work. > > > -- Hal > From mshefty at ichips.intel.com Wed Oct 19 09:42:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Oct 2005 09:42:59 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <20051019145904.3332.qmail@web32502.mail.mud.yahoo.com> References: <20051019145904.3332.qmail@web32502.mail.mud.yahoo.com> Message-ID: <43567793.7000505@ichips.intel.com> Steven Wooding wrote: > I was wondering whether the CM API currently (I'm currently using svn > 3470) supports establishing UC connections? I have the RC transport type > working fine using the CM. Unless there's a bug, nothing in the CM should prevent UC from working. > I had a quick look in the kernel space code, but I'm not really sure > what's going on. Could be a bug is either cm_req_get_qp_type() or > cm_req_set_qp_type() in cm_msgs.h. I checked the kernel code, and I didn't see any obvious issues there. > Anyway, perhaps you could confirm whether the CM supports UC and if so, > look in to this possible bug. I'll look into this more. If you have time, you could change cmpost and ucmpost to use UC and run those. This would help narrow down if the issue is in the kernel, userspace, or the application. (I'm testing some MAD changes, and will try this myself once I'm done testing.) - Sean From halr at voltaire.com Wed Oct 19 09:43:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 12:43:56 -0400 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <43567419.6050205@mellanox.co.il> References: <1129735418.16900.33396.camel@hal.voltaire.com> <43567419.6050205@mellanox.co.il> Message-ID: <1129740235.16900.33953.camel@hal.voltaire.com> On Wed, 2005-10-19 at 12:28, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Tue, 2005-10-18 at 17:11, Eitan Zahavi wrote: > > > >>Hal Rosenstock wrote: > >> > >>>Currently, OpenSM does not support an interactive console. There has > >>>been a desire to introduce the ability to change certain parameters > > > > (as > > > >>>well as display things) once OpenSM has started. This patch > > > > introduces > > > >>>the first most basic commands: help and loglevel. I am investgating > >>>adding smpriority to this. The console is invoked by specifying > > > > -console > > > >>>as an option on the opensm command line. > >>> > >>>If you have a request for a command you would like in the console, I > >>>would like to compile a list of these. > >>> > >>>Comments ? > >> > >>OpenSM gen1 has a nice TCL API (named osmsh) that lets you do all that > > > > > >>and much more. > >>Setting ALL options is supported. > >>It also provides a Tcl access to the SM Database so you can write your > > > > own > > > >>reports on FDB/MC-FDB etc. > >>Interactive control on the discovery and fabric settings sequence > > > > allows > > > >>"single stepping" too. > > > > > > IMO osmsh is more a debugger's tool. It relies on OpenSM globals and > > internal SM data structures rather than well defined APIs which might > > isolate the user from changes. (It exposes the internals of the SM and > > SM modifications may cause scripts using osmsh) to stop working, and > > worse than that, osmsh scripts may cause serious SM bugs. > What is unsafe in running the following basic code? > osm_opts configure -log_file $log_file_name. > osm_init > osm_bind $guid > osm_sweep > osm_set_verbosity 0xffff Are you saying there is no use of globals and internal SM data structures by osmsh or just for that particular flow ? > > I think there is a place for a "safer" console. Perhaps there are levels > > of access privileges where some can do RO things and others have RW > > access. > How would this privilege right be granted? Based on user and/or perhaps group. > >>The OpenSM user manual provides extensive description of it, > >>including some programming examples. > > > > > > What OpenSM documentation ? I didn't see any with the 1.8.0 release. > It is in the 1.7.1 1.7.0 manuals too. How do you get the old versions of this ? > >>Porting of osmsh to gen2 should be very simple. > > > > > > Is someone working on doing this ? > No - but if needed we can do that. > > > > > >>I do not see why we need to invent yet another way to do these things. > >>Instead I would recommend including osm Tcl extension in the gen2 > > > > trunk > > > >>and put it to work. > > > > > > -- Hal > > > From rolandd at cisco.com Wed Oct 19 09:57:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 19 Oct 2005 09:57:08 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> (Richard Frank's message of "Wed, 19 Oct 2005 12:08:29 -0400") References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> Message-ID: <52psq1igwr.fsf@cisco.com> Richard> MessageOracle currently depends on 64 bytes of private Richard> data for connect and accept. How do you work with RDS, which is not connection oriented and hence does not even have connect or accept? - R. From mshefty at ichips.intel.com Wed Oct 19 10:00:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Oct 2005 10:00:02 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> Message-ID: <43567B92.8040508@ichips.intel.com> Richard Frank wrote: > Oracle currently depends on 64 bytes of private data for connect and > accept. Is any of that data used to exchange address information? It's impossible to provide both the source and destination address in the CM REQ private data and still give the user 64 bytes. The source address is needed for the reverse GID->IP lookup. Can we make due without the destination address? - Sean From Arkady.Kanevsky at netapp.com Wed Oct 19 10:13:40 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 19 Oct 2005 13:13:40 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: Sean, if look at the proposal it shows 2 ways to address this. 1. Have 2 protocols. One just send SRC IP address and port, and provdie 64 bytes to ULP. Another one send both SRC and DEST info and leaves 48(+-) bytes of private data for ULP. 2. Have 2 protocols. Split IPv4 and IPv6 methods. For IPv4 send SRC and DST addressing and 64 bytes of ULP private data. For IPv6 we have several options. a. GID=IPv6 address b. use second CM frame to have carry ULP private data. c. others But having multiple versions supported is not pleasant. It looses a simple backwards compatibility of current protocol which just formats CM private data field. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, October 19, 2005 1:00 PM > To: Richard Frank > Cc: swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org; Davis, Arlin R > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > Richard Frank wrote: > > Oracle currently depends on 64 bytes of private data for connect and > > accept. > > Is any of that data used to exchange address information? > > It's impossible to provide both the source and destination > address in the CM REQ > private data and still give the user 64 bytes. The source > address is needed for > the reverse GID->IP lookup. Can we make due without the > destination address? > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Wed Oct 19 10:32:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 19 Oct 2005 10:32:42 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: (IBMEHCA DD's message of "Wed, 19 Oct 2005 16:58:46 +0200") References: Message-ID: <52ll0pif9h.fsf@cisco.com> IBMEHCA> I put out a initial set of files (as discussed) on IBMEHCA> openib.org svn. The Kconfig option will follow when I've IBMEHCA> verified that it really compiles as is in svn. Great, it's good to have the code in svn at last. >From a first look I don't see any ebus files checked in, and I don't see them in the upstream kernel yet either. I don't follow the linux ppc64 mailing lists so I don't know the status of merging ebus. For that matter what is the status of moving dma_addr_t to u64 on ppc64? I'll read over the rest of the code and post some comments soon. - R. From xma at us.ibm.com Wed Oct 19 10:34:54 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Oct 2005 11:34:54 -0600 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: Message-ID: Here is the Kconfig/Makefile patch I used to build ehca as a part of linux build. diff -urN infiniband/Kconfig infiniband-patch/Kconfig --- infiniband/Kconfig 2005-10-17 09:28:41.000000000 -0700 +++ infiniband-patch/Kconfig 2005-10-17 09:51:08.000000000 -0700 @@ -30,6 +30,8 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/hw/ehca/Kconfig" + source "drivers/infiniband/ulp/ipoib/Kconfig" source "drivers/infiniband/ulp/sdp/Kconfig" diff -urN infiniband/hw/ehca/Kconfig infiniband-patch/hw/ehca/Kconfig --- infiniband/hw/ehca/Kconfig 1969-12-31 16:00:00.000000000 -0800 +++ infiniband-patch/hw/ehca/Kconfig 2005-10-17 09:51:55.000000000 -0700 @@ -0,0 +1,6 @@ +config INFINIBAND_EHCA + tristate "IBM EHCA support" + depends on IBMEBUS && INFINIBAND + ---help--- + This is a low-level driver for IBM eBUS host + channel adapters (HCAs). diff -urN infiniband/hw/ehca/Makefile infiniband-patch/hw/ehca/Makefile --- infiniband/hw/ehca/Makefile 2005-10-17 09:45:19.000000000 -0700 +++ infiniband-patch/hw/ehca/Makefile 2005-10-17 09:55:33.000000000 -0700 @@ -46,7 +46,7 @@ endif # GEN2_PATH_KERNEL = drivers #for gen2 code in kernel -obj-m += hcad_mod.o +obj-$(CONFIG_INFINIBAND_EHCA) += hcad_mod.o hcad_mod-objs = ehca_main.o ehca_hca.o ipz_pt_fn.o ehca_classes.o ehca_av.o \ ehca_pd.o ehca_mrmw.o ehca_cq.o ehca_sqp.o ehca_qp.o hcp_sense.o \ @@ -59,6 +59,7 @@ EXTRA_CFLAGS +=-DP_SERIES -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL \ + -Idrivers/infiniband/include \ -I$(src)/. \ -I$(GEN2_PATH_KERNEL)/infiniband/include/rdma \ -I$(GEN2_PATH_KERNEL)/infiniband/core Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 IBMEHCA DD Sent by: openib-general-bounces at openib.org 10/19/2005 07:58 AM To openib-general at openib.org cc Subject Re: [openib-general] moving IBM eHCA Device Driver to openib.org I put out a initial set of files (as discussed) on openib.org svn. The Kconfig option will follow when I've verified that it really compiles as is in svn. The same set with makefiles and install scripts is available from sourceforge as ehca2_0033 Christoph Roland Dreier wrote on 04.10.2005 18:43:09: > Congratulations on getting to this stage! > > gen2/trunk/src/linux-kernel/infiniband/hw/ehca > > gen2/trunk/src/userspace/libehca > > Yes, this is the right place to add the code. > > We should probably modify the linux-kernel/infiniband/Kconfig to only > > allow to compile the kernel part for ppc64 builds > > Yes, add > source "drivers/infiniband/hw/ehca/Kconfig" > to that Kconfig, and > obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ > to the Makefile. > - R._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ehca.config.patch Type: application/octet-stream Size: 1555 bytes Desc: not available URL: From mshefty at ichips.intel.com Wed Oct 19 10:37:58 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Oct 2005 10:37:58 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: References: Message-ID: <43568476.9030905@ichips.intel.com> Kanevsky, Arkady wrote: > if look at the proposal it shows 2 ways to address this. I did notice this. > 1. Have 2 protocols. > One just send SRC IP address and port, and provdie 64 bytes to ULP. > Another one send both SRC and DEST info and leaves 48(+-) bytes of > private data for ULP. If the goal is to make the mapping from IP address to IB address transparent, then I think we want a single protocol. Ideally, the application shouldn't need to know if they're connecting over iWarp, IB, or any other RDMA NIC. Any solution that makes IB appear different than iWarp makes this more difficult to accomplish. > 2. Have 2 protocols. > Split IPv4 and IPv6 methods. Same issue as above. This makes IB connections appear differently than an iWarp connection. This is why I asked if the destination address is required. If it is, then the applications need to make do with less private data. > For IPv4 send SRC and DST addressing and 64 bytes of ULP private data. > For IPv6 we have several options. > a. GID=IPv6 address Unless an IP packet can be sent to a GID and be processed, I don't consider a GID equal to an IPv6 address. I also don't think that we should require system administrators to add GIDs to IB ports just because they want to add an IP address to a system. > b. use second CM frame to have carry ULP private data. An application can make due with no private data. They can transfer whatever data that need once the connection has been established, like all TCP applications do. Adding more CM messages to pass the same data that should be passed over the user's QP is the wrong approach. In fact another alternative is to make all CM private data reserved. - Sean From Arkady.Kanevsky at netapp.com Wed Oct 19 10:47:46 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 19 Oct 2005 13:47:46 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: But TCP connection do not need to pre-allocate recv buffers like RDMA does. If all RDMA connection attributes can be modified without beraking a live connection, like RDMA read credits, then analogy with TCP will work. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, October 19, 2005 1:38 PM > To: Kanevsky, Arkady > Cc: Richard Frank; swg at infinibandta.org; > dat-discussions at yahoogroups.com; openib-general at openib.org; > Davis, Arlin R > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > Kanevsky, Arkady wrote: > > if look at the proposal it shows 2 ways to address this. > > I did notice this. > > > 1. Have 2 protocols. > > One just send SRC IP address and port, and provdie 64 bytes to ULP. > > Another one send both SRC and DEST info and leaves 48(+-) bytes of > > private data for ULP. > > If the goal is to make the mapping from IP address to IB > address transparent, > then I think we want a single protocol. Ideally, the > application shouldn't need > to know if they're connecting over iWarp, IB, or any other > RDMA NIC. Any > solution that makes IB appear different than iWarp makes this > more difficult to > accomplish. > > > 2. Have 2 protocols. > > Split IPv4 and IPv6 methods. > > Same issue as above. This makes IB connections appear > differently than an iWarp > connection. This is why I asked if the destination address > is required. If it > is, then the applications need to make do with less private data. > > > For IPv4 send SRC and DST addressing and 64 bytes of ULP > private data. > > For IPv6 we have several options. a. GID=IPv6 address > > Unless an IP packet can be sent to a GID and be processed, I > don't consider a > GID equal to an IPv6 address. I also don't think that we > should require system > administrators to add GIDs to IB ports just because they want > to add an IP > address to a system. > > > b. use second CM frame to have carry ULP private data. > > An application can make due with no private data. They can > transfer whatever > data that need once the connection has been established, like all TCP > applications do. Adding more CM messages to pass the same > data that should be > passed over the user's QP is the wrong approach. > > In fact another alternative is to make all CM private data reserved. > > - Sean > From xma at us.ibm.com Wed Oct 19 10:49:53 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 19 Oct 2005 10:49:53 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: <52ll0pif9h.fsf@cisco.com> Message-ID: > For that matter what is the status of moving dma_addr_t to u64 on ppc64? The patch has been in mm tree. It will be in 2.6.15. eBUS patch will submit to upper stream to review within a couple days. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Oct 19 11:21:20 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Oct 2005 11:21:20 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: <43557D80.602@ichips.intel.com> References: <52ek6imrsq.fsf@cisco.com> <43556FE9.6010009@ichips.intel.com> <521x2imqe9.fsf@cisco.com> <43557D80.602@ichips.intel.com> Message-ID: <43568EA0.3080700@ichips.intel.com> Sean Hefty wrote: >> I'm still a little confused as to where the data buffer actually is. >> Is it pointed to by the struct ib_mad *mad member? If so, it seems a >> little odd to make the pointer have type struct ib_mad *, since struct >> ib_mad is exactly 256 bytes long. > > Yes - it's pointed to by struct ib_mad. This is how ib_mad_send_buf > worked before. I can change the pointer to void*, but then it requires > casting to ib_mad or ib_mad_hdr where ever it is used. I've also > considered changing it to a pointer to a union of the different MAD types. Thinking about this more, making the buffer a void* may actually make it easier than struct ib_mad*, since explicit casting wouldn't be necessary. There are several places in the code where there are casts to ib_rmpp_mad, ib_smp, ib_vendor_mad, etc. I'd like to make supporting very large sends a separate change at this point though. That change could possibly combine the ib_mad_send_buf and ib_mad_recv_wc structures together. - Sean From rolandd at cisco.com Wed Oct 19 11:34:32 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 19 Oct 2005 11:34:32 -0700 Subject: [openib-general] [RFC] MAD API changes to fix DMA mapping issues In-Reply-To: <43568EA0.3080700@ichips.intel.com> (Sean Hefty's message of "Wed, 19 Oct 2005 11:21:20 -0700") References: <52ek6imrsq.fsf@cisco.com> <43556FE9.6010009@ichips.intel.com> <521x2imqe9.fsf@cisco.com> <43557D80.602@ichips.intel.com> <43568EA0.3080700@ichips.intel.com> Message-ID: <52hdbdicef.fsf@cisco.com> Sean> I'd like to make supporting very large sends a separate Sean> change at this point though. That change could possibly Sean> combine the ib_mad_send_buf and ib_mad_recv_wc structures Sean> together. Fair enough. One approach would be something like struct scsi_cmnd: it has a void *buffer member, which is either a pointer to the data buffer itself, or a pointer to an array of struct sg_list that point to the pieces of the buffer. - R. From nacc at us.ibm.com Wed Oct 19 12:05:44 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 19 Oct 2005 12:05:44 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: <52ll0pif9h.fsf@cisco.com> References: <52ll0pif9h.fsf@cisco.com> Message-ID: <20051019190544.GP28213@us.ibm.com> On 19.10.2005 [10:32:42 -0700], Roland Dreier wrote: > IBMEHCA> I put out a initial set of files (as discussed) on > IBMEHCA> openib.org svn. The Kconfig option will follow when I've > IBMEHCA> verified that it really compiles as is in svn. > > Great, it's good to have the code in svn at last. > > >From a first look I don't see any ebus files checked in, and I don't > see them in the upstream kernel yet either. I don't follow the linux > ppc64 mailing lists so I don't know the status of merging ebus. > > For that matter what is the status of moving dma_addr_t to u64 on ppc64? I believe the dma_addr_t change already exists in -mm and is pending for 2.6.15. Thanks, Nish From liran at mellanox.co.il Wed Oct 19 12:33:45 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Wed, 19 Oct 2005 21:33:45 +0200 Subject: [openib-general] InfiniBand Test Project (IBTP) - Update Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB293@mtlexch01.mtl.com> Hi , We've updated IBTP tree with Osmtest sources both on ibal (WinIB) and Gen2 stacks : https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmt est https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest Osmtest is the main verification tool for OpenSM , include various SA (Good / Bad) flows. Attached to each directory a short README file for setup and usage information. > Liran Sorani > Mellanox Technologies LTD. > mailto:liran at mellanox.co.il > Phone: +972(4)9097200 Ext: 214 > Israel, Yokneam P.O.B 586 ZIP 20692 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hozer at hozed.org Wed Oct 19 13:09:37 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 19 Oct 2005 15:09:37 -0500 Subject: [openib-general] where is IB_WARN defined? Message-ID: <20051019200937.GJ30127@kalmia.hozed.org> I'm trying to rebuild opensm, and the libibumad configure is failing because IB_WARN is apparently not defined anyplace I can find it. -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From troy at scl.ameslab.gov Wed Oct 19 13:16:37 2005 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Wed, 19 Oct 2005 15:16:37 -0500 Subject: [openib-general] EHCA ipoib error.. Message-ID: <20051019201637.GE8748@minbar.scl.ameslab.gov> I get the following errors when trying to bring up ipoib: 10:~# modprobe ib_hcad_mod ehca_nr_ports=1 FATAL: Module ib_hcad_mod not found. 10:~# modprobe hcad_mod ehca_nr_ports=1 [ 1401.993165] eHCA Infiniband Device Driver (Rel.: EHCA2_0028) [ 1401.994486] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd [ 1489.471617] PU0000 00060103:ehca_parse_ec EHCA port 1 is available. 10:~# modprobe ib_ipoib 10:~# ifconfig ib0 10.40.4.52 netmask 255.255.0.0 [ 1697.767708] PU0000 000b0544:internal_modify_qp HCAD_ERROR ehca_qp=c0000001cd7b2c80 qp_num=1 req_mask=21 opt_mask=88040 submitted_mask=3 qp_type=1 [ 1697.767734] ib_mad: mad_error_handler - ib_modify_qp to RTS : -22 From halr at voltaire.com Wed Oct 19 13:24:57 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 16:24:57 -0400 Subject: [openib-general] InfiniBand Test Project (IBTP) - Update Message-ID: <1129753494.16900.35548.camel@hal.voltaire.com> On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > Hi , > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) and > Gen2 stacks : > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmtest > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest > > Osmtest is the main verification tool for OpenSM , include various SA > (Good / Bad) flows. > Attached to each directory a short README file for setup and usage > information. How is the Linux one different from osmtest in the trunk ? Also, (nit): I think https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest/Makefile.in is a generated file and should be removed. -- Hal > > Liran Sorani > > Mellanox Technologies LTD. > > mailto:liran at mellanox.co.il > > Phone: +972(4)9097200 Ext: 214 > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Wed Oct 19 13:30:06 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 16:30:06 -0400 Subject: [openib-general] where is IB_WARN defined? In-Reply-To: <20051019200937.GJ30127@kalmia.hozed.org> References: <20051019200937.GJ30127@kalmia.hozed.org> Message-ID: <1129753800.16900.35586.camel@hal.voltaire.com> On Wed, 2005-10-19 at 16:09, Troy Benjegerdes wrote: > I'm trying to rebuild opensm, and the libibumad configure is failing because > IB_WARN is apparently not defined anyplace I can find it. It should be IBWARN. Where is IB_WARN or do you mean IBWARN ? It should be found in libibcommon/include/infiniband/common.h What svn version are you using ? Have you updated all your management libraries ? -- Hal From halr at voltaire.com Wed Oct 19 13:37:14 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 16:37:14 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: References: Message-ID: <1129754232.16900.35630.camel@hal.voltaire.com> On Tue, 2005-10-18 at 18:40, Kevin Reilly wrote: > > > On Mon, 2005-10-18 at 10:07, Kevin Reilly wrote: > >On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: > >> > Should this code work, because it seems that out_dev is a kernel > >> > address (platform: PPC64) which cannot accessed by a userspace > >> > program. Via GDB I can see that rt has the following content: > >> > > >> > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a > >> > kernel address. > >> > >> Yes, this is a bug which has been previously pointed out on the list and > >> not fixed. > > > >The fix for this involves an ABI change: it should return the GID of the > >outgoing IB device. > > > >-- Hal > > Should we (IBM) work on submitting a patch for this? That's up to you. > Returning the GID or the device_name would be good fix. Yes, either of these could be made to work. > I guess our reluctance is that we've heard the this address translation > library function might be depreciated for another interface? Yes, that has been my reluctance as well. It appears AT is likely to be superceeded by CMA. > Having neither leaves us without a method to translate healthy > "heartbeat-able" IP interfaces to HCAs where we can run things over verbs. As of now, there is no user space CMA although work is likely to commence on this shortly. You should make sure the APIs suit your needs. -- Hal > Kevin J. Reilly > STSM, HPC Architecture > -Federation/HPS Chief Engineer > -HPC interconnect architect > (office) 845-433-7976 (tieline) 8-293-7976 > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sinate at yahoo.com Wed Oct 19 14:24:57 2005 From: sinate at yahoo.com (Steven Wooding) Date: Wed, 19 Oct 2005 22:24:57 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <43567793.7000505@ichips.intel.com> Message-ID: <20051019212457.5220.qmail@web32502.mail.mud.yahoo.com> Hi Sean, I've modified cmpost to try to UC and get similar results as my app. The changes I made to cmpost.c was to change RC to UC (two places) and remove the following req and rep parameters as I beleive these are not required for UC: req.retry_count = 5; rep.rnr_retry_count = req->rnr_retry_count; I also put in some print statements to observe the value of qp_type. Here are the results: Client-side output: starting client req.qp_type = 3 Received REJ Error sending REQ or REP receiving data transfers initiating data transfers data transfers complete test complete (note to anybody else reading this thread; the last four lines do not mean the data got transferred successfully, as no error checking is done on the connect_events() function) Server-side output: starting server event->param.req_rcvd.qp_type = 2 failed to modify QP to RTR: 22 failing connection request initiating data transfers receiving data transfers data transfers complete test complete So basically the server-side thinks the QP being requested is an RC, not the required UC. Hope this helps Sean. Cheers, Steve. --- Sean Hefty wrote: > I'll look into this more. If you have time, you > could change cmpost and ucmpost > to use UC and run those. This would help narrow > down if the issue is in the > kernel, userspace, or the application. (I'm testing > some MAD changes, and will > try this myself once I'm done testing.) > ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com From rolandd at cisco.com Wed Oct 19 14:31:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 19 Oct 2005 14:31:08 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <20051019212457.5220.qmail@web32502.mail.mud.yahoo.com> (Steven Wooding's message of "Wed, 19 Oct 2005 22:24:57 +0100 (BST)") References: <20051019212457.5220.qmail@web32502.mail.mud.yahoo.com> Message-ID: <528xwpi483.fsf@cisco.com> Does the patch below help? It looks like there is a missing break after the UC case in the kernel CM, so the code falls through and overwrites the field with the value for RC. - R. diff --git a/drivers/infiniband/core/cm_msgs.h b/drivers/infiniband/core/cm_msgs.h --- a/drivers/infiniband/core/cm_msgs.h +++ b/drivers/infiniband/core/cm_msgs.h @@ -186,6 +186,7 @@ static inline void cm_req_set_qp_type(st req_msg->offset40 = cpu_to_be32((be32_to_cpu( req_msg->offset40) & 0xFFFFFFF9) | 0x2); + break; default: req_msg->offset40 = cpu_to_be32(be32_to_cpu( req_msg->offset40) & From mshefty at ichips.intel.com Wed Oct 19 14:34:42 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 19 Oct 2005 14:34:42 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <528xwpi483.fsf@cisco.com> References: <20051019212457.5220.qmail@web32502.mail.mud.yahoo.com> <528xwpi483.fsf@cisco.com> Message-ID: <4356BBF2.6070905@ichips.intel.com> Roland Dreier wrote: > Does the patch below help? It looks like there is a missing break > after the UC case in the kernel CM, so the code falls through and > overwrites the field with the value for RC. Good catch. I overlooked this about 10 times now... The break should be there. - Sean From jlentini at netapp.com Wed Oct 19 14:43:52 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 19 Oct 2005 17:43:52 -0400 (EDT) Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: <52psq1igwr.fsf@cisco.com> References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> <52psq1igwr.fsf@cisco.com> Message-ID: On Wed, 19 Oct 2005, Roland Dreier wrote: roland> Richard> MessageOracle currently depends on 64 bytes of private roland> Richard> data for connect and accept. roland> roland> How do you work with RDS, which is not connection oriented and hence roland> does not even have connect or accept? The D is somewhat misleading. It refers to the functionality provider to the consumer application. Internally, RDS uses reliable connections, see rds_cm_init_active_connection() at: https://openib.org/svn/trunk/contrib/silverstorm/rds/rds_cm.c or the SilverStorm presentation at the last OpenIB workshop: http://openib.org/docs/oib_wkshp_082205/Reliable_Datagram_Sockets.ppt From sinate at yahoo.com Wed Oct 19 14:33:51 2005 From: sinate at yahoo.com (Steven Wooding) Date: Wed, 19 Oct 2005 22:33:51 +0100 (BST) Subject: [openib-general] Strange output when calling ibv_poll_cq function In-Reply-To: <5264rupzla.fsf@cisco.com> Message-ID: <20051019213352.96385.qmail@web32507.mail.mud.yahoo.com> Sorry Roland, My fault. I had the wrong access flags set when I registered the memory region. Thanks for reply though. Cheers, Steve. --- Roland Dreier wrote: > However, reading the completion contents, I see that > it is a receive > completion with status "local protection error." So > something is > wrong with the receive request you posted -- the > address is out of > bounds, you used the wrong L_Key, or something like > that. ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com From rolandd at cisco.com Wed Oct 19 14:56:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 19 Oct 2005 14:56:39 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: (James Lentini's message of "Wed, 19 Oct 2005 17:43:52 -0400 (EDT)") References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> <52psq1igwr.fsf@cisco.com> Message-ID: <52zmp5goh4.fsf@cisco.com> James> The D is somewhat misleading. It refers to the James> functionality provider to the consumer application. Right, that's what we're talking about. The RDS implementation only needs a few bytes of private data on top of the IP address info. So the RDS implementation itself is clearly OK with any of the proposals being discussed here. However, Rick mentioned that Oracle needs 64 bytes of private data in both directions for connections. My question was how Oracle works on top of RDS, which does not provide any private data to consumers. - R. From sinate at yahoo.com Wed Oct 19 14:44:41 2005 From: sinate at yahoo.com (Steven Wooding) Date: Wed, 19 Oct 2005 22:44:41 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <528xwpi483.fsf@cisco.com> Message-ID: <20051019214441.99009.qmail@web32507.mail.mud.yahoo.com> Roland, That looks like that's the problem. I'll try your patch out tomorrow. I did look at that code, but did not spot the missing break. Thanks a lot. Cheers, Steve. ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com From richard.frank at oracle.com Wed Oct 19 16:12:33 2005 From: richard.frank at oracle.com (Richard Frank) Date: Wed, 19 Oct 2005 19:12:33 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE><52psq1igwr.fsf@cisco.com> <52zmp5goh4.fsf@cisco.com> Message-ID: <007201c5d502$972bd410$0300a8c0@YOURA06808D9DE> Oracle's uDAPL ipc implementation uses 64 bytes of private connection data - currently - some of this is the result of having 64 bytes to use at the start - so we designed around this. We can probably reduce this somewhat. And of course if we want to rewrite our connection handling for uDAPL (add our own wire protocol) we can probably skip using the uDAPL connection data all together. For RDS we use our own connection data sent via datagrams which has always been part of the Oracle UDP ipc implementation. ----- Original Message ----- From: "Roland Dreier" To: "James Lentini" Cc: "Richard Frank" ; ; ; "Davis, Arlin R" Sent: Wednesday, October 19, 2005 5:56 PM Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol > James> The D is somewhat misleading. It refers to the > James> functionality provider to the consumer application. > > Right, that's what we're talking about. The RDS implementation only > needs a few bytes of private data on top of the IP address info. So > the RDS implementation itself is clearly OK with any of the proposals > being discussed here. > > However, Rick mentioned that Oracle needs 64 bytes of private data in > both directions for connections. My question was how Oracle works on > top of RDS, which does not provide any private data to consumers. > > - R. > From richard.frank at oracle.com Wed Oct 19 16:19:00 2005 From: richard.frank at oracle.com (Richard Frank) Date: Wed, 19 Oct 2005 19:19:00 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: <008601c5d503$7dc1d3c0$0300a8c0@YOURA06808D9DE> It's probably fine to go ahead and reduce the IPC private data - I think we (Oracle) can work around this. ----- Original Message ----- From: "Richard Frank" To: "James Lentini" ; "Roland Dreier" Cc: ; ; "Davis, Arlin R" Sent: Wednesday, October 19, 2005 7:12 PM Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol > Oracle's uDAPL ipc implementation uses 64 bytes of private connection > data - currently - some of this is the result of having 64 bytes to use at > the start - so we designed around this. We can probably reduce this > somewhat. And of course if we want to rewrite our connection handling for > uDAPL (add our own wire protocol) we can probably skip using the uDAPL > connection data all together. > > For RDS we use our own connection data sent via datagrams which has always > been part of the Oracle UDP ipc implementation. > > ----- Original Message ----- > From: "Roland Dreier" > To: "James Lentini" > Cc: "Richard Frank" ; ; > ; "Davis, Arlin R" > Sent: Wednesday, October 19, 2005 5:56 PM > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > >> James> The D is somewhat misleading. It refers to the >> James> functionality provider to the consumer application. >> >> Right, that's what we're talking about. The RDS implementation only >> needs a few bytes of private data on top of the IP address info. So >> the RDS implementation itself is clearly OK with any of the proposals >> being discussed here. >> >> However, Rick mentioned that Oracle needs 64 bytes of private data in >> both directions for connections. My question was how Oracle works on >> top of RDS, which does not provide any private data to consumers. >> >> - R. >> > From jcarr at linuxmachines.com Wed Oct 19 16:54:15 2005 From: jcarr at linuxmachines.com (Jeff Carr) Date: Wed, 19 Oct 2005 16:54:15 -0700 Subject: [openib-general] How to debug QP INIT->RTR -22 error In-Reply-To: <43527119.2050103@keysounds.co.uk> References: <43527119.2050103@keysounds.co.uk> Message-ID: <4356DCA7.3010705@linuxmachines.com> On 10/16/05 08:26, Steve Wooding wrote: > Hi there, > > I'm trying to make a QP connection using the CM, but the active side > cannot get to the RTR state. ibv_modify_qp returns errorno -22, invalid > argument. > > What would the best way to find out exactly what the error is and help > me fix my app (just to say, it is only my app that's broken, nothing > else)? Would turning kernel debugging on be helpful at all? Also useful is to make sure you have non-zero lids on both nodes: cat /sys/class/infiniband/mthca0/ports/1/lid cat /sys/class/infiniband/mthca0/ports/1/sm_lid Enjoy, Jeff From hozer at hozed.org Wed Oct 19 18:02:43 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 19 Oct 2005 20:02:43 -0500 Subject: [openib-general] where is IB_WARN defined? In-Reply-To: <1129753800.16900.35586.camel@hal.voltaire.com> References: <20051019200937.GJ30127@kalmia.hozed.org> <1129753800.16900.35586.camel@hal.voltaire.com> Message-ID: <20051020010243.GK30127@kalmia.hozed.org> On Wed, Oct 19, 2005 at 04:30:06PM -0400, Hal Rosenstock wrote: > On Wed, 2005-10-19 at 16:09, Troy Benjegerdes wrote: > > I'm trying to rebuild opensm, and the libibumad configure is failing because > > IB_WARN is apparently not defined anyplace I can find it. > > It should be IBWARN. Where is IB_WARN or do you mean IBWARN ? > > It should be found in libibcommon/include/infiniband/common.h > > What svn version are you using ? Have you updated all your management > libraries ? Hrrm.. it looks like for some reason the top level makefile didn't rebuild libibcommon. Also, is there a clean way we could use a common autocong/automake setup for the diags? The build spends more time running configure than compiling ;) From halr at voltaire.com Wed Oct 19 18:07:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Oct 2005 21:07:21 -0400 Subject: [openib-general] where is IB_WARN defined? In-Reply-To: <20051020010243.GK30127@kalmia.hozed.org> References: <20051019200937.GJ30127@kalmia.hozed.org> <1129753800.16900.35586.camel@hal.voltaire.com> <20051020010243.GK30127@kalmia.hozed.org> Message-ID: <1129770440.16900.36992.camel@hal.voltaire.com> On Wed, 2005-10-19 at 21:02, Troy Benjegerdes wrote: > On Wed, Oct 19, 2005 at 04:30:06PM -0400, Hal Rosenstock wrote: > > On Wed, 2005-10-19 at 16:09, Troy Benjegerdes wrote: > > > I'm trying to rebuild opensm, and the libibumad configure is failing because > > > IB_WARN is apparently not defined anyplace I can find it. > > > > It should be IBWARN. Where is IB_WARN or do you mean IBWARN ? > > > > It should be found in libibcommon/include/infiniband/common.h > > > > What svn version are you using ? Have you updated all your management > > libraries ? > > Hrrm.. it looks like for some reason the top level makefile didn't > rebuild libibcommon. Not sure why that would be. In my top level generated Makefile, LIBS:=libibcommon libibumad libibmad @for i in $(LIBS); do\ if [ -x $$i/autogen.sh ]; then\ if !(cd $$i; ./autogen.sh && ./configure && make && make install); then exit 1; fi\ fi\ done What does yours look like ? > Also, is there a clean way we could use a common autocong/automake setup > for the diags? The build spends more time running configure than > compiling ;) I've had it on my list for a while to combine all the diags subdirectories into 1 and just have one diags build to simplify this but I haven't gotten to this yet. -- Hal From hozer at hozed.org Wed Oct 19 18:23:05 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 19 Oct 2005 20:23:05 -0500 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <1129662629.16900.23196.camel@hal.voltaire.com> References: <1129662629.16900.23196.camel@hal.voltaire.com> Message-ID: <20051020012305.GL30127@kalmia.hozed.org> On Tue, Oct 18, 2005 at 03:10:31PM -0400, Hal Rosenstock wrote: > Currently, OpenSM does not support an interactive console. There has > been a desire to introduce the ability to change certain parameters (as > well as display things) once OpenSM has started. This patch introduces > the first most basic commands: help and loglevel. I am investgating > adding smpriority to this. The console is invoked by specifying -console > as an option on the opensm command line. > > If you have a request for a command you would like in the console, I > would like to compile a list of these. > > Comments ? As well as a console, I'd like an API for some way for external programs (say a cluster queue manager) to be able to query the SM (or the sm + some helper library) for the following things: * Topology * guid/lid/IPoIB address/switch port mappings * link state Future neat things to do: * An interface to dynamically partition the fabric * Register for notifications for certain events (excessive traffic queueing, or error counts) From hozer at hozed.org Wed Oct 19 19:25:20 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 19 Oct 2005 21:25:20 -0500 Subject: [openib-general] EHCA-0028 userspace build fails with openib svn 3774 Message-ID: <20051020022519.GM30127@kalmia.hozed.org> make[1]: Entering directory `/usr/src/openib-src/userspace/libehca' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.-O3 -g -Wall -D_GNU_SOURCE -DP_SERIES -I../libibverbs/include -Isrc -g -O2 -MTsrc_libehca_la-ehca_umain.lo -MD -MP -MF ".deps/src_libehca_la-ehca_umain.Tpo" -c -o src_libehca_la-ehca_umain.lo `test -f 'src/ehca_umain.c' || echo './'`src/ehca_umain.c; \ then mv -f ".deps/src_libehca_la-ehca_umain.Tpo" ".deps/src_libehca_la-ehca_umain.Plo"; else rm -f ".deps/src_libehca_la-ehca_umain.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -O3 -g -Wall -D_GNU_SOURCE -DP_SERIES -I../libibverbs/include -Isrc -g -O2 -MT src_libehca_la-ehca_umain.lo -MD -MP -MF .deps/src_libehca_la-ehca_umain.Tpo -c src/ehca_umain.c -fPIC -DPIC -o .libs/src_libehca_la-ehca_umain.o src/ehca_umain.c: In function 'ehcau_query_device': src/ehca_umain.c:66: warning: passing argument 3 of 'ibv_cmd_query_device' fromincompatible pointer type src/ehca_umain.c:66: warning: passing argument 4 of 'ibv_cmd_query_device' makes pointer from integer without a cast src/ehca_umain.c:66: error: too few arguments to function 'ibv_cmd_query_device' make[1]: *** [src_libehca_la-ehca_umain.lo] Error 1 From hozer at hozed.org Wed Oct 19 20:17:12 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Wed, 19 Oct 2005 22:17:12 -0500 Subject: [openib-general] where is IB_WARN defined? In-Reply-To: <1129770440.16900.36992.camel@hal.voltaire.com> References: <20051019200937.GJ30127@kalmia.hozed.org> <1129753800.16900.35586.camel@hal.voltaire.com> <20051020010243.GK30127@kalmia.hozed.org> <1129770440.16900.36992.camel@hal.voltaire.com> Message-ID: <20051020031712.GN30127@kalmia.hozed.org> > > Hrrm.. it looks like for some reason the top level makefile didn't > > rebuild libibcommon. > > Not sure why that would be. > > In my top level generated Makefile, > LIBS:=libibcommon libibumad libibmad > > @for i in $(LIBS); do\ > if [ -x $$i/autogen.sh ]; then\ > if !(cd $$i; ./autogen.sh && ./configure && make > && make install); then exit 1; fi\ > fi\ > done > > What does yours look like ? It looks like the same thing, so I'm going to assume for the moment this was some error that scrolled by that didn't occur when I did it manually. > > > Also, is there a clean way we could use a common autocong/automake setup > > for the diags? The build spends more time running configure than > > compiling ;) > > I've had it on my list for a while to combine all the diags > subdirectories into 1 and just have one diags build to simplify this but > I haven't gotten to this yet. > > -- Hal From schihei at de.ibm.com Wed Oct 19 23:46:12 2005 From: schihei at de.ibm.com (Heiko J Schick) Date: Thu, 20 Oct 2005 08:46:12 +0200 Subject: [openib-general] EHCA-0028 userspace build fails with openib svn 3774 In-Reply-To: <20051020022519.GM30127@kalmia.hozed.org> References: <20051020022519.GM30127@kalmia.hozed.org> Message-ID: <43573D34.5010902@de.ibm.com> Hello Troy, this problem should be solved in EHCA2_0033. The EHCA2_0028 package was only tested with OpenIB trunk 3615. The problem is EHCA_0028 doesn't included the raw_fw_ver pointer for ibv_cmd_query_device. Please use EHCA2_0033 available via: 1: https://openib.org/svn/gen2/trunk/src/userspace/libehca/ 2: http://prdownloads.sourceforge.net/ibmehcad/ehca2_EHCA2_0033.tgz?download We've included yesterday EHCA2_0033 into the OpenIB tree. Troy Benjegerdes wrote: > make[1]: Entering directory `/usr/src/openib-src/userspace/libehca' > if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. > -I.-O3 -g -Wall -D_GNU_SOURCE -DP_SERIES -I../libibverbs/include -Isrc > -g -O2 -MTsrc_libehca_la-ehca_umain.lo -MD -MP -MF > ".deps/src_libehca_la-ehca_umain.Tpo" -c -o src_libehca_la-ehca_umain.lo > `test -f 'src/ehca_umain.c' || echo './'`src/ehca_umain.c; \ > then mv -f ".deps/src_libehca_la-ehca_umain.Tpo" > ".deps/src_libehca_la-ehca_umain.Plo"; else rm -f > ".deps/src_libehca_la-ehca_umain.Tpo"; exit 1; fi > mkdir .libs > gcc -DHAVE_CONFIG_H -I. -I. -I. -O3 -g -Wall -D_GNU_SOURCE -DP_SERIES > -I../libibverbs/include -Isrc -g -O2 -MT src_libehca_la-ehca_umain.lo > -MD -MP -MF .deps/src_libehca_la-ehca_umain.Tpo -c src/ehca_umain.c > -fPIC -DPIC -o .libs/src_libehca_la-ehca_umain.o > src/ehca_umain.c: In function 'ehcau_query_device': > src/ehca_umain.c:66: warning: passing argument 3 of > 'ibv_cmd_query_device' fromincompatible pointer type > src/ehca_umain.c:66: warning: passing argument 4 of > 'ibv_cmd_query_device' makes pointer from integer without a cast > src/ehca_umain.c:66: error: too few arguments to function > 'ibv_cmd_query_device' > make[1]: *** [src_libehca_la-ehca_umain.lo] Error 1 > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick ---------------------------------------------------------------------- Heiko J Schick I/O Firmware Development II Linux InfiniBand Device Drivers IBM Deutschland Entwicklung GmbH external: 49-07031-16-0 x4219 Schoenaicher Str. 220 t/l: 120-4129 71032 Boeblingen email: schickhj at de.ibm.com ---------------------------------------------------------------------- From liran at mellanox.co.il Thu Oct 20 00:49:11 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Thu, 20 Oct 2005 09:49:11 +0200 Subject: [openib-general] InfiniBand Test Project (IBTP) - Update Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB2E7@mtlexch01.mtl.com> Hi , Hal . The Linux & WinIB are the same , except for several cosmetic changes . Regarding Makefile.in , it's an outcome of autogen , I'll remove it . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, October 19, 2005 10:25 PM To: Liran Sorani Cc: openib-general at openib.org Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > Hi , > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) and > Gen2 stacks : > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmt est > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest > > Osmtest is the main verification tool for OpenSM , include various SA > (Good / Bad) flows. > Attached to each directory a short README file for setup and usage > information. How is the Linux one different from osmtest in the trunk ? Also, (nit): I think https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest/Makefile.in is a generated file and should be removed. -- Hal > > Liran Sorani > > Mellanox Technologies LTD. > > mailto:liran at mellanox.co.il > > Phone: +972(4)9097200 Ext: 214 > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From zpyun at msn.com Thu Oct 20 01:00:44 2005 From: zpyun at msn.com (Barbra Mckenzie) Date: Thu, 20 Oct 2005 10:00:44 +0200 Subject: [openib-general] Maintain your weight loss Message-ID: <42.916.92.@msn.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://strongermo.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://strongermo.com/ Regards, Dr. Barbra Mckenzie From jackm at mellanox.co.il Thu Oct 20 04:04:44 2005 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Thu, 20 Oct 2005 13:04:44 +0200 Subject: [openib-general] [PATCH] fix page_size_cap value in ib_query_device for mellanox provider Message-ID: <20051020110443.GA7198@mellanox.co.il> NOTE!! This patch also affects interpretation of page_size_cap field in the ib_device_attr struct in file ib_verbs.h (i.e., for all providers). The page_size_cap field is interpreted here as a bitmap of power-of-2 page sizes supported by the device. Signed-off-by: Jack Morgenstein Index: linux-kernel/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- linux-kernel/drivers/infiniband/include/rdma/ib_verbs.h (revision 3827) +++ linux-kernel/drivers/infiniband/include/rdma/ib_verbs.h (working copy) @@ -91,6 +91,7 @@ __be64 node_guid; __be64 sys_image_guid; u64 max_mr_size; + /* page_size_cap is a bitmap of supported power-of-2 page sizes. */ u64 page_size_cap; u32 vendor_id; u32 vendor_part_id; Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_dev.h (revision 3827) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_dev.h (working copy) @@ -154,6 +154,8 @@ int reserved_mcgs; int num_pds; int reserved_pds; + /* page_size_cap is a bitmap of supported power-of-2 page sizes. */ + u32 page_size_cap; u32 flags; u8 port_width_cap; }; Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_main.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_main.c (revision 3827) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_main.c (working copy) @@ -168,6 +168,7 @@ mdev->limits.max_srq_wqes = dev_lim->max_srq_sz; mdev->limits.reserved_srqs = dev_lim->reserved_srqs; mdev->limits.reserved_eecs = dev_lim->reserved_eecs; + mdev->limits.page_size_cap = ~(u32)(dev_lim->min_page_sz - 1); /* * Subtract 1 from the limit because we need to allocate a * spare CQE so the HCA HW can tell the difference between an Index: linux-kernel/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-kernel/drivers/infiniband/hw/mthca/mthca_provider.c (revision 3827) +++ linux-kernel/drivers/infiniband/hw/mthca/mthca_provider.c (working copy) @@ -111,6 +111,7 @@ props->max_mcast_qp_attach = MTHCA_QP_PER_MGM; props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * props->max_mcast_grp; + props->page_size_cap = (u64)mdev->limits.page_size_cap; err = 0; out: From halr at voltaire.com Thu Oct 20 03:54:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 06:54:12 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <20051020012305.GL30127@kalmia.hozed.org> References: <1129662629.16900.23196.camel@hal.voltaire.com> <20051020012305.GL30127@kalmia.hozed.org> Message-ID: <1129805651.16900.39956.camel@hal.voltaire.com> On Wed, 2005-10-19 at 21:23, Troy Benjegerdes wrote: > On Tue, Oct 18, 2005 at 03:10:31PM -0400, Hal Rosenstock wrote: > > Currently, OpenSM does not support an interactive console. There has > > been a desire to introduce the ability to change certain parameters (as > > well as display things) once OpenSM has started. This patch introduces > > the first most basic commands: help and loglevel. I am investgating > > adding smpriority to this. The console is invoked by specifying -console > > as an option on the opensm command line. > > > > If you have a request for a command you would like in the console, I > > would like to compile a list of these. > > > > Comments ? > > As well as a console, I'd like an API for some way for external programs > (say a cluster queue manager) to be able to query the SM (or the sm + some > helper library) for the following things: > > * Topology This can be done via SA queries currently. > * guid/lid/IPoIB address/switch port mappings The SM does not know (see) IPoIB addresses. The only thing it sees is the part of the subnet address. The rest can be done via SA queries currently. > * link state This can be done via SA query currently. This argues for a higher layer API to make these queries easy. > Future neat things to do: > > * An interface to dynamically partition the fabric Is this referring to IB partitioning ? > * Register for notifications for certain events (excessive traffic > queueing, or error counts) Not sure what you mean by excessive traffic queuing. It is the event set which is of interest to me. Are there others ? There are a set of events which can be subscribed to currently. The ones along these lines are local link integrity threshold reached on a port, excessive buffer overrun threshold reached on a port, flow control and update watchdog timer expired on a switch port. If you are referring to the PortCounters, these would need to be polled (at some periodicity) and then an event created as there is no event for this defined in IBA. Higher layer APIs could help with this area too. Thanks for the input. -- Hal From halr at voltaire.com Thu Oct 20 04:00:50 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 07:00:50 -0400 Subject: [openib-general] InfiniBand Test Project (IBTP) - Update In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB2E7@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB2E7@mtlexch01.mtl.com> Message-ID: <1129805900.16900.39986.camel@hal.voltaire.com> On Thu, 2005-10-20 at 03:49, Liran Sorani wrote: > Hi , Hal . > The Linux & WinIB are the same , except for several cosmetic changes . I was referring to the (differences in the) Linux one in ibtp and the Linux one under gen2/trunk. > Regarding Makefile.in , it's an outcome of autogen , I'll remove it . Thanks. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 19, 2005 10:25 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update > > > On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > > Hi , > > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) > and > > Gen2 stacks : > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmtest > > > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest > > > > Osmtest is the main verification tool for OpenSM , include various > SA > > (Good / Bad) flows. > > Attached to each directory a short README file for setup and usage > > information. > > How is the Linux one different from osmtest in the trunk ? > > Also, (nit): > I think > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest/Makefile.in > is a generated file and should be removed. > > -- Hal > > > > Liran Sorani > > > Mellanox Technologies LTD. > > > mailto:liran at mellanox.co.il > > > Phone: +972(4)9097200 Ext: 214 > > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Thu Oct 20 06:25:37 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 09:25:37 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: OK. I will update the proposal for IBTA based on this feedback and all other feedback posted. I will still separate private data usage proposal and port mapping one. If your Apps depends on 64 bytes of private data, please, raise your voice now. ARkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, October 19, 2005 7:19 PM > To: Richard Frank; Lentini, James; Roland Dreier > Cc: swg at infinibandta.org; openib-general at openib.org; Davis, Arlin R > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > It's probably fine to go ahead and reduce the IPC private > data - I think we > (Oracle) can work around this. > > > ----- Original Message ----- > From: "Richard Frank" > To: "James Lentini" ; "Roland Dreier" > > Cc: ; ; > "Davis, Arlin R" > > Sent: Wednesday, October 19, 2005 7:12 PM > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > > Oracle's uDAPL ipc implementation uses 64 bytes of private > connection > > data - currently - some of this is the result of having 64 > bytes to use at > > the start - so we designed around this. We can probably reduce this > > somewhat. And of course if we want to rewrite our > connection handling for > > uDAPL (add our own wire protocol) we can probably skip > using the uDAPL > > connection data all together. > > > > For RDS we use our own connection data sent via datagrams which has > > always > > been part of the Oracle UDP ipc implementation. > > > > ----- Original Message ----- > > From: "Roland Dreier" > > To: "James Lentini" > > Cc: "Richard Frank" ; > ; > > ; "Davis, Arlin R" > > > Sent: Wednesday, October 19, 2005 5:56 PM > > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > > emulationprotocol > > > > > >> James> The D is somewhat misleading. It refers to the > >> James> functionality provider to the consumer application. > >> > >> Right, that's what we're talking about. The RDS > implementation only > >> needs a few bytes of private data on top of the IP address > info. So > >> the RDS implementation itself is clearly OK with any of > the proposals > >> being discussed here. > >> > >> However, Rick mentioned that Oracle needs 64 bytes of > private data in > >> both directions for connections. My question was how > Oracle works on > >> top of RDS, which does not provide any private data to consumers. > >> > >> - R. > >> > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hozer at hozed.org Thu Oct 20 06:53:44 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 08:53:44 -0500 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <1129805651.16900.39956.camel@hal.voltaire.com> References: <1129662629.16900.23196.camel@hal.voltaire.com> <20051020012305.GL30127@kalmia.hozed.org> <1129805651.16900.39956.camel@hal.voltaire.com> Message-ID: <20051020135344.GP30127@kalmia.hozed.org> > > * Topology > > This can be done via SA queries currently. > > > * guid/lid/IPoIB address/switch port mappings > > The SM does not know (see) IPoIB addresses. The only thing it sees is > the part of the subnet address. > > The rest can be done via SA queries currently. > > > * link state > > This can be done via SA query currently. > > This argues for a higher layer API to make these queries easy. > > > Future neat things to do: > > > > * An interface to dynamically partition the fabric > > Is this referring to IB partitioning ? I think so, but IB partitioning may not actually map to what I'm interested in. From the high-level (application) point of view, I want to ensure that communication traffic for one cluster job minimally affects another job. > > * Register for notifications for certain events (excessive traffic > > queueing, or error counts) > > Not sure what you mean by excessive traffic queuing. I guess I'd like to know whenever utilization on a single link exceeds 90%, or the queuing delay exceeds XXX nanoseconds. > It is the event set which is of interest to me. Are there others ? > > There are a set of events which can be subscribed to currently. The ones > along these lines are local link integrity threshold reached on a port, > excessive buffer overrun threshold reached on a port, flow control and > update watchdog timer expired on a switch port. > > If you are referring to the PortCounters, these would need to be polled > (at some periodicity) and then an event created as there is no event for > this defined in IBA. > > Higher layer APIs could help with this area too. Some of this stuff may not necessarily belong in the OpenSM process either.. Stuff like getting IPoIB address from GUID's would be usefull in a library, but isn't the SM's responsibility. From hozer at hozed.org Thu Oct 20 07:13:01 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 09:13:01 -0500 Subject: [openib-general] EHCA-0028 userspace build fails with openib svn 3774 In-Reply-To: <43573D34.5010902@de.ibm.com> References: <20051020022519.GM30127@kalmia.hozed.org> <43573D34.5010902@de.ibm.com> Message-ID: <20051020141301.GQ30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 08:46:12AM +0200, Heiko J Schick wrote: > Hello Troy, > > this problem should be solved in EHCA2_0033. The EHCA2_0028 package was > only tested with OpenIB trunk 3615. The problem is EHCA_0028 doesn't > included the raw_fw_ver pointer for ibv_cmd_query_device. > > Please use EHCA2_0033 available via: > 1: https://openib.org/svn/gen2/trunk/src/userspace/libehca/ > 2: http://prdownloads.sourceforge.net/ibmehcad/ehca2_EHCA2_0033.tgz?download > > We've included yesterday EHCA2_0033 into the OpenIB tree. A couple of nits.. infiniband/hw/ehca/Makefile doesn't fit 80 columns very well. I also see a bunch of warnings about: drivers/infiniband/hw/ehca/./hcp_if.h:1823: warning: ISO C90 forbids mixed declarations and code I need the following patches to build. However, I appear to be unable to unload the previous 0028 hcad_mod, so I think I'm going to have to reboot to actually test it. Index: infiniband/Kconfig =================================================================== --- infiniband/Kconfig (revision 3828) +++ infiniband/Kconfig (working copy) @@ -32,6 +32,8 @@ source "drivers/infiniband/hw/mthca/Kconfig" +source "drivers/infiniband/hw/ehca/Kconfig" + source "drivers/infiniband/ulp/ipoib/Kconfig" source "drivers/infiniband/ulp/sdp/Kconfig" Index: infiniband/Makefile =================================================================== --- infiniband/Makefile (revision 3828) +++ infiniband/Makefile (working copy) @@ -1,6 +1,7 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_IPATH_CORE) += hw/ipath/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ +obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SDP) += ulp/sdp/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ Index: infiniband/hw/ehca/Makefile =================================================================== --- infiniband/hw/ehca/Makefile (revision 3828) +++ infiniband/hw/ehca/Makefile (working copy) @@ -37,15 +37,6 @@ # $Id: Makefile.kernel_prod,v 1.13 2005/10/13 15:01:16 schickhj Exp $ # - - -#make for kernel 2.6 build - -ifndef GEN2_PATH_KERNEL -GEN2_PATH_KERNEL = /home/source/trunk_3745/src/linux-kernel -endif -# GEN2_PATH_KERNEL = drivers #for gen2 code in kernel - obj-m += hcad_mod.o hcad_mod-objs = ehca_main.o ehca_hca.o ipz_pt_fn.o ehca_classes.o ehca_av.o \ @@ -60,7 +51,6 @@ EXTRA_CFLAGS +=-DP_SERIES -DEHCA_USE_HCALL -DEHCA_USE_HCALL_KERNEL \ -I$(src)/. \ - -I$(GEN2_PATH_KERNEL)/infiniband/include/rdma \ - -I$(GEN2_PATH_KERNEL)/infiniband/core + -Idrivers/infiniband/include From jimmy.hill at us.ibm.com Thu Oct 20 07:31:32 2005 From: jimmy.hill at us.ibm.com (Jimmy Hill) Date: Thu, 20 Oct 2005 09:31:32 -0500 Subject: [openib-general] Re: private data... In-Reply-To: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> Message-ID: A Linux uDAPL-based system infrastructure application I am working on at IBM currently depends on 64-bytes of Private Data for Connect and Accept as well. -- jimmy Oracle currently depends on 64 bytes of private data for connect and accept. ----- Original Message ----- From: Kanevsky, Arkady To: Davis, Arlin R ; dat-discussions at yahoogroups.com ; Grant Grundler Cc: swg at infinibandta.org ; openib-general at openib.org Sent: Wednesday, October 19, 2005 11:31 AM Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arlin, just to clarify, Intel MPI will not have problems with useing less than 64 bytes of private data. If a solution will provide you with 48 bytes of private data will it be sufficient? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] Sent: Wednesday, October 19, 2005 11:30 AM To: dat-discussions at yahoogroups.com; Grant Grundler Cc: swg at infinibandta.org; openib-general at openib.org Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Arkady, Intel MPI (real consumer of uDAPL) has no problem with this change. -arlin From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Wednesday, October 19, 2005 6:40 AM To: Grant Grundler; Caitlin Bestler Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; openib-general at openib.org Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol Grant, The developers of the application(s) in questions are aware of the discussion. I will leave it to them to respond. I bring the discussion point at the weekly DAT Collaborative meeting which we have every Wednesday. I appologize that the DAT Collaborative charter does not allow to submit contribution without joining DAT Collaborative. But this is no different from Linux not accepting any contrubutions without proper license. Byt be rest assure that as a Chair I bring the concerns and suggestions stated in email discussion at the DAT meetings. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Tuesday, October 18, 2005 8:02 PM > To: Caitlin Bestler > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > Roland (and the rest of us) would like to see someone name a > > > real consumer of the proposed interface. ie who depends on > > > this change? > > > Then the dependency for that use/user can be discussed and > > > appropriate tradeoffs made. Make sense? > > > > Unfortunately not every application that is under > development, or even > > deployed, can be discussed in a google-searchable public > forum. That > > especially applies to user-mode development. > > Well, this is open source. While I don't want to preclude > closed source developement, it's usually necessary to have an > open source consumer that any open source developer can test with. > > > So I could have actually tested such applications and still not be > > free to cite them here. > > Understood. I'm not asking *you* to cite one unless you > happen to own one of the consumers. > > > With any luck some of them > > are following the discussion and will jump in on their own. > > Unfortunately, since they are developing to uDAPL they are > unlikely to > > be following this discussion. > > It doesn't help that the DAT yahoo-groups.com mailing list is > rejecting my replies. It would be helpful if someone > following this forum could share Roland's question with DAT > mailing list if it didn't make it there already and possibly > explain why naming a consumer is necessary. > > hth, > grant > SPONSORED LINKS Protocol Communication and networking Wireless communication and networking YAHOO! GROUPS LINKS Visit your group "dat-discussions" on the web. To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Oct 20 07:37:31 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 10:37:31 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <20051020135344.GP30127@kalmia.hozed.org> References: <1129662629.16900.23196.camel@hal.voltaire.com> <20051020012305.GL30127@kalmia.hozed.org> <1129805651.16900.39956.camel@hal.voltaire.com> <20051020135344.GP30127@kalmia.hozed.org> Message-ID: <1129819050.16900.41208.camel@hal.voltaire.com> On Thu, 2005-10-20 at 09:53, Troy Benjegerdes wrote: > > > * Topology > > > > This can be done via SA queries currently. > > > > > * guid/lid/IPoIB address/switch port mappings > > > > The SM does not know (see) IPoIB addresses. The only thing it sees is > > the part of the subnet address. > > > > The rest can be done via SA queries currently. > > > > > * link state > > > > This can be done via SA query currently. > > > > This argues for a higher layer API to make these queries easy. > > > > > Future neat things to do: > > > > > > * An interface to dynamically partition the fabric > > > > Is this referring to IB partitioning ? > > I think so, but IB partitioning may not actually map to what I'm > interested in. From the high-level (application) point of view, I want to > ensure that communication traffic for one cluster job minimally affects > another job. Do the set of end nodes overlap for jobs ? This might be via using different SLs rather than different (IB) partitions depending on the requirement. In any case, there is more work here than just this API. > > > * Register for notifications for certain events (excessive traffic > > > queueing, or error counts) > > > > Not sure what you mean by excessive traffic queuing. > > I guess I'd like to know whenever utilization on a single link exceeds > 90%, or whatever % you would want to be notified about (with sampling/polling at some interval (assuming there is no IB defined event for these). > or the queuing delay exceeds XXX nanoseconds. I think you are talking more in the abstract here. I need to think about this one some more as to if/how to determine something like this for IB. > > It is the event set which is of interest to me. Are there others ? > > > > There are a set of events which can be subscribed to currently. The ones > > along these lines are local link integrity threshold reached on a port, > > excessive buffer overrun threshold reached on a port, flow control and > > update watchdog timer expired on a switch port. > > > > If you are referring to the PortCounters, these would need to be polled > > (at some periodicity) and then an event created as there is no event for > > this defined in IBA. > > > > Higher layer APIs could help with this area too. > > Some of this stuff may not necessarily belong in the OpenSM process either.. > Stuff like getting IPoIB address from GUID's would be usefull in a > library, but isn't the SM's responsibility. There are a couple of approaches I can imagine for obtaining the mappings of GUID to IPoIB address(es). 1. Vendor specific MADs could be implemented for this but this is ugly. Interaction would be required to register and unregister each IPoIB address with the vendor specific agent for this. 2. OpenSM node needs to be on either all IPoIB subnets or those of "interest". It could then do the equivalent of a broadcast ping on each IPoIB subnet and match the ARP/neighbor entries with the GUID requested. Note that the same GUID can have multiple IP addresses on the same or different subnets. A RARP based approach won't work as the QPN is also part of the IPoIB hardware address. -- Hal From jbarker at lanl.gov Thu Oct 20 07:50:31 2005 From: jbarker at lanl.gov (James W. Barker) Date: Thu, 20 Oct 2005 08:50:31 -0600 Subject: [openib-general] Build problem with util/mad_test Message-ID: <6.2.3.4.2.20051020084309.02007f10@cic-mail.lanl.gov> All, When following the instructions posted in the installation cheat sheet I encountered a problem with the build of the mad_test. I execute: (cd util/mad_test && ./autogen.sh && ./configure) and get the following error message: -bash: cd: util/mad_test: No such file or directory I have been unable to locate the mad_test, any thoughts? Thanks, Jim Barker From jlentini at netapp.com Thu Oct 20 08:03:02 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Oct 2005 11:03:02 -0400 (EDT) Subject: [openib-general] Re: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes In-Reply-To: <20051011134747.GA17185@mellanox.co.il> References: <1128091110.5270.1072.camel@hal.voltaire.com> <20051011134747.GA17185@mellanox.co.il> Message-ID: On Tue, 11 Oct 2005, Michael S. Tsirkin wrote: mst> I think this list scan needs some kind of protection. mst> The following is what I checked in. Does this needs to be updated mst> in other places as well? I agree that SDP must also obtain the dev_base_lock. The update to IBAT, on which the SDP patch was based, used the dev_base_lock (svn diff -r 3317:3547 core/at.c). The dev_base list isn't searched in any other places. From hozer at hozed.org Thu Oct 20 08:04:33 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 10:04:33 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: References: <20051020144020.GR30127@kalmia.hozed.org> Message-ID: <20051020150432.GS30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 04:47:07PM +0200, Christoph Raisch wrote: > Can't promise the opensm part, but we'll try. > We have some intel machines with mellanox cards. A second possibility > would be to just use the mellanox cards in our power5 boxes for opensm and > see what happens. This is strange.. This machine has a mellanox card, but no ehca card. It looks like when hcad_mod and ib_mthca are both loaded something conflicts. 10:/usr/lib/infiniband# modprobe ib_ipoib [ 1920.053089] mthca0: ib_query_pkey port 0 failed (ret = -22) 10:/usr/lib/infiniband# lsmod Module Size Used by ib_ipoib 57200 0 ib_sa 19704 1 ib_ipoib hcad_mod 989040 0 ib_uverbs 50896 0 openafs 847024 3 ib_mthca 154656 0 ib_mad 54692 2 ib_sa,ib_mthca ib_core 61488 6 ib_ipoib,ib_sa,hcad_mod,ib_uverbs,ib_mthca,ib_mad ... [ 145.817358] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) [ 145.817373] ib_mthca: Initializing 0000:d9:00.0 [ 145.818257] PCI: Enabling device: (0000:d9:00.0), cmd 142 [ 152.401571] openafs: module license 'http://www.openafs.org/dl/license10.html' taints kernel. [ 152.404580] Found system call table at 0xc000000000013e68 (scan: close+ioctl) [ 152.420156] Starting AFS cache scan...Memory cache: Allocating 12500 dcache entries...found 0 non-empty cache files (0%). [ 1580.877118] eHCA Infiniband Device Driver (Rel.: EHCA2_0033) [ 1920.053089] mthca0: ib_query_pkey port 0 failed (ret = -22) From halr at voltaire.com Thu Oct 20 08:11:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 11:11:46 -0400 Subject: [openib-general] Build problem with util/mad_test In-Reply-To: <6.2.3.4.2.20051020084309.02007f10@cic-mail.lanl.gov> References: <6.2.3.4.2.20051020084309.02007f10@cic-mail.lanl.gov> Message-ID: <1129820891.16900.41392.camel@hal.voltaire.com> On Thu, 2005-10-20 at 10:50, James W. Barker wrote: > When following the instructions posted in the installation cheat > sheet I encountered a problem with the build of the mad_test. I execute: > (cd util/mad_test && ./autogen.sh && ./configure) > > and get the following error message: > -bash: cd: util/mad_test: No such file or directory > > I have been unable to locate the mad_test, any thoughts? The installation cheat sheet needs updating. mad_test no longer exists. Unfortunately with svn directories are not removed. I just took care of updating the installation cheat sheet on the wiki. Thanks. -- Hal > Thanks, > Jim Barker > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Oct 20 08:15:05 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 11:15:05 -0400 Subject: [openib-general] Re: [PATCH] SDP: In sdp_link.c::do_link_path_lookup, handle interface table numbering holes In-Reply-To: References: <1128091110.5270.1072.camel@hal.voltaire.com> <20051011134747.GA17185@mellanox.co.il> Message-ID: <1129821073.16900.41413.camel@hal.voltaire.com> On Thu, 2005-10-20 at 11:03, James Lentini wrote: > On Tue, 11 Oct 2005, Michael S. Tsirkin wrote: > > mst> I think this list scan needs some kind of protection. > mst> The following is what I checked in. Does this needs to be updated > mst> in other places as well? > > I agree that SDP must also obtain the dev_base_lock. I think this was directed at the patch I supplied. I mistakenly left that out. I think the code Michael committed in sdp_link.c obtains the dev_base_lock appropriately. -- Hal > The update to IBAT, on which the SDP patch was based, used the > dev_base_lock (svn diff -r 3317:3547 core/at.c). > > The dev_base list isn't searched in any other places. From mshefty at ichips.intel.com Thu Oct 20 08:51:02 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 08:51:02 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: References: Message-ID: <4357BCE6.5050108@ichips.intel.com> Kanevsky, Arkady wrote: > I will update the proposal for IBTA based on this feedback > and all other feedback posted. > I will still separate private data usage proposal > and port mapping one. Again, I think that these should be in the same proposal. The CM REQ carries the IB transport layer address. The goal here is to map another transport layer address to the IB one. The source port is included in the private data. By not including the destination port, there's an assumption that it's provided somewhere else in the CM REQ. We should either make this explicit, or put the destination port in the private data as well. - Sean From hozer at hozed.org Thu Oct 20 09:24:17 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 11:24:17 -0500 Subject: [openib-general] EHCA-0028 userspace build fails with openib svn 3774 Message-ID: <20051020162417.GU30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 08:46:12AM +0200, Heiko J Schick wrote: > Hello Troy, > > this problem should be solved in EHCA2_0033. The EHCA2_0028 package was > only tested with OpenIB trunk 3615. The problem is EHCA_0028 doesn't > included the raw_fw_ver pointer for ibv_cmd_query_device. EHCA_0033 (the version in openib subversion) still has some serious problems. When the module is unloaded, it doesn't clean up kernel threads, and setting a different value of 'port_act_time' when loading the module the second time does not have any effect. The firmware is also broken and not responding to certain opensm queries. This is part of osm.log.. Oct 20 11:16:59 946091 [43005960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x16 trans_id=0x9567) -- dropping. Oct 20 11:16:59 946115 [43005960] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0 Oct 20 11:16:59 946126 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT). Oct 20 11:16:59 946155 [43005960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x3 trans_id................0x9567 attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x10000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][13][2] Return path: [0][0][0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 From caitlinb at broadcom.com Thu Oct 20 09:25:55 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 20 Oct 2005 09:25:55 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A8B@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Thursday, October 20, 2005 8:51 AM > To: Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org; Lentini, > James; Davis, Arlin R > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > Kanevsky, Arkady wrote: > > I will update the proposal for IBTA based on this feedback and all > > other feedback posted. > > I will still separate private data usage proposal and port mapping > > one. > > Again, I think that these should be in the same proposal. > The CM REQ carries the IB transport layer address. The goal > here is to map another transport layer address to the IB one. > The source port is included in the private data. By not > including the destination port, there's an assumption that > it's provided somewhere else in the CM REQ. We should either > make this explicit, or put the destination port in the > private data as well. > Under the general programming model for an IP-centric daemon, the listener can assume that connection requests will be for the TCP port that the listen was issued upon. However, the daemon typically listens on *all* addresses that the system supports. It is not uncommon for the application to note which destination address was actually requested and to vary the service provided based upon that. This is what makes it possible for single machines to host vast numbers of web sites. It is less common, but still requiring support, for the daemon to differentiate service based upon the source address. It is more common to simply refuse service based upon the source address, which can be handled by the CM or firewall itself rather than by the application, but there are exceptions. Some web-sites have intranet versus internet verions. Some file servers control access lists based upon source address. It is actually quite effective when combined with network authentication of source addresses. From pradeep at us.ibm.com Thu Oct 20 09:35:02 2005 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 20 Oct 2005 09:35:02 -0700 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: <1129754232.16900.35630.camel@hal.voltaire.com> Message-ID: openib-general-bounces at openib.org wrote on 10/19/2005 01:37:14 PM: > On Tue, 2005-10-18 at 18:40, Kevin Reilly wrote: > > > > > > On Mon, 2005-10-18 at 10:07, Kevin Reilly wrote: > > >On Mon, 2005-10-17 at 10:07, Hal Rosenstock wrote: > > >> > Should this code work, because it seems that out_dev is a kernel > > >> > address (platform: PPC64) which cannot accessed by a userspace > > >> > program. Via GDB I can see that rt has the following content: > > >> > > > >> > The address is rt->out_dev = 0xc0000000cffaa800 which looks like a > > >> > kernel address. > > >> > > >> Yes, this is a bug which has been previously pointed out on the list and > > >> not fixed. > > > > > >The fix for this involves an ABI change: it should return the GID of the > > >outgoing IB device. > > > > > >-- Hal > > > > Should we (IBM) work on submitting a patch for this? > > That's up to you. > > > Returning the GID or the device_name would be good fix. > > Yes, either of these could be made to work. > > > I guess our reluctance is that we've heard the this address translation > > library function might be depreciated for another interface? > > Yes, that has been my reluctance as well. It appears AT is likely to be > superceeded by CMA. > Is there a ballpark estimate (or a plan) of when CMA willl be ready? Estimates like by end of Q4 2005 or end of Q1 2006 will help us make some decisions if we should submit a patch for this bug or wait for CMA. Pradeep pradeep at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Thu Oct 20 09:36:04 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 12:36:04 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: The updated proposal will have IP addresses and TCP ports of src and dst in private data. How TCP ports are mapped to IB service IDs is a separate proposal. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 11:51 AM > To: Kanevsky, Arkady > Cc: Richard Frank; Lentini, James; Roland Dreier; > swg at infinibandta.org; openib-general at openib.org; Davis, Arlin R > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > Kanevsky, Arkady wrote: > > I will update the proposal for IBTA based on this feedback and all > > other feedback posted. I will still separate private data usage > > proposal and port mapping one. > > Again, I think that these should be in the same proposal. > The CM REQ carries > the IB transport layer address. The goal here is to map > another transport layer > address to the IB one. The source port is included in the > private data. By not > including the destination port, there's an assumption that > it's provided > somewhere else in the CM REQ. We should either make this > explicit, or put the > destination port in the private data as well. > > - Sean > From Arkady.Kanevsky at netapp.com Thu Oct 20 09:39:45 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 12:39:45 -0400 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol Message-ID: with both SRC and DST IP addresses and TCP ports all these models will be supported. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Thursday, October 20, 2005 12:26 PM > To: Sean Hefty; Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org; Lentini, > James; Davis, Arlin R > Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP > emulationprotocol > > > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > > Sent: Thursday, October 20, 2005 8:51 AM > > To: Kanevsky, Arkady > > Cc: swg at infinibandta.org; openib-general at openib.org; Lentini, > > James; Davis, Arlin R > > Subject: Re: [dat-discussions] RE: [openib-general] Re: iWARP > > emulationprotocol > > > > Kanevsky, Arkady wrote: > > > I will update the proposal for IBTA based on this feedback and all > > > other feedback posted. > > > I will still separate private data usage proposal and > port mapping > > > one. > > > > Again, I think that these should be in the same proposal. > > The CM REQ carries the IB transport layer address. The goal > > here is to map another transport layer address to the IB one. > > The source port is included in the private data. By not > > including the destination port, there's an assumption that > > it's provided somewhere else in the CM REQ. We should either > > make this explicit, or put the destination port in the > > private data as well. > > > > Under the general programming model for an IP-centric daemon, > the listener can assume that connection requests will be for > the TCP port that the listen was issued upon. > > However, the daemon typically listens on *all* addresses that > the system supports. It is not uncommon for the application > to note which destination address was actually requested and > to vary the service provided based upon that. This is what > makes it possible for single machines to host vast numbers of > web sites. > > It is less common, but still requiring support, for the > daemon to differentiate service based upon the source > address. It is more common to simply refuse service based > upon the source > address, which can be handled by the CM or firewall itself > rather than by the application, but there are exceptions. > Some web-sites have intranet versus internet verions. Some > file servers control access lists based upon source address. > It is actually quite effective when combined with network > authentication of source addresses. > From mshefty at ichips.intel.com Thu Oct 20 09:40:54 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 09:40:54 -0700 Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulationprotocol In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020A8B@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020A8B@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <4357C896.60002@ichips.intel.com> Caitlin Bestler wrote: > However, the daemon typically listens on *all* addresses that > the system supports. It is not uncommon for the application > to note which destination address was actually requested and > to vary the service provided based upon that. This is what makes > it possible for single machines to host vast numbers of web sites. The CMA supports listening on a port across all addresses, with the requested destination address reported to the client. - Sean From Arkady.Kanevsky at netapp.com Thu Oct 20 09:41:51 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 12:41:51 -0400 Subject: [openib-general] FW upgrade for TopSpin cards Message-ID: I want to upgrade FW on several TopSpin cards I have. There is tvflash utility in gen2/trunk/src/userspace/tvflash I tried to build tvflash on 2.6.13.3 system I have. I get a bunch of warnings (see below). gcc version is gcc version 4.0.0 20050519 (Red Hat 4.0.0-8). What's the story? Can I use OpenIB tvflash to upgrade FW on a TopSpin card? Can I use OpenIB mstflint for it? Which version of the utilities should I use? Why warning when I build it? Arkady ****************************************** # make make: Warning: File `.deps/src_tvflash-tvflash.Po' has modification time 1.8e+04 s in the future make all-am make[1]: Entering directory `/u/arkady/openib/gen2/trunk/src/userspace/tvflash' make[1]: Warning: File `.deps/src_tvflash-tvflash.Po' has modification time 1.8e+04 s in the future if gcc -DHAVE_CONFIG_H -I. -I. -I. -Wall -g -O2 -MT src_tvflash-tvflash.o -MD -MP -MF ".deps/src_tvflash-tvflash.Tpo" -c -o src_tvflash-tvflash.o `test -f 'src/tvflash.c' || echo './'`src/tvflash.c; \ then mv -f ".deps/src_tvflash-tvflash.Tpo" ".deps/src_tvflash-tvflash.Po"; else rm -f ".deps/src_tvflash-tvflash.Tpo"; exit 1; fi src/tvflash.c: In function 'parse_guid': src/tvflash.c:112: warning: pointer targets in passing argument 1 of '__builtin_strchr' differ in signedness src/tvflash.c:117: warning: pointer targets in passing argument 1 of 'strrchr' differ in signedness src/tvflash.c:117: warning: pointer targets in assignment differ in signedness src/tvflash.c:135: warning: pointer targets in passing argument 1 of 'strrchr' differ in signedness src/tvflash.c:135: warning: pointer targets in assignment differ in signedness src/tvflash.c:205: warning: pointer targets in passing argument 1 of 'strtol' differ in signedness src/tvflash.c: In function 'identify_board': src/tvflash.c:702: warning: pointer targets in passing argument 1 of 'strncasecmp' differ in signedness src/tvflash.c: In function 'flash_image_read_from_file': src/tvflash.c:828: warning: pointer targets in assignment differ in signedness src/tvflash.c:830: warning: pointer targets in assignment differ in signedness src/tvflash.c:832: warning: pointer targets in assignment differ in signedness src/tvflash.c:844: warning: pointer targets in assignment differ in signedness src/tvflash.c: In function 'flash_check_failsafe': src/tvflash.c:905: warning: pointer targets in passing argument 2 of 'validate_image' differ in signedness src/tvflash.c:911: warning: pointer targets in passing argument 2 of 'validate_image' differ in signedness src/tvflash.c: In function 'create_ver_str': src/tvflash.c:1033: warning: pointer targets in passing argument 1 of 'snprintf' differ in signedness src/tvflash.c:1039: warning: pointer targets in passing argument 1 of 'snprintf' differ in signedness src/tvflash.c:1044: warning: pointer targets in passing argument 1 of 'snprintf' differ in signedness src/tvflash.c:1046: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c: In function 'identify_hca': src/tvflash.c:1278: warning: pointer targets in passing argument 1 of 'sscanf' differ in signedness src/tvflash.c: In function 'identify_firmware': src/tvflash.c:1399: warning: pointer targets in passing argument 1 of 'sscanf' differ in signedness src/tvflash.c: In function 'upload_firmware': src/tvflash.c:1813: warning: pointer targets in passing argument 1 of 'parse_guid' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1932: warning: pointer targets in passing argument 1 of 'strncmp' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1936: warning: pointer targets in passing argument 1 of 'strncmp' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1940: warning: pointer targets in passing argument 1 of 'strncmp' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1944: warning: pointer targets in passing argument 1 of 'strncmp' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of 'strlen' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of '__builtin_strcmp' differ in signedness src/tvflash.c:1948: warning: pointer targets in passing argument 1 of 'strncmp' differ in signedness gcc -g -O2 -o src/tvflash src_tvflash-tvflash.o -lpci make[1]: warning: Clock skew detected. Your build may be incomplete. make[1]: Leaving directory `/u/arkady/openib/gen2/trunk/src/userspace/tvflash' make: warning: Clock skew detected. Your build may be incomplete. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Oct 20 09:39:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Oct 2005 12:39:51 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: References: Message-ID: <1129826390.16900.41898.camel@hal.voltaire.com> On Thu, 2005-10-20 at 12:35, Pradeep Satyanarayana wrote: > Is there a ballpark estimate (or a plan) of when CMA willl be ready? > Estimates like by end of Q4 2005 > or end of Q1 2006 will help us make some decisions if we should submit > a patch for this bug or wait > for CMA. CMA is ready now. It's up to Sean to say what he thinks for user CMA availability (for planning purposes). -- Hal From mshefty at ichips.intel.com Thu Oct 20 09:47:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 09:47:40 -0700 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: References: Message-ID: <4357CA2C.9030203@ichips.intel.com> Pradeep Satyanarayana wrote: > Is there a ballpark estimate (or a plan) of when CMA willl be ready? > Estimates like by end of Q4 2005 > or end of Q1 2006 will help us make some decisions if we should submit a > patch for this bug or wait > for CMA. The kernel CMA is ready today. An additional change will be required at some point once the iWarp Emulation Protocol is defined, but that will be minor. Work on the user CMA should begin by the end of this week. I estimate that it will take about 4 weeks to complete. - Sean From krause at cup.hp.com Thu Oct 20 09:59:44 2005 From: krause at cup.hp.com (Michael Krause) Date: Thu, 20 Oct 2005 09:59:44 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> Message-ID: <6.2.0.14.2.20051020095636.026d5c18@esmail.cup.hp.com> This is really an IBTA issue to resolve and to insure that backward compatibility with existing applications is maintained. Hence, this exercise of who is broken or not is inherently flawed in that one cannot comprehend all implementations that may exist. Therefore, the spec should use either a new version number or a reserved bit to indicate that there is a defined format to the private data portion or not. This is no different than what is done in other technologies such as PCIe. Those applications that require the existing semantics will be confined to the existing associated infrastructure. Those that want the new IP semantics set the bit / version and operate within the restricted private data space available. It is that simple. Mike At 07:31 AM 10/20/2005, Jimmy Hill wrote: >A Linux uDAPL-based system infrastructure application I am working on at >IBM currently depends on 64-bytes of Private Data for Connect and Accept >as well. > >-- jimmy > > > > >Oracle currently depends on 64 bytes of private data for connect and accept. > > >----- Original Message ----- >From: Kanevsky, Arkady >To: Davis, Arlin R ; >dat-discussions at yahoogroups.com ; >Grant Grundler >Cc: swg at infinibandta.org ; >openib-general at openib.org >Sent: Wednesday, October 19, 2005 11:31 AM >Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP >emulationprotocol > >Arlin, >just to clarify, Intel MPI will not have problems with useing less than 64 >bytes >of private data. >If a solution will provide you with 48 bytes of private data will it be >sufficient? >Arkady > > >Arkady Kanevsky email: >arkady at netapp.com >Network Appliance phone: 781-768-5395 >375 Totten Pond Rd. Fax: 781-895-1195 >Waltham, MA 02451-2010 central phone: 781-768-5300 > > >-----Original Message----- >From: Davis, Arlin R [mailto:arlin.r.davis at intel.com] >Sent: Wednesday, October 19, 2005 11:30 AM >To: >dat-discussions at yahoogroups.com; >Grant Grundler >Cc: swg at infinibandta.org; >openib-general at openib.org >Subject: RE: [dat-discussions] RE: [openib-general] Re: iWARP >emulationprotocol > >Arkady, > >Intel MPI (real consumer of uDAPL) has no problem with this change. > >-arlin > > > >---------- > >From: dat-discussions at yahoogroups.com >[mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady >Sent: Wednesday, October 19, 2005 6:40 AM >To: Grant Grundler; Caitlin Bestler >Cc: Roland Dreier; swg at infinibandta.org; dat-discussions at yahoogroups.com; >openib-general at openib.org >Subject: [dat-discussions] RE: [openib-general] Re: iWARP emulation protocol > >Grant, >The developers of the application(s) in questions are aware of the >discussion. >I will leave it to them to respond. > >I bring the discussion point at the weekly DAT Collaborative meeting >which we have every Wednesday. > >I appologize that the DAT Collaborative charter does not allow >to submit contribution without joining DAT Collaborative. >But this is no different from Linux not accepting any contrubutions >without proper license. >Byt be rest assure that as a Chair I bring the concerns >and suggestions stated in email discussion at the DAT meetings. > >Arkady > >Arkady Kanevsky email: arkady at netapp.com >Network Appliance phone: 781-768-5395 >375 Totten Pond Rd. Fax: 781-895-1195 >Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Grant Grundler [mailto:iod00d at hp.com] > > Sent: Tuesday, October 18, 2005 8:02 PM > > To: Caitlin Bestler > > Cc: Grant Grundler; Roland Dreier; Kanevsky, Arkady; > > swg at infinibandta.org; dat-discussions at yahoogroups.com; > > openib-general at openib.org > > Subject: Re: [openib-general] Re: iWARP emulation protocol > > > > > > On Tue, Oct 18, 2005 at 04:40:54PM -0700, Caitlin Bestler wrote: > > > > Roland (and the rest of us) would like to see someone name a > > > > real consumer of the proposed interface. ie who depends on > > > > this change? > > > > Then the dependency for that use/user can be discussed and > > > > appropriate tradeoffs made. Make sense? > > > > > > Unfortunately not every application that is under > > development, or even > > > deployed, can be discussed in a google-searchable public > > forum. That > > > especially applies to user-mode development. > > > > Well, this is open source. While I don't want to preclude > > closed source developement, it's usually necessary to have an > > open source consumer that any open source developer can test with. > > > > > So I could have actually tested such applications and still not be > > > free to cite them here. > > > > Understood. I'm not asking *you* to cite one unless you > > happen to own one of the consumers. > > > > > With any luck some of them > > > are following the discussion and will jump in on their own. > > > Unfortunately, since they are developing to uDAPL they are > > unlikely to > > > be following this discussion. > > > > It doesn't help that the DAT yahoo-groups.com mailing list is > > rejecting my replies. It would be helpful if someone > > following this forum could share Roland's question with DAT > > mailing list if it didn't make it there already and possibly > > explain why naming a consumer is necessary. > > > > hth, > > grant > > > > > >SPONSORED LINKS >Protocol >Communication >and networking >Wireless >communication and networking > > >---------- >YAHOO! GROUPS LINKS > * Visit your group > "dat-discussions" on the web. > * > * To unsubscribe from this group, send an email to: > * > dat-discussions-unsubscribe at yahoogroups.com > > * > * Your use of Yahoo! Groups is subject to the > Yahoo! Terms of Service. > > >---------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Oct 20 10:24:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 10:24:44 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <6.2.0.14.2.20051020095636.026d5c18@esmail.cup.hp.com> References: <005301c5d4c7$59290c90$0300a8c0@YOURA06808D9DE> <6.2.0.14.2.20051020095636.026d5c18@esmail.cup.hp.com> Message-ID: <4357D2DC.4040600@ichips.intel.com> Michael Krause wrote: > This is really an IBTA issue to resolve and to insure that backward > compatibility with existing applications is maintained. Hence, this > exercise of who is broken or not is inherently flawed in that one cannot > comprehend all implementations that may exist. Therefore, the spec > should use either a new version number or a reserved bit to indicate > that there is a defined format to the private data portion or not. > This is no different than what is done in other technologies such as > PCIe. Those applications that require the existing semantics will be > confined to the existing associated infrastructure. Those that want the > new IP semantics set the bit / version and operate within the restricted > private data space available. It is that simple. If we use an IBTA assigned service ID, I think that this can be defined without using a reserved bit or changing a version number. The two possible implementations that I see are using a single service ID, or mapping port numbers to a range of assigned service IDs. - Sean From ftillier at silverstorm.com Thu Oct 20 10:34:16 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 10:34:16 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <6.2.0.14.2.20051020095636.026d5c18@esmail.cup.hp.com> Message-ID: <001501c5d59c$7fbd3d10$9e5aa8c0@infiniconsys.com> > From: Michael Krause [mailto:krause at cup.hp.com] > Sent: Thursday, October 20, 2005 10:00 AM > > This is really an IBTA issue to resolve and to insure that backward > compatibility with existing applications is maintained. Hence, this exercise > of who is broken or not is inherently flawed in that one cannot comprehend all > implementations that may exist. Therefore, the spec should use either a new > version number or a reserved bit to indicate that there is a defined format to > the private data portion or not. This is no different than what is done in > other technologies such as PCIe. Those applications that require the existing > semantics will be confined to the existing associated infrastructure. Those > that want the new IP semantics set the bit / version and operate within the > restricted private data space available. It is that simple. While I agree with you, the issue at hand is that DAPL tries to do both - providing IP semantics to the application *and* 64-bytes of private data. While the IBTA may use a reserved bit to differentiate native IB or IP-enhanced connection establishment MADs, if DAPL is to use this feature then DAPL clients will lose some of their private data. This gets us back to how to handle DAPL clients that depend on the full 64 bytes of private data and how to support them, which is a DAPL issue IMO and not an IBTA issue. The IBTA should do what's right for IB independently of DAPL, and define a proper IP-enhanced CM protocol. - Fab From jbarker at lanl.gov Thu Oct 20 10:34:27 2005 From: jbarker at lanl.gov (James W. Barker) Date: Thu, 20 Oct 2005 11:34:27 -0600 Subject: [openib-general] Building userspace verbs libraries problem Message-ID: <6.2.3.4.2.20051020113042.021da7c0@cic-mail.lanl.gov> All, When building userspace verbs libraries per the installation cheat sheet, I execute: (cd libibverbs && ./autogen.sh && ./configure && make && make install) which generates the error: configure: error: sysfs_open_class() not found. libibverbs requires libsysfs. Am I missing an RPM (libsysfs?). Thanks, Jim Barker From rjwalsh at pathscale.com Thu Oct 20 10:35:32 2005 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 20 Oct 2005 10:35:32 -0700 Subject: [openib-general] Building userspace verbs libraries problem In-Reply-To: <6.2.3.4.2.20051020113042.021da7c0@cic-mail.lanl.gov> References: <6.2.3.4.2.20051020113042.021da7c0@cic-mail.lanl.gov> Message-ID: <1129829732.16558.13.camel@phosphene.durables.org> > configure: error: sysfs_open_class() not found. libibverbs > requires libsysfs. > > Am I missing an RPM (libsysfs?). sysfsutils-devel on Fedora/RedHat systems. Relies on sysfsutils. Regards, Rober. -- Robert Walsh Email: rjwalsh at pathscale.com PathScale, Inc. Phone: +1 650 934 8117 2071 Stierlin Court, Suite 200 Fax: +1 650 428 1969 Mountain View, CA 94043. From rolandd at cisco.com Thu Oct 20 10:48:01 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 10:48:01 -0700 Subject: [openib-general] FW upgrade for TopSpin cards In-Reply-To: (Arkady Kanevsky's message of "Thu, 20 Oct 2005 12:41:51 -0400") References: Message-ID: <52ek6gf5bi.fsf@cisco.com> Arkady> I get a bunch of warnings (see below). All of the warnings look benign (although you might want to synchronize the clock between your build system and your file server). Arkady> Can I use OpenIB tvflash to upgrade FW on a TopSpin card? Yes. Arkady> Can I use OpenIB mstflint for it? Yes. Arkady> Which version of the utilities should I use? I would use the latest subversion revision. Arkady> Why warning when I build it? Because gcc 4.0 added a bunch of semi-bogus pointer sign warnings, and you clocks are out of synch. - R. From rolandd at cisco.com Thu Oct 20 10:48:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 10:48:35 -0700 Subject: [openib-general] Re: ehca testing In-Reply-To: <20051020150432.GS30127@kalmia.hozed.org> (Troy Benjegerdes's message of "Thu, 20 Oct 2005 10:04:33 -0500") References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> Message-ID: <52ach4f5ak.fsf@cisco.com> Troy> This is strange.. This machine has a mellanox card, but no Troy> ehca card. It looks like when hcad_mod and ib_mthca are Troy> both loaded something conflicts. Have you confirmed that it works without hcad_mod loaded? - R. From Arkady.Kanevsky at netapp.com Thu Oct 20 10:50:20 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 13:50:20 -0400 Subject: [openib-general] Re: [swg] Re: private data... Message-ID: Fab, you are correct. But this is DAPL issue not IBTA. As long IBTA defines support for current CM with full private data and enhanced semantic CM with reduced private data but socket addressing model support and OpenIB expose access to both of them the rest is DAPL issue. OpenIB will not have any backwards compatibility issue because this is the first version of DAPL they will support. But, of course, it will be nice if can support apps written to current version of DAPL. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Fab Tillier [mailto:ftillier at silverstorm.com] > Sent: Thursday, October 20, 2005 1:34 PM > To: 'Michael Krause'; dat-discussions at yahoogroups.com > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: RE: [openib-general] Re: [swg] Re: private data... > > > > From: Michael Krause [mailto:krause at cup.hp.com] > > Sent: Thursday, October 20, 2005 10:00 AM > > > > This is really an IBTA issue to resolve and to insure that backward > > compatibility with existing applications is maintained. > Hence, this > > exercise of who is broken or not is inherently flawed in that one > > cannot comprehend all implementations that may exist. > Therefore, the > > spec should use either a new version number or a reserved > bit to indicate that there is a defined format to > > the private data portion or not. This is no different > than what is done in > > other technologies such as PCIe. Those applications that > require the > > existing semantics will be confined to the existing associated > > infrastructure. Those that want the new IP semantics set the bit / > > version and operate within the restricted private data space > > available. It is that simple. > > While I agree with you, the issue at hand is that DAPL tries > to do both - providing IP semantics to the application *and* > 64-bytes of private data. While the IBTA may use a reserved > bit to differentiate native IB or IP-enhanced connection > establishment MADs, if DAPL is to use this feature then DAPL > clients will lose some of their private data. This gets us > back to how to handle DAPL clients that depend on the full 64 > bytes of private data and how to support them, which is a > DAPL issue IMO and not an IBTA issue. The IBTA should do > what's right for IB independently of DAPL, and define a > proper IP-enhanced CM protocol. > > - Fab > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Thu Oct 20 10:52:34 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 13:52:34 -0400 Subject: [openib-general] FW upgrade for TopSpin cards Message-ID: Thanks Roland. I was worried about pointer sign warnings. Clock is not an issue. Do you plan to fix the srcs so gcc 4.0 warning will not be generated? Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, October 20, 2005 1:48 PM > To: Kanevsky, Arkady > Cc: openib-general at openib.org > Subject: Re: [openib-general] FW upgrade for TopSpin cards > > > Arkady> I get a bunch of warnings (see below). > > All of the warnings look benign (although you might want to > synchronize the clock between your build system and your file server). > > Arkady> Can I use OpenIB tvflash to upgrade FW on a TopSpin card? > > Yes. > > Arkady> Can I use OpenIB mstflint for it? > > Yes. > > Arkady> Which version of the utilities should I use? > > I would use the latest subversion revision. > > Arkady> Why warning when I build it? > > Because gcc 4.0 added a bunch of semi-bogus pointer sign > warnings, and you clocks are out of synch. > > - R. > From hozer at hozed.org Thu Oct 20 10:56:03 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 12:56:03 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: <52ach4f5ak.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> Message-ID: <20051020175603.GV30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 10:48:35AM -0700, Roland Dreier wrote: > Troy> This is strange.. This machine has a mellanox card, but no > Troy> ehca card. It looks like when hcad_mod and ib_mthca are > Troy> both loaded something conflicts. > > Have you confirmed that it works without hcad_mod loaded? I've since found I have the same problem without hcad_mod. I don't see any errors in dmesg except for: [ 7415.421699] mthca0: ib_query_pkey port 0 failed (ret = -22) Any ideas? What debug options should I try next? From iod00d at hp.com Thu Oct 20 11:02:32 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 20 Oct 2005 11:02:32 -0700 Subject: [openib-general] FW upgrade for TopSpin cards In-Reply-To: References: Message-ID: <20051020180232.GC21274@esmail.cup.hp.com> On Thu, Oct 20, 2005 at 12:41:51PM -0400, Kanevsky, Arkady wrote: > I want to upgrade FW on several TopSpin cards I have. > > There is tvflash utility in gen2/trunk/src/userspace/tvflash ... > Can I use OpenIB tvflash to upgrade FW on a TopSpin card? > Can I use OpenIB mstflint for it? > Which version of the utilities should I use? tvflash works for me on ia64 machines. However, mstflint seems to better supported and I would reccomend that. grant From rolandd at cisco.com Thu Oct 20 11:03:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 11:03:28 -0700 Subject: [openib-general] Re: ehca testing In-Reply-To: <20051020175603.GV30127@kalmia.hozed.org> (Troy Benjegerdes's message of "Thu, 20 Oct 2005 12:56:03 -0500") References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> Message-ID: <52sluwdq1b.fsf@cisco.com> Troy> I've since found I have the same problem without hcad_mod. I Troy> don't see any errors in dmesg except for: Troy> [ 7415.421699] mthca0: ib_query_pkey port 0 failed (ret = -22) It's strange that IPoIB is querying port 0 of a CA. Could you have mismatched versions of modules, so that some were compiled with a different version of ? You could add printk calls to ipoib_add_one and mthca_register_device and make sure that they see the same value of node_type for the struct ib_device (make sure to add the printk to mthca_register_device after the place where it assigns the node_type field). - R. From ftillier at silverstorm.com Thu Oct 20 11:07:20 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 11:07:20 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4357D2DC.4040600@ichips.intel.com> Message-ID: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 10:25 AM > > If we use an IBTA assigned service ID, I think that this can be > defined without using a reserved bit or changing a version number. > The two possible implementations that I see are using a single > service ID, or mapping port numbers to a range of assigned service > IDs. I would personally rather see a reserved bit get used. Imagine a system that has two protocols installed that use IP addressing. That system might want to have different apps listening on the same port number over both, even though the protocols are different. Having a reserved bit in the REQ indicate the presence of IP addressing information (including source and destination port numbers) in the private data seems most flexible to me. - Fab From rolandd at cisco.com Thu Oct 20 11:09:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 11:09:24 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> (Fab Tillier's message of "Thu, 20 Oct 2005 11:07:20 -0700") References: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> Message-ID: <52oe5kdprf.fsf@cisco.com> Fab> I would personally rather see a reserved bit get used. Fab> Imagine a system that has two protocols installed that use IP Fab> addressing. That system might want to have different apps Fab> listening on the same port number over both, even though the Fab> protocols are different. I disagree. The port number is part of an IP address, and it doesn't make sense to have two different services listening to the same port. You can't do it over TCP or iWARP, and I don't see any reason for IB to support this. - R. From mshefty at ichips.intel.com Thu Oct 20 11:11:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 11:11:17 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> References: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> Message-ID: <4357DDC5.2080002@ichips.intel.com> Fab Tillier wrote: > I would personally rather see a reserved bit get used. Imagine a system that > has two protocols installed that use IP addressing. That system might want to > have different apps listening on the same port number over both, even though the > protocols are different. I don't think that this maps well to TCP. Apps need to listen on different ports. > Having a reserved bit in the REQ indicate the presence of IP addressing > information (including source and destination port numbers) in the private data > seems most flexible to me. How would a reserved bit help here? How does the CM know which app to give the request to? My preference is to use the service ID, with a mapping that looks like: (OPENIB_OUI << 48) + port number because that makes my job easier. :) - Sean From ftillier at silverstorm.com Thu Oct 20 11:19:45 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 11:19:45 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4357DDC5.2080002@ichips.intel.com> Message-ID: <001801c5d5a2$dba77450$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 11:11 AM > > Fab Tillier wrote: > > I would personally rather see a reserved bit get used. > > Imagine a system that has two protocols installed that > > use IP addressing. That system might want to have different > > apps listening on the same port number over both, even though > > the protocols are different. > > I don't think that this maps well to TCP. Apps need to listen on > different ports. Are DAPL apps TCP apps? I thought they just wanted to use IP addresses for connection establishment, but weren't actual TCP apps. If DAPL apps aren't TCP apps, should they block usage of the TCP port from real TCP apps? > > Having a reserved bit in the REQ indicate the presence of IP > > addressing information (including source and destination port > > numbers) in the private data seems most flexible to me. > > How would a reserved bit help here? How does the CM know which > app to give the request to? Based on the ServiceID provided by the applications on both sides of the connection. - Fab From ftillier at silverstorm.com Thu Oct 20 11:22:52 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 11:22:52 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4357DDC5.2080002@ichips.intel.com> Message-ID: <001901c5d5a3$498fe600$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 11:11 AM > > Fab Tillier wrote: > > I would personally rather see a reserved bit get used. > > Imagine a system that has two protocols installed that > > use IP addressing. That system might want to have different > > apps listening on the same port number over both, even though > > the protocols are different. > > I don't think that this maps well to TCP. Apps need to listen on > different ports. Are DAPL apps TCP apps? I thought they just wanted to use IP addresses for connection establishment, but weren't actual TCP apps. If DAPL apps aren't TCP apps, should they block usage of the TCP port from real TCP apps? > > Having a reserved bit in the REQ indicate the presence of IP > > addressing information (including source and destination port > > numbers) in the private data seems most flexible to me. > > How would a reserved bit help here? How does the CM know which > app to give the request to? Based on the ServiceID provided by the applications on both sides of the connection. > My preference is to use the service ID, with a mapping that looks like: > > (OPENIB_OUI << 48) + port number > > because that makes my job easier. :) I think having a range of service IDs defined for TCP applications makes sense. So for TCP apps, the port number would be encapsulated in the SID as you suggest, and non-TCP apps that want to use IP addresses for connection establishment wouldn't care about ports and would use their own SID. This eliminates the need to put the port numbers in the private data - only the source and destination IP addresses. - Fab From caitlinb at broadcom.com Thu Oct 20 11:30:06 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 20 Oct 2005 11:30:06 -0700 Subject: [openib-general] Re: [swg] Re: private data... Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A9B@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Fab Tillier > Sent: Thursday, October 20, 2005 11:20 AM > To: 'Sean Hefty' > Cc: swg at infinibandta.org; dat-discussions at yahoogroups.com; > openib-general at openib.org > Subject: RE: [openib-general] Re: [swg] Re: private data... > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Thursday, October 20, 2005 11:11 AM > > > > Fab Tillier wrote: > > > I would personally rather see a reserved bit get used. > > > Imagine a system that has two protocols installed that use IP > > > addressing. That system might want to have different > apps listening > > > on the same port number over both, even though the protocols are > > > different. > > > > I don't think that this maps well to TCP. Apps need to listen on > > different ports. > > Are DAPL apps TCP apps? I thought they just wanted to use IP > addresses for connection establishment, but weren't actual > TCP apps. If DAPL apps aren't TCP apps, should they block > usage of the TCP port from real TCP apps? > > > > Having a reserved bit in the REQ indicate the presence of IP > > > addressing information (including source and destination port > > > numbers) in the private data seems most flexible to me. > > > > How would a reserved bit help here? How does the CM know > which app to > > give the request to? > > Based on the ServiceID provided by the applications on both > sides of the connection. > The closest thing that you come to having "two services" on one TCP port would be iSER-style services. Full emulation of this would require establishing an SDP connection, using SDP to exchange messages to establish that the remote peer was RDMA capable, and then "converting" the socket to RDMA mode (i.e., disabling the SDP handling). Using RDMA to simulate sending of streaming mode messages for the purpose of determining whether the remote peer supports RDMA is at the minimum rather strange. The effort to benefit ratio on that one is far from convincing. Supporting transport neutral connection setup for applications that *know* they are using RDMA semantics is 90% of the benefit at way less than 90% of the effort. For what it's worth the SCTP adaptation of iWARP does *not* support this TCP feature. So if an IP protocol doesn't think it is worth emulating then why should IB do it? The transport neutral approach is to say that the application determines whether RDMA is supported. If it knows RDMA is supported it establishes a connection to a TCP port that is advertised as supporting RDMA. If the network in use is IB then there is no need to send a wire message to establish that RDMA is supported. From krause at cup.hp.com Thu Oct 20 11:31:57 2005 From: krause at cup.hp.com (Michael Krause) Date: Thu, 20 Oct 2005 11:31:57 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <52oe5kdprf.fsf@cisco.com> References: <001701c5d5a1$1e0d82a0$9e5aa8c0@infiniconsys.com> <52oe5kdprf.fsf@cisco.com> Message-ID: <6.2.0.14.2.20051020113001.022f8d80@esmail.cup.hp.com> At 11:09 AM 10/20/2005, Roland Dreier wrote: > Fab> I would personally rather see a reserved bit get used. > Fab> Imagine a system that has two protocols installed that use IP > Fab> addressing. That system might want to have different apps > Fab> listening on the same port number over both, even though the > Fab> protocols are different. > >I disagree. The port number is part of an IP address, and it doesn't >make sense to have two different services listening to the same port. >You can't do it over TCP or iWARP, and I don't see any reason for IB >to support this. This is one of the reasons why there is a SDP port mapper defined for iWARP. The application listens on a defined service port but based on policy outside of the protocol, the application instance may be redirected to different IP address or port to transparently operate over a RDMA interconnect. So, the application listens on one port while the RDMA infrastructure transparently listens on a separate port and potentially IP address. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlentini at netapp.com Thu Oct 20 11:39:24 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Oct 2005 14:39:24 -0400 (EDT) Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001801c5d5a2$dba77450$9e5aa8c0@infiniconsys.com> References: <001801c5d5a2$dba77450$9e5aa8c0@infiniconsys.com> Message-ID: On Thu, 20 Oct 2005, Fab Tillier wrote: > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Thursday, October 20, 2005 11:11 AM > > > > Fab Tillier wrote: > > > I would personally rather see a reserved bit get used. > > > Imagine a system that has two protocols installed that > > > use IP addressing. That system might want to have different > > > apps listening on the same port number over both, even though > > > the protocols are different. I don't understand what you mean by this. Do you want two apps listening on the same service id? > > I don't think that this maps well to TCP. Apps need to listen on > > different ports. > > Are DAPL apps TCP apps? DAPL doesn't mandate any particular network transport. However the API uses sockaddrs to hold network addresses. The specification says that these should contain IP addresses. > I thought they just wanted to use IP addresses for connection > establishment, but weren't actual TCP apps. If DAPL apps aren't TCP > apps, should they block usage of the TCP port from real TCP apps? They should not. > > > Having a reserved bit in the REQ indicate the presence of IP > > > addressing information (including source and destination port > > > numbers) in the private data seems most flexible to me. > > > > How would a reserved bit help here? How does the CM know which > > app to give the request to? > > Based on the ServiceID provided by the applications on both sides of the > connection. You'd need to add a parameter to specify whether or not the bit should be set to the call for listening on a service id, right? I like Sean's idea better. Have a well know service id or range of service ids on which this protocol is used. I think of it as a service running on top of the CM protocol for using IP addresses on native IB. I don't think it should be mandatory for every CM connection. james From hozer at hozed.org Thu Oct 20 11:44:36 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 13:44:36 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: <52sluwdq1b.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> Message-ID: <20051020184435.GW30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 11:03:28AM -0700, Roland Dreier wrote: > Troy> I've since found I have the same problem without hcad_mod. I > Troy> don't see any errors in dmesg except for: > > Troy> [ 7415.421699] mthca0: ib_query_pkey port 0 failed (ret = -22) > > It's strange that IPoIB is querying port 0 of a CA. Could you have > mismatched versions of modules, so that some were compiled with a > different version of ? > > You could add printk calls to ipoib_add_one and mthca_register_device > and make sure that they see the same value of node_type for the struct > ib_device (make sure to add the printk to mthca_register_device after > the place where it assigns the node_type field). > There must be some poorly specified dependency information in the kernel makefiles, since I just did a clean build and it loads fine now. I've been updating the subversion ocassionally, and doing 'make modules'. Apparently I had an old object file that didn't get rebuilt. From Arkady.Kanevsky at netapp.com Thu Oct 20 11:52:29 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 14:52:29 -0400 Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... Message-ID: It may be benefitial to split the range of SID for one TCP port into several. That way the all "base TCP ports equivalent" will have the same mapping. Mapping to support SDP will use SDP assigned Service IDs. And IPoIB will have its own range similar to SDP for TCP ports. The RDMA "native" will use the "base" TCP ports. We can also add more ports for each of these categeories to support multiple SIDs. But we as well can use portmapper for them. But transparent adoptable clients that support all 3 (or more of the methods) will have to try each of the assigned Service IDs using appropriate transport to see if server supports it. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Lentini, James > Sent: Thursday, October 20, 2005 2:39 PM > To: Fab Tillier > Cc: 'Sean Hefty'; swg at infinibandta.org; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... > > > > > On Thu, 20 Oct 2005, Fab Tillier wrote: > > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > > Sent: Thursday, October 20, 2005 11:11 AM > > > > > > Fab Tillier wrote: > > > > I would personally rather see a reserved bit get used. > Imagine a > > > > system that has two protocols installed that use IP > addressing. > > > > That system might want to have different apps listening on the > > > > same port number over both, even though the protocols are > > > > different. > > I don't understand what you mean by this. Do you want two apps > listening on the same service id? > > > > I don't think that this maps well to TCP. Apps need to listen on > > > different ports. > > > > Are DAPL apps TCP apps? > > DAPL doesn't mandate any particular network transport. > However the API > uses sockaddrs to hold network addresses. The specification says that > these should contain IP addresses. > > > I thought they just wanted to use IP addresses for connection > > establishment, but weren't actual TCP apps. If DAPL apps > aren't TCP > > apps, should they block usage of the TCP port from real TCP apps? > > They should not. > > > > > Having a reserved bit in the REQ indicate the presence of IP > > > > addressing information (including source and destination port > > > > numbers) in the private data seems most flexible to me. > > > > > > How would a reserved bit help here? How does the CM know > which app > > > to give the request to? > > > > Based on the ServiceID provided by the applications on both > sides of > > the connection. > > You'd need to add a parameter to specify whether or not the > bit should > be set to the call for listening on a service id, right? > > I like Sean's idea better. Have a well know service id or range of > service ids on which this protocol is used. I think of it as > a service > running on top of the CM protocol for using IP addresses on > native IB. > I don't think it should be mandatory for every CM connection. > > james > From ftillier at silverstorm.com Thu Oct 20 12:00:23 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 12:00:23 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: Message-ID: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Thursday, October 20, 2005 11:39 AM > > I like Sean's idea better. Have a well know service id or range of > service ids on which this protocol is used. I think of it as a service > running on top of the CM protocol for using IP addresses on native IB. > I don't think it should be mandatory for every CM connection. The well known service ID implies that a DAPL application *would* prevent a TCP application from using a particular port, which seems to conflict your statement that DAPL apps shouldn't prevent TCP apps from working. That's not to say you couldn't have one range of service IDs for TCP applications, and another range for DAPL applications, and yet another range per protocol or application that wishes to use IP addressing during connection establishment. However, this doesn't extend the CM protocol, but just creates an ad-hoc group of protocols that happen to define the first 32-bytes of their private data similarly. Having a bit in the CM REQ indicate whether the first 32-bytes of private data contain the source and destination IP addresses allows any app using any service ID to use IP addresses as source and destination identifiers regardless of what protocol they actually use once the connection is established. Defining service ID ranges for particular protocols then becomes the responsibility of the organizations defining such protocols and the owner of the OUI with which the service ID ranges are defined, and is outside the scope of the IBTA. - Fab From caitlinb at broadcom.com Thu Oct 20 11:59:56 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 20 Oct 2005 11:59:56 -0700 Subject: [openib-general] Re: [swg] Re: private data... Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020A9D@NT-SJCA-0751.brcm.ad.broadcom.com> > > > > Are DAPL apps TCP apps? > > DAPL doesn't mandate any particular network transport. > However the API uses sockaddrs to hold network addresses. The > specification says that these should contain IP addresses. > > > I thought they just wanted to use IP addresses for connection > > establishment, but weren't actual TCP apps. If DAPL apps > > aren't TCP apps, should they block usage of the TCP port > > from real TCP apps? > > They should not. > DAPL defines two connection models. One is presumed to be transport neutral and assumes RDMA capability. There is one exchange of private data to enable configuring RDMA properly. This is the original mapping, and what should have priority in Connection Management discussions. It is the only mode currently proposed for iWARP support because the other modes bring up more stack integration issues. There is a second mode where a SOCK_STREAM handle can be used to exchange streaming mode messages (with TCP semantics) and then either continuing in streaming mode or conveting ot RDMA mode. That one is not worth handling in a transport neutral manner. When the transport neutral model is mapped to iWARP the TCP port is indeed pre-empted. But that does not mean that the InfiniBand mapping must do the same. The only thing that needs to be guaranteed is that a DAPL client (or any RDMA client that assumes RDMA mode) will be able to request a connection without risk that the connection request will be answered by a streaming mode server. iWARP solves that with the MPA Request/Response exchange. But transport neutral applications do not need to know what mechanism is used to prevent streaming/RDMA mode mismatches. There is also no need to guarantee that there is no streaming mode service using the same conceptual TCP port number, but it is one way to ensure that the connection requests cannot be misdirected. From mshefty at ichips.intel.com Thu Oct 20 12:11:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 12:11:01 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> References: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> Message-ID: <4357EBC5.1060906@ichips.intel.com> Fab Tillier wrote: > That's not to say you couldn't have one range of service IDs for TCP > applications, and another range for DAPL applications, and yet another range per > protocol or application that wishes to use IP addressing during connection > establishment. However, this doesn't extend the CM protocol, but just creates > an ad-hoc group of protocols that happen to define the first 32-bytes of their > private data similarly. If applications map their "port" numbers to different service IDs, then there's no need to define the private data at all. The CM can perform its job without changes and route based purely on service IDs. The only reason to use a reserve bit or change the version is if the CM needs to look into the private data. The definition of private data is an issue for an upper level connection manager. My hope is that this can be defined such that the upper level connection manager can support multiple transports, so I don't have to build an upper level upper level connection manager. Eventually an application that uses or pretends to use a port number must deal with the fact that another application may want to use that same number. For applications that are transport neutral, this is a problem. For applications that aren't transport neutral, they can use the native addressing for their specific transport. > Having a bit in the CM REQ indicate whether the first 32-bytes of private data > contain the source and destination IP addresses allows any app using any service > ID to use IP addresses as source and destination identifiers regardless of what > protocol they actually use once the connection is established. What does the CM do with this bit? - Sean From mshefty at ichips.intel.com Thu Oct 20 12:18:30 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 12:18:30 -0700 Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... In-Reply-To: References: Message-ID: <4357ED86.7090200@ichips.intel.com> Kanevsky, Arkady wrote: > It may be benefitial to split the range of SID for one TCP port into > several. > That way the all "base TCP ports equivalent" will have the same > mapping. > Mapping to support SDP will use SDP assigned Service IDs. > And IPoIB will have its own range similar to SDP for TCP ports. > The RDMA "native" will use the "base" TCP ports. > We can also add more ports for each of these categeories > to support multiple SIDs. But we as well > can use portmapper for them. > > But transparent adoptable clients that support all 3 (or more of the > methods) > will have to try each of the assigned Service IDs using appropriate > transport > to see if server supports it. > Arkady An application that wants to connect to destination port 53 is expecting a particular application. The benefits of letting the application connection to different applications depending on a transport that may be selected by some underlying software is questionable to me. If we're going to define a protocol that passes TCP/IP addresses in private data, then the addresses should behave as close to TCP port numbers as possible. Two CM REQs that contain the same address should expect to reach the same destination. - Sean From tom at opengridcomputing.com Thu Oct 20 12:45:01 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Oct 2005 14:45:01 -0500 Subject: [openib-general] rdma_bind_addr question Message-ID: <1129837501.10748.9.camel@trinity.austin.ammasso.com> Sean: I'm looking at the CMA code from the perspective of adding iWARP support and there may be an issue relative to identifying the transport given a net_device ptr. For example, rdma_bind_addr calls ib_translate_addr which in turn calls ip_dev_find to map an IP address to a net_device ptr. Right now the code (ib_translate_addr) seems to assume that the device is for an IPoIB device. Going forward, how do we know whether the underlying net_device is for an IPoIB device, an iWARP device, or a dumb Ethernet device? For the first two, we will take one of two paths in the CMA, for the dumb Ethernet device I presume we will return an error. Ideas? Tom From mshefty at ichips.intel.com Thu Oct 20 12:35:34 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 12:35:34 -0700 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <1129837501.10748.9.camel@trinity.austin.ammasso.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> Message-ID: <4357F186.1070407@ichips.intel.com> Tom Tucker wrote: > Right now the code (ib_translate_addr) seems to assume that the device > is for an IPoIB device. Going forward, how do we know whether the > underlying net_device is for an IPoIB device, an iWARP device, or a dumb > Ethernet device? The code does assume this currently. The code should check that the net_device type = ARPHRD_INFINIBAND. Without this check, the CMA will simply error out later when mapping the returned address to a GID. > For the first two, we will take one of two paths in the CMA, for the > dumb Ethernet device I presume we will return an error. I guess the proper thing to return is -ENODEV from ib_translate_addr() if we have the wrong device type. The CMA could then check iWarp devices. - Sean From mshefty at ichips.intel.com Thu Oct 20 12:37:49 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 12:37:49 -0700 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <4357F186.1070407@ichips.intel.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> Message-ID: <4357F20D.2080007@ichips.intel.com> Sean Hefty wrote: > net_device type = ARPHRD_INFINIBAND. Without this check, the CMA will > simply error out later when mapping the returned address to a GID. ^^^^^^^^^error out when looking up the returned "GID". From tom at opengridcomputing.com Thu Oct 20 13:03:23 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Oct 2005 15:03:23 -0500 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <4357F186.1070407@ichips.intel.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> Message-ID: <1129838603.10748.19.camel@trinity.austin.ammasso.com> Cool, that's the answer I was looking for. On Thu, 2005-10-20 at 12:35 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > Right now the code (ib_translate_addr) seems to assume that the device > > is for an IPoIB device. Going forward, how do we know whether the > > underlying net_device is for an IPoIB device, an iWARP device, or a dumb > > Ethernet device? > > The code does assume this currently. The code should check that the net_device > type = ARPHRD_INFINIBAND. Without this check, the CMA will simply error out > later when mapping the returned address to a GID. > > > For the first two, we will take one of two paths in the CMA, for the > > dumb Ethernet device I presume we will return an error. > > I guess the proper thing to return is -ENODEV from ib_translate_addr() if we > have the wrong device type. The CMA could then check iWarp devices. > > - Sean From ttucker at es335.com Thu Oct 20 13:05:28 2005 From: ttucker at es335.com (Tom Tucker) Date: Thu, 20 Oct 2005 15:05:28 -0500 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> References: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> Message-ID: <1129838728.10748.23.camel@trinity.austin.ammasso.com> If TCP/IP addresses and port numbers are being used to identify hosts and services then IMHO there should be no ambiguity (read overlap) between these port numbers and IP addresses on different transports. This means specifically, that if an IB application is listening on a host on port number 600, then this port number/IP address pair are CONSUMED. Having another application listening on the same port over the "native" stack should not happen and if it does chaos will absolutely ensue. This also applies to iWARP/IP applications. Note that the same port number on DIFFERENT IP addresses is absolutely fine. In fact, with CMA it will be possible for a service (.e.g. NFS) to listen on a given port on an IB, an iWARP, AND a "native" TCP interface all at the same time. No problem because the IP addresses are different. Note that without integration with the host stack the implementation cannot enforce this, but we should assume that the management tools and/or documentation will. Otherwise, we end down an incredibly deep rathole -- and it's dark down there. On Thu, 2005-10-20 at 12:00 -0700, Fab Tillier wrote: > > From: James Lentini [mailto:jlentini at netapp.com] > > Sent: Thursday, October 20, 2005 11:39 AM > > > > I like Sean's idea better. Have a well know service id or range of > > service ids on which this protocol is used. I think of it as a service > > running on top of the CM protocol for using IP addresses on native IB. > > I don't think it should be mandatory for every CM connection. > > The well known service ID implies that a DAPL application *would* prevent a TCP > application from using a particular port, which seems to conflict your statement > that DAPL apps shouldn't prevent TCP apps from working. > > That's not to say you couldn't have one range of service IDs for TCP > applications, and another range for DAPL applications, and yet another range per > protocol or application that wishes to use IP addressing during connection > establishment. However, this doesn't extend the CM protocol, but just creates > an ad-hoc group of protocols that happen to define the first 32-bytes of their > private data similarly. > > Having a bit in the CM REQ indicate whether the first 32-bytes of private data > contain the source and destination IP addresses allows any app using any service > ID to use IP addresses as source and destination identifiers regardless of what > protocol they actually use once the connection is established. > > Defining service ID ranges for particular protocols then becomes the > responsibility of the organizations defining such protocols and the owner of the > OUI with which the service ID ranges are defined, and is outside the scope of > the IBTA. > > - Fab > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Thu Oct 20 13:08:00 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Oct 2005 15:08:00 -0500 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <4357F186.1070407@ichips.intel.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> Message-ID: <1129838880.10748.25.camel@trinity.austin.ammasso.com> BTW -- I think this means that we need an ARPHRD_IWARP type. On Thu, 2005-10-20 at 12:35 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > Right now the code (ib_translate_addr) seems to assume that the device > > is for an IPoIB device. Going forward, how do we know whether the > > underlying net_device is for an IPoIB device, an iWARP device, or a dumb > > Ethernet device? > > The code does assume this currently. The code should check that the net_device > type = ARPHRD_INFINIBAND. Without this check, the CMA will simply error out > later when mapping the returned address to a GID. > > > For the first two, we will take one of two paths in the CMA, for the > > dumb Ethernet device I presume we will return an error. > > I guess the proper thing to return is -ENODEV from ib_translate_addr() if we > have the wrong device type. The CMA could then check iWarp devices. > > - Sean From jlentini at netapp.com Thu Oct 20 12:59:16 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 20 Oct 2005 15:59:16 -0400 (EDT) Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> References: <001c01c5d5a8$89aa7980$9e5aa8c0@infiniconsys.com> Message-ID: On Thu, 20 Oct 2005, Fab Tillier wrote: > > From: James Lentini [mailto:jlentini at netapp.com] > > Sent: Thursday, October 20, 2005 11:39 AM > > > > I like Sean's idea better. Have a well know service id or range of > > service ids on which this protocol is used. I think of it as a service > > running on top of the CM protocol for using IP addresses on native IB. > > I don't think it should be mandatory for every CM connection. > > The well known service ID implies that a DAPL application *would* > prevent a TCP application from using a particular port, which seems > to conflict your statement that DAPL apps shouldn't prevent TCP apps > from working. I don't understood what you mean by TCP application. I assumed you meant an application that uses the Berkley sockets API to communicate over TCP, but I see now that is not what you meant. This IBTA proposal does not involve any interactions with the TCP protocol stack. > That's not to say you couldn't have one range of service IDs for TCP > applications, What do you mean by "TCP applications" in this context? > and another range for DAPL applications, I don't see a reason why DAPL applications couldn't take advantage of the services being provided by the proposed protocol. > and yet another range per protocol or application that wishes to use > IP addressing during connection establishment. How are the applications in this group different from the "TCP applications" above? > However, this doesn't extend the CM protocol, but just creates an > ad-hoc group of protocols that happen to define the first 32-bytes > of their private data similarly. > > Having a bit in the CM REQ indicate whether the first 32-bytes of > private data contain the source and destination IP addresses allows > any app using any service ID to use IP addresses as source and > destination identifiers regardless of what protocol they actually > use once the connection is established. For a particular protocol, I would expect this addressing service either to be used or not used. I can't envision a situation were you would want the protocol to use this service in some situations and not use the service in others. If multiple protocols are going to be using the same service id (some times an server for protocol X is listening on service ID Z, sometimes a server for protocol Y is listening on service ID Z,...) and their use of this service isn't consistent, then I agree that the CM bit solves this problem. > Defining service ID ranges for particular protocols then becomes the > responsibility of the organizations defining such protocols and the > owner of the OUI with which the service ID ranges are defined, and > is outside the scope of the IBTA. This is a good benefit. I still think viewing this as a new service that uses a well known service id is cleaner. Then the addressing protocol and CM protocol aren't tied together. From ftillier at silverstorm.com Thu Oct 20 13:05:10 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 13:05:10 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4357EBC5.1060906@ichips.intel.com> Message-ID: <001d01c5d5b1$947b1870$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 12:11 PM > > Fab Tillier wrote: > > That's not to say you couldn't have one range of service IDs for TCP > > applications, and another range for DAPL applications, and yet another > > range per protocol or application that wishes to use IP addressing > > during connection establishment. However, this doesn't extend the > > CM protocol, but just creates an ad-hoc group of protocols that happen > > to define the first 32-bytes of their private data similarly. > > If applications map their "port" numbers to different service IDs, then > there's no need to define the private data at all. The CM can perform > its job without changes and route based purely on service IDs. The only > reason to use a reserve bit or change the version is if the CM needs to > look into the private data. > > The definition of private data is an issue for an upper level connection > manager. My hope is that this can be defined such that the upper level > connection manager can support multiple transports, so I don't have to > build an upper level upper level connection manager. My understanding was that we want the IBTA to add a section in the IB spec to define this higher-level connection management protocol, specifically the use of the first 32-bytes of the private data in the REQ to contain the source and destination IP addresses associated with the source and destination GIDs in the primary and alternate paths. If that's not the case, then why is the IBTA SW working group involved here? Why do they care? If my understanding is correct, the bit would have meaning to this higher-level connection management protocol, and not to the lower level IB connection management protocol. Defining a range of service IDs for protocols that use this feature creates a bound group that then requires a rev of the spec anytime someone else wants in on the fun. I think defining the higher level protocol without restricting the scope of service IDs would be beneficial. > > Having a bit in the CM REQ indicate whether the first 32-bytes of > > private data contain the source and destination IP addresses allows > > any app using any service ID to use IP addresses as source and > > destination identifiers regardless of what protocol they actually > > use once the connection is established. > > What does the CM do with this bit? The IB CM does nothing. A higher-level, IP addressing aware CM protocol defined by the IBTA would. If a connection request comes in on a particular SID handled by the higher level CM and doesn't have the bit set, then the request should be rejected as malformed. If the bit is set, the higher level CM could check that the source and destination IP addresses provided match the GIDs specified in the primary and alternate paths. - Fab From mshefty at ichips.intel.com Thu Oct 20 13:09:29 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 13:09:29 -0700 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <1129838880.10748.25.camel@trinity.austin.ammasso.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> <1129838880.10748.25.camel@trinity.austin.ammasso.com> Message-ID: <4357F979.9030105@ichips.intel.com> Tom Tucker wrote: > BTW -- I think this means that we need an ARPHRD_IWARP type. It may be that the CMA can simply look in its local device list for an ib_device that has a given MAC address. I don't know the detail of how iWarp will work with this. Will iWarp need a call similar to ib_translate_addr() to translate an IP address into a MAC address? Are the MAC addresses stored with the ib_device somehow? - Sean From rolandd at cisco.com Thu Oct 20 13:17:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 13:17:44 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001d01c5d5b1$947b1870$9e5aa8c0@infiniconsys.com> (Fab Tillier's message of "Thu, 20 Oct 2005 13:05:10 -0700") References: <001d01c5d5b1$947b1870$9e5aa8c0@infiniconsys.com> Message-ID: <52k6g8djtj.fsf@cisco.com> Fab> My understanding was that we want the IBTA to add a section Fab> in the IB spec to define this higher-level connection Fab> management protocol, specifically the use of the first Fab> 32-bytes of the private data in the REQ to contain the source Fab> and destination IP addresses associated with the source and Fab> destination GIDs in the primary and alternate paths. Yes, but there's no point in doing this unless there's a defined range of service IDs to map TCP ports onto. If every protocol needs to define its own service ID mapping, then the protocol might as well define how it uses the IB CM private data to carry IP addressing info. This is exactly what SDP does today. However, this solution is apparently not acceptable for NFS/RDMA. Hence the current discussion. - R. From tom at opengridcomputing.com Thu Oct 20 13:46:04 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 20 Oct 2005 15:46:04 -0500 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <4357F979.9030105@ichips.intel.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> <1129838880.10748.25.camel@trinity.austin.ammasso.com> <4357F979.9030105@ichips.intel.com> Message-ID: <1129841164.10748.37.camel@trinity.austin.ammasso.com> On Thu, 2005-10-20 at 13:09 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > BTW -- I think this means that we need an ARPHRD_IWARP type. > > It may be that the CMA can simply look in its local device list for an ib_device > that has a given MAC address. The issues (I think) are a) being able to appropriately fail a bad IP address (i.e. not an RDMA device), b) interpreting the addresses in the net_device structure, and c) muxing the request to either the IB_CM or IW_CM respectively. Unfortunately, I'm still working through the design and don't have it all figured out yet. I will absolutely be posting preliminary patches for comment and review. Thanks, > I don't know the detail of how iWarp will work with this. Will iWarp need a > call similar to ib_translate_addr() to translate an IP address into a MAC > address? Are the MAC addresses stored with the ib_device somehow? > > - Sean From ftillier at silverstorm.com Thu Oct 20 13:25:52 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 13:25:52 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: Message-ID: <001e01c5d5b4$78dd0f30$9e5aa8c0@infiniconsys.com> > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Thursday, October 20, 2005 12:59 PM > > On Thu, 20 Oct 2005, Fab Tillier wrote: > > > > From: James Lentini [mailto:jlentini at netapp.com] > > > Sent: Thursday, October 20, 2005 11:39 AM > > > > > > I like Sean's idea better. Have a well know service id or range of > > > service ids on which this protocol is used. I think of it as a service > > > running on top of the CM protocol for using IP addresses on native IB. > > > I don't think it should be mandatory for every CM connection. > > > > The well known service ID implies that a DAPL application *would* > > prevent a TCP application from using a particular port, which seems > > to conflict your statement that DAPL apps shouldn't prevent TCP apps > > from working. > > I don't understood what you mean by TCP application. I assumed you > meant an application that uses the Berkley sockets API to communicate > over TCP, but I see now that is not what you meant. This IBTA > proposal does not involve any interactions with the TCP protocol > stack. I meant a TCP application that was re-routed over IB through the use of some protocol (SDP-like). SDP itself isn't a good example because it already handles the IP addressing issues itself in the hello message. > > That's not to say you couldn't have one range of service IDs for TCP > > applications, > > What do you mean by "TCP applications" in this context? Applications that expect TCP-like behavior with respect to IP address and port usage. > > and another range for DAPL applications, > > I don't see a reason why DAPL applications couldn't take advantage of > the services being provided by the proposed protocol. It depends on whether DAPL expects to consume a full TCP address (IP+port), or is just using the IP addresses to 'facilitate' connection establishment. > > and yet another range per protocol or application that wishes to use > > IP addressing during connection establishment. > > How are the applications in this group different from the "TCP > applications" above? An application may wish to use IP addresses (without port numbers) to allow users to easily specify addressing information in a way they are familiar with. However, such an application may not care about the port number at all, and there's no need to force it to claim a port (and thus prevent someone who cares about port numbers from getting one). DAPL to me fell into this category, but maybe it falls into the "TCP" category. > > However, this doesn't extend the CM protocol, but just creates an > > ad-hoc group of protocols that happen to define the first 32-bytes > > of their private data similarly. > > > > Having a bit in the CM REQ indicate whether the first 32-bytes of > > private data contain the source and destination IP addresses allows > > any app using any service ID to use IP addresses as source and > > destination identifiers regardless of what protocol they actually > > use once the connection is established. > > For a particular protocol, I would expect this addressing service > either to be used or not used. I can't envision a situation were you > would want the protocol to use this service in some situations and not > use the service in others. Using a bit allows the protocol to be used independently of the service ID, allowing any client, using any service ID, to use the facility if it so desires. I wasn't advocating allowing arbitrary use of the protocol with any given service ID, and I agree with you that the protocol would be either used or not given a particular service ID. > If multiple protocols are going to be using the same service id (some > times an server for protocol X is listening on service ID Z, sometimes > a server for protocol Y is listening on service ID Z,...) and their > use of this service isn't consistent, then I agree that the CM bit > solves this problem. The CM bit allows protocol usage to be clear and independent of service ID. It comes down to whether we want to tie protocol use with a set of SIDs, rather than defining a protocol generically, and just tying SID usage to protocol use. - Fab From mshefty at ichips.intel.com Thu Oct 20 13:30:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 13:30:10 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001d01c5d5b1$947b1870$9e5aa8c0@infiniconsys.com> References: <001d01c5d5b1$947b1870$9e5aa8c0@infiniconsys.com> Message-ID: <4357FE52.2040505@ichips.intel.com> Fab Tillier wrote: > If my understanding is correct, the bit would have meaning to this higher-level > connection management protocol, and not to the lower level IB connection > management protocol. Defining a range of service IDs for protocols that use > this feature creates a bound group that then requires a rev of the spec anytime > someone else wants in on the fun. I think defining the higher level protocol > without restricting the scope of service IDs would be beneficial. I'll use the first bit in the 2nd byte in my service ID to indicate this then. That bit's reserved. :) Using a bit in the REQ means that the higher level connection management protocol needs to receive and process CM REQs. How does the REQ get routed to the higher level CM? If it's based on service ID, then why is the bit needed at all? If I'm routing based on this bit, then I could just as easily define this protocol to exist on a single service ID, and still route on service ID. The upper level CM can then demultiplex to the correct application based on the addresses found in the private data. Using a reserved bit is essentially adding a 65th bit to the service ID. In any case, I don't see how defining this private data format without specifying which service IDs use it is all that useful. - Sean From ftillier at silverstorm.com Thu Oct 20 13:34:39 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 13:34:39 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <52k6g8djtj.fsf@cisco.com> Message-ID: <001f01c5d5b5$b3b99cd0$9e5aa8c0@infiniconsys.com> > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, October 20, 2005 1:18 PM > > Fab> My understanding was that we want the IBTA to add a section > Fab> in the IB spec to define this higher-level connection > Fab> management protocol, specifically the use of the first > Fab> 32-bytes of the private data in the REQ to contain the source > Fab> and destination IP addresses associated with the source and > Fab> destination GIDs in the primary and alternate paths. > > Yes, but there's no point in doing this unless there's a defined range > of service IDs to map TCP ports onto. If every protocol needs to > define its own service ID mapping, then the protocol might as well > define how it uses the IB CM private data to carry IP addressing info. > This is exactly what SDP does today. However, this solution is > apparently not acceptable for NFS/RDMA. Hence the current discussion. I'm not saying we shouldn't define a range of service IDs, I'm questioning whether we should restrict the use of this protocol to just the defined range of service IDs. I think there's a benefit in having different protocols use a well-established and defined way of mapping IP addresses to IB. I'd like to see us define the protocol independent of the service ID. We can then establish a service ID range to be used with this protocol for NFS/RDMA, or for more generic TCP mappings, but these are two different issues to me. - Fab From mshefty at ichips.intel.com Thu Oct 20 13:39:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 13:39:59 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <001f01c5d5b5$b3b99cd0$9e5aa8c0@infiniconsys.com> References: <001f01c5d5b5$b3b99cd0$9e5aa8c0@infiniconsys.com> Message-ID: <4358009F.2080505@ichips.intel.com> Fab Tillier wrote: > I'd like to see us define the protocol independent of the service ID. We can > then establish a service ID range to be used with this protocol for NFS/RDMA, or > for more generic TCP mappings, but these are two different issues to me. But the protocol (if you define a private data format as a protocol) has no meaning to the CM. It only has meaning to the application that's listening on the service ID. Using a reserved bit in the REQ mixes the CM's protocol (which is to process REQs, REPs, etc.) with that of the application. - Sean From ftillier at silverstorm.com Thu Oct 20 13:44:58 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 13:44:58 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4357FE52.2040505@ichips.intel.com> Message-ID: <002001c5d5b7$23f30df0$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 1:30 PM > > Using a bit in the REQ means that the higher level connection management > protocol needs to receive and process CM REQs. How does the REQ get routed to > the higher level CM? If it's based on service ID, then why is the bit needed > at all? If I'm routing based on this bit, then I could just as easily define > this > protocol to exist on a single service ID, and still route on service ID. The > upper level CM can then demultiplex to the correct application based on the > addresses found in the private data. > > Using a reserved bit is essentially adding a 65th bit to the service ID. I disagree. Using a reserved bit indicates that the first 32-bytes of private data have a known format and can be evaluated by an entity shared by multiple clients (the CMA). The service ID on the other hand indicates what protocol is implemented over the connection once it is established. > In any case, I don't see how defining this private data format without > specifying which service IDs use it is all that useful. You can do both, but I think they are separate. The protocol can be useful outside the scope of DAPL or NFS/RDMA. WSD could use it, and then use a higher-level CM to do all the IP to IB path management rather than duplicating it. - Fab From ftillier at silverstorm.com Thu Oct 20 13:52:16 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Thu, 20 Oct 2005 13:52:16 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <4358009F.2080505@ichips.intel.com> Message-ID: <002101c5d5b8$28703c30$9e5aa8c0@infiniconsys.com> > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Thursday, October 20, 2005 1:40 PM > > Fab Tillier wrote: > > I'd like to see us define the protocol independent of the service ID. > > We can then establish a service ID range to be used with this protocol > > for NFS/RDMA, or for more generic TCP mappings, but these are two > > different issues to me. > > But the protocol (if you define a private data format as a protocol) has no > meaning to the CM. It only has meaning to the application that's listening on > the service ID. The same can be said of the starting local QPN, responder resource, initiator depth, starting PSN, MTU, and so forth. The CM doesn't care about these - the application does, as these settings affect how it configures its QP and what features of its protocol it can use. > Using a reserved bit in the REQ mixes the CM's protocol > (which is to process REQs, REPs, etc.) with that of the application. There are a number of fields that are not used by the CM state machine that are included in these MADs already. These fields are defined in the CM protocol not because they impact MAD processing in the CM, but because they represent minimum information needed to configure a QP and client. - Fab From sinate at yahoo.com Thu Oct 20 14:01:33 2005 From: sinate at yahoo.com (Steven Wooding) Date: Thu, 20 Oct 2005 22:01:33 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <4356BBF2.6070905@ichips.intel.com> Message-ID: <20051020210133.27820.qmail@web32504.mail.mud.yahoo.com> --- Sean Hefty wrote: > Good catch. I overlooked this about 10 times now... > The break should be there. > > - Sean The break indeed fixes the qp_type value at the receiver of the req message. However, this is not enough to successfully make the UC connection using the CM. The qp_attr_mask returned by the cm_init_qp_attr() function still assumes an RC connection is required. To get around this, I overwrote the qp_attr_mask with a mask suitable for UC, after the cm_init_qp_attr() call. This needs to be done for qp state transitions INIT->RTR and RTR->RTS. This is OK for now, but would really need to be done by the CM code. I had a look at where the mask is set in cm.c (cm_init_qp_rtr_attr() and cm_init_qp_rts_attr()) but I was unsure how to make the mask depend on the QP type. Maybe you have a better idea of how to do this. Anyway, hope this helps. Cheers, Steve. ___________________________________________________________ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com From Arkady.Kanevsky at netapp.com Thu Oct 20 14:06:49 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 17:06:49 -0400 Subject: [dat-discussions] RE: [openib-general] Re: [swg] Re: private data... Message-ID: The real issue is how Consumer specifies whether to use current CM protocol with unformated private data or proposed CM with formated private data. So the extra bit tells CM how to interpret private data. connection request is delivered to Service ID specified in the request. Local CM does not have to know which TCP port that Service ID corresponds. CM does not care about TCP ports. But the ULP above CM does that is why it will get formated private data. It can even be possible for that ULP to handle both formats of REQs. In one case it will get only private data and can make basic assumptions base on it, and in another it will get full socket address and can make the full distinctions of the requestor based on it. IBTA plans to define new CM which formats the private data. DAPL may decide to only support new CM format. Or may decide to support both with some caveats. But the issue for OpenIB CM will still remain to support both or not? and if yes (for backwards compatibility) how to expose it? The issue how IP address is translated to IB is done by IPoIB. There is a need for a facility that translate TCP port to Service ID. Mapping should be defined by IBTA. That is what being discussed there. This is part of the protocol definition not API. Once this is defined ULP can decide on which Service ID(s) to listen. Requestor can send conn req to a specific Service ID (IB specific) or use higher level abstraction - TCP port. CM may be capable to translate TCP port to Service ID based on ULP. For example, iSER over IPoIB will be mapped to one Service ID and native iSER over IB will be mapped to another. But this is not simple. On another hand every intermediate level protocol (SDP, IPoIB) can do conversion. But this is also hard and is extension of existing protocol. or at least a facility on top of it. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Fab Tillier [mailto:ftillier at silverstorm.com] > Sent: Thursday, October 20, 2005 4:45 PM > To: 'Sean Hefty' > Cc: Lentini, James; swg at infinibandta.org; > dat-discussions at yahoogroups.com; openib-general at openib.org > Subject: [dat-discussions] RE: [openib-general] Re: [swg] Re: > private data... > > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Thursday, October 20, 2005 1:30 PM > > > > Using a bit in the REQ means that the higher level connection > > management protocol needs to receive and process CM REQs. How does > > the REQ get routed to the higher level CM? If it's based > on service > > ID, then why is the bit needed at all? If I'm routing > based on this > > bit, then I could just as easily define this protocol to exist on a > > single service ID, and still route on service ID. The > upper level CM > > can then demultiplex to the correct application based on > the addresses > > found in the private data. > > > > Using a reserved bit is essentially adding a 65th bit to > the service > > ID. > > I disagree. Using a reserved bit indicates that the first > 32-bytes of private data have a known format and can be > evaluated by an entity shared by multiple clients (the CMA). > > The service ID on the other hand indicates what protocol is > implemented over the connection once it is established. > > > In any case, I don't see how defining this private data > format without > > specifying which service IDs use it is all that useful. > > You can do both, but I think they are separate. The protocol > can be useful outside the scope of DAPL or NFS/RDMA. WSD > could use it, and then use a higher-level CM to do all the IP > to IB path management rather than duplicating it. > > - Fab > > > > > ------------------------ Yahoo! Groups Sponsor > --------------------~--> > Fair play? Video games influencing politics. Click and talk > back! http://us.click.yahoo.com/T8sf5C/tzNLAA/TtwFAA/W6uqlB/TM > -------------------------------------------------------------- > ------~-> > > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/dat-discussions/ > > <*> To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > > From sean.hefty at intel.com Thu Oct 20 14:22:50 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 14:22:50 -0700 Subject: [openib-general] Re: [swg] Re: private data... In-Reply-To: <002101c5d5b8$28703c30$9e5aa8c0@infiniconsys.com> Message-ID: >The same can be said of the starting local QPN, responder resource, initiator >depth, starting PSN, MTU, and so forth. The CM doesn't care about these - the >application does, as these settings affect how it configures its QP and what >features of its protocol it can use. Not exactly the same. The "connection" cares about these, and must be included as part of the connection protocol. >There are a number of fields that are not used by the CM state machine that are >included in these MADs already. These fields are defined in the CM protocol >not because they impact MAD processing in the CM, but because they represent >minimum information needed to configure a QP and client. Exactly. The IP address does not configure the QP. What you're advocating is that a service ID can support two private data formats depending on if a bit in the CM REQ is set or not. (If only a single format is supported, then the bit is not needed.) This is the wrong place to store this information. The format of the data beyond the addressing information is not conveyed by this bit, so additional information about the private data format is still needed. You can grab several reserved bits from the REQ and define it as a "private data version", but then apps that care about this could just as easily record the version in the private data itself. - Sean From Arkady.Kanevsky at netapp.com Thu Oct 20 14:26:03 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 20 Oct 2005 17:26:03 -0400 Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... Message-ID: But that require changes to CM APIs vs a module on top of it to parse and populate private data field. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:sean.hefty at intel.com] > Sent: Thursday, October 20, 2005 5:23 PM > To: 'Fab Tillier'; 'Sean Hefty' > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... > > > >The same can be said of the starting local QPN, responder resource, > >initiator depth, starting PSN, MTU, and so forth. The CM > doesn't care > >about these - the application does, as these settings affect how it > >configures its QP and what features of its protocol it can use. > > Not exactly the same. The "connection" cares about these, > and must be included as part of the connection protocol. > > >There are a number of fields that are not used by the CM > state machine > >that are included in these MADs already. These fields are > defined in > >the CM protocol not because they impact MAD processing in > the CM, but > >because they represent minimum information needed to > configure a QP and > >client. > > Exactly. The IP address does not configure the QP. > > What you're advocating is that a service ID can support two > private data formats depending on if a bit in the CM REQ is > set or not. (If only a single format is supported, then the > bit is not needed.) This is the wrong place to store this > information. The format of the data beyond the addressing > information is not conveyed by this bit, so additional > information about the private data format is still needed. > > You can grab several reserved bits from the REQ and define it > as a "private data version", but then apps that care about > this could just as easily record the version in the private > data itself. > > - Sean > From caitlinb at broadcom.com Thu Oct 20 14:58:07 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Thu, 20 Oct 2005 14:58:07 -0700 Subject: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AA5@NT-SJCA-0751.brcm.ad.broadcom.com> I believe a review of what the implementer of a transport neutral daemon for an RDMA protocol would be expecting from a Connection Management service: -- It expects that it can listen for connection requests on a specific 16-bit port number (with traditional TCP port number semantics) on either a specific IP Address or for all IP Addresses associated with the network device. -- It will receive connection requests that were initiated by active peers that wish to establish a reliable connection for the purpose of exchanging RDMA messges. This Connection Request will identify: 1) The remote IP Address of the active peer. This will be authenticated in the sense that the address is known to have more meaning than just being a value made up by a remote user-mode peer. If it is a lie then privileged software is complicit in the lie. The address may be even more authenticated than that. 2) The destination IP address that the active peer requested. That is, if the network device supports multiple addresses concurrently (as with a web farm) the connection request will identify *which* address was specified by the remote active peer. 3) Private Data supplied by the remote peer to establish its identity, the required characteristics of the desired connection and/or other application specific purposes. The private data is supplied prior to connection establishment specifically to enable selection/configuration of the RDMA QP. Note that on a transport neutral basis the passive side application cannot assume that the QP is fully configured to match credit requirements of the remote peer -- it must configure QP capacities itself. -- It will NOT receive connection requests from remote peers seeking to connect with similar services based upon streaming socket semantics (SDP or plain TCP). -- If it so chooses, it may accept the connection request by supplying a compatibly configured RDMA QP and response private data. -- If it so chooses, it may reject the conneciton request. Many of these requirements point to why the additional data is needed, and why taking the first N bytes of the existing private data is requried. The key requirement that I belive requires that "65th bit" is that a client seeking a streaming mode daemon cannot initiate a connection with an RDMA mode daemon and start mis-exchanging data. If anyone cares to google you can find out that I had a low opinion of the value of this requirement when it was discussed in the IETF RDDP WG. Well, actually it wasn't discussed, it was imposed by fiat by the Transport Area directors. But despite my low opinion of it, it is part of IP based RDMA connection establishment. And in the interest of transport neutrality an InfiniBand option to emulate IP based connection establishment should emulate it as well. From sean.hefty at intel.com Thu Oct 20 15:01:30 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 15:01:30 -0700 Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... In-Reply-To: Message-ID: >But that require changes to CM APIs vs a module on top of it >to parse and populate private data field. I'm wasn't advocating this change. What I think needs to be defined here is a *service* that provides TCP/IP connection semantics, similar to the definition of SDP. Applications can make use of this service or not, but the goal is that all services that use TCP/IP addressing to establish a connection would do so. OpenIB would provide an implementation of this service. The service is defined by one or more service IDs, plus a private data format. Moving beyond defining this service to changing the CM REQ, or separating the definition of the service into a private data protocol and application defined service IDs seem like a step in the wrong direction. - Sean From hozer at hozed.org Thu Oct 20 15:08:00 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 20 Oct 2005 17:08:00 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: <52sluwdq1b.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> Message-ID: <20051020220759.GX30127@kalmia.hozed.org> On Thu, Oct 20, 2005 at 11:03:28AM -0700, Roland Dreier wrote: > Troy> I've since found I have the same problem without hcad_mod. I > Troy> don't see any errors in dmesg except for: > > Troy> [ 7415.421699] mthca0: ib_query_pkey port 0 failed (ret = -22) > > It's strange that IPoIB is querying port 0 of a CA. Could you have > mismatched versions of modules, so that some were compiled with a > different version of ? There is some sort of strange initializiation error going on here.. When ib_mthca is loaded by udev on startup, and then I modprobe ib_ipoib, I get the ib_querry_pkey error. But unloading all the ib modules and reloading them manually works just fine. (I have noticed a 5-10 second delay before ipoib starts working right) This is kernel 2.6.13.3 svnversion 3829 From rolandd at cisco.com Thu Oct 20 15:32:13 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 20 Oct 2005 15:32:13 -0700 Subject: [openib-general] Re: ehca testing In-Reply-To: <20051020220759.GX30127@kalmia.hozed.org> (Troy Benjegerdes's message of "Thu, 20 Oct 2005 17:08:00 -0500") References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> Message-ID: <52br1jes5u.fsf@cisco.com> Troy> There is some sort of strange initializiation error going on here.. Yes, very strange. Can you add printk(KERN_ERR "hca->node_type = %d\n", hca->node_type); to the beginning of ipoib_add_port(), and printk(KERN_ERR "dev->ib_dev.node_type = %d\n", dev->ib_dev.node_type); right before the call to ib_register_device() in mthca_register_device() and send the output that you get when hotplug loads ib_mthca vs. when you load ib_mthca by hand? Thanks, Roland From kjreilly at us.ibm.com Thu Oct 20 18:26:47 2005 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Thu, 20 Oct 2005 21:26:47 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: <4357CA2C.9030203@ichips.intel.com> Message-ID: Will the CMA have that the same function ib_at_route_by_ip() that we are using in libibat? /** * ib_at_route_by_ip - asynchronously resolve ip address to ib route * @dst_ip: destination ip * @src_ip: source ip - optional * @tos: ip type of service * @flags: ib_at_route_flags * @ib_route: out structure * @async_comp: asynchronous callback structure - optional * @req_id: pointer for request ID * * Resolve the specified dst_ip to a &struct ib_route structure. * src_ip can be provide to force specific output interface. * flags can be used to select resolving method; currently IB-ARP or ATS. * * See ib_at_completion structure documentation for asynchronous * operation details. */ int ib_at_route_by_ip(uint32_t dst_ip, uint32_t src_ip, int tos, uint16_t flags, struct ib_at_ib_route *ib_route, struct ib_at_completion *async_comp, uint64_t *req_id); Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 Sean Hefty To Pradeep 10/20/2005 12:47 Satyanarayana/Beaverton/IBM at IBMUS PM cc Hal Rosenstock , Kevin Reilly/Poughkeepsie/IBM at IBMUS, openib-general-bounces at openib.org, "openib-general at openib.org" Subject Re: [openib-general] Re: Questions about libibat, ib_uat, and ib_a Pradeep Satyanarayana wrote: > Is there a ballpark estimate (or a plan) of when CMA willl be ready? > Estimates like by end of Q4 2005 > or end of Q1 2006 will help us make some decisions if we should submit a > patch for this bug or wait > for CMA. The kernel CMA is ready today. An additional change will be required at some point once the iWarp Emulation Protocol is defined, but that will be minor. Work on the user CMA should begin by the end of this week. I estimate that it will take about 4 weeks to complete. - Sean From ttucker at es335.com Thu Oct 20 20:22:34 2005 From: ttucker at es335.com (Tom Tucker) Date: Thu, 20 Oct 2005 22:22:34 -0500 Subject: [swg] RE: [openib-general] Re: [swg] Re: private data... In-Reply-To: References: Message-ID: <1129864954.21779.2.camel@mail.es335.com> I agree. On Thu, 2005-10-20 at 15:01 -0700, Sean Hefty wrote: > >But that require changes to CM APIs vs a module on top of it > >to parse and populate private data field. > > I'm wasn't advocating this change. What I think needs to be defined here is a > *service* that provides TCP/IP connection semantics, similar to the definition > of SDP. Applications can make use of this service or not, but the goal is that > all services that use TCP/IP addressing to establish a connection would do so. > OpenIB would provide an implementation of this service. > > The service is defined by one or more service IDs, plus a private data format. > > Moving beyond defining this service to changing the CM REQ, or separating the > definition of the service into a private data protocol and application defined > service IDs seem like a step in the wrong direction. > > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Oct 20 21:02:43 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 20 Oct 2005 21:02:43 -0700 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: Message-ID: >Will the CMA have that the same function ib_at_route_by_ip() that we are >using in libibat? The CMA will resolve IB routes based on source/destination TCP/IP addresses if that is what you are looking for. It will then establish connections based on those routes. You may want to look at rdma_cm.h in the include/rdma directory to ensure that it meets your needs. - Sean From ttucker at es335.com Thu Oct 20 21:26:22 2005 From: ttucker at es335.com (Tom Tucker) Date: Thu, 20 Oct 2005 23:26:22 -0500 Subject: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AA5@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AA5@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1129868783.21779.58.camel@mail.es335.com> I think this is a useful discussion, however, I would point out that some of the information being exchanged doesn't have to be in the private data. It could be exchanged in an untagged send/recv after the connection is established; which has the benefit of allowing the application to use an arbitrarily large chunk of data to authenticate and authorize the remote peer instead of trying to boil the ocean in 64B of data. It might be better to only solve the core problem ... which I think is identifying the service and QP configuration. On Thu, 2005-10-20 at 14:58 -0700, Caitlin Bestler wrote: > I believe a review of what the implementer of a transport neutral > daemon for an RDMA protocol would be expecting from a Connection > Management service: > > -- It expects that it can listen for connection requests on a specific > 16-bit port number (with traditional TCP port number semantics) on > either a specific IP Address or for all IP Addresses associated with > the network device. I would expand this to say "...or on all interfaces that have an IP address". From my reading, it seems to me that the current CMA does this. > -- It will receive connection requests that were initiated by active peers > that wish to establish a reliable connection for the purpose of > exchanging RDMA messges. > > This Connection Request will identify: > 1) The remote IP Address of the active peer. > This will be > authenticated > in the sense that the address is known to have more meaning than > just being a value made up by a remote user-mode peer. If it is a > lie > then privileged software is complicit in the lie. The address may be > even more authenticated than that. Are you saying that it should not be possible for a user mode peer to masquerade as another host? If this is what you're saying, then I don't think it is any more secure done in the kernel than in user mode because the remote peer has no way of knowing where the data was prepared. I think that if authentication is the purpose of the remote address, don't bother. If the active peer needs to be authenticated, do it after connection establishment when you can exchange signatures of sufficient size to be useful. Am I missing something here? > 2) The destination IP address that the active peer requested. That > is, if > the network device supports multiple addresses concurrently (as with > a > web farm) the connection request will identify *which* address was > specified by the remote active peer. ... and port. I agree this is needed because it is part of the "local service signature" aka service id aka port number/ip address" > 3) Private Data supplied by the remote peer to establish its > identity, IMHO, private data is useless (or at least insecure) for this purpose for the reasons mentioned above. > the required characteristics of the desired connection and/or other > application specific purposes. The private data is supplied prior to > connection establishment specifically to enable > selection/configuration of the RDMA QP. I agree that this is a useful purpose for private data. > Note that on a transport neutral basis the passive > side > application cannot assume that the QP is fully configured to match > credit > requirements of the remote peer -- it must configure QP capacities > itself. > -- It will NOT receive connection requests from remote peers seeking to > connect > with similar services based upon streaming socket semantics (SDP or plain > TCP). What specifically are you saying here? That the app won't see the connect request until after an MPA Start Request has been received? > > -- If it so chooses, it may accept the connection request by supplying a > compatibly > configured RDMA QP and response private data. > -- If it so chooses, it may reject the conneciton request. > Many of these requirements point to why the additional data is needed, and > why taking > the first N bytes of the existing private data is requried. > > The key requirement that I belive requires that "65th bit" is that a client > seeking > a streaming mode daemon cannot initiate a connection with an RDMA mode daemon > and > start mis-exchanging data. Are you referring to the MPA Start Key thingy again? I don't think the IB guys don't have this issue. > > If anyone cares to google you can find out that I had a low opinion of the > value > of this requirement when it was discussed in the IETF RDDP WG. Well, > actually it > wasn't discussed, it was imposed by fiat by the Transport Area directors. > > But despite my low opinion of it, it is part of IP based RDMA connection > establishment. > And in the interest of transport neutrality an InfiniBand option to emulate > IP based > connection establishment should emulate it as well. > erf. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at ipperformance.com Thu Oct 20 21:41:26 2005 From: tom at ipperformance.com (Tom Tucker) Date: Thu, 20 Oct 2005 23:41:26 -0500 Subject: [openib-general] Re: rdma_bind_addr question In-Reply-To: <4357F979.9030105@ichips.intel.com> References: <1129837501.10748.9.camel@trinity.austin.ammasso.com> <4357F186.1070407@ichips.intel.com> <1129838880.10748.25.camel@trinity.austin.ammasso.com> <4357F979.9030105@ichips.intel.com> Message-ID: <1129869686.21779.67.camel@mail.es335.com> On Thu, 2005-10-20 at 13:09 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > BTW -- I think this means that we need an ARPHRD_IWARP type. > > It may be that the CMA can simply look in its local device list for an ib_device > that has a given MAC address. After pondering this, I think you're correct. There is one issue, however. Currently, the GID is stored beginning at the fourth byte of the dev_addr for IBoIB, but the Ethernet MAC address begins at byte 0. Is it possible to move this 4B quantity to follow the GID? If so, we could pad the dev_addr for iWARP devices with zeroes and use the exact same code to search the cma device table. If the device is found, it already has a type in the ib_device structure to distinguish between IB and iWARP devices. If the caller gave us an IP address for a dumb Ethernet device, we would go looking for it in the cma device list and simply not find it. It would still fail, just later, and would avoid a new ARPHRD type. > > I don't know the detail of how iWarp will work with this. Will iWarp need a > call similar to ib_translate_addr() to translate an IP address into a MAC > address? Are the MAC addresses stored with the ib_device somehow? > > - Sean From schihei at de.ibm.com Thu Oct 20 23:53:49 2005 From: schihei at de.ibm.com (Heiko J Schick) Date: Fri, 21 Oct 2005 08:53:49 +0200 Subject: [openib-general] build libibverbs with --libdir parameter Message-ID: <4358907D.8060507@de.ibm.com> Hello, I've saw that libibverbs uses the --libdir configure parameter as default path for all userspace driver libraries (e.g. mthca.so libehca.so, ...). For the 64-bit version we must use --libdir /usr/local/lib64 to get libibverbs installed in /usr/local/lib64 and not in /usr/local/lib. I can't use --prefix. When I now build a RPM and uses --libdir, libibverbs will get problems because RPM will install the librarary into an own temporarily directory (/var/tmp/usr/local/lib64). The default library path is then /var/tmp/usr/local/lib64/infiniband. Is there some way to change this behaviour? The only optin I can see is to patch the DEFAULT_PATH define in init.c for RPM builds. -- Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick ---------------------------------------------------------------------- Heiko J Schick I/O Firmware Development II Linux InfiniBand Device Drivers IBM Deutschland Entwicklung GmbH external: 49-07031-16-0 x4219 Schoenaicher Str. 220 t/l: 120-4129 71032 Boeblingen email: schickhj at de.ibm.com ---------------------------------------------------------------------- From info at njhfges.com Fri Oct 21 01:31:33 2005 From: info at njhfges.com (info at njhfges.com) Date: 21 Oct 2005 17:31:33 +0900 Subject: [openib-general] $B%(%C%A$G2T$0(B Message-ID: <20051021083133.1956.qmail@mail.njhfges.com> $B"!!!:#2s$N!Z>R2pNA![!ZF~2qHqMQ![$OA4$FL5NA$G$9!#EPO?8eH/@8$9$k;v$J$I$b0l at ZM-$j$^$;$s!#(B $B"!!!DL>o!Z(B2,000$B1_J,![$NL5NA%]%$%s%H$r"(!Z(B10,000$B1_J,![$HCW$7$^$9!#(B $B"!!!5U1g=u4uK>=w at -$O:GDc(B3$BK|1_0J>e$,3NDj$5$l$F$$$kJ}$N$_$4>R2pCW$7$^$9!#(B $B"!!!0lH/$G at .N)$J$i$J$/$F$b!":G?7>pJs$r?o;~99?78e>R2p$5$;$FD:$-$^$9!#(B $B"!!!pJs0lMw$r4QMw$G$-$^$9!#(B $B!c$*;n$7!d$4F~2q$NJ}$O"M(B http://1191.jp/buzz/index.html $B"(=EMW"((B $B!|0lK|1_L5NA(BP$B$G0l%v7n$[$IMxMQ2DG=$G$9!#!JM>M5$G$9!#!K(B $B!|>e5-!Z%Z!<%8![$,I=<($5$l$J$+$C$?>l9g$O!L8"Mx=*N;!M$H$J$C$F$*$j$^$9$N$G!"0lHLF~2q%Z!<%8!Z![$r$4MxMQ2<$5$$!#(B $B$=$NBe$o$j$K5.J}MM$N!L5U!oFCJL8"Mx!M52<$5$$!#(B http://1191.jp/buzz/ -------------------------------------------------------------------------------- $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B I don't veceive your mail awg_tokyo at yahoo.com.au $B%a!<%k$N $B!g!g!g!g!y!g!g!g!g!g!y!g!g!g!g!g!g!g!g!g!y!g!g!g!g!g!y!g!g!g!g!g(B $B!c(BNO.I don't veceive your mail$B!d"M!!(Bsweet_as_candy_700 at yahoo.fr $B!c:#8e!"l9g$O!d"M!!(Bsweet_as_candy_700 at yahoo.fr $B!g!g!g!g!y!g!g!g!g!g!y!g!g!g!g!g!g!g!g!g!y!g!g!g!g!g!y!g!g!g!g!g(B From gabhijit at pantasys.com Fri Oct 21 06:45:28 2005 From: gabhijit at pantasys.com (Abhijit Gadgil) Date: Fri, 21 Oct 2005 19:15:28 +0530 Subject: [openib-general] Adding static entry to arp table? Message-ID: <1129902328.5494.4.camel@psmith.ind.pantasys.com> Hi all, Is there a patch (to ip utility or Linux kernel), which can add static entry to the arp table using ip neigh command? I am using gen1 based stack, but didn't find anything after a 'grep' in the gen2 stack as well? any pointers? Thanks and regards. -abhijit From info at sjdhfy.com Fri Oct 21 05:37:33 2005 From: info at sjdhfy.com (info at sjdhfy.com) Date: 21 Oct 2005 21:37:33 +0900 Subject: [openib-general] $B!y=EMW!y(B Message-ID: <20051021123733.2203.qmail@mail.sjdhfy.com> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B $B(.(#(#(B $BG.$$$4MWK>$K$*1~$($7$F$D$$$KL5NA%]%$%s%HBgI}(BUP$B!*(B $B(.(.(#(B $BATG/$NJ}$K$b$4Gx$7=k$$F|!9$,B3$-$^$9$,!"$*Hh$l$G$O$"$j$^$;$s$+!)(B $B!VL~$5$l$?$$!W(B $B!V$3$N2F$O$^$@(BH$B$J=P2q$$$,L5$$!W(B $B!V$*6b$,L5$$!W(B $BEy!9$*G:$_$G$O$"$j$^$;$s$+!)(B $B$?$@:#!"=w at -;o9-9pBgNL7G:\Cf$K$D$-!"6/NO$K$"$J$?$NM_K>$rK~$?$9=P2q$$$r;Y1gCW$7$^$9!*(B $B"(!V%5%/%i!W(B $B!V6Hl$rDs6!$7$F$*$j$^$9!#(B $B"(A4$F%*!<%W%s$K$7$F$*$j$^$9$N$G0B?4$7$F$*3Z$7$_2<$5$$!#(B $B:#$9$0EPO?$7$FD:$$$?J}$K$O!";O$a$K(B10000$B1_J,$N%]%$%s%H$rL5NA$G:9$7>e$2$F$*$j$^$9"v(B $BEPO?$O$3$A$i"*(B http://www.00-love5.com/?0yen $B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B*$B!&!y(B*$B!&(B*$B!y(B*$B!&(B*$B!y(B $B!}L5NA%]%$%s%H$GAjEvM7$Y$^$9$N$G@'Hs$*;n$72<$5$$"v(B $B!}$[$H$s$I$NJ}$,L5NA%]%$%s%HFb$G!"=w at -$r(BGET$B$7$F$^$9!*!*(B $B!};HMQ$7$F$_$F!V$3$l$O!*!W$H;W$C$FD:$$$?J}$N$_M-NA$X$*?J$_2<$5$$!*(B $B;n$7$F$_$k!*"*(B http://www.00-love5.com/?0yen $B!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&!]!&(B $B$b$7!"5.J}$,AGE($J=P2q$$$r5a$a$F$$$i$C$7$c$i$J$1$l$P$*OM$S?=$7>e$2$^$9!#(B $B$* $B'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X(B $B=w at -$NAjr7o!!#2#2:P$+$i#3#5:P$^$G$NCK at -(B $BJs=7$K4X$7$F$O!"Aj@\OC$7$F$*7h$a2<$5$$!#(B $BEvJ}$O8r>D at .N)$N:]!"=w at -2q0w$NJ}$+$iNA6b$rD:$/0Y(B $BCK at -2q0w$NJ}$+$i0l at Z$*6b$rD:$-$^$;$s!#(B $B>0!"EPO?$7$F$$$k=w at -2q0w$NJ}$O!"?H85?3::$r$7$?(B $BJ}$N$_$H$J$j$^$9!#2q0w$NCf$K$Oo<1$"$kJ}$N$_$NJg=8$H$5$;$FD:$-$^$9!#(B $B$44uK>$NJ}$O$3$A$i$NNx?MJg=8$h$j$*F~$j2<$5$$!#(B $B59$7$/$*4j$$CW$7$^$9!#(B http://www.gyakuten5.net/?dog $B'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X'X(B I don't veceive yourmail $B%a!<%k [caitlin] Comments inline [/caitlin] -----Original Message----- From: Tom Tucker [mailto:ttucker at es335.com] Sent: Thu 10/20/2005 9:26 PM To: Caitlin Bestler Cc: Sean Hefty; Fab Tillier; swg at infinibandta.org; openib-general at openib.org Subject: Re: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) I think this is a useful discussion, however, I would point out that some of the information being exchanged doesn't have to be in the private data. It could be exchanged in an untagged send/recv after the connection is established; which has the benefit of allowing the application to use an arbitrarily large chunk of data to authenticate and authorize the remote peer instead of trying to boil the ocean in 64B of data. It might be better to only solve the core problem ... which I think is identifying the service and QP configuration. On Thu, 2005-10-20 at 14:58 -0700, Caitlin Bestler wrote: > I believe a review of what the implementer of a transport neutral > daemon for an RDMA protocol would be expecting from a Connection > Management service: > > -- It expects that it can listen for connection requests on a specific > 16-bit port number (with traditional TCP port number semantics) on > either a specific IP Address or for all IP Addresses associated with > the network device. I would expand this to say "...or on all interfaces that have an IP address". From my reading, it seems to me that the current CMA does this. > -- It will receive connection requests that were initiated by active peers > that wish to establish a reliable connection for the purpose of > exchanging RDMA messges. > > This Connection Request will identify: > 1) The remote IP Address of the active peer. > This will beauthenticated > in the sense that the address is known to have more meaning than > just being a value made up by a remote user-mode peer. If it is a > lie then privileged software is complicit in the lie. The address may be > even more authenticated than that. Are you saying that it should not be possible for a user mode peer to masquerade as another host? If this is what you're saying, then I don't think it is any more secure done in the kernel than in user mode because the remote peer has no way of knowing where the data was prepared. I think that if authentication is the purpose of the remote address, don't bother. If the active peer needs to be authenticated, do it after connection establishment when you can exchange signatures of sufficient size to be useful. Am I missing something here? [caitlin] I'm merely noting what a TCP daemon is able to assume today as part of IP connection setup. The remote IP address supplied may be heavily authenticated (the local network actively prevents IP spoofing by checking routes, etc.) or next to worthless (the only guarantee si that the forger had root access on some machine). Regardless of whether this level of authentication is advisable, it is *exactly* the assumption that many servers make today. This includes many NFS configurations. When the network is isolated from external connections, and the network administrator is confident that they control "root" for all machines within the local network this can be a signifigant level of defense. Within a corporate intranet, for example, this may be the mechanism to ensure that marketing does not examine internla engineering documents. I wouldn't recommend it for protecting HR files, but it can be quite adequate for many purposes. More importantly, if this guarantee is not provided then an explicit warning should be made. For example, unless the CM header itself is mariked as having IP data in it there is no way to know that a user mode application simply hasn't made up an IP address and submitted as part of a normal CM requests private data. [/caitlin] > 2) The destination IP address that the active peer requested. That > is, if the network device supports multiple addresses concurrently (as with > a > web farm) the connection request will identify *which* address was > specified by the remote active peer. ... and port. I agree this is needed because it is part of the "local service signature" aka service id aka port number/ip address" [caitlin] But the port was implicit in the listen. There is no need for it to be part of the Connection Request reported to the listener. If there is a desire to conserve Service IDs then the wire protocol could certainly include the TCP port in the CM Private Data. But that would be transparent to the application. The transparency to the applicaion is my concern. I'll confess to not seeing any great need to avoid allocating 64K Service IDs out of a 64-bit range -- but that's an issue for the IB developers to work out. [/caitlin] > 3) Private Data supplied by the remote peer to establish its > identity, IMHO, private data is useless (or at least insecure) for this purpose for the reasons mentioned above. > the required characteristics of the desired connection and/or other > application specific purposes. The private data is supplied prior to > connection establishment specifically to enable > selection/configuration of the RDMA QP. I agree that this is a useful purpose for private data. [caitlin] Agreed. This is why private data exists. Application developers SHOULD use it only for this purpose, and use Send/Recv to exchange other information. But the problem is once you label it "private data" the application developers tend to think of it as their data that they can use anyway they see fit. [/caitlin] > Note that on a transport neutral basis the passive > side application cannot assume that the QP is fully configured to match > credit requirements of the remote peer -- it must configure QP capacities > itself. > -- It will NOT receive connection requests from remote peers seeking to > connect > with similar services based upon streaming socket semantics (SDP or plain > TCP). What specifically are you saying here? That the app won't see the connect request until after an MPA Start Request has been received? [caitlin] Correct, it is not an IP-ssemantics RDMA Connection Request until there is no doubt that it is. Over iWarp/MPA that is established by a valid MPA Request frame. Over iWARP/SCTP it is established by an association parameter. For IP-Address-based IB CM that implies that we have to know that the Private Data actually is in the IP-Address bearing format. If a stray TCP client who has no idea that they are connecting to an RDMA daemon initiates a TCP connection that connection will be handled strictly by the iWARP CM and never reported as an iWARP connection request to the Consumer. Similarly, if a stray IB client that has no idea what these "IP semantics" options are about attempts to connect to a Service that wants the IP-format CM Private Data the connection request should not be reported to the Consumer. Especially not with garbled private data and addressign data taken from the intended private data. That would be a change in the smenatics for a transport neutral daemon. It is not responsible for figuring out if the client really meant to connect with it. Sorting out those who do not know how to speak the relevant Connection Establishment protocol should remain a responsibiloity of the Connection Manager (for whatever connection establishment protocol is in use) and not shifted to the listener. [/caitlin] > > -- If it so chooses, it may accept the connection request by supplying a > compatibly > configured RDMA QP and response private data. > -- If it so chooses, it may reject the conneciton request. > Many of these requirements point to why the additional data is needed, and > why taking > the first N bytes of the existing private data is requried. > > The key requirement that I belive requires that "65th bit" is that a client > seeking > a streaming mode daemon cannot initiate a connection with an RDMA mode daemon > and > start mis-exchanging data. Are you referring to the MPA Start Key thingy again? I don't think the IB guys don't have this issue. [caitlin] They do. But it's not distinquishing a plain TCP client from an iWARP client. It's distinquishing a Connection Request from a client who wants an IP semantics connection from one who is using the base CM protocol. [/caintlin] From ttucker at es335.com Fri Oct 21 07:20:16 2005 From: ttucker at es335.com (ttucker at es335.com) Date: Fri, 21 Oct 2005 09:20:16 -0500 Subject: [openib-general] Adding static entry to arp table? In-Reply-To: <1129902328.5494.4.camel@psmith.ind.pantasys.com> References: <1129902328.5494.4.camel@psmith.ind.pantasys.com> Message-ID: <20051021092016.68qivum10ks4o0cw@www.opengridcomputing.com> Do you need it to be an API, or can you just use the arp -s command? Quoting Abhijit Gadgil : > Hi all, > > Is there a patch (to ip utility or Linux kernel), which can add static > entry to the arp table using ip neigh command? I am using gen1 based > stack, but didn't find anything after a 'grep' in the gen2 stack as > well? any pointers? > > Thanks and regards. > > -abhijit > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From info at rtgchd.com Fri Oct 21 05:35:35 2005 From: info at rtgchd.com (info at rtgchd.com) Date: 21 Oct 2005 21:35:35 +0900 Subject: [openib-general] $B$9$0$KEEOC$b$i$($^$;$s$+!)(B Message-ID: <20051021123535.23805.qmail@mail.rtgchd.com> $BF|K\:GBg5i$NBg?M$N=P2q$$!*!*(B $B-!!Z40A4L5NA![$*$3$:$+$$(BGET!! $B-"!Z=P2q$$J]>ZIU![$@$+$i at dBP$K2q$($k!*!*(B $B40A4(/(.(/(.(/(B $BL5NA(5(1(0(1!s!!"+EvA3$G$9!#(B $B!|(BPC$B!&7HBS%5%$%H$HO"F0%7%9%F%`(B $B!|D>%"%I!&D>EE8r49<+M3(B $B$^$:$OEEOC$9$k"M(B http://awg.webchu.com/sweet-s/?gyakuen $B$5$!!":#$NJ?K^$J at 83h$rC&=P$9$k$N$O$"$J$?$NM&5$$H7hCG$N$_!#(B $B0lDL$N%a!<%k$G$"$J$?$N1?L?$rJQ$($h$&!*(B $BH~?M7O!"2D0&$$7O!"$*;P7O$,B?$$$7!"!z!T9b3[=jF@l9g$O(B sweet_baby_sweet_12 at yahoo.it $B!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a!a(B From gabhijit at pantasys.com Fri Oct 21 07:31:47 2005 From: gabhijit at pantasys.com (Abhijit Gadgil) Date: Fri, 21 Oct 2005 20:01:47 +0530 Subject: [openib-general] Adding static entry to arp table? In-Reply-To: <20051021092016.68qivum10ks4o0cw@www.opengridcomputing.com> References: <1129902328.5494.4.camel@psmith.ind.pantasys.com> <20051021092016.68qivum10ks4o0cw@www.opengridcomputing.com> Message-ID: <1129905107.5494.16.camel@psmith.ind.pantasys.com> There is no need for the API, I am using the 'ip neigh' command, because the 'arp -s' has got some issues with the hardware address length. Does 'arp -s' actually work? -abhijit On Fri, 2005-10-21 at 19:50, ttucker at es335.com wrote: > Do you need it to be an API, or can you just use the arp -s command? > > Quoting Abhijit Gadgil : > > > Hi all, > > > > Is there a patch (to ip utility or Linux kernel), which can add static > > entry to the arp table using ip neigh command? I am using gen1 based > > stack, but didn't find anything after a 'grep' in the gen2 stack as > > well? any pointers? > > > > Thanks and regards. > > > > -abhijit > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > From swise at opengridcomputing.com Fri Oct 21 07:47:35 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Oct 2005 09:47:35 -0500 Subject: [openib-general] Re: rdma_bind_addr question References: <1129837501.10748.9.camel@trinity.austin.ammasso.com><4357F186.1070407@ichips.intel.com><1129838880.10748.25.camel@trinity.austin.ammasso.com><4357F979.9030105@ichips.intel.com> <1129869686.21779.67.camel@mail.es335.com> Message-ID: <004401c5d64e$60783150$d5000a0a@STEVO> I think we can simplify this without changing the format of how the IPoIB mac addr is stored. assume the IPoIB netdevs have type ARPHRD_INFINIBAND, and iWARP netdevs are type ARPHRD_ETHERNET. The CMA would search only the IWARP openib devices and compare the 6 byte MAC address if the netdev type is ETHERNET, and search only the IB openib devices matching on the full 20 byte IPoIB mac addr. Doing this removes the possibility that a 6 byte ethernet mac addr accidentally matches the first 6 bytes of the IPoIB mac addr... Steve. ----- Original Message ----- From: "Tom Tucker" To: "Sean Hefty" Cc: Sent: Thursday, October 20, 2005 11:41 PM Subject: [openib-general] Re: rdma_bind_addr question > On Thu, 2005-10-20 at 13:09 -0700, Sean Hefty wrote: >> Tom Tucker wrote: >> > BTW -- I think this means that we need an ARPHRD_IWARP type. >> >> It may be that the CMA can simply look in its local device list for an >> ib_device >> that has a given MAC address. > > After pondering this, I think you're correct. There is one issue, > however. Currently, the GID is stored beginning at the fourth byte of > the dev_addr for IBoIB, but the Ethernet MAC address begins at byte 0. > Is it possible to move this 4B quantity to follow the GID? If so, we > could pad the dev_addr for iWARP devices with zeroes and use the exact > same code to search the cma device table. If the device is found, it > already has a type in the ib_device structure to distinguish between IB > and iWARP devices. > > If the caller gave us an IP address for a dumb Ethernet device, we would > go looking for it in the cma device list and simply not find it. It > would still fail, just later, and would avoid a new ARPHRD type. > >> >> I don't know the detail of how iWarp will work with this. Will iWarp >> need a >> call similar to ib_translate_addr() to translate an IP address into a MAC >> address? Are the MAC addresses stored with the ib_device somehow? >> >> - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Fri Oct 21 08:02:37 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 21 Oct 2005 11:02:37 -0400 Subject: [openib-general] configuring ipoib Message-ID: How do you configure ipoib? I used "ifconfig ib0 ip_address" which works fine. But if I have several ports on an HCA how do I specify which port ip_address should be associated with? Ditto if you have multiple cards. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at ipperformance.com Fri Oct 21 08:09:27 2005 From: tom at ipperformance.com (Tom Tucker) Date: Fri, 21 Oct 2005 10:09:27 -0500 Subject: [openib-general] Adding static entry to arp table? In-Reply-To: <1129905107.5494.16.camel@psmith.ind.pantasys.com> References: <1129902328.5494.4.camel@psmith.ind.pantasys.com> <20051021092016.68qivum10ks4o0cw@www.opengridcomputing.com> <1129905107.5494.16.camel@psmith.ind.pantasys.com> Message-ID: <20051021100927.dhp1kcgs0swkw0c8@www.opengridcomputing.com> I've used it for Ethernet MAC addresses. I haven't tried it for non-ARPHRD_ETHERNET types. Quoting Abhijit Gadgil : > There is no need for the API, I am using the 'ip neigh' command, because > the 'arp -s' has got some issues with the hardware address length. > > Does 'arp -s' actually work? > > -abhijit > > > On Fri, 2005-10-21 at 19:50, ttucker at es335.com wrote: >> Do you need it to be an API, or can you just use the arp -s command? >> >> Quoting Abhijit Gadgil : >> >> > Hi all, >> > >> > Is there a patch (to ip utility or Linux kernel), which can add static >> > entry to the arp table using ip neigh command? I am using gen1 based >> > stack, but didn't find anything after a 'grep' in the gen2 stack as >> > well? any pointers? >> > >> > Thanks and regards. >> > >> > -abhijit >> > >> > >> > _______________________________________________ >> > openib-general mailing list >> > openib-general at openib.org >> > http://openib.org/mailman/listinfo/openib-general >> > >> > To unsubscribe, please visit >> > http://openib.org/mailman/listinfo/openib-general >> > >> >> >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Oct 21 08:08:38 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Oct 2005 11:08:38 -0400 Subject: [openib-general] configuring ipoib In-Reply-To: References: Message-ID: <1129907306.16900.50211.camel@hal.voltaire.com> On Fri, 2005-10-21 at 11:02, Kanevsky, Arkady wrote: > How do you configure ipoib? > I used "ifconfig ib0 ip_address" which works fine. > But if I have several ports on an HCA how do I specify which port > ip_address should be associated with? > Ditto if you have multiple cards. HCA port 0 = ib0 HCA port 1 = ib1 I think additional HCAs would be: ib2/ib3, etc. -- Hal > Thanks, > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance phone: 781-768-5395 > > 375 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at ipperformance.com Fri Oct 21 08:53:25 2005 From: tom at ipperformance.com (Tom Tucker) Date: Fri, 21 Oct 2005 10:53:25 -0500 Subject: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1025DFD@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1025DFD@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <20051021105325.olh7016x7o8o4884@www.opengridcomputing.com> It is not unreasonable and certainly desireable that a CMA that uses TCP/IP addressing should be able to provide to the consumer the four-tuple that uniquely identifies the session. No argument. No problem. It's the *other* rationales being suggested that are killing me. - The receiving application knows nothing about whether or not the source address has been spoofed. Only the firewall does and even its tests are weak (if the next hop is the same for both the real and spoofed address you can't tell). So these arguments don't convince me. This is not about security. - NFS does need the source address for determining whether or not it will allow the mount. No problem. I'm not arguing about that. I'm arguing about some presumption of security that simply does not exist. - Making statements about what the local node can assume about how the remote node prepared the data (kernel vs. user) will get us absolutely slapped by the Linux crowd. I'm gonna duck -- you're on your own ;-) - In general, it seems that this Private Data protocol is getting overloaded with requirements that extend beyond identiciation of the 4-tuple. I am suggesting that OpenIB apps should exchange this kind of info in a different way instead of establishing a Private Data protocol generic to all ULP. BTW: I don't disagree that a CM client shouldn't need to be worried about incoming requests from non-RDMA clients. Quoting Caitlin Bestler : > [caitlin] > Comments inline > [/caitlin] > > > -----Original Message----- > From: Tom Tucker [mailto:ttucker at es335.com] > Sent: Thu 10/20/2005 9:26 PM > To: Caitlin Bestler > Cc: Sean Hefty; Fab Tillier; swg at infinibandta.org; openib-general at openib.org > Subject: Re: [openib-general] Semantics of transport neutral connection > establishment (was Re: [swg] Re: private data...) > > > I think this is a useful discussion, however, I would point out that > some of the information being exchanged doesn't have to be in the > private data. It could be exchanged in an untagged send/recv after the > connection is established; which has the benefit of allowing the > application to use an arbitrarily large chunk of data to authenticate > and authorize the remote peer instead of trying to boil the ocean in 64B > of data. > > It might be better to only solve the core problem ... which I think is > identifying the service and QP configuration. > > On Thu, 2005-10-20 at 14:58 -0700, Caitlin Bestler wrote: >> I believe a review of what the implementer of a transport neutral >> daemon for an RDMA protocol would be expecting from a Connection >> Management service: >> >> -- It expects that it can listen for connection requests on a specific >> 16-bit port number (with traditional TCP port number semantics) on >> either a specific IP Address or for all IP Addresses associated with >> the network device. > > I would expand this to say "...or on all interfaces that have an IP > address". From my reading, it seems to me that the current CMA does > this. > >> -- It will receive connection requests that were initiated by active peers >> that wish to establish a reliable connection for the purpose of >> exchanging RDMA messges. >> >> This Connection Request will identify: >> 1) The remote IP Address of the active peer. >> This will beauthenticated >> in the sense that the address is known to have more meaning than >> just being a value made up by a remote user-mode peer. If it is a >> lie then privileged software is complicit in the lie. The address may be >> even more authenticated than that. > > Are you saying that it should not be possible for a user mode peer to > masquerade as another host? If this is what you're saying, then I don't > think it is any more secure done in the kernel than in user mode because > the remote peer has no way of knowing where the data was prepared. > > I think that if authentication is the purpose of the remote address, > don't bother. If the active peer needs to be authenticated, do it after > connection establishment when you can exchange signatures of sufficient > size to be useful. > > Am I missing something here? > > > [caitlin] > I'm merely noting what a TCP daemon is able to assume today as > part of IP connection setup. The remote IP address supplied may > be heavily authenticated (the local network actively prevents IP > spoofing by checking routes, etc.) or next to worthless (the only > guarantee si that the forger had root access on some machine). > > Regardless of whether this level of authentication is advisable, it > is *exactly* the assumption that many servers make today. This > includes many NFS configurations. When the network is isolated > from external connections, and the network administrator is > confident that they control "root" for all machines within the > local network this can be a signifigant level of defense. Within > a corporate intranet, for example, this may be the mechanism > to ensure that marketing does not examine internla engineering > documents. I wouldn't recommend it for protecting HR files, but > it can be quite adequate for many purposes. > > More importantly, if this guarantee is not provided then an > explicit warning should be made. For example, unless the > CM header itself is mariked as having IP data in it there is no > way to know that a user mode application simply hasn't made > up an IP address and submitted as part of a normal CM requests > private data. > [/caitlin] > > > >> 2) The destination IP address that the active peer requested. That >> is, if the network device supports multiple addresses concurrently (as with >> a >> web farm) the connection request will identify *which* address was >> specified by the remote active peer. > > ... and port. I agree this is needed because it is part of the "local > service signature" aka service id aka port number/ip address" > > [caitlin] > But the port was implicit in the listen. There is no need for it to > be part of the Connection Request reported to the listener. If there > is a desire to conserve Service IDs then the wire protocol could certainly > include the TCP port in the CM Private Data. But that would be transparent > to the application. > > The transparency to the applicaion is my concern. I'll confess to not > seeing any great need to avoid allocating 64K Service IDs out of a > 64-bit range -- but that's an issue for the IB developers to work out. > [/caitlin] > >> 3) Private Data supplied by the remote peer to establish its >> identity, > > IMHO, private data is useless (or at least insecure) for this purpose > for the reasons mentioned above. > >> the required characteristics of the desired connection and/or > other >> application specific purposes. The private data is supplied prior > to >> connection establishment specifically to enable >> selection/configuration of the RDMA QP. > > I agree that this is a useful purpose for private data. > > [caitlin] > Agreed. This is why private data exists. Application developers > SHOULD use it only for this purpose, and use Send/Recv to exchange > other information. > > But the problem is once you label it "private data" the application > developers tend to think of it as their data that they can use anyway > they see fit. > [/caitlin] > >> Note that on a transport neutral basis the passive >> side application cannot assume that the QP is fully configured to match >> credit requirements of the remote peer -- it must configure QP capacities >> itself. > >> -- It will NOT receive connection requests from remote peers seeking to >> connect >> with similar services based upon streaming socket semantics (SDP or > plain >> TCP). > > What specifically are you saying here? That the app won't see the > connect request until after an MPA Start Request has been received? > > [caitlin] > Correct, it is not an IP-ssemantics RDMA Connection Request until > there is no doubt that it is. > > Over iWarp/MPA that is established by a valid MPA Request frame. > Over iWARP/SCTP it is established by an association parameter. > > For IP-Address-based IB CM that implies that we have to know that > the Private Data actually is in the IP-Address bearing format. > > If a stray TCP client who has no idea that they are connecting to an > RDMA daemon initiates a TCP connection that connection will be > handled strictly by the iWARP CM and never reported as an iWARP > connection request to the Consumer. > > Similarly, if a stray IB client that has no idea what these "IP semantics" > options are about attempts to connect to a Service that wants the > IP-format CM Private Data the connection request should not be > reported to the Consumer. Especially not with garbled private data > and addressign data taken from the intended private data. > > That would be a change in the smenatics for a transport neutral > daemon. It is not responsible for figuring out if the client really > meant to connect with it. Sorting out those who do not know how > to speak the relevant Connection Establishment protocol should > remain a responsibiloity of the Connection Manager (for whatever > connection establishment protocol is in use) and not shifted to > the listener. > [/caitlin] > >> >> -- If it so chooses, it may accept the connection request by supplying a >> compatibly >> configured RDMA QP and response private data. >> -- If it so chooses, it may reject the conneciton request. > >> Many of these requirements point to why the additional data is needed, and >> why taking >> the first N bytes of the existing private data is requried. >> >> The key requirement that I belive requires that "65th bit" is that a client >> seeking >> a streaming mode daemon cannot initiate a connection with an RDMA mode > daemon >> and >> start mis-exchanging data. > > Are you referring to the MPA Start Key thingy again? I don't think the > IB guys don't have this issue. > > [caitlin] > They do. But it's not distinquishing a plain TCP client from an iWARP > client. It's distinquishing a Connection Request from a client who wants > an IP semantics connection from one who is using the base CM protocol. > [/caintlin] > > > > > From iod00d at hp.com Fri Oct 21 08:58:20 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 21 Oct 2005 08:58:20 -0700 Subject: [openib-general] configuring ipoib In-Reply-To: References: Message-ID: <20051021155820.GC25476@esmail.cup.hp.com> On Fri, Oct 21, 2005 at 11:02:37AM -0400, Kanevsky, Arkady wrote: > How do you configure ipoib? > I used "ifconfig ib0 ip_address" which works fine. > But if I have several ports on an HCA how do I specify which port > ip_address should be associated with? Nit: For linux, Christoph Hellwig (and others) have explained that the ip address is associated with the host, NOT any card. The route to the subnet is associated with the card. > Ditto if you have multiple cards. iowa:/usr/src/linux-2.6.13# lspci -vt -d 15b3: -+-[c0]---01.0-[c1]----00.0 Mellanox Technologies MT23108 InfiniHost +-[40]---01.0-[41]----00.0 Mellanox Technologies MT23108 InfiniHost \-[00]- iowa:/usr/src/linux-2.6.13# ifconfig -a | fgrep ib ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 ib1 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 ib2 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 ib3 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 hth, grant From iod00d at hp.com Fri Oct 21 09:09:42 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 21 Oct 2005 09:09:42 -0700 Subject: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) In-Reply-To: <20051021105325.olh7016x7o8o4884@www.opengridcomputing.com> References: <54AD0F12E08D1541B826BE97C98F99F1025DFD@NT-SJCA-0751.brcm.ad.broadcom.com> <20051021105325.olh7016x7o8o4884@www.opengridcomputing.com> Message-ID: <20051021160942.GD25476@esmail.cup.hp.com> On Fri, Oct 21, 2005 at 10:53:25AM -0500, Tom Tucker wrote: ... > BTW: I don't disagree that a CM client shouldn't need to be worried about > incoming requests from non-RDMA clients. my parser is tripping all over this one...did you mean: "I agree a CM client can ignore non-RDMA clients"? or "I agree a CM client should only get requests from RDMA clients"? thanks, grant From surs at cse.ohio-state.edu Fri Oct 21 09:17:28 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 21 Oct 2005 12:17:28 -0400 Subject: [openib-general] uDAPL open HCA problem Message-ID: <20051021161727.GA23980@cse.ohio-state.edu> Hello, I have udapl over Gen2 setup on our cluster and am able to run udapl programs. However, sometimes I get this error (after a few runs of the same program): open_hca: ERR ib_at_ips_by_gid for mthca0 dapls_ib_open_hca failed 40000 The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree cards (fw ver 5.1.0). lsmod on my machine shows this: [surs at ro0:~] lsmod | grep ^ib ib_ipoib 48008 0 ib_uat 14840 0 ib_at 25696 1 ib_uat ib_sa 17804 2 ib_ipoib,ib_at ib_ucm 22280 0 ib_cm 37744 1 ib_ucm ib_uverbs 35992 0 ib_umad 18208 0 ib_mthca 122656 0 ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca ib_core 56192 8 ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad My infiniband devices are (created by hand): [surs at ro0:~] ls -l /dev/infiniband/ total 0 crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0 crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0 crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0 crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1 I'd really appreciate if someone could help me understand what might be going wrong. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From iod00d at hp.com Fri Oct 21 09:23:43 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 21 Oct 2005 09:23:43 -0700 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <20051021161727.GA23980@cse.ohio-state.edu> References: <20051021161727.GA23980@cse.ohio-state.edu> Message-ID: <20051021162343.GE25476@esmail.cup.hp.com> On Fri, Oct 21, 2005 at 12:17:28PM -0400, Sayantan Sur wrote: > Hello, > > I have udapl over Gen2 setup on our cluster and am able to run udapl > programs. However, sometimes I get this error (after a few runs of the > same program): > > open_hca: ERR ib_at_ips_by_gid for mthca0 > dapls_ib_open_hca failed 40000 > > The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree cards > (fw ver 5.1.0). Folks here will still need to know: 1) Which kernel version? 2) Which SVN version of GEN2 are you using? hth, grant > > lsmod on my machine shows this: > > [surs at ro0:~] lsmod | grep ^ib > ib_ipoib 48008 0 > ib_uat 14840 0 > ib_at 25696 1 ib_uat > ib_sa 17804 2 ib_ipoib,ib_at > ib_ucm 22280 0 > ib_cm 37744 1 ib_ucm > ib_uverbs 35992 0 > ib_umad 18208 0 > ib_mthca 122656 0 > ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca > ib_core 56192 8 > ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > > My infiniband devices are (created by hand): > > [surs at ro0:~] ls -l /dev/infiniband/ > total 0 > crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat > crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0 > crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0 > crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0 > crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1 > > > I'd really appreciate if someone could help me understand what might be > going wrong. > > Thanks, > Sayantan. > > -- > http://www.cse.ohio-state.edu/~surs > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Fri Oct 21 09:24:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 09:24:51 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <20051020210133.27820.qmail@web32504.mail.mud.yahoo.com> References: <20051020210133.27820.qmail@web32504.mail.mud.yahoo.com> Message-ID: <43591653.4020101@ichips.intel.com> Steven Wooding wrote: > I had a look at where the mask is set in cm.c > (cm_init_qp_rtr_attr() and cm_init_qp_rts_attr()) but > I was unsure how to make the mask depend on the QP > type. Maybe you have a better idea of how to do this. I will take a look at this later today or early next week. Thanks for the feedback. - Sean From tom at ipperformance.com Fri Oct 21 09:29:19 2005 From: tom at ipperformance.com (Tom Tucker) Date: Fri, 21 Oct 2005 11:29:19 -0500 Subject: [openib-general] Semantics of transport neutral connection establishment (was Re: [swg] Re: private data...) In-Reply-To: <20051021160942.GD25476@esmail.cup.hp.com> References: <54AD0F12E08D1541B826BE97C98F99F1025DFD@NT-SJCA-0751.brcm.ad.broadcom.com> <20051021105325.olh7016x7o8o4884@www.opengridcomputing.com> <20051021160942.GD25476@esmail.cup.hp.com> Message-ID: <20051021112919.zw38qcn6skwsk4ck@www.opengridcomputing.com> Quoting Grant Grundler : > On Fri, Oct 21, 2005 at 10:53:25AM -0500, Tom Tucker wrote: > ... >> BTW: I don't disagree that a CM client shouldn't need to be worried about >> incoming requests from non-RDMA clients. > > my parser is tripping all over this one...did you mean: > "I agree a CM client can ignore non-RDMA clients"? > or > "I agree a CM client should only get requests from RDMA clients"? Specifically, it means that a CMA consumer over an iWARP transport will not see connect events until after MPA Start Key negotiation. I don't know if this helped the parser or not... Here's another try...practically speaking if a user does: telnet that an OpenIB app listening on won't see a connect event unless the user is extremely adept at typing in MPA headers... > > thanks, > grant > From mshefty at ichips.intel.com Fri Oct 21 09:53:27 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 09:53:27 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: References: Message-ID: <43591D07.5050709@ichips.intel.com> Based on the discussions that have taken place so far, I believe that we're trying to define something more complex than what's needed. Here's what I think that we want: At a minimum, we need an assigned service ID to identifies a TCP/IP connection service. For simplicity of the implementation, I would use an ID similar to that defined for SDP: 0x00 14 05 xx xx xx xx xx I don't know that the SWG or IBTA needs to be involved defining the protocol beyond assigning the service ID. The connection service can define service IDs as: 0x00 14 05 00 00 00 dst port And a private data format for the CM REQ: version(8) | reserved(8) | src port (16) src ip (16) dst ip (16) user private data (56) /* for version 1 */ Other private data would be left unchanged, though if we wanted to get more sophisticated, we could define REJ codes to indicate bad addresses/version/etc. Not surprisingly, this is exactly what's implemented in the CMA and working today. On a related note, it would be convenient if SDP were changed to run over this protocol. - Sean From surs at cse.ohio-state.edu Fri Oct 21 10:50:13 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 21 Oct 2005 13:50:13 -0400 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <20051021162343.GE25476@esmail.cup.hp.com> References: <20051021161727.GA23980@cse.ohio-state.edu> <20051021162343.GE25476@esmail.cup.hp.com> Message-ID: <20051021175011.GA25479@cse.ohio-state.edu> * On Oct,2 Grant Grundler wrote : > Folks here will still need to know: Ooops. Forgot to plug int that information! > 1) Which kernel version? 2.6.13.1 > 2) Which SVN version of GEN2 are you using? 3685. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From Arkady.Kanevsky at netapp.com Fri Oct 21 11:21:23 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 21 Oct 2005 14:21:23 -0400 Subject: [openib-general] configuring ipoib Message-ID: Thanks guys. Please, excuse my terminology. No I can route a single IP address to an IB port. But how do I route 2 (or more) IP addresses to the same IB port? If I specify the same ib# it just changes an associated IP address for the port. If I specify next ib# it returns an error since that ib# does not have a port behind it. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Friday, October 21, 2005 11:58 AM > To: Kanevsky, Arkady > Cc: openib-general at openib.org > Subject: Re: [openib-general] configuring ipoib > > > On Fri, Oct 21, 2005 at 11:02:37AM -0400, Kanevsky, Arkady wrote: > > How do you configure ipoib? > > I used "ifconfig ib0 ip_address" which works fine. > > But if I have several ports on an HCA how do I specify which port > > ip_address should be associated with? > > Nit: For linux, Christoph Hellwig (and others) have explained > that the ip address is associated with the host, NOT any > card. The route to the subnet is associated with the card. > > > Ditto if you have multiple cards. > > iowa:/usr/src/linux-2.6.13# lspci -vt -d 15b3: > -+-[c0]---01.0-[c1]----00.0 Mellanox Technologies MT23108 InfiniHost > +-[40]---01.0-[41]----00.0 Mellanox Technologies MT23108 InfiniHost > \-[00]- > iowa:/usr/src/linux-2.6.13# ifconfig -a | fgrep ib > ib0 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > ib1 Link encap:UNSPEC HWaddr > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > ib2 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > ib3 Link encap:UNSPEC HWaddr > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > > > hth, > grant > From sean.hefty at intel.com Fri Oct 21 11:27:28 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 11:27:28 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings Message-ID: The following patch should fix the MAD layer's DMA mapping issue. This patch includes all related patches that were previously posted. The fix involved changing the MAD layer API. All callers must now use the MAD layer to allocate and free send MADs. DMA mappings are done by the MAD layer. This patch was tested in the kernel, but not for userspace. Since the code makes significant changes to the MAD layer, I'd like to get some wider testing. Signed-off-by: Sean Hefty Index: trunk/src/linux-kernel/infiniband/include/rdma/ib_verbs.h =================================================================== --- trunk/src/linux-kernel/infiniband/include/rdma/ib_verbs.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/include/rdma/ib_verbs.h (working copy) @@ -595,11 +595,8 @@ struct ib_send_wr { } atomic; struct { struct ib_ah *ah; - struct ib_mad_hdr *mad_hdr; u32 remote_qpn; u32 remote_qkey; - int timeout_ms; /* valid for MADs only */ - int retries; /* valid for MADs only */ u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; Index: trunk/src/linux-kernel/infiniband/include/rdma/ib_mad.h =================================================================== --- trunk/src/linux-kernel/infiniband/include/rdma/ib_mad.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/include/rdma/ib_mad.h (working copy) @@ -109,10 +109,14 @@ #define IB_QP_SET_QKEY 0x80000000 enum { + IB_MGMT_MAD_HDR = 24, IB_MGMT_MAD_DATA = 232, + IB_MGMT_RMPP_HDR = 36, IB_MGMT_RMPP_DATA = 220, + IB_MGMT_VENDOR_HDR = 40, IB_MGMT_VENDOR_DATA = 216, - IB_MGMT_SA_DATA = 200 + IB_MGMT_SA_HDR = 56, + IB_MGMT_SA_DATA = 200, }; struct ib_mad_hdr { @@ -203,26 +207,26 @@ struct ib_class_port_info /** * ib_mad_send_buf - MAD data buffer and work request for sends. + * @next: A pointer used to chain together MADs for posting. * @mad: References an allocated MAD data buffer. The size of the data * buffer is specified in the @send_wr.length field. - * @mapping: DMA mapping information. * @mad_agent: MAD agent that allocated the buffer. + * @ah: The address handle to use when sending the MAD. * @context: User-controlled context fields. - * @send_wr: An initialized work request structure used when sending the MAD. - * The wr_id field of the work request is initialized to reference this - * data structure. - * @sge: A scatter-gather list referenced by the work request. + * @timeout_ms: Time to wait for a response. + * @retries: Number of times to retry a request for a response. * * Users are responsible for initializing the MAD buffer itself, with the * exception of specifying the payload length field in any RMPP MAD. */ struct ib_mad_send_buf { - struct ib_mad *mad; - DECLARE_PCI_UNMAP_ADDR(mapping) + struct ib_mad_send_buf *next; + void *mad; struct ib_mad_agent *mad_agent; + struct ib_ah *ah; void *context[2]; - struct ib_send_wr send_wr; - struct ib_sge sge; + int timeout_ms; + int retries; }; /** @@ -287,7 +291,7 @@ typedef void (*ib_mad_send_handler)(stru * or @mad_send_wc. */ typedef void (*ib_mad_snoop_handler)(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, + struct ib_mad_send_buf *send_buf, struct ib_mad_send_wc *mad_send_wc); /** @@ -334,13 +338,13 @@ struct ib_mad_agent { /** * ib_mad_send_wc - MAD send completion information. - * @wr_id: Work request identifier associated with the send MAD request. + * @send_buf: Send MAD data buffer associated with the send MAD request. * @status: Completion status. * @vendor_err: Optional vendor error information returned with a failed * request. */ struct ib_mad_send_wc { - u64 wr_id; + struct ib_mad_send_buf *send_buf; enum ib_wc_status status; u32 vendor_err; }; @@ -366,7 +370,7 @@ struct ib_mad_recv_buf { * @rmpp_list: Specifies a list of RMPP reassembled received MAD buffers. * @mad_len: The length of the received MAD, without duplicated headers. * - * For received response, the wr_id field of the wc is set to the wr_id + * For received response, the wr_id contains a pointer to the ib_mad_send_buf * for the corresponding send request. */ struct ib_mad_recv_wc { @@ -463,9 +467,9 @@ int ib_unregister_mad_agent(struct ib_ma /** * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client. - * @mad_agent: Specifies the associated registration to post the send to. - * @send_wr: Specifies the information needed to send the MAD(s). - * @bad_send_wr: Specifies the MAD on which an error was encountered. + * @send_buf: Specifies the information needed to send the MAD(s). + * @bad_send_buf: Specifies the MAD on which an error was encountered. This + * parameter is optional if only a single MAD is posted. * * Sent MADs are not guaranteed to complete in the order that they were posted. * @@ -479,9 +483,8 @@ int ib_unregister_mad_agent(struct ib_ma * defined data being transferred. The paylen_newwin field should be * specified in network-byte order. */ -int ib_post_send_mad(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, - struct ib_send_wr **bad_send_wr); +int ib_post_send_mad(struct ib_mad_send_buf *send_buf, + struct ib_mad_send_buf **bad_send_buf); /** * ib_coalesce_recv_mad - Coalesces received MAD data into a single buffer. @@ -507,23 +510,25 @@ void ib_free_recv_mad(struct ib_mad_recv /** * ib_cancel_mad - Cancels an outstanding send MAD operation. * @mad_agent: Specifies the registration associated with sent MAD. - * @wr_id: Indicates the work request identifier of the MAD to cancel. + * @send_buf: Indicates the MAD to cancel. * * MADs will be returned to the user through the corresponding * ib_mad_send_handler. */ -void ib_cancel_mad(struct ib_mad_agent *mad_agent, u64 wr_id); +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf); /** * ib_modify_mad - Modifies an outstanding send MAD operation. * @mad_agent: Specifies the registration associated with sent MAD. - * @wr_id: Indicates the work request identifier of the MAD to modify. + * @send_buf: Indicates the MAD to modify. * @timeout_ms: New timeout value for sent MAD. * * This call will reset the timeout value for a sent MAD to the specified * value. */ -int ib_modify_mad(struct ib_mad_agent *mad_agent, u64 wr_id, u32 timeout_ms); +int ib_modify_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf, u32 timeout_ms); /** * ib_redirect_mad_qp - Registers a QP for MAD services. @@ -572,7 +577,6 @@ int ib_process_mad_wc(struct ib_mad_agen * @remote_qpn: Specifies the QPN of the receiving node. * @pkey_index: Specifies which PKey the MAD will be sent using. This field * is valid only if the remote_qpn is QP 1. - * @ah: References the address handle used to transfer to the remote node. * @rmpp_active: Indicates if the send will enable RMPP. * @hdr_len: Indicates the size of the data header of the MAD. This length * should include the common MAD header, RMPP header, plus any class @@ -582,11 +586,10 @@ int ib_process_mad_wc(struct ib_mad_agen * additional padding that may be necessary. * @gfp_mask: GFP mask used for the memory allocation. * - * This is a helper routine that may be used to allocate a MAD. Users are - * not required to allocate outbound MADs using this call. The returned - * MAD send buffer will reference a data buffer usable for sending a MAD, along + * This routine allocates a MAD for sending. The returned MAD send buffer + * will reference a data buffer usable for sending a MAD, along * with an initialized work request structure. Users may modify the returned - * MAD data buffer or work request before posting the send. + * MAD data buffer before posting the send. * * The returned data buffer will be cleared. Users are responsible for * initializing the common MAD and any class specific headers. If @rmpp_active @@ -594,7 +597,7 @@ int ib_process_mad_wc(struct ib_mad_agen */ struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, - struct ib_ah *ah, int rmpp_active, + int rmpp_active, int hdr_len, int data_len, unsigned int __nocast gfp_mask); Index: trunk/src/linux-kernel/infiniband/core/agent.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/agent.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/agent.c (working copy) @@ -36,58 +36,41 @@ * * $Id$ */ +#include "agent.h" +#include "smi.h" -#include -#include - -#include +#define SPFX "ib_agent: " -#include "smi.h" -#include "agent_priv.h" -#include "mad_priv.h" -#include "agent.h" +struct ib_agent_port_private { + struct list_head port_list; + struct ib_mad_agent *agent[2]; +}; -spinlock_t ib_agent_port_list_lock; +static DEFINE_SPINLOCK(ib_agent_port_list_lock); static LIST_HEAD(ib_agent_port_list); -/* - * Caller must hold ib_agent_port_list_lock - */ -static inline struct ib_agent_port_private * -__ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +__ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; - BUG_ON(!(!!device ^ !!mad_agent)); /* Exactly one MUST be (!NULL) */ - - if (device) { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if (entry->smp_agent->device == device && - entry->port_num == port_num) - return entry; - } - } else { - list_for_each_entry(entry, &ib_agent_port_list, port_list) { - if ((entry->smp_agent == mad_agent) || - (entry->perf_mgmt_agent == mad_agent)) - return entry; - } + list_for_each_entry(entry, &ib_agent_port_list, port_list) { + if (entry->agent[0]->device == device && + entry->agent[0]->port_num == port_num) + return entry; } return NULL; } -static inline struct ib_agent_port_private * -ib_get_agent_port(struct ib_device *device, int port_num, - struct ib_mad_agent *mad_agent) +static struct ib_agent_port_private * +ib_get_agent_port(struct ib_device *device, int port_num) { struct ib_agent_port_private *entry; unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - entry = __ib_get_agent_port(device, port_num, mad_agent); + entry = __ib_get_agent_port(device, port_num); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - return entry; } @@ -99,192 +82,76 @@ int smi_check_local_dr_smp(struct ib_smp if (smp->mgmt_class != IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) return 1; - port_priv = ib_get_agent_port(device, port_num, NULL); + + port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { printk(KERN_DEBUG SPFX "smi_check_local_dr_smp %s port %d " - "not open\n", - device->name, port_num); + "not open\n", device->name, port_num); return 1; } - return smi_check_local_smp(port_priv->smp_agent, smp); + return smi_check_local_smp(port_priv->agent[0], smp); } -static int agent_mad_send(struct ib_mad_agent *mad_agent, - struct ib_agent_port_private *port_priv, - struct ib_mad_private *mad_priv, - struct ib_grh *grh, - struct ib_wc *wc) -{ - struct ib_agent_send_wr *agent_send_wr; - struct ib_sge gather_list; - struct ib_send_wr send_wr; - struct ib_send_wr *bad_send_wr; - struct ib_ah_attr ah_attr; - unsigned long flags; - int ret = 1; - - agent_send_wr = kmalloc(sizeof(*agent_send_wr), GFP_KERNEL); - if (!agent_send_wr) - goto out; - agent_send_wr->mad = mad_priv; - - gather_list.addr = dma_map_single(mad_agent->device->dma_device, - &mad_priv->mad, - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - gather_list.length = sizeof(mad_priv->mad); - gather_list.lkey = mad_agent->mr->lkey; - - send_wr.next = NULL; - send_wr.opcode = IB_WR_SEND; - send_wr.sg_list = &gather_list; - send_wr.num_sge = 1; - send_wr.wr.ud.remote_qpn = wc->src_qp; /* DQPN */ - send_wr.wr.ud.timeout_ms = 0; - send_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; - - ah_attr.dlid = wc->slid; - ah_attr.port_num = mad_agent->port_num; - ah_attr.src_path_bits = wc->dlid_path_bits; - ah_attr.sl = wc->sl; - ah_attr.static_rate = 0; - ah_attr.ah_flags = 0; /* No GRH */ - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - if (wc->wc_flags & IB_WC_GRH) { - ah_attr.ah_flags = IB_AH_GRH; - /* Should sgid be looked up ? */ - ah_attr.grh.sgid_index = 0; - ah_attr.grh.hop_limit = grh->hop_limit; - ah_attr.grh.flow_label = be32_to_cpu( - grh->version_tclass_flow) & 0xfffff; - ah_attr.grh.traffic_class = (be32_to_cpu( - grh->version_tclass_flow) >> 20) & 0xff; - memcpy(ah_attr.grh.dgid.raw, - grh->sgid.raw, - sizeof(ah_attr.grh.dgid)); - } - } - - agent_send_wr->ah = ib_create_ah(mad_agent->qp->pd, &ah_attr); - if (IS_ERR(agent_send_wr->ah)) { - printk(KERN_ERR SPFX "No memory for address handle\n"); - kfree(agent_send_wr); - goto out; - } - - send_wr.wr.ud.ah = agent_send_wr->ah; - if (mad_priv->mad.mad.mad_hdr.mgmt_class == IB_MGMT_CLASS_PERF_MGMT) { - send_wr.wr.ud.pkey_index = wc->pkey_index; - send_wr.wr.ud.remote_qkey = IB_QP1_QKEY; - } else { /* for SMPs */ - send_wr.wr.ud.pkey_index = 0; - send_wr.wr.ud.remote_qkey = 0; - } - send_wr.wr.ud.mad_hdr = &mad_priv->mad.mad.mad_hdr; - send_wr.wr_id = (unsigned long)agent_send_wr; - - pci_unmap_addr_set(agent_send_wr, mapping, gather_list.addr); - - /* Send */ - spin_lock_irqsave(&port_priv->send_list_lock, flags); - if (ib_post_send_mad(mad_agent, &send_wr, &bad_send_wr)) { - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(mad_priv->mad), - DMA_TO_DEVICE); - ib_destroy_ah(agent_send_wr->ah); - kfree(agent_send_wr); - } else { - list_add_tail(&agent_send_wr->send_list, - &port_priv->send_posted_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - ret = 0; - } - -out: - return ret; -} - -int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num) +int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn) { struct ib_agent_port_private *port_priv; - struct ib_mad_agent *mad_agent; + struct ib_mad_agent *agent; + struct ib_mad_send_buf *send_buf; + struct ib_ah *ah; + int ret; - port_priv = ib_get_agent_port(device, port_num, NULL); + port_priv = ib_get_agent_port(device, port_num); if (!port_priv) { - printk(KERN_DEBUG SPFX "agent_send %s port %d not open\n", - device->name, port_num); - return 1; + printk(KERN_ERR SPFX "Unable to find port agent\n"); + return -ENODEV; } - /* Get mad agent based on mgmt_class in MAD */ - switch (mad->mad.mad.mad_hdr.mgmt_class) { - case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE: - case IB_MGMT_CLASS_SUBN_LID_ROUTED: - mad_agent = port_priv->smp_agent; - break; - case IB_MGMT_CLASS_PERF_MGMT: - mad_agent = port_priv->perf_mgmt_agent; - break; - default: - return 1; + agent = port_priv->agent[qpn]; + ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); + if (IS_ERR(ah)) { + ret = PTR_ERR(ah); + printk(KERN_ERR SPFX "ib_create_ah_from_wc error:%d\n", ret); + return ret; + } + + send_buf = ib_create_send_mad(agent, wc->src_qp, wc->pkey_index, 0, + IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, + GFP_KERNEL); + if (IS_ERR(send_buf)) { + ret = PTR_ERR(send_buf); + printk(KERN_ERR SPFX "ib_create_send_mad error:%d\n", ret); + goto err1; + } + + memcpy(send_buf->mad, mad, sizeof *mad); + send_buf->ah = ah; + if ((ret = ib_post_send_mad(send_buf, NULL))) { + printk(KERN_ERR SPFX "ib_post_send_mad error:%d\n", ret); + goto err2; } - - return agent_mad_send(mad_agent, port_priv, mad, grh, wc); + return 0; +err2: + ib_free_send_mad(send_buf); +err1: + ib_destroy_ah(ah); + return ret; } static void agent_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_agent_port_private *port_priv; - struct ib_agent_send_wr *agent_send_wr; - unsigned long flags; - - /* Find matching MAD agent */ - port_priv = ib_get_agent_port(NULL, 0, mad_agent); - if (!port_priv) { - printk(KERN_ERR SPFX "agent_send_handler: no matching MAD " - "agent %p\n", mad_agent); - return; - } - - agent_send_wr = (struct ib_agent_send_wr *)(unsigned long)mad_send_wc->wr_id; - spin_lock_irqsave(&port_priv->send_list_lock, flags); - /* Remove completed send from posted send MAD list */ - list_del(&agent_send_wr->send_list); - spin_unlock_irqrestore(&port_priv->send_list_lock, flags); - - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(agent_send_wr, mapping), - sizeof(agent_send_wr->mad->mad), - DMA_TO_DEVICE); - - ib_destroy_ah(agent_send_wr->ah); - - /* Release allocated memory */ - kmem_cache_free(ib_mad_cache, agent_send_wr->mad); - kfree(agent_send_wr); + ib_destroy_ah(mad_send_wc->send_buf->ah); + ib_free_send_mad(mad_send_wc->send_buf); } int ib_agent_port_open(struct ib_device *device, int port_num) { - int ret; struct ib_agent_port_private *port_priv; unsigned long flags; - - /* First, check if port already open for SMI */ - port_priv = ib_get_agent_port(device, port_num, NULL); - if (port_priv) { - printk(KERN_DEBUG SPFX "%s port %d already open\n", - device->name, port_num); - return 0; - } + int ret; /* Create new device info */ port_priv = kmalloc(sizeof *port_priv, GFP_KERNEL); @@ -293,32 +160,25 @@ int ib_agent_port_open(struct ib_device ret = -ENOMEM; goto error1; } - memset(port_priv, 0, sizeof *port_priv); - port_priv->port_num = port_num; - spin_lock_init(&port_priv->send_list_lock); - INIT_LIST_HEAD(&port_priv->send_posted_list); - - /* Obtain send only MAD agent for SM class (SMI QP) */ - port_priv->smp_agent = ib_register_mad_agent(device, port_num, - IB_QPT_SMI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->smp_agent)) { - ret = PTR_ERR(port_priv->smp_agent); + /* Obtain send only MAD agent for SMI QP */ + port_priv->agent[0] = ib_register_mad_agent(device, port_num, + IB_QPT_SMI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[0])) { + ret = PTR_ERR(port_priv->agent[0]); goto error2; } - /* Obtain send only MAD agent for PerfMgmt class (GSI QP) */ - port_priv->perf_mgmt_agent = ib_register_mad_agent(device, port_num, - IB_QPT_GSI, - NULL, 0, - &agent_send_handler, - NULL, NULL); - if (IS_ERR(port_priv->perf_mgmt_agent)) { - ret = PTR_ERR(port_priv->perf_mgmt_agent); + /* Obtain send only MAD agent for GSI QP */ + port_priv->agent[1] = ib_register_mad_agent(device, port_num, + IB_QPT_GSI, NULL, 0, + &agent_send_handler, + NULL, NULL); + if (IS_ERR(port_priv->agent[1])) { + ret = PTR_ERR(port_priv->agent[1]); goto error3; } @@ -329,7 +189,7 @@ int ib_agent_port_open(struct ib_device return 0; error3: - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[0]); error2: kfree(port_priv); error1: @@ -342,7 +202,7 @@ int ib_agent_port_close(struct ib_device unsigned long flags; spin_lock_irqsave(&ib_agent_port_list_lock, flags); - port_priv = __ib_get_agent_port(device, port_num, NULL); + port_priv = __ib_get_agent_port(device, port_num); if (port_priv == NULL) { spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); printk(KERN_ERR SPFX "Port %d not found\n", port_num); @@ -351,9 +211,8 @@ int ib_agent_port_close(struct ib_device list_del(&port_priv->port_list); spin_unlock_irqrestore(&ib_agent_port_list_lock, flags); - ib_unregister_mad_agent(port_priv->perf_mgmt_agent); - ib_unregister_mad_agent(port_priv->smp_agent); + ib_unregister_mad_agent(port_priv->agent[1]); + ib_unregister_mad_agent(port_priv->agent[0]); kfree(port_priv); - return 0; } Index: trunk/src/linux-kernel/infiniband/core/mad_rmpp.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/mad_rmpp.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/mad_rmpp.c (working copy) @@ -103,12 +103,12 @@ void ib_cancel_rmpp_recvs(struct ib_mad_ static int data_offset(u8 mgmt_class) { if (mgmt_class == IB_MGMT_CLASS_SUBN_ADM) - return offsetof(struct ib_sa_mad, data); + return IB_MGMT_SA_HDR; else if ((mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && (mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) - return offsetof(struct ib_vendor_mad, data); + return IB_MGMT_VENDOR_HDR; else - return offsetof(struct ib_rmpp_mad, data); + return IB_MGMT_RMPP_HDR; } static void format_ack(struct ib_rmpp_mad *ack, @@ -135,21 +135,18 @@ static void ack_recv(struct mad_rmpp_rec struct ib_mad_recv_wc *recv_wc) { struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; - int hdr_len, ret; + int ret; - hdr_len = sizeof(struct ib_mad_hdr) + sizeof(struct ib_rmpp_hdr); msg = ib_create_send_mad(&rmpp_recv->agent->agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, rmpp_recv->ah, 1, - hdr_len, sizeof(struct ib_rmpp_mad) - hdr_len, - GFP_KERNEL); + recv_wc->wc->pkey_index, 1, IB_MGMT_RMPP_HDR, + IB_MGMT_RMPP_DATA, GFP_KERNEL); if (!msg) return; - format_ack((struct ib_rmpp_mad *) msg->mad, - (struct ib_rmpp_mad *) recv_wc->recv_buf.mad, rmpp_recv); - ret = ib_post_send_mad(&rmpp_recv->agent->agent, &msg->send_wr, - &bad_send_wr); + format_ack(msg->mad, (struct ib_rmpp_mad *) recv_wc->recv_buf.mad, + rmpp_recv); + msg->ah = rmpp_recv->ah; + ret = ib_post_send_mad(msg, NULL); if (ret) ib_free_send_mad(msg); } @@ -160,30 +157,31 @@ static int alloc_response_msg(struct ib_ { struct ib_mad_send_buf *m; struct ib_ah *ah; - int hdr_len; ah = ib_create_ah_from_wc(agent->qp->pd, recv_wc->wc, recv_wc->recv_buf.grh, agent->port_num); if (IS_ERR(ah)) return PTR_ERR(ah); - hdr_len = sizeof(struct ib_mad_hdr) + sizeof(struct ib_rmpp_hdr); m = ib_create_send_mad(agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, ah, 1, hdr_len, - sizeof(struct ib_rmpp_mad) - hdr_len, - GFP_KERNEL); + recv_wc->wc->pkey_index, 1, + IB_MGMT_RMPP_HDR, IB_MGMT_RMPP_DATA, GFP_KERNEL); if (IS_ERR(m)) { ib_destroy_ah(ah); return PTR_ERR(m); } + m->ah = ah; *msg = m; return 0; } -static void free_msg(struct ib_mad_send_buf *msg) +void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc) { - ib_destroy_ah(msg->send_wr.wr.ud.ah); - ib_free_send_mad(msg); + struct ib_rmpp_mad *rmpp_mad = mad_send_wc->send_buf->mad; + + if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_ACK) + ib_destroy_ah(mad_send_wc->send_buf->ah); + ib_free_send_mad(mad_send_wc->send_buf); } static void nack_recv(struct ib_mad_agent_private *agent, @@ -191,14 +189,13 @@ static void nack_recv(struct ib_mad_agen { struct ib_mad_send_buf *msg; struct ib_rmpp_mad *rmpp_mad; - struct ib_send_wr *bad_send_wr; int ret; ret = alloc_response_msg(&agent->agent, recv_wc, &msg); if (ret) return; - rmpp_mad = (struct ib_rmpp_mad *) msg->mad; + rmpp_mad = msg->mad; memcpy(rmpp_mad, recv_wc->recv_buf.mad, data_offset(recv_wc->recv_buf.mad->mad_hdr.mgmt_class)); @@ -210,9 +207,11 @@ static void nack_recv(struct ib_mad_agen rmpp_mad->rmpp_hdr.seg_num = 0; rmpp_mad->rmpp_hdr.paylen_newwin = 0; - ret = ib_post_send_mad(&agent->agent, &msg->send_wr, &bad_send_wr); - if (ret) - free_msg(msg); + ret = ib_post_send_mad(msg, NULL); + if (ret) { + ib_destroy_ah(msg->ah); + ib_free_send_mad(msg); + } } static void recv_timeout_handler(void *data) @@ -585,7 +584,7 @@ static int send_next_seg(struct ib_mad_s int timeout; u32 paylen; - rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + rmpp_mad = mad_send_wr->send_buf.mad; ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); rmpp_mad->rmpp_hdr.seg_num = cpu_to_be32(mad_send_wr->seg_num); @@ -612,7 +611,7 @@ static int send_next_seg(struct ib_mad_s } /* 2 seconds for an ACK until we can find the packet lifetime */ - timeout = mad_send_wr->send_wr.wr.ud.timeout_ms; + timeout = mad_send_wr->send_buf.timeout_ms; if (!timeout || timeout > 2000) mad_send_wr->timeout = msecs_to_jiffies(2000); mad_send_wr->seg_num++; @@ -640,7 +639,7 @@ static void abort_send(struct ib_mad_age wc.status = IB_WC_REM_ABORT_ERR; wc.vendor_err = rmpp_status; - wc.wr_id = mad_send_wr->wr_id; + wc.send_buf = &mad_send_wr->send_buf; ib_mad_complete_send_wr(mad_send_wr, &wc); return; out: @@ -694,12 +693,12 @@ static void process_rmpp_ack(struct ib_m if (seg_num > mad_send_wr->last_ack) { mad_send_wr->last_ack = seg_num; - mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries; + mad_send_wr->retries = mad_send_wr->send_buf.retries; } mad_send_wr->newwin = newwin; if (mad_send_wr->last_ack == mad_send_wr->total_seg) { /* If no response is expected, the ACK completes the send */ - if (!mad_send_wr->send_wr.wr.ud.timeout_ms) { + if (!mad_send_wr->send_buf.timeout_ms) { struct ib_mad_send_wc wc; ib_mark_mad_done(mad_send_wr); @@ -707,13 +706,13 @@ static void process_rmpp_ack(struct ib_m wc.status = IB_WC_SUCCESS; wc.vendor_err = 0; - wc.wr_id = mad_send_wr->wr_id; + wc.send_buf = &mad_send_wr->send_buf; ib_mad_complete_send_wr(mad_send_wr, &wc); return; } if (mad_send_wr->refcount == 1) - ib_reset_mad_timeout(mad_send_wr, mad_send_wr-> - send_wr.wr.ud.timeout_ms); + ib_reset_mad_timeout(mad_send_wr, + mad_send_wr->send_buf.timeout_ms); } else if (mad_send_wr->refcount == 1 && mad_send_wr->seg_num < mad_send_wr->newwin && mad_send_wr->seg_num <= mad_send_wr->total_seg) { @@ -842,7 +841,7 @@ int ib_send_rmpp_mad(struct ib_mad_send_ struct ib_rmpp_mad *rmpp_mad; int i, total_len, ret; - rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + rmpp_mad = mad_send_wr->send_buf.mad; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE)) return IB_RMPP_RESULT_UNHANDLED; @@ -863,7 +862,7 @@ int ib_send_rmpp_mad(struct ib_mad_send_ mad_send_wr->total_seg = (total_len - mad_send_wr->data_offset) / (sizeof(struct ib_rmpp_mad) - mad_send_wr->data_offset); - mad_send_wr->pad = total_len - offsetof(struct ib_rmpp_mad, data) - + mad_send_wr->pad = total_len - IB_MGMT_RMPP_HDR - be32_to_cpu(rmpp_mad->rmpp_hdr.paylen_newwin); /* We need to wait for the final ACK even if there isn't a response */ @@ -878,23 +877,15 @@ int ib_process_rmpp_send_wc(struct ib_ma struct ib_mad_send_wc *mad_send_wc) { struct ib_rmpp_mad *rmpp_mad; - struct ib_mad_send_buf *msg; int ret; - rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + rmpp_mad = mad_send_wr->send_buf.mad; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE)) return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */ - if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) { - msg = (struct ib_mad_send_buf *) (unsigned long) - mad_send_wc->wr_id; - if (rmpp_mad->rmpp_hdr.rmpp_type == IB_MGMT_RMPP_TYPE_ACK) - ib_free_send_mad(msg); - else - free_msg(msg); + if (rmpp_mad->rmpp_hdr.rmpp_type != IB_MGMT_RMPP_TYPE_DATA) return IB_RMPP_RESULT_INTERNAL; /* ACK, STOP, or ABORT */ - } if (mad_send_wc->status != IB_WC_SUCCESS || mad_send_wr->status != IB_WC_SUCCESS) @@ -905,7 +896,7 @@ int ib_process_rmpp_send_wc(struct ib_ma if (mad_send_wr->last_ack == mad_send_wr->total_seg) { mad_send_wr->timeout = - msecs_to_jiffies(mad_send_wr->send_wr.wr.ud.timeout_ms); + msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms); return IB_RMPP_RESULT_PROCESSED; /* Send done */ } @@ -926,7 +917,7 @@ int ib_retry_rmpp(struct ib_mad_send_wr_ struct ib_rmpp_mad *rmpp_mad; int ret; - rmpp_mad = (struct ib_rmpp_mad *)mad_send_wr->send_wr.wr.ud.mad_hdr; + rmpp_mad = mad_send_wr->send_buf.mad; if (!(ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE)) return IB_RMPP_RESULT_UNHANDLED; /* RMPP not active */ Index: trunk/src/linux-kernel/infiniband/core/cm.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/cm.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/cm.c (working copy) @@ -175,8 +175,7 @@ static int cm_alloc_msg(struct cm_id_pri m = ib_create_send_mad(mad_agent, cm_id_priv->id.remote_cm_qpn, cm_id_priv->av.pkey_index, - ah, 0, sizeof(struct ib_mad_hdr), - sizeof(struct ib_mad)-sizeof(struct ib_mad_hdr), + 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_ATOMIC); if (IS_ERR(m)) { ib_destroy_ah(ah); @@ -184,7 +183,8 @@ static int cm_alloc_msg(struct cm_id_pri } /* Timeout set by caller if response is expected. */ - m->send_wr.wr.ud.retries = cm_id_priv->max_cm_retries; + m->ah = ah; + m->retries = cm_id_priv->max_cm_retries; atomic_inc(&cm_id_priv->refcount); m->context[0] = cm_id_priv; @@ -205,20 +205,20 @@ static int cm_alloc_response_msg(struct return PTR_ERR(ah); m = ib_create_send_mad(port->mad_agent, 1, mad_recv_wc->wc->pkey_index, - ah, 0, sizeof(struct ib_mad_hdr), - sizeof(struct ib_mad)-sizeof(struct ib_mad_hdr), + 0, IB_MGMT_MAD_HDR, IB_MGMT_MAD_DATA, GFP_ATOMIC); if (IS_ERR(m)) { ib_destroy_ah(ah); return PTR_ERR(m); } + m->ah = ah; *msg = m; return 0; } static void cm_free_msg(struct ib_mad_send_buf *msg) { - ib_destroy_ah(msg->send_wr.wr.ud.ah); + ib_destroy_ah(msg->ah); if (msg->context[0]) cm_deref_id(msg->context[0]); ib_free_send_mad(msg); @@ -677,8 +677,7 @@ retest: break; case IB_CM_SIDR_REQ_SENT: cm_id->state = IB_CM_IDLE; - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); spin_unlock_irqrestore(&cm_id_priv->lock, flags); break; case IB_CM_SIDR_REQ_RCVD: @@ -689,8 +688,7 @@ retest: case IB_CM_MRA_REQ_RCVD: case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); /* Fall through */ case IB_CM_REQ_RCVD: case IB_CM_MRA_REQ_SENT: @@ -707,8 +705,7 @@ retest: ib_send_cm_dreq(cm_id, NULL, 0); goto retest; case IB_CM_DREQ_SENT: - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); cm_enter_timewait(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); break; @@ -882,7 +879,6 @@ int ib_send_cm_req(struct ib_cm_id *cm_i struct ib_cm_req_param *param) { struct cm_id_private *cm_id_priv; - struct ib_send_wr *bad_send_wr; struct cm_req_msg *req_msg; unsigned long flags; int ret; @@ -934,7 +930,7 @@ int ib_send_cm_req(struct ib_cm_id *cm_i req_msg = (struct cm_req_msg *) cm_id_priv->msg->mad; cm_format_req(req_msg, cm_id_priv, param); cm_id_priv->tid = req_msg->hdr.tid; - cm_id_priv->msg->send_wr.wr.ud.timeout_ms = cm_id_priv->timeout_ms; + cm_id_priv->msg->timeout_ms = cm_id_priv->timeout_ms; cm_id_priv->msg->context[1] = (void *) (unsigned long) IB_CM_REQ_SENT; cm_id_priv->local_qpn = cm_req_get_local_qpn(req_msg); @@ -943,8 +939,7 @@ int ib_send_cm_req(struct ib_cm_id *cm_i cm_req_get_primary_local_ack_timeout(req_msg); spin_lock_irqsave(&cm_id_priv->lock, flags); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &cm_id_priv->msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(cm_id_priv->msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); goto error2; @@ -967,7 +962,6 @@ static int cm_issue_rej(struct cm_port * void *ari, u8 ari_length) { struct ib_mad_send_buf *msg = NULL; - struct ib_send_wr *bad_send_wr; struct cm_rej_msg *rej_msg, *rcv_msg; int ret; @@ -990,7 +984,7 @@ static int cm_issue_rej(struct cm_port * memcpy(rej_msg->ari, ari, ari_length); } - ret = ib_post_send_mad(port->mad_agent, &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) cm_free_msg(msg); @@ -1170,7 +1164,6 @@ static void cm_dup_req_handler(struct cm struct cm_id_private *cm_id_priv) { struct ib_mad_send_buf *msg = NULL; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1199,8 +1192,7 @@ static void cm_dup_req_handler(struct cm } spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, &msg->send_wr, - &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) goto free; return; @@ -1364,7 +1356,6 @@ int ib_send_cm_rep(struct ib_cm_id *cm_i struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; struct cm_rep_msg *rep_msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1386,11 +1377,10 @@ int ib_send_cm_rep(struct ib_cm_id *cm_i rep_msg = (struct cm_rep_msg *) msg->mad; cm_format_rep(rep_msg, cm_id_priv, param); - msg->send_wr.wr.ud.timeout_ms = cm_id_priv->timeout_ms; + msg->timeout_ms = cm_id_priv->timeout_ms; msg->context[1] = (void *) (unsigned long) IB_CM_REP_SENT; - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -1428,7 +1418,6 @@ int ib_send_cm_rtu(struct ib_cm_id *cm_i { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; void *data; int ret; @@ -1455,8 +1444,7 @@ int ib_send_cm_rtu(struct ib_cm_id *cm_i cm_format_rtu((struct cm_rtu_msg *) msg->mad, cm_id_priv, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -1501,7 +1489,6 @@ static void cm_dup_rep_handler(struct cm struct cm_id_private *cm_id_priv; struct cm_rep_msg *rep_msg; struct ib_mad_send_buf *msg = NULL; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1529,8 +1516,7 @@ static void cm_dup_rep_handler(struct cm goto unlock; spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, &msg->send_wr, - &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) goto free; goto deref; @@ -1598,8 +1584,7 @@ static int cm_rep_handler(struct cm_work /* todo: handle peer_to_peer */ - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); @@ -1633,8 +1618,7 @@ static int cm_establish_handler(struct c goto out; } - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); @@ -1673,8 +1657,7 @@ static int cm_rtu_handler(struct cm_work } cm_id_priv->id.state = IB_CM_ESTABLISHED; - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); @@ -1711,7 +1694,6 @@ int ib_send_cm_dreq(struct ib_cm_id *cm_ { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1733,11 +1715,10 @@ int ib_send_cm_dreq(struct ib_cm_id *cm_ cm_format_dreq((struct cm_dreq_msg *) msg->mad, cm_id_priv, private_data, private_data_len); - msg->send_wr.wr.ud.timeout_ms = cm_id_priv->timeout_ms; + msg->timeout_ms = cm_id_priv->timeout_ms; msg->context[1] = (void *) (unsigned long) IB_CM_DREQ_SENT; - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { cm_enter_timewait(cm_id_priv); spin_unlock_irqrestore(&cm_id_priv->lock, flags); @@ -1771,7 +1752,6 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; void *data; int ret; @@ -1801,8 +1781,7 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ cm_format_drep((struct cm_drep_msg *) msg->mad, cm_id_priv, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, &msg->send_wr, - &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -1819,7 +1798,6 @@ static int cm_dreq_handler(struct cm_wor struct cm_id_private *cm_id_priv; struct cm_dreq_msg *dreq_msg; struct ib_mad_send_buf *msg = NULL; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1838,8 +1816,7 @@ static int cm_dreq_handler(struct cm_wor switch (cm_id_priv->id.state) { case IB_CM_REP_SENT: case IB_CM_DREQ_SENT: - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); break; case IB_CM_ESTABLISHED: case IB_CM_MRA_REP_RCVD: @@ -1853,8 +1830,7 @@ static int cm_dreq_handler(struct cm_wor cm_id_priv->private_data_len); spin_unlock_irqrestore(&cm_id_priv->lock, flags); - if (ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr)) + if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; default: @@ -1901,8 +1877,7 @@ static int cm_drep_handler(struct cm_wor } cm_enter_timewait(cm_id_priv); - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); @@ -1927,7 +1902,6 @@ int ib_send_cm_rej(struct ib_cm_id *cm_i { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -1971,8 +1945,7 @@ int ib_send_cm_rej(struct ib_cm_id *cm_i if (ret) goto out; - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) cm_free_msg(msg); @@ -2048,8 +2021,7 @@ static int cm_rej_handler(struct cm_work case IB_CM_MRA_REQ_RCVD: case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); /* fall through */ case IB_CM_REQ_RCVD: case IB_CM_MRA_REQ_SENT: @@ -2059,8 +2031,7 @@ static int cm_rej_handler(struct cm_work cm_reset_to_idle(cm_id_priv); break; case IB_CM_DREQ_SENT: - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); /* fall through */ case IB_CM_REP_RCVD: case IB_CM_MRA_REP_SENT: @@ -2095,7 +2066,6 @@ int ib_send_cm_mra(struct ib_cm_id *cm_i { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; void *data; unsigned long flags; int ret; @@ -2119,8 +2089,7 @@ int ib_send_cm_mra(struct ib_cm_id *cm_i cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, CM_MSG_RESPONSE_REQ, service_timeout, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) goto error2; cm_id->state = IB_CM_MRA_REQ_SENT; @@ -2133,8 +2102,7 @@ int ib_send_cm_mra(struct ib_cm_id *cm_i cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, CM_MSG_RESPONSE_REP, service_timeout, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) goto error2; cm_id->state = IB_CM_MRA_REP_SENT; @@ -2147,8 +2115,7 @@ int ib_send_cm_mra(struct ib_cm_id *cm_i cm_format_mra((struct cm_mra_msg *) msg->mad, cm_id_priv, CM_MSG_RESPONSE_OTHER, service_timeout, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) goto error2; cm_id->lap_state = IB_CM_MRA_LAP_SENT; @@ -2210,14 +2177,14 @@ static int cm_mra_handler(struct cm_work case IB_CM_REQ_SENT: if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_REQ || ib_modify_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) goto out; cm_id_priv->id.state = IB_CM_MRA_REQ_RCVD; break; case IB_CM_REP_SENT: if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_REP || ib_modify_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) goto out; cm_id_priv->id.state = IB_CM_MRA_REP_RCVD; break; @@ -2225,7 +2192,7 @@ static int cm_mra_handler(struct cm_work if (cm_mra_get_msg_mraed(mra_msg) != CM_MSG_RESPONSE_OTHER || cm_id_priv->id.lap_state != IB_CM_LAP_SENT || ib_modify_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg, timeout)) + cm_id_priv->msg, timeout)) goto out; cm_id_priv->id.lap_state = IB_CM_MRA_LAP_RCVD; break; @@ -2288,7 +2255,6 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -2309,11 +2275,10 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i cm_format_lap((struct cm_lap_msg *) msg->mad, cm_id_priv, alternate_path, private_data, private_data_len); - msg->send_wr.wr.ud.timeout_ms = cm_id_priv->timeout_ms; + msg->timeout_ms = cm_id_priv->timeout_ms; msg->context[1] = (void *) (unsigned long) IB_CM_ESTABLISHED; - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -2357,7 +2322,6 @@ static int cm_lap_handler(struct cm_work struct cm_lap_msg *lap_msg; struct ib_cm_lap_event_param *param; struct ib_mad_send_buf *msg = NULL; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -2391,8 +2355,7 @@ static int cm_lap_handler(struct cm_work cm_id_priv->private_data_len); spin_unlock_irqrestore(&cm_id_priv->lock, flags); - if (ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr)) + if (ib_post_send_mad(msg, NULL)) cm_free_msg(msg); goto deref; default: @@ -2448,7 +2411,6 @@ int ib_send_cm_apr(struct ib_cm_id *cm_i { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -2471,8 +2433,7 @@ int ib_send_cm_apr(struct ib_cm_id *cm_i cm_format_apr((struct cm_apr_msg *) msg->mad, cm_id_priv, status, info, info_length, private_data, private_data_len); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -2511,8 +2472,7 @@ static int cm_apr_handler(struct cm_work goto out; } cm_id_priv->id.lap_state = IB_CM_LAP_IDLE; - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); cm_id_priv->msg = NULL; ret = atomic_inc_and_test(&cm_id_priv->work_count); @@ -2587,7 +2547,6 @@ int ib_send_cm_sidr_req(struct ib_cm_id { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -2610,13 +2569,12 @@ int ib_send_cm_sidr_req(struct ib_cm_id cm_format_sidr_req((struct cm_sidr_req_msg *) msg->mad, cm_id_priv, param); - msg->send_wr.wr.ud.timeout_ms = cm_id_priv->timeout_ms; + msg->timeout_ms = cm_id_priv->timeout_ms; msg->context[1] = (void *) (unsigned long) IB_CM_SIDR_REQ_SENT; spin_lock_irqsave(&cm_id_priv->lock, flags); if (cm_id->state == IB_CM_IDLE) - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); else ret = -EINVAL; @@ -2730,7 +2688,6 @@ int ib_send_cm_sidr_rep(struct ib_cm_id { struct cm_id_private *cm_id_priv; struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; unsigned long flags; int ret; @@ -2752,8 +2709,7 @@ int ib_send_cm_sidr_rep(struct ib_cm_id cm_format_sidr_rep((struct cm_sidr_rep_msg *) msg->mad, cm_id_priv, param); - ret = ib_post_send_mad(cm_id_priv->av.port->mad_agent, - &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_free_msg(msg); @@ -2806,8 +2762,7 @@ static int cm_sidr_rep_handler(struct cm goto out; } cm_id_priv->id.state = IB_CM_IDLE; - ib_cancel_mad(cm_id_priv->av.port->mad_agent, - (unsigned long) cm_id_priv->msg); + ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg); spin_unlock_irqrestore(&cm_id_priv->lock, flags); cm_format_sidr_rep_event(work); @@ -2875,9 +2830,7 @@ discard: static void cm_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_mad_send_buf *msg; - - msg = (struct ib_mad_send_buf *)(unsigned long)mad_send_wc->wr_id; + struct ib_mad_send_buf *msg = mad_send_wc->send_buf; switch (mad_send_wc->status) { case IB_WC_SUCCESS: Index: trunk/src/linux-kernel/infiniband/core/agent.h =================================================================== --- trunk/src/linux-kernel/infiniband/core/agent.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/agent.h (working copy) @@ -39,17 +39,14 @@ #ifndef __AGENT_H_ #define __AGENT_H_ -extern spinlock_t ib_agent_port_list_lock; +#include -extern int ib_agent_port_open(struct ib_device *device, - int port_num); +extern int ib_agent_port_open(struct ib_device *device, int port_num); extern int ib_agent_port_close(struct ib_device *device, int port_num); -extern int agent_send(struct ib_mad_private *mad, - struct ib_grh *grh, - struct ib_wc *wc, - struct ib_device *device, - int port_num); +extern int agent_send_response(struct ib_mad *mad, struct ib_grh *grh, + struct ib_wc *wc, struct ib_device *device, + int port_num, int qpn); #endif /* __AGENT_H_ */ Index: trunk/src/linux-kernel/infiniband/core/mad_rmpp.h =================================================================== --- trunk/src/linux-kernel/infiniband/core/mad_rmpp.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/mad_rmpp.h (working copy) @@ -51,6 +51,8 @@ ib_process_rmpp_recv_wc(struct ib_mad_ag int ib_process_rmpp_send_wc(struct ib_mad_send_wr_private *mad_send_wr, struct ib_mad_send_wc *mad_send_wc); +void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc); + void ib_cancel_rmpp_recvs(struct ib_mad_agent_private *agent); int ib_retry_rmpp(struct ib_mad_send_wr_private *mad_send_wr); Index: trunk/src/linux-kernel/infiniband/core/sa_query.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/sa_query.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/sa_query.c (working copy) @@ -74,9 +74,8 @@ struct ib_sa_query { void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *); void (*release)(struct ib_sa_query *); struct ib_sa_port *port; - struct ib_sa_mad *mad; + struct ib_mad_send_buf *mad_buf; struct ib_sa_sm_ah *sm_ah; - DECLARE_PCI_UNMAP_ADDR(mapping) int id; }; @@ -426,6 +425,7 @@ void ib_sa_cancel_query(int id, struct i { unsigned long flags; struct ib_mad_agent *agent; + struct ib_mad_send_buf *mad_buf; spin_lock_irqsave(&idr_lock, flags); if (idr_find(&query_idr, id) != query) { @@ -433,9 +433,10 @@ void ib_sa_cancel_query(int id, struct i return; } agent = query->port->agent; + mad_buf = query->mad_buf; spin_unlock_irqrestore(&idr_lock, flags); - ib_cancel_mad(agent, id); + ib_cancel_mad(agent, mad_buf); } EXPORT_SYMBOL(ib_sa_cancel_query); @@ -455,73 +456,49 @@ static void init_mad(struct ib_sa_mad *m spin_unlock_irqrestore(&tid_lock, flags); } +static void acquire_ah(struct ib_sa_port *port, struct ib_sa_query *query) +{ + unsigned long flags; + + spin_lock_irqsave(&port->ah_lock, flags); + kref_get(&port->sm_ah->ref); + query->sm_ah = port->sm_ah; + spin_unlock_irqrestore(&port->ah_lock, flags); +} + static int send_mad(struct ib_sa_query *query, int timeout_ms) { - struct ib_sa_port *port = query->port; unsigned long flags; - int ret; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { - .ud = { - .mad_hdr = &query->mad->mad_hdr, - .remote_qpn = 1, - .remote_qkey = IB_QP1_QKEY, - .timeout_ms = timeout_ms, - } - } - }; + int ret, id; retry: if (!idr_pre_get(&query_idr, GFP_ATOMIC)) return -ENOMEM; spin_lock_irqsave(&idr_lock, flags); - ret = idr_get_new(&query_idr, query, &query->id); + ret = idr_get_new(&query_idr, query, &id); spin_unlock_irqrestore(&idr_lock, flags); if (ret == -EAGAIN) goto retry; if (ret) return ret; - wr.wr_id = query->id; + query->mad_buf->timeout_ms = timeout_ms; + query->mad_buf->context[0] = query; + query->id = id; - spin_lock_irqsave(&port->ah_lock, flags); - kref_get(&port->sm_ah->ref); - query->sm_ah = port->sm_ah; - wr.wr.ud.ah = port->sm_ah->ah; - spin_unlock_irqrestore(&port->ah_lock, flags); - - gather_list.addr = dma_map_single(port->agent->device->dma_device, - query->mad, - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - gather_list.length = sizeof (struct ib_sa_mad); - gather_list.lkey = port->agent->mr->lkey; - pci_unmap_addr_set(query, mapping, gather_list.addr); - - ret = ib_post_send_mad(port->agent, &wr, &bad_wr); + ret = ib_post_send_mad(query->mad_buf, NULL); if (ret) { - dma_unmap_single(port->agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, query->id); + idr_remove(&query_idr, id); spin_unlock_irqrestore(&idr_lock, flags); } /* * It's not safe to dereference query any more, because the * send may already have completed and freed the query in - * another context. So use wr.wr_id, which has a copy of the - * query's id. + * another context. */ - return ret ? ret : wr.wr_id; + return ret ? ret : id; } static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query, @@ -543,7 +520,6 @@ static void ib_sa_path_rec_callback(stru static void ib_sa_path_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_path_query, sa_query)); } @@ -585,6 +561,7 @@ int ib_sa_path_rec_get(struct ib_device struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port; struct ib_mad_agent *agent; + struct ib_sa_mad *mad; int ret; if (!sa_dev) @@ -596,37 +573,46 @@ int ib_sa_path_rec_get(struct ib_device query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } + query->sa_query.mad_buf->ah = query->sa_query.sm_ah->ah; query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; - query->sa_query.release = ib_sa_path_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = IB_MGMT_METHOD_GET; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_path_rec_callback : NULL; + query->sa_query.release = ib_sa_path_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = IB_MGMT_METHOD_GET; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_PATH_REC); + mad->sa_hdr.comp_mask = comp_mask; - ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), - rec, query->sa_query.mad->data); + ib_pack(path_rec_table, ARRAY_SIZE(path_rec_table), rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_path_rec_get); @@ -649,7 +635,6 @@ static void ib_sa_service_rec_callback(s static void ib_sa_service_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_service_query, sa_query)); } @@ -693,6 +678,7 @@ int ib_sa_service_rec_query(struct ib_de struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port; struct ib_mad_agent *agent; + struct ib_sa_mad *mad; int ret; if (!sa_dev) @@ -709,38 +695,47 @@ int ib_sa_service_rec_query(struct ib_de query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } + query->sa_query.mad_buf->ah = query->sa_query.sm_ah->ah; query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; - query->sa_query.release = ib_sa_service_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = - cpu_to_be16(IB_SA_ATTR_SERVICE_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_service_rec_callback : NULL; + query->sa_query.release = ib_sa_service_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_SERVICE_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(service_rec_table, ARRAY_SIZE(service_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_service_rec_query); @@ -763,7 +758,6 @@ static void ib_sa_mcmember_rec_callback( static void ib_sa_mcmember_rec_release(struct ib_sa_query *sa_query) { - kfree(sa_query->mad); kfree(container_of(sa_query, struct ib_sa_mcmember_query, sa_query)); } @@ -782,6 +776,7 @@ int ib_sa_mcmember_rec_query(struct ib_d struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); struct ib_sa_port *port; struct ib_mad_agent *agent; + struct ib_sa_mad *mad; int ret; if (!sa_dev) @@ -793,53 +788,56 @@ int ib_sa_mcmember_rec_query(struct ib_d query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; - query->sa_query.mad = kmalloc(sizeof *query->sa_query.mad, gfp_mask); - if (!query->sa_query.mad) { - kfree(query); - return -ENOMEM; + + acquire_ah(port, &query->sa_query); + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; } + query->sa_query.mad_buf->ah = query->sa_query.sm_ah->ah; query->callback = callback; query->context = context; - init_mad(query->sa_query.mad, agent); + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); - query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; - query->sa_query.release = ib_sa_mcmember_rec_release; - query->sa_query.port = port; - query->sa_query.mad->mad_hdr.method = method; - query->sa_query.mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); - query->sa_query.mad->sa_hdr.comp_mask = comp_mask; + query->sa_query.callback = callback ? ib_sa_mcmember_rec_callback : NULL; + query->sa_query.release = ib_sa_mcmember_rec_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_MC_MEMBER_REC); + mad->sa_hdr.comp_mask = comp_mask; ib_pack(mcmember_rec_table, ARRAY_SIZE(mcmember_rec_table), - rec, query->sa_query.mad->data); + rec, mad->data); *sa_query = &query->sa_query; ret = send_mad(&query->sa_query, timeout_ms); - if (ret < 0) { - *sa_query = NULL; - kfree(query->sa_query.mad); - kfree(query); - } + if (ret < 0) + goto err2; return ret; +err2: + *sa_query = NULL; + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kref_put(&query->sa_query.sm_ah->ref, free_sm_ah); + kfree(query); + return ret; } EXPORT_SYMBOL(ib_sa_mcmember_rec_query); static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_sa_query *query; + struct ib_sa_query *query = mad_send_wc->send_buf->context[0]; unsigned long flags; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_send_wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); - - if (!query) - return; - if (query->callback) switch (mad_send_wc->status) { case IB_WC_SUCCESS: @@ -856,30 +854,25 @@ static void send_handler(struct ib_mad_a break; } - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(query, mapping), - sizeof (struct ib_sa_mad), - DMA_TO_DEVICE); - kref_put(&query->sm_ah->ref, free_sm_ah); - - query->release(query); - spin_lock_irqsave(&idr_lock, flags); - idr_remove(&query_idr, mad_send_wc->wr_id); + idr_remove(&query_idr, query->id); spin_unlock_irqrestore(&idr_lock, flags); + + ib_free_send_mad(mad_send_wc->send_buf); + kref_put(&query->sm_ah->ref, free_sm_ah); + query->release(query); } static void recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) { struct ib_sa_query *query; - unsigned long flags; + struct ib_mad_send_buf *mad_buf; - spin_lock_irqsave(&idr_lock, flags); - query = idr_find(&query_idr, mad_recv_wc->wc->wr_id); - spin_unlock_irqrestore(&idr_lock, flags); + mad_buf = (void *) (unsigned long) mad_recv_wc->wc->wr_id; + query = mad_buf->context[0]; - if (query && query->callback) { + if (query->callback) { if (mad_recv_wc->wc->status == IB_WC_SUCCESS) query->callback(query, mad_recv_wc->recv_buf.mad->mad_hdr.status ? Index: trunk/src/linux-kernel/infiniband/core/user_mad.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/user_mad.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/user_mad.c (working copy) @@ -96,7 +96,6 @@ struct ib_umad_file { }; struct ib_umad_packet { - struct ib_ah *ah; struct ib_mad_send_buf *msg; struct list_head list; int length; @@ -139,10 +138,9 @@ static void send_handler(struct ib_mad_a struct ib_mad_send_wc *send_wc) { struct ib_umad_file *file = agent->context; - struct ib_umad_packet *timeout, *packet = - (void *) (unsigned long) send_wc->wr_id; + struct ib_umad_packet *timeout, *packet = send_wc->send_buf->context[0]; - ib_destroy_ah(packet->msg->send_wr.wr.ud.ah); + ib_destroy_ah(packet->msg->ah); ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { @@ -268,11 +266,11 @@ static ssize_t ib_umad_write(struct file struct ib_umad_packet *packet; struct ib_mad_agent *agent; struct ib_ah_attr ah_attr; - struct ib_send_wr *bad_wr; + struct ib_ah *ah; struct ib_rmpp_mad *rmpp_mad; u8 method; __be64 *tid; - int ret, length, hdr_len, data_len, rmpp_hdr_size; + int ret, length, hdr_len, rmpp_hdr_size; int rmpp_active = 0; if (count < sizeof (struct ib_user_mad)) @@ -321,9 +319,9 @@ static ssize_t ib_umad_write(struct file ah_attr.grh.traffic_class = packet->mad.hdr.traffic_class; } - packet->ah = ib_create_ah(agent->qp->pd, &ah_attr); - if (IS_ERR(packet->ah)) { - ret = PTR_ERR(packet->ah); + ah = ib_create_ah(agent->qp->pd, &ah_attr); + if (IS_ERR(ah)) { + ret = PTR_ERR(ah); goto err_up; } @@ -337,12 +335,10 @@ static ssize_t ib_umad_write(struct file /* Validate that the management class can support RMPP */ if (rmpp_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_ADM) { - hdr_len = offsetof(struct ib_sa_mad, data); - data_len = length - hdr_len; + hdr_len = IB_MGMT_SA_HDR; } else if ((rmpp_mad->mad_hdr.mgmt_class >= IB_MGMT_CLASS_VENDOR_RANGE2_START) && (rmpp_mad->mad_hdr.mgmt_class <= IB_MGMT_CLASS_VENDOR_RANGE2_END)) { - hdr_len = offsetof(struct ib_vendor_mad, data); - data_len = length - hdr_len; + hdr_len = IB_MGMT_VENDOR_HDR; } else { ret = -EINVAL; goto err_ah; @@ -353,25 +349,23 @@ static ssize_t ib_umad_write(struct file ret = -EINVAL; goto err_ah; } - hdr_len = offsetof(struct ib_mad, data); - data_len = length - hdr_len; + hdr_len = IB_MGMT_MAD_HDR; } packet->msg = ib_create_send_mad(agent, be32_to_cpu(packet->mad.hdr.qpn), - 0, packet->ah, rmpp_active, - hdr_len, data_len, + 0, rmpp_active, + hdr_len, length - hdr_len, GFP_KERNEL); if (IS_ERR(packet->msg)) { ret = PTR_ERR(packet->msg); goto err_ah; } - packet->msg->send_wr.wr.ud.timeout_ms = packet->mad.hdr.timeout_ms; - packet->msg->send_wr.wr.ud.retries = packet->mad.hdr.retries; - - /* Override send WR WRID initialized in ib_create_send_mad */ - packet->msg->send_wr.wr_id = (unsigned long) packet; + packet->msg->ah = ah; + packet->msg->timeout_ms = packet->mad.hdr.timeout_ms; + packet->msg->retries = packet->mad.hdr.retries; + packet->msg->context[0] = packet; if (!rmpp_active) { /* Copy message from user into send buffer */ @@ -403,17 +397,17 @@ static ssize_t ib_umad_write(struct file * transaction ID matches the agent being used to send the * MAD. */ - method = packet->msg->mad->mad_hdr.method; + method = ((struct ib_mad_hdr *) packet->msg)->method; if (!(method & IB_MGMT_METHOD_RESP) && method != IB_MGMT_METHOD_TRAP_REPRESS && method != IB_MGMT_METHOD_SEND) { - tid = &packet->msg->mad->mad_hdr.tid; + tid = &((struct ib_mad_hdr *) packet->msg)->tid; *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | (be64_to_cpup(tid) & 0xffffffff)); } - ret = ib_post_send_mad(agent, &packet->msg->send_wr, &bad_wr); + ret = ib_post_send_mad(packet->msg, NULL); if (ret) goto err_msg; @@ -425,7 +419,7 @@ err_msg: ib_free_send_mad(packet->msg); err_ah: - ib_destroy_ah(packet->ah); + ib_destroy_ah(ah); err_up: up_read(&file->agent_mutex); Index: trunk/src/linux-kernel/infiniband/core/mad.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/mad.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/mad.c (working copy) @@ -579,7 +579,7 @@ static void dequeue_mad(struct ib_mad_li } static void snoop_send(struct ib_mad_qp_info *qp_info, - struct ib_send_wr *send_wr, + struct ib_mad_send_buf *send_buf, struct ib_mad_send_wc *mad_send_wc, int mad_snoop_flags) { @@ -597,7 +597,7 @@ static void snoop_send(struct ib_mad_qp_ atomic_inc(&mad_snoop_priv->refcount); spin_unlock_irqrestore(&qp_info->snoop_lock, flags); mad_snoop_priv->agent.snoop_handler(&mad_snoop_priv->agent, - send_wr, mad_send_wc); + send_buf, mad_send_wc); if (atomic_dec_and_test(&mad_snoop_priv->refcount)) wake_up(&mad_snoop_priv->wait); spin_lock_irqsave(&qp_info->snoop_lock, flags); @@ -654,10 +654,10 @@ static void build_smp_wc(u64 wr_id, u16 * Return < 0 if error */ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, - struct ib_smp *smp, - struct ib_send_wr *send_wr) + struct ib_mad_send_wr_private *mad_send_wr) { int ret; + struct ib_smp *smp = mad_send_wr->send_buf.mad; unsigned long flags; struct ib_mad_local_private *local; struct ib_mad_private *mad_priv; @@ -666,6 +666,7 @@ static int handle_outgoing_dr_smp(struct struct ib_device *device = mad_agent_priv->agent.device; u8 port_num = mad_agent_priv->agent.port_num; struct ib_wc mad_wc; + struct ib_send_wr *send_wr = &mad_send_wr->send_wr; if (!smi_handle_dr_smp_send(smp, device->node_type, port_num)) { ret = -EINVAL; @@ -745,13 +746,7 @@ static int handle_outgoing_dr_smp(struct goto out; } - local->send_wr = *send_wr; - local->send_wr.sg_list = local->sg_list; - memcpy(local->sg_list, send_wr->sg_list, - sizeof *send_wr->sg_list * send_wr->num_sge); - local->send_wr.next = NULL; - local->tid = send_wr->wr.ud.mad_hdr->tid; - local->wr_id = send_wr->wr_id; + local->mad_send_wr = mad_send_wr; /* Reference MAD agent until send side of local completion handled */ atomic_inc(&mad_agent_priv->refcount); /* Queue local completion to local list */ @@ -781,17 +776,17 @@ static int get_buf_length(int hdr_len, i struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, u32 remote_qpn, u16 pkey_index, - struct ib_ah *ah, int rmpp_active, + int rmpp_active, int hdr_len, int data_len, unsigned int __nocast gfp_mask) { struct ib_mad_agent_private *mad_agent_priv; - struct ib_mad_send_buf *send_buf; + struct ib_mad_send_wr_private *mad_send_wr; int buf_size; void *buf; - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, agent); + mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, + agent); buf_size = get_buf_length(hdr_len, data_len); if ((!mad_agent->rmpp_version && @@ -799,45 +794,40 @@ struct ib_mad_send_buf * ib_create_send_ (!rmpp_active && buf_size > sizeof(struct ib_mad))) return ERR_PTR(-EINVAL); - buf = kmalloc(sizeof *send_buf + buf_size, gfp_mask); + buf = kmalloc(sizeof *mad_send_wr + buf_size, gfp_mask); if (!buf) return ERR_PTR(-ENOMEM); - memset(buf, 0, sizeof *send_buf + buf_size); + memset(buf, 0, sizeof *mad_send_wr + buf_size); - send_buf = buf + buf_size; - send_buf->mad = buf; + mad_send_wr = buf + buf_size; + mad_send_wr->send_buf.mad = buf; - send_buf->sge.addr = dma_map_single(mad_agent->device->dma_device, - buf, buf_size, DMA_TO_DEVICE); - pci_unmap_addr_set(send_buf, mapping, send_buf->sge.addr); - send_buf->sge.length = buf_size; - send_buf->sge.lkey = mad_agent->mr->lkey; - - send_buf->send_wr.wr_id = (unsigned long) send_buf; - send_buf->send_wr.sg_list = &send_buf->sge; - send_buf->send_wr.num_sge = 1; - send_buf->send_wr.opcode = IB_WR_SEND; - send_buf->send_wr.send_flags = IB_SEND_SIGNALED; - send_buf->send_wr.wr.ud.ah = ah; - send_buf->send_wr.wr.ud.mad_hdr = &send_buf->mad->mad_hdr; - send_buf->send_wr.wr.ud.remote_qpn = remote_qpn; - send_buf->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; - send_buf->send_wr.wr.ud.pkey_index = pkey_index; + mad_send_wr->mad_agent_priv = mad_agent_priv; + mad_send_wr->sg_list[0].length = buf_size; + mad_send_wr->sg_list[0].lkey = mad_agent->mr->lkey; + + mad_send_wr->send_wr.wr_id = (unsigned long) mad_send_wr; + mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; + mad_send_wr->send_wr.num_sge = 1; + mad_send_wr->send_wr.opcode = IB_WR_SEND; + mad_send_wr->send_wr.send_flags = IB_SEND_SIGNALED; + mad_send_wr->send_wr.wr.ud.remote_qpn = remote_qpn; + mad_send_wr->send_wr.wr.ud.remote_qkey = IB_QP_SET_QKEY; + mad_send_wr->send_wr.wr.ud.pkey_index = pkey_index; if (rmpp_active) { - struct ib_rmpp_mad *rmpp_mad; - rmpp_mad = (struct ib_rmpp_mad *)send_buf->mad; + struct ib_rmpp_mad *rmpp_mad = mad_send_wr->send_buf.mad; rmpp_mad->rmpp_hdr.paylen_newwin = cpu_to_be32(hdr_len - - offsetof(struct ib_rmpp_mad, data) + data_len); + IB_MGMT_RMPP_HDR + data_len); rmpp_mad->rmpp_hdr.rmpp_version = mad_agent->rmpp_version; rmpp_mad->rmpp_hdr.rmpp_type = IB_MGMT_RMPP_TYPE_DATA; ib_set_rmpp_flags(&rmpp_mad->rmpp_hdr, IB_MGMT_RMPP_FLAG_ACTIVE); } - send_buf->mad_agent = mad_agent; + mad_send_wr->send_buf.mad_agent = mad_agent; atomic_inc(&mad_agent_priv->refcount); - return send_buf; + return &mad_send_wr->send_buf; } EXPORT_SYMBOL(ib_create_send_mad); @@ -847,10 +837,6 @@ void ib_free_send_mad(struct ib_mad_send mad_agent_priv = container_of(send_buf->mad_agent, struct ib_mad_agent_private, agent); - - dma_unmap_single(send_buf->mad_agent->device->dma_device, - pci_unmap_addr(send_buf, mapping), - send_buf->sge.length, DMA_TO_DEVICE); kfree(send_buf->mad); if (atomic_dec_and_test(&mad_agent_priv->refcount)) @@ -861,8 +847,10 @@ EXPORT_SYMBOL(ib_free_send_mad); int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr) { struct ib_mad_qp_info *qp_info; - struct ib_send_wr *bad_send_wr; struct list_head *list; + struct ib_send_wr *bad_send_wr; + struct ib_mad_agent *mad_agent; + struct ib_sge *sge; unsigned long flags; int ret; @@ -871,10 +859,17 @@ int ib_send_mad(struct ib_mad_send_wr_pr mad_send_wr->send_wr.wr_id = (unsigned long)&mad_send_wr->mad_list; mad_send_wr->mad_list.mad_queue = &qp_info->send_queue; + mad_agent = mad_send_wr->send_buf.mad_agent; + sge = mad_send_wr->sg_list; + sge->addr = dma_map_single(mad_agent->device->dma_device, + mad_send_wr->send_buf.mad, sge->length, + DMA_TO_DEVICE); + pci_unmap_addr_set(mad_send_wr, mapping, sge->addr); + spin_lock_irqsave(&qp_info->send_queue.lock, flags); if (qp_info->send_queue.count < qp_info->send_queue.max_active) { - ret = ib_post_send(mad_send_wr->mad_agent_priv->agent.qp, - &mad_send_wr->send_wr, &bad_send_wr); + ret = ib_post_send(mad_agent->qp, &mad_send_wr->send_wr, + &bad_send_wr); list = &qp_info->send_queue.list; } else { ret = 0; @@ -886,6 +881,11 @@ int ib_send_mad(struct ib_mad_send_wr_pr list_add_tail(&mad_send_wr->mad_list.list, list); } spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); + if (ret) + dma_unmap_single(mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, mapping), + sge->length, DMA_TO_DEVICE); + return ret; } @@ -893,45 +893,28 @@ int ib_send_mad(struct ib_mad_send_wr_pr * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated * with the registered client */ -int ib_post_send_mad(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, - struct ib_send_wr **bad_send_wr) +int ib_post_send_mad(struct ib_mad_send_buf *send_buf, + struct ib_mad_send_buf **bad_send_buf) { - int ret = -EINVAL; struct ib_mad_agent_private *mad_agent_priv; - - /* Validate supplied parameters */ - if (!bad_send_wr) - goto error1; - - if (!mad_agent || !send_wr) - goto error2; - - if (!mad_agent->send_handler) - goto error2; - - mad_agent_priv = container_of(mad_agent, - struct ib_mad_agent_private, - agent); + struct ib_mad_send_buf *next_send_buf; + struct ib_mad_send_wr_private *mad_send_wr; + unsigned long flags; + int ret = -EINVAL; /* Walk list of send WRs and post each on send list */ - while (send_wr) { - unsigned long flags; - struct ib_send_wr *next_send_wr; - struct ib_mad_send_wr_private *mad_send_wr; - struct ib_smp *smp; + for (; send_buf; send_buf = next_send_buf) { - /* Validate more parameters */ - if (send_wr->num_sge > IB_MAD_SEND_REQ_MAX_SG) - goto error2; - - if (send_wr->wr.ud.timeout_ms && !mad_agent->recv_handler) - goto error2; + mad_send_wr = container_of(send_buf, + struct ib_mad_send_wr_private, + send_buf); + mad_agent_priv = mad_send_wr->mad_agent_priv; - if (!send_wr->wr.ud.mad_hdr) { - printk(KERN_ERR PFX "MAD header must be supplied " - "in WR %p\n", send_wr); - goto error2; + if (!send_buf->mad_agent->send_handler || + (send_buf->timeout_ms && + !send_buf->mad_agent->recv_handler)) { + ret = -EINVAL; + goto error; } /* @@ -939,40 +922,24 @@ int ib_post_send_mad(struct ib_mad_agent * current one completes, and the user modifies the work * request associated with the completion */ - next_send_wr = (struct ib_send_wr *)send_wr->next; + next_send_buf = send_buf->next; + mad_send_wr->send_wr.wr.ud.ah = send_buf->ah; - smp = (struct ib_smp *)send_wr->wr.ud.mad_hdr; - if (smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { - ret = handle_outgoing_dr_smp(mad_agent_priv, smp, - send_wr); + if (((struct ib_mad_hdr *) send_buf->mad)->mgmt_class == + IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) { + ret = handle_outgoing_dr_smp(mad_agent_priv, + mad_send_wr); if (ret < 0) /* error */ - goto error2; + goto error; else if (ret == 1) /* locally consumed */ - goto next; - } - - /* Allocate MAD send WR tracking structure */ - mad_send_wr = kmalloc(sizeof *mad_send_wr, GFP_ATOMIC); - if (!mad_send_wr) { - printk(KERN_ERR PFX "No memory for " - "ib_mad_send_wr_private\n"); - ret = -ENOMEM; - goto error2; + continue; } - memset(mad_send_wr, 0, sizeof *mad_send_wr); - mad_send_wr->send_wr = *send_wr; - mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list; - memcpy(mad_send_wr->sg_list, send_wr->sg_list, - sizeof *send_wr->sg_list * send_wr->num_sge); - mad_send_wr->wr_id = send_wr->wr_id; - mad_send_wr->tid = send_wr->wr.ud.mad_hdr->tid; - mad_send_wr->mad_agent_priv = mad_agent_priv; + mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid; /* Timeout will be updated after send completes */ - mad_send_wr->timeout = msecs_to_jiffies(send_wr->wr. - ud.timeout_ms); - mad_send_wr->retries = mad_send_wr->send_wr.wr.ud.retries; - /* One reference for each work request to QP + response */ + mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms); + mad_send_wr->retries = send_buf->retries; + /* Reference for work request to QP + response */ mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0); mad_send_wr->status = IB_WC_SUCCESS; @@ -995,16 +962,13 @@ int ib_post_send_mad(struct ib_mad_agent list_del(&mad_send_wr->agent_list); spin_unlock_irqrestore(&mad_agent_priv->lock, flags); atomic_dec(&mad_agent_priv->refcount); - goto error2; + goto error; } -next: - send_wr = next_send_wr; } return 0; - -error2: - *bad_send_wr = send_wr; -error1: +error: + if (bad_send_buf) + *bad_send_buf = send_buf; return ret; } EXPORT_SYMBOL(ib_post_send_mad); @@ -1447,8 +1411,7 @@ find_mad_agent(struct ib_mad_port_privat * of MAD. */ hi_tid = be64_to_cpu(mad->mad_hdr.tid) >> 32; - list_for_each_entry(entry, &port_priv->agent_list, - agent_list) { + list_for_each_entry(entry, &port_priv->agent_list, agent_list) { if (entry->agent.hi_tid == hi_tid) { mad_agent = entry; break; @@ -1571,8 +1534,7 @@ ib_find_send_mad(struct ib_mad_agent_pri */ list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_list) { - if (is_data_mad(mad_agent_priv, - mad_send_wr->send_wr.wr.ud.mad_hdr) && + if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && mad_send_wr->tid == tid && mad_send_wr->timeout) { /* Verify request has not been canceled */ return (mad_send_wr->status == IB_WC_SUCCESS) ? @@ -1628,14 +1590,14 @@ static void ib_mad_complete_recv(struct spin_unlock_irqrestore(&mad_agent_priv->lock, flags); /* Defined behavior is to complete response before request */ - mad_recv_wc->wc->wr_id = mad_send_wr->wr_id; + mad_recv_wc->wc->wr_id = (unsigned long) &mad_send_wr->send_buf; mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, mad_recv_wc); atomic_dec(&mad_agent_priv->refcount); mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_send_wc.send_buf = &mad_send_wr->send_buf; ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); } else { mad_agent_priv->agent.recv_handler(&mad_agent_priv->agent, @@ -1728,11 +1690,11 @@ local: if (ret & IB_MAD_RESULT_CONSUMED) goto out; if (ret & IB_MAD_RESULT_REPLY) { - /* Send response */ - if (!agent_send(response, &recv->grh, wc, - port_priv->device, - port_priv->port_num)) - response = NULL; + agent_send_response(&response->mad.mad, + &recv->grh, wc, + port_priv->device, + port_priv->port_num, + qp_info->qp->qp_num); goto out; } } @@ -1866,15 +1828,15 @@ void ib_mad_complete_send_wr(struct ib_m if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; - if (ret != IB_RMPP_RESULT_INTERNAL) + if (ret == IB_RMPP_RESULT_INTERNAL) + ib_rmpp_send_handler(mad_send_wc); + else mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, mad_send_wc); /* Release reference on agent taken when sending */ if (atomic_dec_and_test(&mad_agent_priv->refcount)) wake_up(&mad_agent_priv->wait); - - kfree(mad_send_wr); return; done: spin_unlock_irqrestore(&mad_agent_priv->lock, flags); @@ -1888,6 +1850,7 @@ static void ib_mad_send_done_handler(str struct ib_mad_qp_info *qp_info; struct ib_mad_queue *send_queue; struct ib_send_wr *bad_send_wr; + struct ib_mad_send_wc mad_send_wc; unsigned long flags; int ret; @@ -1898,6 +1861,9 @@ static void ib_mad_send_done_handler(str qp_info = send_queue->qp_info; retry: + dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, + pci_unmap_addr(mad_send_wr, mapping), + mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); queued_send_wr = NULL; spin_lock_irqsave(&send_queue->lock, flags); list_del(&mad_list->list); @@ -1914,17 +1880,17 @@ retry: } spin_unlock_irqrestore(&send_queue->lock, flags); - /* Restore client wr_id in WC and complete send */ - wc->wr_id = mad_send_wr->wr_id; + mad_send_wc.send_buf = &mad_send_wr->send_buf; + mad_send_wc.status = wc->status; + mad_send_wc.vendor_err = wc->vendor_err; if (atomic_read(&qp_info->snoop_count)) - snoop_send(qp_info, &mad_send_wr->send_wr, - (struct ib_mad_send_wc *)wc, + snoop_send(qp_info, &mad_send_wr->send_buf, &mad_send_wc, IB_MAD_SNOOP_SEND_COMPLETIONS); - ib_mad_complete_send_wr(mad_send_wr, (struct ib_mad_send_wc *)wc); + ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); if (queued_send_wr) { ret = ib_post_send(qp_info->qp, &queued_send_wr->send_wr, - &bad_send_wr); + &bad_send_wr); if (ret) { printk(KERN_ERR PFX "ib_post_send failed: %d\n", ret); mad_send_wr = queued_send_wr; @@ -2066,38 +2032,37 @@ static void cancel_mads(struct ib_mad_ag list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr, &cancel_list, agent_list) { - mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_send_wc.send_buf = &mad_send_wr->send_buf; + list_del(&mad_send_wr->agent_list); mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, &mad_send_wc); - - list_del(&mad_send_wr->agent_list); - kfree(mad_send_wr); atomic_dec(&mad_agent_priv->refcount); } } static struct ib_mad_send_wr_private* -find_send_by_wr_id(struct ib_mad_agent_private *mad_agent_priv, u64 wr_id) +find_send_wr(struct ib_mad_agent_private *mad_agent_priv, + struct ib_mad_send_buf *send_buf) { struct ib_mad_send_wr_private *mad_send_wr; list_for_each_entry(mad_send_wr, &mad_agent_priv->wait_list, agent_list) { - if (mad_send_wr->wr_id == wr_id) + if (&mad_send_wr->send_buf == send_buf) return mad_send_wr; } list_for_each_entry(mad_send_wr, &mad_agent_priv->send_list, agent_list) { - if (is_data_mad(mad_agent_priv, - mad_send_wr->send_wr.wr.ud.mad_hdr) && - mad_send_wr->wr_id == wr_id) + if (is_data_mad(mad_agent_priv, mad_send_wr->send_buf.mad) && + &mad_send_wr->send_buf == send_buf) return mad_send_wr; } return NULL; } -int ib_modify_mad(struct ib_mad_agent *mad_agent, u64 wr_id, u32 timeout_ms) +int ib_modify_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf, u32 timeout_ms) { struct ib_mad_agent_private *mad_agent_priv; struct ib_mad_send_wr_private *mad_send_wr; @@ -2107,7 +2072,7 @@ int ib_modify_mad(struct ib_mad_agent *m mad_agent_priv = container_of(mad_agent, struct ib_mad_agent_private, agent); spin_lock_irqsave(&mad_agent_priv->lock, flags); - mad_send_wr = find_send_by_wr_id(mad_agent_priv, wr_id); + mad_send_wr = find_send_wr(mad_agent_priv, send_buf); if (!mad_send_wr || mad_send_wr->status != IB_WC_SUCCESS) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); return -EINVAL; @@ -2119,7 +2084,7 @@ int ib_modify_mad(struct ib_mad_agent *m mad_send_wr->refcount -= (mad_send_wr->timeout > 0); } - mad_send_wr->send_wr.wr.ud.timeout_ms = timeout_ms; + mad_send_wr->send_buf.timeout_ms = timeout_ms; if (active) mad_send_wr->timeout = msecs_to_jiffies(timeout_ms); else @@ -2130,9 +2095,10 @@ int ib_modify_mad(struct ib_mad_agent *m } EXPORT_SYMBOL(ib_modify_mad); -void ib_cancel_mad(struct ib_mad_agent *mad_agent, u64 wr_id) +void ib_cancel_mad(struct ib_mad_agent *mad_agent, + struct ib_mad_send_buf *send_buf) { - ib_modify_mad(mad_agent, wr_id, 0); + ib_modify_mad(mad_agent, send_buf, 0); } EXPORT_SYMBOL(ib_cancel_mad); @@ -2166,10 +2132,9 @@ static void local_completions(void *data * Defined behavior is to complete response * before request */ - build_smp_wc(local->wr_id, + build_smp_wc((unsigned long) local->mad_send_wr, be16_to_cpu(IB_LID_PERMISSIVE), - 0 /* pkey index */, - recv_mad_agent->agent.port_num, &wc); + 0, recv_mad_agent->agent.port_num, &wc); local->mad_priv->header.recv_wc.wc = &wc; local->mad_priv->header.recv_wc.mad_len = @@ -2196,11 +2161,11 @@ local_send_completion: /* Complete send */ mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; - mad_send_wc.wr_id = local->wr_id; + mad_send_wc.send_buf = &local->mad_send_wr->send_buf; if (atomic_read(&mad_agent_priv->qp_info->snoop_count)) - snoop_send(mad_agent_priv->qp_info, &local->send_wr, - &mad_send_wc, - IB_MAD_SNOOP_SEND_COMPLETIONS); + snoop_send(mad_agent_priv->qp_info, + &local->mad_send_wr->send_buf, + &mad_send_wc, IB_MAD_SNOOP_SEND_COMPLETIONS); mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, &mad_send_wc); @@ -2221,8 +2186,7 @@ static int retry_send(struct ib_mad_send if (!mad_send_wr->retries--) return -ETIMEDOUT; - mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->send_wr. - wr.ud.timeout_ms); + mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms); if (mad_send_wr->mad_agent_priv->agent.rmpp_version) { ret = ib_retry_rmpp(mad_send_wr); @@ -2285,11 +2249,10 @@ static void timeout_sends(void *data) mad_send_wc.status = IB_WC_RESP_TIMEOUT_ERR; else mad_send_wc.status = mad_send_wr->status; - mad_send_wc.wr_id = mad_send_wr->wr_id; + mad_send_wc.send_buf = &mad_send_wr->send_buf; mad_agent_priv->agent.send_handler(&mad_agent_priv->agent, &mad_send_wc); - kfree(mad_send_wr); atomic_dec(&mad_agent_priv->refcount); spin_lock_irqsave(&mad_agent_priv->lock, flags); } @@ -2761,7 +2724,6 @@ static int __init ib_mad_init_module(voi int ret; spin_lock_init(&ib_mad_port_list_lock); - spin_lock_init(&ib_agent_port_list_lock); ib_mad_cache = kmem_cache_create("ib_mad", sizeof(struct ib_mad_private), Index: trunk/src/linux-kernel/infiniband/core/agent_priv.h =================================================================== --- trunk/src/linux-kernel/infiniband/core/agent_priv.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/agent_priv.h (working copy) @@ -1,62 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Mellanox Technologies Ltd. All rights reserved. - * Copyright (c) 2004, 2005 Infinicon Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Intel Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Topspin Corporation. All rights reserved. - * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - -#ifndef __IB_AGENT_PRIV_H__ -#define __IB_AGENT_PRIV_H__ - -#include - -#define SPFX "ib_agent: " - -struct ib_agent_send_wr { - struct list_head send_list; - struct ib_ah *ah; - struct ib_mad_private *mad; - DECLARE_PCI_UNMAP_ADDR(mapping) -}; - -struct ib_agent_port_private { - struct list_head port_list; - struct list_head send_posted_list; - spinlock_t send_list_lock; - int port_num; - struct ib_mad_agent *smp_agent; /* SM class */ - struct ib_mad_agent *perf_mgmt_agent; /* PerfMgmt class */ -}; - -#endif /* __IB_AGENT_PRIV_H__ */ Index: trunk/src/linux-kernel/infiniband/core/mad_priv.h =================================================================== --- trunk/src/linux-kernel/infiniband/core/mad_priv.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/mad_priv.h (working copy) @@ -118,9 +118,10 @@ struct ib_mad_send_wr_private { struct ib_mad_list_head mad_list; struct list_head agent_list; struct ib_mad_agent_private *mad_agent_priv; + struct ib_mad_send_buf send_buf; + DECLARE_PCI_UNMAP_ADDR(mapping) struct ib_send_wr send_wr; struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; - u64 wr_id; /* client WR ID */ __be64 tid; unsigned long timeout; int retries; @@ -141,10 +142,7 @@ struct ib_mad_local_private { struct list_head completion_list; struct ib_mad_private *mad_priv; struct ib_mad_agent_private *recv_mad_agent; - struct ib_send_wr send_wr; - struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG]; - u64 wr_id; /* client WR ID */ - __be64 tid; + struct ib_mad_send_wr_private *mad_send_wr; }; struct ib_mad_mgmt_method_table { Index: trunk/src/linux-kernel/infiniband/core/ping.c =================================================================== --- trunk/src/linux-kernel/infiniband/core/ping.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/ping.c (working copy) @@ -108,7 +108,6 @@ static void pingd_recv_handler(struct ib struct ib_ah *ah; struct ib_mad_send_buf *msg; struct ib_vendor_mad *vend; - struct ib_send_wr *bad_send_wr; int ret; /* Find matching MAD agent */ @@ -128,17 +127,17 @@ static void pingd_recv_handler(struct ib } msg = ib_create_send_mad(mad_agent, mad_recv_wc->wc->src_qp, - mad_recv_wc->wc->pkey_index, ah, 0, - offsetof(struct ib_vendor_mad, data), - mad_recv_wc->mad_len - - offsetof(struct ib_vendor_mad, data), + mad_recv_wc->wc->pkey_index, 0, + IB_MGMT_VENDOR_HDR, + mad_recv_wc->mad_len - IB_MGMT_VENDOR_HDR, GFP_KERNEL); if (IS_ERR(msg)) { printk(KERN_ERR SPFX "pingd_recv_handler: failed to create response MAD\n"); goto error2; } - vend = (struct ib_vendor_mad *) msg->mad; + msg->ah = ah; + vend = msg->mad; memcpy(vend, mad_recv_wc->recv_buf.mad, sizeof(*vend)); vend->mad_hdr.method |= IB_MGMT_METHOD_RESP; vend->mad_hdr.status = 0; @@ -149,7 +148,7 @@ static void pingd_recv_handler(struct ib system_utsname.nodename, system_utsname.domainname); /* Send response */ - ret = ib_post_send_mad(mad_agent, &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (!ret) { ib_free_recv_mad(mad_recv_wc); return; @@ -167,10 +166,9 @@ error1: static void pingd_send_handler(struct ib_mad_agent *mad_agent, struct ib_mad_send_wc *mad_send_wc) { - struct ib_mad_send_buf *msg; + struct ib_mad_send_buf *msg = mad_send_wc->send_buf; - msg = (struct ib_mad_send_buf *) (unsigned long) mad_send_wc->wr_id; - ib_destroy_ah(msg->send_wr.wr.ud.ah); + ib_destroy_ah(msg->ah); if (mad_send_wc->status != IB_WC_SUCCESS) printk(KERN_ERR SPFX "pingd_send_handler: Error sending MAD: %d\n", mad_send_wc->status); ib_free_send_mad(msg); Index: trunk/src/linux-kernel/infiniband/core/smi.h =================================================================== --- trunk/src/linux-kernel/infiniband/core/smi.h (revision 3830) +++ trunk/src/linux-kernel/infiniband/core/smi.h (working copy) @@ -35,10 +35,11 @@ * * $Id$ */ - #ifndef __SMI_H_ #define __SMI_H_ +#include + int smi_handle_dr_smp_recv(struct ib_smp *smp, u8 node_type, int port_num, Index: trunk/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c =================================================================== --- trunk/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c (revision 3830) +++ trunk/src/linux-kernel/infiniband/hw/mthca/mthca_mad.c (working copy) @@ -46,11 +46,6 @@ enum { MTHCA_VENDOR_CLASS2 = 0xa }; -struct mthca_trap_mad { - struct ib_mad *mad; - DECLARE_PCI_UNMAP_ADDR(mapping) -}; - static void update_sm_ah(struct mthca_dev *dev, u8 port_num, u16 lid, u8 sl) { @@ -116,49 +111,14 @@ static void forward_trap(struct mthca_de struct ib_mad *mad) { int qpn = mad->mad_hdr.mgmt_class != IB_MGMT_CLASS_SUBN_LID_ROUTED; - struct mthca_trap_mad *tmad; - struct ib_sge gather_list; - struct ib_send_wr *bad_wr, wr = { - .opcode = IB_WR_SEND, - .sg_list = &gather_list, - .num_sge = 1, - .send_flags = IB_SEND_SIGNALED, - .wr = { - .ud = { - .remote_qpn = qpn, - .remote_qkey = qpn ? IB_QP1_QKEY : 0, - .timeout_ms = 0 - } - } - }; + struct ib_mad_send_buf *send_buf; struct ib_mad_agent *agent = dev->send_agent[port_num - 1][qpn]; int ret; unsigned long flags; if (agent) { - tmad = kmalloc(sizeof *tmad, GFP_KERNEL); - if (!tmad) - return; - - tmad->mad = kmalloc(sizeof *tmad->mad, GFP_KERNEL); - if (!tmad->mad) { - kfree(tmad); - return; - } - - memcpy(tmad->mad, mad, sizeof *mad); - - wr.wr.ud.mad_hdr = &tmad->mad->mad_hdr; - wr.wr_id = (unsigned long) tmad; - - gather_list.addr = dma_map_single(agent->device->dma_device, - tmad->mad, - sizeof *tmad->mad, - DMA_TO_DEVICE); - gather_list.length = sizeof *tmad->mad; - gather_list.lkey = to_mpd(agent->qp->pd)->ntmr.ibmr.lkey; - pci_unmap_addr_set(tmad, mapping, gather_list.addr); - + send_buf = ib_create_send_mad(agent, qpn, 0, 0, IB_MGMT_MAD_HDR, + IB_MGMT_MAD_DATA, GFP_ATOMIC); /* * We rely here on the fact that MLX QPs don't use the * address handle after the send is posted (this is @@ -166,21 +126,15 @@ static void forward_trap(struct mthca_de * it's OK for our devices). */ spin_lock_irqsave(&dev->sm_lock, flags); - wr.wr.ud.ah = dev->sm_ah[port_num - 1]; - if (wr.wr.ud.ah) - ret = ib_post_send_mad(agent, &wr, &bad_wr); + memcpy(send_buf->mad, mad, sizeof *mad); + if ((send_buf->ah = dev->sm_ah[port_num - 1])) + ret = ib_post_send_mad(send_buf, NULL); else ret = -EINVAL; spin_unlock_irqrestore(&dev->sm_lock, flags); - if (ret) { - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(tmad, mapping), - sizeof *tmad->mad, - DMA_TO_DEVICE); - kfree(tmad->mad); - kfree(tmad); - } + if (ret) + ib_free_send_mad(send_buf); } } @@ -267,15 +221,7 @@ int mthca_process_mad(struct ib_device * static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { - struct mthca_trap_mad *tmad = - (void *) (unsigned long) mad_send_wc->wr_id; - - dma_unmap_single(agent->device->dma_device, - pci_unmap_addr(tmad, mapping), - sizeof *tmad->mad, - DMA_TO_DEVICE); - kfree(tmad->mad); - kfree(tmad); + ib_free_send_mad(mad_send_wc->send_buf); } int mthca_create_agents(struct mthca_dev *dev) Index: utils/src/linux-kernel/infiniband/util/madeye/madeye.c =================================================================== --- utils/src/linux-kernel/infiniband/util/madeye/madeye.c (revision 3481) +++ utils/src/linux-kernel/infiniband/util/madeye/madeye.c (working copy) @@ -391,22 +391,19 @@ static void print_smp(struct ib_smp *smp } static void snoop_smi_handler(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, + struct ib_mad_send_buf *send_buf, struct ib_mad_send_wc *mad_send_wc) { - if (!smp && send_wr->wr.ud.mad_hdr->mgmt_class != mgmt_class) + struct ib_mad_hdr *hdr = send_buf->mad; + + if (!smp && hdr->mgmt_class != mgmt_class) return; - if (attr_id && send_wr->wr.ud.mad_hdr->attr_id != attr_id) + if (attr_id && hdr->attr_id != attr_id) return; printk("Madeye:sent SMP\n"); - if (send_wr->num_sge > 1) { - printk("Madeye:sent SMP - multiple sg entries\n"); - print_mad_hdr(send_wr->wr.ud.mad_hdr); - } else { - printk("Madeye:sent SMP\n"); - print_smp((struct ib_smp *)send_wr->wr.ud.mad_hdr); - } + printk("Madeye:sent SMP\n"); + print_smp(send_buf->mad); } static void recv_smi_handler(struct ib_mad_agent *mad_agent, @@ -440,25 +437,21 @@ static int is_rmpp_mad(struct ib_mad_hdr } static void snoop_gsi_handler(struct ib_mad_agent *mad_agent, - struct ib_send_wr *send_wr, + struct ib_mad_send_buf *send_buf, struct ib_mad_send_wc *mad_send_wc) { - struct ib_mad_hdr *hdr = send_wr->wr.ud.mad_hdr; - struct ib_rmpp_mad *mad; + struct ib_mad_hdr *hdr = send_buf->mad; - if (!gmp && send_wr->wr.ud.mad_hdr->mgmt_class != mgmt_class) + if (!gmp && hdr->mgmt_class != mgmt_class) return; - if (attr_id && send_wr->wr.ud.mad_hdr->attr_id != attr_id) + if (attr_id && hdr->attr_id != attr_id) return; printk("Madeye:sent GMP\n"); print_mad_hdr(hdr); - if (is_rmpp_mad(hdr)) { - mad = (struct ib_rmpp_mad *) hdr; - print_rmpp_hdr(&mad->rmpp_hdr); - } - + if (is_rmpp_mad(hdr)) + print_rmpp_hdr(&((struct ib_rmpp_mad *) hdr)->rmpp_hdr); } static void recv_gsi_handler(struct ib_mad_agent *mad_agent, Index: utils/src/linux-kernel/infiniband/util/grmpp/grmpp.c =================================================================== --- utils/src/linux-kernel/infiniband/util/grmpp/grmpp.c (revision 3375) +++ utils/src/linux-kernel/infiniband/util/grmpp/grmpp.c (working copy) @@ -168,10 +168,9 @@ static struct ib_ah * create_ah(void) static void format_send(struct ib_mad_send_buf *msg, int id) { - struct ib_vendor_mad *mad; + struct ib_vendor_mad *mad = msg->mad; u64 hi_tid, low_tid; - mad = (struct ib_vendor_mad *) msg->mad; mad->mad_hdr.base_version = IB_MGMT_BASE_VERSION; mad->mad_hdr.mgmt_class = GRMPP_MGMT_CLASS; mad->mad_hdr.class_version = 1; @@ -182,8 +181,9 @@ static void format_send(struct ib_mad_se mad->mad_hdr.tid = cpu_to_be64(hi_tid | low_tid); if (responses) - msg->send_wr.wr.ud.timeout_ms = 7000; - msg->send_wr.wr.ud.retries = 3; + msg->timeout_ms = 7000; + msg->retries = 3; + msg->ah = test.ah; mad->oui[0] = (u8) (IB_OPENIB_OUI >> 16); mad->oui[1] = (u8) ((IB_OPENIB_OUI >> 8) & 0xFF); @@ -193,27 +193,24 @@ static void format_send(struct ib_mad_se static int send_msgs(void) { struct ib_mad_send_buf *msg; - struct ib_send_wr *bad_send_wr; int i, ret = 0; - if (!rmpp && message_size > sizeof(struct ib_vendor_mad) - - offsetof(struct ib_vendor_mad, data)) { + if (!rmpp && message_size > IB_MGMT_VENDOR_DATA) { printk("grmpp: no RMPP reducing message size\n"); - message_size = sizeof(struct ib_vendor_mad) - - offsetof(struct ib_vendor_mad, data); + message_size = IB_MGMT_VENDOR_DATA; } for (i = 0; i < message_count && !ret; i++) { - msg = ib_create_send_mad(test.agent, 1, 0, test.ah, rmpp, - offsetof(struct ib_vendor_mad, data), - message_size, GFP_KERNEL); + msg = ib_create_send_mad(test.agent, 1, 0, rmpp, + IB_MGMT_VENDOR_HDR, message_size, + GFP_KERNEL); if (IS_ERR(msg)) { ret = PTR_ERR(msg); break; } format_send(msg, i); - ret = ib_post_send_mad(test.agent, &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) ib_free_send_mad(msg); } @@ -225,7 +222,6 @@ static void send_response(struct ib_mad_ { struct ib_mad_send_buf *msg; struct ib_vendor_mad *send_mad; - struct ib_send_wr *bad_send_wr; struct ib_ah *ah; int ret; @@ -238,12 +234,11 @@ static void send_response(struct ib_mad_ } msg = ib_create_send_mad(test.agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, ah, + recv_wc->wc->pkey_index, ib_get_rmpp_flags(&mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE, - offsetof(struct ib_vendor_mad, data), - recv_wc->mad_len - - offsetof(struct ib_vendor_mad, data), + IB_MGMT_VENDOR_HDR, + recv_wc->mad_len - IB_MGMT_VENDOR_HDR, GFP_KERNEL); if (IS_ERR(msg)) { printk("grmpp: Error creating response MAD: %d\n", @@ -251,15 +246,16 @@ static void send_response(struct ib_mad_ goto error1; } - send_mad = (struct ib_vendor_mad *) msg->mad; + send_mad = msg->mad; memcpy(send_mad, mad, offsetof(struct ib_mad, data)); send_mad->mad_hdr.method |= IB_MGMT_METHOD_RESP; send_mad->oui[0] = (u8) (IB_OPENIB_OUI >> 16); send_mad->oui[1] = (u8) ((IB_OPENIB_OUI >> 8) & 0xFF); send_mad->oui[2] = (u8) (IB_OPENIB_OUI & 0xFF); - msg->send_wr.wr.ud.retries = 3; + msg->retries = 3; + msg->ah = test.ah; - ret = ib_post_send_mad(test.agent, &msg->send_wr, &bad_send_wr); + ret = ib_post_send_mad(msg, NULL); if (ret) { printk("grmpp: Error sending response MAD: %d\n", ret); goto error2; @@ -274,17 +270,13 @@ error1: static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *send_wc) { - struct ib_mad_send_buf *msg; - - msg = (struct ib_mad_send_buf *) (unsigned long) send_wc->wr_id; - if (is_server) - ib_destroy_ah(msg->send_wr.wr.ud.ah); + ib_destroy_ah(send_wc->send_buf->ah); atomic_inc(&test.sends); if (send_wc->status != IB_WC_SUCCESS) printk("grmpp: Error sending MAD: %d\n", send_wc->status); - ib_free_send_mad(msg); + ib_free_send_mad(send_wc->send_buf); if (!is_server && atomic_dec_and_test(&test.sends_left)) wake_up(&test.wait); From swise at opengridcomputing.com Fri Oct 21 11:30:55 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Oct 2005 13:30:55 -0500 Subject: [openib-general] configuring ipoib References: Message-ID: <008c01c5d66d$98e263c0$d5000a0a@STEVO> Normally, in linux, you specify IP aliases thusly: ifconfig : So to add a 2nd address to eth0, do this: ifconfig eth0:1 1.2.3.4 And a 3rd like this: ifconfig eth0:2 5.6.7.8 etc... I would assume IPoIB interfaces behave the same way... by the way, the can be anything string. ----- Original Message ----- From: "Kanevsky, Arkady" To: "Grant Grundler" Cc: Sent: Friday, October 21, 2005 1:21 PM Subject: RE: [openib-general] configuring ipoib Thanks guys. Please, excuse my terminology. No I can route a single IP address to an IB port. But how do I route 2 (or more) IP addresses to the same IB port? If I specify the same ib# it just changes an associated IP address for the port. If I specify next ib# it returns an error since that ib# does not have a port behind it. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Grant Grundler [mailto:iod00d at hp.com] > Sent: Friday, October 21, 2005 11:58 AM > To: Kanevsky, Arkady > Cc: openib-general at openib.org > Subject: Re: [openib-general] configuring ipoib > > > On Fri, Oct 21, 2005 at 11:02:37AM -0400, Kanevsky, Arkady wrote: > > How do you configure ipoib? > > I used "ifconfig ib0 ip_address" which works fine. > > But if I have several ports on an HCA how do I specify which port > > ip_address should be associated with? > > Nit: For linux, Christoph Hellwig (and others) have explained > that the ip address is associated with the host, NOT any > card. The route to the subnet is associated with the card. > > > Ditto if you have multiple cards. > > iowa:/usr/src/linux-2.6.13# lspci -vt -d 15b3: > -+-[c0]---01.0-[c1]----00.0 Mellanox Technologies MT23108 InfiniHost > +-[40]---01.0-[41]----00.0 Mellanox Technologies MT23108 InfiniHost > \-[00]- > iowa:/usr/src/linux-2.6.13# ifconfig -a | fgrep ib > ib0 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > ib1 Link encap:UNSPEC HWaddr > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > ib2 Link encap:UNSPEC HWaddr > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > ib3 Link encap:UNSPEC HWaddr > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > > > hth, > grant > _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Fri Oct 21 11:52:13 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 21 Oct 2005 14:52:13 -0400 Subject: [openib-general] configuring ipoib Message-ID: that works. Thanks Steve. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Friday, October 21, 2005 2:31 PM > To: Kanevsky, Arkady; Grant Grundler > Cc: openib-general at openib.org > Subject: Re: [openib-general] configuring ipoib > > > Normally, in linux, you specify IP aliases thusly: > > ifconfig : > > So to add a 2nd address to eth0, do this: > > ifconfig eth0:1 1.2.3.4 > > And a 3rd like this: > > ifconfig eth0:2 5.6.7.8 > > etc... > > I would assume IPoIB interfaces behave the same way... > > by the way, the can be anything string. > > > ----- Original Message ----- > From: "Kanevsky, Arkady" > To: "Grant Grundler" > Cc: > Sent: Friday, October 21, 2005 1:21 PM > Subject: RE: [openib-general] configuring ipoib > > > Thanks guys. > Please, excuse my terminology. > No I can route a single IP address to an IB port. > But how do I route 2 (or more) IP addresses to the same IB port? > > If I specify the same ib# it just changes an associated IP > address for the port. If I specify next ib# it returns an > error since that ib# does not have a port behind it. > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance phone: 781-768-5395 > 375 Totten Pond Rd. Fax: 781-895-1195 > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Grant Grundler [mailto:iod00d at hp.com] > > Sent: Friday, October 21, 2005 11:58 AM > > To: Kanevsky, Arkady > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] configuring ipoib > > > > > > On Fri, Oct 21, 2005 at 11:02:37AM -0400, Kanevsky, Arkady wrote: > > > How do you configure ipoib? > > > I used "ifconfig ib0 ip_address" which works fine. > > > But if I have several ports on an HCA how do I specify which port > > > ip_address should be associated with? > > > > Nit: For linux, Christoph Hellwig (and others) have > explained that the > > ip address is associated with the host, NOT any card. The > route to the > > subnet is associated with the card. > > > > > Ditto if you have multiple cards. > > > > iowa:/usr/src/linux-2.6.13# lspci -vt -d 15b3: > > -+-[c0]---01.0-[c1]----00.0 Mellanox Technologies MT23108 > InfiniHost > > +-[40]---01.0-[41]----00.0 Mellanox Technologies MT23108 > InfiniHost > > \-[00]- > > iowa:/usr/src/linux-2.6.13# ifconfig -a | fgrep ib > > ib0 Link encap:UNSPEC HWaddr > > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > ib1 Link encap:UNSPEC HWaddr > > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > > ib2 Link encap:UNSPEC HWaddr > > 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > ib3 Link encap:UNSPEC HWaddr > > 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > > > > > > hth, > > grant > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ardavis at ichips.intel.com Fri Oct 21 11:59:46 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 21 Oct 2005 11:59:46 -0700 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <20051021161727.GA23980@cse.ohio-state.edu> References: <20051021161727.GA23980@cse.ohio-state.edu> Message-ID: <43593AA2.6040501@ichips.intel.com> Sayantan Sur wrote: >Hello, > >I have udapl over Gen2 setup on our cluster and am able to run udapl >programs. However, sometimes I get this error (after a few runs of the >same program): > > open_hca: ERR ib_at_ips_by_gid for mthca0 >dapls_ib_open_hca failed 40000 > > uDAPL uses uAT to get the IP address using the GID (ATS records via SA) of the local device/port. The SA query for this record is failing for some reason. Did your SM bounce during this time? Did you bounce or reconfigure the IPoIB network device? You can set "env DAPL_DBG_TYPE=0xffff" for more information. -arlin >The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree cards >(fw ver 5.1.0). > >lsmod on my machine shows this: > >[surs at ro0:~] lsmod | grep ^ib >ib_ipoib 48008 0 >ib_uat 14840 0 >ib_at 25696 1 ib_uat >ib_sa 17804 2 ib_ipoib,ib_at >ib_ucm 22280 0 >ib_cm 37744 1 ib_ucm >ib_uverbs 35992 0 >ib_umad 18208 0 >ib_mthca 122656 0 >ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca >ib_core 56192 8 >ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > >My infiniband devices are (created by hand): > >[surs at ro0:~] ls -l /dev/infiniband/ >total 0 >crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat >crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0 >crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0 >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0 >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1 > > >I'd really appreciate if someone could help me understand what might be >going wrong. > >Thanks, >Sayantan. > > > From Arkady.Kanevsky at netapp.com Fri Oct 21 12:23:08 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 21 Oct 2005 15:23:08 -0400 Subject: [openib-general] FW upgrade for TopSpin cards Message-ID: Roland, sorry to bug you on that but... I have a Cisco HCA (PCI-X) hca_type MTS23108 hw_rev a1 fw_ver 1.18.0 hca_type and hw_rev are clearly Mellanox nomenclature. I suspect that this is Cisco FW version #. But all OpenIB documentation is with respect to Mellanox nomenclature. For example from http://www.openib.org/docs/ipoib_faq.txt 1. Verify the firmware version via cat /sys/class/infiniband/mthca0/fw_ver For PCI-X HCAs, version 3.2.0 is recommended. For PCIe HCAs, version 4.5.3 is recommended. ********************************* Is there analogous documentation for Cisco FW? Where is that FW (this is Cougar card)? Are Cisco FWs and Mellanox FW the same? If yes what is the correspondance between the 2 numbering schemas. While this specific question is for Cougar card, the answer should be generic and cover all HCAs. Can the documentation be updated to cover all supported HW regardless of the vendor? Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rolandd at cisco.com] > Sent: Thursday, October 20, 2005 1:48 PM > To: Kanevsky, Arkady > Cc: openib-general at openib.org > Subject: Re: [openib-general] FW upgrade for TopSpin cards > > > Arkady> I get a bunch of warnings (see below). > > All of the warnings look benign (although you might want to > synchronize the clock between your build system and your file server). > > Arkady> Can I use OpenIB tvflash to upgrade FW on a TopSpin card? > > Yes. > > Arkady> Can I use OpenIB mstflint for it? > > Yes. > > Arkady> Which version of the utilities should I use? > > I would use the latest subversion revision. > > Arkady> Why warning when I build it? > > Because gcc 4.0 added a bunch of semi-bogus pointer sign > warnings, and you clocks are out of synch. > > - R. > From jlentini at netapp.com Fri Oct 21 12:24:11 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 21 Oct 2005 15:24:11 -0400 (EDT) Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <43591D07.5050709@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> Message-ID: sean> At a minimum, we need an assigned service ID to identifies a sean> TCP/IP connection service. For simplicity of the sean> implementation, I would use an ID similar to that defined for sean> SDP: sean> sean> 0x00 14 05 xx xx xx xx xx sean> sean> I don't know that the SWG or IBTA needs to be involved defining sean> the protocol beyond assigning the service ID. Standardizing the protocol will ensure interroperability. sean> The connection service can define service IDs as: sean> sean> 0x00 14 05 00 00 00 dst port sean> sean> And a private data format for the CM REQ: sean> sean> version(8) | reserved(8) | src port (16) sean> src ip (16) sean> dst ip (16) sean> user private data (56) /* for version 1 */ Are the numbers in parens in bytes or bits? It looks like a mixture to me. sean> Other private data would be left unchanged, though if we wanted sean> to get more sophisticated, we could define REJ codes to indicate sean> bad addresses/version/etc. Not surprisingly, this is exactly sean> what's implemented in the CMA and working today. I agree. We should keep it simple. sean> On a related note, it would be convenient if SDP were changed to sean> run over this protocol. Agreed. From mshefty at ichips.intel.com Fri Oct 21 12:28:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 12:28:25 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: References: <43591D07.5050709@ichips.intel.com> Message-ID: <43594159.3000202@ichips.intel.com> James Lentini wrote: > Standardizing the protocol will ensure interroperability. Agreed - just didn't know if this was the responsibility of the SWG. > sean> version(8) | reserved(8) | src port (16) version(1) | reserved(1) | src port (2) > sean> src ip (16) > sean> dst ip (16) > sean> user private data (56) /* for version 1 */ > > Are the numbers in parens in bytes or bits? It looks like a mixture to > me. Uhm.. they were a mix. Changed above to bytes. - Sean From mshefty at ichips.intel.com Fri Oct 21 12:34:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 12:34:28 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <43591D07.5050709@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> Message-ID: <435942C4.80308@ichips.intel.com> Sean Hefty wrote: > version(8) | reserved(8) | src port (16) > src ip (16) > dst ip (16) > user private data (56) /* for version 1 */ Random thought... if the src and dst IP addresses will always be on the same network, the data could be layed out as: network addr (x) src host addr (y) dst host addr (y) This could save enough space to provide 64 bytes of user private data. Although my preference would be to keep it simpler. (I'm not that familiar with IPv6 addressing. How does it define network versus host addressing?) - Sean From iod00d at hp.com Fri Oct 21 12:36:11 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 21 Oct 2005 12:36:11 -0700 Subject: [openib-general] configuring ipoib In-Reply-To: References: Message-ID: <20051021193611.GF25476@esmail.cup.hp.com> On Fri, Oct 21, 2005 at 02:21:23PM -0400, Kanevsky, Arkady wrote: > Thanks guys. > Please, excuse my terminology. > No I can route a single IP address to an IB port. ok > But how do I route 2 (or more) IP addresses to the same IB port? You can associate multiple subnets with an interface and indicate which IP address this host should respond to: ifconfig ib:1 e.g. ifconfig ib0 10.0.0.81 netmask 255.255.255.0 ifconfig ib1 10.0.1.81 netmask 255.255.255.0 ifconfig ib2 10.0.2.81 netmask 255.255.255.0 ifconfig ib3 10.0.3.81 netmask 255.255.255.0 ifconfig ib3:1 10.0.3.99 netmask 255.255.255.0 route -n output now looks like: iowa:/usr/src/linux-2.6.13# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 ib0 10.0.1.0 0.0.0.0 255.255.255.0 U 0 0 0 ib1 10.0.2.0 0.0.0.0 255.255.255.0 U 0 0 0 ib2 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 10.0.3.0 0.0.0.0 255.255.255.0 U 0 0 0 ib3 iowa:/usr/src/linux-2.6.13# ifconfig ib3 ib3 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.3.81 Bcast:10.255.255.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) iowa:/usr/src/linux-2.6.13# ifconfig ib3:1 ib3:1 Link encap:UNSPEC HWaddr 00-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.3.99 Bcast:10.255.255.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:2044 Metric:1 > If I specify the same ib# it just changes an associated IP address for > the port. eh? Either you aren't getting it or are very persistent with the "confusing" terminology. I'll try one more time in case it's the former. Connect a linux box (e.g. laptop) to two subnets and configure both NICs like normal for two distinct subnets. Disconnect the second cable. Then from another box that has it's default route set to the first subnet, try to ping the IP you "associated with the second port". The linux box will respond. E.g. I have two private subnets: 192.168.0.0/24 and 192.168.1.0/24. Both are connected to my squid (http cache) server that also provides NAT to the outside world. The squid server does NOT do routing. iowa:~# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0 owa:~# ping 192.168.1.1 PING 192.168.1.1 (192.168.1.1): 56 data bytes 64 bytes from 192.168.1.1: icmp_seq=0 ttl=64 time=0.1 ms 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time=0.1 ms --- 192.168.1.1 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.1/0.1 ms Obviously, I can ping the gateway. My point is the gateway will also respond to other subnets that it owns: iowa:~# ping 192.168.0.1 PING 192.168.0.1 (192.168.0.1): 56 data bytes 64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.1 ms 64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.1 ms --- 192.168.0.1 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.1/0.1 ms iowa:~# ping 192.168.0.11 PING 192.168.0.11 (192.168.0.11): 56 data bytes --- 192.168.0.11 ping statistics --- 2 packets transmitted, 0 packets received, 100% packet loss 192.168.0.11 is alive and owned by a third machine on that subnet. Just trying to demonstrate that the squid server is NOT doing any routing. > If I specify next ib# it returns an error since that ib# does > not have a port behind it. Right. No port means no ib#. hth, grant From jlentini at netapp.com Fri Oct 21 12:37:59 2005 From: jlentini at netapp.com (James Lentini) Date: Fri, 21 Oct 2005 15:37:59 -0400 (EDT) Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <43594159.3000202@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> Message-ID: On Fri, 21 Oct 2005, Sean Hefty wrote: > James Lentini wrote: > > Standardizing the protocol will ensure interroperability. > > Agreed - just didn't know if this was the responsibility of the SWG. The SWG has agreed to take it on. I think it is appropriate for the SWG to work on this. > > sean> version(8) | reserved(8) | src port (16) > version(1) | reserved(1) | src port (2) > > sean> src ip (16) > > sean> dst ip (16) > > sean> user private data (56) /* for version 1 */ > > > > Are the numbers in parens in bytes or bits? It looks like a mixture to me. > > Uhm.. they were a mix. Changed above to bytes. Ok. I assume that your 1 byte of version information is broken into 2 4-bit pieces, one for the protocol version and one for the IP version. What about making the src and dst ip fields variable length based on the IP version (4 bytes for IPv4 and 16 bytes for IPv6). That would provide more private data for IPv4 networks at the expense of a variable sized header and all the complexity it entails. From mshefty at ichips.intel.com Fri Oct 21 12:44:56 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 12:44:56 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> Message-ID: <43594538.7030806@ichips.intel.com> James Lentini wrote: > Ok. I assume that your 1 byte of version information is broken into 2 > 4-bit pieces, one for the protocol version and one for the IP version. That is correct. > What about making the src and dst ip fields variable length based on > the IP version (4 bytes for IPv4 and 16 bytes for IPv6). > > That would provide more private data for IPv4 networks at the expense > of a variable sized header and all the complexity it entails. That's a possibility that wouldn't add that much complexity. See my other message for yet another approach though. I'm just not sure that it helps an app much to have different private data sizes based on the address size, unless the app is written specifically for IPv4. - Sean From ftillier at silverstorm.com Fri Oct 21 12:50:48 2005 From: ftillier at silverstorm.com (Fab Tillier) Date: Fri, 21 Oct 2005 12:50:48 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: Message-ID: <003201c5d678$bcde9310$9e5aa8c0@infiniconsys.com> > From: James Lentini [mailto:jlentini at netapp.com] > Sent: Friday, October 21, 2005 12:38 PM > > On Fri, 21 Oct 2005, Sean Hefty wrote: > > > > sean> version(8) | reserved(8) | src port (16) > > version(1) | reserved(1) | src port (2) > > > sean> src ip (16) > > > sean> dst ip (16) > > > sean> user private data (56) /* for version 1 */ > > > > > > Are the numbers in parens in bytes or bits? It looks like a mixture to me. > > > > Uhm.. they were a mix. Changed above to bytes. > > Ok. I assume that your 1 byte of version information is broken into 2 > 4-bit pieces, one for the protocol version and one for the IP version. Doesn't leading-zero-padding the IPv4 addresses to be 16 bytes eliminates the need for an IP version field? - Fab From rolandd at cisco.com Fri Oct 21 13:02:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 13:02:31 -0700 Subject: [openib-general] FW upgrade for TopSpin cards In-Reply-To: (Arkady Kanevsky's message of "Fri, 21 Oct 2005 15:23:08 -0400") References: Message-ID: <52vezqd4fc.fsf@cisco.com> Arkady> I have a Cisco HCA (PCI-X) hca_type MTS23108 hw_rev a1 Arkady> fw_ver 1.18.0 Arkady> hca_type and hw_rev are clearly Mellanox nomenclature. I Arkady> suspect that this is Cisco FW version #. No, it's just a very old Mellanox FW version. Arkady> Is there analogous documentation for Cisco FW? Where is Arkady> that FW (this is Cougar card)? Are Cisco FWs and Mellanox Arkady> FW the same? If yes what is the correspondance between Arkady> the 2 numbering schemas. Cisco FW and Mellanox FW are virtually identical. The version numbering is the same; the only difference is that Cisco FW will have things like the NodeDescription customized. You can use Mellanox FW on an HCA from Cisco, and vice versa. - R. From rolandd at cisco.com Fri Oct 21 13:04:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 13:04:53 -0700 Subject: [openib-general] build libibverbs with --libdir parameter In-Reply-To: <4358907D.8060507@de.ibm.com> (Heiko J. Schick's message of "Fri, 21 Oct 2005 08:53:49 +0200") References: <4358907D.8060507@de.ibm.com> Message-ID: <52u0fad4be.fsf@cisco.com> Heiko> Is there some way to change this behaviour? The only optin Heiko> I can see is to patch the DEFAULT_PATH define in init.c for Heiko> RPM builds. I'm not sure I understand the problem you're having. When I build libibverbs RPMs for x86_64 (using the spec file in the libibverbs tarball and building under Fedora Core 4), I get the following: $ rpm -ql libibverbs /usr/lib64/libibverbs.so.1 /usr/lib64/libibverbs.so.1.0.0 /usr/share/doc/libibverbs-1.0 /usr/share/doc/libibverbs-1.0/AUTHORS /usr/share/doc/libibverbs-1.0/COPYING /usr/share/doc/libibverbs-1.0/ChangeLog /usr/share/doc/libibverbs-1.0/README $ rpm -ql libmthca /usr/lib64/infiniband/mthca.so /usr/share/doc/libmthca-1.0 /usr/share/doc/libmthca-1.0/AUTHORS /usr/share/doc/libmthca-1.0/COPYING /usr/share/doc/libmthca-1.0/ChangeLog /usr/share/doc/libmthca-1.0/README and everything works fine. The libibverbs search path is /usr/lib64/infiniband as it should be. Apparently rpmbuild is passing the correct configure and install options to make lib64 work fine. Can you be more specific about the problems you see? Thanks, Roland From rolandd at cisco.com Fri Oct 21 13:24:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 13:24:48 -0700 Subject: [openib-general] Adding static entry to arp table? In-Reply-To: <1129902328.5494.4.camel@psmith.ind.pantasys.com> (Abhijit Gadgil's message of "Fri, 21 Oct 2005 19:15:28 +0530") References: <1129902328.5494.4.camel@psmith.ind.pantasys.com> Message-ID: <52pspyd3e7.fsf@cisco.com> Abhijit> Hi all, Is there a patch (to ip utility or Linux kernel), Abhijit> which can add static entry to the arp table using ip Abhijit> neigh command? I am using gen1 based stack, but didn't Abhijit> find anything after a 'grep' in the gen2 stack as well? Abhijit> any pointers? With the current Linux IB drivers ("gen2"), "ip neigh add" should work fine. There is probably a way to add ARP entries with your gen1-based stack, but you would have to ask your vendor for details. - R. From rolandd at cisco.com Fri Oct 21 13:25:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 13:25:31 -0700 Subject: [openib-general] configuring ipoib In-Reply-To: (Arkady Kanevsky's message of "Fri, 21 Oct 2005 14:21:23 -0400") References: Message-ID: <52ll0md3d0.fsf@cisco.com> Arkady> Thanks guys. Please, excuse my terminology. No I can Arkady> route a single IP address to an IB port. But how do I Arkady> route 2 (or more) IP addresses to the same IB port? Just use "ip addr add" once for each address you want to configure. - R. From swise at opengridcomputing.com Fri Oct 21 13:34:31 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 21 Oct 2005 15:34:31 -0500 Subject: [openib-general] TCP/IP connection service over IB References: <43591D07.5050709@ichips.intel.com> <435942C4.80308@ichips.intel.com> Message-ID: <003c01c5d67e$f9921150$d5000a0a@STEVO> > Random thought... if the src and dst IP addresses will always be on the > same network, the data could be layed out as: > > network addr (x) > src host addr (y) > dst host addr (y) > > This could save enough space to provide 64 bytes of user private data. > Although my preference would be to keep it simpler. (I'm not that > familiar with IPv6 addressing. How does it define network versus host > addressing?) I don't think you want to make the assumption that src and dst addrs are on the same IP network. While that may be true for a set of IB hosts on a common IB switch set as a single IP subnet, there may be TCP/IB bridge/gateway products that allow remote IP hosts to connect into an IB cluster and those could certainly be on a remote subnet. IPv6 addrs define networks in a similar manner to IPv4: IE some number of the bits in the address define the network number, and the remaining define the host number. From fgogpjf at gaoland.net Fri Oct 21 18:42:13 2005 From: fgogpjf at gaoland.net (Francisca Diggs) Date: Fri, 21 Oct 2005 20:42:13 -0500 Subject: [openib-general] Cheap watches. Message-ID: <539q442j.8338894@gaoland.net> We noticed you had bought one of our products before. We just recently slashed prices, and thought we should let you know. http://worldsfinestwatchz.com/ Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Francisca Diggs Customer Service Rep. lieu or forgotten in some chine not may puzzle or a exhibition see be careworn the a gobbledygook the ! bogeymen inthe myosin be. faith ! rudiment not but liturgy a on lolly be may earthen on on spare but in rotarian see be lacquer mayit calm it's. From tom at opengridcomputing.com Fri Oct 21 14:08:14 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 21 Oct 2005 16:08:14 -0500 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <43594538.7030806@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> Message-ID: <1129928894.4255.0.camel@trinity.austin.ammasso.com> Sean: I'm thinking that for iWARP, there won't be anything in the Private Data at all except consumer private data. Is that your expectation? On Fri, 2005-10-21 at 12:44 -0700, Sean Hefty wrote: > James Lentini wrote: > > Ok. I assume that your 1 byte of version information is broken into 2 > > 4-bit pieces, one for the protocol version and one for the IP version. > > That is correct. > > > What about making the src and dst ip fields variable length based on > > the IP version (4 bytes for IPv4 and 16 bytes for IPv6). > > > > That would provide more private data for IPv4 networks at the expense > > of a variable sized header and all the complexity it entails. > > That's a possibility that wouldn't add that much complexity. See my other > message for yet another approach though. I'm just not sure that it helps an app > much to have different private data sizes based on the address size, unless the > app is written specifically for IPv4. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Fri Oct 21 14:02:19 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 14:02:19 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <1129928894.4255.0.camel@trinity.austin.ammasso.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> Message-ID: <4359575B.5020302@ichips.intel.com> Tom Tucker wrote: > I'm thinking that for iWARP, there won't be anything in the Private Data > at all except consumer private data. Is that your expectation? I believe so. This is only trying to define a TCP/IP connection service over IB. I'm assuming that there's no need to define something similar for iWarp. Does SCTP share the same port space as TCP? Is any mapping between them required? - Sean From caitlinb at broadcom.com Fri Oct 21 14:04:47 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 21 Oct 2005 14:04:47 -0700 Subject: [openib-general] TCP/IP connection service over IB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AC5@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Steve Wise > Sent: Friday, October 21, 2005 1:35 PM > To: Sean Hefty > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: Re: [openib-general] TCP/IP connection service over IB > > > Random thought... if the src and dst IP addresses will always be on > > the same network, the data could be layed out as: > > > > network addr (x) > > src host addr (y) > > dst host addr (y) > > > > This could save enough space to provide 64 bytes of user > private data. > > Although my preference would be to keep it simpler. (I'm not that > > familiar with IPv6 addressing. How does it define network > versus host > > addressing?) > > I don't think you want to make the assumption that src and > dst addrs are on the same IP network. While that may be true > for a set of IB hosts on a common IB switch set as a single > IP subnet, there may be TCP/IB bridge/gateway products that > allow remote IP hosts to connect into an IB cluster and those > could certainly be on a remote subnet. > > IPv6 addrs define networks in a similar manner to IPv4: IE > some number of the bits in the address define the network > number, and the remaining define the host number. > More relevantly, GIDs are syntactically identical to IPv6. The only real difference between a GID and an IPv6 address is who/how the network portion is assigned. While IPv4-only applications exist, they are not supposed to. Certainly no new API or protocol should encourage an application to be IPv4 dependent. The rationale for using only the IPV4 format would be that GIDs *could* be assigned that are valid IPv6 addresses. Hence no translation would be needed. Relying on assignment of IPV6 compatible GIDs may be undesirable, however, because HCAs generally cannot accept a large number of assigned GIDs. The amount of private data supported does vary on network characteristics. A responsible application should be allowed to piggy-back additional data when the larger size is support (512 bytes over IP networks). Forcing the application to have to use an additional round-trip over the network would not be network friendly. However, it should be clear to application developers that they are expected to make their applications work in the minimum size guaranteed -- and that the minimum size is adequate for the core purpose of enabling the QP to be selected/configured. From caitlinb at broadcom.com Fri Oct 21 14:06:44 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Fri, 21 Oct 2005 14:06:44 -0700 Subject: [openib-general] TCP/IP connection service over IB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AC6@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Friday, October 21, 2005 2:02 PM > To: Tom Tucker > Cc: swg at infinibandta.org; openib-general > Subject: Re: [openib-general] TCP/IP connection service over IB > > Tom Tucker wrote: > > I'm thinking that for iWARP, there won't be anything in the Private > > Data at all except consumer private data. Is that your expectation? > > I believe so. This is only trying to define a TCP/IP > connection service over IB. I'm assuming that there's no > need to define something similar for iWarp. > > Does SCTP share the same port space as TCP? Is any mapping > between them required? > It's basically the same as with TCP and UDP. It's a 16 bit number, and most people do not use the same port number to mean *different* things over the different IP transports. From sean.hefty at intel.com Fri Oct 21 14:06:57 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 21 Oct 2005 14:06:57 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <20051020210133.27820.qmail@web32504.mail.mud.yahoo.com> Message-ID: >I had a look at where the mask is set in cm.c >(cm_init_qp_rtr_attr() and cm_init_qp_rts_attr()) but >I was unsure how to make the mask depend on the QP >type. Maybe you have a better idea of how to do this. Here's a patch (edited by hand, so let me know if there's any issue applying it) that should permit UC connections over the CM. I was able to test this using cmpost. Signed-off-by: Sean Hefty Index: cm.c =================================================================== --- cm.c (revision 3830) +++ cm.c (working copy) @@ -135,6 +135,7 @@ __be64 tid; __be32 local_qpn; __be32 remote_qpn; + enum ib_qp_type qp_type; __be32 sq_psn; __be32 rq_psn; int timeout_ms; @@ -926,6 +923,7 @@ cm_id_priv->responder_resources = param->responder_resources; cm_id_priv->retry_count = param->retry_count; cm_id_priv->path_mtu = param->primary_path->mtu; + cm_id_priv->qp_type = param->qp_type; ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg); if (ret) @@ -1320,6 +1314,7 @@ cm_req_get_primary_local_ack_timeout(req_msg); cm_id_priv->retry_count = cm_req_get_retry_count(req_msg); cm_id_priv->rnr_retry_count = cm_req_get_rnr_retry_count(req_msg); + cm_id_priv->qp_type = cm_req_get_qp_type(req_msg); cm_format_req_event(work, cm_id_priv, &listen_cm_id_priv->id); cm_process_work(cm_id_priv, work); @@ -3079,10 +3035,10 @@ case IB_CM_ESTABLISHED: *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT; - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE; if (cm_id_priv->responder_resources) - qp_attr->qp_access_flags |= IB_ACCESS_REMOTE_WRITE | - IB_ACCESS_REMOTE_READ; + qp_attr->qp_access_flags |= IB_ACCESS_REMOTE_READ; qp_attr->pkey_index = cm_id_priv->av.pkey_index; qp_attr->port_num = cm_id_priv->av.port->port_num; ret = 0; @@ -3112,14 +3068,18 @@ case IB_CM_MRA_REP_RCVD: case IB_CM_ESTABLISHED: *qp_attr_mask = IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU | - IB_QP_DEST_QPN | IB_QP_RQ_PSN | - IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_MIN_RNR_TIMER; + IB_QP_DEST_QPN | IB_QP_RQ_PSN; qp_attr->ah_attr = cm_id_priv->av.ah_attr; qp_attr->path_mtu = cm_id_priv->path_mtu; qp_attr->dest_qp_num = be32_to_cpu(cm_id_priv->remote_qpn); qp_attr->rq_psn = be32_to_cpu(cm_id_priv->rq_psn); - qp_attr->max_dest_rd_atomic = cm_id_priv->responder_resources; - qp_attr->min_rnr_timer = 0; + if (cm_id_priv->qp_type == IB_QPT_RC) { + *qp_attr_mask |= IB_QP_MAX_DEST_RD_ATOMIC | + IB_QP_MIN_RNR_TIMER; + qp_attr->max_dest_rd_atomic = + cm_id_priv->responder_resources; + qp_attr->min_rnr_timer = 0; + } if (cm_id_priv->alt_av.ah_attr.dlid) { *qp_attr_mask |= IB_QP_ALT_PATH; qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; @@ -3148,14 +3108,17 @@ case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: case IB_CM_ESTABLISHED: - *qp_attr_mask = IB_QP_STATE | IB_QP_TIMEOUT | IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | IB_QP_SQ_PSN | - IB_QP_MAX_QP_RD_ATOMIC; - qp_attr->timeout = cm_id_priv->local_ack_timeout; - qp_attr->retry_cnt = cm_id_priv->retry_count; - qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; + *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); - qp_attr->max_rd_atomic = cm_id_priv->initiator_depth; + if (cm_id_priv->qp_type == IB_QPT_RC) { + *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC; + qp_attr->timeout = cm_id_priv->local_ack_timeout; + qp_attr->retry_cnt = cm_id_priv->retry_count; + qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; + qp_attr->max_rd_atomic = cm_id_priv->initiator_depth; + } if (cm_id_priv->alt_av.ah_attr.dlid) { *qp_attr_mask |= IB_QP_PATH_MIG_STATE; qp_attr->path_mig_state = IB_MIG_REARM; From rolandd at cisco.com Fri Oct 21 14:12:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 14:12:56 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AC6@NT-SJCA-0751.brcm.ad.broadcom.com> (Caitlin Bestler's message of "Fri, 21 Oct 2005 14:06:44 -0700") References: <54AD0F12E08D1541B826BE97C98F99F1020AC6@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <524q7ad15z.fsf@cisco.com> Caitlin> It's basically the same as with TCP and UDP. It's a 16 Caitlin> bit number, and most people do not use the same port Caitlin> number to mean *different* things over the different IP Caitlin> transports. But, just to be clear, the port number spaces are disjoint. It's possible and valid to have one TCP socket bound to a given IP/port number, and another UDP socket bound to the same IP/port number. I do agree that assigned port numbers generally have the same meaning across all transports. For example, both TCP port 111 and UDP port 111 are the sunrpc portmapper. - R. From rolandd at cisco.com Fri Oct 21 14:58:04 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 14:58:04 -0700 Subject: [openib-general] moving IBM eHCA Device Driver to openib.org In-Reply-To: (IBMEHCA DD's message of "Wed, 19 Oct 2005 16:58:46 +0200") References: Message-ID: <52r7aebkib.fsf@cisco.com> Here are a few quick notes based on spending a little while skimming through some of the ehca source code: ehca_asm.h: this stuff should be in include/asm and mostly already is: asm_sync_mem is just the standard mb() from . mftb() is get_tb() from so you just need to add prefetch_zero to include/asm-ppc64 ehca_classes.c: why have ehca_module_new? it seems to be used to allocate a single instance of a struct that should just be module-global variables. ehca_classes_pSeries.h: Why have EHCA_MEMPAGESIZE and EHCA_MEMPAGESIZE_MASK? Can they be different from PAGE_SIZE and PAGE_MASK? Get rid of typedefs of struct hcp_eq_handle -- just use structs directly in code. Can struct hcp_modify_qp_control_block declaration be made readable? ehca_common.h: why have typedef of ehca_redcode_t? Just use long directly. ntohd() seems to duplicate be64_to_cpu() except without proper __be64 annotation -- why is it needed? ehca_irq.c: in ehca_comp_event_callback(), what protects against the CQ being destroyed out from under you? ehca_kernel.h: ehca_sleep() and ehca_msleep() duplicate msleep_interruptible(). why is assert() needed instead of just using BUG_ON() why have MIN() and MAX() instad of just using min()/max()? ehca_kv_to_g() looks rather horrible -- why are you working with kernel virtual addresses at all? ehca_kr_to_g() looks like it is a substitute for dma_map_single() Why not use the correct DMA API instead? ehca_main.c: ehca_module_exit() -- I don't see where ehca_wq is destroyed. ehca_module_exit() -- what prevents ehca_poll_eq() from running after the module text is gone? You don't wait for the kthread to actually stop, and it might be in the middle of the sleep. From chai.15 at osu.edu Fri Oct 21 16:43:51 2005 From: chai.15 at osu.edu (LEI CHAI) Date: Fri, 21 Oct 2005 19:43:51 -0400 Subject: [openib-general] uDAPL open HCA problem Message-ID: <6f14f72b91.72b916f14f@osu.edu> Hi, I'm from the same lab as Sayantan. Thanks for your suggestion. Currently we could not reproduce the problem, however, we meet another problem. When I try to tear down a connection between two nodes I often get some messages like this: [ 0] 005e0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 The program can run and exit though. After using the debug option as you suggested I got the following log. It starts from the point where I start to free the resources and disconnect the nodes: dapl_lmr_free (0x76f3b0) dapl_lmr_free (0x76f4e0) dapl_lmr_free (0x76f650) dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x7 ID=0x76fa70 CTX=0x76fb00 passive_cb: conn 0x76fb00 id 7797360 event 7 dapli_async_event_cb(0x5c40c0) dapl_lmr_free (0x76fee0) dapl_lmr_free (0x7a9150) dapl_lmr_free (0x7a9280) dapl_lmr_free (0x7a93b0) dapl_lmr_free (0x7a94e0) dapl_lmr_free (0x7a9610) dapl_lmr_free (0x7a9740) dapl_lmr_free (0x7a9870) dapl_lmr_free (0x7a99a0) dapl_lmr_free (0x7a9ad0) dapl_ep_disconnect (0x69b070, 1) disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 1) dapl_ep_disconnect () returns 0x0 dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x8 ID=0x76f9c0 CTX=0x76f7a0 active_cb: conn 0x76f7a0 id 7797184 event 8 dapli_async_event_cb(0x5c4410) dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7bc) dapl_evd_wait: EVD 0x5c89b0, CQ (nil) dapl_evd_wait (0x5c89b0, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b4c) dapl_evd_wait: EVD 0x5c89b0, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76f9c0 CTX=0x76f7a0 active_cb: conn 0x76f7a0 id 7797184 event 9 --> dapl_evd_connection_callback: ctxt: 0x69b070 event: 1 cm_handle 0x76f7a0 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69b070, conn 0x76f7a0, id 7797184 flags 0) destroy_cm_id: conn 0x76f7a0 id 7797184 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406 dapli_evd_post_event: Called with event # 4005 dapl_evd_connection_callback () returns active_cb: DESTROY conn 0x76f7a0 id 7797184 dapli_async_event_cb(0x5c4410) dapl_evd_wait () returns 0x0 dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76fa70 CTX=0x76fb00 passive_cb: conn 0x76fb00 id 7797360 event 9 --> dapl_cr_callback! context: 0x5c8b20 event: 1 cm_handle 0x76fb00 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69b070, conn 0x76fb00, id 7797360 flags 0) destroy_cm_id: conn 0x76fb00 id 7797360 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406 dapli_evd_post_event: Called with event # 4005 dapl_evd_wait () returns 0x0 dapli_async_event_cb(0x5c40c0) dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x7 ID=0x76f910 CTX=0x7a9120 passive_cb: conn 0x7a9120 id 7797008 event 7 dapli_async_event_cb(0x5c40c0) dapl_ep_disconnect (0x69bd20, 1) disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 1) dapl_ep_disconnect () returns 0x0 dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffffebf7c0, 0x7fffffebf7b8) dapl_evd_wait: EVD 0x5ccb00, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x8 ID=0x76fc70 CTX=0x76fa50 active_cb: conn 0x76fa50 id 7797872 event 8 dapli_async_event_cb(0x5c4410) dapl_evd_wait (0x5ccb00, -1, 1, 0x7fffff9a9b50, 0x7fffff9a9b48) dapl_evd_wait: EVD 0x5ccb00, CQ (nil) dapli_cq_event_cb(0x5c4410) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76fc70 CTX=0x76fa50 active_cb: conn 0x76fa50 id 7797872 event 9 --> dapl_evd_connection_callback: ctxt: 0x69bd20 event: 1 cm_handle 0x76fa50 dapli_cq_event_cb(0x5c40c0) dapli_cm_event() dapli_cm_event: EVENT=0x9 ID=0x76f910 CTX=0x7a9120 passive_cb: conn 0x7a9120 id 7797008 event 9 --> dapl_cr_callback! context: 0x5ccc70 event: 1 cm_handle 0x7a9120 dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69bd20, conn 0x7a9120, id 7797008 flags 0) destroy_cm_id: conn 0x7a9120 id 7797008 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407 dapli_evd_post_event: Called with event # 4005 dapl_evd_wait () returns 0x0 dapl_ep_free (0x69b070) dapl_ep_disconnect (0x69b070, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0 qp_free: ep_ptr 0x69b070 qp 0x69b3a0 modify_qp: qp 0x69b3a0, state 6 qp_num 0x4e0406 dapli_async_event_cb(0x5c40c0) dapls_ib_get_dat_event: event(passive) ib=0x1 dat=0x4005 disconnect(ep 0x69bd20, conn 0x76fa50, id 7797872 flags 0) destroy_cm_id: conn 0x76fa50 id 7797872 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407 dapli_evd_post_event: Called with event # 4005 dapl_evd_connection_callback () returns active_cb: DESTROY conn 0x76fa50 id 7797872 dapli_async_event_cb(0x5c4410) dapl_evd_wait () returns 0x0 dapl_ep_free (0x69b070) dapl_ep_disconnect (0x69b070, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69b070 qp_state 1 qp_handle 69b3a0 qp_free: ep_ptr 0x69b070 qp 0x69b3a0 modify_qp: qp 0x69b3a0, state 6 qp_num 0x2c0406 >>> dapl_psp_free 0x5c8b20 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0) destroy_cm_id: conn 0x5c8be0 id 6065664 dapl_evd_free (0x5c89b0) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c8840) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c85e0) [ 0] 002c0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 >>> dapl_psp_free 0x5c8b20 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5c8b20 cm_ptr 0x5c8be0) destroy_cm_id: conn 0x5c8be0 id 6065664 dapl_evd_free (0x5c89b0) dapl_evd_free () returns 0x0 dapl_evd_free (0x5c8840) dapl_evd_free () returns 0x0 cq_object_destroy: wait_obj=0x5c8750 dapl_evd_free () returns 0x0 dapl_ep_free (0x69bd20) dapl_ep_disconnect (0x69bd20, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220 qp_free: ep_ptr 0x69bd20 qp 0x76f220 modify_qp: qp 0x76f220, state 6 qp_num 0x2c0407 >>> dapl_psp_free 0x5ccc70 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30) destroy_cm_id: conn 0x5ccd30 id 6082384 dapl_evd_free (0x5ccb00) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc990) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc730) cq_object_destroy: wait_obj=0x5cc8a0 dapl_evd_free () returns 0x0 dapl_pz_free (0x5c8510) dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffff9a9900) dapl_ia_query () returns 0x0 dapl_ia_close (0x5c8000, 1) setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil) dapl_evd_free (0x5c80f0) dapl_evd_free () returns 0x0 close_hca: 0x5c4390->0x5ca3b0 ib_thread_destroy: wait on hca 0x2 destroy dapl_evd_free (0x5c85e0) cq_object_destroy: wait_obj=0x5c8750 dapl_evd_free () returns 0x0 dapl_ep_free (0x69bd20) dapl_ep_disconnect (0x69bd20, 0) dapl_ep_disconnect () returns 0x0 dapl_ep_free: Free EP: b, ep 0x69bd20 qp_state 1 qp_handle 76f220 qp_free: ep_ptr 0x69bd20 qp 0x76f220 modify_qp: qp 0x76f220, state 6 qp_num 0x4e0407 >>> dapl_psp_free 0x5ccc70 >>> dapl_psp_free: state 1 cr_list_count 0 remove_listener(ia_ptr 0x5c8000 sp_ptr 0x5ccc70 cm_ptr 0x5ccd30) destroy_cm_id: conn 0x5ccd30 id 6082384 dapl_evd_free (0x5ccb00) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc990) dapl_evd_free () returns 0x0 dapl_evd_free (0x5cc730) cq_object_destroy: wait_obj=0x5cc8a0 dapl_evd_free () returns 0x0 dapl_pz_free (0x5c8510) dapl_ia_query (0x5c8000, (nil), 0x0, (nil), 0x3ffffff, 0x7fffffebf570) dapl_ia_query () returns 0x0 dapl_ia_close (0x5c8000, 1) setup_async_cb: ia 0x5c8000 type 0 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 1 hdl (nil) cb (nil) ctx (nil) setup_async_cb: ia 0x5c8000 type 3 hdl (nil) cb (nil) ctx (nil) dapl_evd_free (0x5c80f0) dapl_evd_free () returns 0x0 close_hca: 0x5c4040->0x5ca3b0 DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(8512) ib_thread_destroy: waiting for ib_thread ib_thread(8512) EXIT DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(8081) ib_thread_destroy: waiting for ib_thread ib_thread(8081) EXIT ib_thread_destroy(8512) exit ib_thread_destroy(8081) exit Any suggestions would be highly appreciated. Thanks. Lei ----- Original Message ----- From: Arlin Davis Date: Friday, October 21, 2005 2:59 pm Subject: Re: [openib-general] uDAPL open HCA problem > Sayantan Sur wrote: > > >Hello, > > > >I have udapl over Gen2 setup on our cluster and am able to run udapl > >programs. However, sometimes I get this error (after a few runs > of the > >same program): > > > > open_hca: ERR ib_at_ips_by_gid for mthca0 > >dapls_ib_open_hca failed 40000 > > > > > > uDAPL uses uAT to get the IP address using the GID (ATS records > via SA) > of the local device/port. The SA query for this record is failing > for > some reason. Did your SM bounce during this time? Did you bounce > or > reconfigure the IPoIB network device? > > You can set "env DAPL_DBG_TYPE=0xffff" for more information. > > -arlin > > >The machine is a AMD Opteron (Tyan S2895), with Mellanox MemFree > cards>(fw ver 5.1.0). > > > >lsmod on my machine shows this: > > > >[surs at ro0:~] lsmod | grep ^ib > >ib_ipoib 48008 0 > >ib_uat 14840 0 > >ib_at 25696 1 ib_uat > >ib_sa 17804 2 ib_ipoib,ib_at > >ib_ucm 22280 0 > >ib_cm 37744 1 ib_ucm > >ib_uverbs 35992 0 > >ib_umad 18208 0 > >ib_mthca 122656 0 > >ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca > >ib_core 56192 8 > >ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad > > > >My infiniband devices are (created by hand): > > > >[surs at ro0:~] ls -l /dev/infiniband/ > >total 0 > >crw-rw-rw- 1 root root 231, 191 2005-10-20 21:13 uat > >crw-rw-rw- 1 root root 231, 224 2005-10-20 21:12 ucm0 > >crwxrwxrwx 1 root root 231, 192 2005-09-21 04:37 umad0 > >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs0 > >crwxrwxrwx 1 root root 231, 192 2005-09-16 19:29 uverbs1 > > > > > >I'd really appreciate if someone could help me understand what > might be > >going wrong. > > > >Thanks, > >Sayantan. > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Fri Oct 21 16:48:31 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 16:48:31 -0700 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <6f14f72b91.72b916f14f@osu.edu> (LEI CHAI's message of "Fri, 21 Oct 2005 19:43:51 -0400") References: <6f14f72b91.72b916f14f@osu.edu> Message-ID: <521x2ebfe8.fsf@cisco.com> LEI> Hi, I'm from the same lab as Sayantan. Thanks for your LEI> suggestion. Currently we could not reproduce the problem, LEI> however, we meet another problem. When I try to tear down a LEI> connection between two nodes I often get some messages like LEI> this: LEI> [ 0] 005e0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 LEI> [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 That's OK, it's just showing that you polled a "work request flushed" status from a completion queue. The latest version of libmthca should no longer print these messages. - R. From chai.15 at osu.edu Fri Oct 21 18:14:16 2005 From: chai.15 at osu.edu (LEI CHAI) Date: Fri, 21 Oct 2005 21:14:16 -0400 Subject: [openib-general] uDAPL open HCA problem Message-ID: <87e828301b.8301b87e82@osu.edu> Hi, Thank you very much for your reply. Now the open HCA problem comes back :-( Here is the log message: [chail at ro0] mpiexec -n 2 ./a.out DAPL: NOT Setting Loopback dapl_ib_init: ib_thread_init(12016) dapl_ia_open (ib0, 8, 0x7ffffff28668, 0xd9da48) open_hca: mthca0 - 0xdb3390 ib_thread(12016,0x40200960): ENTER: pipe 8 at 4 open_hca: Found dev mthca0 0002c902004002e8 open_hca: GID subnet fe80000000000000 id 0002c902004002e9 ips_by_gid: RET 0 at_rec 0x7ffffff283d0 -> id 2861 dapli_at_event_cb() ip_comp_handler: rec 0x7ffffff283d0 ->id 2861 id 2861 num -22 3afa6000 ip_comp_handler: resolution err -22 retry 1 ip_comp_handler: ips_by_gid 0 rec 0x7ffffff283d0->id 2862 dapli_at_event_cb() ip_comp_handler: rec 0x7ffffff283d0 ->id 2862 id 2862 num -22 0 ip_comp_handler: resolution err -22 retry 2 ip_comp_handler: ips_by_gid 0 rec 0x7ffffff283d0->id 2863 dapli_at_event_cb() ip_comp_handler: rec 0x7ffffff283d0 ->id 2863 id 2863 num -22 0 ip_comp_handler: resolution err -22 retry 3 ip_comp_handler: ips_by_gid 0 rec 0x7ffffff283d0->id 2864 dapli_at_event_cb() ip_comp_handler: rec 0x7ffffff283d0 ->id 2864 id 2864 num -22 0 ip_comp_handler: resolution err -22 retry 4 ip_comp_handler: ERR: at_rec 0x7ffffff283d0, id 2864 num -22 open_hca: ERR ib_at_ips_by_gid for mthca0 dapls_ib_open_hca failed 40000 dapl_ia_open () returns 0x40000 DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(12016) ib_thread_destroy: waiting for ib_thread ib_thread(12016) EXIT [rdma_udapl_priv.c:640] error(262144): Cannot open IA DAPL: NOT Setting Loopback dapl_ib_init: ib_thread_init(11337) dapl_ia_open (ib0, 8, 0x7fffffa8d618, 0xd9da48) open_hca: mthca0 - 0xdb3390 ib_thread(11337,0x40800960): ENTER: pipe 8 at 4 open_hca: Found dev mthca0 0002c90200400314 open_hca: GID subnet fe80000000000000 id 0002c90200400315 ips_by_gid: RET 0 at_rec 0x7fffffa8d380 -> id 4627 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffa8d380 ->id 4627 id 4627 num -22 3c66c000 ip_comp_handler: resolution err -22 retry 1 ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4628 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffa8d380 ->id 4628 id 4628 num -22 0 ip_comp_handler: resolution err -22 retry 2 [rdma_udapl_priv.c:640] error(262144): Cannot open IA ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4629 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffa8d380 ->id 4629 id 4629 num -22 0 ip_comp_handler: resolution err -22 retry 3 ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4630 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffa8d380 ->id 4630 id 4630 num -22 0 ip_comp_handler: resolution err -22 retry 4 ip_comp_handler: ERR: at_rec 0x7fffffa8d380, id 4630 num -22 open_hca: ERR ib_at_ips_by_gid for mthca0 dapls_ib_open_hca failed 40000 dapl_ia_open () returns 0x40000 DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(11337) ib_thread_destroy: waiting for ib_thread ib_thread(11337) EXIT ib_thread_destroy(12016) exit rank 0 in job 421 ro0_33361 caused collective abort of all ranks exit status of rank 0: return code 1 Any idea what is going on? Thanks. Lei ----- Original Message ----- From: Roland Dreier Date: Friday, October 21, 2005 7:48 pm Subject: Re: [openib-general] uDAPL open HCA problem > LEI> Hi, I'm from the same lab as Sayantan. Thanks for your > LEI> suggestion. Currently we could not reproduce the problem, > LEI> however, we meet another problem. When I try to tear > down a > LEI> connection between two nodes I often get some messages like > LEI> this: > > LEI> [ 0] 005e0406 [ 4] 00000000 [ 8] 00000000 [ c] 00000000 > LEI> [10] 05f90000 [14] 00000000 [18] 00000008 [1c] fe100000 > > That's OK, it's just showing that you polled a "work request flushed" > status from a completion queue. The latest version of libmthca should > no longer print these messages. > > - R. > From rolandd at cisco.com Fri Oct 21 20:09:24 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 21 Oct 2005 20:09:24 -0700 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <87e828301b.8301b87e82@osu.edu> (LEI CHAI's message of "Fri, 21 Oct 2005 21:14:16 -0400") References: <87e828301b.8301b87e82@osu.edu> Message-ID: <52sluu9riz.fsf@cisco.com> LEI> Any idea what is going on? Nope, sorry. Looks like something is going wrong inside ib_at. - R. From info at sdfvg.com Fri Oct 21 19:47:08 2005 From: info at sdfvg.com (info at sdfvg.com) Date: 22 Oct 2005 11:47:08 +0900 Subject: [openib-general] $BO"Mm@h8x3+"v(B Message-ID: <20051022024708.14659.qmail@mail.sdfvg.com> $B0l=o$K5o$F$/$l$k=w at -$r3N We noticed you had bought one of our products before. We just recently slashed prices, and thought we should let you know. http://worldsfinestwatchz.com/ Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Clarice Sorensen Customer Service Rep. frequent try wafer try the checkbook may it's scapegoat the try kim not but jim it the goldsmith , on protest itand falconry in. neuromuscular but nv be a neapolitan ! but diagnosable not the cohomology but in glycogen not a edition , not kermit oron r and. From info at vjdhsy.com Fri Oct 21 22:40:21 2005 From: info at vjdhsy.com (info at vjdhsy.com) Date: 22 Oct 2005 14:40:21 +0900 Subject: [openib-general] $B2q$C$F$/$l$l$P:9$7>e$2$^$9!#(B Message-ID: <20051022054021.10573.qmail@mail.vjdhsy.com> $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B $B!c(BNO.I don't veceive your mail$B!d"M!!(Bsweet_baby_sweet_12 at yahoo.it $B!c:#8e!"l9g$O!d"M!!(Bsweet_baby_sweet_12 at yahoo.it $B!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g!g(B From sinate at yahoo.com Sat Oct 22 00:57:44 2005 From: sinate at yahoo.com (Steven Wooding) Date: Sat, 22 Oct 2005 08:57:44 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: Message-ID: <20051022075744.42416.qmail@web32507.mail.mud.yahoo.com> --- Sean Hefty wrote: > > Here's a patch (edited by hand, so let me know if > there's any issue > applying it) that should permit UC connections over > the CM. I was able to > test this using cmpost. Thanks for the quick response. I'll try this patch out next week (Monday), but it looks good. Regards, Steve. ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com From tom at ipperformance.com Sat Oct 22 07:58:28 2005 From: tom at ipperformance.com (Tom Tucker) Date: Sat, 22 Oct 2005 09:58:28 -0500 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <4359575B.5020302@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> <4359575B.5020302@ichips.intel.com> Message-ID: <1129993109.21779.77.camel@mail.es335.com> On Fri, 2005-10-21 at 14:02 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > I'm thinking that for iWARP, there won't be anything in the Private Data > > at all except consumer private data. Is that your expectation? > > I believe so. This is only trying to define a TCP/IP connection service over > IB. I'm assuming that there's no need to define something similar for iWarp. Just wanted to be sure. > > Does SCTP share the same port space as TCP? Is any mapping between them required? No, different port spaces. I don't think there is a need for a mapping. I'm sure Caitlin will pipe up if there is. > > - Sean From rolandd at cisco.com Sat Oct 22 10:03:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sat, 22 Oct 2005 10:03:53 -0700 Subject: [openib-general] [git pull] InfiniBand fix for 2.6.14-rc5 Message-ID: <52fyqta3gm.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the following change, which fixes a subtle bug biting some users: [IB] mthca: Always re-arm EQs in mthca_tavor_interrupt() We should always re-arm an event queue's interrupt in mthca_tavor_interrupt() if the corresponding bit is set in the event cause register (ECR), even if we didn't find any entries in the EQ. If we don't, then there's a window where we miss an EQ entry and then get stuck because we don't get another EQ event. Signed-off-by: Roland Dreier drivers/infiniband/hw/mthca/mthca_eq.c | 23 ++++++++++++----------- 1 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -396,20 +396,21 @@ static irqreturn_t mthca_tavor_interrupt writel(dev->eq_table.clr_mask, dev->eq_table.clr_int); ecr = readl(dev->eq_regs.tavor.ecr_base + 4); - if (ecr) { - writel(ecr, dev->eq_regs.tavor.ecr_base + - MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); - - for (i = 0; i < MTHCA_NUM_EQ; ++i) - if (ecr & dev->eq_table.eq[i].eqn_mask && - mthca_eq_int(dev, &dev->eq_table.eq[i])) { + if (!ecr) + return IRQ_NONE; + + writel(ecr, dev->eq_regs.tavor.ecr_base + + MTHCA_ECR_CLR_BASE - MTHCA_ECR_BASE + 4); + + for (i = 0; i < MTHCA_NUM_EQ; ++i) + if (ecr & dev->eq_table.eq[i].eqn_mask) { + if (mthca_eq_int(dev, &dev->eq_table.eq[i])) tavor_set_eq_ci(dev, &dev->eq_table.eq[i], dev->eq_table.eq[i].cons_index); - tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); - } - } + tavor_eq_req_not(dev, dev->eq_table.eq[i].eqn); + } - return IRQ_RETVAL(ecr); + return IRQ_HANDLED; } static irqreturn_t mthca_tavor_msi_x_interrupt(int irq, void *eq_ptr, From liran at mellanox.co.il Sun Oct 23 00:01:16 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Sun, 23 Oct 2005 09:01:16 +0200 Subject: [openib-general] InfiniBand Test Project (IBTP) - Update Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB46D@mtlexch01.mtl.com> Currently only a minor bug fix in osmt_service flow , and cosmetics changes to fit WinIb stack . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, October 20, 2005 1:01 PM To: Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update On Thu, 2005-10-20 at 03:49, Liran Sorani wrote: > Hi , Hal . > The Linux & WinIB are the same , except for several cosmetic changes . I was referring to the (differences in the) Linux one in ibtp and the Linux one under gen2/trunk. > Regarding Makefile.in , it's an outcome of autogen , I'll remove it . Thanks. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 19, 2005 10:25 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update > > > On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > > Hi , > > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) > and > > Gen2 stacks : > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmt est > > > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest > > > > Osmtest is the main verification tool for OpenSM , include various > SA > > (Good / Bad) flows. > > Attached to each directory a short README file for setup and usage > > information. > > How is the Linux one different from osmtest in the trunk ? > > Also, (nit): > I think > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest/Makefile.in > is a generated file and should be removed. > > -- Hal > > > > Liran Sorani > > > Mellanox Technologies LTD. > > > mailto:liran at mellanox.co.il > > > Phone: +972(4)9097200 Ext: 214 > > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at mjhgdr.com Sun Oct 23 00:42:30 2005 From: info at mjhgdr.com (info at mjhgdr.com) Date: 23 Oct 2005 16:42:30 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051023074230.3869.qmail@mail.mjhgdr.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From loneill at the-book-shop.net Sun Oct 23 03:42:51 2005 From: loneill at the-book-shop.net (Angelo Sanders) Date: Sun, 23 Oct 2005 12:42:51 +0200 Subject: [openib-general] Stop throwing away your money Message-ID: <000001c5d7ce$53daa500$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! Name Patches Regular Now Steel Package 10 Patches $79.95 $49.95 Free shipping Silver Package 25 Patches $129.95 $99.95 Free shipping and exercise manual included Gold Package 40 Patches $189.95 $149.95 Free shipping and exercise manual included Platinum Package 65 Patches $259.95 $199.95 Free shipping and exercise manual included -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at kijugt.com Sun Oct 23 07:39:36 2005 From: info at kijugt.com (info at kijugt.com) Date: 23 Oct 2005 23:39:36 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051023143936.13831.qmail@mail.kijugt.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From info at mjhgdr.com Sun Oct 23 07:48:30 2005 From: info at mjhgdr.com (info at mjhgdr.com) Date: 23 Oct 2005 23:48:30 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051023144830.24732.qmail@mail.mjhgdr.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From info at lkjys.com Sun Oct 23 16:05:53 2005 From: info at lkjys.com (info at lkjys.com) Date: 24 Oct 2005 08:05:53 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051023230553.1165.qmail@mail.lkjys.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From jlentini at netapp.com Sun Oct 23 18:04:39 2005 From: jlentini at netapp.com (James Lentini) Date: Sun, 23 Oct 2005 21:04:39 -0400 (EDT) Subject: [openib-general] uDAPL open HCA problem In-Reply-To: <87e828301b.8301b87e82@osu.edu> References: <87e828301b.8301b87e82@osu.edu> Message-ID: On Fri, 21 Oct 2005, LEI CHAI wrote: > ips_by_gid: RET 0 at_rec 0x7fffffa8d380 -> id 4627 > dapli_at_event_cb() > ip_comp_handler: rec 0x7fffffa8d380 ->id 4627 id 4627 num -22 3c66c000 > ip_comp_handler: resolution err -22 retry 1 > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4628 > dapli_at_event_cb() > ip_comp_handler: rec 0x7fffffa8d380 ->id 4628 id 4628 num -22 0 > ip_comp_handler: resolution err -22 retry 2 > [rdma_udapl_priv.c:640] error(262144): Cannot open IA > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4629 > dapli_at_event_cb() > ip_comp_handler: rec 0x7fffffa8d380 ->id 4629 id 4629 num -22 0 > ip_comp_handler: resolution err -22 retry 3 > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4630 > dapli_at_event_cb() > ip_comp_handler: rec 0x7fffffa8d380 ->id 4630 id 4630 num -22 0 > ip_comp_handler: resolution err -22 retry 4 > ip_comp_handler: ERR: at_rec 0x7fffffa8d380, id 4630 num -22 > open_hca: ERR ib_at_ips_by_gid for mthca0 ib_at_ips_by_gid is failing again. Have you setup an IPoIB address? From info at kijugt.com Sun Oct 23 20:31:19 2005 From: info at kijugt.com (info at kijugt.com) Date: 24 Oct 2005 12:31:19 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051024033119.20940.qmail@mail.kijugt.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From info at kjgjd.com Sun Oct 23 21:13:31 2005 From: info at kjgjd.com (info at kjgjd.com) Date: 24 Oct 2005 13:13:31 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051024041331.14959.qmail@mail.kjgjd.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From kfzxsuoazfm at gmpexpress.net Sun Oct 23 18:37:28 2005 From: kfzxsuoazfm at gmpexpress.net (Jefferey Lancaster) Date: Mon, 24 Oct 2005 05:37:28 +0400 Subject: [openib-general] Fwd: Cheap watches. Message-ID: <473d341m.6217535@gmpexpress.net> We noticed you had bought one of our products before. We just recently slashed prices, and thought we should let you know. http://worldsfinestwatchs.net/ Check us out, im sure you will find something that you will like, at a price that is very affordable. Regards, Jefferey Lancaster Customer Service Rep. can't some rex in see sparrow some or backyard see but fillip on and woven the the novelty some in barter bea isfahan the. protozoa on toroid on try ram a or puissant some not hydrostatic it's , bundoora in or phonology on a restitution buton byroad ,. From yaronh at voltaire.com Sun Oct 23 22:50:33 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 24 Oct 2005 07:50:33 +0200 Subject: [dat-discussions] RE: [openib-general] Re: [swg] Re: private data... Message-ID: <35EA21F54A45CB47B879F21A91F4862F856BE9@taurus.voltaire.com> > -----Original Message----- > From: dat-discussions at yahoogroups.com [mailto:dat- > discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady > Sent: Thursday, October 20, 2005 5:07 PM > To: dat-discussions at yahoogroups.com; Sean Hefty > Cc: Lentini, James; swg at infinibandta.org; openib-general at openib.org > Subject: RE: [dat-discussions] RE: [openib-general] Re: [swg] Re: private > data... > > > Once this is defined ULP can decide on which Service ID(s) to listen. > Requestor can send conn req to a specific Service ID (IB specific) > or use higher level abstraction - TCP port. > CM may be capable to translate TCP port to Service ID based on ULP. > For example, iSER over IPoIB will be mapped to one Service ID and > native iSER over IB will be mapped to another. But this is not simple. > On another hand every intermediate level protocol (SDP, IPoIB) can > do conversion. But this is also hard and is extension of existing > protocol. A small correction, there is no iSER over IPoIB, just iSER over Native RDMA There can be an iSCSI/TCP session running over IPoIB but than it's a connectionless UD session (without ServiceID), also the iSER spec defines that iSCSI/iSER is in precedence to iSCSI/TCP. To add to the ongoing discussion, one of the major benefits in maintaining the TCP port numbers for RDMA protocols is the ability to leverage on existing naming services and configuration mechanisms. e.g. NFS use Port mappers, other protocols use DHCP, DNS, SLP, iSNS, well defined numbers, or other mechanisms, this way the upper layers beyond the transport stay the same and don't bother if its IB or iWarp or even if its plain TCP. If we don't preserve a simple/linear port mapping, we probably need to reinvent name-services for RDMA as well. Yaron From eitan at mellanox.co.il Mon Oct 24 00:08:55 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 24 Oct 2005 09:08:55 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30693AA@mtlexch01.mtl.com> I would suggest to use SNMP for the tasks below. IETF IPoIB group has defined an SNMP MIB that can support the required functionality below. Everything but the dynamic partitioning (OpenSM does not have partition manager to this moment) and forwarding of Performance Monitoring traps (which are generated by the PM) can be done through osmsh or through SA client today. EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Troy Benjegerdes [mailto:hozer at hozed.org] > Sent: Thursday, October 20, 2005 3:23 AM > To: Hal Rosenstock > Cc: openib-general at openib.org > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > On Tue, Oct 18, 2005 at 03:10:31PM -0400, Hal Rosenstock wrote: > > Currently, OpenSM does not support an interactive console. There has > > been a desire to introduce the ability to change certain parameters (as > > well as display things) once OpenSM has started. This patch introduces > > the first most basic commands: help and loglevel. I am investgating > > adding smpriority to this. The console is invoked by specifying -console > > as an option on the opensm command line. > > > > If you have a request for a command you would like in the console, I > > would like to compile a list of these. > > > > Comments ? > > As well as a console, I'd like an API for some way for external programs > (say a cluster queue manager) to be able to query the SM (or the sm + some > helper library) for the following things: > > * Topology > * guid/lid/IPoIB address/switch port mappings > * link state > > Future neat things to do: > > * An interface to dynamically partition the fabric > * Register for notifications for certain events (excessive traffic > queueing, or error counts) > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Mon Oct 24 01:59:28 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 24 Oct 2005 10:59:28 +0200 Subject: [openib-general] [PATCH] Opensm - enabling updn algorithm Message-ID: <5zhdb7xpcf.fsf@mtl066.yok.mtl.com> Hi Hal, I noticed that somewhere in the merge the flags of the up-down algorithm were dropped from the main.c. Also - you've added a patch to enable stack_dump in debug mode, but this patch breaks our compilation under gen1. I've added a flag to check that the stack compiled is gen2. Thanks, Yael Signed-off-by: Yael Kalka Index: osm/opensm/main.c =================================================================== --- osm/opensm/main.c (revision 3843) +++ osm/opensm/main.c (working copy) @@ -452,7 +452,7 @@ main( boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:s:t:vVhorc"; + const char * const short_option = "i:f:ed:g:l:s:t:a:uvVhorc"; /* In the array below, the 2nd parameter specified the number @@ -498,7 +498,7 @@ main( exit(1); } -#ifdef _DEBUG_ +#if defined (_DEBUG_) && defined (OSM_VENDOR_INTF_OPENIB) enable_stack_dump(1); #endif From halr at voltaire.com Mon Oct 24 02:50:46 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Oct 2005 05:50:46 -0400 Subject: [openib-general] Re: [PATCH] Opensm - enabling updn algorithm In-Reply-To: <5zhdb7xpcf.fsf@mtl066.yok.mtl.com> References: <5zhdb7xpcf.fsf@mtl066.yok.mtl.com> Message-ID: <1130147445.4397.11595.camel@hal.voltaire.com> On Mon, 2005-10-24 at 04:59, Yael Kalka wrote: > Hi Hal, > > I noticed that somewhere in the merge the flags of the up-down algorithm > were dropped from the main.c. > Also - you've added a patch to enable stack_dump in debug mode, but > this patch breaks our compilation under gen1. I've added a flag to check > that the stack compiled is gen2. Thanks. Applied. -- Hal From mohitka at noida.hcltech.com Mon Oct 24 03:09:49 2005 From: mohitka at noida.hcltech.com (Mohit Katiyar, Noida) Date: Mon, 24 Oct 2005 15:39:49 +0530 Subject: [openib-general] iSER details Message-ID: <3E6BB9CEE261E2428AD25D0D553DC4970169CD8D@HSDLNTD1110010.noida.hcltech.com> Hi, Can anyone tell me where can I find the specifications of iSER protocol on Infiniband. I could not find any document which provides specification specially according to Infiniband, all the doc were on iWarp. If anyone can guide me in this Thanks in advance Mohit -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Mon Oct 24 03:31:04 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Oct 2005 06:31:04 -0400 Subject: [openib-general] iSER details In-Reply-To: <3E6BB9CEE261E2428AD25D0D553DC4970169CD8D@HSDLNTD1110010.noida.hcltech.com> References: <3E6BB9CEE261E2428AD25D0D553DC4970169CD8D@HSDLNTD1110010.noida.hcltech.com> Message-ID: <1130149864.4397.11891.camel@hal.voltaire.com> On Mon, 2005-10-24 at 06:09, Mohit Katiyar, Noida wrote: > Can anyone tell me where can I find the specifications of iSER > protocol on Infiniband. I could not find any document which provides > specification specially according to Infiniband, all the doc were on > iWarp. If anyone can guide me in this There are 2 relevant I-Ds: iSCSI Extensions for RDMA Specification http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-05.txt and Generalization of iSER for InfiniBand and other Network Protocols http://www.ietf.org/internet-drafts/draft-hufferd-iser-ib-01.txt At the last IETF IPS WG meeting in Paris, the sense of the meeting was: Sense of room: Want to proceed towards applying these changes (after careful review and WG rough consensus) to the approved iSER draft so that there is one draft that is broadly applicable rather than the current iSER draft plus a draft that modifies that draft to broaden it. so these changes for IB will be folded into an upcoming version of the iSER I-D. -- Hal From halr at voltaire.com Mon Oct 24 04:25:49 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Oct 2005 07:25:49 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E30693AA@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E30693AA@mtlexch01.mtl.com> Message-ID: <1130153148.4397.12286.camel@hal.voltaire.com> On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > I would suggest to use SNMP for the tasks below. IETF IPoIB group has > defined an SNMP MIB that can support the required functionality below. The IETF SNMP MIBs are one way of presenting the information to the outside world. There are other possible management interfaces. The SNMP MIB instrumentation would need to use lower layer APIs to get this information out of the SM. > Everything but the dynamic partitioning (OpenSM does not have > partition manager to this moment) What Troy meant by partitioning is not necessarily IB partitioning. > and forwarding of Performance > Monitoring traps (which are generated by the PM) can be done through > osmsh or through SA client today. What PerfMgr are you referring to ? -- Hal > EZ > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Troy Benjegerdes [mailto:hozer at hozed.org] > > Sent: Thursday, October 20, 2005 3:23 AM > > To: Hal Rosenstock > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > > > On Tue, Oct 18, 2005 at 03:10:31PM -0400, Hal Rosenstock wrote: > > > Currently, OpenSM does not support an interactive console. There > has > > > been a desire to introduce the ability to change certain > parameters (as > > > well as display things) once OpenSM has started. This patch > introduces > > > the first most basic commands: help and loglevel. I am > investgating > > > adding smpriority to this. The console is invoked by specifying > -console > > > as an option on the opensm command line. > > > > > > If you have a request for a command you would like in the console, > I > > > would like to compile a list of these. > > > > > > Comments ? > > > > As well as a console, I'd like an API for some way for external > programs > > (say a cluster queue manager) to be able to query the SM (or the sm > + some > > helper library) for the following things: > > > > * Topology > > * guid/lid/IPoIB address/switch port mappings > > * link state > > > > Future neat things to do: > > > > * An interface to dynamically partition the fabric > > * Register for notifications for certain events (excessive traffic > > queueing, or error counts) > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From nvbqz at co.uk Mon Oct 24 04:41:47 2005 From: nvbqz at co.uk (Jim Harrell) Date: Mon, 24 Oct 2005 13:41:47 +0200 Subject: [openib-general] The miracle. Message-ID: <2795277.2@co.uk> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://coolhoodia.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://coolhoodia.com/ Regards, Dr. Jim Harrell From info at lgatwg.com Mon Oct 24 07:07:34 2005 From: info at lgatwg.com (info at lgatwg.com) Date: 24 Oct 2005 23:07:34 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051024140734.24895.qmail@mail.lgatwg.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From yaronh at voltaire.com Mon Oct 24 09:17:04 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Mon, 24 Oct 2005 18:17:04 +0200 Subject: [openib-general] iSER details Message-ID: <35EA21F54A45CB47B879F21A91F4862F856C86@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Monday, October 24, 2005 6:31 AM > To: Mohit Katiyar, Noida > Cc: openib-general at openib.org > Subject: Re: [openib-general] iSER details > > On Mon, 2005-10-24 at 06:09, Mohit Katiyar, Noida wrote: > > Can anyone tell me where can I find the specifications of iSER > > protocol on Infiniband. I could not find any document which provides > > specification specially according to Infiniband, all the doc were on > > iWarp. If anyone can guide me in this > > There are 2 relevant I-Ds: > > iSCSI Extensions for RDMA Specification > http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-05.txt > As Hal indicate the iSER-05 IETF draft already incorporates InfiniBand, and already passed last call status. There aren't many differenced between IB and iWarp, IBTA is also working on the IP address mapping over InfiniBand that will be leveraged by iSER/IB and NFS/RDMA, and few other clarifications/issues. Note one key difference in the IETF draft is that IB negotiate the Login over the RC connection, where in iWarp its over a TCP connection (and than transition to RDMA RC). Some more detailed material can be found on http://www.haifa.il.ibm.com/satran/ips/iSER-in-an-IB-network-V9.pdf It's a little old but many sections are still relevant Yaron From info at kijshd.com Mon Oct 24 07:11:24 2005 From: info at kijshd.com (info at kijshd.com) Date: 24 Oct 2005 23:11:24 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051024141124.31469.qmail@mail.kijshd.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From info at lgatwg.com Mon Oct 24 07:49:09 2005 From: info at lgatwg.com (info at lgatwg.com) Date: 24 Oct 2005 23:49:09 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051024144909.4090.qmail@mail.lgatwg.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From mshefty at ichips.intel.com Mon Oct 24 10:58:07 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 10:58:07 -0700 Subject: [openib-general] device add/remove in userspace Message-ID: <435D20AF.7040609@ichips.intel.com> Is there a way for a userspace application to know if a device has been added or removed? I'm porting the CMA to userspace. One of the features of the kernel CMA is that listen requests can span all devices, with device add/remove handled automatically. I'd like to expose similar functionality in userspace, but it appears that the device list is built once when ibv_get_devices() is called. - Sean From rolandd at cisco.com Mon Oct 24 11:26:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 11:26:18 -0700 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <435D20AF.7040609@ichips.intel.com> (Sean Hefty's message of "Mon, 24 Oct 2005 10:58:07 -0700") References: <435D20AF.7040609@ichips.intel.com> Message-ID: <52ek6a7ovp.fsf@cisco.com> Sean> Is there a way for a userspace application to know if a Sean> device has been added or removed? Sean> I'm porting the CMA to userspace. One of the features of Sean> the kernel CMA is that listen requests can span all devices, Sean> with device add/remove handled automatically. I'd like to Sean> expose similar functionality in userspace, but it appears Sean> that the device list is built once when ibv_get_devices() is Sean> called. We don't really handle this right now. It could probably be made to work on top of hotplug/udev/hal/something but it seems tricky to me. - R. From eitan at mellanox.co.il Mon Oct 24 11:24:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 24 Oct 2005 20:24:57 +0200 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <1129740235.16900.33953.camel@hal.voltaire.com> References: <1129740235.16900.33953.camel@hal.voltaire.com> Message-ID: <435D26F9.7040202@mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2005-10-19 at 12:28, Eitan Zahavi wrote: > >>Hal Rosenstock wrote: >> >>>On Tue, 2005-10-18 at 17:11, Eitan Zahavi wrote: >>> >>> >>>>Hal Rosenstock wrote: >>>> >>>> >>>>>Currently, OpenSM does not support an interactive console. There > > has > >>>>>been a desire to introduce the ability to change certain parameters >>> >>>(as >>> >>> >>>>>well as display things) once OpenSM has started. This patch >>> >>>introduces >>> >>> >>>>>the first most basic commands: help and loglevel. I am investgating >>>>>adding smpriority to this. The console is invoked by specifying >>> >>>-console >>> >>> >>>>>as an option on the opensm command line. >>>>> >>>>>If you have a request for a command you would like in the console, > > I > >>>>>would like to compile a list of these. >>>>> >>>>>Comments ? >>>> >>>>OpenSM gen1 has a nice TCL API (named osmsh) that lets you do all > > that > >>> >>>>and much more. >>>>Setting ALL options is supported. >>>>It also provides a Tcl access to the SM Database so you can write > > your > >>>own >>> >>> >>>>reports on FDB/MC-FDB etc. >>>>Interactive control on the discovery and fabric settings sequence >>> >>>allows >>> >>> >>>>"single stepping" too. >>> >>> >>>IMO osmsh is more a debugger's tool. It relies on OpenSM globals and >>>internal SM data structures rather than well defined APIs which > > might > >>>isolate the user from changes. (It exposes the internals of the SM > > and > >>>SM modifications may cause scripts using osmsh) to stop working, and >>>worse than that, osmsh scripts may cause serious SM bugs. >> >>What is unsafe in running the following basic code? >>osm_opts configure -log_file $log_file_name. >>osm_init >>osm_bind $guid >>osm_sweep >>osm_set_verbosity 0xffff > > > Are you saying there is no use of globals and internal SM data > structures by osmsh or just for that particular flow ? All I say is that one can get his flows as complicated as he wants... For the trivial features you support in the console there is no complicated equivalent flow in osmsh. For sophisticated requirements you will eventually end up re-writing osmsh. > > >>>I think there is a place for a "safer" console. Perhaps there are > > levels > >>>of access privileges where some can do RO things and others have RW >>>access. >> >>How would this privilege right be granted? > > > Based on user and/or perhaps group. I do not understand how the permission to write flow through the console is going to be administered. > > >>>>The OpenSM user manual provides extensive description of it, >>>>including some programming examples. >>> >>> >>>What OpenSM documentation ? I didn't see any with the 1.8.0 release. >> >>It is in the 1.7.1 1.7.0 manuals too. > > > How do you get the old versions of this ? It is in the main trunk ... https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_UM.pdf > > >>>>Porting of osmsh to gen2 should be very simple. >>> >>> >>>Is someone working on doing this ? >> >>No - but if needed we can do that. >> >>> >>>>I do not see why we need to invent yet another way to do these > > things. > >>>>Instead I would recommend including osm Tcl extension in the gen2 >>> >>>trunk >>> >>> >>>>and put it to work. >>> >>> >>>-- Hal >>> >> > From krause at cup.hp.com Mon Oct 24 11:35:27 2005 From: krause at cup.hp.com (Michael Krause) Date: Mon, 24 Oct 2005 11:35:27 -0700 Subject: [openib-general] TCP/IP connection service over IB In-Reply-To: <003201c5d678$bcde9310$9e5aa8c0@infiniconsys.com> References: <003201c5d678$bcde9310$9e5aa8c0@infiniconsys.com> Message-ID: <6.2.0.14.2.20051024113443.0214cef0@esmail.cup.hp.com> At 12:50 PM 10/21/2005, Fab Tillier wrote: > > From: James Lentini [mailto:jlentini at netapp.com] > > Sent: Friday, October 21, 2005 12:38 PM > > > > On Fri, 21 Oct 2005, Sean Hefty wrote: > > > > > > sean> version(8) | reserved(8) | src port (16) > > > version(1) | reserved(1) | src port (2) > > > > sean> src ip (16) > > > > sean> dst ip (16) > > > > sean> user private data (56) /* for version 1 */ > > > > > > > > Are the numbers in parens in bytes or bits? It looks like a mixture > to me. > > > > > > Uhm.. they were a mix. Changed above to bytes. > > > > Ok. I assume that your 1 byte of version information is broken into 2 > > 4-bit pieces, one for the protocol version and one for the IP version. > >Doesn't leading-zero-padding the IPv4 addresses to be 16 bytes eliminates the >need for an IP version field? Not really. The same logic was used in the SDP port mapper for iWARP where there was still an IP version provided so that the space remained constant while the end node would know how to parse the message. Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Oct 24 11:37:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 11:37:17 -0700 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <52ek6a7ovp.fsf@cisco.com> References: <435D20AF.7040609@ichips.intel.com> <52ek6a7ovp.fsf@cisco.com> Message-ID: <435D29DD.6030407@ichips.intel.com> Roland Dreier wrote: > Sean> Is there a way for a userspace application to know if a > Sean> device has been added or removed? > > We don't really handle this right now. It could probably be made to > work on top of hotplug/udev/hal/something but it seems tricky to me. At this point, I'm still trying to decide if the userspace CMA should talk to the userspace CM or kernel CMA, so I'm not sure what's needed yet. How difficult would it be to support a call such as ibv_open_device_by_guid() or ibv_open_device_by_name()? - Sean From eitan at mellanox.co.il Mon Oct 24 11:38:10 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 24 Oct 2005 20:38:10 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <1130153148.4397.12286.camel@hal.voltaire.com> References: <1130153148.4397.12286.camel@hal.voltaire.com> Message-ID: <435D2A12.8040508@mellanox.co.il> Hal Rosenstock wrote: > On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > >>I would suggest to use SNMP for the tasks below. IETF IPoIB group has >>defined an SNMP MIB that can support the required functionality below. > > > The IETF SNMP MIBs are one way of presenting the information to the > outside world. There are other possible management interfaces. The SNMP > MIB instrumentation would need to use lower layer APIs to get this > information out of the SM. Yes but the IETF SM MIB is the only one that is close to a standard way. It does not require low level interface if it will integrate into the OpenSM code. One way to do it is buy extending OpenSM with an AgentX interface. IMO one clear advantage of using SNMP for SM integration is that the code will work with any SM that is IETF compliant. Also if you want to write a "client server" type of application on top of an SM you can either stick to sending MADs which translate into SA client based application or you better stay with some known protocol for management (like SNMP) and not develop yet another protocol for doing exactly the same things as SNMP already supports. > > >>Everything but the dynamic partitioning (OpenSM does not have >>partition manager to this moment) > > > What Troy meant by partitioning is not necessarily IB partitioning. How are you sure about that? Troy - please comment. > > >>and forwarding of Performance >>Monitoring traps (which are generated by the PM) can be done through >>osmsh or through SA client today. > > > What PerfMgr are you referring to ? No specific one. But the specification does not require the SM too. For various reasons (like load) it might make more sense to have the PM distributed. Anyway, my point is that the SM is not the owner of PM trap reporting. It is the PM that should support Reporting (I.e InformInfo registration and Trap forwarding) for PM traps. But the spec does not define such traps anyway. > > -- Hal > > >>EZ >> >>Eitan Zahavi >>Design Technology Director >>Mellanox Technologies LTD >>Tel:+972-4-9097208 >>Fax:+972-4-9593245 >>P.O. Box 586 Yokneam 20692 ISRAEL >> >> >> >>>-----Original Message----- >>>From: Troy Benjegerdes [mailto:hozer at hozed.org] >>>Sent: Thursday, October 20, 2005 3:23 AM >>>To: Hal Rosenstock >>>Cc: openib-general at openib.org >>>Subject: Re: [openib-general] [RFC] OpenSM Interactive Console >>> >>>On Tue, Oct 18, 2005 at 03:10:31PM -0400, Hal Rosenstock wrote: >>> >>>>Currently, OpenSM does not support an interactive console. There >> >>has >> >>>>been a desire to introduce the ability to change certain >> >>parameters (as >> >>>>well as display things) once OpenSM has started. This patch >> >>introduces >> >>>>the first most basic commands: help and loglevel. I am >> >>investgating >> >>>>adding smpriority to this. The console is invoked by specifying >> >>-console >> >>>>as an option on the opensm command line. >>>> >>>>If you have a request for a command you would like in the console, >> >>I >> >>>>would like to compile a list of these. >>>> >>>>Comments ? >>> >>>As well as a console, I'd like an API for some way for external >> >>programs >> >>>(say a cluster queue manager) to be able to query the SM (or the sm >> >>+ some >> >>>helper library) for the following things: >>> >>>* Topology >>>* guid/lid/IPoIB address/switch port mappings >>>* link state >>> >>>Future neat things to do: >>> >>>* An interface to dynamically partition the fabric >>>* Register for notifications for certain events (excessive traffic >>> queueing, or error counts) >>>_______________________________________________ >>>openib-general mailing list >>>openib-general at openib.org >>>http://openib.org/mailman/listinfo/openib-general >>> >>>To unsubscribe, please visit >> >>http://openib.org/mailman/listinfo/openib-general >> > > From ttucker at opengridcomputing.com Mon Oct 24 12:09:37 2005 From: ttucker at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Oct 2005 14:09:37 -0500 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <435D29DD.6030407@ichips.intel.com> References: <435D20AF.7040609@ichips.intel.com> <52ek6a7ovp.fsf@cisco.com> <435D29DD.6030407@ichips.intel.com> Message-ID: <1130180977.4203.23.camel@trinity.austin.ammasso.com> On Mon, 2005-10-24 at 11:37 -0700, Sean Hefty wrote: > Roland Dreier wrote: > > Sean> Is there a way for a userspace application to know if a > > Sean> device has been added or removed? > > > > We don't really handle this right now. It could probably be made to > > work on top of hotplug/udev/hal/something but it seems tricky to me. > > At this point, I'm still trying to decide if the userspace CMA should talk to > the userspace CM or kernel CMA, so I'm not sure what's needed yet. I'm not sure of all the issues you're considering, but it seems to me at first blush that the user space stuff should talk the kernel CMA. If you don't do it in the kernel: - you will end up replicating transport dependent connection management logic in the user mode library - you will have to export, support and maintain a much larger number of kernel services - implementing security/provisioning policy in user mode is trickier than in the kernel > How difficult would it be to support a call such as ibv_open_device_by_guid() or > ibv_open_device_by_name()? > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Mon Oct 24 11:56:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 11:56:18 -0700 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <435D29DD.6030407@ichips.intel.com> (Sean Hefty's message of "Mon, 24 Oct 2005 11:37:17 -0700") References: <435D20AF.7040609@ichips.intel.com> <52ek6a7ovp.fsf@cisco.com> <435D29DD.6030407@ichips.intel.com> Message-ID: <5264rm7nhp.fsf@cisco.com> Sean> How difficult would it be to support a call such as Sean> ibv_open_device_by_guid() or ibv_open_device_by_name()? I don't think something like that would be hard. - R. From philippe.gregoire at cea.fr Mon Oct 24 11:57:20 2005 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Mon, 24 Oct 2005 20:57:20 +0200 Subject: [openib-general] last version for 2.6.9 backport Message-ID: <200510241858.UAA00199@styx.bruyeres.cea.fr> Is 3640 the lastest version for which a backport to 2.6.9 has been done ? Or does someone is working on a newer version today ? thanks Philippe From mshefty at ichips.intel.com Mon Oct 24 12:00:29 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 12:00:29 -0700 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <1130180977.4203.23.camel@trinity.austin.ammasso.com> References: <435D20AF.7040609@ichips.intel.com> <52ek6a7ovp.fsf@cisco.com> <435D29DD.6030407@ichips.intel.com> <1130180977.4203.23.camel@trinity.austin.ammasso.com> Message-ID: <435D2F4D.3000306@ichips.intel.com> Tom Tucker wrote: > I'm not sure of all the issues you're considering, but it seems to me at > first blush that the user space stuff should talk the kernel CMA. If you > don't do it in the kernel: > - you will end up replicating transport dependent connection management > logic in the user mode library > - you will have to export, support and maintain a much larger number of > kernel services > - implementing security/provisioning policy in user mode is trickier > than in the kernel My hope is that the userspace CMA can talk to the kernel CMA. Otherwise, I need to expose the kernel address translation and SA query services, plus deal with more complicated event handling. The issue is that the kernel CMA performs QP transitions, which would need to change for userspace QPs. To transition the QPs properly, the userspace CMA would need to do so based on IB CM messages, and not kernel CMA events. - Sean From jbarker at lanl.gov Mon Oct 24 12:18:30 2005 From: jbarker at lanl.gov (James W. Barker) Date: Mon, 24 Oct 2005 13:18:30 -0600 Subject: [openib-general] max locked memory 32 Kb Message-ID: <6.2.3.4.2.20051024131202.021ea2f0@cic-mail.lanl.gov> All, Following the procedure outlined in your installation cheat sheet, after make modules modules_install and a reboot, the max locked memory (ulimit -a) changes from unlimited to 32Kb. The only changes made in the .config file should be associated with "Scroll to "device drivers", select it, select Infiniband support. Select userspace support, Mellanox HCA support." "Exit saving kernel configuration. " "make modules modules_install" Any thoughts? Thanks, Jim Barker James W. Barker, Ph.D. Los Alamos National Laboratory Computer and Computational Sciences Division Advanced Computing Laboratory - Resilient Technologies Team 505-665-9558 From kingman at austin.rr.com Mon Oct 24 13:01:41 2005 From: kingman at austin.rr.com (John Kingman) Date: Mon, 24 Oct 2005 15:01:41 -0500 (CDT) Subject: [openib-general] [PATCH] [SRP] Fix bug w/ SRP task mgmt iu size Message-ID: This patch fixes the problem of task management ius being the wrong length. Signed-off-by: John Kingman kingman at storagegear.com Index: ib_srp.c =================================================================== --- ib_srp.c (revision 3852) +++ ib_srp.c (working copy) @@ -1136,7 +1136,7 @@ static int srp_send_tsk_mgmt(struct scsi tsk_mgmt->tsk_mgmt_func = func; tsk_mgmt->task_tag = req_index; - if (__srp_post_send(target, iu, sizeof tsk_mgmt)) + if (__srp_post_send(target, iu, sizeof *tsk_mgmt)) goto out; req->tsk_mgmt = iu; From rolandd at cisco.com Mon Oct 24 13:13:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 13:13:18 -0700 Subject: [openib-general] Re: [PATCH] [SRP] Fix bug w/ SRP task mgmt iu size In-Reply-To: (John Kingman's message of "Mon, 24 Oct 2005 15:01:41 -0500 (CDT)") References: Message-ID: <521x2a7jxd.fsf@cisco.com> Thanks, applied. > Signed-off-by: John Kingman kingman at storagegear.com Your email is supposed to be in angle brackets like "< >" -- cf Documentation/SubmittingPatches in the kernel tree. I fixed this in the commit message I used. - R. From robert.j.woodruff at intel.com Mon Oct 24 13:17:09 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 24 Oct 2005 13:17:09 -0700 Subject: [openib-general] last version for 2.6.9 backport In-Reply-To: <200510241858.UAA00199@styx.bruyeres.cea.fr> Message-ID: Philippe Gregoire wrote, >Is 3640 the lastest version for which a backport to 2.6.9 has been done ? >Or does someone is working on a newer version today ? I have a newer one based on 3796 that I just finished testing. I will commit it today. woody From robert.j.woodruff at intel.com Mon Oct 24 13:27:12 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Mon, 24 Oct 2005 13:27:12 -0700 Subject: [openib-general] last version for 2.6.9 backport In-Reply-To: Message-ID: Bob Woodruff wrote, >>Philippe Gregoire wrote, >>Is 3640 the lastest version for which a backport to 2.6.9 has been done ? >>Or does someone is working on a newer version today ? >I have a newer one based on 3796 that I just finished testing. ?I will commit it today. >woody New version of 2.6.9 backport patches committed in svn3854. woody _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Mon Oct 24 13:54:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 13:54:25 -0700 Subject: [openib-general] [PATCH/RFC] mthca: report catastrophic errors Message-ID: <52wtk263ge.fsf@cisco.com> I just committed the following patch, which adds some initial support for detecting and reporting catastrophic errors reported by Mellanox HCAs. We start a periodic timer which polls the catastrophic error reporting buffer in device memory. If an error is detected, we dump the contents of the buffer for port-mortem debugging, and report a fatal asynchronous error to higher levels. In the future we can try to recover from these errors by resetting the device, but this will require some work in higher-level code as well. Let's get this in now, so that we at least get catastrophic errors reported in logs. Comments and criticisms gratefully accepted. - R. --- infiniband/hw/mthca/mthca_provider.c (revision 3852) +++ infiniband/hw/mthca/mthca_provider.c (working copy) @@ -1175,10 +1175,13 @@ int mthca_register_device(struct mthca_d } } + mthca_start_catas_poll(dev); + return 0; } void mthca_unregister_device(struct mthca_dev *dev) { + mthca_stop_catas_poll(dev); ib_unregister_device(&dev->ib_dev); } --- infiniband/hw/mthca/mthca_catas.c (revision 0) +++ infiniband/hw/mthca/mthca_catas.c (revision 0) @@ -0,0 +1,151 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "mthca_dev.h" + +enum { + MTHCA_CATAS_POLL_INTERVAL = 5 * HZ, + + MTHCA_CATAS_TYPE_INTERNAL = 0, + MTHCA_CATAS_TYPE_UPLINK = 3, + MTHCA_CATAS_TYPE_DDR = 4, + MTHCA_CATAS_TYPE_PARITY = 5, +}; + +static DEFINE_SPINLOCK(catas_lock); + +static void handle_catas(struct mthca_dev *dev) +{ + struct ib_event event; + const char *type; + int i; + + event.device = &dev->ib_dev; + event.event = IB_EVENT_DEVICE_FATAL; + event.element.port_num = 0; + + ib_dispatch_event(&event); + + switch (swab32(readl(dev->catas_err.map)) >> 24) { + case MTHCA_CATAS_TYPE_INTERNAL: + type = "internal error"; + break; + case MTHCA_CATAS_TYPE_UPLINK: + type = "uplink bus error"; + break; + case MTHCA_CATAS_TYPE_DDR: + type = "DDR data error"; + break; + case MTHCA_CATAS_TYPE_PARITY: + type = "internal parity error"; + break; + default: + type = "unknown error"; + break; + } + + mthca_err(dev, "Catastrophic error detected: %s\n", type); + for (i = 0; i < dev->catas_err.size; ++i) + mthca_err(dev, " buf[%02x]: %08x\n", + i, swab32(readl(dev->catas_err.map + i))); +} + +static void poll_catas(unsigned long dev_ptr) +{ + struct mthca_dev *dev = (struct mthca_dev *) dev_ptr; + unsigned long flags; + int i; + + for (i = 0; i < dev->catas_err.size; ++i) + if (readl(dev->catas_err.map + i)) { + handle_catas(dev); + return; + } + + spin_lock_irqsave(&catas_lock, flags); + if (dev->catas_err.stop) + mod_timer(&dev->catas_err.timer, + jiffies + MTHCA_CATAS_POLL_INTERVAL); + spin_unlock_irqrestore(&catas_lock, flags); + + return; +} + +void mthca_start_catas_poll(struct mthca_dev *dev) +{ + init_timer(&dev->catas_err.timer); + dev->catas_err.stop = 0; + dev->catas_err.map = NULL; + + if (!request_mem_region(dev->catas_err.addr, + dev->catas_err.size * 4, + DRV_NAME)) { + mthca_warn(dev, "couldn't request catastrophic error region " + "at 0x%llx/0x%x\n", + (unsigned long long) dev->catas_err.addr, + dev->catas_err.size * 4); + return; + } + + dev->catas_err.map = ioremap(dev->catas_err.addr, dev->catas_err.size * 4); + if (!dev->catas_err.map) { + mthca_warn(dev, "couldn't map catastrophic error region " + "at 0x%llx/0x%x\n", + (unsigned long long) dev->catas_err.addr, + dev->catas_err.size * 4); + release_mem_region(dev->catas_err.addr, + dev->catas_err.size * 4); + return; + } + + dev->catas_err.timer.data = (unsigned long) dev; + dev->catas_err.timer.function = poll_catas; + dev->catas_err.timer.expires = jiffies + MTHCA_CATAS_POLL_INTERVAL; + add_timer(&dev->catas_err.timer); +} + +void mthca_stop_catas_poll(struct mthca_dev *dev) +{ + spin_lock_irq(&catas_lock); + dev->catas_err.stop = 1; + spin_unlock_irq(&catas_lock); + + del_timer_sync(&dev->catas_err.timer); + + if (dev->catas_err.map) { + iounmap(dev->catas_err.map); + release_mem_region(dev->catas_err.addr, + dev->catas_err.size * 4); + } +} --- infiniband/hw/mthca/mthca_cmd.c (revision 3852) +++ infiniband/hw/mthca/mthca_cmd.c (working copy) @@ -1,6 +1,7 @@ /* * Copyright (c) 2004, 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Mellanox Technologies. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -706,9 +707,13 @@ int mthca_QUERY_FW(struct mthca_dev *dev MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET); dev->cmd.max_cmds = 1 << lg; + MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET); + MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET); mthca_dbg(dev, "FW version %012llx, max commands %d\n", (unsigned long long) dev->fw_ver, dev->cmd.max_cmds); + mthca_dbg(dev, "Catastrophic error buffer at 0x%llx, size 0x%x\n", + (unsigned long long) dev->catas_err.addr, dev->catas_err.size); if (mthca_is_memfree(dev)) { MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET); --- infiniband/hw/mthca/Makefile (revision 3852) +++ infiniband/hw/mthca/Makefile (working copy) @@ -9,4 +9,5 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += ib_mth ib_mthca-y := mthca_main.o mthca_cmd.o mthca_profile.o mthca_reset.o \ mthca_allocator.o mthca_eq.o mthca_pd.o mthca_cq.o \ mthca_mr.o mthca_qp.o mthca_av.o mthca_mcg.o mthca_mad.o \ - mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o + mthca_provider.o mthca_memfree.o mthca_uar.o mthca_srq.o \ + mthca_catas.o --- infiniband/hw/mthca/mthca_dev.h (revision 3852) +++ infiniband/hw/mthca/mthca_dev.h (working copy) @@ -258,6 +258,14 @@ struct mthca_mcg_table { struct mthca_icm_table *table; }; +struct mthca_catas_err { + u64 addr; + u32 __iomem *map; + unsigned long stop; + u32 size; + struct timer_list timer; +}; + struct mthca_dev { struct ib_device ib_dev; struct pci_dev *pdev; @@ -318,6 +326,8 @@ struct mthca_dev { struct mthca_av_table av_table; struct mthca_mcg_table mcg_table; + struct mthca_catas_err catas_err; + struct mthca_uar driver_uar; struct mthca_db_table *db_tab; struct mthca_pd driver_pd; @@ -405,6 +415,9 @@ void mthca_cleanup_mcg_table(struct mthc int mthca_register_device(struct mthca_dev *dev); void mthca_unregister_device(struct mthca_dev *dev); +void mthca_start_catas_poll(struct mthca_dev *dev); +void mthca_stop_catas_poll(struct mthca_dev *dev); + int mthca_uar_alloc(struct mthca_dev *dev, struct mthca_uar *uar); void mthca_uar_free(struct mthca_dev *dev, struct mthca_uar *uar); From ardavis at ichips.intel.com Mon Oct 24 13:59:19 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Mon, 24 Oct 2005 13:59:19 -0700 Subject: [openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN In-Reply-To: <20051013193436.GA13514@mellanox.co.il> References: <20051013193436.GA13514@mellanox.co.il> Message-ID: <435D4B27.2010208@ichips.intel.com> Michael S. Tsirkin wrote: >Thanks Arlin. I plan to look into integrating this. >One question: for which psn values do you see performance drop on 4.6.0 FW? > > > Any luck isolating this performance problem? I just want to understand the cause so I know for sure 4.7 FW is a solid fix. Didn't see anything in the 4.7 release notes that covered this issue. From halr at voltaire.com Mon Oct 24 14:09:56 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Oct 2005 17:09:56 -0400 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <435D26F9.7040202@mellanox.co.il> References: <1129740235.16900.33953.camel@hal.voltaire.com> <435D26F9.7040202@mellanox.co.il> Message-ID: <1130187995.4397.15081.camel@hal.voltaire.com> On Mon, 2005-10-24 at 14:24, Eitan Zahavi wrote: > > How do you get the old versions of this ? > It is in the main trunk ... > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenSM_UM.pdf That's older than 1.7.1 1.7.0 manuals you had mentioned. Maybe nothing has changed in osmsh between the version of the manual in the trunk and currently. -- Hal From halr at voltaire.com Mon Oct 24 14:13:09 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Oct 2005 17:13:09 -0400 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <435D2A12.8040508@mellanox.co.il> References: <1130153148.4397.12286.camel@hal.voltaire.com> <435D2A12.8040508@mellanox.co.il> Message-ID: <1130188003.4397.15083.camel@hal.voltaire.com> On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > > > >>I would suggest to use SNMP for the tasks below. IETF IPoIB group has > >>defined an SNMP MIB that can support the required functionality below. > > > > > > The IETF SNMP MIBs are one way of presenting the information to the > > outside world. There are other possible management interfaces. The SNMP > > MIB instrumentation would need to use lower layer APIs to get this > > information out of the SM. > Yes but the IETF SM MIB is the only one that is close to a standard way. > It does not require low level interface if it will integrate into the OpenSM code. > One way to do it is buy extending OpenSM with an AgentX interface. > > IMO one clear advantage of using SNMP for SM integration is that the code will work with any SM that is IETF compliant. > Also if you want to write a "client server" type of application on top of an SM you > can either stick to sending MADs which translate into SA client based application or > you better stay with some known protocol for management (like SNMP) and not develop yet another protocol for > doing exactly the same things as SNMP already supports. There are limitations in the SNMP MIBs. One is that they are RO so they are more for monitoring. Also, many environments do not use SNMP. It is unclear how much of a requirement it is to manage any SM or how many other SMs support the SM MIB. (There are other IB associated MIBs too). > >>Everything but the dynamic partitioning (OpenSM does not have > >>partition manager to this moment) > > > > > > What Troy meant by partitioning is not necessarily IB partitioning. > How are you sure about that? Troy - please comment. I think you missed an email on this. > >>and forwarding of Performance > >>Monitoring traps (which are generated by the PM) can be done through > >>osmsh or through SA client today. > > > > > > What PerfMgr are you referring to ? > No specific one. But the specification does not require the SM too. Huh ? What spec ? An SM is required in a subnet. There is no subnet without this. There is a subnet without a PerfMgr. > For various reasons (like load) it might make more sense to have the PM distributed. Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > Anyway, my point is that the SM is not the owner of PM trap reporting. It is the PM that > should support Reporting (I.e InformInfo registration and Trap forwarding) for PM traps. > But the spec does not define such traps anyway. My point was that the PerfMgr is beyond the IBA spec. It is only the PMA that is defined and has no traps so these will all need synthesis by the PerfMgr. -- Hal From nacc at us.ibm.com Mon Oct 24 14:26:58 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 24 Oct 2005 14:26:58 -0700 Subject: [openib-general] [PATCH/RFC] mthca: report catastrophic errors In-Reply-To: <52wtk263ge.fsf@cisco.com> References: <52wtk263ge.fsf@cisco.com> Message-ID: <20051024212658.GA7300@us.ibm.com> On 24.10.2005 [13:54:25 -0700], Roland Dreier wrote: > I just committed the following patch, which adds some initial support > for detecting and reporting catastrophic errors reported by Mellanox > HCAs. We start a periodic timer which polls the catastrophic error > reporting buffer in device memory. If an error is detected, we dump > the contents of the buffer for port-mortem debugging, and report a > fatal asynchronous error to higher levels. > > In the future we can try to recover from these errors by resetting the > device, but this will require some work in higher-level code as well. > Let's get this in now, so that we at least get catastrophic errors > reported in logs. > > Comments and criticisms gratefully accepted. > > - R. > > --- infiniband/hw/mthca/mthca_provider.c (revision 3852) > +++ infiniband/hw/mthca/mthca_provider.c (working copy) > +void mthca_start_catas_poll(struct mthca_dev *dev) > +{ > + init_timer(&dev->catas_err.timer); > + dev->catas_err.stop = 0; > + dev->catas_err.map = NULL; > + > + if (!request_mem_region(dev->catas_err.addr, > + dev->catas_err.size * 4, > + DRV_NAME)) { > + mthca_warn(dev, "couldn't request catastrophic error region " > + "at 0x%llx/0x%x\n", > + (unsigned long long) dev->catas_err.addr, > + dev->catas_err.size * 4); > + return; > + } > + > + dev->catas_err.map = ioremap(dev->catas_err.addr, dev->catas_err.size * 4); > + if (!dev->catas_err.map) { > + mthca_warn(dev, "couldn't map catastrophic error region " > + "at 0x%llx/0x%x\n", > + (unsigned long long) dev->catas_err.addr, > + dev->catas_err.size * 4); > + release_mem_region(dev->catas_err.addr, > + dev->catas_err.size * 4); > + return; > + } > + > + dev->catas_err.timer.data = (unsigned long) dev; > + dev->catas_err.timer.function = poll_catas; > + dev->catas_err.timer.expires = jiffies + MTHCA_CATAS_POLL_INTERVAL; I know akpm has been harping on this only recently (I have yet to audit all the kernel, but will get around to it eventually), but these three inits can be done via setup_timer() now. Thanks, Nish From sinate at yahoo.com Mon Oct 24 14:30:44 2005 From: sinate at yahoo.com (Steven Wooding) Date: Mon, 24 Oct 2005 22:30:44 +0100 (BST) Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: Message-ID: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> Hi Sean, Your patch works if I put the IB_QP_MAX_QP_RD_ATOMIC mask into the UC (defualt) QP attr mask. Otherwise fine. Thanks, Steve. --- Sean Hefty wrote: > >I had a look at where the mask is set in cm.c > >(cm_init_qp_rtr_attr() and cm_init_qp_rts_attr()) > but > >I was unsure how to make the mask depend on the QP > >type. Maybe you have a better idea of how to do > this. > > Here's a patch (edited by hand, so let me know if > there's any issue > applying it) that should permit UC connections over > the CM. I was able to > test this using cmpost. > > Signed-off-by: Sean Hefty > > > Index: cm.c > =================================================================== > --- cm.c (revision 3830) > +++ cm.c (working copy) > @@ -135,6 +135,7 @@ > __be64 tid; > __be32 local_qpn; > __be32 remote_qpn; > + enum ib_qp_type qp_type; > __be32 sq_psn; > __be32 rq_psn; > int timeout_ms; > @@ -926,6 +923,7 @@ > cm_id_priv->responder_resources = > param->responder_resources; > cm_id_priv->retry_count = param->retry_count; > cm_id_priv->path_mtu = param->primary_path->mtu; > + cm_id_priv->qp_type = param->qp_type; > > ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg); > if (ret) > @@ -1320,6 +1314,7 @@ > cm_req_get_primary_local_ack_timeout(req_msg); > cm_id_priv->retry_count = > cm_req_get_retry_count(req_msg); > cm_id_priv->rnr_retry_count = > cm_req_get_rnr_retry_count(req_msg); > + cm_id_priv->qp_type = cm_req_get_qp_type(req_msg); > > cm_format_req_event(work, cm_id_priv, > &listen_cm_id_priv->id); > cm_process_work(cm_id_priv, work); > @@ -3079,10 +3035,10 @@ > case IB_CM_ESTABLISHED: > *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS > | > IB_QP_PKEY_INDEX | IB_QP_PORT; > - qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE; > + qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE > | > + IB_ACCESS_REMOTE_WRITE; > if (cm_id_priv->responder_resources) > - qp_attr->qp_access_flags |= > IB_ACCESS_REMOTE_WRITE | > - IB_ACCESS_REMOTE_READ; > + qp_attr->qp_access_flags |= > IB_ACCESS_REMOTE_READ; > qp_attr->pkey_index = cm_id_priv->av.pkey_index; > qp_attr->port_num = > cm_id_priv->av.port->port_num; > ret = 0; > @@ -3112,14 +3068,18 @@ > case IB_CM_MRA_REP_RCVD: > case IB_CM_ESTABLISHED: > *qp_attr_mask = IB_QP_STATE | IB_QP_AV | > IB_QP_PATH_MTU | > - IB_QP_DEST_QPN | IB_QP_RQ_PSN | > - IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_MIN_RNR_TIMER; > + IB_QP_DEST_QPN | IB_QP_RQ_PSN; > qp_attr->ah_attr = cm_id_priv->av.ah_attr; > qp_attr->path_mtu = cm_id_priv->path_mtu; > qp_attr->dest_qp_num = > be32_to_cpu(cm_id_priv->remote_qpn); > qp_attr->rq_psn = > be32_to_cpu(cm_id_priv->rq_psn); > - qp_attr->max_dest_rd_atomic = > cm_id_priv->responder_resources; > - qp_attr->min_rnr_timer = 0; > + if (cm_id_priv->qp_type == IB_QPT_RC) { > + *qp_attr_mask |= IB_QP_MAX_DEST_RD_ATOMIC | > + IB_QP_MIN_RNR_TIMER; > + qp_attr->max_dest_rd_atomic = > + cm_id_priv->responder_resources; > + qp_attr->min_rnr_timer = 0; > + } > if (cm_id_priv->alt_av.ah_attr.dlid) { > *qp_attr_mask |= IB_QP_ALT_PATH; > qp_attr->alt_ah_attr = > cm_id_priv->alt_av.ah_attr; > @@ -3148,14 +3108,17 @@ > case IB_CM_REP_SENT: > case IB_CM_MRA_REP_RCVD: > case IB_CM_ESTABLISHED: > - *qp_attr_mask = IB_QP_STATE | IB_QP_TIMEOUT | > IB_QP_RETRY_CNT | > - IB_QP_RNR_RETRY | IB_QP_SQ_PSN | > - IB_QP_MAX_QP_RD_ATOMIC; > - qp_attr->timeout = cm_id_priv->local_ack_timeout; > - qp_attr->retry_cnt = cm_id_priv->retry_count; > - qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; > + *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; > qp_attr->sq_psn = > be32_to_cpu(cm_id_priv->sq_psn); > - qp_attr->max_rd_atomic = > cm_id_priv->initiator_depth; > + if (cm_id_priv->qp_type == IB_QPT_RC) { > + *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT > | > + IB_QP_RNR_RETRY | > + IB_QP_MAX_QP_RD_ATOMIC; > + qp_attr->timeout = > cm_id_priv->local_ack_timeout; > + qp_attr->retry_cnt = cm_id_priv->retry_count; > + qp_attr->rnr_retry = > cm_id_priv->rnr_retry_count; > + qp_attr->max_rd_atomic = > cm_id_priv->initiator_depth; > + } > if (cm_id_priv->alt_av.ah_attr.dlid) { > *qp_attr_mask |= IB_QP_PATH_MIG_STATE; > qp_attr->path_mig_state = IB_MIG_REARM; > > > > ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com From mshefty at ichips.intel.com Mon Oct 24 15:16:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 15:16:46 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> References: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> Message-ID: <435D5D4E.5000106@ichips.intel.com> Steven Wooding wrote: > Your patch works if I put the IB_QP_MAX_QP_RD_ATOMIC > mask into the UC (defualt) QP attr mask. Otherwise > fine. Why was this needed? Atomic and reads apply only to RC. Were you seeing an error? - Sean From rolandd at cisco.com Mon Oct 24 15:33:57 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 15:33:57 -0700 Subject: [openib-general] [PATCH/RFC] mthca: report catastrophic errors In-Reply-To: <20051024212658.GA7300@us.ibm.com> (Nishanth Aravamudan's message of "Mon, 24 Oct 2005 14:26:58 -0700") References: <52wtk263ge.fsf@cisco.com> <20051024212658.GA7300@us.ibm.com> Message-ID: <52sluq5yui.fsf@cisco.com> Nishanth> I know akpm has been harping on this only recently (I Nishanth> have yet to audit all the kernel, but will get around to Nishanth> it eventually), but these three inits can be done via Nishanth> setup_timer() now. Hmm, I can't find anything like setup_timer() in Linus's latest tree. Is this an -mm thing? If so I'll wait until it hits mainline to update this. - R. From mshefty at ichips.intel.com Mon Oct 24 15:38:00 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 15:38:00 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings In-Reply-To: References: Message-ID: <435D6248.8040001@ichips.intel.com> Sean Hefty wrote: > The following patch should fix the MAD layer's DMA mapping issue. This > patch includes all related patches that were previously posted. The fix > involved changing the MAD layer API. All callers must now use the MAD > layer to allocate and free send MADs. DMA mappings are done by the MAD > layer. Are there any objections to committing this? - Sean From rolandd at cisco.com Mon Oct 24 15:41:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 15:41:47 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <435D5D4E.5000106@ichips.intel.com> (Sean Hefty's message of "Mon, 24 Oct 2005 15:16:46 -0700") References: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> <435D5D4E.5000106@ichips.intel.com> Message-ID: <52oe5e5yhg.fsf@cisco.com> Sean> Why was this needed? Atomic and reads apply only to RC. Sean> Were you seeing an error? I think it's a bug in mthca. I'll post a patch shortly. - R. From rolandd at cisco.com Mon Oct 24 15:42:54 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 15:42:54 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings In-Reply-To: <435D6248.8040001@ichips.intel.com> (Sean Hefty's message of "Mon, 24 Oct 2005 15:38:00 -0700") References: <435D6248.8040001@ichips.intel.com> Message-ID: <52k6g25yfl.fsf@cisco.com> Sean> Are there any objections to committing this? Sorry, I missed it on Friday. Let me give it a quick read through -- I'll NAK it by tonight if I see anything but I don't anticipate problems. - R. From rolandd at cisco.com Mon Oct 24 15:53:53 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 15:53:53 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <435D5D4E.5000106@ichips.intel.com> (Sean Hefty's message of "Mon, 24 Oct 2005 15:16:46 -0700") References: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> <435D5D4E.5000106@ichips.intel.com> Message-ID: <52fyqq5xxa.fsf@cisco.com> I think something like this will fix things. Does this look right to everyone? - R. --- infiniband/hw/mthca/mthca_qp.c (revision 3852) +++ infiniband/hw/mthca/mthca_qp.c (working copy) @@ -338,8 +338,7 @@ static const struct { [UC] = (IB_QP_AV | IB_QP_PATH_MTU | IB_QP_DEST_QPN | - IB_QP_RQ_PSN | - IB_QP_MAX_DEST_RD_ATOMIC), + IB_QP_RQ_PSN), [RC] = (IB_QP_AV | IB_QP_PATH_MTU | IB_QP_DEST_QPN | @@ -368,8 +367,7 @@ static const struct { .trans = MTHCA_TRANS_RTR2RTS, .req_param = { [UD] = IB_QP_SQ_PSN, - [UC] = (IB_QP_SQ_PSN | - IB_QP_MAX_QP_RD_ATOMIC), + [UC] = IB_QP_SQ_PSN, [RC] = (IB_QP_TIMEOUT | IB_QP_RETRY_CNT | IB_QP_RNR_RETRY | @@ -446,8 +444,6 @@ static const struct { [UD] = (IB_QP_PKEY_INDEX | IB_QP_QKEY), [UC] = (IB_QP_AV | - IB_QP_MAX_QP_RD_ATOMIC | - IB_QP_MAX_DEST_RD_ATOMIC | IB_QP_CUR_STATE | IB_QP_ALT_PATH | IB_QP_ACCESS_FLAGS | @@ -478,7 +474,7 @@ static const struct { .opt_param = { [UD] = (IB_QP_CUR_STATE | IB_QP_QKEY), - [UC] = (IB_QP_CUR_STATE), + [UC] = IB_QP_CUR_STATE, [RC] = (IB_QP_CUR_STATE | IB_QP_MIN_RNR_TIMER), [MLX] = (IB_QP_CUR_STATE | From mshefty at ichips.intel.com Mon Oct 24 16:18:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 16:18:39 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <52fyqq5xxa.fsf@cisco.com> References: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> <435D5D4E.5000106@ichips.intel.com> <52fyqq5xxa.fsf@cisco.com> Message-ID: <435D6BCF.3060806@ichips.intel.com> Roland Dreier wrote: > I think something like this will fix things. Does this look right to everyone? Looks fine to me. Thanks for catching this. - Sean From nacc at us.ibm.com Mon Oct 24 16:48:34 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 24 Oct 2005 16:48:34 -0700 Subject: [openib-general] [PATCH/RFC] mthca: report catastrophic errors In-Reply-To: <52sluq5yui.fsf@cisco.com> References: <52wtk263ge.fsf@cisco.com> <20051024212658.GA7300@us.ibm.com> <52sluq5yui.fsf@cisco.com> Message-ID: <20051024234834.GC7300@us.ibm.com> On 24.10.2005 [15:33:57 -0700], Roland Dreier wrote: > Nishanth> I know akpm has been harping on this only recently (I > Nishanth> have yet to audit all the kernel, but will get around to > Nishanth> it eventually), but these three inits can be done via > Nishanth> setup_timer() now. > > Hmm, I can't find anything like setup_timer() in Linus's latest tree. > Is this an -mm thing? If so I'll wait until it hits mainline to > update this. Ah yes, sorry. I will probably do the IB update(s) for you when I hit all the other trees (eventually....) Thanks, Nish From mshefty at ichips.intel.com Mon Oct 24 17:06:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 24 Oct 2005 17:06:01 -0700 Subject: [openib-general] RFC userspace CMA Message-ID: <435D76E9.5040404@ichips.intel.com> I'm soliciting any comments that anyone might have on the general design for the userspace CMA before I get too far into the implementation. - The API will match the kernel API for the most part. The exception is that event handling will match other userspace libraries (get/ack event). - There will be a single CMA device exported through /sys/class/infiniband. - The kernel CMA will be modified to remove the requirement to use rdma_create_qp(). Users that want to allocate and manage their own QP states will be able to specify QP attributes (qpn, qp_type, srq) through the rdma_conn_param structure. - The kernel CMA will expose a new call, rdma_init_qp_attr() to initialize QP attributes used to modify the state of the QP. The call will be similar to the infiniband CM routine. Use of this call is optional. The CMA will automatically transition QPs created by rdma_create_qp(). - The uCMA will open devices for users and return them the device context with related events. The uCMA will close the device if there are no rdma_cma_id's associated with it. - To support device add, the uCMA will need a new verb's call: ibv_open_device_by_guid(). If a connection request occurs for a device that is not yet known by the uCMA, it will open the device. Comments? - Sean From ted.kim at sun.com Mon Oct 24 17:17:22 2005 From: ted.kim at sun.com (Ted H. Kim) Date: Mon, 24 Oct 2005 17:17:22 -0700 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB In-Reply-To: <4359575B.5020302@ichips.intel.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> <4359575B.5020302@ichips.intel.com> Message-ID: <435D7992.7000705@sun.com> Just a comment on this one point ... Sean Hefty wrote: >> I'm thinking that for iWARP, there won't be anything in the Private Data >> at all except consumer private data. Is that your expectation? > > > I believe so. This is only trying to define a TCP/IP connection service > over IB. I'm assuming that there's no need to define something similar > for iWarp. Not sure if this is relevant for your intended base, but ITAPI 2.0 for iWARP has a 16-byte IOH (IRD/ORD header) in MPA "private data" to note RDMA initiator/responder limits like the corresponding fields in the IB CM protocol. -ted From tom at opengridcomputing.com Mon Oct 24 18:06:08 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 24 Oct 2005 20:06:08 -0500 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB In-Reply-To: <435D7992.7000705@sun.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> <4359575B.5020302@ichips.intel.com> <435D7992.7000705@sun.com> Message-ID: <1130202368.6405.11.camel@trinity.austin.ammasso.com> Ted: I think it's relevant, so let's make sure my assumptions are correct: - The ITAPI will be a "ULP" on OpenIB - The ITAPI will create the IRD/ORD headers in its private data and submit this as part of its connection establishment. - The ITAPI consumer at the remote peer will use this data to configure it's local QP before accepting the connection Over IB, the IRD/ORD private data will be prepended with a "private data header" that contains the source and destination IP addresses, source port, etc... The remote peer will not see this data as part of the private data, but rather will see it in the CMA event in the upcall. Over iWARP/MPA, there will be nothing else in the private data except what was provided by the consumer (ITAPI in this case). The reason being that this extra information (IP addressing info) is in the protocol header proper. On Mon, 2005-10-24 at 17:17 -0700, Ted H. Kim wrote: > Just a comment on this one point ... > > Sean Hefty wrote: > >> I'm thinking that for iWARP, there won't be anything in the Private Data > >> at all except consumer private data. Is that your expectation? > > > > > > I believe so. This is only trying to define a TCP/IP connection service > > over IB. I'm assuming that there's no need to define something similar > > for iWarp. > > Not sure if this is relevant for your intended base, but ITAPI 2.0 > for iWARP has a 16-byte IOH (IRD/ORD header) > in MPA "private data" to note RDMA initiator/responder limits > like the corresponding fields in the IB CM protocol. > > -ted > From rolandd at cisco.com Mon Oct 24 21:33:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 24 Oct 2005 21:33:30 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings In-Reply-To: (Sean Hefty's message of "Fri, 21 Oct 2005 11:27:28 -0700") References: Message-ID: <524q765i79.fsf@cisco.com> Yeah, this looks fine to check in. I have a couple of trivial cleanups I'd like to do for sa_query.c and user_mad.c, but I can do them after you commit. Also, this chunk: > --- trunk/src/linux-kernel/infiniband/core/smi.h (revision 3830) > +++ trunk/src/linux-kernel/infiniband/core/smi.h (working copy) > @@ -35,10 +35,11 @@ > * > * $Id$ > */ > - > #ifndef __SMI_H_ > #define __SMI_H_ > looks bogus -- I'd prefer to keep that blank line there, since it looks ugly to me to have the #ifndef right after the closing */. - R. From liran at mellanox.co.il Mon Oct 24 22:35:01 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Tue, 25 Oct 2005 07:35:01 +0200 Subject: [openib-general] Osmtest removal from Gen2 main trunk Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB6BD@mtlexch01.mtl.com> Hi , Hal . Since now the Osmtest is updated (in all stack flavours) under ibtp repository (https://openib.org/svn/trunk/contrib/mellanox/ibtp/), I'd like to remove it from main trunk : https://openib.org/svn/gen2/trunk/src/userspace/management/osm/osmtest. New updates will be checked into ibtp repository only , thanks . -----Original Message----- From: Liran Sorani Sent: Sunday, October 23, 2005 9:01 AM To: 'Hal Rosenstock'; Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update Currently only a minor bug fix in osmt_service flow , and cosmetics changes to fit WinIb stack . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, October 20, 2005 1:01 PM To: Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update On Thu, 2005-10-20 at 03:49, Liran Sorani wrote: > Hi , Hal . > The Linux & WinIB are the same , except for several cosmetic changes . I was referring to the (differences in the) Linux one in ibtp and the Linux one under gen2/trunk. > Regarding Makefile.in , it's an outcome of autogen , I'll remove it . Thanks. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 19, 2005 10:25 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update > > > On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > > Hi , > > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) > and > > Gen2 stacks : > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmt est > > > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest > > > > Osmtest is the main verification tool for OpenSM , include various > SA > > (Good / Bad) flows. > > Attached to each directory a short README file for setup and usage > > information. > > How is the Linux one different from osmtest in the trunk ? > > Also, (nit): > I think > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest/Makefile.in > is a generated file and should be removed. > > -- Hal > > > > Liran Sorani > > > Mellanox Technologies LTD. > > > mailto:liran at mellanox.co.il > > > Phone: +972(4)9097200 Ext: 214 > > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at kijshd.com Tue Oct 25 00:06:43 2005 From: info at kijshd.com (info at kijshd.com) Date: 25 Oct 2005 16:06:43 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051025070643.9742.qmail@mail.kijshd.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From eitan at mellanox.co.il Tue Oct 25 02:19:59 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 25 Oct 2005 11:19:59 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <1130188003.4397.15083.camel@hal.voltaire.com> References: <1130188003.4397.15083.camel@hal.voltaire.com> Message-ID: <435DF8BF.9080905@mellanox.co.il> Hal Rosenstock wrote: > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > >>Hal Rosenstock wrote: >> >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: >>> >>> >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > has > >>>>defined an SNMP MIB that can support the required functionality > > below. > >>> >>>The IETF SNMP MIBs are one way of presenting the information to the >>>outside world. There are other possible management interfaces. The > > SNMP > >>>MIB instrumentation would need to use lower layer APIs to get this >>>information out of the SM. >> >>Yes but the IETF SM MIB is the only one that is close to a standard > > way. > >>It does not require low level interface if it will integrate into the > > OpenSM code. > >>One way to do it is buy extending OpenSM with an AgentX interface. >> >>IMO one clear advantage of using SNMP for SM integration is that the > > code will work with any SM that is IETF compliant. > >>Also if you want to write a "client server" type of application on top > > of an SM you > >>can either stick to sending MADs which translate into SA client based > > application or > >>you better stay with some known protocol for management (like SNMP) > > and not develop yet another protocol for > >>doing exactly the same things as SNMP already supports. > > > There are limitations in the SNMP MIBs. One is that they are RO so they > are more for monitoring. Also, many environments do not use SNMP. It is > unclear how much of a requirement it is to manage any SM or how many > other SMs support the SM MIB. (There are other IB associated MIBs too). SNMP MIBs are certainly not just RO a simple example from the SM MIB: ibSmPortInfoLMC OBJECT-TYPE SYNTAX Unsigned32(0..7) MAX-ACCESS read-write STATUS current DESCRIPTION "LID mask for multipath support. User should take extra caution when setting this value, since any change will effect packet routing." ::= { ibSmPortInfoEntry 19 } I agree that it is possible that currently no SM is supporting the SM MIB. But it does make sense to have ALL of the them support it. Such that they can be activated/deactivated and configured in the manner. Most unix distributions and windows box have standard SNMP agent and client included in them So it does not take more then simple bash or C code to interact with the SM if it supports SNMP. > > >>>>Everything but the dynamic partitioning (OpenSM does not have >>>>partition manager to this moment) >>> >>> >>>What Troy meant by partitioning is not necessarily IB partitioning. >> >>How are you sure about that? Troy - please comment. > > > I think you missed an email on this. > > >>>>and forwarding of Performance >>>>Monitoring traps (which are generated by the PM) can be done through >>>>osmsh or through SA client today. >>> >>> >>>What PerfMgr are you referring to ? >> >>No specific one. But the specification does not require the SM too. > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > without this. There is a subnet without a PerfMgr. Yes its a typo I meant PM. SM is a requirement. You know I did not mean that. > > >>For various reasons (like load) it might make more sense to have the > > PM distributed. > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > It is the PM that > >>should support Reporting (I.e InformInfo registration and Trap > > forwarding) for PM traps. > >>But the spec does not define such traps anyway. > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > that is defined and has no traps so these will all need synthesis by the > PerfMgr. Agree. > > -- Hal > From eitan at mellanox.co.il Tue Oct 25 02:20:51 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 25 Oct 2005 11:20:51 +0200 Subject: [openib-general] Re: [RFC] OpenSM Interactive Console In-Reply-To: <1130187995.4397.15081.camel@hal.voltaire.com> References: <1130187995.4397.15081.camel@hal.voltaire.com> Message-ID: <435DF8F3.10200@mellanox.co.il> Hal Rosenstock wrote: > On Mon, 2005-10-24 at 14:24, Eitan Zahavi wrote: > >>>How do you get the old versions of this ? >> >>It is in the main trunk ... >> > > https://openib.org/svn/gen2/trunk/src/userspace/management/osm/doc/OpenS > M_UM.pdf > > That's older than 1.7.1 1.7.0 manuals you had mentioned. Maybe nothing > has changed in osmsh between the version of the manual in the trunk and > currently. Yes correct no API changed > > -- Hal > From umaxx at oleco.net Tue Oct 25 02:39:35 2005 From: umaxx at oleco.net (Joerg Zinke) Date: Tue, 25 Oct 2005 11:39:35 +0200 Subject: [openib-general] question about poll_cq() Message-ID: <20051025113935.42db75ac@marvin.local> hi, i hope it's ok to ask this question here. i just want to know more about poll_cq(). the standard seems to define only the input/ouput-params. after reading the code i figured out that there is some kind of "mapping" from libibverbs to the device specific "plugin" (mthca). so it looks like that the ibv_poll_cq() call from the userspace (e.g. ibv_xx_pingpong) finally ends in a call of: mthca_poll_cq() in mthca_cq.c so my question is: can someone give me a brief summary what mthca_poll_cq (or mthca_poll_one()) is really doing? i want to know how polling the completion queue really works, there must be some kind of low-level acknowledge/response- message if a new entry enters the cq? where is the completion queue really located (on a rdma operation) - local or remote? or maybe give me some hints/links to docs where i can read more about it. thanks. regards, joerg zinke From info at kjjdy.com Tue Oct 25 02:48:20 2005 From: info at kjjdy.com (info at kjjdy.com) Date: 25 Oct 2005 18:48:20 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051025094820.24283.qmail@mail.kjjdy.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From info at kusydg.com Tue Oct 25 03:18:43 2005 From: info at kusydg.com (info at kusydg.com) Date: 25 Oct 2005 19:18:43 +0900 Subject: [openib-general] $BD>%"%I65$($^$9!#(B Message-ID: <20051025101843.6328.qmail@mail.kusydg.com> $B!V$O$8$a$^$7$F!#%^%j$C$F$$$$$^$9!#$$$-$J$j$N%a!<%k$4$a$s$J(B $B$5$$!#(B $BAjCL$K>h$C$FM_$7$/$F!"%a!<%k$7$F$_$^$7$?!#CK$N?M$G$9$h$M!)(B $B!!%a!<%k=P$7$?;~$+$iF,$NCf$O$3$N=P2q$$$N;v$G0l?'@w$^$C$A$c(B $B$C$F$$$^$9!#$=$A$i$O;d$N;v$I$&;W$$$^$9$+!)(B($B6[D%46!*!)!K$:(B $B$C$HG:$s$G$$$^$7$?!#7k6I<+J,$+$i$3$&$7$F%a!<%kAw$i$J$$$H0l(B $BJb$b?J$^$J$/>!%"%I65$($^$9(B $B!#7h$7$F5?$C$F$O$$$^$;$s$,!"=c?h$J=P2q$$$K$7$?$/$F!"$A$g$C(B $B$H?5=E$K$J$C$F$$$^$9!#(B $B!!2q$C$F$/$l$k$J$i!";d$NCf$G?.MQ$G$-$k?M$@$H;W$$$^$9$N$G!"(B $B46$K8@$$$^$9$H!"=c?h$J=P2q$$$rCg2p$7$F$*$j$^$9(B $B!#%W%i%$%P%7!pJs$N3NG'$OL5NAEPO?$+$i"-(B http://www.otakkujp.net?imasugu $B%3%A%i$N=w at -$@$1$G$J$/!"L5NA$G2q0w$K$J$i$l$?J}$X$N%"%I8x3+(B $B0[@->R2p$bKhF|9T$C$F$*$j$^$9$N$G!"Hs>o$KJXMx$G$9!#(B $B5qH](B iranai at otakkujp.net From caitlinb at broadcom.com Tue Oct 25 08:24:46 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 08:24:46 -0700 Subject: [openib-general] question about poll_cq() Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AF0@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Joerg Zinke > Sent: Tuesday, October 25, 2005 2:40 AM > To: openib-general at openib.org > Subject: [openib-general] question about poll_cq() > > hi, > > i hope it's ok to ask this question here. > i just want to know more about poll_cq(). > the standard seems to define only the input/ouput-params. > after reading the code i figured out that there is some kind > of "mapping" from libibverbs to the device specific "plugin" (mthca). > so it looks like that the ibv_poll_cq() call from the userspace (e.g. > ibv_xx_pingpong) finally ends in a call of: mthca_poll_cq() > in mthca_cq.c > > so my question is: can someone give me a brief summary what > mthca_poll_cq (or mthca_poll_one()) is really doing? > i want to know how polling the completion queue really works, > there must be some kind of low-level acknowledge/response- > message if a new entry enters the cq? > where is the completion queue really located (on a rdma > operation) - local or remote? > Understanding how a given device implements poll_cq is legitimate if the purpose is debugging and/or understanding memory/bus utilization. However it is one of those that you SHOULD NOT know when righting code that uses poll_cq. What will vary over time and model is where the work completions are stored (in device and/or host memory) how they are formatted and whether they are self contained or reference other data (such as the original work request). The work completion that you get from a successful poll_cq may never have existed as that sequence of bytes until you made the call to poll_cq. If these things were not deliberately undefined then there would be no need for both ibv_poll_cq and device specific methods. For a discussion on what you can assume about a CQ across all devices I'd suggest reviewing the IBTA and/or RDMAC verbs. But basically, about all that is guaranteed is that it is a reliable ordered queued that the Consumer MUST provision adequately so as to avoid overflows. From mshefty at ichips.intel.com Tue Oct 25 08:42:41 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 08:42:41 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings In-Reply-To: References: Message-ID: <435E5271.3040507@ichips.intel.com> Sean Hefty wrote: > The following patch should fix the MAD layer's DMA mapping issue. This > patch includes all related patches that were previously posted. The fix > involved changing the MAD layer API. All callers must now use the MAD > layer to allocate and free send MADs. DMA mappings are done by the MAD > layer. These changes have been committed. Let me know if I broke anything. - Sean From caitlinb at broadcom.com Tue Oct 25 08:53:28 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 08:53:28 -0700 Subject: [openib-general] RFC userspace CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AF2@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hefty > Sent: Monday, October 24, 2005 5:06 PM > To: openib > Subject: [openib-general] RFC userspace CMA > > I'm soliciting any comments that anyone might have on the > general design for the userspace CMA before I get too far > into the implementation. > > - The API will match the kernel API for the most part. The > exception is that event handling will match other userspace > libraries (get/ack event). > > - There will be a single CMA device exported through > /sys/class/infiniband. > > - The kernel CMA will be modified to remove the requirement > to use rdma_create_qp(). Users that want to allocate and > manage their own QP states will be able to specify QP > attributes (qpn, qp_type, srq) through the rdma_conn_param structure. > Why? Every CM interface I've dealt with has had the Consumer create and configure the QP on each end. On the active side the QP is supplied with the connect request. On the passive side it is supplied with the accept. State modifications and other configuration changes were done by the CM based on the Consumer having passed in the handle. > - The kernel CMA will expose a new call, rdma_init_qp_attr() > to initialize QP attributes used to modify the state of the > QP. The call will be similar to the infiniband CM routine. > Use of this call is optional. The CMA will automatically > transition QPs created by rdma_create_qp(). > > - The uCMA will open devices for users and return them the > device context with related events. The uCMA will close the > device if there are no rdma_cma_id's associated with it. > > - To support device add, the uCMA will need a new verb's call: > ibv_open_device_by_guid(). If a connection request occurs > for a device that is not yet known by the uCMA, it will open > the device. > Why does the uCMA need to open HCAs? Why does it have to be anything other than a front-end to the kCMA? I can see where a user-mode daemon might get a connection request that could only be answered with a QP on an device that it had not previolusly opened, and that opening that device based on information in the Connection Request would be useful -- but that still doesn't have the uCMA opening the device. More importantly, the problem can be solved just as easily by having the listener listen only on rdma devices that it has already opened. From Arkady.Kanevsky at netapp.com Tue Oct 25 09:00:09 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 12:00:09 -0400 Subject: [openib-general] round 2 - proposal for socket based connection model Message-ID: Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Thanks everybody for tons of feedback and deep discussions. I appologize if I had missed something. Happy reading, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v2.pdf Type: application/octet-stream Size: 55124 bytes Desc: IP Address Support by InfiniBand CM_v2.pdf URL: From mshefty at ichips.intel.com Tue Oct 25 09:07:25 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 09:07:25 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AF2@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AF2@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435E583D.50309@ichips.intel.com> Caitlin Bestler wrote: >>- The kernel CMA will be modified to remove the requirement >>to use rdma_create_qp(). Users that want to allocate and >>manage their own QP states will be able to specify QP >>attributes (qpn, qp_type, srq) through the rdma_conn_param structure. > > Why? If the userspace CMA talks to the kernel CMA, then the kernel CMA cannot transition the QP. There's not even a valid handle. The alternative is to have the userspace CMA talk to userspace IB CM, SQ query, address translation modules. A user of the kernel CMA can still call rdma_create_qp() and have the kernel CMA transition it for them. The same is true for userspace applications. > Why does the uCMA need to open HCAs? Why does it have to be > anything other than a front-end to the kCMA? The kernel CMA abstracts device addition/removal from the user. To accomplish the same goals with the userspace CMA, the uCMA needs to open/close the device. If the user opens and closes the device, then API changes are necessary. I don't see any benefit for the user to open the device, since it requires users to search for devices based on some sort of identifier. > I can see where a user-mode daemon might get a connection > request that could only be answered with a QP on an device > that it had not previolusly opened, and that opening that > device based on information in the Connection Request would > be useful -- but that still doesn't have the uCMA opening > the device. More importantly, the problem can be solved > just as easily by having the listener listen only on rdma > devices that it has already opened. This results in the listen call operating on an RDMA device, rather than on an IP address, which is the intent of the API. - Sean From rolandd at cisco.com Tue Oct 25 09:19:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 09:19:25 -0700 Subject: [openib-general] question about poll_cq() In-Reply-To: <20051025113935.42db75ac@marvin.local> (Joerg Zinke's message of "Tue, 25 Oct 2005 11:39:35 +0200") References: <20051025113935.42db75ac@marvin.local> Message-ID: <52r7a94liq.fsf@cisco.com> Joerg> hi, i hope it's ok to ask this question here. i just want Joerg> to know more about poll_cq(). the standard seems to define Joerg> only the input/ouput-params. after reading the code i Joerg> figured out that there is some kind of "mapping" from Joerg> libibverbs to the device specific "plugin" (mthca). so it Joerg> looks like that the ibv_poll_cq() call from the userspace Joerg> (e.g. ibv_xx_pingpong) finally ends in a call of: Joerg> mthca_poll_cq() in mthca_cq.c Yes, if the underlying device is a Mellanox HCA. PathScale HCAs will end up in a function in libipathverbs, and IBM eHCAs will end up in libehca. Joerg> so my question is: can someone give me a brief summary what Joerg> mthca_poll_cq (or mthca_poll_one()) is really doing? i Joerg> want to know how polling the completion queue really works, Joerg> there must be some kind of low-level acknowledge/response- Joerg> message if a new entry enters the cq? where is the Joerg> completion queue really located (on a rdma operation) - Joerg> local or remote? A very brief sketch of what happens is that the device-specific implementation of CQs for Mellanox HCAs allocates a circular buffer in memory and passes the address to the hardware. The buffer is divided into fixed-size chunks, each of which represents one completion entry. Initially the buffer is cleared out, and every time the hardware adds an entry onto the completion queue, it sets a bit in that chunk to show that the entry is now valid. The driver polls the CQ by looking to see if the next chunk has the bit set. If it does, then the driver translates the entry from hardware format into standard struct ibv_wc format; if it doesn't, then the driver returns status indicating that the CQ is empty. Completion queues are always located in local system memory. - R. From Arkady.Kanevsky at netapp.com Tue Oct 25 09:18:42 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 12:18:42 -0400 Subject: [openib-general] RE: round 2 - proposal for socket based connection model Message-ID: Fixed source -> destination for IP address on page 4. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Kanevsky, Arkady Sent: Tuesday, October 25, 2005 12:00 PM To: openib-general at openib.org; dat-discussions at yahoogroups.com; swg at infinibandta.org Subject: round 2 - proposal for socket based connection model Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Thanks everybody for tons of feedback and deep discussions. I appologize if I had missed something. Happy reading, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IP Address Support by InfiniBand CM_v2.pdf Type: application/octet-stream Size: 55126 bytes Desc: IP Address Support by InfiniBand CM_v2.pdf URL: From caitlinb at broadcom.com Tue Oct 25 09:35:42 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 09:35:42 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AF4@NT-SJCA-0751.brcm.ad.broadcom.com> On an IP network, a non-privileged user is generally not capable of forging a source IP address and is typically prevented from using certain source ports. I would propose that the CM [MAY|SHOULD|MUST] enforce that a non-privileged user can only use a Source IP Address and Port that they would have been able to use following the normal stack path (or what it would have been in the case that there is no conventional IP stack associated with this path). So if IPoIB is installed, you would not be able to use any address that you would have been blocked from using over IPoIB. Or at least you would not be guaranteed that you could. I think that MUST is the correct level of enforcement, but it needs to be clear that the CM and OS *MAY* do this checking and that a userspace IB application cannot use the IB stack to perform IP spoofing. ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 25, 2005 9:00 AM To: openib-general at openib.org; dat-discussions at yahoogroups.com; swg at infinibandta.org Subject: [dat-discussions] round 2 - proposal for socket based connection model Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Thanks everybody for tons of feedback and deep discussions. I appologize if I had missed something. Happy reading, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Arkady.Kanevsky at netapp.com Tue Oct 25 09:39:12 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 12:39:12 -0400 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: Caitlin, how does it change the proposed protocol? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Caitlin Bestler [mailto:caitlinb at broadcom.com] Sent: Tuesday, October 25, 2005 12:36 PM To: dat-discussions at yahoogroups.com; openib-general at openib.org; swg at infinibandta.org Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model On an IP network, a non-privileged user is generally not capable of forging a source IP address and is typically prevented from using certain source ports. I would propose that the CM [MAY|SHOULD|MUST] enforce that a non-privileged user can only use a Source IP Address and Port that they would have been able to use following the normal stack path (or what it would have been in the case that there is no conventional IP stack associated with this path). So if IPoIB is installed, you would not be able to use any address that you would have been blocked from using over IPoIB. Or at least you would not be guaranteed that you could. I think that MUST is the correct level of enforcement, but it needs to be clear that the CM and OS *MAY* do this checking and that a userspace IB application cannot use the IB stack to perform IP spoofing. ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 25, 2005 9:00 AM To: openib-general at openib.org; dat-discussions at yahoogroups.com; swg at infinibandta.org Subject: [dat-discussions] round 2 - proposal for socket based connection model Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Thanks everybody for tons of feedback and deep discussions. I appologize if I had missed something. Happy reading, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitlinb at broadcom.com Tue Oct 25 09:45:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 09:45:17 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AF7@NT-SJCA-0751.brcm.ad.broadcom.com> I believe it requires a CM protocol version change, or a "IP Address Header present" bit. Basically, userspace consumers can supply *any* 72 bytes of private data currently. To maintain backwards compatability you need an authenticator that says "this IP header data vouched for by privileged components on this end", and that authenticator cannot be within the private data. The equivalent guarantee is provided on IP networks by the fact that raw sockets are not accessible by non-privileged applications. ________________________________ From: Kanevsky, Arkady [mailto:Arkady.Kanevsky at netapp.com] Sent: Tuesday, October 25, 2005 9:39 AM To: Caitlin Bestler; dat-discussions at yahoogroups.com; openib-general at openib.org; swg at infinibandta.org Subject: RE: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Caitlin, how does it change the proposed protocol? Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Caitlin Bestler [mailto:caitlinb at broadcom.com] Sent: Tuesday, October 25, 2005 12:36 PM To: dat-discussions at yahoogroups.com; openib-general at openib.org; swg at infinibandta.org Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model On an IP network, a non-privileged user is generally not capable of forging a source IP address and is typically prevented from using certain source ports. I would propose that the CM [MAY|SHOULD|MUST] enforce that a non-privileged user can only use a Source IP Address and Port that they would have been able to use following the normal stack path (or what it would have been in the case that there is no conventional IP stack associated with this path). So if IPoIB is installed, you would not be able to use any address that you would have been blocked from using over IPoIB. Or at least you would not be guaranteed that you could. I think that MUST is the correct level of enforcement, but it needs to be clear that the CM and OS *MAY* do this checking and that a userspace IB application cannot use the IB stack to perform IP spoofing. ________________________________ From: dat-discussions at yahoogroups.com [mailto:dat-discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 25, 2005 9:00 AM To: openib-general at openib.org; dat-discussions at yahoogroups.com; swg at infinibandta.org Subject: [dat-discussions] round 2 - proposal for socket based connection model Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Thanks everybody for tons of feedback and deep discussions. I appologize if I had missed something. Happy reading, Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 ________________________________ YAHOO! GROUPS LINKS * Visit your group "dat-discussions " on the web. * To unsubscribe from this group, send an email to: dat-discussions-unsubscribe at yahoogroups.com * Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service . ________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Oct 25 09:56:01 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 09:56:01 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AF7@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AF7@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435E63A1.7070102@ichips.intel.com> Caitlin Bestler wrote: > I believe it requires a CM protocol version change, or a "IP Address > Header present" bit. > > Basically, userspace consumers can supply *any* 72 bytes of private data > currently. > To maintain backwards compatability you need an authenticator that says > "this IP > header data vouched for by privileged components on this end", and that > authenticator > cannot be within the private data. I believe that the solution is keep the CM protocol as is. The CM private data should be completely controlled by the service. The IB CM does not care if an IP address is in the private data or not. My reading of the proposal is that it defines a private data format that a particular service may or may not use. - Sean From Arkady.Kanevsky at netapp.com Tue Oct 25 10:03:15 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 13:03:15 -0400 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: Correct. But this does bring the question how responder CM knows that it need to parse the private data. I suspect this will be done via new version of CM. But a suage of some of the CM REQ reserved fields are also possible. Anotherwords the current CM version assumes that CM only supports one version and there is no need to support more than 1 version. This proposal may change this assumption. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 12:56 PM > To: Caitlin Bestler > Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; > openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > > Caitlin Bestler wrote: > > I believe it requires a CM protocol version change, or a "IP Address > > Header present" bit. > > > > Basically, userspace consumers can supply *any* 72 bytes of private > > data > > currently. > > To maintain backwards compatability you need an > authenticator that says > > "this IP > > header data vouched for by privileged components on this > end", and that > > authenticator > > cannot be within the private data. > > I believe that the solution is keep the CM protocol as is. > The CM private data > should be completely controlled by the service. The IB CM > does not care if an > IP address is in the private data or not. > > My reading of the proposal is that it defines a private data > format that a > particular service may or may not use. > > - Sean > From caitlinb at broadcom.com Tue Oct 25 10:03:11 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 10:03:11 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AFB@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 9:56 AM > To: Caitlin Bestler > Cc: Kanevsky, Arkady; dat-discussions at yahoogroups.com; > openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > Caitlin Bestler wrote: > > I believe it requires a CM protocol version change, or a > "IP Address > > Header present" bit. > > > > Basically, userspace consumers can supply *any* 72 bytes of private > > data currently. > > To maintain backwards compatability you need an authenticator that > > says "this IP header data vouched for by privileged > components on this > > end", and that authenticator cannot be within the private data. > > I believe that the solution is keep the CM protocol as is. > The CM private data should be completely controlled by the > service. The IB CM does not care if an IP address is in the > private data or not. > > My reading of the proposal is that it defines a private data > format that a particular service may or may not use. > Is that because you do not agree that there is a problem? Or is it that you think the gap betweeen this and existing IP connection semantics is small enough that it is better to cover it with a disclosure than by changing the CM protocol? How would advise an application that uses the remote address to check an Access Control List (such as an NFS daemon) to treat this data? On an IP network the remote IP Address/port was vouched for by the remote kernel at the minimum, and MAY have been authenticated by each routing element along the way. Private data supplied through the existing CM protocol has neither of those safeguards. From sean.hefty at intel.com Tue Oct 25 10:04:56 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 10:04:56 -0700 Subject: [openib-general] round 2 - proposal for socket based connectionmodel In-Reply-To: Message-ID: Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Comments on the private data format: Combine major/minor version into a single field. There's no advantage to have two fields, so keep it simple. Remove ZB and SI bits. These are unrelated to socket addressing. If the destination port number is encoded in a service ID, then it can be removed from the private data. The transport protocol number could also be encoded in the service ID and removed from the private data. Actually, the version, IP version, and source port could all be encoded in the service ID, limiting the private data to just 32 bytes of IP addresses. - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Oct 25 10:07:44 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 10:07:44 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: References: Message-ID: <435E6660.6030705@ichips.intel.com> Kanevsky, Arkady wrote: > Correct. > But this does bring the question how responder CM knows that it need to > parse > the private data. I suspect this will be done via new version of CM. > But a suage of some of the CM REQ reserved fields are also possible. > Anotherwords the current CM version assumes that CM only supports > one version and there is no need to support more than 1 version. The responder knows how to parse the private data based on the service ID that they're listening on. This is how it's done today, and how it will still need to be done. What is the motivation to change it? What data is beyond the addressing? How does the responder know how to interpret that? - Sean From caitlinb at broadcom.com Tue Oct 25 10:21:51 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 10:21:51 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AFC@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 10:08 AM > To: Kanevsky, Arkady > Cc: Caitlin Bestler; dat-discussions at yahoogroups.com; > openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > Kanevsky, Arkady wrote: > > Correct. > > But this does bring the question how responder CM knows > that it need > > to parse the private data. I suspect this will be done via > new version > > of CM. > > But a suage of some of the CM REQ reserved fields are also possible. > > Anotherwords the current CM version assumes that CM only > supports one > > version and there is no need to support more than 1 version. > > The responder knows how to parse the private data based on > the service ID that they're listening on. This is how it's > done today, and how it will still need to be done. What is > the motivation to change it? > > What data is beyond the addressing? How does the responder > know how to interpret that? > I agree, the listener is responsible for knowing what format the Private Data is supposed to be in. Therefore it knows in advance what portions of it are relevant to the CM (the IP address information and/or the ITAPI IRD/ORD pre-header). So the listen request can specify the required CM parsing. But that does not prevent a non-privileged application from forging the IP address information. These connection requests are being presented to daemons as though they had the same degree of authentication as address headers in an IP network could have. The latter can be quite high when switches and routers validate source addresses versus arriving ports. From tom at opengridcomputing.com Tue Oct 25 10:23:33 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 12:23:33 -0500 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AF4@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AF4@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1130261013.9790.2.camel@trinity.austin.ammasso.com> What does this have to do with the protocol? On Tue, 2005-10-25 at 09:35 -0700, Caitlin Bestler wrote: > On an IP network, a non-privileged user is generally not capable of > forging > a source IP address and is typically prevented from using certain > source ports. > > I would propose that the CM [MAY|SHOULD|MUST] enforce that a non- > privileged > user can only use a Source IP Address and Port that they would have > been > able to use following the normal stack path (or what it would have > been in the > case that there is no conventional IP stack associated with this > path). > > So if IPoIB is installed, you would not be able to use any address > that > you would have been blocked from using over IPoIB. Or at least you > would not be guaranteed that you could. > > I think that MUST is the correct level of enforcement, but it needs to > be > clear that the CM and OS *MAY* do this checking and that a userspace > IB application cannot use the IB stack to perform IP spoofing. > > > ______________________________________________________________ > From: dat-discussions at yahoogroups.com [mailto:dat- > discussions at yahoogroups.com] On Behalf Of Kanevsky, Arkady > Sent: Tuesday, October 25, 2005 9:00 AM > To: openib-general at openib.org; dat- > discussions at yahoogroups.com; swg at infinibandta.org > Subject: [dat-discussions] round 2 - proposal for socket based > connection model > > > Dear OpenIB, SWG and DAT members, > enclosed is teh second version of the proposal. > There are really 2 proposals that are related. > The first one is encoding IP 5-tuple into REQ private data > with small additional info for versioning and IB capabilities. > The second is just a couple of ideas, not a real proposal, > on maping of IP ports > to IB Service IDs. > > Thanks everybody for tons of feedback and deep discussions. > I appologize if I had missed something. > > Happy reading, > Arkady > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance phone: 781-768-5395 > > 375 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > > ______________________________________________________________ > YAHOO! GROUPS LINKS > > 1. Visit your group "dat-discussions" on the web. > > 2. To unsubscribe from this group, send an email to: > dat-discussions-unsubscribe at yahoogroups.com > > 3. Your use of Yahoo! Groups is subject to the Yahoo! > Terms of Service. > > > ______________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Arkady.Kanevsky at netapp.com Tue Oct 25 10:25:51 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 13:25:51 -0400 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: Think of a single API that supports iWARP and IB (transport independent API). To a connection listener it provides the IP 5-tuple + private data. For IB it means that CM parses REQ and extracts IP 5-tuple as separate fields from private data. Listener does not parse the private data encoding of the proposal. So CM need to know if it need to encode IP 5-tuple on requestor side and if need to parse on responder side. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 1:08 PM > To: Kanevsky, Arkady > Cc: Caitlin Bestler; dat-discussions at yahoogroups.com; > openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > > Kanevsky, Arkady wrote: > > Correct. > > But this does bring the question how responder CM knows > that it need > > to parse the private data. I suspect this will be done via > new version > > of CM. But a suage of some of the CM REQ reserved fields are also > > possible. Anotherwords the current CM version assumes that CM only > > supports one version and there is no need to support more than 1 > > version. > > The responder knows how to parse the private data based on > the service ID that > they're listening on. This is how it's done today, and how > it will still need > to be done. What is the motivation to change it? > > What data is beyond the addressing? How does the responder > know how to > interpret that? > > - Sean > From mshefty at ichips.intel.com Tue Oct 25 10:26:55 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 10:26:55 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AFB@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AFB@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435E6ADF.8060103@ichips.intel.com> Caitlin Bestler wrote: > Is that because you do not agree that there is a problem? > Or is it that you think the gap betweeen this and existing IP > connection semantics is small enough that it is better to cover > it with a disclosure than by changing the CM protocol? I would define the problem as: applications want to connect over IB using IP addressing. Defining the CM REQ private data solves is only a small part of the solution (reverse lookup). > On an IP network the remote IP Address/port was vouched for > by the remote kernel at the minimum, and MAY have been authenticated > by each routing element along the way. Private data supplied through > the existing CM protocol has neither of those safeguards. I think that security is a separate issue outside of this. I have no idea what OS is running on a remote system, let alone how it may have verified an address. That said, the kernel CMA would set this data based on information that it collects. But only users of the CMA would have this additional protection. - Sean From caitlinb at broadcom.com Tue Oct 25 10:31:16 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 10:31:16 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020AFD@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Tuesday, October 25, 2005 10:24 AM > To: Caitlin Bestler > Cc: DAT Collaborative; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > What does this have to do with the protocol? > It's a whopping big security vulnerability. The application is left with an expectation that the address is more validated than it is. Admittedly even on an IP network it is not perfectly authenticated, but with this protocol the remote address information is far less authenticated and trivially spoofed. From mshefty at ichips.intel.com Tue Oct 25 10:34:16 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 10:34:16 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: References: Message-ID: <435E6C98.5020406@ichips.intel.com> Kanevsky, Arkady wrote: > Think of a single API that supports iWARP and IB (transport independent > API). The CMA implements this today and did not require any changes to the IB CM. > To a connection listener it provides the IP 5-tuple + private data. > For IB it means that CM parses REQ and extracts IP 5-tuple as separate > fields from private data. Why push this down into the CM? The CM should operate on IB addresses, not IP addresses. The mapping of IP addresses to IB addresses is done at a higher level. > Listener does not parse the private data encoding of the proposal. The listener is the one who cares about the IP addressing. - Sean From tom at opengridcomputing.com Tue Oct 25 10:37:11 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 12:37:11 -0500 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020AFC@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020AFC@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1130261831.9790.16.camel@trinity.austin.ammasso.com> On Tue, 2005-10-25 at 10:21 -0700, Caitlin Bestler wrote: > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Tuesday, October 25, 2005 10:08 AM > > To: Kanevsky, Arkady > > Cc: Caitlin Bestler; dat-discussions at yahoogroups.com; > > openib-general at openib.org; swg at infinibandta.org > > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > > proposal for socket based connection model > > > > Kanevsky, Arkady wrote: > > > Correct. > > > But this does bring the question how responder CM knows > > that it need > > > to parse the private data. I suspect this will be done via > > new version > > > of CM. > > > But a suage of some of the CM REQ reserved fields are also possible. > > > Anotherwords the current CM version assumes that CM only > > supports one > > > version and there is no need to support more than 1 version. > > > > The responder knows how to parse the private data based on > > the service ID that they're listening on. This is how it's > > done today, and how it will still need to be done. What is > > the motivation to change it? > > > > What data is beyond the addressing? How does the responder > > know how to interpret that? > > > > I agree, the listener is responsible for knowing what format > the Private Data is supposed to be in. Therefore it knows in > advance what portions of it are relevant to the CM (the IP > address information and/or the ITAPI IRD/ORD pre-header). > So the listen request can specify the required CM parsing. > > But that does not prevent a non-privileged application from > forging the IP address information. These connection requests > are being presented to daemons as though they had the same > degree of authentication as address headers in an IP network > could have. The latter can be quite high when switches and > routers validate source addresses versus arriving ports. I believe that the assurances you are talking about are peculiar to an implementation, not to the network. The CMA is what is preparing the private data header, not the app. WRT a IB CM app, it could very easily pretend to be a "CMA App" and build it's own private data that spoofed the address. How would the local CM know that it is supposed to verify this? Where is the service id/private data format mapping database? In short, I think we are mixing many different things together here. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rolandd at cisco.com Tue Oct 25 10:42:18 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 10:42:18 -0700 Subject: [openib-general] [PATCH] Fix for MAD layer DMA mappings In-Reply-To: (Sean Hefty's message of "Fri, 21 Oct 2005 11:27:28 -0700") References: Message-ID: <52ll0h4hol.fsf@cisco.com> BTW, I just tried this on my PPC 4xx system, and the MAD layer works fine now. The port makes it to active and IPoIB works as well. Thanks, Roland From Arkady.Kanevsky at netapp.com Tue Oct 25 10:43:07 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 13:43:07 -0400 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: Sean, The reason IBTA is interested to address IP address issue is because of multiple UPLs and APIs want to support socket based connection model. Sure each one of them can define its own protocol (for private data). But this will not ensure interoperability. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 1:34 PM > To: Kanevsky, Arkady > Cc: Caitlin Bestler; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > > Kanevsky, Arkady wrote: > > Think of a single API that supports iWARP and IB (transport > > independent API). > > The CMA implements this today and did not require any changes > to the IB CM. > > > To a connection listener it provides the IP 5-tuple + private data. > > For IB it means that CM parses REQ and extracts IP 5-tuple > as separate > > fields from private data. > > Why push this down into the CM? The CM should operate on IB > addresses, not IP > addresses. The mapping of IP addresses to IB addresses is > done at a higher level. > > > Listener does not parse the private data encoding of the proposal. > > The listener is the one who cares about the IP addressing. > > - Sean > From mshefty at ichips.intel.com Tue Oct 25 10:51:46 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 10:51:46 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: References: Message-ID: <435E70B2.4070400@ichips.intel.com> Kanevsky, Arkady wrote: > Sean, > The reason IBTA is interested to address IP address issue > is because of multiple UPLs and APIs want to support > socket based connection model. Sure each one of them > can define its own protocol (for private data). > But this will not ensure interoperability. There's no interoperability between different ULPs anyway. Each does define its own protocol. Trying to standardize part of the CM REQ private data doesn't help in this regard. - Sean From caitlinb at broadcom.com Tue Oct 25 10:51:48 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 10:51:48 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B00@NT-SJCA-0751.brcm.ad.broadcom.com> > > I believe that the assurances you are talking about are > peculiar to an implementation, not to the network. > I disagree. Anytime you send an IP datagram on an IP network you are expected to provide an authentic source address. Any intermediate network device MAY enforce that rule and drop packets with invalid source addresses. IP Addresses stored in private data, by contrast, are guaranteed to pass all middleboxes unmolested without review of validation. This is not a spoofer taking advantage of a lazy network admin, this is a spoofer being given a "get out of jail free" card that says the network admin is not even allowed to do spot checks. > The CMA is what is preparing the private data header, not the > app. WRT a IB CM app, it could very easily pretend to be a > "CMA App" and build it's own private data that spoofed the > address. How would the local CM know that it is supposed to > verify this? Where is the service id/private data format > mapping database? > > In short, I think we are mixing many different things together here. > > For the very same reasons that a userspace consumer is not allowed to pretend to be the CM itself, it should not be allowed to just make up Source IP Addresses. If it's going to lie it needs to be a privileged liar. Preserving the existing CM infrastructure is fine, but not if it forces us to take something that should be authenticated by privileged software and simply trust that userspace code will fill it in correctly. From Arkady.Kanevsky at netapp.com Tue Oct 25 10:59:22 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 13:59:22 -0400 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: It is APIs not ULPs that are concern. Each ULP can define its own protocol. But APIs can not. But defining a protocol for each ULP is also bad. This proposal defines it for all ULPs. If ULP uses API, it does the parsing. If ULP uses verbs it can do the parsing and encoding itself. But in the later case it will have to have a different ULP CM for each transport. Bad idea. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 1:52 PM > To: Kanevsky, Arkady > Cc: Caitlin Bestler; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > > Kanevsky, Arkady wrote: > > Sean, > > The reason IBTA is interested to address IP address issue > > is because of multiple UPLs and APIs want to support > > socket based connection model. Sure each one of them > > can define its own protocol (for private data). > > But this will not ensure interoperability. > > There's no interoperability between different ULPs anyway. > Each does define its > own protocol. Trying to standardize part of the CM REQ > private data doesn't > help in this regard. > > - Sean > From rolandd at cisco.com Tue Oct 25 10:59:27 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 10:59:27 -0700 Subject: [openib-general] Support for UC connections using the CM API? In-Reply-To: <52fyqq5xxa.fsf@cisco.com> (Roland Dreier's message of "Mon, 24 Oct 2005 15:53:53 -0700") References: <20051024213044.89732.qmail@web32505.mail.mud.yahoo.com> <435D5D4E.5000106@ichips.intel.com> <52fyqq5xxa.fsf@cisco.com> Message-ID: <52hdb54gw0.fsf@cisco.com> OK, I checked in the mthca QP transition table fix. By the way, I think we probably need to consolidate the checking of required/optional QP modify attributes so that mthca, ipath and ehca don't each have their own copy of the code (and their own bugs). - R. From mshefty at ichips.intel.com Tue Oct 25 11:09:42 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 11:09:42 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: References: Message-ID: <435E74E6.3000103@ichips.intel.com> Kanevsky, Arkady wrote: > It is APIs not ULPs that are concern. Yes - and an application that wants to use IP addressing instead of IB addressing should use a different API than that of the IB CM. Trying to define the IB CM to use anybody's favorite transport/network address is the wrong solution to the problem. That is a service level issue best left to the service that's trying to perform the mapping. > Each ULP can define its own protocol. Each ULP does define its own protocol - connection or otherwise. SDP cannot talk to IPoIB which cannot talk to SRP. > If ULP uses API, it does the parsing. APIs are merely an interface. What needs to be defined is a service that can do the parsing. For Linux, the API to that service should be defined using the standard open source method. - Sean From tom at opengridcomputing.com Tue Oct 25 11:13:06 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 13:13:06 -0500 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020B00@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020B00@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <1130263986.9790.45.camel@trinity.austin.ammasso.com> On Tue, 2005-10-25 at 10:51 -0700, Caitlin Bestler wrote: > > > > > > I believe that the assurances you are talking about are > > peculiar to an implementation, not to the network. > > > > I disagree. Anytime you send an IP datagram on an IP network > you are expected to provide an authentic source address. Any > intermediate network device MAY enforce that rule and drop > packets with invalid source addresses. > I don't see anything in the protocol specs (RFC 791, RFC 793, ...) that talks about this, so we just have to agree to disagree. :-) > IP Addresses stored in private data, by contrast, are guaranteed > to pass all middleboxes unmolested without review of validation. > This is not a spoofer taking advantage of a lazy network admin, > this is a spoofer being given a "get out of jail free" card that > says the network admin is not even allowed to do spot checks. > > > The CMA is what is preparing the private data header, not the > > app. WRT a IB CM app, it could very easily pretend to be a > > "CMA App" and build it's own private data that spoofed the > > address. How would the local CM know that it is supposed to > > verify this? Where is the service id/private data format > > mapping database? > > > > In short, I think we are mixing many different things together here. > > > > > > For the very same reasons that a userspace consumer is not allowed > to pretend to be the CM itself, it should not be allowed to just > make up Source IP Addresses. If it's going to lie it needs to be > a privileged liar. > > Preserving the existing CM infrastructure is fine, but not if it > forces us to take something that should be authenticated by privileged > software and simply trust that userspace code will fill it in correctly. From caitlinb at broadcom.com Tue Oct 25 11:15:28 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 11:15:28 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B02@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 11:10 AM > To: Kanevsky, Arkady > Cc: Caitlin Bestler; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > Kanevsky, Arkady wrote: > > It is APIs not ULPs that are concern. > > Yes - and an application that wants to use IP addressing > instead of IB addressing should use a different API than that > of the IB CM. Trying to define the IB CM to use anybody's > favorite transport/network address is the wrong solution to > the problem. That is a service level issue best left to the > service that's trying to perform the mapping. > What you are proposing is an API that purports to have the semantics of TCP/IP connection establishment that can be implemented under non-IP transports such as InfiniBand. However, as proposed the mapping of this API to InfiniBand does *not* implement the semantics of TCP/IP connection establishment in that the remote address presented to the listener has been subject to *no* authentication. That is a change in the API that has an impact on the application. It is creating a requiremet for the application to validate the remote identity greater than it would face for TCP/IP connection establishment. From arlin.r.davis at intel.com Tue Oct 25 11:17:50 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 25 Oct 2005 11:17:50 -0700 Subject: [openib-general] [PATCH] new uDAPL openIB provider using socket CM Message-ID: James, Here is a patch to add an optional openIB uDAPL provider that uses the socket CM for anyone having problems scaling out with the uCM/uAT version. To build the new provider, simply "make VERBS=openib_scm". This version does not require IPoIB, uCM, or uAT. -arlin Signed-off by: Arlin Davis Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 3848) +++ dapl/udapl/Makefile (working copy) @@ -139,6 +139,16 @@ CFLAGS += -I/usr/local/include/infinib endif # +# OpenIB provider with Socket CM +# +ifeq ($(VERBS),openib_scm) +PROVIDER = $(TOPDIR)/../openib_scm +CFLAGS += -DOPENIB +CFLAGS += -DCQ_WAIT_OBJECT +CFLAGS += -I/usr/local/include/infiniband +endif + +# # If an implementation supports CM and DTO completions on the same EVD # then DAPL_MERGE_CM_DTO should be set # CFLAGS += -DDAPL_MERGE_CM_DTO=1 @@ -251,6 +261,13 @@ PROVIDER_SRCS = dapl_ib_util.c dapl_ib_ PROVIDER_SRCS += dapl_ib_cm.c dapl_ib_mem.c endif +ifeq ($(VERBS),openib_scm) +LDFLAGS += -libverbs +LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib +PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ + dapl_ib_cm.c dapl_ib_mem.c +endif + UDAPL_SRCS = dapl_init.c \ dapl_evd_create.c \ dapl_evd_query.c \ Index: dapl/openib_scm/dapl_ib_dto.h =================================================================== --- dapl/openib_scm/dapl_ib_dto.h (revision 0) +++ dapl/openib_scm/dapl_ib_dto.h (revision 0) @@ -0,0 +1,261 @@ +/* + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_dto.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - DTO operations and CQE macros + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifndef _DAPL_IB_DTO_H_ +#define _DAPL_IB_DTO_H_ + +#include "dapl_ib_util.h" + +#define DEFAULT_DS_ENTRIES 8 + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p); + +/* + * dapls_ib_post_recv + * + * Provider specific Post RECV function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_recv ( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov ) +{ + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_recv_wr wr; + struct ibv_recv_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_rcv: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if ( segments <= DEFAULT_DS_ENTRIES ) + ds_array_p = ds_array; + else + ds_array_p = dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup work request */ + total_len = 0; + wr.next = 0; + wr.num_sge = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_rcv: l_key 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if (ibv_post_recv(ep_ptr->qp_handle, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + return DAT_SUCCESS; +} + + +/* + * dapls_ib_post_send + * + * Provider specific Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs %d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if( segments <= DEFAULT_DS_ENTRIES ) + ds_array_p = ds_array; + else + ds_array_p = dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d \n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr ); + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send ) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log (DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} + +STATIC _INLINE_ DAT_RETURN +dapls_ib_optional_prv_dat ( + IN DAPL_CR *cr_ptr, + IN const void *event_data, + OUT DAPL_CR **cr_pp) +{ + return DAT_SUCCESS; +} + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + case IBV_WC_SEND: + return (OP_SEND); + case IBV_WC_RDMA_WRITE: + return (OP_RDMA_WRITE); + case IBV_WC_RDMA_READ: + return (OP_RDMA_READ); + case IBV_WC_COMP_SWAP: + return (OP_COMP_AND_SWAP); + case IBV_WC_FETCH_ADD: + return (OP_FETCH_AND_ADD); + case IBV_WC_BIND_MW: + return (OP_BIND_MW); + case IBV_WC_RECV: + return (OP_RECEIVE); + case IBV_WC_RECV_RDMA_WITH_IMM: + return (OP_RECEIVE_IMM); + default: + return (OP_INVALID); + } +} + +#define DAPL_GET_CQE_OPTYPE(cqe_p) dapls_cqe_opcode(cqe_p) +#define DAPL_GET_CQE_WRID(cqe_p) ((ib_work_completion_t*)cqe_p)->wr_id +#define DAPL_GET_CQE_STATUS(cqe_p) ((ib_work_completion_t*)cqe_p)->status +#define DAPL_GET_CQE_BYTESNUM(cqe_p) ((ib_work_completion_t*)cqe_p)->byte_len +#define DAPL_GET_CQE_IMMED_DATA(cqe_p) ((ib_work_completion_t*)cqe_p)->imm_data + +#endif /* _DAPL_IB_DTO_H_ */ Index: dapl/openib_scm/dapl_ib_util.c =================================================================== --- dapl/openib_scm/dapl_ib_util.c (revision 0) +++ dapl/openib_scm/dapl_ib_util.c (revision 0) @@ -0,0 +1,471 @@ +/* + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - init, open, close, utilities + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifdef RCSID +static const char rcsid[] = "$Id: $"; +#endif + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_ib_util.h" + +#include +#include +#include +#include +#include + +int g_dapl_loopback_connection = 0; + +/* just get IP address for hostname */ +DAT_RETURN getipaddr( char *addr, int addr_len) +{ + struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; + struct hostent *h_ptr; + struct utsname ourname; + + if ( uname( &ourname ) < 0 ) + return DAT_INTERNAL_ERROR; + + h_ptr = gethostbyname( ourname.nodename ); + if ( h_ptr == NULL ) + return DAT_INTERNAL_ERROR; + + if ( h_ptr->h_addrtype == AF_INET ) { + ipv4_addr = (struct sockaddr_in*) addr; + ipv4_addr->sin_family = AF_INET; + dapl_os_memcpy( &ipv4_addr->sin_addr, h_ptr->h_addr_list[0], 4 ); + } else + return DAT_INVALID_ADDRESS; + + return DAT_SUCCESS; +} + +/* + * dapls_ib_init, dapls_ib_release + * + * Initialize Verb related items for device open + * + * Input: + * none + * + * Output: + * none + * + * Returns: + * 0 success, -1 error + * + */ +int32_t dapls_ib_init (void) +{ + return 0; +} + +int32_t dapls_ib_release (void) +{ + return 0; +} + +/* + * dapls_ib_open_hca + * + * Open HCA + * + * Input: + * *hca_name pointer to provider device name + * *ib_hca_handle_p pointer to provide HCA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_open_hca ( + IN IB_HCA_NAME hca_name, + IN DAPL_HCA *hca_ptr) +{ + struct dlist *dev_list; + int opts; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " open_hca: %s - %p\n", hca_name, hca_ptr ); + + /* Get list of all IB devices, find match, open */ + dev_list = ibv_get_devices(); + dlist_start(dev_list); + dlist_for_each_data(dev_list,hca_ptr->ib_trans.ib_dev,struct ibv_device) { + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) + break; + } + + if (!hca_ptr->ib_trans.ib_dev) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB device %s not found\n", + hca_name); + return DAT_INTERNAL_ERROR; + } + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + (unsigned long long)bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); + + hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); + if (!hca_ptr->ib_hca_handle) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB dev open failed for %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + return DAT_INTERNAL_ERROR; + } + + /* set inline max with enviroment or default */ + hca_ptr->ib_trans.max_inline_send = + dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); + + /* initialize cq_lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to init cq_lock\n"); + goto bail; + } + + /* EVD events without direct CQ channels, non-blocking */ + hca_ptr->ib_trans.ib_cq = + ibv_create_comp_channel(hca_ptr->ib_hca_handle); + opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */ + if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, + F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: ERR with CQ FD\n" ); + goto bail; + } + + if (dapli_cq_thread_init(hca_ptr)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: cq_thread_init failed for %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + goto bail; + } + + /* initialize cr_list lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to init lock\n"); + goto bail; + } + + /* initialize CM list for listens on this HCA */ + dapl_llist_init_head(&hca_ptr->ib_trans.list); + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cr_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cr_thread, + (void*)hca_ptr, + &hca_ptr->ib_trans.thread ); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to create thread\n"); + goto bail; + } + + /* wait for thread */ + while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: waiting for cr_thread\n"); + nanosleep (&sleep, &remain); + } + + /* get the IP address of the device */ + dat_status = getipaddr((char*)&hca_ptr->hca_address, + sizeof(DAT_SOCK_ADDR6) ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " open_hca: %s, port %d, %s %d.%d.%d.%d\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_family == AF_INET ? "AF_INET":"AF_INET6", + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff ); + + return dat_status; +bail: + ibv_close_device(hca_ptr->ib_hca_handle); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_close_hca + * + * Open HCA + * + * Input: + * DAPL_HCA provide CA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) +{ + dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); + + dapli_cq_thread_destroy(hca_ptr); + + if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + if (ibv_close_device(hca_ptr->ib_hca_handle)) + return(dapl_convert_errno(errno,"ib_close_device")); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + } + + dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); + + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; + while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " close_hca: waiting for cr_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); + + return (DAT_SUCCESS); +} + +/* + * dapls_ib_query_hca + * + * Query the hca attribute + * + * Input: + * hca_handl hca handle + * ia_attr attribute of the ia + * ep_attr attribute of the ep + * ip_addr ip address of DET NIC + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + */ + +DAT_RETURN dapls_ib_query_hca ( + IN DAPL_HCA *hca_ptr, + OUT DAT_IA_ATTR *ia_attr, + OUT DAT_EP_ATTR *ep_attr, + OUT DAT_SOCK_ADDR6 *ip_addr) +{ + struct ibv_device_attr dev_attr; + struct ibv_port_attr port_attr; + + if (hca_ptr->ib_hca_handle == NULL) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR," query_hca: BAD handle\n"); + return (DAT_INVALID_HANDLE); + } + + /* local IP address of device, set during ia_open */ + if (ip_addr != NULL) + memcpy(ip_addr, &hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6)); + + if (ia_attr == NULL && ep_attr == NULL) + return DAT_SUCCESS; + + /* query verbs for this device and port attributes */ + if (ibv_query_device(hca_ptr->ib_hca_handle, &dev_attr) || + ibv_query_port(hca_ptr->ib_hca_handle, + hca_ptr->port_num, &port_attr)) + return(dapl_convert_errno(errno,"ib_query_hca")); + + if (ia_attr != NULL) { + ia_attr->adapter_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->vendor_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->ia_address_ptr = (DAT_IA_ADDRESS_PTR)&hca_ptr->hca_address; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: %s %s %d.%d.%d.%d\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_family == AF_INET ? "AF_INET":"AF_INET6", + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff ); + + ia_attr->hardware_version_major = dev_attr.hw_ver; + /* ia_attr->hardware_version_minor = dev_attr.fw_ver; */ + ia_attr->max_eps = dev_attr.max_qp; + ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; + ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; + ia_attr->max_evds = dev_attr.max_cq; + ia_attr->max_evd_qlen = dev_attr.max_cqe; + ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; + ia_attr->max_lmrs = dev_attr.max_mr; + ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + ia_attr->max_rmrs = dev_attr.max_mw; + ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; + ia_attr->max_rmr_target_address = dev_attr.max_mr_size; + ia_attr->max_pzs = dev_attr.max_pd; + ia_attr->max_mtu_size = port_attr.max_msg_sz; + ia_attr->max_rdma_size = port_attr.max_msg_sz; + ia_attr->num_transport_attr = 0; + ia_attr->transport_attr = NULL; + ia_attr->num_vendor_attr = 0; + ia_attr->vendor_attr = NULL; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: (%x.%x) ep %d ep_q %d evd %d evd_q %d\n", + ia_attr->hardware_version_major, + ia_attr->hardware_version_minor, + ia_attr->max_eps, ia_attr->max_dto_per_ep, + ia_attr->max_evds, ia_attr->max_evd_qlen ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", + ia_attr->max_mtu_size, ia_attr->max_rdma_size, + ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, + ia_attr->max_rmrs ); + + } + + if (ep_attr != NULL) { + ep_attr->max_mtu_size = port_attr.max_msg_sz; + ep_attr->max_rdma_size = port_attr.max_msg_sz; + ep_attr->max_recv_dtos = dev_attr.max_qp_wr; + ep_attr->max_request_dtos = dev_attr.max_qp_wr; + ep_attr->max_recv_iov = dev_attr.max_sge; + ep_attr->max_request_iov = dev_attr.max_sge; + ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; + ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", + ep_attr->max_mtu_size, + ep_attr->max_recv_dtos, ep_attr->max_recv_iov, + ep_attr->max_rdma_read_in, ep_attr->max_rdma_read_out); + } + + return DAT_SUCCESS; +} + +/* + * dapls_ib_setup_async_callback + * + * Set up an asynchronous callbacks of various kinds + * + * Input: + * ia_handle IA handle + * handler_type type of handler to set up + * callback_handle handle param for completion callbacks + * callback callback routine pointer + * context argument for callback routine + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_setup_async_callback ( + IN DAPL_IA *ia_ptr, + IN DAPL_ASYNC_HANDLER_TYPE handler_type, + IN DAPL_EVD *evd_ptr, + IN ib_async_handler_t callback, + IN void *context ) + +{ + ib_hca_transport_t *hca_ptr; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " setup_async_cb: ia %p type %d handle %p cb %p ctx %p\n", + ia_ptr, handler_type, evd_ptr, callback, context); + + hca_ptr = &ia_ptr->hca_ptr->ib_trans; + switch(handler_type) + { + case DAPL_ASYNC_UNAFILIATED: + hca_ptr->async_unafiliated = + (ib_async_handler_t)callback; + hca_ptr->async_un_ctx = context; + break; + case DAPL_ASYNC_CQ_ERROR: + hca_ptr->async_cq_error = + (ib_async_cq_handler_t)callback; + break; + case DAPL_ASYNC_CQ_COMPLETION: + hca_ptr->async_cq = + (ib_async_dto_handler_t)callback; + break; + case DAPL_ASYNC_QP_ERROR: + hca_ptr->async_qp_error = + (ib_async_qp_handler_t)callback; + break; + default: + break; + } + return DAT_SUCCESS; +} + Index: dapl/openib_scm/dapl_ib_mem.c =================================================================== --- dapl/openib_scm/dapl_ib_mem.c (revision 0) +++ dapl/openib_scm/dapl_ib_mem.c (revision 0) @@ -0,0 +1,392 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_mem.c + * + * PURPOSE: Intel DET APIs: Memory windows, registration, + * and protection domain + * + * $Id: $ + * + **********************************************************************/ + +#include /* for IOCTL's */ +#include /* for socket(2) and related bits and pieces */ +#include /* for socket(2) */ +#include /* for struct ifreq */ +#include /* for ARPHRD_ETHER */ +#include /* for _SC_CLK_TCK */ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" + +/* + * dapls_convert_privileges + * + * Convert LMR privileges to provider + * + * Input: + * DAT_MEM_PRIV_FLAGS + * + * Output: + * none + * + * Returns: + * ibv_access_flags + * + */ +STATIC _INLINE_ int +dapls_convert_privileges ( + IN DAT_MEM_PRIV_FLAGS privileges) +{ + int access = 0; + + /* + * if (DAT_MEM_PRIV_LOCAL_READ_FLAG & privileges) do nothing + */ + if (DAT_MEM_PRIV_LOCAL_WRITE_FLAG & privileges) + access |= IBV_ACCESS_LOCAL_WRITE; + if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_WRITE; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + + return access; +} + +/* + * dapl_ib_pd_alloc + * + * Alloc a PD + * + * Input: + * ia_handle IA handle + * pz pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_pd_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_PZ *pz ) +{ + /* get a protection domain */ + pz->pd_handle = ibv_alloc_pd(ia_ptr->hca_ptr->ib_hca_handle); + if (!pz->pd_handle) + return(dapl_convert_errno(ENOMEM,"alloc_pd")); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " pd_alloc: pd_handle=%p\n", + pz->pd_handle ); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_pd_free + * + * Free a PD + * + * Input: + * ia_handle IA handle + * PZ_ptr pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_pd_free ( + IN DAPL_PZ *pz ) +{ + if (pz->pd_handle != IB_INVALID_HANDLE) { + if (ibv_dealloc_pd(pz->pd_handle)) + return(dapl_convert_errno(errno,"dealloc_pd")); + pz->pd_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_register + * + * Register a virtual memory region + * + * Input: + * ia_handle IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register ( + IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_PVOID virt_addr, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS privileges) +{ + ib_pd_handle_t ib_pd_handle; + + ib_pd_handle = ((DAPL_PZ *)lmr->param.pz_handle)->pd_handle; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " mr_register: ia=%p, lmr=%p va=%p ln=%d pv=0x%x\n", + ia_ptr, lmr, virt_addr, length, privileges ); + + /* TODO: shared memory */ + if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) { + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " mr_register_shared: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); + } + + /* local read is default on IB */ + lmr->mr_handle = + ibv_reg_mr(((DAPL_PZ *)lmr->param.pz_handle)->pd_handle, + virt_addr, + length, + dapls_convert_privileges(privileges)); + + if (!lmr->mr_handle) + return(dapl_convert_errno(ENOMEM,"reg_mr")); + + lmr->param.lmr_context = lmr->mr_handle->lkey; + lmr->param.rmr_context = lmr->mr_handle->rkey; + lmr->param.registered_size = length; + lmr->param.registered_address = (DAT_VADDR)(uintptr_t) virt_addr; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " mr_register: mr=%p h %x pd %p ctx %p ,lkey=0x%x, rkey=0x%x priv=%x\n", + lmr->mr_handle, lmr->mr_handle->handle, + lmr->mr_handle->pd, + lmr->mr_handle->context, + lmr->mr_handle->lkey, + lmr->mr_handle->rkey, + length, dapls_convert_privileges(privileges) ); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_deregister + * + * Free a memory region + * + * Input: + * lmr pointer to dapl_lmr struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mr_deregister ( + IN DAPL_LMR *lmr ) +{ + if (lmr->mr_handle != IB_INVALID_HANDLE) { + if (ibv_dereg_mr(lmr->mr_handle)) + return(dapl_convert_errno(errno,"dereg_pd")); + lmr->mr_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + + +/* + * dapl_ib_mr_register_shared + * + * Register a virtual memory region + * + * Input: + * ia_ptr IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register_shared ( + IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_MEM_PRIV_FLAGS privileges ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mr_register_shared: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_alloc + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_alloc ( + IN DAPL_RMR *rmr ) +{ + + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_alloc: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_free + * + * Release bindings of a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mw_free ( + IN DAPL_RMR *rmr ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_free: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_bind + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_bind ( + IN DAPL_RMR *rmr, + IN DAPL_LMR *lmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_VADDR virtual_address, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS mem_priv, + IN DAT_BOOLEAN is_signaled) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_bind: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_unbind + * + * Unbind a protection domain from a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INVALID_STATE; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_unbind ( + IN DAPL_RMR *rmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_BOOLEAN is_signaled ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_unbind: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + Index: dapl/openib_scm/dapl_ib_cm.c =================================================================== --- dapl/openib_scm/dapl_ib_cm.c (revision 0) +++ dapl/openib_scm/dapl_ib_cm.c (revision 0) @@ -0,0 +1,1073 @@ +/* + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cm.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - connection management + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_cr_util.h" +#include "dapl_name_service.h" +#include "dapl_ib_util.h" + +#include +#include +#include +#include +#include + +/* prototypes */ +static uint16_t dapli_get_lid( struct ibv_device *dev, int port ); + +static DAT_RETURN dapli_socket_connect ( DAPL_EP *ep_ptr, + DAT_IA_ADDRESS_PTR r_addr, + DAT_CONN_QUAL r_qual, + DAT_COUNT p_size, + DAT_PVOID p_data ); + +static DAT_RETURN dapli_socket_listen ( DAPL_IA *ia_ptr, + DAT_CONN_QUAL serviceID, + DAPL_SP *sp_ptr ); + +static DAT_RETURN dapli_socket_accept( ib_cm_srvc_handle_t cm_ptr ); + +static DAT_RETURN dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ); + +/* XXX temporary hack to get lid */ +static uint16_t dapli_get_lid(IN struct ibv_device *dev, IN int port) +{ + char path[128]; + char val[16]; + char name[256]; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + sprintf(name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + return strtol(val, NULL, 0); +} + +/* + * ACTIVE: Create socket, connect, and exchange QP information + */ +static DAT_RETURN +dapli_socket_connect ( DAPL_EP *ep_ptr, + DAT_IA_ADDRESS_PTR r_addr, + DAT_CONN_QUAL r_qual, + DAT_COUNT p_size, + DAT_PVOID p_data ) +{ + ib_cm_handle_t cm_ptr; + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + int len, opt = 1; + struct iovec iovec[2]; + short rtu_data = htons(0x0E0F); + + dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d\n", r_qual); + + /* + * Allocate CM and initialize + */ + if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL ) { + return DAT_INSUFFICIENT_RESOURCES; + } + + (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); + cm_ptr->socket = -1; + + /* create, connect, sockopt, and exchange QP information */ + if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return DAT_INSUFFICIENT_RESOURCES; + } + + ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); + + if ( connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) < 0 ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect: %s on r_qual %d\n", + strerror(errno), (unsigned int)r_qual); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return DAT_INVALID_ADDRESS; + } + setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt)); + + /* Send QP info, IA address, and private data */ + cm_ptr->dst.qpn = ep_ptr->qp_handle->qp_num; + cm_ptr->dst.port = ia_ptr->hca_ptr->port_num; + cm_ptr->dst.lid = dapli_get_lid( ia_ptr->hca_ptr->ib_trans.ib_dev, + ia_ptr->hca_ptr->port_num ); + cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; + cm_ptr->dst.p_size = p_size; + iovec[0].iov_base = &cm_ptr->dst; + iovec[0].iov_len = sizeof(ib_qp_cm_t); + if ( p_size ) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); + if ( len != (p_size + sizeof(ib_qp_cm_t)) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect write: ERR %s, wcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " connect: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + + /* read DST information into cm_ptr, overwrite SRC info */ + len = readv( cm_ptr->socket, iovec, 1 ); + if ( len != sizeof(ib_qp_cm_t) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read: ERR %s, rcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " connect: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + + /* validate private data size before reading */ + if ( cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read: psize (%d) wrong\n", + cm_ptr->dst.p_size ); + goto bail; + } + + /* read private data into cm_handle if any present */ + if ( cm_ptr->dst.p_size ) { + iovec[0].iov_base = cm_ptr->p_data; + iovec[0].iov_len = cm_ptr->dst.p_size; + len = readv( cm_ptr->socket, iovec, 1 ); + if ( len != cm_ptr->dst.p_size ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read pdata: ERR %s, rcnt=%d\n", + strerror(errno), len); + goto bail; + } + } + + /* modify QP to RTR and then to RTS with remote info */ + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTS, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + ep_ptr->qp_state = IB_QP_STATE_RTS; + + /* complete handshake after final QP state change */ + write(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); + + /* init cm_handle and post the event with private data */ + ep_ptr->cm_handle = cm_ptr; + dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); + dapl_evd_connection_callback( ep_ptr->cm_handle, + IB_CME_CONNECTED, + cm_ptr->p_data, + ep_ptr ); + return DAT_SUCCESS; + +bail: + /* close socket, free cm structure and post error event */ + if ( cm_ptr->socket >= 0 ) + close(cm_ptr->socket); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + + dapl_evd_connection_callback( ep_ptr->cm_handle, + IB_CME_LOCAL_FAILURE, + NULL, + ep_ptr ); + return DAT_INTERNAL_ERROR; +} + + +/* + * PASSIVE: Create socket, listen, accept, exchange QP information + */ +static DAT_RETURN +dapli_socket_listen ( DAPL_IA *ia_ptr, + DAT_CONN_QUAL serviceID, + DAPL_SP *sp_ptr ) +{ + struct sockaddr_in addr; + ib_cm_srvc_handle_t cm_ptr = NULL; + int opt = 1; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " listen(ia_ptr %p ServiceID %d sp_ptr %p)\n", + ia_ptr, serviceID, sp_ptr); + + /* Allocate CM and initialize */ + if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); + + cm_ptr->socket = cm_ptr->l_socket = -1; + cm_ptr->sp = sp_ptr; + cm_ptr->hca_ptr = ia_ptr->hca_ptr; + + /* bind, listen, set sockopt, accept, exchange data */ + if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "socket for listen returned %d\n", errno); + dat_status = DAT_INSUFFICIENT_RESOURCES; + goto bail; + } + + setsockopt(cm_ptr->l_socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); + addr.sin_port = htons(serviceID); + addr.sin_family = AF_INET; + addr.sin_addr.s_addr = INADDR_ANY; + + if (( bind( cm_ptr->l_socket,(struct sockaddr*)&addr, sizeof(addr) ) < 0) || + (listen( cm_ptr->l_socket, 128 ) < 0) ) { + + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR %s on conn_qual 0x%x\n", + strerror(errno),serviceID); + + if ( errno == EADDRINUSE ) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = DAT_CONN_QUAL_UNAVAILABLE; + + goto bail; + } + + /* set cm_handle for this service point, save listen socket */ + sp_ptr->cm_srvc_handle = cm_ptr; + + /* add to SP->CR thread list */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); + dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock ); + dapl_llist_add_tail(&cm_ptr->hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cm_ptr->entry, cm_ptr); + dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock); + + dapl_dbg_log( DAPL_DBG_TYPE_CM, + " listen: qual 0x%x cr %p s_fd %d\n", + ntohs(serviceID), cm_ptr, cm_ptr->l_socket ); + + return dat_status; +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR on conn_qual 0x%x\n",serviceID); + if ( cm_ptr->l_socket >= 0 ) + close( cm_ptr->l_socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return dat_status; +} + + +/* + * PASSIVE: send local QP information, private data, and wait for + * active side to respond with QP RTS/RTR status + */ +static DAT_RETURN +dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) +{ + ib_cm_handle_t acm_ptr; + void *p_data = NULL; + int len; + DAT_RETURN dat_status = DAT_SUCCESS; + + /* Allocate accept CM and initialize */ + if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + (void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) ); + + acm_ptr->socket = -1; + acm_ptr->sp = cm_ptr->sp; + acm_ptr->hca_ptr = cm_ptr->hca_ptr; + + len = sizeof(acm_ptr->dst.ia_address); + acm_ptr->socket = accept(cm_ptr->l_socket, + (struct sockaddr*)&acm_ptr->dst.ia_address, + &len ); + + if ( acm_ptr->socket < 0 ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept: ERR %s on FD %d l_cr %p\n", + strerror(errno),cm_ptr->l_socket,cm_ptr); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + + /* read in DST QP info, IA address. check for private data */ + len = read( acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t) ); + if ( len != sizeof(ib_qp_cm_t) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read: ERR %s, rcnt=%d\n", + strerror(errno), len); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + acm_ptr->dst.port, acm_ptr->dst.lid, + acm_ptr->dst.qpn, acm_ptr->dst.p_size ); + + /* validate private data size before reading */ + if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read: psize (%d) wrong\n", + acm_ptr->dst.p_size ); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + + /* read private data into cm_handle if any present */ + if ( acm_ptr->dst.p_size ) { + len = read( acm_ptr->socket, + acm_ptr->p_data, acm_ptr->dst.p_size ); + if ( len != acm_ptr->dst.p_size ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read pdata: ERR %s, rcnt=%d\n", + strerror(errno), len ); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept: psize=%d read\n", + acm_ptr->dst.p_size); + p_data = acm_ptr->p_data; + } + + /* trigger CR event and return SUCCESS */ + dapls_cr_callback( acm_ptr, + IB_CME_CONNECTION_REQUEST_PENDING, + p_data, + acm_ptr->sp ); + + return DAT_SUCCESS; + +bail: + if ( acm_ptr->socket >=0 ) + close( acm_ptr->socket ); + dapl_os_free( acm_ptr, sizeof( *acm_ptr ) ); + return DAT_INTERNAL_ERROR; +} + + +static DAT_RETURN +dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ) +{ + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; + ib_qp_cm_t qp_cm; + struct iovec iovec[2]; + int len; + short rtu_data = 0; + + if (p_size > IB_MAX_REP_PDATA_SIZE) + return DAT_LENGTH_ERROR; + + /* must have a accepted socket */ + if ( cm_ptr->socket < 0 ) + return DAT_INTERNAL_ERROR; + + /* modify QP to RTR and then to RTS with remote info already read */ + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTS, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + ep_ptr->qp_state = IB_QP_STATE_RTS; + + /* Send QP info, IA address, and private data */ + qp_cm.qpn = ep_ptr->qp_handle->qp_num; + qp_cm.port = ia_ptr->hca_ptr->port_num; + qp_cm.lid = dapli_get_lid( ia_ptr->hca_ptr->ib_trans.ib_dev, + ia_ptr->hca_ptr->port_num ); + qp_cm.ia_address = ia_ptr->hca_ptr->hca_address; + qp_cm.p_size = p_size; + iovec[0].iov_base = &qp_cm; + iovec[0].iov_len = sizeof(ib_qp_cm_t); + if (p_size) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); + if (len != (p_size + sizeof(ib_qp_cm_t))) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_final: ERR %s, wcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept_final: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + qp_cm.port, qp_cm.lid, qp_cm.qpn, qp_cm.p_size ); + + /* complete handshake after final QP state change */ + len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); + if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_final: ERR %s, rcnt=%d rdata=%x\n", + strerror(errno), len, ntohs(rtu_data) ); + goto bail; + } + + /* final data exchange if remote QP state is good to go */ + dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); + dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp ); + return DAT_SUCCESS; + +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR !QP_RTR_RTS \n"); + if ( cm_ptr >= 0 ) + close( cm_ptr->socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_connect + * + * Initiate a connection with the passive listener on another node + * + * Input: + * ep_handle, + * remote_ia_address, + * remote_conn_qual, + * prd_size size of private data and structure + * prd_prt pointer to private data structure + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_connect ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_IA_ADDRESS_PTR remote_ia_address, + IN DAT_CONN_QUAL remote_conn_qual, + IN DAT_COUNT private_data_size, + IN void *private_data ) +{ + DAPL_EP *ep_ptr; + ib_qp_handle_t qp_ptr; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " connect(ep_handle %p ....)\n", ep_handle); + /* + * Sanity check + */ + if ( NULL == ep_handle ) + return DAT_SUCCESS; + + ep_ptr = (DAPL_EP*)ep_handle; + qp_ptr = ep_ptr->qp_handle; + + return (dapli_socket_connect( ep_ptr, remote_ia_address, + remote_conn_qual, + private_data_size, private_data )); +} + +/* + * dapls_ib_disconnect + * + * Disconnect an EP + * + * Input: + * ep_handle, + * disconnect_flags + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * + */ +DAT_RETURN +dapls_ib_disconnect ( + IN DAPL_EP *ep_ptr, + IN DAT_CLOSE_FLAGS close_flags ) +{ + ib_cm_handle_t cm_ptr = ep_ptr->cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_disconnect(ep_handle %p ....)\n", + ep_ptr); + + if ( cm_ptr->socket >= 0 ) { + close( cm_ptr->socket ); + cm_ptr->socket = -1; + } + + /* reinit to modify QP state */ + dapls_ib_reinit_ep(ep_ptr); + + if ( ep_ptr->cr_ptr ) { + dapls_cr_callback ( ep_ptr->cm_handle, + IB_CME_DISCONNECTED, + NULL, + ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr ); + } else { + dapl_evd_connection_callback ( ep_ptr->cm_handle, + IB_CME_DISCONNECTED, + NULL, + ep_ptr ); + ep_ptr->cm_handle = NULL; + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_disconnect_clean + * + * Clean up outstanding connection data. This routine is invoked + * after the final disconnect callback has occurred. Only on the + * ACTIVE side of a connection. + * + * Input: + * ep_ptr DAPL_EP + * active Indicates active side of connection + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_disconnect_clean ( + IN DAPL_EP *ep_ptr, + IN DAT_BOOLEAN active, + IN const ib_cm_events_t ib_cm_event ) +{ + return; +} + +/* + * dapl_ib_setup_conn_listener + * + * Have the CM set up a connection listener. + * + * Input: + * ibm_hca_handle HCA handle + * qp_handle QP handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * DAT_CONN_QUAL_UNAVAILBLE + * DAT_CONN_QUAL_IN_USE + * + */ +DAT_RETURN +dapls_ib_setup_conn_listener ( + IN DAPL_IA *ia_ptr, + IN DAT_UINT64 ServiceID, + IN DAPL_SP *sp_ptr ) +{ + return (dapli_socket_listen( ia_ptr, ServiceID, sp_ptr )); +} + + +/* + * dapl_ib_remove_conn_listener + * + * Have the CM remove a connection listener. + * + * Input: + * ia_handle IA handle + * ServiceID IB Channel Service ID + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_remove_conn_listener ( + IN DAPL_IA *ia_ptr, + IN DAPL_SP *sp_ptr ) +{ + ib_cm_srvc_handle_t cm_ptr = sp_ptr->cm_srvc_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_remove_conn_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n", + ia_ptr, sp_ptr, cm_ptr ); + + /* close accepted socket, free cm_srvc_handle and return */ + if ( cm_ptr != NULL ) { + if ( cm_ptr->l_socket >= 0 ) { + close( cm_ptr->l_socket ); + cm_ptr->socket = -1; + } + /* cr_thread will free */ + sp_ptr->cm_srvc_handle = NULL; + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_accept_connection + * + * Perform necessary steps to accept a connection + * + * Input: + * cr_handle + * ep_handle + * private_data_size + * private_data + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_accept_connection ( + IN DAT_CR_HANDLE cr_handle, + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT p_size, + IN const DAT_PVOID p_data ) +{ + DAPL_CR *cr_ptr; + DAPL_EP *ep_ptr; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_accept_connection(cr %p ep %p prd %p,%d)\n", + cr_handle, ep_handle, p_data, p_size ); + + cr_ptr = (DAPL_CR *) cr_handle; + ep_ptr = (DAPL_EP *) ep_handle; + + /* allocate and attach a QP if necessary */ + if ( ep_ptr->qp_state == DAPL_QP_STATE_UNATTACHED ) { + DAT_RETURN status; + status = dapls_ib_qp_alloc( ep_ptr->header.owner_ia, + ep_ptr, ep_ptr ); + if ( status != DAT_SUCCESS ) + return status; + } + + return ( dapli_socket_accept_final(ep_ptr, cr_ptr, p_size, p_data) ); +} + + +/* + * dapls_ib_reject_connection + * + * Reject a connection + * + * Input: + * cr_handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_reject_connection ( + IN ib_cm_handle_t ib_cm_handle, + IN int reject_reason ) +{ + ib_cm_srvc_handle_t cm_ptr = ib_cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_reject_connection(cm_handle %p reason %x)\n", + ib_cm_handle, reject_reason ); + + /* just close the socket and return */ + if ( cm_ptr->socket > 0 ) { + close( cm_ptr->socket ); + cm_ptr->socket = -1; + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_cm_remote_addr + * + * Obtain the remote IP address given a connection + * + * Input: + * cr_handle + * + * Output: + * remote_ia_address: where to place the remote address + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + * + */ +DAT_RETURN +dapls_ib_cm_remote_addr ( + IN DAT_HANDLE dat_handle, + OUT DAT_SOCK_ADDR6 *remote_ia_address ) +{ + DAPL_HEADER *header; + ib_cm_handle_t ib_cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_cm_remote_addr(dat_handle %p, ....)\n", + dat_handle ); + + header = (DAPL_HEADER *)dat_handle; + + if (header->magic == DAPL_MAGIC_EP) + ib_cm_handle = ((DAPL_EP *) dat_handle)->cm_handle; + else if (header->magic == DAPL_MAGIC_CR) + ib_cm_handle = ((DAPL_CR *) dat_handle)->ib_cm_handle; + else + return DAT_INVALID_HANDLE; + + dapl_os_memcpy( remote_ia_address, + &ib_cm_handle->dst.ia_address, + sizeof(DAT_SOCK_ADDR6) ); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_private_data_size + * + * Return the size of private data given a connection op type + * + * Input: + * prd_ptr private data pointer + * conn_op connection operation type + * + * If prd_ptr is NULL, this is a query for the max size supported by + * the provider, otherwise it is the actual size of the private data + * contained in prd_ptr. + * + * + * Output: + * None + * + * Returns: + * length of private data + * + */ +int dapls_ib_private_data_size ( + IN DAPL_PRIVATE *prd_ptr, + IN DAPL_PDATA_OP conn_op) +{ + int size; + + switch (conn_op) + { + case DAPL_PDATA_CONN_REQ: + { + size = IB_MAX_REQ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_REP: + { + size = IB_MAX_REP_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_REJ: + { + size = IB_MAX_REJ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_DREQ: + { + size = IB_MAX_DREQ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_DREP: + { + size = IB_MAX_DREP_PDATA_SIZE; + break; + } + default: + { + size = 0; + } + + } /* end case */ + + return size; +} + +/* + * Map all socket CM event codes to the DAT equivelent. + */ +#define DAPL_IB_EVENT_CNT 11 + +static struct ib_cm_event_map +{ + const ib_cm_events_t ib_cm_event; + DAT_EVENT_NUMBER dat_event_num; + } ib_cm_event_map[DAPL_IB_EVENT_CNT] = { + /* 00 */ { IB_CME_CONNECTED, + DAT_CONNECTION_EVENT_ESTABLISHED}, + /* 01 */ { IB_CME_DISCONNECTED, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 02 */ { IB_CME_DISCONNECTED_ON_LINK_DOWN, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 03 */ { IB_CME_CONNECTION_REQUEST_PENDING, + DAT_CONNECTION_REQUEST_EVENT}, + /* 04 */ { IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + DAT_CONNECTION_REQUEST_EVENT}, + /* 05 */ { IB_CME_DESTINATION_REJECT, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 06 */ { IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + DAT_CONNECTION_EVENT_PEER_REJECTED}, + /* 07 */ { IB_CME_DESTINATION_UNREACHABLE, + DAT_CONNECTION_EVENT_UNREACHABLE}, + /* 08 */ { IB_CME_TOO_MANY_CONNECTION_REQUESTS, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 09 */ { IB_CME_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN}, + /* 10 */ { IB_CM_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN} +}; + +/* + * dapls_ib_get_cm_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * dat_event_num DAT event we need an equivelent CM event for + * + * Output: + * none + * + * Returns: + * ib_cm_event of translated DAPL value + */ +DAT_EVENT_NUMBER +dapls_ib_get_dat_event ( + IN const ib_cm_events_t ib_cm_event, + IN DAT_BOOLEAN active) +{ + DAT_EVENT_NUMBER dat_event_num; + int i; + + active = active; + + if (ib_cm_event > IB_CM_LOCAL_FAILURE) + return (DAT_EVENT_NUMBER) 0; + + dat_event_num = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) { + dat_event_num = ib_cm_event_map[i].dat_event_num; + break; + } + } + dapl_dbg_log (DAPL_DBG_TYPE_CALLBACK, + "dapls_ib_get_dat_event: event translate(%s) ib=0x%x dat=0x%x\n", + active ? "active" : "passive", ib_cm_event, dat_event_num); + + return dat_event_num; +} + + +/* + * dapls_ib_get_dat_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * ib_cm_event event provided to the dapl callback routine + * active switch indicating active or passive connection + * + * Output: + * none + * + * Returns: + * DAT_EVENT_NUMBER of translated provider value + */ +ib_cm_events_t +dapls_ib_get_cm_event ( + IN DAT_EVENT_NUMBER dat_event_num) +{ + ib_cm_events_t ib_cm_event; + int i; + + ib_cm_event = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if ( dat_event_num == ib_cm_event_map[i].dat_event_num ) { + ib_cm_event = ib_cm_event_map[i].ib_cm_event; + break; + } + } + return ib_cm_event; +} + +/* async CR processing thread to avoid blocking applications */ +void cr_thread(void *arg) +{ + struct dapl_hca *hca_ptr = arg; + ib_cm_srvc_handle_t cr, next_cr; + int max_fd; + fd_set rfd,rfds; + struct timeval to; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n",hca_ptr); + + dapl_os_lock( &hca_ptr->ib_trans.lock ); + hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; + while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { + + FD_ZERO( &rfds ); + max_fd = -1; + + if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) + next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list); + else + next_cr = NULL; + + while (next_cr) { + cr = next_cr; + dapl_dbg_log (DAPL_DBG_TYPE_CM," thread: cm_ptr %p\n", cr ); + if (cr->l_socket == -1 || + hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { + + dapl_dbg_log(DAPL_DBG_TYPE_CM," thread: Freeing %p\n", cr); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + dapl_llist_remove_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_os_free( cr, sizeof(*cr) ); + continue; + } + + FD_SET( cr->l_socket, &rfds ); /* add to select set */ + if ( cr->l_socket > max_fd ) + max_fd = cr->l_socket; + + /* individual select poll to check for work */ + FD_ZERO(&rfd); + FD_SET(cr->l_socket, &rfd); + dapl_os_unlock(&hca_ptr->ib_trans.lock); + to.tv_sec = 0; + to.tv_usec = 0; + if ( select(cr->l_socket + 1,&rfd, NULL, NULL, &to) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " thread: ERR %s on cr %p sk %d\n", + strerror(errno), cr, cr->l_socket); + close(cr->l_socket); + cr->l_socket = -1; + } else if ( FD_ISSET(cr->l_socket, &rfd) && + dapli_socket_accept(cr)) { + close(cr->l_socket); + cr->l_socket = -1; + } + dapl_os_lock( &hca_ptr->ib_trans.lock ); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + to.tv_sec = 0; + to.tv_usec = 100000; /* wakeup and check destroy */ + select(max_fd + 1, &rfds, NULL, NULL, &to); + dapl_os_lock( &hca_ptr->ib_trans.lock ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr); +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_scm/dapl_ib_qp.c =================================================================== --- dapl/openib_scm/dapl_ib_qp.c (revision 0) +++ dapl/openib_scm/dapl_ib_qp.c (revision 0) @@ -0,0 +1,399 @@ +/* + * Copyright (c) 2002-2003, Network Appliance, Inc. All rights reserved. + * + * This Software is licensed under either one of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_qp.c + * + * PURPOSE: QP routines for access to DET Verbs + * + * $Id: $ + **********************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" + +/* + * dapl_ib_qp_alloc + * + * Alloc a QP + * + * Input: + * *ep_ptr pointer to EP INFO + * ib_hca_handle provider HCA handle + * ib_pd_handle provider protection domain handle + * cq_recv provider recv CQ handle + * cq_send provider send CQ handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_qp_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAPL_EP *ep_ctx_ptr ) +{ + DAT_EP_ATTR *attr; + DAPL_EVD *rcv_evd, *req_evd; + ib_cq_handle_t rcv_cq, req_cq; + ib_pd_handle_t ib_pd_handle; + struct ibv_qp_init_attr qp_create; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " qp_alloc: ia_ptr %p ep_ptr %p ep_ctx_ptr %p\n", + ia_ptr, ep_ptr, ep_ctx_ptr); + + attr = &ep_ptr->param.ep_attr; + ib_pd_handle = ((DAPL_PZ *)ep_ptr->param.pz_handle)->pd_handle; + rcv_evd = (DAPL_EVD *) ep_ptr->param.recv_evd_handle; + req_evd = (DAPL_EVD *) ep_ptr->param.request_evd_handle; + + /* + * DAT allows usage model of EP's with no EVD's but IB does not. + * Create a CQ with zero entries under the covers to support and + * catch any invalid posting. + */ + if ( rcv_evd != DAT_HANDLE_NULL ) + rcv_cq = rcv_evd->ib_cq_handle; + else if (!ia_ptr->hca_ptr->ib_trans.ib_cq_empty) + rcv_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + else { + struct ibv_comp_channel *channel = + ia_ptr->hca_ptr->ib_trans.ib_cq; +#ifdef CQ_WAIT_OBJECT + if (rcv_evd->cq_wait_obj_handle) + channel = rcv_evd->cq_wait_obj_handle; +#endif + /* Call IB verbs to create CQ */ + rcv_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + 0, NULL, channel, 0); + + if (rcv_cq == IB_INVALID_HANDLE) + return(dapl_convert_errno(ENOMEM, "create_cq")); + + ia_ptr->hca_ptr->ib_trans.ib_cq_empty = rcv_cq; + } + if (req_evd != DAT_HANDLE_NULL) + req_cq = req_evd->ib_cq_handle; + else + req_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + + /* Setup attributes and create qp */ + dapl_os_memzero((void*)&qp_create, sizeof(qp_create)); + qp_create.send_cq = req_cq; + qp_create.recv_cq = rcv_cq; + qp_create.cap.max_send_wr = attr->max_request_dtos; + qp_create.cap.max_recv_wr = attr->max_recv_dtos; + qp_create.cap.max_send_sge = attr->max_request_iov; + qp_create.cap.max_recv_sge = attr->max_recv_iov; + qp_create.cap.max_inline_data = ia_ptr->hca_ptr->ib_trans.max_inline_send; + qp_create.qp_type = IBV_QPT_RC; + qp_create.qp_context = (void*)ep_ptr; + + ep_ptr->qp_handle = ibv_create_qp( ib_pd_handle, &qp_create); + if (!ep_ptr->qp_handle) + return(dapl_convert_errno(ENOMEM, "create_qp")); + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " qp_alloc: qpn %p sq %d,%d rq %d,%d\n", + ep_ptr->qp_handle->qp_num, + qp_create.cap.max_send_wr,qp_create.cap.max_send_sge, + qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge ); + + /* Setup QP attributes for INIT state on the way out */ + if (dapls_modify_qp_state(ep_ptr->qp_handle, + IBV_QPS_INIT, + NULL ) != DAT_SUCCESS ) { + ibv_destroy_qp(ep_ptr->qp_handle); + ep_ptr->qp_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; + } + + ep_ptr->qp_state = IB_QP_STATE_INIT; + return DAT_SUCCESS; +} + +/* + * dapl_ib_qp_free + * + * Free a QP + * + * Input: + * ia_handle IA handle + * *ep_ptr pointer to EP INFO + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN +dapls_ib_qp_free ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr ) +{ + dapl_dbg_log (DAPL_DBG_TYPE_EP, " qp_free: ep_ptr %p qp %p\n", + ep_ptr, ep_ptr->qp_handle); + + if (ep_ptr->qp_handle != IB_INVALID_HANDLE) { + /* force error state to flush queue, then destroy */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, NULL); + + if (ibv_destroy_qp(ep_ptr->qp_handle)) + return(dapl_convert_errno(errno,"destroy_qp")); + + ep_ptr->qp_handle = IB_INVALID_HANDLE; + ep_ptr->qp_state = IB_QP_STATE_ERROR; + } + + return DAT_SUCCESS; +} + +/* + * dapl_ib_qp_modify + * + * Set the QP to the parameters specified in an EP_PARAM + * + * The EP_PARAM structure that is provided has been + * sanitized such that only non-zero values are valid. + * + * Input: + * ib_hca_handle HCA handle + * qp_handle QP handle + * ep_attr Sanitized EP Params + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_qp_modify ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAT_EP_ATTR *attr ) +{ + struct ibv_qp_attr qp_attr; + + if (ep_ptr->qp_handle == IB_INVALID_HANDLE) + return DAT_INVALID_PARAMETER; + + /* + * EP state, qp_handle state should be an indication + * of current state but the only way to be sure is with + * a user mode ibv_query_qp call which is NOT available + */ + + /* move to error state if necessary */ + if ((ep_ptr->qp_state == IB_QP_STATE_ERROR) && + (ep_ptr->qp_handle->state != IBV_QPS_ERR)) { + ep_ptr->qp_state = IB_QP_STATE_ERROR; + return (dapls_modify_qp_state(ep_ptr->qp_handle, + IBV_QPS_ERR, NULL)); + } + + /* + * Check if we have the right qp_state to modify attributes + */ + if ((ep_ptr->qp_handle->state != IBV_QPS_RTR ) && + (ep_ptr->qp_handle->state != IBV_QPS_RTS )) + return DAT_INVALID_STATE; + + /* Adjust to current EP attributes */ + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); + qp_attr.cap.max_send_wr = attr->max_request_dtos; + qp_attr.cap.max_recv_wr = attr->max_recv_dtos; + qp_attr.cap.max_send_sge = attr->max_request_iov; + qp_attr.cap.max_recv_sge = attr->max_recv_iov; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "modify_qp: qp %p sq %d,%d, rq %d,%d\n", + ep_ptr->qp_handle, + qp_attr.cap.max_send_wr, qp_attr.cap.max_send_sge, + qp_attr.cap.max_recv_wr, qp_attr.cap.max_recv_sge ); + + if (ibv_modify_qp(ep_ptr->qp_handle, &qp_attr, IBV_QP_CAP)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "modify_qp: modify ep %p qp %p failed\n", + ep_ptr, ep_ptr->qp_handle); + return(dapl_convert_errno(errno,"modify_qp_state")); + } + + return DAT_SUCCESS; +} + +/* + * dapls_ib_reinit_ep + * + * Move the QP to INIT state again. + * + * Input: + * ep_ptr DAPL_EP + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_reinit_ep ( + IN DAPL_EP *ep_ptr) +{ + + if ( ep_ptr->qp_handle != IB_INVALID_HANDLE ) { + /* move to RESET state and then to INIT */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT, 0); + ep_ptr->qp_state = IB_QP_STATE_INIT; + } + + /* TODO: When IB-CM is implement then handle timewait before + * allowing re-use of this QP + */ +} + +/* + * Generic QP modify for init, reset, error, RTS, RTR + */ +DAT_RETURN +dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN ib_qp_cm_t *qp_cm ) +{ + struct ibv_qp_attr qp_attr; + enum ibv_qp_attr_mask mask = IBV_QP_STATE; + + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = qp_state; + + switch (qp_state) { + /* additional attributes with RTR and RTS */ + case IBV_QPS_RTR: + { + mask |= IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER; + qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.path_mtu = IBV_MTU_1024; + qp_attr.dest_qp_num = qp_cm->qpn; + qp_attr.rq_psn = 1; + qp_attr.max_dest_rd_atomic = 8; + qp_attr.min_rnr_timer = 12; + qp_attr.ah_attr.is_global = 0; + qp_attr.ah_attr.dlid = qp_cm->lid; + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = qp_cm->port; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_rtr: qpn %x lid %x port %x\n", + qp_cm->qpn,qp_cm->lid,qp_cm->port ); + break; + } + case IBV_QPS_RTS: + { + mask |= IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC; + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.timeout = 14; + qp_attr.retry_cnt = 7; + qp_attr.rnr_retry = 7; + qp_attr.sq_psn = 1; + qp_attr.max_rd_atomic = 8; + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_rts: psn %x or %x\n", + qp_attr.sq_psn, qp_attr.max_rd_atomic ); + break; + } + case IBV_QPS_INIT: + { + DAPL_IA *ia_ptr; + DAPL_EP *ep_ptr; + /* need to find way back to port num */ + ep_ptr = (DAPL_EP*)qp_handle->qp_context; + if (ep_ptr) + ia_ptr = ep_ptr->header.owner_ia; + else + break; + + mask |= IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS; + + qp_attr.pkey_index = 0; + qp_attr.port_num = ia_ptr->hca_ptr->port_num; + qp_attr.qp_access_flags = + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_ATOMIC; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_init: pi %x port %x acc %x\n", + qp_attr.pkey_index, qp_attr.port_num, + qp_attr.qp_access_flags ); + break; + } + default: + break; + + } + + if (ibv_modify_qp(qp_handle, &qp_attr, mask)) + return(dapl_convert_errno(errno,"modify_qp_state")); + + return DAT_SUCCESS; +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_scm/README =================================================================== --- dapl/openib_scm/README (revision 0) +++ dapl/openib_scm/README (revision 0) @@ -0,0 +1,40 @@ + +OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels. + +to build: + +cd dapl/udapl +make VERBS=openib_scm clean +make VERBS=openib_scm + + +Modifications to common code: + +- added dapl/openib_scm directory + + dapl/udapl/Makefile + +New files for openib_scm provider + + dapl/openib/dapl_ib_cq.c + dapl/openib/dapl_ib_dto.h + dapl/openib/dapl_ib_mem.c + dapl/openib/dapl_ib_qp.c + dapl/openib/dapl_ib_util.c + dapl/openib/dapl_ib_util.h + dapl/openib/dapl_ib_cm.c + +A simple dapl test just for openib_scm testing... + + test/dtest/dtest.c + test/dtest/makefile + + server: dtest -s + client: dtest -h hostname + +known issues: + + no memory windows support in ibverbs, dat_create_rmr fails. + + + Index: dapl/openib_scm/dapl_ib_util.h =================================================================== --- dapl/openib_scm/dapl_ib_util.h (revision 0) +++ dapl/openib_scm/dapl_ib_util.h (revision 0) @@ -0,0 +1,355 @@ +/* + * This Software is licensed under both of the following two licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * in the file LICENSE.txt in the root directory. The license is also + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * OR + * + * 2) under the terms of the "The BSD License" a copy of which is in the file + * LICENSE2.txt in the root directory. The license is also available from + * the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * Licensee has the right to choose either one of the above two licenses. + * + * Redistributions of source code must retain both the above copyright + * notice and either one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, either one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - definitions, prototypes, + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#ifndef _DAPL_IB_UTIL_H_ +#define _DAPL_IB_UTIL_H_ + +#include "verbs.h" +#include + +#ifndef __cplusplus +#define false 0 +#define true 1 +#endif /*__cplusplus */ + +/* Typedefs to map common DAPL provider types to IB verbs */ +typedef struct ibv_qp *ib_qp_handle_t; +typedef struct ibv_cq *ib_cq_handle_t; +typedef struct ibv_pd *ib_pd_handle_t; +typedef struct ibv_mr *ib_mr_handle_t; +typedef struct ibv_mw *ib_mw_handle_t; +typedef struct ibv_wc ib_work_completion_t; + +/* HCA context type maps to IB verbs */ +typedef struct ibv_context *ib_hca_handle_t; +typedef ib_hca_handle_t dapl_ibal_ca_t; + +/* CM mappings, user CM not complete use SOCKETS */ + +/* destination info to exchange until real IB CM shows up */ +typedef struct _ib_qp_cm +{ + uint32_t qpn; + uint16_t lid; + uint16_t port; + int p_size; + DAT_SOCK_ADDR6 ia_address; + +} ib_qp_cm_t; + +/* + * dapl_llist_entry in dapl.h but dapl.h depends on provider + * typedef's in this file first. move dapl_llist_entry out of dapl.h + */ +struct ib_llist_entry +{ + struct dapl_llist_entry *flink; + struct dapl_llist_entry *blink; + void *data; + struct dapl_llist_entry *list_head; +}; + +struct ib_cm_handle +{ + struct ib_llist_entry entry; + int socket; + int l_socket; + struct dapl_hca *hca_ptr; + DAT_HANDLE cr; + DAT_HANDLE sp; + ib_qp_cm_t dst; + unsigned char p_data[256]; +}; + +typedef struct ib_cm_handle *ib_cm_handle_t; +typedef ib_cm_handle_t ib_cm_srvc_handle_t; + +DAT_RETURN getipaddr(char *addr, int addr_len); + +/* CM events */ +typedef enum +{ + IB_CME_CONNECTED, + IB_CME_DISCONNECTED, + IB_CME_DISCONNECTED_ON_LINK_DOWN, + IB_CME_CONNECTION_REQUEST_PENDING, + IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + IB_CME_DESTINATION_REJECT, + IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + IB_CME_DESTINATION_UNREACHABLE, + IB_CME_TOO_MANY_CONNECTION_REQUESTS, + IB_CME_LOCAL_FAILURE, + IB_CM_LOCAL_FAILURE + +} ib_cm_events_t; + +/* prototype for cm thread */ +void cr_thread (void *arg); + +/* Operation and state mappings */ +typedef enum ibv_send_flags ib_send_op_type_t; +typedef struct ibv_sge ib_data_segment_t; +typedef enum ibv_qp_state ib_qp_state_t; +typedef enum ibv_event_type ib_async_event_type; +typedef struct ibv_async_event ib_error_record_t; + +/* CQ notifications */ +typedef enum +{ + IB_NOTIFY_ON_NEXT_COMP, + IB_NOTIFY_ON_SOLIC_COMP + +} ib_notification_type_t; + +/* other mappings */ +typedef int ib_bool_t; +typedef union ibv_gid GID; +typedef char *IB_HCA_NAME; +typedef uint16_t ib_hca_port_t; +typedef uint32_t ib_comp_handle_t; + +#ifdef CQ_WAIT_OBJECT +typedef struct ibv_comp_channel *ib_wait_obj_handle_t; +#endif + +/* Definitions */ +#define IB_INVALID_HANDLE NULL + +/* inline send rdma threshold */ +#define INLINE_SEND_DEFAULT 128 + +/* CM private data areas */ +#define IB_MAX_REQ_PDATA_SIZE 92 +#define IB_MAX_REP_PDATA_SIZE 196 +#define IB_MAX_REJ_PDATA_SIZE 148 +#define IB_MAX_DREQ_PDATA_SIZE 220 +#define IB_MAX_DREP_PDATA_SIZE 224 + +/* DTO OPs, ordered for DAPL ENUM definitions ???*/ +#define OP_RDMA_WRITE IBV_WR_RDMA_WRITE +#define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define OP_SEND IBV_WR_SEND +#define OP_SEND_IMM IBV_WR_SEND_WITH_IMM +#define OP_RDMA_READ IBV_WR_RDMA_READ +#define OP_COMP_AND_SWAP IBV_WR_ATOMIC_CMP_AND_SWP +#define OP_FETCH_AND_ADD IBV_WR_ATOMIC_FETCH_AND_ADD +#define OP_RECEIVE 7 /* internal op */ +#define OP_RECEIVE_IMM 8 /* internel op */ +#define OP_BIND_MW 9 /* internal op */ +#define OP_INVALID 0xff + +/* Definitions to map QP state */ +#define IB_QP_STATE_RESET IBV_QPS_RESET +#define IB_QP_STATE_INIT IBV_QPS_INIT +#define IB_QP_STATE_RTR IBV_QPS_RTR +#define IB_QP_STATE_RTS IBV_QPS_RTS +#define IB_QP_STATE_SQD IBV_QPS_SQD +#define IB_QP_STATE_SQE IBV_QPS_SQE +#define IB_QP_STATE_ERROR IBV_QPS_ERR + +/* Definitions for ibverbs/mthca return codes, should be defined in verbs.h */ +/* some are errno and some are -n values */ + +/** + * ibv_get_device_name - Return kernel device name + * ibv_get_device_guid - Return device's node GUID + * ibv_open_device - Return ibv_context or NULL + * ibv_close_device - Return 0, (errno?) + * ibv_get_async_event - Return 0, -1 + * ibv_alloc_pd - Return ibv_pd, NULL + * ibv_dealloc_pd - Return 0, errno + * ibv_reg_mr - Return ibv_mr, NULL + * ibv_dereg_mr - Return 0, errno + * ibv_create_cq - Return ibv_cq, NULL + * ibv_destroy_cq - Return 0, errno + * ibv_get_cq_event - Return 0 & ibv_cq/context, int + * ibv_poll_cq - Return n & ibv_wc, 0 ok, -1 empty, -2 error + * ibv_req_notify_cq - Return 0 (void?) + * ibv_create_qp - Return ibv_qp, NULL + * ibv_modify_qp - Return 0, errno + * ibv_destroy_qp - Return 0, errno + * ibv_post_send - Return 0, -1 & bad_wr + * ibv_post_recv - Return 0, -1 & bad_wr + */ + +/* async handler for DTO, CQ, QP, and unafiliated */ +typedef void (*ib_async_dto_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_cq_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_cq_handle_t ib_cq_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_qp_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_qp_handle_t ib_qp_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef enum +{ + IB_THREAD_INIT, + IB_THREAD_RUN, + IB_THREAD_CANCEL, + IB_THREAD_EXIT + +} ib_thread_state_t; + +/* ib_hca_transport_t, specific to this implementation */ +typedef struct _ib_hca_transport +{ + struct ibv_device *ib_dev; + ib_cq_handle_t ib_cq_empty; + DAPL_OS_LOCK cq_lock; + int max_inline_send; + ib_thread_state_t cq_state; + DAPL_OS_THREAD cq_thread; + struct ibv_comp_channel *ib_cq; + int cr_state; + DAPL_OS_THREAD thread; + DAPL_OS_LOCK lock; + struct dapl_llist_entry *list; + ib_async_handler_t async_unafiliated; + void *async_un_ctx; + ib_async_cq_handler_t async_cq_error; + ib_async_dto_handler_t async_cq; + ib_async_qp_handler_t async_qp_error; + +} ib_hca_transport_t; + +/* provider specfic fields for shared memory support */ +typedef uint32_t ib_shm_transport_t; + +/* prototypes */ +int32_t dapls_ib_init (void); +int32_t dapls_ib_release (void); +void cq_thread (void *arg); +void cr_thread(void *arg); +int dapli_cq_thread_init(struct dapl_hca *hca_ptr); +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); + + +DAT_RETURN +dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN ib_qp_cm_t *qp_cm ); + +/* inline functions */ +STATIC _INLINE_ IB_HCA_NAME dapl_ib_convert_name (IN char *name) +{ + /* use ascii; name of local device */ + return dapl_os_strdup(name); +} + +STATIC _INLINE_ void dapl_ib_release_name (IN IB_HCA_NAME name) +{ + return; +} + +/* + * Convert errno to DAT_RETURN values + */ +STATIC _INLINE_ DAT_RETURN +dapl_convert_errno( IN int err, IN const char *str ) +{ + if (!err) return DAT_SUCCESS; + +#if DAPL_DBG + if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); +#endif + + switch( err ) + { + case EOVERFLOW : return DAT_LENGTH_ERROR; + case EACCES : return DAT_PRIVILEGES_VIOLATION; + case ENXIO : + case ERANGE : + case EPERM : return DAT_PROTECTION_VIOLATION; + case EINVAL : + case EBADF : + case ENOENT : + case ENOTSOCK : return DAT_INVALID_HANDLE; + case EISCONN : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_CONNECTED; + case ECONNREFUSED : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_NOTREADY; + case ETIME : + case ETIMEDOUT : return DAT_TIMEOUT_EXPIRED; + case ENETUNREACH: return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_UNREACHABLE; + case EADDRINUSE : return DAT_CONN_QUAL_IN_USE; + case EALREADY : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_ACTCONNPENDING; + case ENOSPC : + case ENOMEM : + case E2BIG : + case EDQUOT : return DAT_INSUFFICIENT_RESOURCES; + case EAGAIN : return DAT_QUEUE_EMPTY; + case EINTR : return DAT_INTERRUPTED_CALL; + case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_MALFORMED; + case EFAULT : + default : return DAT_INTERNAL_ERROR; + } + } + +/* + * Definitions required only for DAT 1.1 builds + */ +#define IB_ACCESS_LOCAL_READ IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_LOCAL_WRITE IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_REMOTE_READ IBV_ACCESS_REMOTE_READ +#define IB_ACCESS_REMOTE_WRITE IBV_ACCESS_REMOTE_WRITE +#define IB_ACCESS_MW_BIND IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_ATOMIC + +#endif /* _DAPL_IB_UTIL_H_ */ Index: dapl/openib_scm/dapl_ib_cq.c =================================================================== --- dapl/openib_scm/dapl_ib_cq.c (revision 0) +++ dapl/openib_scm/dapl_ib_cq.c (revision 0) @@ -0,0 +1,619 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cq.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - completion queue + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" +#include "dapl_evd_util.h" +#include "dapl_ring_buffer_util.h" +#include +#include + +int dapli_cq_thread_init(struct dapl_hca *hca_ptr) +{ + DAT_RETURN dat_status; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " cq_thread_init: failed to create thread\n"); + return 1; + } + + /* wait for thread to start */ + while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_thread_init: waiting for cq_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",getpid()); + return 0; +} + +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); + + if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) + return; + + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; + pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr); + while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 200000000; /* 200 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_thread_destroy: waiting for cq_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid()); +} + +/* catch the signal */ +static void ib_cq_handler(int signum) +{ + return; +} + +void cq_thread( void *arg ) +{ + struct dapl_hca *hca_ptr = arg; + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + sigset_t sigset; + + sigemptyset(&sigset); + sigaddset(&sigset,SIGUSR1); + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); + signal(SIGUSR1, ib_cq_handler); + + hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); + + /* wait on DTO event, or signal to abort */ + while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { + struct pollfd cq_fd = { + .fd = hca_ptr->ib_trans.ib_cq->fd, + .events = POLLIN, + .revents = 0 + }; + if ((poll(&cq_fd, 1, -1) == 1) && + (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, + &ibv_cq, (void*)&evd_ptr))) { + + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { + ibv_ack_cq_events(ibv_cq, 1); + return; + } + + /* process DTO event via callback */ + dapl_evd_dto_callback ( hca_ptr->ib_hca_handle, + evd_ptr->ib_cq_handle, + (void*)evd_ptr ); + + ibv_ack_cq_events(ibv_cq, 1); + } + } + hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); +} + + +/* + * Map all verbs DTO completion codes to the DAT equivelent. + * + * Not returned by verbs: DAT_DTO_ERR_PARTIAL_PACKET + */ +static struct ib_status_map +{ + int ib_status; + DAT_DTO_COMPLETION_STATUS dat_status; +} ib_status_map[] = { + /* 00 */ { IBV_WC_SUCCESS, DAT_DTO_SUCCESS}, + /* 01 */ { IBV_WC_LOC_LEN_ERR, DAT_DTO_ERR_LOCAL_LENGTH}, + /* 02 */ { IBV_WC_LOC_QP_OP_ERR, DAT_DTO_ERR_LOCAL_EP}, + /* 03 */ { IBV_WC_LOC_EEC_OP_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 04 */ { IBV_WC_LOC_PROT_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 05 */ { IBV_WC_WR_FLUSH_ERR, DAT_DTO_ERR_FLUSHED}, + /* 06 */ { IBV_WC_MW_BIND_ERR, DAT_RMR_OPERATION_FAILED}, + /* 07 */ { IBV_WC_BAD_RESP_ERR, DAT_DTO_ERR_BAD_RESPONSE}, + /* 08 */ { IBV_WC_LOC_ACCESS_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 09 */ { IBV_WC_REM_INV_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 10 */ { IBV_WC_REM_ACCESS_ERR, DAT_DTO_ERR_REMOTE_ACCESS}, + /* 11 */ { IBV_WC_REM_OP_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 12 */ { IBV_WC_RETRY_EXC_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 13 */ { IBV_WC_RNR_RETRY_EXC_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, + /* 14 */ { IBV_WC_LOC_RDD_VIOL_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 15 */ { IBV_WC_REM_INV_RD_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 16 */ { IBV_WC_REM_ABORT_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 17 */ { IBV_WC_INV_EECN_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 18 */ { IBV_WC_INV_EEC_STATE_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 19 */ { IBV_WC_FATAL_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 20 */ { IBV_WC_RESP_TIMEOUT_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, + /* 21 */ { IBV_WC_GENERAL_ERR, DAT_DTO_ERR_TRANSPORT}, +}; + +/* + * dapls_ib_get_dto_status + * + * Return the DAT status of a DTO operation + * + * Input: + * cqe_ptr pointer to completion queue entry + * + * Output: + * none + * + * Returns: + * Value from ib_status_map table above + */ + +DAT_DTO_COMPLETION_STATUS +dapls_ib_get_dto_status ( + IN ib_work_completion_t *cqe_ptr) +{ + uint32_t ib_status; + int i; + + ib_status = DAPL_GET_CQE_STATUS (cqe_ptr); + + /* + * Due to the implementation of verbs completion code, we need to + * search the table for the correct value rather than assuming + * linear distribution. + */ + for (i = 0; i <= IBV_WC_GENERAL_ERR; i++) { + if (ib_status == ib_status_map[i].ib_status) { + if ( ib_status != IBV_WC_SUCCESS ) { + dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, DAPL_GET_CQE_OPTYPE (cqe_ptr)); + } + return ib_status_map[i].dat_status; + } + } + + dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, + DAPL_GET_CQE_OPTYPE (cqe_ptr)); + + return DAT_DTO_FAILURE; +} + +DAT_RETURN dapls_ib_get_async_event ( + IN ib_error_record_t *err_record, + OUT DAT_EVENT_NUMBER *async_event) +{ + DAT_RETURN dat_status = DAT_SUCCESS; + int err_code = err_record->event_type; + + switch (err_code) { + /* OVERFLOW error */ + case IBV_EVENT_CQ_ERR: + *async_event = DAT_ASYNC_ERROR_EVD_OVERFLOW; + break; + /* INTERNAL errors */ + case IBV_EVENT_DEVICE_FATAL: + *async_event = DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR; + break; + /* CATASTROPHIC errors */ + case IBV_EVENT_PORT_ERR: + *async_event = DAT_ASYNC_ERROR_IA_CATASTROPHIC; + break; + /* BROKEN QP error */ + case IBV_EVENT_SQ_DRAINED: + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + *async_event = DAT_ASYNC_ERROR_EP_BROKEN; + break; + + /* connection completion */ + case IBV_EVENT_COMM_EST: + *async_event = DAT_CONNECTION_EVENT_ESTABLISHED; + break; + + /* TODO: process HW state changes */ + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + default: + dat_status = DAT_ERROR (DAT_NOT_IMPLEMENTED, 0); + } + return dat_status; +} + +/* + * dapl_ib_cq_alloc + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_cq_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen ); + + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + *cqlen, + evd_ptr, + channel, 0); + + if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* arm cq for events */ + dapls_set_cq_notify(ia_ptr, evd_ptr); + + /* update with returned cq entry size */ + *cqlen = evd_ptr->ib_cq_handle->cqe; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: new_cq %p cqlen=%d \n", + evd_ptr->ib_cq_handle, *cqlen ); + + return DAT_SUCCESS; +} + + +/* + * dapl_ib_cq_resize + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_cq_resize ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + ib_cq_handle_t new_cq; + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + + /* IB verbs doe not support resize. Try to re-create CQ + * with new size. Can only be done if QP is not attached. + * destroy EBUSY == QP still attached. + */ + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + new_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, *cqlen, + evd_ptr, channel, 0); + + if (new_cq == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* destroy the original and replace if successful */ + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) { + ibv_destroy_cq(new_cq); + return(dapl_convert_errno(errno,"resize_cq")); + } + + /* update EVD with new cq handle and size */ + evd_ptr->ib_cq_handle = new_cq; + *cqlen = new_cq->cqe; + + /* arm cq for events */ + dapls_set_cq_notify (ia_ptr, evd_ptr); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_cq_free + * + * destroy a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_cq_free ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr) +{ + if ( evd_ptr->ib_cq_handle != IB_INVALID_HANDLE ) { + /* copy all entries on CQ to EVD before destroying */ + dapls_evd_copy_cq(evd_ptr); + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) + return(dapl_convert_errno(errno,"destroy_cq")); + evd_ptr->ib_cq_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapls_set_cq_notify + * + * Set the CQ notification for next + * + * Input: + * hca_handl hca handle + * DAPL_EVD evd handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_set_cq_notify ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr) +{ + if (ibv_req_notify_cq( evd_ptr->ib_cq_handle, 0 )) + return(dapl_convert_errno(errno,"notify_cq")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_notify + * + * Set the CQ notification type + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * type notification type + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_ib_completion_notify ( + IN ib_hca_handle_t hca_handle, + IN DAPL_EVD *evd_ptr, + IN ib_notification_type_t type) +{ + if (ibv_req_notify_cq( evd_ptr->ib_cq_handle, type )) + return(dapl_convert_errno(errno,"notify_cq_type")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_poll + * + * CQ poll for completions + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * wc_ptr work completion + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_QUEUE_EMPTY + * + */ +DAT_RETURN dapls_ib_completion_poll ( + IN DAPL_HCA *hca_ptr, + IN DAPL_EVD *evd_ptr, + IN ib_work_completion_t *wc_ptr) +{ + int ret; + + ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr); + if (ret == 1) + return DAT_SUCCESS; + + return DAT_QUEUE_EMPTY; +} + +#ifdef CQ_WAIT_OBJECT + +/* NEW common wait objects for providers with direct CQ wait objects */ +DAT_RETURN +dapls_ib_wait_object_create ( + IN DAPL_EVD *evd_ptr, + IN ib_wait_obj_handle_t *p_cq_wait_obj_handle ) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_CM, + " cq_object_create: (%p,%p)\n", + evd_ptr, p_cq_wait_obj_handle ); + + /* set cq_wait object to evd_ptr */ + *p_cq_wait_obj_handle = + ibv_create_comp_channel(evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_destroy ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " cq_object_destroy: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + ibv_destroy_comp_channel(p_cq_wait_obj_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wakeup ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " cq_object_wakeup: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + /* no wake up mechanism */ + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wait ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle, + IN u_int32_t timeout) +{ + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + void *ibv_ctx = NULL; + int status = 0; + int timeout_ms = -1; + struct pollfd cq_fd = { + .fd = p_cq_wait_obj_handle->fd, + .events = POLLIN, + .revents = 0 + }; + + dapl_dbg_log ( DAPL_DBG_TYPE_CM, + " cq_object_wait: CQ channel %p time %d\n", + p_cq_wait_obj_handle, timeout ); + + /* uDAPL timeout values in usecs */ + if (timeout != DAT_TIMEOUT_INFINITE) + timeout_ms = timeout/1000; + + status = poll(&cq_fd, 1, timeout_ms); + + /* returned event */ + if (status > 0) { + if (!ibv_get_cq_event(p_cq_wait_obj_handle, + &ibv_cq, (void*)&evd_ptr)) { + ibv_ack_cq_events(ibv_cq, 1); + } + status = 0; + + /* timeout */ + } else if (status == 0) + status = ETIMEDOUT; + + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", + evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); + + return(dapl_convert_errno(status,"cq_wait_object_wait")); + +} +#endif + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + From mshefty at ichips.intel.com Tue Oct 25 11:20:57 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 11:20:57 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020B02@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020B02@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435E7789.6030708@ichips.intel.com> Caitlin Bestler wrote: > What you are proposing is an API that purports to have the > semantics of TCP/IP connection establishment that can be > implemented under non-IP transports such as InfiniBand. > > However, as proposed the mapping of this API to InfiniBand > does *not* implement the semantics of TCP/IP connection > establishment in that the remote address presented to > the listener has been subject to *no* authentication. > > That is a change in the API that has an impact on the > application. It is creating a requiremet for the application > to validate the remote identity greater than it would face > for TCP/IP connection establishment. What API proposal are you referring to? If you're referring to the CMA, there's only a kernel (privileged) component in existence. It sets the IP address in the private data. What is the issue that you're referring to? - Sean From caitlinb at broadcom.com Tue Oct 25 11:23:34 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 11:23:34 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B04@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Tuesday, October 25, 2005 11:13 AM > To: Caitlin Bestler > Cc: Sean Hefty; Kanevsky, Arkady; swg at infinibandta.org; DAT > Collaborative; openib-general at openib.org > Subject: RE: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > On Tue, 2005-10-25 at 10:51 -0700, Caitlin Bestler wrote: > > > > > > > > > > I believe that the assurances you are talking about are > peculiar to > > > an implementation, not to the network. > > > > > > > I disagree. Anytime you send an IP datagram on an IP > network you are > > expected to provide an authentic source address. Any intermediate > > network device MAY enforce that rule and drop packets with invalid > > source addresses. > > > > I don't see anything in the protocol specs (RFC 791, RFC 793, > ...) that talks about this, so we just have to agree to disagree. :-) > Joe Touch's current draft on spoofing prevention covers this well in Section 3.2 (draft-ietf-tcpm-tcp-antispoof-02). IP networks can prevent address spoofing at the network layer using IPSec or by having border routers/filters validate the source address of incoming packets against routing rules. The latter is covered in RFC 2827 "Ingress Filtering for Multihomed Networks" and RFC 2267 "Network Ingress Ingress Filtering: Defeating Denial of Service Attacks which employ IP Address Spoofing" And more generally, in a TCP network a non-privileged client is NOT allowed to bind to any address and is NOT allowed to send raw Ethernet to bypass the host stack. From caitlinb at broadcom.com Tue Oct 25 11:28:59 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 11:28:59 -0700 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B06@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 11:21 AM > To: Caitlin Bestler > Cc: Kanevsky, Arkady; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > proposal for socket based connection model > > Caitlin Bestler wrote: > > What you are proposing is an API that purports to have the > semantics > > of TCP/IP connection establishment that can be implemented under > > non-IP transports such as InfiniBand. > > > > However, as proposed the mapping of this API to InfiniBand > does *not* > > implement the semantics of TCP/IP connection establishment > in that the > > remote address presented to the listener has been subject to *no* > > authentication. > > > > That is a change in the API that has an impact on the > application. It > > is creating a requiremet for the application to validate the remote > > identity greater than it would face for TCP/IP connection > > establishment. > > What API proposal are you referring to? > > If you're referring to the CMA, there's only a kernel > (privileged) component in existence. It sets the IP address > in the private data. What is the issue that you're referring to? > > - Sean > The remote peer will be able to use an existing CM to send a forged IP address. There is nothing the receiving CMA, or consumer (no matter how privileged) can do to detect this without the cooperation of privileged components on the remote end. They cannot know that the cooperation they are receiving from the remote end is from a privileged entity unless it comes from a privileged QP and is not part of the existing pass-through data. And playing the "I'm only in kernel" ostrich game doesn't help. Any connection establishment protocol has to make sense from both user and kernel modes and needs to be symmetric. But it needs to include clear controls on who is trusted to provide what information, and what information MUST come from a privileged entity. A source IP address that can come from a non-privileged entity is NOT consistent with IP network connection establishment semantics. From Arkady.Kanevsky at netapp.com Tue Oct 25 12:08:26 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 15:08:26 -0400 Subject: [openib-general] round 2 - proposal for socket based connectionmodel Message-ID: Sean, answers in-line. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 -----Original Message----- From: Sean Hefty [mailto:sean.hefty at intel.com] Sent: Tuesday, October 25, 2005 1:05 PM To: Kanevsky, Arkady; openib-general at openib.org; swg at infinibandta.org Subject: RE: [openib-general] round 2 - proposal for socket based connectionmodel Dear OpenIB, SWG and DAT members, enclosed is teh second version of the proposal. There are really 2 proposals that are related. The first one is encoding IP 5-tuple into REQ private data with small additional info for versioning and IB capabilities. The second is just a couple of ideas, not a real proposal, on maping of IP ports to IB Service IDs. Comments on the private data format: Combine major/minor version into a single field. There's no advantage to have two fields, so keep it simple. [AK] agree Remove ZB and SI bits. These are unrelated to socket addressing. [AK] That is true these are unrelated to socket addressing. But since several ULPs over IB need this info it can be added to the generic CM extensions for IB. I will rename the proposal to deal with it. I prefer a single private data formating proposal rather then several layered on top of each other. If IBTA think this is generic enough and want to redefine some reserved fields for it - good. This is captured in discussion slides. If the destination port number is encoded in a service ID, then it can be removed from the private data. [AK] This is dependent on how port mapping to Service ID is done. But if SDP will incorporate this into hello-wold protocol this may still be needed. With 64-bytes Consumer private data requirement relaxed saving 2 bytes will not make much difference. The transport protocol number could also be encoded in the service ID and removed from the private data. Actually, the version, IP version, and source port could all be encoded in the service ID, limiting the private data to just 32 bytes of IP addresses. [AK] Encoding IP version into Service ID sounds strange. Service ID is a pprt equivalent. Sure it is much larger than IP ports but why does CM extensions should encode more than port into it? Even with this Consumer private data is still only 60 bytes (not old 64-bytes requirement). - Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Oct 25 12:21:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 12:21:39 -0700 Subject: [openib-general] round 2 - proposal for socket based connectionmodel In-Reply-To: References: Message-ID: <435E85C3.3020802@ichips.intel.com> Kanevsky, Arkady wrote: > Sean, > answers in-line. > Arkady At this point, I'm just going to disagree with this approach and move on with the current implementation of the CMA. What's needed is a service that provides IB connections using TCP/IP addressing. I don't believe this proposal meets this goal. To meet the requirement of connecting over IB using TCP/IP addressing, I believe that we need a service with a reserved service identifier or range of identifiers, a mechanism for mapping between IP and IB addresses, and a mechanism for reversing the mapping. I don't see where the proposal addresses the bulk of the work that's required, nor do I think that it will present an API to the user that does not expose IB related addressing (such as service IDs). - Sean From Arkady.Kanevsky at netapp.com Tue Oct 25 12:38:35 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 15:38:35 -0400 Subject: [openib-general] round 2 - proposal for socket based connectionmodel Message-ID: What are you trying to achieve? I am trying to define an IB REQ protocol extension that support IP connection 5-tuple exchange between connection requestor and responder. And define mapping between IP 5-tuple and IB entities. That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) can use RDMA transport without change. To modify ULP to know that it runs on top of IB vs. iWARP vs. (any other RDMA transport) is bad idea. It is one thing to choose proper port to connect. Completely different to ask ULP to parse private data in transport specific way. The same protocol must support both user level ULPs and kernel level ULPs. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 3:22 PM > To: Kanevsky, Arkady > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] round 2 - proposal for socket > based connectionmodel > > > Kanevsky, Arkady wrote: > > Sean, > > answers in-line. > > Arkady > > At this point, I'm just going to disagree with this approach > and move on with > the current implementation of the CMA. What's needed is a > service that provides > IB connections using TCP/IP addressing. I don't believe this > proposal meets > this goal. > > To meet the requirement of connecting over IB using TCP/IP > addressing, I believe > that we need a service with a reserved service identifier or range of > identifiers, a mechanism for mapping between IP and IB > addresses, and a > mechanism for reversing the mapping. > > I don't see where the proposal addresses the bulk of the work > that's required, > nor do I think that it will present an API to the user that > does not expose IB > related addressing (such as service IDs). > > - Sean > From robert.j.woodruff at intel.com Tue Oct 25 12:50:37 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 25 Oct 2005 12:50:37 -0700 Subject: [openib-general] [PATCH] new uDAPL openIB provider using socket CM Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005ED7B55@orsmsx408> Arlin wrote, >James, >Here is a patch to add an optional openIB uDAPL provider that uses the socket CM >for anyone having >problems scaling out with the uCM/uAT version. To build the new provider, simply >"make >VERBS=openib_scm". This version does not require IPoIB, uCM, or uAT. >-arlin I have been using this DAPL provider with my testing of MPI and it seems stable. It has been tested up to 128 nodes. The uAT version on the other hand seems to still have problems so I would be glad to see this version be put into the tree also. Sayantan, you may also want to try this socket based CM uDAPL provide for your work. It seems more stable than the uAT version in my testing thus far. woody From ted.kim at sun.com Tue Oct 25 13:16:51 2005 From: ted.kim at sun.com (Ted H. Kim) Date: Tue, 25 Oct 2005 13:16:51 -0700 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB In-Reply-To: <1130202368.6405.11.camel@trinity.austin.ammasso.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> <4359575B.5020302@ichips.intel.com> <435D7992.7000705@sun.com> <1130202368.6405.11.camel@trinity.austin.ammasso.com> Message-ID: <435E92B3.6020201@sun.com> Tom, Some comments inline ... Tom Tucker wrote: > I think it's relevant, so let's make sure my assumptions are correct: > > - The ITAPI will be a "ULP" on OpenIB ITAPI is like uDAPL, so if uDAPL is a "ULP" then the answer is yes. The point is that for uDAPL you have the actual "app" running over uDAPL. So I guess it's a matter of terminology whether uDAPL is a ULP or is it some sort of middleware with the app being the "ULP". > - The ITAPI will create the IRD/ORD headers in its private data and > submit this as part of its connection establishment. > - The ITAPI consumer at the remote peer will use this data to configure > it's local QP before accepting the connection > > Over IB, the IRD/ORD private data will be prepended with a "private data > header" that contains the source and destination IP addresses, source > port, etc... The remote peer will not see this data as part of the > private data, but rather will see it in the CMA event in the upcall. Over IB, the IRD/ORD data is already built in to the standard CM stuff (i.e. the "responder resources" and "initiator depth" fields of REQ and REP). So no additional demands are made on private data for IB in ITAPI for the IOH purpose. Of course the ITAPI app (like a uDAPL app) can also use private data for app specific/ULP reasons. > Over iWARP/MPA, there will be nothing else in the private data except > what was provided by the consumer (ITAPI in this case). The reason being > that this extra information (IP addressing info) is in the protocol > header proper. Just to restate for clarity, ITAPI for iWARP will use the first 16 bytes of MPA private date for the IOH (IRD/ORD header). The rest is usable for app/ULP reasons. I should point out that there was once a proposal of doing a RDDP IETF draft which would have sub-divided the MPA private data into a "middleware" section and an "app" section. The idea was to be sure that the app/ULP and middleware (e.g. the IOH) uses of private data would not step on each other. I think this idea did not progress, mostly because the author (John Carrier, formerly of Adaptec) changed jobs and was no longer working on iWARP stuff. While not directly proposed, this idea could have been carried over to IB. Some of the ideas on this thread are already implicitly doing this middleware (for IP addressing purpose) vs ULP/app split. -ted From caitlinb at broadcom.com Tue Oct 25 13:37:25 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 13:37:25 -0700 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B0D@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message From Ted Kim ----- > > > I should point out that there was once a proposal of doing a > RDDP IETF draft which would have sub-divided the MPA private > data into a "middleware" section and an "app" section. The > idea was to be sure that the app/ULP and middleware (e.g. the > IOH) uses of private data would not step on each other. I > think this idea did not progress, mostly because the author > (John Carrier, formerly of Adaptec) changed jobs and was no > longer working on iWARP stuff. > > While not directly proposed, this idea could have been > carried over to IB. > Some of the ideas on this thread are already implicitly doing > this middleware (for IP addressing purpose) vs ULP/app split. > > > -ted > >From a spec-minimalist viewpoint there is no real benefit in having the wire protocol distinquish between payload provided by a non-privileged middleware library and a non-privileged application. It might be a really nice and convenient thing for it to do, but there is no real harm in having the middleware do the marking. The real issue here is that data that had been in a privileged header that had been implicitly validated by successful routing in both directions is being replaced by data that is traveling opaque directly from non-privileged code on the peer. On an IP network you cannot successfully establish a connnection where the remote IP address has no correlation with reality and is totally unreviewable by network administrators. The network administrator can also always block connections from certain addresses as a matter of policy and the application cannot override that. Typically the system administrator can as well. Passing IP addresses in non-privileged private data is an entirely different issue. From surs at cse.ohio-state.edu Tue Oct 25 13:49:57 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 25 Oct 2005 16:49:57 -0400 Subject: [openib-general] [PATCH] new uDAPL openIB provider using socket CM In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0005ED7B55@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0005ED7B55@orsmsx408> Message-ID: <20051025204955.GA19984@cse.ohio-state.edu> Hi Woody, * On Oct,2 Woodruff, Robert J wrote : > Arlin wrote, > >James, > > >Here is a patch to add an optional openIB uDAPL provider that uses the > socket CM >for anyone having > >problems scaling out with the uCM/uAT version. To build the new > provider, simply >"make > >VERBS=openib_scm". This version does not require IPoIB, uCM, or uAT. > > >-arlin > > I have been using this DAPL provider with my testing of MPI and > it seems stable. It has been tested up to 128 nodes. > The uAT version on the other hand seems to still have problems > so I would be glad to see this version be put into the tree also. > > Sayantan, you may also want to try this socket based CM > uDAPL provide for your work. It seems more stable than the uAT version > in my testing thus far. Thanks for this information! We will try this out in our lab. Sincerely, Sayantan. -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Tue Oct 25 13:58:03 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Tue, 25 Oct 2005 16:58:03 -0400 Subject: [openib-general] uDAPL open HCA problem In-Reply-To: References: <87e828301b.8301b87e82@osu.edu> Message-ID: <20051025205801.GA20054@cse.ohio-state.edu> * On Oct,10 James Lentini wrote : > > > On Fri, 21 Oct 2005, LEI CHAI wrote: > > > ips_by_gid: RET 0 at_rec 0x7fffffa8d380 -> id 4627 > > dapli_at_event_cb() > > ip_comp_handler: rec 0x7fffffa8d380 ->id 4627 id 4627 num -22 3c66c000 > > ip_comp_handler: resolution err -22 retry 1 > > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4628 > > dapli_at_event_cb() > > ip_comp_handler: rec 0x7fffffa8d380 ->id 4628 id 4628 num -22 0 > > ip_comp_handler: resolution err -22 retry 2 > > [rdma_udapl_priv.c:640] error(262144): Cannot open IA > > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4629 > > dapli_at_event_cb() > > ip_comp_handler: rec 0x7fffffa8d380 ->id 4629 id 4629 num -22 0 > > ip_comp_handler: resolution err -22 retry 3 > > ip_comp_handler: ips_by_gid 0 rec 0x7fffffa8d380->id 4630 > > dapli_at_event_cb() > > ip_comp_handler: rec 0x7fffffa8d380 ->id 4630 id 4630 num -22 0 > > ip_comp_handler: resolution err -22 retry 4 > > ip_comp_handler: ERR: at_rec 0x7fffffa8d380, id 4630 num -22 > > open_hca: ERR ib_at_ips_by_gid for mthca0 > > ib_at_ips_by_gid is failing again. Have you setup an IPoIB address? Sorry for the late reply :-( Yes, we have IPoIB setup. This happens intermittently. As suggested by Woody, we will also try out the scm version. Thanks, Sayantan. ===== ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:150.1.110.4 Bcast:150.1.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:c902:40:315/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) lsmod | grep ^ib [surs at ro1:~] lsmod | grep ^ib ib_ipoib 48008 0 ib_uat 14840 0 ib_at 25696 1 ib_uat ib_sa 17804 2 ib_ipoib,ib_at ib_ucm 22280 0 ib_cm 37744 1 ib_ucm ib_uverbs 35992 0 ib_umad 18208 0 ib_mthca 122656 0 ib_mad 44072 4 ib_sa,ib_cm,ib_umad,ib_mthca ib_core 56192 8 ib_ipoib,ib_sa,ib_ucm,ib_cm,ib_uverbs,ib_umad,ib_mthca,ib_mad Error: [chail at ro1:osu_benchmarks] ../bin/mpiexec -n 2 ./a.out DAPL: NOT Setting Loopback dapl_ib_init: ib_thread_init(7629) dapl_ia_open (ib0, 8, 0x7fffffe20a18, 0x5ae728) open_hca: mthca0 - 0x5c4390 ib_thread(7629,0x40200960): ENTER: pipe 8 at 4 open_hca: Found dev mthca0 0002c90200400314 open_hca: GID subnet fe80000000000000 id 0002c90200400315 ips_by_gid: RET 0 at_rec 0x7fffffe20780 -> id 37 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe20780 ->id 37 id 37 num -22 3c77c000 ip_comp_handler: resolution err -22 retry 1 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe20780->id 38 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe20780 ->id 38 id 38 num -22 0 ip_comp_handler: resolution err -22 retry 2 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe20780->id 39 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe20780 ->id 39 id 39 num -22 0 ip_comp_handler: resolution err -22 retry 3 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe20780->id 40 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe20780 ->id 40 id 40 num -22 0 ip_comp_handler: resolution err -22 retry 4 ip_comp_handler: ERR: at_rec 0x7fffffe20780, id 40 num -22 open_hca: ERR ib_at_ips_by_gid for mthca0 dapls_ib_open_hca failed 40000 dapl_ia_open () returns 0x40000 DAPL: Stopped (dapl_fini) dapl_ib_release: ib_thread_destroy(7629) ib_thread_destroy: waiting for ib_thread ib_thread(7629) EXIT [rdma_udapl_priv.c:640] error(262144): Cannot open IA DAPL: NOT Setting Loopback dapl_ib_init: ib_thread_init(7630) dapl_ia_open (ib0, 8, 0x7fffffe55578, 0x5ae728) open_hca: mthca0 - 0x5c4390 ib_thread(7630,0x40200960): ENTER: pipe 8 at 4 open_hca: Found dev mthca0 0002c90200400314 open_hca: GID subnet fe80000000000000 id 0002c90200400315 ips_by_gid: RET 0 at_rec 0x7fffffe552e0 -> id 41 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe552e0 ->id 41 id 41 num -22 3c77c000 ip_comp_handler: resolution err -22 retry 1 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe552e0->id 42 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe552e0 ->id 42 id 42 num -22 0 ip_comp_handler: resolution err -22 retry 2 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe552e0->id 43 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe552e0 ->id 43 id 43 num -22 0 ip_comp_handler: resolution err -22 retry 3 ip_comp_handler: ips_by_gid 0 rec 0x7fffffe552e0->id 44 dapli_at_event_cb() ip_comp_handler: rec 0x7fffffe552e0 ->id 44 id 44 num -22 0 ip_comp_handler: resolution err -22 retry 4 ip_comp_handler: ERR: at_rec 0x7fffffe552e0, id 44 num -22 [rdma_udapl_priv.c:640] error(262144): Cannot open IA open_hca: ERR ib_at_ips_by_gid for mthca0 dapls_ib_open_hca failed 40000 dapl_ia_open () returns 0x40000 > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- http://www.cse.ohio-state.edu/~surs From rolandd at cisco.com Tue Oct 25 14:14:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 14:14:39 -0700 Subject: [openib-general] [PATCH] Minor mad_rmpp.c cleanup Message-ID: <523bmp47uo.fsf@cisco.com> This changes alloc_response_msg() in mad_rmpp.c to return the struct it allocates directly (or an error code a la ERR_PTR), rather than returning a status and passing the struct back in a pointer param. This simplifies the code and gets rid of warnings like drivers/infiniband/core/mad_rmpp.c: In function nack_recv: drivers/infiniband/core/mad_rmpp.c:192: warning: msg may be used uninitialized in this function with newer versions of gcc. Signed-off-by: Roland Dreier --- infiniband/core/mad_rmpp.c (revision 3865) +++ infiniband/core/mad_rmpp.c (working copy) @@ -151,28 +151,27 @@ static void ack_recv(struct mad_rmpp_rec ib_free_send_mad(msg); } -static int alloc_response_msg(struct ib_mad_agent *agent, - struct ib_mad_recv_wc *recv_wc, - struct ib_mad_send_buf **msg) +static struct ib_mad_send_buf *alloc_response_msg(struct ib_mad_agent *agent, + struct ib_mad_recv_wc *recv_wc) { - struct ib_mad_send_buf *m; + struct ib_mad_send_buf *msg; struct ib_ah *ah; ah = ib_create_ah_from_wc(agent->qp->pd, recv_wc->wc, recv_wc->recv_buf.grh, agent->port_num); if (IS_ERR(ah)) - return PTR_ERR(ah); + return (void *) ah; - m = ib_create_send_mad(agent, recv_wc->wc->src_qp, - recv_wc->wc->pkey_index, 1, - IB_MGMT_RMPP_HDR, IB_MGMT_RMPP_DATA, GFP_KERNEL); - if (IS_ERR(m)) { + msg = ib_create_send_mad(agent, recv_wc->wc->src_qp, + recv_wc->wc->pkey_index, 1, + IB_MGMT_RMPP_HDR, IB_MGMT_RMPP_DATA, + GFP_KERNEL); + if (IS_ERR(msg)) ib_destroy_ah(ah); - return PTR_ERR(m); - } - m->ah = ah; - *msg = m; - return 0; + else + msg->ah = ah; + + return msg; } void ib_rmpp_send_handler(struct ib_mad_send_wc *mad_send_wc) @@ -191,8 +190,8 @@ static void nack_recv(struct ib_mad_agen struct ib_rmpp_mad *rmpp_mad; int ret; - ret = alloc_response_msg(&agent->agent, recv_wc, &msg); - if (ret) + msg = alloc_response_msg(&agent->agent, recv_wc); + if (IS_ERR(msg)) return; rmpp_mad = msg->mad; From tom at opengridcomputing.com Tue Oct 25 14:40:46 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 16:40:46 -0500 Subject: [openib-general] round 2 - proposal for socket based connection model In-Reply-To: References: Message-ID: <1130276446.9790.85.camel@trinity.austin.ammasso.com> Arkady: I don't think anyone disagrees with your goals. Unfortunately additional requirements on the implementation were coupled with the specification of the private data format (protocol). This peripheral discussion derailed any attempt to discuss the protocol. Attempts to separate the protocol discussion from the implementation failed. And so here we are... On Tue, 2005-10-25 at 15:38 -0400, Kanevsky, Arkady wrote: > What are you trying to achieve? > > I am trying to define an IB REQ protocol extension that > support IP connection 5-tuple exchange between connection > requestor and responder. > And define mapping between IP 5-tuple and IB entities. > > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) > can use RDMA transport without change. > To modify ULP to know that it runs on top of IB vs. iWARP > vs. (any other RDMA transport) is bad idea. > It is one thing to choose proper port to connect. > Completely different to ask ULP to parse private data > in transport specific way. > > The same protocol must support both user level ULPs > and kernel level ULPs. > Arkady > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance phone: 781-768-5395 > 375 Totten Pond Rd. Fax: 781-895-1195 > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Tuesday, October 25, 2005 3:22 PM > > To: Kanevsky, Arkady > > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org > > Subject: Re: [openib-general] round 2 - proposal for socket > > based connectionmodel > > > > > > Kanevsky, Arkady wrote: > > > Sean, > > > answers in-line. > > > Arkady > > > > At this point, I'm just going to disagree with this approach > > and move on with > > the current implementation of the CMA. What's needed is a > > service that provides > > IB connections using TCP/IP addressing. I don't believe this > > proposal meets > > this goal. > > > > To meet the requirement of connecting over IB using TCP/IP > > addressing, I believe > > that we need a service with a reserved service identifier or range of > > identifiers, a mechanism for mapping between IP and IB > > addresses, and a > > mechanism for reversing the mapping. > > > > I don't see where the proposal addresses the bulk of the work > > that's required, > > nor do I think that it will present an API to the user that > > does not expose IB > > related addressing (such as service IDs). > > > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tom at opengridcomputing.com Tue Oct 25 14:52:12 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 16:52:12 -0500 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB In-Reply-To: <435E92B3.6020201@sun.com> References: <43591D07.5050709@ichips.intel.com> <43594159.3000202@ichips.intel.com> <43594538.7030806@ichips.intel.com> <1129928894.4255.0.camel@trinity.austin.ammasso.com> <4359575B.5020302@ichips.intel.com> <435D7992.7000705@sun.com> <1130202368.6405.11.camel@trinity.austin.ammasso.com> <435E92B3.6020201@sun.com> Message-ID: <1130277132.9790.97.camel@trinity.austin.ammasso.com> On Tue, 2005-10-25 at 13:16 -0700, Ted H. Kim wrote: > Tom, > > Some comments inline ... > > > Tom Tucker wrote: > > I think it's relevant, so let's make sure my assumptions are correct: > > > > - The ITAPI will be a "ULP" on OpenIB > > ITAPI is like uDAPL, so if uDAPL is a "ULP" then the answer is yes. > The point is that for uDAPL you have the actual "app" running over > uDAPL. So I guess it's a matter of terminology whether uDAPL is > a ULP or is it some sort of middleware with the app being the "ULP". > Yeah, you're right the terminology is probably a little goofy. The reason for the goofosity is that some of the "ulp" really are protocols (ISER, IPoIB), and some are API (DAPL, MPI). All use the same interface to register with OpenIB. But that said, yes, ITAPI is like uDAPL. > > > - The ITAPI will create the IRD/ORD headers in its private data and > > submit this as part of its connection establishment. > > - The ITAPI consumer at the remote peer will use this data to configure > > it's local QP before accepting the connection > > > > Over IB, the IRD/ORD private data will be prepended with a "private data > > header" that contains the source and destination IP addresses, source > > port, etc... The remote peer will not see this data as part of the > > private data, but rather will see it in the CMA event in the upcall. > > Over IB, the IRD/ORD data is already built in to the standard CM > stuff (i.e. the "responder resources" and "initiator depth" fields of > REQ and REP). So no additional demands are made on private data for IB > in ITAPI for the IOH purpose. Of course the ITAPI app (like a uDAPL app) > can also use private data for app specific/ULP reasons. ok -- bad example. Sorry. This is a weird one. On iWARP, you need the private data header to pass this stuff along and on IB, you don't. What I was trying to say is that "whatever the private data", on IB it will get a private data header prepended and on iWARP, it won't. > > > > Over iWARP/MPA, there will be nothing else in the private data except > > what was provided by the consumer (ITAPI in this case). The reason being > > that this extra information (IP addressing info) is in the protocol > > header proper. > > Just to restate for clarity, ITAPI for iWARP will use the first 16 bytes of > MPA private date for the IOH (IRD/ORD header). The rest is usable for > app/ULP reasons. Yessir. And in fact, the ITAPI CM will strip this stuff before presenting it to the app. > > > I should point out that there was once a proposal of doing a RDDP IETF > draft which would have sub-divided the MPA private data into a > "middleware" section and an "app" section. The idea was to be sure that > the app/ULP and middleware (e.g. the IOH) uses of private data would not > step on each other. I think this idea did not progress, mostly because > the author (John Carrier, formerly of Adaptec) changed jobs and was no > longer working on iWARP stuff. > > While not directly proposed, this idea could have been carried over to IB. > Some of the ideas on this thread are already implicitly > doing this middleware (for IP addressing purpose) vs ULP/app split. > I think we are grappling with a lot of these layering issues now. We are also grappling with protocol vs. implementation issues. Keep it coming, because this is exactly the kind of feedback I think we need. > -ted > From tom at opengridcomputing.com Tue Oct 25 14:55:46 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 25 Oct 2005 16:55:46 -0500 Subject: [openib-general] round 2 - proposal for socket based connection model In-Reply-To: <1130276446.9790.85.camel@trinity.austin.ammasso.com> References: <1130276446.9790.85.camel@trinity.austin.ammasso.com> Message-ID: <1130277346.9790.100.camel@trinity.austin.ammasso.com> Arkady: I may actually have a constructive comment about the protocol (private data format). One thing I noticed is that *almost* everything in the private data header is available in the native iWARP protocol header except the ZB and SI bits. If these bits become part of the canonical private data header, then does that require an iWARP transport to use the header too even though only two bits are useful? Sorry if this is a dumb question, Tom On Tue, 2005-10-25 at 16:40 -0500, Tom Tucker wrote: > Arkady: > > I don't think anyone disagrees with your goals. Unfortunately additional > requirements on the implementation were coupled with the specification > of the private data format (protocol). This peripheral discussion > derailed any attempt to discuss the protocol. > > Attempts to separate the protocol discussion from the implementation > failed. And so here we are... > > > On Tue, 2005-10-25 at 15:38 -0400, Kanevsky, Arkady wrote: > > What are you trying to achieve? > > > > I am trying to define an IB REQ protocol extension that > > support IP connection 5-tuple exchange between connection > > requestor and responder. > > And define mapping between IP 5-tuple and IB entities. > > > > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) > > can use RDMA transport without change. > > To modify ULP to know that it runs on top of IB vs. iWARP > > vs. (any other RDMA transport) is bad idea. > > It is one thing to choose proper port to connect. > > Completely different to ask ULP to parse private data > > in transport specific way. > > > > The same protocol must support both user level ULPs > > and kernel level ULPs. > > Arkady > > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance phone: 781-768-5395 > > 375 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > > > -----Original Message----- > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > > Sent: Tuesday, October 25, 2005 3:22 PM > > > To: Kanevsky, Arkady > > > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org > > > Subject: Re: [openib-general] round 2 - proposal for socket > > > based connectionmodel > > > > > > > > > Kanevsky, Arkady wrote: > > > > Sean, > > > > answers in-line. > > > > Arkady > > > > > > At this point, I'm just going to disagree with this approach > > > and move on with > > > the current implementation of the CMA. What's needed is a > > > service that provides > > > IB connections using TCP/IP addressing. I don't believe this > > > proposal meets > > > this goal. > > > > > > To meet the requirement of connecting over IB using TCP/IP > > > addressing, I believe > > > that we need a service with a reserved service identifier or range of > > > identifiers, a mechanism for mapping between IP and IB > > > addresses, and a > > > mechanism for reversing the mapping. > > > > > > I don't see where the proposal addresses the bulk of the work > > > that's required, > > > nor do I think that it will present an API to the user that > > > does not expose IB > > > related addressing (such as service IDs). > > > > > > - Sean > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From nacc at us.ibm.com Tue Oct 25 15:04:46 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 25 Oct 2005 15:04:46 -0700 Subject: [openib-general] Automated userspace build error Message-ID: <20051025220446.GA27205@us.ibm.com> Hi all, I'm trying to add at least the build portion of userspace testing to my daily kernel build tests, but am running into the following failure in libibcm: checking dynamic linker charactericonfigure: error: ibv_get_devices() not found. libibcm requires libibcm. Not sure, why the log got overrun, but obviously the latter part is concerning. Is there a circular dependence somewhere? Thanks, Nish From rolandd at cisco.com Tue Oct 25 15:09:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 15:09:42 -0700 Subject: [openib-general] Automated userspace build error In-Reply-To: <20051025220446.GA27205@us.ibm.com> (Nishanth Aravamudan's message of "Tue, 25 Oct 2005 15:04:46 -0700") References: <20051025220446.GA27205@us.ibm.com> Message-ID: <52u0f52qqh.fsf@cisco.com> > checking dynamic linker charactericonfigure: error: > ibv_get_devices() not found. libibcm requires libibcm. The last error seeming like a circular dependency is just a typo, fixed by the following patch (already checked in). As for why your build is failing, it seems that the libibcm configure is not finding an install of libibverbs. Without knowing what your setup is like, it's hard to speculate why that might be. - R. --- libibcm/configure.in (revision 3861) +++ libibcm/configure.in (working copy) @@ -26,7 +26,7 @@ dnl Checks for libraries if test "$disable_libcheck" != "yes" then AC_CHECK_LIB(ibverbs, ibv_get_devices, [], - AC_MSG_ERROR([ibv_get_devices() not found. libibcm requires libibcm.])) + AC_MSG_ERROR([ibv_get_devices() not found. libibcm requires libibverbs.])) AC_CHECK_LIB(ibat, ib_at_route_by_ip, [], AC_MSG_ERROR([ib_at_route_by_ip() not found. libibcm requires libat.])) fi From mshefty at ichips.intel.com Tue Oct 25 15:10:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 15:10:09 -0700 Subject: [openib-general] Re: [PATCH] Minor mad_rmpp.c cleanup In-Reply-To: <523bmp47uo.fsf@cisco.com> References: <523bmp47uo.fsf@cisco.com> Message-ID: <435EAD41.8050807@ichips.intel.com> Roland Dreier wrote: > This changes alloc_response_msg() in mad_rmpp.c to return the struct > it allocates directly (or an error code a la ERR_PTR), rather than > returning a status and passing the struct back in a pointer param. > This simplifies the code and gets rid of warnings like > > drivers/infiniband/core/mad_rmpp.c: In function nack_recv: > drivers/infiniband/core/mad_rmpp.c:192: warning: msg may be used uninitialized in this function > > with newer versions of gcc. Thanks! Committed. - Sean From nacc at us.ibm.com Tue Oct 25 15:18:49 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Tue, 25 Oct 2005 15:18:49 -0700 Subject: [openib-general] Automated userspace build error In-Reply-To: <52u0f52qqh.fsf@cisco.com> References: <20051025220446.GA27205@us.ibm.com> <52u0f52qqh.fsf@cisco.com> Message-ID: <20051025221849.GB27205@us.ibm.com> On 25.10.2005 [15:09:42 -0700], Roland Dreier wrote: > > checking dynamic linker charactericonfigure: error: > > ibv_get_devices() not found. libibcm requires libibcm. > > The last error seeming like a circular dependency is just a typo, > fixed by the following patch (already checked in). Hrm, well, I'm testing the latest svn (3865), did the patch just get checked in? > As for why your build is failing, it seems that the libibcm configure > is not finding an install of libibverbs. Without knowing what your > setup is like, it's hard to speculate why that might be. True true; let me do some digging and figure out what's going on. Thanks, Nish From rolandd at cisco.com Tue Oct 25 15:22:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Tue, 25 Oct 2005 15:22:56 -0700 Subject: [openib-general] Automated userspace build error In-Reply-To: <20051025221849.GB27205@us.ibm.com> (Nishanth Aravamudan's message of "Tue, 25 Oct 2005 15:18:49 -0700") References: <20051025220446.GA27205@us.ibm.com> <52u0f52qqh.fsf@cisco.com> <20051025221849.GB27205@us.ibm.com> Message-ID: <52pspt2q4f.fsf@cisco.com> Nishanth> Hrm, well, I'm testing the latest svn (3865), did the Nishanth> patch just get checked in? Yeah, I only noticed it and fixed it after your original email. I just meant that I had already checked it in before sending my reply. Sorry for the confusion... - R. From swise at opengridcomputing.com Tue Oct 25 15:24:35 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 25 Oct 2005 17:24:35 -0500 Subject: [openib-general] round 2 - proposal for socket basedconnection model References: <1130276446.9790.85.camel@trinity.austin.ammasso.com> <1130277346.9790.100.camel@trinity.austin.ammasso.com> Message-ID: <008d01c5d9b2$e1b615e0$020010ac@haggard> Why does an application care whether the remote implementation supports ZB? Whether memory regions can be described with zero based rkeys or not doesn't matter on an end-to-end level. Its only a local issue. So ZB shouldn't be there IMO. ----- Original Message ----- From: "Tom Tucker" To: "Kanevsky, Arkady" Cc: ; Sent: Tuesday, October 25, 2005 4:55 PM Subject: RE: [openib-general] round 2 - proposal for socket basedconnection model > Arkady: > > I may actually have a constructive comment about the protocol (private > data format). One thing I noticed is that *almost* everything in the > private data header is available in the native iWARP protocol header > except the ZB and SI bits. If these bits become part of the canonical > private data header, then does that require an iWARP transport to use > the header too even though only two bits are useful? > > Sorry if this is a dumb question, > > Tom > > On Tue, 2005-10-25 at 16:40 -0500, Tom Tucker wrote: >> Arkady: >> >> I don't think anyone disagrees with your goals. Unfortunately >> additional >> requirements on the implementation were coupled with the >> specification >> of the private data format (protocol). This peripheral discussion >> derailed any attempt to discuss the protocol. >> >> Attempts to separate the protocol discussion from the implementation >> failed. And so here we are... >> >> >> On Tue, 2005-10-25 at 15:38 -0400, Kanevsky, Arkady wrote: >> > What are you trying to achieve? >> > >> > I am trying to define an IB REQ protocol extension that >> > support IP connection 5-tuple exchange between connection >> > requestor and responder. >> > And define mapping between IP 5-tuple and IB entities. >> > >> > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so >> > on) >> > can use RDMA transport without change. >> > To modify ULP to know that it runs on top of IB vs. iWARP >> > vs. (any other RDMA transport) is bad idea. >> > It is one thing to choose proper port to connect. >> > Completely different to ask ULP to parse private data >> > in transport specific way. >> > >> > The same protocol must support both user level ULPs >> > and kernel level ULPs. >> > Arkady >> > >> > Arkady Kanevsky email: arkady at netapp.com >> > Network Appliance phone: 781-768-5395 >> > 375 Totten Pond Rd. Fax: 781-895-1195 >> > Waltham, MA 02451-2010 central phone: 781-768-5300 >> > >> > >> > >> > > -----Original Message----- >> > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] >> > > Sent: Tuesday, October 25, 2005 3:22 PM >> > > To: Kanevsky, Arkady >> > > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org >> > > Subject: Re: [openib-general] round 2 - proposal for socket >> > > based connectionmodel >> > > >> > > >> > > Kanevsky, Arkady wrote: >> > > > Sean, >> > > > answers in-line. >> > > > Arkady >> > > >> > > At this point, I'm just going to disagree with this approach >> > > and move on with >> > > the current implementation of the CMA. What's needed is a >> > > service that provides >> > > IB connections using TCP/IP addressing. I don't believe this >> > > proposal meets >> > > this goal. >> > > >> > > To meet the requirement of connecting over IB using TCP/IP >> > > addressing, I believe >> > > that we need a service with a reserved service identifier or >> > > range of >> > > identifiers, a mechanism for mapping between IP and IB >> > > addresses, and a >> > > mechanism for reversing the mapping. >> > > >> > > I don't see where the proposal addresses the bulk of the work >> > > that's required, >> > > nor do I think that it will present an API to the user that >> > > does not expose IB >> > > related addressing (such as service IDs). >> > > >> > > - Sean >> > > >> > _______________________________________________ >> > openib-general mailing list >> > openib-general at openib.org >> > http://openib.org/mailman/listinfo/openib-general >> > >> > To unsubscribe, please visit >> > http://openib.org/mailman/listinfo/openib-general >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From caitlinb at broadcom.com Tue Oct 25 15:39:17 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Tue, 25 Oct 2005 15:39:17 -0700 Subject: [openib-general] round 2 - proposal for socket based connection model Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B14@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Tom Tucker > Sent: Tuesday, October 25, 2005 2:56 PM > To: Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: RE: [openib-general] round 2 - proposal for socket > based connection model > > Arkady: > > I may actually have a constructive comment about the protocol > (private data format). One thing I noticed is that *almost* > everything in the private data header is available in the > native iWARP protocol header except the ZB and SI bits. If > these bits become part of the canonical private data header, > then does that require an iWARP transport to use the header > too even though only two bits are useful? > > Sorry if this is a dumb question, > I'm not sure I followed why these were needed myself. From mshefty at ichips.intel.com Tue Oct 25 15:43:39 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 15:43:39 -0700 Subject: [openib-general] round 2 - proposal for socket based connectionmodel In-Reply-To: References: Message-ID: <435EB51B.5050701@ichips.intel.com> Kanevsky, Arkady wrote: > What are you trying to achieve? I'm trying to define a connection *service* for Infiniband that uses TCP/IP addresses as its user interface. That service will have its own protocol, in much the same way that SDP, SRP, etc. do today. > I am trying to define an IB REQ protocol extension that > support IP connection 5-tuple exchange between connection > requestor and responder. Why? What need is there for a protocol extension to the IB CM? To me, this is similar to setting a bit in the CM REQ to indicate that the private data format looks like SDP's private data. The format of the _private_ data shouldn't be known to the CM; that's why it's private data. > And define mapping between IP 5-tuple and IB entities. No mapping between IP <-> IB addresses was defined in the proposal. Defining this mapping is required to make this work. Right now, the mapping is the responsibility of every user. > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) > can use RDMA transport without change. A ULP written to TCP/IP can use an RDMA transport without change. They use SDP. However, an application that wants to take advantage of QP semantics must change. (And if they want to take full advantage of RDMA, they'll likely need to be re-architected as well.) The goal in that case becomes to permit them to establish connections using TCP/IP addresses. To meet this goal, we need to define how to map IP address to and from IB addresses. That mapping is part of the protocol, and is missing from the proposal. And if the application isn't going to know that they're running on Infiniband, then the mapping must also include mapping to a destination service ID. > To modify ULP to know that it runs on top of IB vs. iWARP > vs. (any other RDMA transport) is bad idea. > It is one thing to choose proper port to connect. > Completely different to ask ULP to parse private data > in transport specific way. > The same protocol must support both user level ULPs > and kernel level ULPs. Defining an interface that allows a ULP to use either iWarp, IB, or some other random RDMA transport is an implementation issue. However, it requires something that maps IP to IB addresses (including service IDs). To be more concrete, you've gone from having source and destination TCP/IP addresses to including them in a CM REQ. What translated the source and destination IP addresses into GIDs and a PKey? Who converted those into IB routing information? How was the destination of the CM REQ determined? What service ID was selected? - Sean From mshefty at ichips.intel.com Tue Oct 25 16:06:45 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 25 Oct 2005 16:06:45 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <435D76E9.5040404@ichips.intel.com> References: <435D76E9.5040404@ichips.intel.com> Message-ID: <435EBA85.7050107@ichips.intel.com> Sean Hefty wrote: > - The kernel CMA will expose a new call, rdma_init_qp_attr() to > initialize QP attributes used to modify the state of the QP. The call > will be similar to the infiniband CM routine. Use of this call is > optional. The CMA will automatically transition QPs created by > rdma_create_qp(). The changes are more involved than this. To handle the QP transitions in userspace, the kernel CMA needs to generate another event: CONNECT_RESPONSE. It will also need an additional API: rdma_establish(). (We can overload rdma_accept() in place of rdma_establish().) Basically, the 3-way handshake used by IB needs to be exposed. Use of either of these can be limited to those users that do not associate a QP with their rdma_cm_id. Alternatively, the uCMA kernel component can be integrated with the kernel CMA and make use of private interfaces. - Sean From robert.j.woodruff at intel.com Tue Oct 25 16:07:23 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 25 Oct 2005 16:07:23 -0700 Subject: [openib-general] round 2 - proposal for socketbased connectionmodel Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005ED7FBB@orsmsx408> Sean wrote: >Kanevsky, Arkady wrote: >> What are you trying to achieve? >I'm trying to define a connection *service* for Infiniband that uses TCP/IP >addresses as its user interface. That service will have its own protocol, in >much the same way that SDP, SRP, etc. do today. Seems like we have two proposals, at a high level, I see no substantial benefit of one method over the other, they would both facilitate connecting two machines using IP address. However, one of the 2 proposals has code that is in the trunk that people can start to code to today, the other is a paper tiger that has no code. I suggest that rather than recommend an entirely new model (with no code), why not instead comment on the code that exists and send patches if it needs to be improved. Isn't that the way that open source is suppose to work ? my 2 cents, woody From Arkady.Kanevsky at netapp.com Tue Oct 25 17:48:50 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 20:48:50 -0400 Subject: [swg] Re: [openib-general] TCP/IP connection service over IB Message-ID: DAPL also strip this private data header and present to Consumer IP addresses and ports as separate items from Consumer private data. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Tuesday, October 25, 2005 5:52 PM > To: Ted H. Kim > Cc: swg at infinibandta.org; openib-general > Subject: Re: [swg] Re: [openib-general] TCP/IP connection > service over IB > > > On Tue, 2005-10-25 at 13:16 -0700, Ted H. Kim wrote: > > Tom, > > > > Some comments inline ... > > > > > > Tom Tucker wrote: > > > I think it's relevant, so let's make sure my assumptions are > > > correct: > > > > > > - The ITAPI will be a "ULP" on OpenIB > > > > ITAPI is like uDAPL, so if uDAPL is a "ULP" then the answer is yes. > > The point is that for uDAPL you have the actual "app" running over > > uDAPL. So I guess it's a matter of terminology whether > uDAPL is a ULP > > or is it some sort of middleware with the app being the "ULP". > > > > Yeah, you're right the terminology is probably a little > goofy. The reason for the goofosity is that some of the "ulp" > really are protocols (ISER, IPoIB), and some are API (DAPL, > MPI). All use the same interface > to register with OpenIB. > > But that said, yes, ITAPI is like uDAPL. > > > > > > - The ITAPI will create the IRD/ORD headers in its > private data and > > > submit this as part of its connection establishment. > > > - The ITAPI consumer at the remote peer will use this data to > > > configure it's local QP before accepting the connection > > > > > > Over IB, the IRD/ORD private data will be prepended with > a "private > > > data header" that contains the source and destination IP > addresses, > > > source port, etc... The remote peer will not see this > data as part > > > of the private data, but rather will see it in the CMA > event in the > > > upcall. > > > > Over IB, the IRD/ORD data is already built in to the > standard CM stuff > > (i.e. the "responder resources" and "initiator depth" fields of REQ > > and REP). So no additional demands are made on private data > for IB in > > ITAPI for the IOH purpose. Of course the ITAPI app (like a > uDAPL app) > > can also use private data for app specific/ULP reasons. > > ok -- bad example. Sorry. This is a weird one. On iWARP, you > need the private data header to pass this stuff along and on > IB, you don't. What I was trying to say is that "whatever the > private data", on IB it will get a private data header > prepended and on iWARP, it won't. > > > > > > > > Over iWARP/MPA, there will be nothing else in the private data > > > except what was provided by the consumer (ITAPI in this > case). The > > > reason being that this extra information (IP addressing > info) is in > > > the protocol header proper. > > > > Just to restate for clarity, ITAPI for iWARP will use the first 16 > > bytes of MPA private date for the IOH (IRD/ORD header). The rest is > > usable for app/ULP reasons. > > Yessir. And in fact, the ITAPI CM will strip this stuff > before presenting it to the app. > > > > > > > I should point out that there was once a proposal of doing > a RDDP IETF > > draft which would have sub-divided the MPA private data into a > > "middleware" section and an "app" section. The idea was to be sure > > that the app/ULP and middleware (e.g. the IOH) uses of private data > > would not step on each other. I think this idea did not progress, > > mostly because the author (John Carrier, formerly of > Adaptec) changed > > jobs and was no longer working on iWARP stuff. > > > > While not directly proposed, this idea could have been > carried over to > > IB. Some of the ideas on this thread are already implicitly > doing this > > middleware (for IP addressing purpose) vs ULP/app split. > > > > I think we are grappling with a lot of these layering issues > now. We are also grappling with protocol vs. implementation issues. > > Keep it coming, because this is exactly the kind of feedback > I think we need. > > > -ted > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Tue Oct 25 17:53:41 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 20:53:41 -0400 Subject: [openib-general] round 2 - proposal for socket basedconnection model Message-ID: No. iWARP does not have to pass this info. The info is needed for IB because ZB and SI were introduced in IBTA 1.2 specs as optional functionality. So if ULP wants to use that functionality it need to find out whether remote side can support it. This is needed for backwards compatibility. For example iSER protocol defines the use of remote invalidate but obviously can not be done if remote side can not support it. I do not recall right now whether iWARP defined that functionality as required or optional. Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Tom Tucker [mailto:tom at opengridcomputing.com] > Sent: Tuesday, October 25, 2005 5:56 PM > To: Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: RE: [openib-general] round 2 - proposal for socket > basedconnection model > > > Arkady: > > I may actually have a constructive comment about the protocol > (private data format). One thing I noticed is that *almost* > everything in the private data header is available in the > native iWARP protocol header except the ZB and SI bits. If > these bits become part of the canonical private data header, > then does that require an iWARP transport to use the header > too even though only two bits are useful? > > Sorry if this is a dumb question, > > Tom > > On Tue, 2005-10-25 at 16:40 -0500, Tom Tucker wrote: > > Arkady: > > > > I don't think anyone disagrees with your goals. Unfortunately > > additional requirements on the implementation were coupled with the > > specification of the private data format (protocol). This > peripheral > > discussion derailed any attempt to discuss the protocol. > > > > Attempts to separate the protocol discussion from the > implementation > > failed. And so here we are... > > > > > > On Tue, 2005-10-25 at 15:38 -0400, Kanevsky, Arkady wrote: > > > What are you trying to achieve? > > > > > > I am trying to define an IB REQ protocol extension that > support IP > > > connection 5-tuple exchange between connection requestor and > > > responder. And define mapping between IP 5-tuple and IB entities. > > > > > > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so > > > on) can use RDMA transport without change. To modify ULP to know > > > that it runs on top of IB vs. iWARP vs. (any other RDMA > transport) > > > is bad idea. It is one thing to choose proper port to connect. > > > Completely different to ask ULP to parse private data > > > in transport specific way. > > > > > > The same protocol must support both user level ULPs > > > and kernel level ULPs. > > > Arkady > > > > > > Arkady Kanevsky email: arkady at netapp.com > > > Network Appliance phone: 781-768-5395 > > > 375 Totten Pond Rd. Fax: 781-895-1195 > > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > > > > > > > -----Original Message----- > > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > > > Sent: Tuesday, October 25, 2005 3:22 PM > > > > To: Kanevsky, Arkady > > > > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org > > > > Subject: Re: [openib-general] round 2 - proposal for socket > > > > based connectionmodel > > > > > > > > > > > > Kanevsky, Arkady wrote: > > > > > Sean, > > > > > answers in-line. > > > > > Arkady > > > > > > > > At this point, I'm just going to disagree with this approach > > > > and move on with > > > > the current implementation of the CMA. What's needed is a > > > > service that provides > > > > IB connections using TCP/IP addressing. I don't believe this > > > > proposal meets > > > > this goal. > > > > > > > > To meet the requirement of connecting over IB using TCP/IP > > > > addressing, I believe > > > > that we need a service with a reserved service > identifier or range of > > > > identifiers, a mechanism for mapping between IP and IB > > > > addresses, and a > > > > mechanism for reversing the mapping. > > > > > > > > I don't see where the proposal addresses the bulk of the work > > > > that's required, > > > > nor do I think that it will present an API to the user that > > > > does not expose IB > > > > related addressing (such as service IDs). > > > > > > > > - Sean > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Tue Oct 25 18:11:16 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Tue, 25 Oct 2005 21:11:16 -0400 Subject: [openib-general] round 2 - proposal for socket based connectionmodel Message-ID: Sean Hefty wrote: > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Tuesday, October 25, 2005 6:44 PM > To: Kanevsky, Arkady > Cc: Sean Hefty; openib-general at openib.org; swg at infinibandta.org > Subject: Re: [openib-general] round 2 - proposal for socket > based connectionmodel > > > Kanevsky, Arkady wrote: > > What are you trying to achieve? > > I'm trying to define a connection *service* for Infiniband > that uses TCP/IP > addresses as its user interface. That service will have its > own protocol, in > much the same way that SDP, SRP, etc. do today. > > > I am trying to define an IB REQ protocol extension that support IP > > connection 5-tuple exchange between connection requestor and > > responder. > > Why? What need is there for a protocol extension to the IB > CM? To me, this is > similar to setting a bit in the CM REQ to indicate that the > private data format > looks like SDP's private data. The format of the _private_ > data shouldn't be > known to the CM; that's why it's private data. There is no requirement that the remote side uses the same Linux CM. So in order to achieve interopability you need a protocol. SDP hello-world protocol is defined for SDP. We are defining an equivalent that is ULP independent. If CM is not involved then it is ULP that populate the 5-tuple info on requestor side and parses it on the remote side. Thus, make ULP CM IB specific. This is what we are trying to avoid. ULP should not change regardless whether or not it is running on IB, iWARP, VIA or any other RDMA transport. iWARP does not need private data to pass 5-tuple. > > > And define mapping between IP 5-tuple and IB entities. > > No mapping between IP <-> IB addresses was defined in the > proposal. Defining > this mapping is required to make this work. Right now, the > mapping is the > responsibility of every user. > > > That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP > (and so on) > > can use RDMA transport without change. > > A ULP written to TCP/IP can use an RDMA transport without > change. They use SDP. > However, an application that wants to take advantage of QP > semantics must > change. (And if they want to take full advantage of RDMA, > they'll likely need > to be re-architected as well.) The goal in that case becomes > to permit them to > establish connections using TCP/IP addresses. > > To meet this goal, we need to define how to map IP address to > and from IB > addresses. That mapping is part of the protocol, and is > missing from the > proposal. And if the application isn't going to know that > they're running on > Infiniband, then the mapping must also include mapping to a > destination service ID. > > > To modify ULP to know that it runs on top of IB vs. iWARP > > vs. (any other RDMA transport) is bad idea. > > It is one thing to choose proper port to connect. > > Completely different to ask ULP to parse private data > > in transport specific way. > > The same protocol must support both user level ULPs > > and kernel level ULPs. > > Defining an interface that allows a ULP to use either iWarp, > IB, or some other > random RDMA transport is an implementation issue. However, > it requires > something that maps IP to IB addresses (including service IDs). > > To be more concrete, you've gone from having source and > destination TCP/IP > addresses to including them in a CM REQ. What translated the > source and > destination IP addresses into GIDs and a PKey? Who converted > those into IB > routing information? How was the destination of the CM REQ > determined? What > service ID was selected? IPoIB defines IP -> GID Port -> IB Service ID (part of this proposal) Pkey is configuration setup done by administrator. Ditto for VLAN. > > - Sean > From krause at cup.hp.com Tue Oct 25 18:41:48 2005 From: krause at cup.hp.com (Michael Krause) Date: Tue, 25 Oct 2005 18:41:48 -0700 Subject: [swg] Re: [openib-general] round 2 - proposal for socket based connectionmodel In-Reply-To: <435EB51B.5050701@ichips.intel.com> References: <435EB51B.5050701@ichips.intel.com> Message-ID: <6.2.0.14.2.20051025183202.02750518@esmail.cup.hp.com> Just to correct one comment: A ULP written to TCP/IP can use RDMA transport without change. An example is SDP not that the ULP must use what SDP uses. Also, please keep in mind that SDP on iWARP uses the port mapper protocol to obtain the IP address and port to target for the connection request. So, the TCP connection establishment is to the RDMA listen endpoint from the start and the SDP hello exchange then fills in the rest of the parameters required to determine whether the connection should proceed and what resources should be configured when the response is generated. I will also re-iterate what another person stated and that is to separate out the interface from the wire protocol. IBTA defines wire protocols / semantics while OpenIB is defining its API to communicate the wire protocol and associated semantics. I agree with that person on this point and their other point on the need for the IBTA to construct a solid spec for the wire protocol and associated semantics. OpenIB will then determine how best to implement but these are separate efforts and it would be more productive for all to table the discussion for now. The original request was whether something would break if the private data size was changed. It was noted that one cannot know what will or will not break thus the requirement is to provide a method for software to note the difference in the layout. How is for the IBTA to specify. Just a thought...... Mike At 03:43 PM 10/25/2005, Sean Hefty wrote: >Kanevsky, Arkady wrote: >>What are you trying to achieve? > >I'm trying to define a connection *service* for Infiniband that uses >TCP/IP addresses as its user interface. That service will have its own >protocol, in much the same way that SDP, SRP, etc. do today. > >>I am trying to define an IB REQ protocol extension that >>support IP connection 5-tuple exchange between connection >>requestor and responder. > >Why? What need is there for a protocol extension to the IB CM? To me, >this is similar to setting a bit in the CM REQ to indicate that the >private data format looks like SDP's private data. The format of the >_private_ data shouldn't be known to the CM; that's why it's private data. > >>And define mapping between IP 5-tuple and IB entities. > >No mapping between IP <-> IB addresses was defined in the >proposal. Defining this mapping is required to make this work. Right >now, the mapping is the responsibility of every user. > >>That way ULP which was written to TCP/IP, UDP/IP, CSTP/IP (and so on) >>can use RDMA transport without change. > >A ULP written to TCP/IP can use an RDMA transport without change. They >use SDP. However, an application that wants to take advantage of QP >semantics must change. (And if they want to take full advantage of RDMA, >they'll likely need to be re-architected as well.) The goal in that case >becomes to permit them to establish connections using TCP/IP addresses. > >To meet this goal, we need to define how to map IP address to and from IB >addresses. That mapping is part of the protocol, and is missing from the >proposal. And if the application isn't going to know that they're running >on Infiniband, then the mapping must also include mapping to a destination >service ID. > >>To modify ULP to know that it runs on top of IB vs. iWARP >>vs. (any other RDMA transport) is bad idea. >>It is one thing to choose proper port to connect. >>Completely different to ask ULP to parse private data >>in transport specific way. >>The same protocol must support both user level ULPs >>and kernel level ULPs. > >Defining an interface that allows a ULP to use either iWarp, IB, or some >other random RDMA transport is an implementation issue. However, it >requires something that maps IP to IB addresses (including service IDs). > >To be more concrete, you've gone from having source and destination TCP/IP >addresses to including them in a CM REQ. What translated the source and >destination IP addresses into GIDs and a PKey? Who converted those into >IB routing information? How was the destination of the CM REQ >determined? What service ID was selected? > >- Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From umaxx at oleco.net Wed Oct 26 03:00:02 2005 From: umaxx at oleco.net (Joerg Zinke) Date: Wed, 26 Oct 2005 12:00:02 +0200 Subject: [openib-general] question about poll_cq() In-Reply-To: <52r7a94liq.fsf@cisco.com> References: <20051025113935.42db75ac@marvin.local> <52r7a94liq.fsf@cisco.com> Message-ID: <20051026120002.1f9f1584@marvin.local> On Tue, 25 Oct 2005 09:19:25 -0700 Roland Dreier wrote: > > A very brief sketch of what happens is that the device-specific > implementation of CQs for Mellanox HCAs allocates a circular buffer in > memory and passes the address to the hardware. The buffer is divided > into fixed-size chunks, each of which represents one completion entry. > Initially the buffer is cleared out, and every time the hardware adds > an entry onto the completion queue, it sets a bit in that chunk to > show that the entry is now valid. The driver polls the CQ by looking > to see if the next chunk has the bit set. If it does, then the driver > translates the entry from hardware format into standard struct ibv_wc > format; if it doesn't, then the driver returns status indicating that > the CQ is empty. > > Completion queues are always located in local system memory. > > - R. > thanks for your reply. that's all i wanted to know. joerg From tom at ipperformance.com Wed Oct 26 05:00:43 2005 From: tom at ipperformance.com (Tom Tucker) Date: Wed, 26 Oct 2005 07:00:43 -0500 Subject: [openib-general] RFC userspace CMA In-Reply-To: <435EBA85.7050107@ichips.intel.com> References: <435D76E9.5040404@ichips.intel.com> <435EBA85.7050107@ichips.intel.com> Message-ID: <1130328043.18967.13.camel@mail.es335.com> Sean: FYI, I've started writing the iw_cm that sits below the rdma_cm. Here's the general picture I have in mind. +---------+ | RDMA CM | +-+-----+-+ | | +----+ +----+ | | +---------+ +----+----+ | IB CM | | IW CM | +----+----+ +----+----+ | | ____+_____ ____+_____ +---------+| +---------+| | IB devs || | IW devs || +---------+ +---------+ The purpose of the IW CM is to abstract the two different connection models used by the iWARP side: offloaded and host integrated, and to act as a shim between device specific connection data structures and the rdma_cm data structures. I am also migrating the current iw_cm.h file to match the interfaces in the rdma_cm more closely. In general, the IW CM methods look very much like sockets connect, listen, and accept. There is an iw_cm_id like the ib_cm_id that encapsulates the 5-tuple, a callback for IW CM events and a "provider handle" that represents the adapter "connection cookie". The iw_cm_id is passed to connect, accept, etc... One big difference between the IW_CM and the IB_CM is that the IW_CM does not implement the connection state machine (three way handshake) like it does in the IB_CM. This greatly simplifies the code. Another big difference is that the IW_CM does not implement the service id database (port number space). This is either in the adapter or native stack depending on the model. This means that calls like listen with a local port wildcard can't return until the "listen_reply" comes back from the adapter. I welcome all comments on this especially now that it's early and there's a lot of options and not much code yet. On Tue, 2005-10-25 at 16:06 -0700, Sean Hefty wrote: > Sean Hefty wrote: > > - The kernel CMA will expose a new call, rdma_init_qp_attr() to > > initialize QP attributes used to modify the state of the QP. The call > > will be similar to the infiniband CM routine. Use of this call is > > optional. The CMA will automatically transition QPs created by > > rdma_create_qp(). > > The changes are more involved than this. To handle the QP transitions in > userspace, the kernel CMA needs to generate another event: CONNECT_RESPONSE. It > will also need an additional API: rdma_establish(). (We can overload > rdma_accept() in place of rdma_establish().) Basically, the 3-way handshake > used by IB needs to be exposed. > > Use of either of these can be limited to those users that do not associate a QP > with their rdma_cm_id. Alternatively, the uCMA kernel component can be > integrated with the kernel CMA and make use of private interfaces. > > - Sean > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From liran at mellanox.co.il Wed Oct 26 05:58:02 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Wed, 26 Oct 2005 14:58:02 +0200 Subject: [openib-general] Osmtest Gen2 - Update Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB831@mtlexch01.mtl.com> Hi , Gen2 Osmtest now support configuration flag (--with-top-tree=) for the following purposes : 1. Pre Define SOURCE userspace library & header files to compile and link with. 2. Compile Osmtest as part of the Gen2 stack For more info pls refer to README file. Link : https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest Liran Sorani Mellanox Technologies LTD. mailto:liran at mellanox.co.il Phone: +972(4)9097200 Ext: 214 Israel, Yokneam P.O.B 586 ZIP 20692 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Oct 26 06:08:16 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Oct 2005 15:08:16 +0200 Subject: [openib-general] Re: Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN In-Reply-To: <435D4B27.2010208@ichips.intel.com> References: <435D4B27.2010208@ichips.intel.com> Message-ID: <20051026130816.GJ4769@mellanox.co.il> Quoting Arlin Davis : > Subject: Re: Re: [PATCH] perftest/rdma_bw;?add support for RDMA read and starting PSN > > Michael S. Tsirkin wrote: > > >Thanks Arlin. I plan to look into integrating this. > >One question: for which psn values do you see performance drop on 4.6.0 > >FW? > > Any luck isolating this performance problem? I just want to understand > the cause so I know for sure 4.7 FW is a solid fix. Didn't see anything > in the 4.7 release notes that covered this issue. > Hi, Arlin. Unfortunately, I dont have an answer yet. This is not something we have intentionally fixed with 4.7, but it could be fixed as a byproduct of another fix. I plan to follow through and get back to you when we have an answer. Thanks, -- MST From mst at mellanox.co.il Wed Oct 26 07:09:58 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Oct 2005 16:09:58 +0200 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <435D76E9.5040404@ichips.intel.com> References: <435D76E9.5040404@ichips.intel.com> Message-ID: <20051026140958.GL4769@mellanox.co.il> Quoting Sean Hefty : > Subject: RFC userspace CMA > > I'm soliciting any comments that anyone might have on the general design for the > userspace CMA before I get too far into the implementation. > > - The API will match the kernel API for the most part. The exception is that > event handling will match other userspace libraries (get/ack event). > > - There will be a single CMA device exported through /sys/class/infiniband. > > - The kernel CMA will be modified to remove the requirement to use > rdma_create_qp(). Users that want to allocate and manage their own QP states > will be able to specify QP attributes (qpn, qp_type, srq) through the > rdma_conn_param structure. > > - The kernel CMA will expose a new call, rdma_init_qp_attr() to initialize QP > attributes used to modify the state of the QP. The call will be similar to the > infiniband CM routine. Use of this call is optional. The CMA will > automatically transition QPs created by rdma_create_qp(). > > - The uCMA will open devices for users and return them the device context with > related events. The uCMA will close the device if there are no rdma_cma_id's > associated with it. > > - To support device add, the uCMA will need a new verb's call: > ibv_open_device_by_guid(). If a connection request occurs for a device that is > not yet known by the uCMA, it will open the device. > > Comments? > > - Sean Sounds like a lot of work :). Are there benefits to this approach as opposed to implementing everything in a library on top of ucm/uverbs? -- MST From vqsuettslxw at rr.com Wed Oct 26 07:18:07 2005 From: vqsuettslxw at rr.com (Robin Talley) Date: Wed, 26 Oct 2005 14:18:07 +0000 Subject: [openib-general] The miracle. Message-ID: <2.1480.i93@rr.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://coolhoodia.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://coolhoodia.com/ Regards, Dr. Robin Talley From IBMEHCAD at de.ibm.com Wed Oct 26 07:56:08 2005 From: IBMEHCAD at de.ibm.com (IBMEHCA DD) Date: Wed, 26 Oct 2005 16:56:08 +0200 Subject: [openib-general] prototype version of ebus driver Message-ID: on kernel 2.6.13 and 14 a "ebus" driver is needed to enable the ehca driver on power5. I just uploaded a prototype patch to gen2/users/ehca svn 3879 Christoph R. -------------- next part -------------- An HTML attachment was scrubbed... URL: From itamar at mellanox.co.il Wed Oct 26 08:25:26 2005 From: itamar at mellanox.co.il (Itamar Rabenstein) Date: Wed, 26 Oct 2005 17:25:26 +0200 Subject: [openib-general] opensm problem ??? Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB8A6@mtlexch01.mtl.com> Hi All, I am running openib gen2 svn rev 3872 (kernel + user). my system is EM64T (x86_64) + SUSE9.3 + k2.6.13.4 I have arbel in memfree mode (fw 5.1.132) . my 2 ports are connected in loopback. I am running opensm but the links are not getting into ACTIVE. in the osm.log i see Oct 26 16:59:25 366150 [43005960] -> __osm_vl15_poller: 1 QP0 MADs on wire, 1 outstanding, 0 unicasts sent, 1 total sent. Oct 26 16:59:33 937993 [44007960] -> umad_receiver: ERR 5404: recv error on MAD sized umad (Interrupted system call) Does it works for others ? Itamar The full osm.log is Oct 26 16:59:25 348084 [AB446CA0] -> OpenSM Rev:openib-1.1.0 Oct 26 16:59:25 355661 [0000] -> OpenSM Rev:openib-1.1.0 Oct 26 16:59:25 355661 [AB446CA0] -> OpenSM Rev:openib-1.1.0 Oct 26 16:59:25 355810 [AB446CA0] -> osm_opensm_init: [ Oct 26 16:59:25 356149 [AB446CA0] -> osm_vendor_new: [ Oct 26 16:59:25 356202 [AB446CA0] -> osm_vendor_init: [ Oct 26 16:59:25 356296 [AB446CA0] -> osm_vendor_init: ] Oct 26 16:59:25 356306 [AB446CA0] -> osm_vendor_new: ] Oct 26 16:59:25 356316 [AB446CA0] -> osm_mad_pool_init: [ Oct 26 16:59:25 356427 [AB446CA0] -> osm_mad_pool_init: ] Oct 26 16:59:25 356436 [AB446CA0] -> osm_vl15_init: [ Oct 26 16:59:25 356473 [AB446CA0] -> osm_vl15_init: ] Oct 26 16:59:25 356484 [AB446CA0] -> osm_db_init: [ Oct 26 16:59:25 356503 [AB446CA0] -> osm_db_init: ] Oct 26 16:59:25 356512 [AB446CA0] -> osm_sm_init: [ Oct 26 16:59:25 356534 [AB446CA0] -> osm_sm_mad_ctrl_init: [ Oct 26 16:59:25 356546 [AB446CA0] -> osm_sm_mad_ctrl_init: ] Oct 26 16:59:25 356554 [AB446CA0] -> osm_req_init: [ Oct 26 16:59:25 356565 [AB446CA0] -> osm_req_init: ] Oct 26 16:59:25 356573 [AB446CA0] -> osm_req_ctrl_init: [ Oct 26 16:59:25 356583 [AB446CA0] -> osm_req_ctrl_init: ] Oct 26 16:59:25 356591 [AB446CA0] -> osm_resp_init: [ Oct 26 16:59:25 356600 [AB446CA0] -> osm_resp_init: ] Oct 26 16:59:25 356608 [AB446CA0] -> osm_ni_rcv_init: [ Oct 26 16:59:25 356619 [AB446CA0] -> osm_ni_rcv_init: ] Oct 26 16:59:25 356627 [AB446CA0] -> osm_ni_rcv_ctrl_init: [ Oct 26 16:59:25 356636 [AB446CA0] -> osm_ni_rcv_ctrl_init: ] Oct 26 16:59:25 356644 [AB446CA0] -> osm_pi_rcv_init: [ Oct 26 16:59:25 356652 [AB446CA0] -> osm_pi_rcv_init: ] Oct 26 16:59:25 356661 [AB446CA0] -> osm_pi_rcv_ctrl_init: [ Oct 26 16:59:25 356669 [AB446CA0] -> osm_pi_rcv_ctrl_init: ] Oct 26 16:59:25 356677 [AB446CA0] -> osm_si_rcv_init: [ Oct 26 16:59:25 356688 [AB446CA0] -> osm_si_rcv_init: ] Oct 26 16:59:25 356696 [AB446CA0] -> osm_si_rcv_ctrl_init: [ Oct 26 16:59:25 356705 [AB446CA0] -> osm_si_rcv_ctrl_init: ] Oct 26 16:59:25 356713 [AB446CA0] -> osm_nd_rcv_init: [ Oct 26 16:59:25 356721 [AB446CA0] -> osm_nd_rcv_init: ] Oct 26 16:59:25 356729 [AB446CA0] -> osm_nd_rcv_ctrl_init: [ Oct 26 16:59:25 356738 [AB446CA0] -> osm_nd_rcv_ctrl_init: ] Oct 26 16:59:25 356746 [AB446CA0] -> osm_lid_mgr_init: [ Oct 26 16:59:25 356805 [AB446CA0] -> osm_db_domain_init: [ Oct 26 16:59:25 356782 [43005960] -> __osm_vl15_poller: [ Oct 26 16:59:25 356863 [AB446CA0] -> osm_db_domain_init: ] Oct 26 16:59:25 356878 [AB446CA0] -> osm_db_restore: [ Oct 26 16:59:25 356912 [AB446CA0] -> __osm_lid_mgr_validate_db: [ Oct 26 16:59:25 356930 [AB446CA0] -> __osm_lid_mgr_validate_db: ] Oct 26 16:59:25 356939 [AB446CA0] -> osm_lid_mgr_init: ] Oct 26 16:59:25 356947 [AB446CA0] -> osm_ucast_mgr_init: [ Oct 26 16:59:25 356959 [AB446CA0] -> osm_ucast_mgr_init: ] Oct 26 16:59:25 356967 [AB446CA0] -> osm_link_mgr_init: [ Oct 26 16:59:25 356978 [AB446CA0] -> osm_link_mgr_init: ] Oct 26 16:59:25 356987 [AB446CA0] -> osm_state_mgr_init: [ Oct 26 16:59:25 356998 [AB446CA0] -> osm_state_mgr_init: ] Oct 26 16:59:25 357006 [AB446CA0] -> osm_state_mgr_ctrl_init: [ Oct 26 16:59:25 357015 [AB446CA0] -> osm_state_mgr_ctrl_init: ] Oct 26 16:59:25 357024 [AB446CA0] -> osm_drop_mgr_init: [ Oct 26 16:59:25 357032 [AB446CA0] -> osm_drop_mgr_init: ] Oct 26 16:59:25 357041 [AB446CA0] -> osm_lft_rcv_init: [ Oct 26 16:59:25 357049 [AB446CA0] -> osm_lft_rcv_init: ] Oct 26 16:59:25 357057 [AB446CA0] -> osm_lft_rcv_ctrl_init: [ Oct 26 16:59:25 357099 [AB446CA0] -> osm_lft_rcv_ctrl_init: ] Oct 26 16:59:25 357108 [AB446CA0] -> osm_mft_rcv_init: [ Oct 26 16:59:25 357116 [AB446CA0] -> osm_mft_rcv_init: ] Oct 26 16:59:25 357124 [AB446CA0] -> osm_mft_rcv_ctrl_init: [ Oct 26 16:59:25 357133 [AB446CA0] -> osm_mft_rcv_ctrl_init: ] Oct 26 16:59:25 357141 [AB446CA0] -> osm_sweep_fail_ctrl_init: [ Oct 26 16:59:25 357150 [AB446CA0] -> osm_sweep_fail_ctrl_init: ] Oct 26 16:59:25 357159 [AB446CA0] -> osm_sminfo_rcv_init: [ Oct 26 16:59:25 357167 [AB446CA0] -> osm_sminfo_rcv_init: ] Oct 26 16:59:25 357176 [AB446CA0] -> osm_sminfo_rcv_ctrl_init: [ Oct 26 16:59:25 357184 [AB446CA0] -> osm_sminfo_rcv_ctrl_init: ] Oct 26 16:59:25 357193 [AB446CA0] -> osm_trap_rcv_init: [ Oct 26 16:59:25 357204 [AB446CA0] -> cl_event_wheel_init: [ Oct 26 16:59:25 357214 [AB446CA0] -> cl_event_wheel_init: ] Oct 26 16:59:25 357222 [AB446CA0] -> osm_trap_rcv_init: ] Oct 26 16:59:25 357231 [AB446CA0] -> osm_trap_rcv_ctrl_init: [ Oct 26 16:59:25 357240 [AB446CA0] -> osm_trap_rcv_ctrl_init: ] Oct 26 16:59:25 357248 [AB446CA0] -> osm_sm_state_mgr_init: [ Oct 26 16:59:25 357261 [AB446CA0] -> osm_sm_state_mgr_init: ] Oct 26 16:59:25 357269 [AB446CA0] -> osm_mcast_mgr_init: [ Oct 26 16:59:25 357277 [AB446CA0] -> osm_mcast_mgr_init: ] Oct 26 16:59:25 357286 [AB446CA0] -> osm_slvl_rcv_init: [ Oct 26 16:59:25 357294 [AB446CA0] -> osm_slvl_rcv_init: ] Oct 26 16:59:25 357302 [AB446CA0] -> osm_slvl_rcv_ctrl_init: [ Oct 26 16:59:25 357311 [AB446CA0] -> osm_slvl_rcv_ctrl_init: ] Oct 26 16:59:25 357320 [AB446CA0] -> osm_vla_rcv_init: [ Oct 26 16:59:25 357328 [AB446CA0] -> osm_vla_rcv_init: ] Oct 26 16:59:25 357336 [AB446CA0] -> osm_vla_rcv_ctrl_init: [ Oct 26 16:59:25 357344 [AB446CA0] -> osm_vla_rcv_ctrl_init: ] Oct 26 16:59:25 357353 [AB446CA0] -> osm_pkey_rcv_init: [ Oct 26 16:59:25 357361 [AB446CA0] -> osm_pkey_rcv_init: ] Oct 26 16:59:25 357369 [AB446CA0] -> osm_pkey_rcv_ctrl_init: [ Oct 26 16:59:25 357378 [AB446CA0] -> osm_pkey_rcv_ctrl_init: ] Oct 26 16:59:25 357409 [AB446CA0] -> osm_sm_init: ] Oct 26 16:59:25 357419 [AB446CA0] -> osm_sa_init: [ Oct 26 16:59:25 357428 [AB446CA0] -> osm_sa_resp_init: [ Oct 26 16:59:25 357441 [AB446CA0] -> osm_sa_resp_init: ] Oct 26 16:59:25 357450 [AB446CA0] -> osm_sa_mad_ctrl_init: [ Oct 26 16:59:25 357461 [AB446CA0] -> osm_sa_mad_ctrl_init: ] Oct 26 16:59:25 357470 [AB446CA0] -> osm_cpi_rcv_init: [ Oct 26 16:59:25 357478 [AB446CA0] -> osm_cpi_rcv_init: ] Oct 26 16:59:25 357486 [AB446CA0] -> osm_cpi_rcv_ctrl_init: [ Oct 26 16:59:25 357495 [AB446CA0] -> osm_cpi_rcv_ctrl_init: ] Oct 26 16:59:25 357503 [AB446CA0] -> osm_nr_rcv_init: [ Oct 26 16:59:25 357524 [AB446CA0] -> osm_nr_rcv_init: ] Oct 26 16:59:25 357533 [AB446CA0] -> osm_nr_rcv_ctrl_init: [ Oct 26 16:59:25 357542 [AB446CA0] -> osm_nr_rcv_ctrl_init: ] Oct 26 16:59:25 357550 [AB446CA0] -> osm_pir_rcv_init: [ Oct 26 16:59:25 357566 [AB446CA0] -> osm_pir_rcv_init: ] Oct 26 16:59:25 357575 [AB446CA0] -> osm_pir_rcv_ctrl_init: [ Oct 26 16:59:25 357584 [AB446CA0] -> osm_pir_rcv_ctrl_init: ] Oct 26 16:59:25 357592 [AB446CA0] -> osm_lr_rcv_init: [ Oct 26 16:59:25 357603 [AB446CA0] -> osm_lr_rcv_init: ] Oct 26 16:59:25 357611 [AB446CA0] -> osm_lr_rcv_ctrl_init: [ Oct 26 16:59:25 357619 [AB446CA0] -> osm_lr_rcv_ctrl_init: ] Oct 26 16:59:25 357627 [AB446CA0] -> osm_pr_rcv_init: [ Oct 26 16:59:25 357644 [AB446CA0] -> osm_pr_rcv_init: ] Oct 26 16:59:25 357653 [AB446CA0] -> osm_pr_rcv_ctrl_init: [ Oct 26 16:59:25 357664 [AB446CA0] -> osm_pr_rcv_ctrl_init: ] Oct 26 16:59:25 357672 [AB446CA0] -> osm_smir_rcv_init: [ Oct 26 16:59:25 357681 [AB446CA0] -> osm_smir_rcv_init: ] Oct 26 16:59:25 357689 [AB446CA0] -> osm_smir_ctrl_init: [ Oct 26 16:59:25 357698 [AB446CA0] -> osm_smir_ctrl_init: ] Oct 26 16:59:25 357706 [AB446CA0] -> osm_mcmr_rcv_init: [ Oct 26 16:59:25 357722 [AB446CA0] -> osm_mcmr_rcv_init: ] Oct 26 16:59:25 357731 [AB446CA0] -> osm_mcmr_rcv_ctrl_init: [ Oct 26 16:59:25 357740 [AB446CA0] -> osm_mcmr_rcv_ctrl_init: ] Oct 26 16:59:25 357748 [AB446CA0] -> osm_sr_rcv_init: [ Oct 26 16:59:25 357776 [AB446CA0] -> osm_sr_rcv_init: ] Oct 26 16:59:25 357797 [AB446CA0] -> osm_sr_rcv_ctrl_init: [ Oct 26 16:59:25 357807 [AB446CA0] -> osm_sr_rcv_ctrl_init: ] Oct 26 16:59:25 357815 [AB446CA0] -> osm_infr_rcv_init: [ Oct 26 16:59:25 357823 [AB446CA0] -> osm_infr_rcv_init: ] Oct 26 16:59:25 357831 [AB446CA0] -> osm_infr_rcv_ctrl_init: [ Oct 26 16:59:25 357840 [AB446CA0] -> osm_infr_rcv_ctrl_init: ] Oct 26 16:59:25 357848 [AB446CA0] -> osm_vlarb_rec_rcv_init: [ Oct 26 16:59:25 357864 [AB446CA0] -> osm_vlarb_rec_rcv_init: ] Oct 26 16:59:25 357873 [AB446CA0] -> osm_vlarb_rec_rcv_ctrl_init: [ Oct 26 16:59:25 357882 [AB446CA0] -> osm_vlarb_rec_rcv_ctrl_init: ] Oct 26 16:59:25 357890 [AB446CA0] -> osm_slvl_rec_rcv_init: [ Oct 26 16:59:25 357905 [AB446CA0] -> osm_slvl_rec_rcv_init: ] Oct 26 16:59:25 357913 [AB446CA0] -> osm_slvl_rec_rcv_ctrl_init: [ Oct 26 16:59:25 357922 [AB446CA0] -> osm_slvl_rec_rcv_ctrl_init: ] Oct 26 16:59:25 357930 [AB446CA0] -> osm_pkey_rec_rcv_init: [ Oct 26 16:59:25 357941 [AB446CA0] -> osm_pkey_rec_rcv_init: ] Oct 26 16:59:25 357949 [AB446CA0] -> osm_pkey_rec_rcv_ctrl_init: [ Oct 26 16:59:25 357957 [AB446CA0] -> osm_pkey_rec_rcv_ctrl_init: ] Oct 26 16:59:25 357966 [AB446CA0] -> osm_lftr_rcv_init: [ Oct 26 16:59:25 357986 [AB446CA0] -> osm_lftr_rcv_init: ] Oct 26 16:59:25 357995 [AB446CA0] -> osm_lftr_rcv_ctrl_init: [ Oct 26 16:59:25 358004 [AB446CA0] -> osm_lftr_rcv_ctrl_init: ] Oct 26 16:59:25 358012 [AB446CA0] -> osm_sa_init: ] Oct 26 16:59:25 358020 [AB446CA0] -> osm_opensm_create_mcgroups: [ Oct 26 16:59:25 358028 [AB446CA0] -> osm_sa_create_template_record_ipoib: [ Oct 26 16:59:25 358040 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: [ Oct 26 16:59:25 358048 [AB446CA0] -> __get_new_mlid: [ Oct 26 16:59:25 358057 [AB446CA0] -> __get_new_mlid: No multicast groups found using minimal mlid:0xC000 Oct 26 16:59:25 358068 [AB446CA0] -> __get_new_mlid: ] Oct 26 16:59:25 358076 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc000. Oct 26 16:59:25 358085 [AB446CA0] -> __validate_requested_mgid: [ Oct 26 16:59:25 358095 [AB446CA0] -> __validate_requested_mgid: MGID Signed as 0x401B. Oct 26 16:59:25 358104 [AB446CA0] -> __validate_requested_mgid: Skipping MGID Validation for IPoIB Signed (0x401B) MGIDs. Oct 26 16:59:25 358113 [AB446CA0] -> __validate_requested_mgid: ] Oct 26 16:59:25 358122 [AB446CA0] -> __mgrp_request_is_realizable: [ Oct 26 16:59:25 358130 [AB446CA0] -> __mgrp_request_is_realizable: ] Oct 26 16:59:25 358143 [AB446CA0] -> osm_mgrp_send_create_notice: [ Oct 26 16:59:25 358154 [AB446CA0] -> osm_report_notice: [ Oct 26 16:59:25 358163 [AB446CA0] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Oct 26 16:59:25 358178 [AB446CA0] -> osm_report_notice: ] Oct 26 16:59:25 358186 [AB446CA0] -> osm_mgrp_send_create_notice: ] Oct 26 16:59:25 358195 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: ] Oct 26 16:59:25 358203 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: [ Oct 26 16:59:25 358211 [AB446CA0] -> __get_new_mlid: [ Oct 26 16:59:25 358246 [AB446CA0] -> __get_new_mlid: Found mgrp with lid:0xC000 MGID: 0xff12401bffff0000 : 0x00000000ffffffff Oct 26 16:59:25 358256 [AB446CA0] -> __get_new_mlid: Found available mlid:0xC001 at idx:1 Oct 26 16:59:25 358267 [AB446CA0] -> __get_new_mlid: ] Oct 26 16:59:25 358275 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: Getting new mlid 0xc001. Oct 26 16:59:25 358283 [AB446CA0] -> __validate_requested_mgid: [ Oct 26 16:59:25 358292 [AB446CA0] -> __validate_requested_mgid: MGID Signed as 0x401B. Oct 26 16:59:25 358300 [AB446CA0] -> __validate_requested_mgid: Skipping MGID Validation for IPoIB Signed (0x401B) MGIDs. Oct 26 16:59:25 358309 [AB446CA0] -> __validate_requested_mgid: ] Oct 26 16:59:25 358317 [AB446CA0] -> __mgrp_request_is_realizable: [ Oct 26 16:59:25 358325 [AB446CA0] -> __mgrp_request_is_realizable: ] Oct 26 16:59:25 358334 [AB446CA0] -> osm_mgrp_send_create_notice: [ Oct 26 16:59:25 358342 [AB446CA0] -> osm_report_notice: [ Oct 26 16:59:25 358350 [AB446CA0] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Oct 26 16:59:25 358376 [AB446CA0] -> osm_report_notice: ] Oct 26 16:59:25 358385 [AB446CA0] -> osm_mgrp_send_create_notice: ] Oct 26 16:59:25 358393 [AB446CA0] -> osm_mcmr_rcv_create_new_mgrp: ] Oct 26 16:59:25 358401 [AB446CA0] -> osm_sa_create_template_record_ipoib: ] Oct 26 16:59:25 358409 [AB446CA0] -> osm_opensm_create_mcgroups: ] Oct 26 16:59:25 358420 [AB446CA0] -> updn_construct: [ Oct 26 16:59:25 358431 [AB446CA0] -> updn_construct: ] Oct 26 16:59:25 358439 [AB446CA0] -> updn_init: [ Oct 26 16:59:25 358449 [AB446CA0] -> updn_init: ] Oct 26 16:59:25 358457 [AB446CA0] -> osm_opensm_init: ] Oct 26 16:59:25 358478 [AB446CA0] -> osm_vendor_get_all_port_attr: [ Oct 26 16:59:25 358571 [43806960] -> __osm_sm_sweeper: [ Oct 26 16:59:25 358586 [43806960] -> __osm_sm_sweeper: Masking ^C Signals. Oct 26 16:59:25 362483 [AB446CA0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c9000100d051) as the default port. Oct 26 16:59:25 362502 [AB446CA0] -> osm_vendor_get_all_port_attr: ] Oct 26 16:59:25 362533 [AB446CA0] -> osm_opensm_bind: [ Oct 26 16:59:25 362543 [AB446CA0] -> osm_sm_bind: [ Oct 26 16:59:25 362551 [AB446CA0] -> osm_sm_mad_ctrl_bind: [ Oct 26 16:59:25 362560 [AB446CA0] -> osm_sm_mad_ctrl_bind: Binding to port 0x2c9000100d051. Oct 26 16:59:25 362570 [AB446CA0] -> osm_vendor_bind: [ Oct 26 16:59:25 362579 [AB446CA0] -> osm_vendor_bind: Binding to port 0x2c9000100d051. Oct 26 16:59:25 362588 [AB446CA0] -> osm_vendor_open_port: [ Oct 26 16:59:25 365367 [AB446CA0] -> umad_receiver_init: [ Oct 26 16:59:25 365416 [AB446CA0] -> umad_receiver_init: ] Oct 26 16:59:25 365425 [AB446CA0] -> osm_vendor_open_port: ] Oct 26 16:59:25 365470 [44007960] -> umad_receiver: [ Oct 26 16:59:25 365547 [AB446CA0] -> osm_vendor_bind: ] Oct 26 16:59:25 365557 [AB446CA0] -> osm_sm_mad_ctrl_bind: ] Oct 26 16:59:25 365565 [AB446CA0] -> osm_sm_bind: ] Oct 26 16:59:25 365574 [AB446CA0] -> osm_sa_bind: [ Oct 26 16:59:25 365582 [AB446CA0] -> osm_sa_mad_ctrl_bind: [ Oct 26 16:59:25 365590 [AB446CA0] -> osm_sa_mad_ctrl_bind: Binding to port GUID 0x2c9000100d051. Oct 26 16:59:25 365599 [AB446CA0] -> osm_vendor_bind: [ Oct 26 16:59:25 365607 [AB446CA0] -> osm_vendor_bind: Binding to port 0x2c9000100d051. Oct 26 16:59:25 365616 [AB446CA0] -> osm_vendor_open_port: [ Oct 26 16:59:25 365624 [AB446CA0] -> osm_vendor_open_port: ] Oct 26 16:59:25 365673 [AB446CA0] -> osm_vendor_bind: ] Oct 26 16:59:25 365682 [AB446CA0] -> osm_sa_mad_ctrl_bind: ] Oct 26 16:59:25 365691 [AB446CA0] -> osm_sa_bind: ] Oct 26 16:59:25 365699 [AB446CA0] -> osm_opensm_bind: ] Oct 26 16:59:25 365707 [AB446CA0] -> osm_sm_sweep: [ Oct 26 16:59:25 365718 [AB446CA0] -> osm_state_mgr_process: [ Oct 26 16:59:25 365730 [AB446CA0] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_IDLE. Oct 26 16:59:25 365743 [AB446CA0] -> osm_sm_state_mgr_process: [ Oct 26 16:59:25 365753 [AB446CA0] -> osm_sm_state_mgr_process: Received signal OSM_SM_SIGNAL_INIT in state IB_SMINFO_STATE_INIT. Oct 26 16:59:25 365763 [AB446CA0] -> __osm_sm_state_mgr_discovering_msg: ****************************************************************** ******************** ENTERING SM DISCOVERING STATE *************** ****************************************************************** Oct 26 16:59:25 365773 [AB446CA0] -> osm_sm_state_mgr_process: ] Oct 26 16:59:25 365784 [AB446CA0] -> __osm_state_mgr_sweep_hop_0: [ Oct 26 16:59:25 365793 [AB446CA0] -> __osm_state_mgr_sweep_heavy_msg: ****************************************************************** ******************** INITIATING HEAVY SWEEP ********************** ****************************************************************** Oct 26 16:59:25 365810 [AB446CA0] -> osm_req_get: [ Oct 26 16:59:25 365819 [AB446CA0] -> osm_mad_pool_get: [ Oct 26 16:59:25 365831 [AB446CA0] -> osm_vendor_get: [ Oct 26 16:59:25 365839 [AB446CA0] -> osm_vendor_get: Acquiring UMAD for p_madw = 0x56d658, size = 256. Oct 26 16:59:25 365849 [AB446CA0] -> osm_vendor_get: Acquired UMAD 0x588430, size = 256. Oct 26 16:59:25 365879 [AB446CA0] -> osm_vendor_get: ] Oct 26 16:59:25 365890 [AB446CA0] -> osm_mad_pool_get: Acquired p_madw = 0x56d640, p_mad = 0x588468, size = 256. Oct 26 16:59:25 365905 [AB446CA0] -> osm_mad_pool_get: ] Oct 26 16:59:25 365918 [AB446CA0] -> osm_req_get: Getting NodeInfo (0x11), modifier = 0x0, TID = 0x1234. Oct 26 16:59:25 365929 [AB446CA0] -> osm_vl15_post: [ Oct 26 16:59:25 365937 [AB446CA0] -> osm_vl15_post: Posting p_madw = 0x0x56d640. Oct 26 16:59:25 365946 [AB446CA0] -> osm_vl15_post: 0 QP0 MADs on wire, 1 QP0 MADs outstanding. Oct 26 16:59:25 365955 [AB446CA0] -> osm_vl15_poll: [ Oct 26 16:59:25 365964 [AB446CA0] -> osm_vl15_poll: Signalling poller thread. Oct 26 16:59:25 365979 [AB446CA0] -> osm_vl15_poll: ] Oct 26 16:59:25 365986 [43005960] -> __osm_vl15_poller: Servicing p_madw = 0x56d640. Oct 26 16:59:25 365991 [AB446CA0] -> osm_vl15_post: ] Oct 26 16:59:25 366024 [AB446CA0] -> osm_req_get: ] Oct 26 16:59:25 366036 [AB446CA0] -> __osm_state_mgr_sweep_hop_0: ] Oct 26 16:59:25 366047 [AB446CA0] -> osm_state_mgr_process: ] Oct 26 16:59:25 366059 [AB446CA0] -> osm_sm_sweep: ] Oct 26 16:59:25 366076 [43005960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x0 trans_id................0x1234 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0] Return path: [0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 26 16:59:25 366094 [43005960] -> osm_vendor_send: [ Oct 26 16:59:25 366127 [43005960] -> osm_vendor_send: Completed Sending Request p_madw = 0x56d640. Oct 26 16:59:25 366138 [43005960] -> osm_vendor_send: ] Oct 26 16:59:25 366150 [43005960] -> __osm_vl15_poller: 1 QP0 MADs on wire, 1 outstanding, 0 unicasts sent, 1 total sent. Oct 26 16:59:33 937993 [44007960] -> umad_receiver: ERR 5404: recv error on MAD sized umad (Interrupted system call) Oct 26 16:59:35 365254 [43806960] -> osm_state_mgr_process: [ Oct 26 16:59:35 365276 [43806960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_SWEEP_HEAVY_SELF. Oct 26 16:59:35 365287 [43806960] -> osm_state_mgr_process: ] Oct 26 16:59:45 369718 [43806960] -> osm_state_mgr_process: [ Oct 26 16:59:45 369771 [43806960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_SWEEP_HEAVY_SELF. Oct 26 16:59:45 369782 [43806960] -> osm_state_mgr_process: ] Oct 26 16:59:55 374174 [43806960] -> osm_state_mgr_process: [ Oct 26 16:59:55 374233 [43806960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_SWEEP_HEAVY_SELF. Oct 26 16:59:55 374245 [43806960] -> osm_state_mgr_process: ] Oct 26 17:00:05 378630 [43806960] -> osm_state_mgr_process: [ Oct 26 17:00:05 378681 [43806960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_SWEEP_HEAVY_SELF. Oct 26 17:00:05 378693 [43806960] -> osm_state_mgr_process: ] Oct 26 17:00:15 383073 [43806960] -> osm_state_mgr_process: [ Oct 26 17:00:15 383104 [43806960] -> osm_state_mgr_process: Received signal OSM_SIGNAL_SWEEP in state OSM_SM_STATE_SWEEP_HEAVY_SELF. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Oct 26 08:31:51 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 08:31:51 -0700 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <20051026140958.GL4769@mellanox.co.il> References: <435D76E9.5040404@ichips.intel.com> <20051026140958.GL4769@mellanox.co.il> Message-ID: <435FA167.70101@ichips.intel.com> Michael S. Tsirkin wrote: > Sounds like a lot of work :). I think that it's less work than the alternative. > Are there benefits to this approach as opposed to implementing everything > in a library on top of ucm/uverbs? I considered, and continue to consider implementing on top of ucm. The drawbacks are: it requires more kernel modules: one for the CM, one for SA query, and one for address translation. It complicates the event model, since the uCMA must now deal with events from three different sources, rather than one. And it duplicates a majority of the kernel CMA code in userspace. - Sean From robert.j.woodruff at intel.com Wed Oct 26 08:33:08 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 26 Oct 2005 08:33:08 -0700 Subject: [openib-general] round 2 - proposal for socketbased connectionmodel Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005ED846A@orsmsx408> Arkady wrote, >This is what we are trying to avoid. >ULP should not change regardless whether or not it is running >on IB, iWARP, VIA or any other RDMA transport. The whole point of the CMA is that the ULP can code to an API that is independent of RDMA interconnect. The CMA wire protocol can be documented to allow non-Linux hosts to connect to a Linux box using the same protocol. There is no need to change the existing IB CM protocol to accomplish this. All that is needed is to document that CMA protocol (contained in the private data field of the IB CM requests). woody From mshefty at ichips.intel.com Wed Oct 26 08:41:14 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 08:41:14 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <1130328043.18967.13.camel@mail.es335.com> References: <435D76E9.5040404@ichips.intel.com> <435EBA85.7050107@ichips.intel.com> <1130328043.18967.13.camel@mail.es335.com> Message-ID: <435FA39A.7040700@ichips.intel.com> Tom Tucker wrote: > FYI, I've started writing the iw_cm that sits below the rdma_cm. Here's > the general picture I have in mind. > > +---------+ > | RDMA CM | > +-+-----+-+ > | | > +----+ +----+ > | | > +---------+ +----+----+ > | IB CM | | IW CM | > +----+----+ +----+----+ > | | > ____+_____ ____+_____ > +---------+| +---------+| > | IB devs || | IW devs || > +---------+ +---------+ This is what I was envisioning as well. > I am also migrating the current iw_cm.h file to match the interfaces in > the rdma_cm more closely. Note that there are still some changes occurring to the rdma_cm to support userspace. I'm concerned about how well these changes map to iWarp, since the changes expose the three-way CM handshake used by IB. > In general, the IW CM methods look very much like sockets connect, > listen, and accept. There is an iw_cm_id like the ib_cm_id that > encapsulates the 5-tuple, a callback for IW CM events and a "provider > handle" that represents the adapter "connection cookie". The iw_cm_id is > passed to connect, accept, etc... Something that didn't make sense for the kernel rdma_cm running over IB was adding a backlog parameter to the listen request. (The IB CM is callback driven, so there's not really a backlog.) I will probably add this to the userspace API. Does iWarp need a backlog parameter in the kernel? > depending on the model. This means that calls like listen with a local > port wildcard can't return until the "listen_reply" comes back from the > adapter. I didn't quite follow this. Right now, the rdma_cm only tries to support wildcard IP addresses. Are you wanting to support listening on any port as well? What is a listen_reply? - Sean From mst at mellanox.co.il Wed Oct 26 08:49:29 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Oct 2005 17:49:29 +0200 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <435FA167.70101@ichips.intel.com> References: <435FA167.70101@ichips.intel.com> Message-ID: <20051026154929.GN4769@mellanox.co.il> Quoting Sean Hefty : > I considered, and continue to consider implementing on top of ucm. The > drawbacks are: it requires more kernel modules: one for the CM, one for SA > query, and one for address translation. Cant address translation be done with exiting kernel/user interface? -- MST From robert.j.woodruff at intel.com Wed Oct 26 08:50:51 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 26 Oct 2005 08:50:51 -0700 Subject: [openib-general] RFC userspace CMA Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005ED8501@orsmsx408> Tom Tucker wrote, +---------+ | RDMA CM | +-+-----+-+ | | +----+ +----+ | | +---------+ +----+----+ | IB CM | | IW CM | +----+----+ +----+----+ | | ____+_____ ____+_____ +---------+| +---------+| | IB devs || | IW devs || +---------+ +---------+ >I welcome all comments on this especially now that it's early and >there's a lot of options and not much code yet. Looks like the right approach to me and I am glad to see someone start to implement an IW CM under the RDMA CM. This will allow any problems with the current RDMA CM API and implementation to be flushed out early so we can make any needed changes before too many people code to it. woody From mshefty at ichips.intel.com Wed Oct 26 08:51:40 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 08:51:40 -0700 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <20051026154929.GN4769@mellanox.co.il> References: <435FA167.70101@ichips.intel.com> <20051026154929.GN4769@mellanox.co.il> Message-ID: <435FA60C.4070708@ichips.intel.com> Michael S. Tsirkin wrote: > Quoting Sean Hefty : > >>I considered, and continue to consider implementing on top of ucm. The >>drawbacks are: it requires more kernel modules: one for the CM, one for SA >>query, and one for address translation. > > > Cant address translation be done with exiting kernel/user interface? There's no kernel/user interface for ib_addr, which is what the kernel CMA uses. To use the ib_at kernel/user interface, ib_at would need to be fixed to avoid crashing the system. ib_addr is based off of the ib_at/sdp implementations, but limited to ARP translation only. It would also require userspace components for other RDMA CMs, such as iWarp. - Sean From yaronh at voltaire.com Wed Oct 26 09:21:00 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 26 Oct 2005 18:21:00 +0200 Subject: [openib-general] RE: [dat-discussions] round 2 - proposal forsocket based connection model Message-ID: <35EA21F54A45CB47B879F21A91F4862F856E15@taurus.voltaire.com> > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Kanevsky, Arkady > Sent: Tuesday, October 25, 2005 1:26 PM > To: Sean Hefty > Cc: swg at infinibandta.org; openib-general at openib.org; dat- > discussions at yahoogroups.com > Subject: RE: [openib-general] RE: [dat-discussions] round 2 - proposal > forsocket based connection model > > Think of a single API that supports iWARP and IB (transport independent > API). > To a connection listener it provides the IP 5-tuple + private data. > For IB it means that CM parses REQ and extracts IP 5-tuple as separate > fields from private data. > Listener does not parse the private data encoding of the proposal. > > So CM need to know if it need to encode IP 5-tuple on requestor side > and if need to parse on responder side. > Arkady > Arkady, I agree with Sean you can encode the Dest Port in the ServiceID And if you really want to verify its using that format you can look at the upper 48 bits in the serviceID. We may need to distinguish between Explicit RDMA protocols (iSER, NFS-RDMA, RDP, etc') and Implicit RDMA (SDP, where the Socket application doesn't know it is using RDMA), this can be done in 3 ways: a. port mapper, b. different ServiceID prefix, or c. a bit in the CM REQ Header. Also I'm not sure why we need the Protocol (UDP, TCP, SCTP, ..) since we emulate RDMA we shouldn't care if its TCP or SCTP, and UDP is unconnected and cant drive RDMA anyway Yaron > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance phone: 781-768-5395 > 375 Totten Pond Rd. Fax: 781-895-1195 > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > -----Original Message----- > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > Sent: Tuesday, October 25, 2005 1:08 PM > > To: Kanevsky, Arkady > > Cc: Caitlin Bestler; dat-discussions at yahoogroups.com; > > openib-general at openib.org; swg at infinibandta.org > > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > > proposal for socket based connection model > > > > > > Kanevsky, Arkady wrote: > > > Correct. > > > But this does bring the question how responder CM knows > > that it need > > > to parse the private data. I suspect this will be done via > > new version > > > of CM. But a suage of some of the CM REQ reserved fields are also > > > possible. Anotherwords the current CM version assumes that CM only > > > supports one version and there is no need to support more than 1 > > > version. > > > > The responder knows how to parse the private data based on > > the service ID that > > they're listening on. This is how it's done today, and how > > it will still need > > to be done. What is the motivation to change it? > > > > What data is beyond the addressing? How does the responder > > know how to > > interpret that? > > > > - Sean > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From yaronh at voltaire.com Wed Oct 26 09:23:36 2005 From: yaronh at voltaire.com (Yaron Haviv) Date: Wed, 26 Oct 2005 18:23:36 +0200 Subject: [swg] RE: [openib-general] round 2 - proposal for socket based connection model Message-ID: <35EA21F54A45CB47B879F21A91F4862F856E16@taurus.voltaire.com> > -----Original Message----- > From: Caitlin Bestler [mailto:caitlinb at broadcom.com] > Sent: Tuesday, October 25, 2005 6:39 PM > To: Tom Tucker; Kanevsky, Arkady > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: [swg] RE: [openib-general] round 2 - proposal for socket based > connection model > > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Tom Tucker > > Sent: Tuesday, October 25, 2005 2:56 PM > > To: Kanevsky, Arkady > > Cc: swg at infinibandta.org; openib-general at openib.org > > Subject: RE: [openib-general] round 2 - proposal for socket > > based connection model > > > > Arkady: > > > > I may actually have a constructive comment about the protocol > > (private data format). One thing I noticed is that *almost* > > everything in the private data header is available in the > > native iWARP protocol header except the ZB and SI bits. If > > these bits become part of the canonical private data header, > > then does that require an iWARP transport to use the header > > too even though only two bits are useful? > > > > Sorry if this is a dumb question, > > > > I'm not sure I followed why these were needed myself. I believe ZBTO and Remote Invalidation are mandatory in iWarp, right ? There are two new RDMA features that are available in iWarp, and are new to IB (optional in 1.2 version) A ULP that is supposed to run on both may want to know if the peer supports those, so it can use the correct verbs e.g. if the peer doesn't support remote invalidation the ULP will need to use Send verb, and invalidate the FMR locally, if it does support it, it can use the new "Send with Invalidate" verb which can improve performance and security I don't see why iWarp needs to negotiate it, CMA can just return true on both bits in case its iWarp This is a generic parameters that will be needed by more than one ULP, that wants to make sure what verbs are supported by the RDMA generic layer, that's why its in the generic portion of the header. Yaron From robert.j.woodruff at intel.com Wed Oct 26 09:24:10 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 26 Oct 2005 09:24:10 -0700 Subject: [openib-general] last version for 2.6.9 backport Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005ED85DF@orsmsx408> Robert J Woodruff wrote, >New version of 2.6.9 backport patches committed in svn3854. >woody BTW. I was not able to test the pathscale driver as I do not have any of their H/W, so if someone that has H/W could test it, that would be great. woody From mshefty at ichips.intel.com Wed Oct 26 09:34:04 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 09:34:04 -0700 Subject: [openib-general] CMA service ID space versus private data bytes Message-ID: <435FAFFC.6080604@ichips.intel.com> There's a trade-off between service ID space used by the CMA and the amount of private data available to the user. Currently, the CMA reserves 64k of service ID space and provides 56 bytes of user private data. We can give the user 60 bytes of private data space by shifting 3 bytes (plus 1 reserved byte) from the private data into the service ID. This results in the CMA reserving 2^40 IDs (about 6% of the total range). How important is the private data to people versus the conservation of service IDs? - Sean From halr at voltaire.com Wed Oct 26 09:36:20 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 26 Oct 2005 18:36:20 +0200 Subject: [openib-general] RE: Osmtest removal from Gen2 main trunk Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C50@taurus.voltaire.com> Hi Liran, I'm out at SC05 staging. Can this wait until I get back (no later than early next week) ? I want to do a side by side comparison before osmtest is removed from the trunk. -- Hal ________________________________ From: Liran Sorani [mailto:liran at mellanox.co.il] Sent: Tue 10/25/2005 1:35 AM To: Hal Rosenstock Cc: openib-general at openib.org Subject: Osmtest removal from Gen2 main trunk Hi , Hal . Since now the Osmtest is updated (in all stack flavours) under ibtp repository (https://openib.org/svn/trunk/contrib/mellanox/ibtp/), I'd like to remove it from main trunk : https://openib.org/svn/gen2/trunk/src/userspace/management/osm/osmtest. New updates will be checked into ibtp repository only , thanks . -----Original Message----- From: Liran Sorani Sent: Sunday, October 23, 2005 9:01 AM To: 'Hal Rosenstock'; Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update Currently only a minor bug fix in osmt_service flow , and cosmetics changes to fit WinIb stack . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, October 20, 2005 1:01 PM To: Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update On Thu, 2005-10-20 at 03:49, Liran Sorani wrote: > Hi , Hal . > The Linux & WinIB are the same , except for several cosmetic changes . I was referring to the (differences in the) Linux one in ibtp and the Linux one under gen2/trunk. > Regarding Makefile.in , it's an outcome of autogen , I'll remove it . Thanks. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 19, 2005 10:25 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update > > > On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > > Hi , > > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) > and > > Gen2 stacks : > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmtest > > > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest > > > > Osmtest is the main verification tool for OpenSM , include various > SA > > (Good / Bad) flows. > > Attached to each directory a short README file for setup and usage > > information. > > How is the Linux one different from osmtest in the trunk ? > > Also, (nit): > I think > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management/osm/osmtest/Makefile.in > is a generated file and should be removed. > > -- Hal > > > > Liran Sorani > > > Mellanox Technologies LTD. > > > mailto:liran at mellanox.co.il > > > Phone: +972(4)9097200 Ext: 214 > > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Wed Oct 26 09:42:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 26 Oct 2005 18:42:57 +0200 Subject: [openib-general] OSM: console patch breaks OpenSM make dist Message-ID: <435FB211.4030305@mellanox.co.il> Hi Hal, The last commit you have done breaks the OpenSM make dist. Seems you did not try to make dist. I was also surprised that you have committed OpenSM console patch while it is still under discussion. The following patch resolves that issue: Index: osm/include/Makefile.am =================================================================== --- osm/include/Makefile.am (revision 3880) +++ osm/include/Makefile.am (working copy) @@ -35,6 +35,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_service_record.h \ $(srcdir)/opensm/osm_sa_response.h \ $(srcdir)/opensm/osm_node.h \ + $(srcdir)/opensm/osm_console.h \ $(srcdir)/opensm/osm_sa_slvl_record_ctrl.h \ $(srcdir)/opensm/osm_req.h \ $(srcdir)/opensm/osm_mcm_info.h \ From rolandd at cisco.com Wed Oct 26 09:56:20 2005 From: rolandd at cisco.com (Roland Dreier) Date: Wed, 26 Oct 2005 09:56:20 -0700 Subject: [openib-general] CMA service ID space versus private data bytes In-Reply-To: <435FAFFC.6080604@ichips.intel.com> (Sean Hefty's message of "Wed, 26 Oct 2005 09:34:04 -0700") References: <435FAFFC.6080604@ichips.intel.com> Message-ID: <52hdb41akr.fsf@cisco.com> Sean> This results in the CMA reserving 2^40 IDs (about 6% of the Sean> total range). Just to be nitpicky -- 2^40 service IDs is 1 / 2^24 of the total number of service IDs (2^64), which is about .000006%. - R. From halr at voltaire.com Wed Oct 26 10:01:26 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 26 Oct 2005 19:01:26 +0200 Subject: [openib-general] RE: console patch breaks OpenSM make dist Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C58@taurus.voltaire.com> Hi Eitan, I needed it for SCinet 05 staging. It can easily be backed out. -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Wed 10/26/2005 12:42 PM To: Hal Rosenstock Cc: OPENIB GENERAL Subject: OSM: console patch breaks OpenSM make dist Hi Hal, The last commit you have done breaks the OpenSM make dist. Seems you did not try to make dist. I was also surprised that you have committed OpenSM console patch while it is still under discussion. The following patch resolves that issue: Index: osm/include/Makefile.am =================================================================== --- osm/include/Makefile.am (revision 3880) +++ osm/include/Makefile.am (working copy) @@ -35,6 +35,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_service_record.h \ $(srcdir)/opensm/osm_sa_response.h \ $(srcdir)/opensm/osm_node.h \ + $(srcdir)/opensm/osm_console.h \ $(srcdir)/opensm/osm_sa_slvl_record_ctrl.h \ $(srcdir)/opensm/osm_req.h \ $(srcdir)/opensm/osm_mcm_info.h \ From mshefty at ichips.intel.com Wed Oct 26 10:01:17 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 10:01:17 -0700 Subject: [openib-general] CMA service ID space versus private data bytes In-Reply-To: <52hdb41akr.fsf@cisco.com> References: <435FAFFC.6080604@ichips.intel.com> <52hdb41akr.fsf@cisco.com> Message-ID: <435FB65D.1020101@ichips.intel.com> Roland Dreier wrote: > Sean> This results in the CMA reserving 2^40 IDs (about 6% of the > Sean> total range). > > Just to be nitpicky -- 2^40 service IDs is 1 / 2^24 of the total > number of service IDs (2^64), which is about .000006%. Uhm.. good catch. That makes a difference in how important it is to conserve those IDs. Thanks. - Sean From halr at voltaire.com Wed Oct 26 10:09:12 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 26 Oct 2005 19:09:12 +0200 Subject: [openib-general] FW: [openib-commits] r3881 -gen2/trunk/src/userspace/management/osm/include Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C59@taurus.voltaire.com> Eitan, You should not be checking things into the trunk like this. The procedure is to submit them to the maintainer and the maintainer verifies them and checks them in. -- Hal ________________________________ From: openib-commits-bounces at openib.org on behalf of eitan at openib.org Sent: Wed 10/26/2005 12:50 PM To: openib-commits at openib.org Subject: [openib-commits] r3881 -gen2/trunk/src/userspace/management/osm/include Author: eitan Date: 2005-10-26 09:50:25 -0700 (Wed, 26 Oct 2005) New Revision: 3881 Modified: gen2/trunk/src/userspace/management/osm/include/Makefile.am Log: Missing osm_console.h in DIST list Modified: gen2/trunk/src/userspace/management/osm/include/Makefile.am =================================================================== --- gen2/trunk/src/userspace/management/osm/include/Makefile.am 2005-10-26 16:26:15 UTC (rev 3880) +++ gen2/trunk/src/userspace/management/osm/include/Makefile.am 2005-10-26 16:50:25 UTC (rev 3881) @@ -35,6 +35,7 @@ $(srcdir)/opensm/osm_sa_service_record.h \ $(srcdir)/opensm/osm_sa_response.h \ $(srcdir)/opensm/osm_node.h \ + $(srcdir)/opensm/osm_console.h \ $(srcdir)/opensm/osm_sa_slvl_record_ctrl.h \ $(srcdir)/opensm/osm_req.h \ $(srcdir)/opensm/osm_mcm_info.h \ _______________________________________________ openib-commits mailing list openib-commits at openib.org http://openib.org/mailman/listinfo/openib-commits From Arkady.Kanevsky at netapp.com Wed Oct 26 10:11:02 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 26 Oct 2005 13:11:02 -0400 Subject: [swg] RE: [openib-general] RE: [dat-discussions] round 2 - proposal forsocket based connection model Message-ID: Of course, you can encode versions into service Id. But that will mix concepts. And I do not believe that is worse it to provide a couple more bytes of Consumer private data. This encoding will not be enough to give Consumer 64 bytes of private data. The port numbers are mapped differently for different protocol numbers (families). If we only concern with TCP port mapping this will not be needed. But ULP right now make its decision by standard socket 5-tuple which does include it. I prefer that we do not require any changes in ULP to run over IB. We can do that in the API if there is no need to support more than just TCP. IN this case API can always return the protocol number for TCP to a Consumer. One concern I have is that some existing ULPs (say SDP) rely on the existing format of the private data. Thus, it would not want to use this CM encoding. I do not want to force it to change. Thus, a bit in CM which indicate whether encoding is present looks like a right approach. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Yaron Haviv [mailto:yaronh at voltaire.com] > Sent: Wednesday, October 26, 2005 12:21 PM > To: Kanevsky, Arkady; Sean Hefty > Cc: swg at infinibandta.org; openib-general at openib.org; > dat-discussions at yahoogroups.com > Subject: [swg] RE: [openib-general] RE: [dat-discussions] > round 2 - proposal forsocket based connection model > > > > -----Original Message----- > > From: openib-general-bounces at openib.org [mailto:openib-general- > > bounces at openib.org] On Behalf Of Kanevsky, Arkady > > Sent: Tuesday, October 25, 2005 1:26 PM > > To: Sean Hefty > > Cc: swg at infinibandta.org; openib-general at openib.org; dat- > > discussions at yahoogroups.com > > Subject: RE: [openib-general] RE: [dat-discussions] round 2 > - proposal > > forsocket based connection model > > > > Think of a single API that supports iWARP and IB (transport > independent > > API). > > To a connection listener it provides the IP 5-tuple + private data. > > For IB it means that CM parses REQ and extracts IP 5-tuple > as separate > > fields from private data. Listener does not parse the private data > > encoding of the proposal. > > > > So CM need to know if it need to encode IP 5-tuple on > requestor side > > and if need to parse on responder side. Arkady > > > > Arkady, I agree with Sean you can encode the Dest Port in the > ServiceID > And if you really want to verify its using that format you can look at > the upper 48 bits in the serviceID. > > We may need to distinguish between Explicit RDMA protocols (iSER, > NFS-RDMA, RDP, etc') and Implicit RDMA (SDP, where the Socket > application doesn't know it is using RDMA), this can be done > in 3 ways: > a. port mapper, b. different ServiceID prefix, or c. a bit in > the CM REQ > Header. > > Also I'm not sure why we need the Protocol (UDP, TCP, SCTP, > ..) since we > emulate RDMA we shouldn't care if its TCP or SCTP, and UDP is > unconnected and cant drive RDMA anyway > > Yaron > > > > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance phone: 781-768-5395 > > 375 Totten Pond Rd. Fax: 781-895-1195 > > Waltham, MA 02451-2010 central phone: 781-768-5300 > > > > > > > > > -----Original Message----- > > > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > > > Sent: Tuesday, October 25, 2005 1:08 PM > > > To: Kanevsky, Arkady > > > Cc: Caitlin Bestler; dat-discussions at yahoogroups.com; > > > openib-general at openib.org; swg at infinibandta.org > > > Subject: Re: [openib-general] RE: [dat-discussions] round 2 - > > > proposal for socket based connection model > > > > > > > > > Kanevsky, Arkady wrote: > > > > Correct. > > > > But this does bring the question how responder CM knows > > > that it need > > > > to parse the private data. I suspect this will be done via > > > new version > > > > of CM. But a suage of some of the CM REQ reserved > fields are also > > > > possible. Anotherwords the current CM version assumes > that CM only > > > > supports one version and there is no need to support more than 1 > > > > version. > > > > > > The responder knows how to parse the private data based on > > > the service ID that > > > they're listening on. This is how it's done today, and how > > > it will still need > > > to be done. What is the motivation to change it? > > > > > > What data is beyond the addressing? How does the responder > > > know how to > > > interpret that? > > > > > > - Sean > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib- > > general > From Arkady.Kanevsky at netapp.com Wed Oct 26 10:15:38 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 26 Oct 2005 13:15:38 -0400 Subject: [openib-general] round 2 - proposal for socketbased connectionmodel Message-ID: This is the whole purpose of the protocol. It is OS independent and ensures interoperability. Nobody will change their OS protocol implementation so it can communicate to Linux (or any other OS or vendor) that invented its own protocol... It is not OS (linux no exception) job to invent protocols. But I think this argument have been bitten enough already. Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance phone: 781-768-5395 375 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 > -----Original Message----- > From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > Sent: Wednesday, October 26, 2005 11:33 AM > To: Kanevsky, Arkady; Sean Hefty > Cc: swg at infinibandta.org; openib-general at openib.org > Subject: RE: [openib-general] round 2 - proposal for > socketbased connectionmodel > > > Arkady wrote, > >This is what we are trying to avoid. > >ULP should not change regardless whether or not it is running on IB, > >iWARP, VIA or any other RDMA transport. > > The whole point of the CMA is that the ULP can code to an > API that is independent of RDMA interconnect. The > CMA wire protocol can be documented to allow > non-Linux hosts to connect to a Linux box using > the same protocol. There is no need to change the existing > IB CM protocol to accomplish this. All that is needed is > to document that CMA protocol (contained in the private data > field of the IB CM requests). > > woody > > > From caitlinb at broadcom.com Wed Oct 26 10:27:54 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 26 Oct 2005 10:27:54 -0700 Subject: [openib-general] RFC userspace CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B2F@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Tom Tucker > Sent: Wednesday, October 26, 2005 5:01 AM > To: Sean Hefty > Cc: openib > Subject: Re: [openib-general] RFC userspace CMA > > Sean: > > FYI, I've started writing the iw_cm that sits below the > rdma_cm. Here's the general picture I have in mind. > > +---------+ > | RDMA CM | > +-+-----+-+ > | | > +----+ +----+ > | | > +---------+ +----+----+ > | IB CM | | IW CM | > +----+----+ +----+----+ > | | > ____+_____ ____+_____ > +---------+| +---------+| > | IB devs || | IW devs || > +---------+ +---------+ > > The purpose of the IW CM is to abstract the two different > connection models used by the iWARP side: offloaded and host > integrated, and to act as a shim between device specific > connection data structures and the rdma_cm data structures. > How much logic is really in the RDMA CM? If it is sufficiently small, which is what my expectation is, would it make sense to simply make the IB CM and IW CM conform to the same polymorphic interface? (Making the "RDMA CM" little more than a re-directing inline function). But if there is any substantial portion of common logic then the above structure definitely makes sense. My main concern is that the Connection Request reported up from here has the same semantics over IB CM as it does over IW CM. The remote IP Address of an established connection is pretty fairly validated. It originated from privileged code on the remote side (presumably the kernel) and the routers actually got packets back to that machine using that address. So a daemon using that address for client validation is totally reasonable, over an IP network. An equivalent degree of reliability should be there for IB. Using the SM to validate a translation achieved that. There are certainly other ways to achieve it. But we need to be clear that it is part of what the application is expecting when it is told that remote IP Address X is requesting a connection. From mshefty at ichips.intel.com Wed Oct 26 10:40:24 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 10:40:24 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020B2F@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020B2F@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435FBF88.9070809@ichips.intel.com> Caitlin Bestler wrote: > How much logic is really in the RDMA CM? > > If it is sufficiently small, which is what my expectation is, > would it make sense to simply make the IB CM and IW CM conform > to the same polymorphic interface? (Making the "RDMA CM" little > more than a re-directing inline function). > > But if there is any substantial portion of common logic > then the above structure definitely makes sense. The kernel CMA is about 660 lines of code. It performs QP transitions for the user, abstracts device remove/addition, plus controls the mapping from IP to IB. (The mapping function makes use of external modules, such as ib_addr and sa_query.) In some places, the implementation of the API of the RDMA CM does little more than redirecting to the IW/IB CM, coupled with synchronization for device removal. However, it also handles the IB CM callbacks to provide simpler connection notification. - Sean From halr at voltaire.com Wed Oct 26 10:43:51 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Wed, 26 Oct 2005 19:43:51 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C51@taurus.voltaire.com> Hi Eitan, I sit corrected. There are R/W parameters in the SM MIB as you indicate. I was thinking of all the other IPoIB MIBs. It's been a while since I looked at the SM MIB. Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while ago. At a minimum, it needs to be dusted off. That would include updating it for IBA 1.2. -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tue 10/25/2005 5:19 AM To: Hal Rosenstock Cc: Troy Benjegerdes; openib-general at openib.org Subject: Re: [openib-general] [RFC] OpenSM Interactive Console Hal Rosenstock wrote: > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > >>Hal Rosenstock wrote: >> >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: >>> >>> >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > has > >>>>defined an SNMP MIB that can support the required functionality > > below. > >>> >>>The IETF SNMP MIBs are one way of presenting the information to the >>>outside world. There are other possible management interfaces. The > > SNMP > >>>MIB instrumentation would need to use lower layer APIs to get this >>>information out of the SM. >> >>Yes but the IETF SM MIB is the only one that is close to a standard > > way. > >>It does not require low level interface if it will integrate into the > > OpenSM code. > >>One way to do it is buy extending OpenSM with an AgentX interface. >> >>IMO one clear advantage of using SNMP for SM integration is that the > > code will work with any SM that is IETF compliant. > >>Also if you want to write a "client server" type of application on top > > of an SM you > >>can either stick to sending MADs which translate into SA client based > > application or > >>you better stay with some known protocol for management (like SNMP) > > and not develop yet another protocol for > >>doing exactly the same things as SNMP already supports. > > > There are limitations in the SNMP MIBs. One is that they are RO so they > are more for monitoring. Also, many environments do not use SNMP. It is > unclear how much of a requirement it is to manage any SM or how many > other SMs support the SM MIB. (There are other IB associated MIBs too). SNMP MIBs are certainly not just RO a simple example from the SM MIB: ibSmPortInfoLMC OBJECT-TYPE SYNTAX Unsigned32(0..7) MAX-ACCESS read-write STATUS current DESCRIPTION "LID mask for multipath support. User should take extra caution when setting this value, since any change will effect packet routing." ::= { ibSmPortInfoEntry 19 } I agree that it is possible that currently no SM is supporting the SM MIB. But it does make sense to have ALL of the them support it. Such that they can be activated/deactivated and configured in the manner. Most unix distributions and windows box have standard SNMP agent and client included in them So it does not take more then simple bash or C code to interact with the SM if it supports SNMP. > > >>>>Everything but the dynamic partitioning (OpenSM does not have >>>>partition manager to this moment) >>> >>> >>>What Troy meant by partitioning is not necessarily IB partitioning. >> >>How are you sure about that? Troy - please comment. > > > I think you missed an email on this. > > >>>>and forwarding of Performance >>>>Monitoring traps (which are generated by the PM) can be done through >>>>osmsh or through SA client today. >>> >>> >>>What PerfMgr are you referring to ? >> >>No specific one. But the specification does not require the SM too. > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > without this. There is a subnet without a PerfMgr. Yes its a typo I meant PM. SM is a requirement. You know I did not mean that. > > >>For various reasons (like load) it might make more sense to have the > > PM distributed. > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > It is the PM that > >>should support Reporting (I.e InformInfo registration and Trap > > forwarding) for PM traps. > >>But the spec does not define such traps anyway. > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > that is defined and has no traps so these will all need synthesis by the > PerfMgr. Agree. > > -- Hal > From caitlinb at broadcom.com Wed Oct 26 10:45:00 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 26 Oct 2005 10:45:00 -0700 Subject: [openib-general] RFC userspace CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B32@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: Sean Hefty [mailto:mshefty at ichips.intel.com] > Sent: Wednesday, October 26, 2005 10:40 AM > To: Caitlin Bestler > Cc: Tom Tucker; openib > Subject: Re: [openib-general] RFC userspace CMA > > Caitlin Bestler wrote: > > How much logic is really in the RDMA CM? > > > > If it is sufficiently small, which is what my expectation > is, would it > > make sense to simply make the IB CM and IW CM conform to the same > > polymorphic interface? (Making the "RDMA CM" little more than a > > re-directing inline function). > > > > But if there is any substantial portion of common logic > then the above > > structure definitely makes sense. > > The kernel CMA is about 660 lines of code. It performs QP > transitions for the user, abstracts device remove/addition, > plus controls the mapping from IP to IB. > (The mapping function makes use of external modules, such > as ib_addr and > sa_query.) > > In some places, the implementation of the API of the RDMA CM > does little more than redirecting to the IW/IB CM, coupled > with synchronization for device removal. However, it also > handles the IB CM callbacks to provide simpler connection > notification. > So it sounds like the justification for the RDMA CM being a distinct module is to centralize handling of device addition and removal. Beyond that you are incorporating IB-specific but device-independent logic. As a goal, the iWARP side should be migrating there as well. From tom at opengridcomputing.com Wed Oct 26 10:47:32 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 26 Oct 2005 12:47:32 -0500 Subject: [openib-general] RFC userspace CMA In-Reply-To: <435FA39A.7040700@ichips.intel.com> References: <435D76E9.5040404@ichips.intel.com> <435EBA85.7050107@ichips.intel.com> <1130328043.18967.13.camel@mail.es335.com> <435FA39A.7040700@ichips.intel.com> Message-ID: <1130348852.12337.6.camel@trinity.austin.ammasso.com> On Wed, 2005-10-26 at 08:41 -0700, Sean Hefty wrote: > Tom Tucker wrote: > > FYI, I've started writing the iw_cm that sits below the rdma_cm. Here's > > the general picture I have in mind. > > > > +---------+ > > | RDMA CM | > > +-+-----+-+ > > | | > > +----+ +----+ > > | | > > +---------+ +----+----+ > > | IB CM | | IW CM | > > +----+----+ +----+----+ > > | | > > ____+_____ ____+_____ > > +---------+| +---------+| > > | IB devs || | IW devs || > > +---------+ +---------+ > > This is what I was envisioning as well. > > > I am also migrating the current iw_cm.h file to match the interfaces in > > the rdma_cm more closely. > > Note that there are still some changes occurring to the rdma_cm to support > userspace. I'm concerned about how well these changes map to iWarp, since the > changes expose the three-way CM handshake used by IB. > > > In general, the IW CM methods look very much like sockets connect, > > listen, and accept. There is an iw_cm_id like the ib_cm_id that > > encapsulates the 5-tuple, a callback for IW CM events and a "provider > > handle" that represents the adapter "connection cookie". The iw_cm_id is > > passed to connect, accept, etc... > > Something that didn't make sense for the kernel rdma_cm running over IB was > adding a backlog parameter to the listen request. (The IB CM is callback > driven, so there's not really a backlog.) I will probably add this to the > userspace API. Does iWarp need a backlog parameter in the kernel? It is needed by some adapters. For the AMSO1100 it's passed down to the adapter to reserve syn cache entries for incoming connections. > > > depending on the model. This means that calls like listen with a local > > port wildcard can't return until the "listen_reply" comes back from the > > adapter. > > I didn't quite follow this. Right now, the rdma_cm only tries to support > wildcard IP addresses. Are you wanting to support listening on any port as > well? What is a listen_reply? > Yes it's funky. Basically, the listen in this context is the combination of a 'bind' and a 'listen'. If you specify 0 for a port number on bind, the stack will allocate one for you. MPI uses this to allocate a port and then advertises this port to a central server (node of rank 0) who tells the other servers how to contact each other. This avoids having to allocate a well known port for each node in the MPI cluster and allows multiple apps to run concurrently without allocating additional well-known ports. The listen_reply from the adapter returns the port chosen and the status of the listen request. It is the somewhat analagous to the insert_listen_.... in the IB CM framework. > - Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Wed Oct 26 10:54:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 10:54:09 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <54AD0F12E08D1541B826BE97C98F99F1020B32@NT-SJCA-0751.brcm.ad.broadcom.com> References: <54AD0F12E08D1541B826BE97C98F99F1020B32@NT-SJCA-0751.brcm.ad.broadcom.com> Message-ID: <435FC2C1.3010602@ichips.intel.com> Caitlin Bestler wrote: > So it sounds like the justification for the RDMA CM being > a distinct module is to centralize handling of device addition > and removal. Beyond that you are incorporating IB-specific > but device-independent logic. As a goal, the iWARP side > should be migrating there as well. As a general rule, the code is organized into a group of general functions that are transport independent, and functions that are specific to a given transport. For example, rdma_accept() is written as: if (!cma_comp(id_priv, CMA_CONNECT)) return -EINVAL; switch (id->device->node_type) { case IB_NODE_CA: ret = cma_accept_ib(id_priv, conn_param); break; default: ret = -ENOSYS; break; } if (ret) goto reject; return 0; The IB specific code is in separate functions from the transport specific code, but shares the same file. I did not want to try to define common lower-level interfaces, such as a cma_accept_iwarp(), at this point. - Sean From mshefty at ichips.intel.com Wed Oct 26 11:02:21 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 11:02:21 -0700 Subject: [openib-general] RFC userspace CMA In-Reply-To: <1130348852.12337.6.camel@trinity.austin.ammasso.com> References: <435D76E9.5040404@ichips.intel.com> <435EBA85.7050107@ichips.intel.com> <1130328043.18967.13.camel@mail.es335.com> <435FA39A.7040700@ichips.intel.com> <1130348852.12337.6.camel@trinity.austin.ammasso.com> Message-ID: <435FC4AD.2070701@ichips.intel.com> Tom Tucker wrote: >>Something that didn't make sense for the kernel rdma_cm running over IB was >>adding a backlog parameter to the listen request. (The IB CM is callback >>driven, so there's not really a backlog.) I will probably add this to the >>userspace API. Does iWarp need a backlog parameter in the kernel? > > It is needed by some adapters. For the AMSO1100 it's passed down to the > adapter to reserve syn cache entries for incoming connections. There's no problem adding this. It just that for IB, the user can control the backlog themselves, which gives a little more flexibility. I suppose we can define it as the maximum number of unacknowledged connection requests that a user can have. Where acknowledging a connection request is done by accepting or rejecting the connection. I'll work on adding this after getting the userspace CMA up. > Yes it's funky. Basically, the listen in this context is the combination > of a 'bind' and a 'listen'. If you specify 0 for a port number on bind, > the stack will allocate one for you. I need to think about this more. I'm not sure how to handle a bind of 0 over IB. Can you handle this on the bind only, as opposed to listen? - Sean From tom at opengridcomputing.com Wed Oct 26 11:35:47 2005 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 26 Oct 2005 13:35:47 -0500 Subject: [openib-general] RFC userspace CMA In-Reply-To: <435FC4AD.2070701@ichips.intel.com> References: <435D76E9.5040404@ichips.intel.com> <435EBA85.7050107@ichips.intel.com> <1130328043.18967.13.camel@mail.es335.com> <435FA39A.7040700@ichips.intel.com> <1130348852.12337.6.camel@trinity.austin.ammasso.com> <435FC4AD.2070701@ichips.intel.com> Message-ID: <1130351747.12735.5.camel@trinity.austin.ammasso.com> Ah, I think I understand the confusion. 0 in this case, doesn't mean "wildcard", it means "assign a port for me". I think you already handle this in IB... from ib_cm_listen ... spin_lock_irqsave(&cm.lock, flags); if (service_id == IB_CM_ASSIGN_SERVICE_ID) { ---> cm_id->service_id = cpu_to_be64(cm.listen_service_id++); ---> cm_id->service_mask = __constant_cpu_to_be64(~0ULL); } else { cm_id->service_id = service_id; cm_id->service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); On Wed, 2005-10-26 at 11:02 -0700, Sean Hefty wrote: > Tom Tucker wrote: > >>Something that didn't make sense for the kernel rdma_cm running over IB was > >>adding a backlog parameter to the listen request. (The IB CM is callback > >>driven, so there's not really a backlog.) I will probably add this to the > >>userspace API. Does iWarp need a backlog parameter in the kernel? > > > > It is needed by some adapters. For the AMSO1100 it's passed down to the > > adapter to reserve syn cache entries for incoming connections. > > There's no problem adding this. It just that for IB, the user can control the > backlog themselves, which gives a little more flexibility. I suppose we can > define it as the maximum number of unacknowledged connection requests that a > user can have. Where acknowledging a connection request is done by accepting or > rejecting the connection. I'll work on adding this after getting the userspace > CMA up. > > > Yes it's funky. Basically, the listen in this context is the combination > > of a 'bind' and a 'listen'. If you specify 0 for a port number on bind, > > the stack will allocate one for you. > > I need to think about this more. I'm not sure how to handle a bind of 0 over > IB. Can you handle this on the bind only, as opposed to listen? > > - Sean From jlentini at netapp.com Wed Oct 26 12:01:27 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 26 Oct 2005 15:01:27 -0400 (EDT) Subject: [openib-general] Re: [PATCH] new uDAPL openIB provider using socket CM In-Reply-To: References: Message-ID: On Tue, 25 Oct 2005, Arlin Davis wrote: > James, > > Here is a patch to add an optional openIB uDAPL provider that uses the socket CM for anyone having > problems scaling out with the uCM/uAT version. To build the new provider, simply "make > VERBS=openib_scm". This version does not require IPoIB, uCM, or uAT. > > -arlin > > Signed-off by: Arlin Davis Some of the new files in dapl/openib_scm use this license: dapl/openib_scm/dapl_ib_util.c > +/* > + * This Software is licensed under both of the following two licenses: > + * > + * 1) under the terms of the "Common Public License 1.0" a copy of which is > + * in the file LICENSE.txt in the root directory. The license is also > + * available from the Open Source Initiative, see > + * http://www.opensource.org/licenses/cpl.php. > + * OR > + * > + * 2) under the terms of the "The BSD License" a copy of which is in the file > + * LICENSE2.txt in the root directory. The license is also available from > + * the Open Source Initiative, see > + * http://www.opensource.org/licenses/bsd-license.php. > + * > + * Licensee has the right to choose either one of the above two licenses. > + * > + * Redistributions of source code must retain both the above copyright > + * notice and either one of the license notices. > + * > + * Redistributions in binary form must reproduce both the above copyright > + * notice, either one of the license notices in the documentation > + * and/or other materials provided with the distribution. > + */ and other files use this license dapl/openib_scm/dapl_ib_mem.c > +/* > + * This Software is licensed under one of the following licenses: > + * > + * 1) under the terms of the "Common Public License 1.0" a copy of which is > + * available from the Open Source Initiative, see > + * http://www.opensource.org/licenses/cpl.php. > + * > + * 2) under the terms of the "The BSD License" a copy of which is > + * available from the Open Source Initiative, see > + * http://www.opensource.org/licenses/bsd-license.php. > + * > + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a > + * copy of which is available from the Open Source Initiative, see > + * http://www.opensource.org/licenses/gpl-license.php. > + * > + * Licensee has the right to choose one of the above licenses. > + * > + * Redistributions of source code must retain the above copyright > + * notice and one of the license notices. > + * > + * Redistributions in binary form must reproduce both the above copyright > + * notice, one of the license notices in the documentation > + * and/or other materials provided with the distribution. > + */ I'd like all of the files to use the later. Is that acceptable to you? If so, please send a new patch with this change. james From arlin.r.davis at intel.com Wed Oct 26 12:46:25 2005 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 26 Oct 2005 12:46:25 -0700 Subject: [openib-general] [PATCH #2] new uDAPL openIB provider using socket CM, corrected license headers In-Reply-To: Message-ID: James, This version includes updated license headers per your request. -arlin Signed-off by: Arlin Davis Index: dapl/udapl/Makefile =================================================================== --- dapl/udapl/Makefile (revision 3848) +++ dapl/udapl/Makefile (working copy) @@ -139,6 +139,16 @@ CFLAGS += -I/usr/local/include/infinib endif # +# OpenIB provider with Socket CM +# +ifeq ($(VERBS),openib_scm) +PROVIDER = $(TOPDIR)/../openib_scm +CFLAGS += -DOPENIB +CFLAGS += -DCQ_WAIT_OBJECT +CFLAGS += -I/usr/local/include/infiniband +endif + +# # If an implementation supports CM and DTO completions on the same EVD # then DAPL_MERGE_CM_DTO should be set # CFLAGS += -DDAPL_MERGE_CM_DTO=1 @@ -251,6 +261,13 @@ PROVIDER_SRCS = dapl_ib_util.c dapl_ib_ PROVIDER_SRCS += dapl_ib_cm.c dapl_ib_mem.c endif +ifeq ($(VERBS),openib_scm) +LDFLAGS += -libverbs +LDFLAGS += -rpath /usr/local/lib -L /usr/local/lib +PROVIDER_SRCS = dapl_ib_util.c dapl_ib_cq.c dapl_ib_qp.c \ + dapl_ib_cm.c dapl_ib_mem.c +endif + UDAPL_SRCS = dapl_init.c \ dapl_evd_create.c \ dapl_evd_query.c \ Index: dapl/openib_scm/dapl_ib_dto.h =================================================================== --- dapl/openib_scm/dapl_ib_dto.h (revision 0) +++ dapl/openib_scm/dapl_ib_dto.h (revision 0) @@ -0,0 +1,262 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_dto.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - DTO operations and CQE macros + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifndef _DAPL_IB_DTO_H_ +#define _DAPL_IB_DTO_H_ + +#include "dapl_ib_util.h" + +#define DEFAULT_DS_ENTRIES 8 + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p); + +/* + * dapls_ib_post_recv + * + * Provider specific Post RECV function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_recv ( + IN DAPL_EP *ep_ptr, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov ) +{ + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_recv_wr wr; + struct ibv_recv_wr *bad_wr; + DAT_COUNT i, total_len; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_rcv: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if ( segments <= DEFAULT_DS_ENTRIES ) + ds_array_p = ds_array; + else + ds_array_p = dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup work request */ + total_len = 0; + wr.next = 0; + wr.num_sge = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_rcv: l_key 0x%x va %p len %d\n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if (ibv_post_recv(ep_ptr->qp_handle, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + return DAT_SUCCESS; +} + + +/* + * dapls_ib_post_send + * + * Provider specific Post SEND function + */ +STATIC _INLINE_ DAT_RETURN +dapls_ib_post_send ( + IN DAPL_EP *ep_ptr, + IN ib_send_op_type_t op_type, + IN DAPL_COOKIE *cookie, + IN DAT_COUNT segments, + IN DAT_LMR_TRIPLET *local_iov, + IN const DAT_RMR_TRIPLET *remote_iov, + IN DAT_COMPLETION_FLAGS completion_flags) +{ + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: ep %p op %d ck %p sgs %d l_iov %p r_iov %p f %d\n", + ep_ptr, op_type, cookie, segments, local_iov, + remote_iov, completion_flags); + + ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; + ib_data_segment_t *ds_array_p; + struct ibv_send_wr wr; + struct ibv_send_wr *bad_wr; + ib_hca_transport_t *ibt_ptr = &ep_ptr->header.owner_ia->hca_ptr->ib_trans; + DAT_COUNT i, total_len; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: ep %p cookie %p segs %d l_iov %p\n", + ep_ptr, cookie, segments, local_iov); + + if( segments <= DEFAULT_DS_ENTRIES ) + ds_array_p = ds_array; + else + ds_array_p = dapl_os_alloc(segments * sizeof(ib_data_segment_t)); + + if (NULL == ds_array_p) + return (DAT_INSUFFICIENT_RESOURCES); + + /* setup the work request */ + wr.next = 0; + wr.opcode = op_type; + wr.num_sge = 0; + wr.send_flags = 0; + wr.wr_id = (uint64_t)(uintptr_t)cookie; + wr.sg_list = ds_array_p; + total_len = 0; + + for (i = 0; i < segments; i++ ) { + if ( !local_iov[i].segment_length ) + continue; + + ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; + ds_array_p->length = local_iov[i].segment_length; + ds_array_p->lkey = local_iov[i].lmr_context; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d \n", + ds_array_p->lkey, ds_array_p->addr, + ds_array_p->length ); + + total_len += ds_array_p->length; + wr.num_sge++; + ds_array_p++; + } + + if (cookie != NULL) + cookie->val.dto.size = total_len; + + if ((op_type == OP_RDMA_WRITE) || (op_type == OP_RDMA_READ)) { + wr.wr.rdma.remote_addr = remote_iov->target_address; + wr.wr.rdma.rkey = remote_iov->rmr_context; + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " post_snd_rdma: rkey 0x%x va %#016Lx\n", + wr.wr.rdma.rkey, wr.wr.rdma.remote_addr ); + } + + /* inline data for send or write ops */ + if ((total_len <= ibt_ptr->max_inline_send ) && + ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) + wr.send_flags |= IBV_SEND_INLINE; + + /* set completion flags in work request */ + wr.send_flags |= (DAT_COMPLETION_SUPPRESS_FLAG & + completion_flags) ? 0 : IBV_SEND_SIGNALED; + wr.send_flags |= (DAT_COMPLETION_BARRIER_FENCE_FLAG & + completion_flags) ? IBV_SEND_FENCE : 0; + wr.send_flags |= (DAT_COMPLETION_SOLICITED_WAIT_FLAG & + completion_flags) ? IBV_SEND_SOLICITED : 0; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " post_snd: op 0x%x flags 0x%x sglist %p, %d\n", + wr.opcode, wr.send_flags, wr.sg_list, wr.num_sge); + + if (ibv_post_send(ep_ptr->qp_handle, &wr, &bad_wr)) + return( dapl_convert_errno(EFAULT,"ibv_recv") ); + + dapl_dbg_log (DAPL_DBG_TYPE_EP," post_snd: returned\n"); + return DAT_SUCCESS; +} + +STATIC _INLINE_ DAT_RETURN +dapls_ib_optional_prv_dat ( + IN DAPL_CR *cr_ptr, + IN const void *event_data, + OUT DAPL_CR **cr_pp) +{ + return DAT_SUCCESS; +} + +STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p) +{ + switch (cqe_p->opcode) { + case IBV_WC_SEND: + return (OP_SEND); + case IBV_WC_RDMA_WRITE: + return (OP_RDMA_WRITE); + case IBV_WC_RDMA_READ: + return (OP_RDMA_READ); + case IBV_WC_COMP_SWAP: + return (OP_COMP_AND_SWAP); + case IBV_WC_FETCH_ADD: + return (OP_FETCH_AND_ADD); + case IBV_WC_BIND_MW: + return (OP_BIND_MW); + case IBV_WC_RECV: + return (OP_RECEIVE); + case IBV_WC_RECV_RDMA_WITH_IMM: + return (OP_RECEIVE_IMM); + default: + return (OP_INVALID); + } +} + +#define DAPL_GET_CQE_OPTYPE(cqe_p) dapls_cqe_opcode(cqe_p) +#define DAPL_GET_CQE_WRID(cqe_p) ((ib_work_completion_t*)cqe_p)->wr_id +#define DAPL_GET_CQE_STATUS(cqe_p) ((ib_work_completion_t*)cqe_p)->status +#define DAPL_GET_CQE_BYTESNUM(cqe_p) ((ib_work_completion_t*)cqe_p)->byte_len +#define DAPL_GET_CQE_IMMED_DATA(cqe_p) ((ib_work_completion_t*)cqe_p)->imm_data + +#endif /* _DAPL_IB_DTO_H_ */ Index: dapl/openib_scm/dapl_ib_util.c =================================================================== --- dapl/openib_scm/dapl_ib_util.c (revision 0) +++ dapl/openib_scm/dapl_ib_util.c (revision 0) @@ -0,0 +1,472 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - init, open, close, utilities + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ +#ifdef RCSID +static const char rcsid[] = "$Id: $"; +#endif + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_ib_util.h" + +#include +#include +#include +#include +#include + +int g_dapl_loopback_connection = 0; + +/* just get IP address for hostname */ +DAT_RETURN getipaddr( char *addr, int addr_len) +{ + struct sockaddr_in *ipv4_addr = (struct sockaddr_in*)addr; + struct hostent *h_ptr; + struct utsname ourname; + + if ( uname( &ourname ) < 0 ) + return DAT_INTERNAL_ERROR; + + h_ptr = gethostbyname( ourname.nodename ); + if ( h_ptr == NULL ) + return DAT_INTERNAL_ERROR; + + if ( h_ptr->h_addrtype == AF_INET ) { + ipv4_addr = (struct sockaddr_in*) addr; + ipv4_addr->sin_family = AF_INET; + dapl_os_memcpy( &ipv4_addr->sin_addr, h_ptr->h_addr_list[0], 4 ); + } else + return DAT_INVALID_ADDRESS; + + return DAT_SUCCESS; +} + +/* + * dapls_ib_init, dapls_ib_release + * + * Initialize Verb related items for device open + * + * Input: + * none + * + * Output: + * none + * + * Returns: + * 0 success, -1 error + * + */ +int32_t dapls_ib_init (void) +{ + return 0; +} + +int32_t dapls_ib_release (void) +{ + return 0; +} + +/* + * dapls_ib_open_hca + * + * Open HCA + * + * Input: + * *hca_name pointer to provider device name + * *ib_hca_handle_p pointer to provide HCA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_open_hca ( + IN IB_HCA_NAME hca_name, + IN DAPL_HCA *hca_ptr) +{ + struct dlist *dev_list; + int opts; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " open_hca: %s - %p\n", hca_name, hca_ptr ); + + /* Get list of all IB devices, find match, open */ + dev_list = ibv_get_devices(); + dlist_start(dev_list); + dlist_for_each_data(dev_list,hca_ptr->ib_trans.ib_dev,struct ibv_device) { + if (!strcmp(ibv_get_device_name(hca_ptr->ib_trans.ib_dev),hca_name)) + break; + } + + if (!hca_ptr->ib_trans.ib_dev) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB device %s not found\n", + hca_name); + return DAT_INTERNAL_ERROR; + } + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL," open_hca: Found dev %s %016llx\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + (unsigned long long)bswap_64(ibv_get_device_guid(hca_ptr->ib_trans.ib_dev))); + + hca_ptr->ib_hca_handle = ibv_open_device(hca_ptr->ib_trans.ib_dev); + if (!hca_ptr->ib_hca_handle) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: IB dev open failed for %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + return DAT_INTERNAL_ERROR; + } + + /* set inline max with enviroment or default */ + hca_ptr->ib_trans.max_inline_send = + dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); + + /* initialize cq_lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to init cq_lock\n"); + goto bail; + } + + /* EVD events without direct CQ channels, non-blocking */ + hca_ptr->ib_trans.ib_cq = + ibv_create_comp_channel(hca_ptr->ib_hca_handle); + opts = fcntl(hca_ptr->ib_trans.ib_cq->fd, F_GETFL); /* uCQ */ + if (opts < 0 || fcntl(hca_ptr->ib_trans.ib_cq->fd, + F_SETFL, opts | O_NONBLOCK) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: ERR with CQ FD\n" ); + goto bail; + } + + if (dapli_cq_thread_init(hca_ptr)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: cq_thread_init failed for %s\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev) ); + goto bail; + } + + /* initialize cr_list lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to init lock\n"); + goto bail; + } + + /* initialize CM list for listens on this HCA */ + dapl_llist_init_head(&hca_ptr->ib_trans.list); + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cr_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cr_thread, + (void*)hca_ptr, + &hca_ptr->ib_trans.thread ); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + " open_hca: failed to create thread\n"); + goto bail; + } + + /* wait for thread */ + while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " open_hca: waiting for cr_thread\n"); + nanosleep (&sleep, &remain); + } + + /* get the IP address of the device */ + dat_status = getipaddr((char*)&hca_ptr->hca_address, + sizeof(DAT_SOCK_ADDR6) ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " open_hca: %s, port %d, %s %d.%d.%d.%d\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), hca_ptr->port_num, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_family == AF_INET ? "AF_INET":"AF_INET6", + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)&hca_ptr->hca_address)->sin_addr.s_addr >> 24 & 0xff ); + + return dat_status; +bail: + ibv_close_device(hca_ptr->ib_hca_handle); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_close_hca + * + * Open HCA + * + * Input: + * DAPL_HCA provide CA handle + * + * Output: + * none + * + * Return: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) +{ + dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); + + dapli_cq_thread_destroy(hca_ptr); + + if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { + if (ibv_close_device(hca_ptr->ib_hca_handle)) + return(dapl_convert_errno(errno,"ib_close_device")); + hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; + } + + dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); + + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; + while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " close_hca: waiting for cr_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); + + return (DAT_SUCCESS); +} + +/* + * dapls_ib_query_hca + * + * Query the hca attribute + * + * Input: + * hca_handl hca handle + * ia_attr attribute of the ia + * ep_attr attribute of the ep + * ip_addr ip address of DET NIC + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + */ + +DAT_RETURN dapls_ib_query_hca ( + IN DAPL_HCA *hca_ptr, + OUT DAT_IA_ATTR *ia_attr, + OUT DAT_EP_ATTR *ep_attr, + OUT DAT_SOCK_ADDR6 *ip_addr) +{ + struct ibv_device_attr dev_attr; + struct ibv_port_attr port_attr; + + if (hca_ptr->ib_hca_handle == NULL) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR," query_hca: BAD handle\n"); + return (DAT_INVALID_HANDLE); + } + + /* local IP address of device, set during ia_open */ + if (ip_addr != NULL) + memcpy(ip_addr, &hca_ptr->hca_address, sizeof(DAT_SOCK_ADDR6)); + + if (ia_attr == NULL && ep_attr == NULL) + return DAT_SUCCESS; + + /* query verbs for this device and port attributes */ + if (ibv_query_device(hca_ptr->ib_hca_handle, &dev_attr) || + ibv_query_port(hca_ptr->ib_hca_handle, + hca_ptr->port_num, &port_attr)) + return(dapl_convert_errno(errno,"ib_query_hca")); + + if (ia_attr != NULL) { + ia_attr->adapter_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->vendor_name[DAT_NAME_MAX_LENGTH - 1] = '\0'; + ia_attr->ia_address_ptr = (DAT_IA_ADDRESS_PTR)&hca_ptr->hca_address; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: %s %s %d.%d.%d.%d\n", + ibv_get_device_name(hca_ptr->ib_trans.ib_dev), + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_family == AF_INET ? "AF_INET":"AF_INET6", + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)ia_attr->ia_address_ptr)->sin_addr.s_addr >> 24 & 0xff ); + + ia_attr->hardware_version_major = dev_attr.hw_ver; + /* ia_attr->hardware_version_minor = dev_attr.fw_ver; */ + ia_attr->max_eps = dev_attr.max_qp; + ia_attr->max_dto_per_ep = dev_attr.max_qp_wr; + ia_attr->max_rdma_read_per_ep = dev_attr.max_qp_rd_atom; + ia_attr->max_evds = dev_attr.max_cq; + ia_attr->max_evd_qlen = dev_attr.max_cqe; + ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; + ia_attr->max_lmrs = dev_attr.max_mr; + ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + ia_attr->max_rmrs = dev_attr.max_mw; + ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; + ia_attr->max_rmr_target_address = dev_attr.max_mr_size; + ia_attr->max_pzs = dev_attr.max_pd; + ia_attr->max_mtu_size = port_attr.max_msg_sz; + ia_attr->max_rdma_size = port_attr.max_msg_sz; + ia_attr->num_transport_attr = 0; + ia_attr->transport_attr = NULL; + ia_attr->num_vendor_attr = 0; + ia_attr->vendor_attr = NULL; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: (%x.%x) ep %d ep_q %d evd %d evd_q %d\n", + ia_attr->hardware_version_major, + ia_attr->hardware_version_minor, + ia_attr->max_eps, ia_attr->max_dto_per_ep, + ia_attr->max_evds, ia_attr->max_evd_qlen ); + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d\n", + ia_attr->max_mtu_size, ia_attr->max_rdma_size, + ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, + ia_attr->max_rmrs ); + + } + + if (ep_attr != NULL) { + ep_attr->max_mtu_size = port_attr.max_msg_sz; + ep_attr->max_rdma_size = port_attr.max_msg_sz; + ep_attr->max_recv_dtos = dev_attr.max_qp_wr; + ep_attr->max_request_dtos = dev_attr.max_qp_wr; + ep_attr->max_recv_iov = dev_attr.max_sge; + ep_attr->max_request_iov = dev_attr.max_sge; + ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; + ep_attr->max_rdma_read_out= dev_attr.max_qp_rd_atom; + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " query_hca: MAX msg %llu dto %d iov %d rdma i%d,o%d\n", + ep_attr->max_mtu_size, + ep_attr->max_recv_dtos, ep_attr->max_recv_iov, + ep_attr->max_rdma_read_in, ep_attr->max_rdma_read_out); + } + + return DAT_SUCCESS; +} + +/* + * dapls_ib_setup_async_callback + * + * Set up an asynchronous callbacks of various kinds + * + * Input: + * ia_handle IA handle + * handler_type type of handler to set up + * callback_handle handle param for completion callbacks + * callback callback routine pointer + * context argument for callback routine + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_setup_async_callback ( + IN DAPL_IA *ia_ptr, + IN DAPL_ASYNC_HANDLER_TYPE handler_type, + IN DAPL_EVD *evd_ptr, + IN ib_async_handler_t callback, + IN void *context ) + +{ + ib_hca_transport_t *hca_ptr; + + dapl_dbg_log (DAPL_DBG_TYPE_UTIL, + " setup_async_cb: ia %p type %d handle %p cb %p ctx %p\n", + ia_ptr, handler_type, evd_ptr, callback, context); + + hca_ptr = &ia_ptr->hca_ptr->ib_trans; + switch(handler_type) + { + case DAPL_ASYNC_UNAFILIATED: + hca_ptr->async_unafiliated = + (ib_async_handler_t)callback; + hca_ptr->async_un_ctx = context; + break; + case DAPL_ASYNC_CQ_ERROR: + hca_ptr->async_cq_error = + (ib_async_cq_handler_t)callback; + break; + case DAPL_ASYNC_CQ_COMPLETION: + hca_ptr->async_cq = + (ib_async_dto_handler_t)callback; + break; + case DAPL_ASYNC_QP_ERROR: + hca_ptr->async_qp_error = + (ib_async_qp_handler_t)callback; + break; + default: + break; + } + return DAT_SUCCESS; +} + Index: dapl/openib_scm/dapl_ib_mem.c =================================================================== --- dapl/openib_scm/dapl_ib_mem.c (revision 0) +++ dapl/openib_scm/dapl_ib_mem.c (revision 0) @@ -0,0 +1,392 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_mem.c + * + * PURPOSE: Intel DET APIs: Memory windows, registration, + * and protection domain + * + * $Id: $ + * + **********************************************************************/ + +#include /* for IOCTL's */ +#include /* for socket(2) and related bits and pieces */ +#include /* for socket(2) */ +#include /* for struct ifreq */ +#include /* for ARPHRD_ETHER */ +#include /* for _SC_CLK_TCK */ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" + +/* + * dapls_convert_privileges + * + * Convert LMR privileges to provider + * + * Input: + * DAT_MEM_PRIV_FLAGS + * + * Output: + * none + * + * Returns: + * ibv_access_flags + * + */ +STATIC _INLINE_ int +dapls_convert_privileges ( + IN DAT_MEM_PRIV_FLAGS privileges) +{ + int access = 0; + + /* + * if (DAT_MEM_PRIV_LOCAL_READ_FLAG & privileges) do nothing + */ + if (DAT_MEM_PRIV_LOCAL_WRITE_FLAG & privileges) + access |= IBV_ACCESS_LOCAL_WRITE; + if (DAT_MEM_PRIV_REMOTE_WRITE_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_WRITE; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + if (DAT_MEM_PRIV_REMOTE_READ_FLAG & privileges) + access |= IBV_ACCESS_REMOTE_READ; + + return access; +} + +/* + * dapl_ib_pd_alloc + * + * Alloc a PD + * + * Input: + * ia_handle IA handle + * pz pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_pd_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_PZ *pz ) +{ + /* get a protection domain */ + pz->pd_handle = ibv_alloc_pd(ia_ptr->hca_ptr->ib_hca_handle); + if (!pz->pd_handle) + return(dapl_convert_errno(ENOMEM,"alloc_pd")); + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " pd_alloc: pd_handle=%p\n", + pz->pd_handle ); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_pd_free + * + * Free a PD + * + * Input: + * ia_handle IA handle + * PZ_ptr pointer to PZ struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_pd_free ( + IN DAPL_PZ *pz ) +{ + if (pz->pd_handle != IB_INVALID_HANDLE) { + if (ibv_dealloc_pd(pz->pd_handle)) + return(dapl_convert_errno(errno,"dealloc_pd")); + pz->pd_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_register + * + * Register a virtual memory region + * + * Input: + * ia_handle IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register ( + IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_PVOID virt_addr, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS privileges) +{ + ib_pd_handle_t ib_pd_handle; + + ib_pd_handle = ((DAPL_PZ *)lmr->param.pz_handle)->pd_handle; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " mr_register: ia=%p, lmr=%p va=%p ln=%d pv=0x%x\n", + ia_ptr, lmr, virt_addr, length, privileges ); + + /* TODO: shared memory */ + if (lmr->param.mem_type == DAT_MEM_TYPE_SHARED_VIRTUAL) { + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " mr_register_shared: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); + } + + /* local read is default on IB */ + lmr->mr_handle = + ibv_reg_mr(((DAPL_PZ *)lmr->param.pz_handle)->pd_handle, + virt_addr, + length, + dapls_convert_privileges(privileges)); + + if (!lmr->mr_handle) + return(dapl_convert_errno(ENOMEM,"reg_mr")); + + lmr->param.lmr_context = lmr->mr_handle->lkey; + lmr->param.rmr_context = lmr->mr_handle->rkey; + lmr->param.registered_size = length; + lmr->param.registered_address = (DAT_VADDR)(uintptr_t) virt_addr; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " mr_register: mr=%p h %x pd %p ctx %p ,lkey=0x%x, rkey=0x%x priv=%x\n", + lmr->mr_handle, lmr->mr_handle->handle, + lmr->mr_handle->pd, + lmr->mr_handle->context, + lmr->mr_handle->lkey, + lmr->mr_handle->rkey, + length, dapls_convert_privileges(privileges) ); + + return DAT_SUCCESS; +} + +/* + * dapl_ib_mr_deregister + * + * Free a memory region + * + * Input: + * lmr pointer to dapl_lmr struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mr_deregister ( + IN DAPL_LMR *lmr ) +{ + if (lmr->mr_handle != IB_INVALID_HANDLE) { + if (ibv_dereg_mr(lmr->mr_handle)) + return(dapl_convert_errno(errno,"dereg_pd")); + lmr->mr_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + + +/* + * dapl_ib_mr_register_shared + * + * Register a virtual memory region + * + * Input: + * ia_ptr IA handle + * lmr pointer to dapl_lmr struct + * virt_addr virtual address of beginning of mem region + * length length of memory region + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mr_register_shared ( + IN DAPL_IA *ia_ptr, + IN DAPL_LMR *lmr, + IN DAT_MEM_PRIV_FLAGS privileges ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mr_register_shared: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_alloc + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_alloc ( + IN DAPL_RMR *rmr ) +{ + + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_alloc: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_free + * + * Release bindings of a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_mw_free ( + IN DAPL_RMR *rmr ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_free: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_bind + * + * Bind a protection domain to a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_bind ( + IN DAPL_RMR *rmr, + IN DAPL_LMR *lmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_VADDR virtual_address, + IN DAT_VLEN length, + IN DAT_MEM_PRIV_FLAGS mem_priv, + IN DAT_BOOLEAN is_signaled) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_bind: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * dapls_ib_mw_unbind + * + * Unbind a protection domain from a memory window + * + * Input: + * rmr Initialized rmr to hold binding handles + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER; + * DAT_INVALID_STATE; + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_mw_unbind ( + IN DAPL_RMR *rmr, + IN DAPL_EP *ep, + IN DAPL_COOKIE *cookie, + IN DAT_BOOLEAN is_signaled ) +{ + dapl_dbg_log(DAPL_DBG_TYPE_ERR," mw_unbind: NOT IMPLEMENTED\n"); + return DAT_ERROR (DAT_NOT_IMPLEMENTED, DAT_NO_SUBTYPE); +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + Index: dapl/openib_scm/dapl_ib_cm.c =================================================================== --- dapl/openib_scm/dapl_ib_cm.c (revision 0) +++ dapl/openib_scm/dapl_ib_cm.c (revision 0) @@ -0,0 +1,1074 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cm.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - connection management + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_evd_util.h" +#include "dapl_cr_util.h" +#include "dapl_name_service.h" +#include "dapl_ib_util.h" + +#include +#include +#include +#include +#include + +/* prototypes */ +static uint16_t dapli_get_lid( struct ibv_device *dev, int port ); + +static DAT_RETURN dapli_socket_connect ( DAPL_EP *ep_ptr, + DAT_IA_ADDRESS_PTR r_addr, + DAT_CONN_QUAL r_qual, + DAT_COUNT p_size, + DAT_PVOID p_data ); + +static DAT_RETURN dapli_socket_listen ( DAPL_IA *ia_ptr, + DAT_CONN_QUAL serviceID, + DAPL_SP *sp_ptr ); + +static DAT_RETURN dapli_socket_accept( ib_cm_srvc_handle_t cm_ptr ); + +static DAT_RETURN dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ); + +/* XXX temporary hack to get lid */ +static uint16_t dapli_get_lid(IN struct ibv_device *dev, IN int port) +{ + char path[128]; + char val[16]; + char name[256]; + + if (sysfs_get_mnt_path(path, sizeof path)) { + fprintf(stderr, "Couldn't find sysfs mount.\n"); + return 0; + } + sprintf(name, "%s/class/infiniband/%s/ports/%d/lid", path, + ibv_get_device_name(dev), port); + + if (sysfs_read_attribute_value(name, val, sizeof val)) { + fprintf(stderr, "Couldn't read LID at %s\n", name); + return 0; + } + return strtol(val, NULL, 0); +} + +/* + * ACTIVE: Create socket, connect, and exchange QP information + */ +static DAT_RETURN +dapli_socket_connect ( DAPL_EP *ep_ptr, + DAT_IA_ADDRESS_PTR r_addr, + DAT_CONN_QUAL r_qual, + DAT_COUNT p_size, + DAT_PVOID p_data ) +{ + ib_cm_handle_t cm_ptr; + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + int len, opt = 1; + struct iovec iovec[2]; + short rtu_data = htons(0x0E0F); + + dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d\n", r_qual); + + /* + * Allocate CM and initialize + */ + if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL ) { + return DAT_INSUFFICIENT_RESOURCES; + } + + (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); + cm_ptr->socket = -1; + + /* create, connect, sockopt, and exchange QP information */ + if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return DAT_INSUFFICIENT_RESOURCES; + } + + ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); + + if ( connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) < 0 ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect: %s on r_qual %d\n", + strerror(errno), (unsigned int)r_qual); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return DAT_INVALID_ADDRESS; + } + setsockopt(cm_ptr->socket,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt)); + + /* Send QP info, IA address, and private data */ + cm_ptr->dst.qpn = ep_ptr->qp_handle->qp_num; + cm_ptr->dst.port = ia_ptr->hca_ptr->port_num; + cm_ptr->dst.lid = dapli_get_lid( ia_ptr->hca_ptr->ib_trans.ib_dev, + ia_ptr->hca_ptr->port_num ); + cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; + cm_ptr->dst.p_size = p_size; + iovec[0].iov_base = &cm_ptr->dst; + iovec[0].iov_len = sizeof(ib_qp_cm_t); + if ( p_size ) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); + if ( len != (p_size + sizeof(ib_qp_cm_t)) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect write: ERR %s, wcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " connect: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + + /* read DST information into cm_ptr, overwrite SRC info */ + len = readv( cm_ptr->socket, iovec, 1 ); + if ( len != sizeof(ib_qp_cm_t) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read: ERR %s, rcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " connect: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + + /* validate private data size before reading */ + if ( cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read: psize (%d) wrong\n", + cm_ptr->dst.p_size ); + goto bail; + } + + /* read private data into cm_handle if any present */ + if ( cm_ptr->dst.p_size ) { + iovec[0].iov_base = cm_ptr->p_data; + iovec[0].iov_len = cm_ptr->dst.p_size; + len = readv( cm_ptr->socket, iovec, 1 ); + if ( len != cm_ptr->dst.p_size ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " connect read pdata: ERR %s, rcnt=%d\n", + strerror(errno), len); + goto bail; + } + } + + /* modify QP to RTR and then to RTS with remote info */ + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTS, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + ep_ptr->qp_state = IB_QP_STATE_RTS; + + /* complete handshake after final QP state change */ + write(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); + + /* init cm_handle and post the event with private data */ + ep_ptr->cm_handle = cm_ptr; + dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); + dapl_evd_connection_callback( ep_ptr->cm_handle, + IB_CME_CONNECTED, + cm_ptr->p_data, + ep_ptr ); + return DAT_SUCCESS; + +bail: + /* close socket, free cm structure and post error event */ + if ( cm_ptr->socket >= 0 ) + close(cm_ptr->socket); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + + dapl_evd_connection_callback( ep_ptr->cm_handle, + IB_CME_LOCAL_FAILURE, + NULL, + ep_ptr ); + return DAT_INTERNAL_ERROR; +} + + +/* + * PASSIVE: Create socket, listen, accept, exchange QP information + */ +static DAT_RETURN +dapli_socket_listen ( DAPL_IA *ia_ptr, + DAT_CONN_QUAL serviceID, + DAPL_SP *sp_ptr ) +{ + struct sockaddr_in addr; + ib_cm_srvc_handle_t cm_ptr = NULL; + int opt = 1; + DAT_RETURN dat_status = DAT_SUCCESS; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " listen(ia_ptr %p ServiceID %d sp_ptr %p)\n", + ia_ptr, serviceID, sp_ptr); + + /* Allocate CM and initialize */ + if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); + + cm_ptr->socket = cm_ptr->l_socket = -1; + cm_ptr->sp = sp_ptr; + cm_ptr->hca_ptr = ia_ptr->hca_ptr; + + /* bind, listen, set sockopt, accept, exchange data */ + if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "socket for listen returned %d\n", errno); + dat_status = DAT_INSUFFICIENT_RESOURCES; + goto bail; + } + + setsockopt(cm_ptr->l_socket,SOL_SOCKET,SO_REUSEADDR,&opt,sizeof(opt)); + addr.sin_port = htons(serviceID); + addr.sin_family = AF_INET; + addr.sin_addr.s_addr = INADDR_ANY; + + if (( bind( cm_ptr->l_socket,(struct sockaddr*)&addr, sizeof(addr) ) < 0) || + (listen( cm_ptr->l_socket, 128 ) < 0) ) { + + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR %s on conn_qual 0x%x\n", + strerror(errno),serviceID); + + if ( errno == EADDRINUSE ) + dat_status = DAT_CONN_QUAL_IN_USE; + else + dat_status = DAT_CONN_QUAL_UNAVAILABLE; + + goto bail; + } + + /* set cm_handle for this service point, save listen socket */ + sp_ptr->cm_srvc_handle = cm_ptr; + + /* add to SP->CR thread list */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); + dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock ); + dapl_llist_add_tail(&cm_ptr->hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cm_ptr->entry, cm_ptr); + dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock); + + dapl_dbg_log( DAPL_DBG_TYPE_CM, + " listen: qual 0x%x cr %p s_fd %d\n", + ntohs(serviceID), cm_ptr, cm_ptr->l_socket ); + + return dat_status; +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR, + " listen: ERROR on conn_qual 0x%x\n",serviceID); + if ( cm_ptr->l_socket >= 0 ) + close( cm_ptr->l_socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + return dat_status; +} + + +/* + * PASSIVE: send local QP information, private data, and wait for + * active side to respond with QP RTS/RTR status + */ +static DAT_RETURN +dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) +{ + ib_cm_handle_t acm_ptr; + void *p_data = NULL; + int len; + DAT_RETURN dat_status = DAT_SUCCESS; + + /* Allocate accept CM and initialize */ + if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) + return DAT_INSUFFICIENT_RESOURCES; + + (void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) ); + + acm_ptr->socket = -1; + acm_ptr->sp = cm_ptr->sp; + acm_ptr->hca_ptr = cm_ptr->hca_ptr; + + len = sizeof(acm_ptr->dst.ia_address); + acm_ptr->socket = accept(cm_ptr->l_socket, + (struct sockaddr*)&acm_ptr->dst.ia_address, + &len ); + + if ( acm_ptr->socket < 0 ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept: ERR %s on FD %d l_cr %p\n", + strerror(errno),cm_ptr->l_socket,cm_ptr); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + + /* read in DST QP info, IA address. check for private data */ + len = read( acm_ptr->socket, &acm_ptr->dst, sizeof(ib_qp_cm_t) ); + if ( len != sizeof(ib_qp_cm_t) ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read: ERR %s, rcnt=%d\n", + strerror(errno), len); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept: DST port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + acm_ptr->dst.port, acm_ptr->dst.lid, + acm_ptr->dst.qpn, acm_ptr->dst.p_size ); + + /* validate private data size before reading */ + if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read: psize (%d) wrong\n", + acm_ptr->dst.p_size ); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + + /* read private data into cm_handle if any present */ + if ( acm_ptr->dst.p_size ) { + len = read( acm_ptr->socket, + acm_ptr->p_data, acm_ptr->dst.p_size ); + if ( len != acm_ptr->dst.p_size ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept read pdata: ERR %s, rcnt=%d\n", + strerror(errno), len ); + dat_status = DAT_INTERNAL_ERROR; + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept: psize=%d read\n", + acm_ptr->dst.p_size); + p_data = acm_ptr->p_data; + } + + /* trigger CR event and return SUCCESS */ + dapls_cr_callback( acm_ptr, + IB_CME_CONNECTION_REQUEST_PENDING, + p_data, + acm_ptr->sp ); + + return DAT_SUCCESS; + +bail: + if ( acm_ptr->socket >=0 ) + close( acm_ptr->socket ); + dapl_os_free( acm_ptr, sizeof( *acm_ptr ) ); + return DAT_INTERNAL_ERROR; +} + + +static DAT_RETURN +dapli_socket_accept_final( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ) +{ + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; + ib_qp_cm_t qp_cm; + struct iovec iovec[2]; + int len; + short rtu_data = 0; + + if (p_size > IB_MAX_REP_PDATA_SIZE) + return DAT_LENGTH_ERROR; + + /* must have a accepted socket */ + if ( cm_ptr->socket < 0 ) + return DAT_INTERNAL_ERROR; + + /* modify QP to RTR and then to RTS with remote info already read */ + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTR, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + if ( dapls_modify_qp_state( ep_ptr->qp_handle, + IBV_QPS_RTS, &cm_ptr->dst ) != DAT_SUCCESS ) + goto bail; + + ep_ptr->qp_state = IB_QP_STATE_RTS; + + /* Send QP info, IA address, and private data */ + qp_cm.qpn = ep_ptr->qp_handle->qp_num; + qp_cm.port = ia_ptr->hca_ptr->port_num; + qp_cm.lid = dapli_get_lid( ia_ptr->hca_ptr->ib_trans.ib_dev, + ia_ptr->hca_ptr->port_num ); + qp_cm.ia_address = ia_ptr->hca_ptr->hca_address; + qp_cm.p_size = p_size; + iovec[0].iov_base = &qp_cm; + iovec[0].iov_len = sizeof(ib_qp_cm_t); + if (p_size) { + iovec[1].iov_base = p_data; + iovec[1].iov_len = p_size; + } + len = writev( cm_ptr->socket, iovec, (p_size ? 2:1) ); + if (len != (p_size + sizeof(ib_qp_cm_t))) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_final: ERR %s, wcnt=%d\n", + strerror(errno), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept_final: SRC port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + qp_cm.port, qp_cm.lid, qp_cm.qpn, qp_cm.p_size ); + + /* complete handshake after final QP state change */ + len = read(cm_ptr->socket, &rtu_data, sizeof(rtu_data) ); + if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_final: ERR %s, rcnt=%d rdata=%x\n", + strerror(errno), len, ntohs(rtu_data) ); + goto bail; + } + + /* final data exchange if remote QP state is good to go */ + dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); + dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp ); + return DAT_SUCCESS; + +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR !QP_RTR_RTS \n"); + if ( cm_ptr >= 0 ) + close( cm_ptr->socket ); + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + + return DAT_INTERNAL_ERROR; +} + + +/* + * dapls_ib_connect + * + * Initiate a connection with the passive listener on another node + * + * Input: + * ep_handle, + * remote_ia_address, + * remote_conn_qual, + * prd_size size of private data and structure + * prd_prt pointer to private data structure + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_connect ( + IN DAT_EP_HANDLE ep_handle, + IN DAT_IA_ADDRESS_PTR remote_ia_address, + IN DAT_CONN_QUAL remote_conn_qual, + IN DAT_COUNT private_data_size, + IN void *private_data ) +{ + DAPL_EP *ep_ptr; + ib_qp_handle_t qp_ptr; + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " connect(ep_handle %p ....)\n", ep_handle); + /* + * Sanity check + */ + if ( NULL == ep_handle ) + return DAT_SUCCESS; + + ep_ptr = (DAPL_EP*)ep_handle; + qp_ptr = ep_ptr->qp_handle; + + return (dapli_socket_connect( ep_ptr, remote_ia_address, + remote_conn_qual, + private_data_size, private_data )); +} + +/* + * dapls_ib_disconnect + * + * Disconnect an EP + * + * Input: + * ep_handle, + * disconnect_flags + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * + */ +DAT_RETURN +dapls_ib_disconnect ( + IN DAPL_EP *ep_ptr, + IN DAT_CLOSE_FLAGS close_flags ) +{ + ib_cm_handle_t cm_ptr = ep_ptr->cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_disconnect(ep_handle %p ....)\n", + ep_ptr); + + if ( cm_ptr->socket >= 0 ) { + close( cm_ptr->socket ); + cm_ptr->socket = -1; + } + + /* reinit to modify QP state */ + dapls_ib_reinit_ep(ep_ptr); + + if ( ep_ptr->cr_ptr ) { + dapls_cr_callback ( ep_ptr->cm_handle, + IB_CME_DISCONNECTED, + NULL, + ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr ); + } else { + dapl_evd_connection_callback ( ep_ptr->cm_handle, + IB_CME_DISCONNECTED, + NULL, + ep_ptr ); + ep_ptr->cm_handle = NULL; + dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_disconnect_clean + * + * Clean up outstanding connection data. This routine is invoked + * after the final disconnect callback has occurred. Only on the + * ACTIVE side of a connection. + * + * Input: + * ep_ptr DAPL_EP + * active Indicates active side of connection + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_disconnect_clean ( + IN DAPL_EP *ep_ptr, + IN DAT_BOOLEAN active, + IN const ib_cm_events_t ib_cm_event ) +{ + return; +} + +/* + * dapl_ib_setup_conn_listener + * + * Have the CM set up a connection listener. + * + * Input: + * ibm_hca_handle HCA handle + * qp_handle QP handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * DAT_CONN_QUAL_UNAVAILBLE + * DAT_CONN_QUAL_IN_USE + * + */ +DAT_RETURN +dapls_ib_setup_conn_listener ( + IN DAPL_IA *ia_ptr, + IN DAT_UINT64 ServiceID, + IN DAPL_SP *sp_ptr ) +{ + return (dapli_socket_listen( ia_ptr, ServiceID, sp_ptr )); +} + + +/* + * dapl_ib_remove_conn_listener + * + * Have the CM remove a connection listener. + * + * Input: + * ia_handle IA handle + * ServiceID IB Channel Service ID + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_STATE + * + */ +DAT_RETURN +dapls_ib_remove_conn_listener ( + IN DAPL_IA *ia_ptr, + IN DAPL_SP *sp_ptr ) +{ + ib_cm_srvc_handle_t cm_ptr = sp_ptr->cm_srvc_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_remove_conn_listener(ia_ptr %p sp_ptr %p cm_ptr %p)\n", + ia_ptr, sp_ptr, cm_ptr ); + + /* close accepted socket, free cm_srvc_handle and return */ + if ( cm_ptr != NULL ) { + if ( cm_ptr->l_socket >= 0 ) { + close( cm_ptr->l_socket ); + cm_ptr->socket = -1; + } + /* cr_thread will free */ + sp_ptr->cm_srvc_handle = NULL; + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_accept_connection + * + * Perform necessary steps to accept a connection + * + * Input: + * cr_handle + * ep_handle + * private_data_size + * private_data + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_accept_connection ( + IN DAT_CR_HANDLE cr_handle, + IN DAT_EP_HANDLE ep_handle, + IN DAT_COUNT p_size, + IN const DAT_PVOID p_data ) +{ + DAPL_CR *cr_ptr; + DAPL_EP *ep_ptr; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_accept_connection(cr %p ep %p prd %p,%d)\n", + cr_handle, ep_handle, p_data, p_size ); + + cr_ptr = (DAPL_CR *) cr_handle; + ep_ptr = (DAPL_EP *) ep_handle; + + /* allocate and attach a QP if necessary */ + if ( ep_ptr->qp_state == DAPL_QP_STATE_UNATTACHED ) { + DAT_RETURN status; + status = dapls_ib_qp_alloc( ep_ptr->header.owner_ia, + ep_ptr, ep_ptr ); + if ( status != DAT_SUCCESS ) + return status; + } + + return ( dapli_socket_accept_final(ep_ptr, cr_ptr, p_size, p_data) ); +} + + +/* + * dapls_ib_reject_connection + * + * Reject a connection + * + * Input: + * cr_handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_reject_connection ( + IN ib_cm_handle_t ib_cm_handle, + IN int reject_reason ) +{ + ib_cm_srvc_handle_t cm_ptr = ib_cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_reject_connection(cm_handle %p reason %x)\n", + ib_cm_handle, reject_reason ); + + /* just close the socket and return */ + if ( cm_ptr->socket > 0 ) { + close( cm_ptr->socket ); + cm_ptr->socket = -1; + } + return DAT_SUCCESS; +} + +/* + * dapls_ib_cm_remote_addr + * + * Obtain the remote IP address given a connection + * + * Input: + * cr_handle + * + * Output: + * remote_ia_address: where to place the remote address + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + * + */ +DAT_RETURN +dapls_ib_cm_remote_addr ( + IN DAT_HANDLE dat_handle, + OUT DAT_SOCK_ADDR6 *remote_ia_address ) +{ + DAPL_HEADER *header; + ib_cm_handle_t ib_cm_handle; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "dapls_ib_cm_remote_addr(dat_handle %p, ....)\n", + dat_handle ); + + header = (DAPL_HEADER *)dat_handle; + + if (header->magic == DAPL_MAGIC_EP) + ib_cm_handle = ((DAPL_EP *) dat_handle)->cm_handle; + else if (header->magic == DAPL_MAGIC_CR) + ib_cm_handle = ((DAPL_CR *) dat_handle)->ib_cm_handle; + else + return DAT_INVALID_HANDLE; + + dapl_os_memcpy( remote_ia_address, + &ib_cm_handle->dst.ia_address, + sizeof(DAT_SOCK_ADDR6) ); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_private_data_size + * + * Return the size of private data given a connection op type + * + * Input: + * prd_ptr private data pointer + * conn_op connection operation type + * + * If prd_ptr is NULL, this is a query for the max size supported by + * the provider, otherwise it is the actual size of the private data + * contained in prd_ptr. + * + * + * Output: + * None + * + * Returns: + * length of private data + * + */ +int dapls_ib_private_data_size ( + IN DAPL_PRIVATE *prd_ptr, + IN DAPL_PDATA_OP conn_op) +{ + int size; + + switch (conn_op) + { + case DAPL_PDATA_CONN_REQ: + { + size = IB_MAX_REQ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_REP: + { + size = IB_MAX_REP_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_REJ: + { + size = IB_MAX_REJ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_DREQ: + { + size = IB_MAX_DREQ_PDATA_SIZE; + break; + } + case DAPL_PDATA_CONN_DREP: + { + size = IB_MAX_DREP_PDATA_SIZE; + break; + } + default: + { + size = 0; + } + + } /* end case */ + + return size; +} + +/* + * Map all socket CM event codes to the DAT equivelent. + */ +#define DAPL_IB_EVENT_CNT 11 + +static struct ib_cm_event_map +{ + const ib_cm_events_t ib_cm_event; + DAT_EVENT_NUMBER dat_event_num; + } ib_cm_event_map[DAPL_IB_EVENT_CNT] = { + /* 00 */ { IB_CME_CONNECTED, + DAT_CONNECTION_EVENT_ESTABLISHED}, + /* 01 */ { IB_CME_DISCONNECTED, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 02 */ { IB_CME_DISCONNECTED_ON_LINK_DOWN, + DAT_CONNECTION_EVENT_DISCONNECTED}, + /* 03 */ { IB_CME_CONNECTION_REQUEST_PENDING, + DAT_CONNECTION_REQUEST_EVENT}, + /* 04 */ { IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + DAT_CONNECTION_REQUEST_EVENT}, + /* 05 */ { IB_CME_DESTINATION_REJECT, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 06 */ { IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + DAT_CONNECTION_EVENT_PEER_REJECTED}, + /* 07 */ { IB_CME_DESTINATION_UNREACHABLE, + DAT_CONNECTION_EVENT_UNREACHABLE}, + /* 08 */ { IB_CME_TOO_MANY_CONNECTION_REQUESTS, + DAT_CONNECTION_EVENT_NON_PEER_REJECTED}, + /* 09 */ { IB_CME_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN}, + /* 10 */ { IB_CM_LOCAL_FAILURE, + DAT_CONNECTION_EVENT_BROKEN} +}; + +/* + * dapls_ib_get_cm_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * dat_event_num DAT event we need an equivelent CM event for + * + * Output: + * none + * + * Returns: + * ib_cm_event of translated DAPL value + */ +DAT_EVENT_NUMBER +dapls_ib_get_dat_event ( + IN const ib_cm_events_t ib_cm_event, + IN DAT_BOOLEAN active) +{ + DAT_EVENT_NUMBER dat_event_num; + int i; + + active = active; + + if (ib_cm_event > IB_CM_LOCAL_FAILURE) + return (DAT_EVENT_NUMBER) 0; + + dat_event_num = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if (ib_cm_event == ib_cm_event_map[i].ib_cm_event) { + dat_event_num = ib_cm_event_map[i].dat_event_num; + break; + } + } + dapl_dbg_log (DAPL_DBG_TYPE_CALLBACK, + "dapls_ib_get_dat_event: event translate(%s) ib=0x%x dat=0x%x\n", + active ? "active" : "passive", ib_cm_event, dat_event_num); + + return dat_event_num; +} + + +/* + * dapls_ib_get_dat_event + * + * Return a DAT connection event given a provider CM event. + * + * Input: + * ib_cm_event event provided to the dapl callback routine + * active switch indicating active or passive connection + * + * Output: + * none + * + * Returns: + * DAT_EVENT_NUMBER of translated provider value + */ +ib_cm_events_t +dapls_ib_get_cm_event ( + IN DAT_EVENT_NUMBER dat_event_num) +{ + ib_cm_events_t ib_cm_event; + int i; + + ib_cm_event = 0; + for (i = 0; i < DAPL_IB_EVENT_CNT; i++) { + if ( dat_event_num == ib_cm_event_map[i].dat_event_num ) { + ib_cm_event = ib_cm_event_map[i].ib_cm_event; + break; + } + } + return ib_cm_event; +} + +/* async CR processing thread to avoid blocking applications */ +void cr_thread(void *arg) +{ + struct dapl_hca *hca_ptr = arg; + ib_cm_srvc_handle_t cr, next_cr; + int max_fd; + fd_set rfd,rfds; + struct timeval to; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread: ENTER hca %p\n",hca_ptr); + + dapl_os_lock( &hca_ptr->ib_trans.lock ); + hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; + while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { + + FD_ZERO( &rfds ); + max_fd = -1; + + if (!dapl_llist_is_empty(&hca_ptr->ib_trans.list)) + next_cr = dapl_llist_peek_head (&hca_ptr->ib_trans.list); + else + next_cr = NULL; + + while (next_cr) { + cr = next_cr; + dapl_dbg_log (DAPL_DBG_TYPE_CM," thread: cm_ptr %p\n", cr ); + if (cr->l_socket == -1 || + hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { + + dapl_dbg_log(DAPL_DBG_TYPE_CM," thread: Freeing %p\n", cr); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + dapl_llist_remove_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry); + dapl_os_free( cr, sizeof(*cr) ); + continue; + } + + FD_SET( cr->l_socket, &rfds ); /* add to select set */ + if ( cr->l_socket > max_fd ) + max_fd = cr->l_socket; + + /* individual select poll to check for work */ + FD_ZERO(&rfd); + FD_SET(cr->l_socket, &rfd); + dapl_os_unlock(&hca_ptr->ib_trans.lock); + to.tv_sec = 0; + to.tv_usec = 0; + if ( select(cr->l_socket + 1,&rfd, NULL, NULL, &to) < 0) { + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " thread: ERR %s on cr %p sk %d\n", + strerror(errno), cr, cr->l_socket); + close(cr->l_socket); + cr->l_socket = -1; + } else if ( FD_ISSET(cr->l_socket, &rfd) && + dapli_socket_accept(cr)) { + close(cr->l_socket); + cr->l_socket = -1; + } + dapl_os_lock( &hca_ptr->ib_trans.lock ); + next_cr = dapl_llist_next_entry(&hca_ptr->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cr->entry ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + to.tv_sec = 0; + to.tv_usec = 100000; /* wakeup and check destroy */ + select(max_fd + 1, &rfds, NULL, NULL, &to); + dapl_os_lock( &hca_ptr->ib_trans.lock ); + } + dapl_os_unlock( &hca_ptr->ib_trans.lock ); + hca_ptr->ib_trans.cr_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cr_thread(hca %p) exit\n",hca_ptr); +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_scm/dapl_ib_qp.c =================================================================== --- dapl/openib_scm/dapl_ib_qp.c (revision 0) +++ dapl/openib_scm/dapl_ib_qp.c (revision 0) @@ -0,0 +1,398 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/********************************************************************** + * + * MODULE: dapl_det_qp.c + * + * PURPOSE: QP routines for access to DET Verbs + * + * $Id: $ + **********************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" + +/* + * dapl_ib_qp_alloc + * + * Alloc a QP + * + * Input: + * *ep_ptr pointer to EP INFO + * ib_hca_handle provider HCA handle + * ib_pd_handle provider protection domain handle + * cq_recv provider recv CQ handle + * cq_send provider send CQ handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INTERNAL_ERROR + * + */ +DAT_RETURN +dapls_ib_qp_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAPL_EP *ep_ctx_ptr ) +{ + DAT_EP_ATTR *attr; + DAPL_EVD *rcv_evd, *req_evd; + ib_cq_handle_t rcv_cq, req_cq; + ib_pd_handle_t ib_pd_handle; + struct ibv_qp_init_attr qp_create; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " qp_alloc: ia_ptr %p ep_ptr %p ep_ctx_ptr %p\n", + ia_ptr, ep_ptr, ep_ctx_ptr); + + attr = &ep_ptr->param.ep_attr; + ib_pd_handle = ((DAPL_PZ *)ep_ptr->param.pz_handle)->pd_handle; + rcv_evd = (DAPL_EVD *) ep_ptr->param.recv_evd_handle; + req_evd = (DAPL_EVD *) ep_ptr->param.request_evd_handle; + + /* + * DAT allows usage model of EP's with no EVD's but IB does not. + * Create a CQ with zero entries under the covers to support and + * catch any invalid posting. + */ + if ( rcv_evd != DAT_HANDLE_NULL ) + rcv_cq = rcv_evd->ib_cq_handle; + else if (!ia_ptr->hca_ptr->ib_trans.ib_cq_empty) + rcv_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + else { + struct ibv_comp_channel *channel = + ia_ptr->hca_ptr->ib_trans.ib_cq; +#ifdef CQ_WAIT_OBJECT + if (rcv_evd->cq_wait_obj_handle) + channel = rcv_evd->cq_wait_obj_handle; +#endif + /* Call IB verbs to create CQ */ + rcv_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + 0, NULL, channel, 0); + + if (rcv_cq == IB_INVALID_HANDLE) + return(dapl_convert_errno(ENOMEM, "create_cq")); + + ia_ptr->hca_ptr->ib_trans.ib_cq_empty = rcv_cq; + } + if (req_evd != DAT_HANDLE_NULL) + req_cq = req_evd->ib_cq_handle; + else + req_cq = ia_ptr->hca_ptr->ib_trans.ib_cq_empty; + + /* Setup attributes and create qp */ + dapl_os_memzero((void*)&qp_create, sizeof(qp_create)); + qp_create.send_cq = req_cq; + qp_create.recv_cq = rcv_cq; + qp_create.cap.max_send_wr = attr->max_request_dtos; + qp_create.cap.max_recv_wr = attr->max_recv_dtos; + qp_create.cap.max_send_sge = attr->max_request_iov; + qp_create.cap.max_recv_sge = attr->max_recv_iov; + qp_create.cap.max_inline_data = ia_ptr->hca_ptr->ib_trans.max_inline_send; + qp_create.qp_type = IBV_QPT_RC; + qp_create.qp_context = (void*)ep_ptr; + + ep_ptr->qp_handle = ibv_create_qp( ib_pd_handle, &qp_create); + if (!ep_ptr->qp_handle) + return(dapl_convert_errno(ENOMEM, "create_qp")); + + dapl_dbg_log ( DAPL_DBG_TYPE_EP, + " qp_alloc: qpn %p sq %d,%d rq %d,%d\n", + ep_ptr->qp_handle->qp_num, + qp_create.cap.max_send_wr,qp_create.cap.max_send_sge, + qp_create.cap.max_recv_wr,qp_create.cap.max_recv_sge ); + + /* Setup QP attributes for INIT state on the way out */ + if (dapls_modify_qp_state(ep_ptr->qp_handle, + IBV_QPS_INIT, + NULL ) != DAT_SUCCESS ) { + ibv_destroy_qp(ep_ptr->qp_handle); + ep_ptr->qp_handle = IB_INVALID_HANDLE; + return DAT_INTERNAL_ERROR; + } + + ep_ptr->qp_state = IB_QP_STATE_INIT; + return DAT_SUCCESS; +} + +/* + * dapl_ib_qp_free + * + * Free a QP + * + * Input: + * ia_handle IA handle + * *ep_ptr pointer to EP INFO + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + * + */ +DAT_RETURN +dapls_ib_qp_free ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr ) +{ + dapl_dbg_log (DAPL_DBG_TYPE_EP, " qp_free: ep_ptr %p qp %p\n", + ep_ptr, ep_ptr->qp_handle); + + if (ep_ptr->qp_handle != IB_INVALID_HANDLE) { + /* force error state to flush queue, then destroy */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_ERR, NULL); + + if (ibv_destroy_qp(ep_ptr->qp_handle)) + return(dapl_convert_errno(errno,"destroy_qp")); + + ep_ptr->qp_handle = IB_INVALID_HANDLE; + ep_ptr->qp_state = IB_QP_STATE_ERROR; + } + + return DAT_SUCCESS; +} + +/* + * dapl_ib_qp_modify + * + * Set the QP to the parameters specified in an EP_PARAM + * + * The EP_PARAM structure that is provided has been + * sanitized such that only non-zero values are valid. + * + * Input: + * ib_hca_handle HCA handle + * qp_handle QP handle + * ep_attr Sanitized EP Params + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_qp_modify ( + IN DAPL_IA *ia_ptr, + IN DAPL_EP *ep_ptr, + IN DAT_EP_ATTR *attr ) +{ + struct ibv_qp_attr qp_attr; + + if (ep_ptr->qp_handle == IB_INVALID_HANDLE) + return DAT_INVALID_PARAMETER; + + /* + * EP state, qp_handle state should be an indication + * of current state but the only way to be sure is with + * a user mode ibv_query_qp call which is NOT available + */ + + /* move to error state if necessary */ + if ((ep_ptr->qp_state == IB_QP_STATE_ERROR) && + (ep_ptr->qp_handle->state != IBV_QPS_ERR)) { + ep_ptr->qp_state = IB_QP_STATE_ERROR; + return (dapls_modify_qp_state(ep_ptr->qp_handle, + IBV_QPS_ERR, NULL)); + } + + /* + * Check if we have the right qp_state to modify attributes + */ + if ((ep_ptr->qp_handle->state != IBV_QPS_RTR ) && + (ep_ptr->qp_handle->state != IBV_QPS_RTS )) + return DAT_INVALID_STATE; + + /* Adjust to current EP attributes */ + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); + qp_attr.cap.max_send_wr = attr->max_request_dtos; + qp_attr.cap.max_recv_wr = attr->max_recv_dtos; + qp_attr.cap.max_send_sge = attr->max_request_iov; + qp_attr.cap.max_recv_sge = attr->max_recv_iov; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + "modify_qp: qp %p sq %d,%d, rq %d,%d\n", + ep_ptr->qp_handle, + qp_attr.cap.max_send_wr, qp_attr.cap.max_send_sge, + qp_attr.cap.max_recv_wr, qp_attr.cap.max_recv_sge ); + + if (ibv_modify_qp(ep_ptr->qp_handle, &qp_attr, IBV_QP_CAP)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "modify_qp: modify ep %p qp %p failed\n", + ep_ptr, ep_ptr->qp_handle); + return(dapl_convert_errno(errno,"modify_qp_state")); + } + + return DAT_SUCCESS; +} + +/* + * dapls_ib_reinit_ep + * + * Move the QP to INIT state again. + * + * Input: + * ep_ptr DAPL_EP + * + * Output: + * none + * + * Returns: + * void + * + */ +void +dapls_ib_reinit_ep ( + IN DAPL_EP *ep_ptr) +{ + + if ( ep_ptr->qp_handle != IB_INVALID_HANDLE ) { + /* move to RESET state and then to INIT */ + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_RESET, 0); + dapls_modify_qp_state(ep_ptr->qp_handle, IBV_QPS_INIT, 0); + ep_ptr->qp_state = IB_QP_STATE_INIT; + } + + /* TODO: When IB-CM is implement then handle timewait before + * allowing re-use of this QP + */ +} + +/* + * Generic QP modify for init, reset, error, RTS, RTR + */ +DAT_RETURN +dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN ib_qp_cm_t *qp_cm ) +{ + struct ibv_qp_attr qp_attr; + enum ibv_qp_attr_mask mask = IBV_QP_STATE; + + dapl_os_memzero((void*)&qp_attr, sizeof(qp_attr)); + qp_attr.qp_state = qp_state; + + switch (qp_state) { + /* additional attributes with RTR and RTS */ + case IBV_QPS_RTR: + { + mask |= IBV_QP_AV | + IBV_QP_PATH_MTU | + IBV_QP_DEST_QPN | + IBV_QP_RQ_PSN | + IBV_QP_MAX_DEST_RD_ATOMIC | + IBV_QP_MIN_RNR_TIMER; + qp_attr.qp_state = IBV_QPS_RTR; + qp_attr.path_mtu = IBV_MTU_1024; + qp_attr.dest_qp_num = qp_cm->qpn; + qp_attr.rq_psn = 1; + qp_attr.max_dest_rd_atomic = 8; + qp_attr.min_rnr_timer = 12; + qp_attr.ah_attr.is_global = 0; + qp_attr.ah_attr.dlid = qp_cm->lid; + qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.src_path_bits = 0; + qp_attr.ah_attr.port_num = qp_cm->port; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_rtr: qpn %x lid %x port %x\n", + qp_cm->qpn,qp_cm->lid,qp_cm->port ); + break; + } + case IBV_QPS_RTS: + { + mask |= IBV_QP_TIMEOUT | + IBV_QP_RETRY_CNT | + IBV_QP_RNR_RETRY | + IBV_QP_SQ_PSN | + IBV_QP_MAX_QP_RD_ATOMIC; + qp_attr.qp_state = IBV_QPS_RTS; + qp_attr.timeout = 14; + qp_attr.retry_cnt = 7; + qp_attr.rnr_retry = 7; + qp_attr.sq_psn = 1; + qp_attr.max_rd_atomic = 8; + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_rts: psn %x or %x\n", + qp_attr.sq_psn, qp_attr.max_rd_atomic ); + break; + } + case IBV_QPS_INIT: + { + DAPL_IA *ia_ptr; + DAPL_EP *ep_ptr; + /* need to find way back to port num */ + ep_ptr = (DAPL_EP*)qp_handle->qp_context; + if (ep_ptr) + ia_ptr = ep_ptr->header.owner_ia; + else + break; + + mask |= IBV_QP_PKEY_INDEX | + IBV_QP_PORT | + IBV_QP_ACCESS_FLAGS; + + qp_attr.pkey_index = 0; + qp_attr.port_num = ia_ptr->hca_ptr->port_num; + qp_attr.qp_access_flags = + IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_WRITE | + IBV_ACCESS_REMOTE_READ | + IBV_ACCESS_REMOTE_ATOMIC; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, + " modify_qp_init: pi %x port %x acc %x\n", + qp_attr.pkey_index, qp_attr.port_num, + qp_attr.qp_access_flags ); + break; + } + default: + break; + + } + + if (ibv_modify_qp(qp_handle, &qp_attr, mask)) + return(dapl_convert_errno(errno,"modify_qp_state")); + + return DAT_SUCCESS; +} + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ Index: dapl/openib_scm/README =================================================================== --- dapl/openib_scm/README (revision 0) +++ dapl/openib_scm/README (revision 0) @@ -0,0 +1,40 @@ + +OpenIB uDAPL provider using socket-based CM, in leiu of uCM/uAT, to setup QP/channels. + +to build: + +cd dapl/udapl +make VERBS=openib_scm clean +make VERBS=openib_scm + + +Modifications to common code: + +- added dapl/openib_scm directory + + dapl/udapl/Makefile + +New files for openib_scm provider + + dapl/openib/dapl_ib_cq.c + dapl/openib/dapl_ib_dto.h + dapl/openib/dapl_ib_mem.c + dapl/openib/dapl_ib_qp.c + dapl/openib/dapl_ib_util.c + dapl/openib/dapl_ib_util.h + dapl/openib/dapl_ib_cm.c + +A simple dapl test just for openib_scm testing... + + test/dtest/dtest.c + test/dtest/makefile + + server: dtest -s + client: dtest -h hostname + +known issues: + + no memory windows support in ibverbs, dat_create_rmr fails. + + + Index: dapl/openib_scm/dapl_ib_util.h =================================================================== --- dapl/openib_scm/dapl_ib_util.h (revision 0) +++ dapl/openib_scm/dapl_ib_util.h (revision 0) @@ -0,0 +1,356 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_util.h + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - definitions, prototypes, + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#ifndef _DAPL_IB_UTIL_H_ +#define _DAPL_IB_UTIL_H_ + +#include "verbs.h" +#include + +#ifndef __cplusplus +#define false 0 +#define true 1 +#endif /*__cplusplus */ + +/* Typedefs to map common DAPL provider types to IB verbs */ +typedef struct ibv_qp *ib_qp_handle_t; +typedef struct ibv_cq *ib_cq_handle_t; +typedef struct ibv_pd *ib_pd_handle_t; +typedef struct ibv_mr *ib_mr_handle_t; +typedef struct ibv_mw *ib_mw_handle_t; +typedef struct ibv_wc ib_work_completion_t; + +/* HCA context type maps to IB verbs */ +typedef struct ibv_context *ib_hca_handle_t; +typedef ib_hca_handle_t dapl_ibal_ca_t; + +/* CM mappings, user CM not complete use SOCKETS */ + +/* destination info to exchange until real IB CM shows up */ +typedef struct _ib_qp_cm +{ + uint32_t qpn; + uint16_t lid; + uint16_t port; + int p_size; + DAT_SOCK_ADDR6 ia_address; + +} ib_qp_cm_t; + +/* + * dapl_llist_entry in dapl.h but dapl.h depends on provider + * typedef's in this file first. move dapl_llist_entry out of dapl.h + */ +struct ib_llist_entry +{ + struct dapl_llist_entry *flink; + struct dapl_llist_entry *blink; + void *data; + struct dapl_llist_entry *list_head; +}; + +struct ib_cm_handle +{ + struct ib_llist_entry entry; + int socket; + int l_socket; + struct dapl_hca *hca_ptr; + DAT_HANDLE cr; + DAT_HANDLE sp; + ib_qp_cm_t dst; + unsigned char p_data[256]; +}; + +typedef struct ib_cm_handle *ib_cm_handle_t; +typedef ib_cm_handle_t ib_cm_srvc_handle_t; + +DAT_RETURN getipaddr(char *addr, int addr_len); + +/* CM events */ +typedef enum +{ + IB_CME_CONNECTED, + IB_CME_DISCONNECTED, + IB_CME_DISCONNECTED_ON_LINK_DOWN, + IB_CME_CONNECTION_REQUEST_PENDING, + IB_CME_CONNECTION_REQUEST_PENDING_PRIVATE_DATA, + IB_CME_DESTINATION_REJECT, + IB_CME_DESTINATION_REJECT_PRIVATE_DATA, + IB_CME_DESTINATION_UNREACHABLE, + IB_CME_TOO_MANY_CONNECTION_REQUESTS, + IB_CME_LOCAL_FAILURE, + IB_CM_LOCAL_FAILURE + +} ib_cm_events_t; + +/* prototype for cm thread */ +void cr_thread (void *arg); + +/* Operation and state mappings */ +typedef enum ibv_send_flags ib_send_op_type_t; +typedef struct ibv_sge ib_data_segment_t; +typedef enum ibv_qp_state ib_qp_state_t; +typedef enum ibv_event_type ib_async_event_type; +typedef struct ibv_async_event ib_error_record_t; + +/* CQ notifications */ +typedef enum +{ + IB_NOTIFY_ON_NEXT_COMP, + IB_NOTIFY_ON_SOLIC_COMP + +} ib_notification_type_t; + +/* other mappings */ +typedef int ib_bool_t; +typedef union ibv_gid GID; +typedef char *IB_HCA_NAME; +typedef uint16_t ib_hca_port_t; +typedef uint32_t ib_comp_handle_t; + +#ifdef CQ_WAIT_OBJECT +typedef struct ibv_comp_channel *ib_wait_obj_handle_t; +#endif + +/* Definitions */ +#define IB_INVALID_HANDLE NULL + +/* inline send rdma threshold */ +#define INLINE_SEND_DEFAULT 128 + +/* CM private data areas */ +#define IB_MAX_REQ_PDATA_SIZE 92 +#define IB_MAX_REP_PDATA_SIZE 196 +#define IB_MAX_REJ_PDATA_SIZE 148 +#define IB_MAX_DREQ_PDATA_SIZE 220 +#define IB_MAX_DREP_PDATA_SIZE 224 + +/* DTO OPs, ordered for DAPL ENUM definitions ???*/ +#define OP_RDMA_WRITE IBV_WR_RDMA_WRITE +#define OP_RDMA_WRITE_IMM IBV_WR_RDMA_WRITE_WITH_IMM +#define OP_SEND IBV_WR_SEND +#define OP_SEND_IMM IBV_WR_SEND_WITH_IMM +#define OP_RDMA_READ IBV_WR_RDMA_READ +#define OP_COMP_AND_SWAP IBV_WR_ATOMIC_CMP_AND_SWP +#define OP_FETCH_AND_ADD IBV_WR_ATOMIC_FETCH_AND_ADD +#define OP_RECEIVE 7 /* internal op */ +#define OP_RECEIVE_IMM 8 /* internel op */ +#define OP_BIND_MW 9 /* internal op */ +#define OP_INVALID 0xff + +/* Definitions to map QP state */ +#define IB_QP_STATE_RESET IBV_QPS_RESET +#define IB_QP_STATE_INIT IBV_QPS_INIT +#define IB_QP_STATE_RTR IBV_QPS_RTR +#define IB_QP_STATE_RTS IBV_QPS_RTS +#define IB_QP_STATE_SQD IBV_QPS_SQD +#define IB_QP_STATE_SQE IBV_QPS_SQE +#define IB_QP_STATE_ERROR IBV_QPS_ERR + +/* Definitions for ibverbs/mthca return codes, should be defined in verbs.h */ +/* some are errno and some are -n values */ + +/** + * ibv_get_device_name - Return kernel device name + * ibv_get_device_guid - Return device's node GUID + * ibv_open_device - Return ibv_context or NULL + * ibv_close_device - Return 0, (errno?) + * ibv_get_async_event - Return 0, -1 + * ibv_alloc_pd - Return ibv_pd, NULL + * ibv_dealloc_pd - Return 0, errno + * ibv_reg_mr - Return ibv_mr, NULL + * ibv_dereg_mr - Return 0, errno + * ibv_create_cq - Return ibv_cq, NULL + * ibv_destroy_cq - Return 0, errno + * ibv_get_cq_event - Return 0 & ibv_cq/context, int + * ibv_poll_cq - Return n & ibv_wc, 0 ok, -1 empty, -2 error + * ibv_req_notify_cq - Return 0 (void?) + * ibv_create_qp - Return ibv_qp, NULL + * ibv_modify_qp - Return 0, errno + * ibv_destroy_qp - Return 0, errno + * ibv_post_send - Return 0, -1 & bad_wr + * ibv_post_recv - Return 0, -1 & bad_wr + */ + +/* async handler for DTO, CQ, QP, and unafiliated */ +typedef void (*ib_async_dto_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_cq_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_cq_handle_t ib_cq_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_qp_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_qp_handle_t ib_qp_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef enum +{ + IB_THREAD_INIT, + IB_THREAD_RUN, + IB_THREAD_CANCEL, + IB_THREAD_EXIT + +} ib_thread_state_t; + +/* ib_hca_transport_t, specific to this implementation */ +typedef struct _ib_hca_transport +{ + struct ibv_device *ib_dev; + ib_cq_handle_t ib_cq_empty; + DAPL_OS_LOCK cq_lock; + int max_inline_send; + ib_thread_state_t cq_state; + DAPL_OS_THREAD cq_thread; + struct ibv_comp_channel *ib_cq; + int cr_state; + DAPL_OS_THREAD thread; + DAPL_OS_LOCK lock; + struct dapl_llist_entry *list; + ib_async_handler_t async_unafiliated; + void *async_un_ctx; + ib_async_cq_handler_t async_cq_error; + ib_async_dto_handler_t async_cq; + ib_async_qp_handler_t async_qp_error; + +} ib_hca_transport_t; + +/* provider specfic fields for shared memory support */ +typedef uint32_t ib_shm_transport_t; + +/* prototypes */ +int32_t dapls_ib_init (void); +int32_t dapls_ib_release (void); +void cq_thread (void *arg); +void cr_thread(void *arg); +int dapli_cq_thread_init(struct dapl_hca *hca_ptr); +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr); + + +DAT_RETURN +dapls_modify_qp_state ( IN ib_qp_handle_t qp_handle, + IN ib_qp_state_t qp_state, + IN ib_qp_cm_t *qp_cm ); + +/* inline functions */ +STATIC _INLINE_ IB_HCA_NAME dapl_ib_convert_name (IN char *name) +{ + /* use ascii; name of local device */ + return dapl_os_strdup(name); +} + +STATIC _INLINE_ void dapl_ib_release_name (IN IB_HCA_NAME name) +{ + return; +} + +/* + * Convert errno to DAT_RETURN values + */ +STATIC _INLINE_ DAT_RETURN +dapl_convert_errno( IN int err, IN const char *str ) +{ + if (!err) return DAT_SUCCESS; + +#if DAPL_DBG + if ((err != EAGAIN) && (err != ETIME) && (err != ETIMEDOUT)) + dapl_dbg_log (DAPL_DBG_TYPE_ERR," %s %s\n", str, strerror(err)); +#endif + + switch( err ) + { + case EOVERFLOW : return DAT_LENGTH_ERROR; + case EACCES : return DAT_PRIVILEGES_VIOLATION; + case ENXIO : + case ERANGE : + case EPERM : return DAT_PROTECTION_VIOLATION; + case EINVAL : + case EBADF : + case ENOENT : + case ENOTSOCK : return DAT_INVALID_HANDLE; + case EISCONN : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_CONNECTED; + case ECONNREFUSED : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_NOTREADY; + case ETIME : + case ETIMEDOUT : return DAT_TIMEOUT_EXPIRED; + case ENETUNREACH: return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_UNREACHABLE; + case EADDRINUSE : return DAT_CONN_QUAL_IN_USE; + case EALREADY : return DAT_INVALID_STATE | DAT_INVALID_STATE_EP_ACTCONNPENDING; + case ENOSPC : + case ENOMEM : + case E2BIG : + case EDQUOT : return DAT_INSUFFICIENT_RESOURCES; + case EAGAIN : return DAT_QUEUE_EMPTY; + case EINTR : return DAT_INTERRUPTED_CALL; + case EAFNOSUPPORT : return DAT_INVALID_ADDRESS | DAT_INVALID_ADDRESS_MALFORMED; + case EFAULT : + default : return DAT_INTERNAL_ERROR; + } + } + +/* + * Definitions required only for DAT 1.1 builds + */ +#define IB_ACCESS_LOCAL_READ IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_LOCAL_WRITE IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_REMOTE_READ IBV_ACCESS_REMOTE_READ +#define IB_ACCESS_REMOTE_WRITE IBV_ACCESS_REMOTE_WRITE +#define IB_ACCESS_MW_BIND IBV_ACCESS_LOCAL_WRITE +#define IB_ACCESS_ATOMIC + +#endif /* _DAPL_IB_UTIL_H_ */ Index: dapl/openib_scm/dapl_ib_cq.c =================================================================== --- dapl/openib_scm/dapl_ib_cq.c (revision 0) +++ dapl/openib_scm/dapl_ib_cq.c (revision 0) @@ -0,0 +1,619 @@ +/* + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + * available from the Open Source Initiative, see + * http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + * copy of which is available from the Open Source Initiative, see + * http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ + +/*************************************************************************** + * + * Module: uDAPL + * + * Filename: dapl_ib_cq.c + * + * Author: Arlin Davis + * + * Created: 3/10/2005 + * + * Description: + * + * The uDAPL openib provider - completion queue + * + **************************************************************************** + * Source Control System Information + * + * $Id: $ + * + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + **************************************************************************/ + +#include "dapl.h" +#include "dapl_adapter_util.h" +#include "dapl_lmr_util.h" +#include "dapl_evd_util.h" +#include "dapl_ring_buffer_util.h" +#include +#include + +int dapli_cq_thread_init(struct dapl_hca *hca_ptr) +{ + DAT_RETURN dat_status; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%p)\n", hca_ptr); + + /* create thread to process inbound connect request */ + hca_ptr->ib_trans.cq_state = IB_THREAD_INIT; + dat_status = dapl_os_thread_create(cq_thread, (void*)hca_ptr, &hca_ptr->ib_trans.cq_thread); + if (dat_status != DAT_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " cq_thread_init: failed to create thread\n"); + return 1; + } + + /* wait for thread to start */ + while (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 20000000; /* 20 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_thread_init: waiting for cq_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_init(%d) exit\n",getpid()); + return 0; +} + +void dapli_cq_thread_destroy(struct dapl_hca *hca_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%p)\n", hca_ptr); + + if (hca_ptr->ib_trans.cq_state != IB_THREAD_RUN) + return; + + /* destroy cr_thread and lock */ + hca_ptr->ib_trans.cq_state = IB_THREAD_CANCEL; + pthread_kill(hca_ptr->ib_trans.cq_thread, SIGUSR1); + dapl_dbg_log(DAPL_DBG_TYPE_CM," cq_thread_destroy(%p) cancel\n",hca_ptr); + while (hca_ptr->ib_trans.cq_state != IB_THREAD_EXIT) { + struct timespec sleep, remain; + sleep.tv_sec = 0; + sleep.tv_nsec = 200000000; /* 200 ms */ + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " cq_thread_destroy: waiting for cq_thread\n"); + nanosleep (&sleep, &remain); + } + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread_destroy(%d) exit\n",getpid()); +} + +/* catch the signal */ +static void ib_cq_handler(int signum) +{ + return; +} + +void cq_thread( void *arg ) +{ + struct dapl_hca *hca_ptr = arg; + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + sigset_t sigset; + + sigemptyset(&sigset); + sigaddset(&sigset,SIGUSR1); + pthread_sigmask(SIG_UNBLOCK, &sigset, NULL); + signal(SIGUSR1, ib_cq_handler); + + hca_ptr->ib_trans.cq_state = IB_THREAD_RUN; + + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: ENTER hca %p\n",hca_ptr); + + /* wait on DTO event, or signal to abort */ + while (hca_ptr->ib_trans.cq_state == IB_THREAD_RUN) { + struct pollfd cq_fd = { + .fd = hca_ptr->ib_trans.ib_cq->fd, + .events = POLLIN, + .revents = 0 + }; + if ((poll(&cq_fd, 1, -1) == 1) && + (!ibv_get_cq_event(hca_ptr->ib_trans.ib_cq, + &ibv_cq, (void*)&evd_ptr))) { + + if (DAPL_BAD_HANDLE(evd_ptr, DAPL_MAGIC_EVD)) { + ibv_ack_cq_events(ibv_cq, 1); + return; + } + + /* process DTO event via callback */ + dapl_evd_dto_callback ( hca_ptr->ib_hca_handle, + evd_ptr->ib_cq_handle, + (void*)evd_ptr ); + + ibv_ack_cq_events(ibv_cq, 1); + } + } + hca_ptr->ib_trans.cq_state = IB_THREAD_EXIT; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL," cq_thread: EXIT: hca %p \n", hca_ptr); +} + + +/* + * Map all verbs DTO completion codes to the DAT equivelent. + * + * Not returned by verbs: DAT_DTO_ERR_PARTIAL_PACKET + */ +static struct ib_status_map +{ + int ib_status; + DAT_DTO_COMPLETION_STATUS dat_status; +} ib_status_map[] = { + /* 00 */ { IBV_WC_SUCCESS, DAT_DTO_SUCCESS}, + /* 01 */ { IBV_WC_LOC_LEN_ERR, DAT_DTO_ERR_LOCAL_LENGTH}, + /* 02 */ { IBV_WC_LOC_QP_OP_ERR, DAT_DTO_ERR_LOCAL_EP}, + /* 03 */ { IBV_WC_LOC_EEC_OP_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 04 */ { IBV_WC_LOC_PROT_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 05 */ { IBV_WC_WR_FLUSH_ERR, DAT_DTO_ERR_FLUSHED}, + /* 06 */ { IBV_WC_MW_BIND_ERR, DAT_RMR_OPERATION_FAILED}, + /* 07 */ { IBV_WC_BAD_RESP_ERR, DAT_DTO_ERR_BAD_RESPONSE}, + /* 08 */ { IBV_WC_LOC_ACCESS_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 09 */ { IBV_WC_REM_INV_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 10 */ { IBV_WC_REM_ACCESS_ERR, DAT_DTO_ERR_REMOTE_ACCESS}, + /* 11 */ { IBV_WC_REM_OP_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 12 */ { IBV_WC_RETRY_EXC_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 13 */ { IBV_WC_RNR_RETRY_EXC_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, + /* 14 */ { IBV_WC_LOC_RDD_VIOL_ERR, DAT_DTO_ERR_LOCAL_PROTECTION}, + /* 15 */ { IBV_WC_REM_INV_RD_REQ_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 16 */ { IBV_WC_REM_ABORT_ERR, DAT_DTO_ERR_REMOTE_RESPONDER}, + /* 17 */ { IBV_WC_INV_EECN_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 18 */ { IBV_WC_INV_EEC_STATE_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 19 */ { IBV_WC_FATAL_ERR, DAT_DTO_ERR_TRANSPORT}, + /* 20 */ { IBV_WC_RESP_TIMEOUT_ERR, DAT_DTO_ERR_RECEIVER_NOT_READY}, + /* 21 */ { IBV_WC_GENERAL_ERR, DAT_DTO_ERR_TRANSPORT}, +}; + +/* + * dapls_ib_get_dto_status + * + * Return the DAT status of a DTO operation + * + * Input: + * cqe_ptr pointer to completion queue entry + * + * Output: + * none + * + * Returns: + * Value from ib_status_map table above + */ + +DAT_DTO_COMPLETION_STATUS +dapls_ib_get_dto_status ( + IN ib_work_completion_t *cqe_ptr) +{ + uint32_t ib_status; + int i; + + ib_status = DAPL_GET_CQE_STATUS (cqe_ptr); + + /* + * Due to the implementation of verbs completion code, we need to + * search the table for the correct value rather than assuming + * linear distribution. + */ + for (i = 0; i <= IBV_WC_GENERAL_ERR; i++) { + if (ib_status == ib_status_map[i].ib_status) { + if ( ib_status != IBV_WC_SUCCESS ) { + dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, DAPL_GET_CQE_OPTYPE (cqe_ptr)); + } + return ib_status_map[i].dat_status; + } + } + + dapl_dbg_log (DAPL_DBG_TYPE_DTO_COMP_ERR, + " DTO completion ERROR: %d: op %#x\n", + ib_status, + DAPL_GET_CQE_OPTYPE (cqe_ptr)); + + return DAT_DTO_FAILURE; +} + +DAT_RETURN dapls_ib_get_async_event ( + IN ib_error_record_t *err_record, + OUT DAT_EVENT_NUMBER *async_event) +{ + DAT_RETURN dat_status = DAT_SUCCESS; + int err_code = err_record->event_type; + + switch (err_code) { + /* OVERFLOW error */ + case IBV_EVENT_CQ_ERR: + *async_event = DAT_ASYNC_ERROR_EVD_OVERFLOW; + break; + /* INTERNAL errors */ + case IBV_EVENT_DEVICE_FATAL: + *async_event = DAT_ASYNC_ERROR_PROVIDER_INTERNAL_ERROR; + break; + /* CATASTROPHIC errors */ + case IBV_EVENT_PORT_ERR: + *async_event = DAT_ASYNC_ERROR_IA_CATASTROPHIC; + break; + /* BROKEN QP error */ + case IBV_EVENT_SQ_DRAINED: + case IBV_EVENT_QP_FATAL: + case IBV_EVENT_QP_REQ_ERR: + case IBV_EVENT_QP_ACCESS_ERR: + *async_event = DAT_ASYNC_ERROR_EP_BROKEN; + break; + + /* connection completion */ + case IBV_EVENT_COMM_EST: + *async_event = DAT_CONNECTION_EVENT_ESTABLISHED; + break; + + /* TODO: process HW state changes */ + case IBV_EVENT_PATH_MIG: + case IBV_EVENT_PATH_MIG_ERR: + case IBV_EVENT_PORT_ACTIVE: + case IBV_EVENT_LID_CHANGE: + case IBV_EVENT_PKEY_CHANGE: + case IBV_EVENT_SM_CHANGE: + default: + dat_status = DAT_ERROR (DAT_NOT_IMPLEMENTED, 0); + } + return dat_status; +} + +/* + * dapl_ib_cq_alloc + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INSUFFICIENT_RESOURCES + * + */ +DAT_RETURN +dapls_ib_cq_alloc ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: evd %p cqlen=%d \n", evd_ptr, *cqlen ); + + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + evd_ptr->ib_cq_handle = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, + *cqlen, + evd_ptr, + channel, 0); + + if (evd_ptr->ib_cq_handle == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* arm cq for events */ + dapls_set_cq_notify(ia_ptr, evd_ptr); + + /* update with returned cq entry size */ + *cqlen = evd_ptr->ib_cq_handle->cqe; + + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + "dapls_ib_cq_alloc: new_cq %p cqlen=%d \n", + evd_ptr->ib_cq_handle, *cqlen ); + + return DAT_SUCCESS; +} + + +/* + * dapl_ib_cq_resize + * + * Alloc a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * cqlen minimum QLen + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN +dapls_ib_cq_resize ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr, + IN DAT_COUNT *cqlen ) +{ + ib_cq_handle_t new_cq; + struct ibv_comp_channel *channel = ia_ptr->hca_ptr->ib_trans.ib_cq; + + /* IB verbs doe not support resize. Try to re-create CQ + * with new size. Can only be done if QP is not attached. + * destroy EBUSY == QP still attached. + */ + +#ifdef CQ_WAIT_OBJECT + if (evd_ptr->cq_wait_obj_handle) + channel = evd_ptr->cq_wait_obj_handle; +#endif + + /* Call IB verbs to create CQ */ + new_cq = ibv_create_cq(ia_ptr->hca_ptr->ib_hca_handle, *cqlen, + evd_ptr, channel, 0); + + if (new_cq == IB_INVALID_HANDLE) + return DAT_INSUFFICIENT_RESOURCES; + + /* destroy the original and replace if successful */ + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) { + ibv_destroy_cq(new_cq); + return(dapl_convert_errno(errno,"resize_cq")); + } + + /* update EVD with new cq handle and size */ + evd_ptr->ib_cq_handle = new_cq; + *cqlen = new_cq->cqe; + + /* arm cq for events */ + dapls_set_cq_notify (ia_ptr, evd_ptr); + + return DAT_SUCCESS; +} + +/* + * dapls_ib_cq_free + * + * destroy a CQ + * + * Input: + * ia_handle IA handle + * evd_ptr pointer to EVD struct + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_PARAMETER + * + */ +DAT_RETURN dapls_ib_cq_free ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr) +{ + if ( evd_ptr->ib_cq_handle != IB_INVALID_HANDLE ) { + /* copy all entries on CQ to EVD before destroying */ + dapls_evd_copy_cq(evd_ptr); + if (ibv_destroy_cq(evd_ptr->ib_cq_handle)) + return(dapl_convert_errno(errno,"destroy_cq")); + evd_ptr->ib_cq_handle = IB_INVALID_HANDLE; + } + return DAT_SUCCESS; +} + +/* + * dapls_set_cq_notify + * + * Set the CQ notification for next + * + * Input: + * hca_handl hca handle + * DAPL_EVD evd handle + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_set_cq_notify ( + IN DAPL_IA *ia_ptr, + IN DAPL_EVD *evd_ptr) +{ + if (ibv_req_notify_cq( evd_ptr->ib_cq_handle, 0 )) + return(dapl_convert_errno(errno,"notify_cq")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_notify + * + * Set the CQ notification type + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * type notification type + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * dapl_convert_errno + */ +DAT_RETURN dapls_ib_completion_notify ( + IN ib_hca_handle_t hca_handle, + IN DAPL_EVD *evd_ptr, + IN ib_notification_type_t type) +{ + if (ibv_req_notify_cq( evd_ptr->ib_cq_handle, type )) + return(dapl_convert_errno(errno,"notify_cq_type")); + else + return DAT_SUCCESS; +} + +/* + * dapls_ib_completion_poll + * + * CQ poll for completions + * + * Input: + * hca_handl hca handle + * evd_ptr evd handle + * wc_ptr work completion + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_QUEUE_EMPTY + * + */ +DAT_RETURN dapls_ib_completion_poll ( + IN DAPL_HCA *hca_ptr, + IN DAPL_EVD *evd_ptr, + IN ib_work_completion_t *wc_ptr) +{ + int ret; + + ret = ibv_poll_cq(evd_ptr->ib_cq_handle, 1, wc_ptr); + if (ret == 1) + return DAT_SUCCESS; + + return DAT_QUEUE_EMPTY; +} + +#ifdef CQ_WAIT_OBJECT + +/* NEW common wait objects for providers with direct CQ wait objects */ +DAT_RETURN +dapls_ib_wait_object_create ( + IN DAPL_EVD *evd_ptr, + IN ib_wait_obj_handle_t *p_cq_wait_obj_handle ) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_CM, + " cq_object_create: (%p,%p)\n", + evd_ptr, p_cq_wait_obj_handle ); + + /* set cq_wait object to evd_ptr */ + *p_cq_wait_obj_handle = + ibv_create_comp_channel(evd_ptr->header.owner_ia->hca_ptr->ib_hca_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_destroy ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " cq_object_destroy: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + ibv_destroy_comp_channel(p_cq_wait_obj_handle); + + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wakeup ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle) +{ + dapl_dbg_log ( DAPL_DBG_TYPE_UTIL, + " cq_object_wakeup: wait_obj=%p\n", + p_cq_wait_obj_handle ); + + /* no wake up mechanism */ + return DAT_SUCCESS; +} + +DAT_RETURN +dapls_ib_wait_object_wait ( + IN ib_wait_obj_handle_t p_cq_wait_obj_handle, + IN u_int32_t timeout) +{ + struct dapl_evd *evd_ptr; + struct ibv_cq *ibv_cq = NULL; + void *ibv_ctx = NULL; + int status = 0; + int timeout_ms = -1; + struct pollfd cq_fd = { + .fd = p_cq_wait_obj_handle->fd, + .events = POLLIN, + .revents = 0 + }; + + dapl_dbg_log ( DAPL_DBG_TYPE_CM, + " cq_object_wait: CQ channel %p time %d\n", + p_cq_wait_obj_handle, timeout ); + + /* uDAPL timeout values in usecs */ + if (timeout != DAT_TIMEOUT_INFINITE) + timeout_ms = timeout/1000; + + status = poll(&cq_fd, 1, timeout_ms); + + /* returned event */ + if (status > 0) { + if (!ibv_get_cq_event(p_cq_wait_obj_handle, + &ibv_cq, (void*)&evd_ptr)) { + ibv_ack_cq_events(ibv_cq, 1); + } + status = 0; + + /* timeout */ + } else if (status == 0) + status = ETIMEDOUT; + + dapl_dbg_log (DAPL_DBG_TYPE_CM, + " cq_object_wait: RET evd %p ibv_cq %p ibv_ctx %p %s\n", + evd_ptr, ibv_cq,ibv_ctx,strerror(errno)); + + return(dapl_convert_errno(status,"cq_wait_object_wait")); + +} +#endif + +/* + * Local variables: + * c-indent-level: 4 + * c-basic-offset: 4 + * tab-width: 8 + * End: + */ + From mshefty at ichips.intel.com Wed Oct 26 13:11:34 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 13:11:34 -0700 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <52ek6a7ovp.fsf@cisco.com> References: <435D20AF.7040609@ichips.intel.com> <52ek6a7ovp.fsf@cisco.com> Message-ID: <435FE2F6.4050508@ichips.intel.com> Roland Dreier wrote: > Sean> Is there a way for a userspace application to know if a > Sean> device has been added or removed? > > We don't really handle this right now. It could probably be made to > work on top of hotplug/udev/hal/something but it seems tricky to me. Any idea on how we currently want to handle this? Can a device remove occur if a usermode app is using the device? The kernel CMA will block until all connection identifiers associated with the removed device are freed. I export the device removal up to userspace through the uCMA, but that leaves the kernel blocked until the userspace application cleans up. Doing something more intelligent requires a lot more effort. - Sean From mst at mellanox.co.il Wed Oct 26 13:44:19 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Oct 2005 22:44:19 +0200 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <435FA60C.4070708@ichips.intel.com> References: <435FA60C.4070708@ichips.intel.com> Message-ID: <20051026204419.GA7541@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: RFC userspace CMA > > Michael S. Tsirkin wrote: > > Quoting Sean Hefty : > > > >>I considered, and continue to consider implementing on top of ucm. The > >>drawbacks are: it requires more kernel modules: one for the CM, one for SA > >>query, and one for address translation. > > > > > > Cant address translation be done with exiting kernel/user interface? > > There's no kernel/user interface for ib_addr, which is what the kernel CMA uses. > To use the ib_at kernel/user interface, ib_at would need to be fixed to avoid > crashing the system. ib_addr is based off of the ib_at/sdp implementations, but > limited to ARP translation only. > > It would also require userspace components for other RDMA CMs, such as iWarp. > > - Sean > But I mean, we can already send ARP packets from userspace, cant we? -- MST From jlentini at netapp.com Wed Oct 26 14:10:49 2005 From: jlentini at netapp.com (James Lentini) Date: Wed, 26 Oct 2005 17:10:49 -0400 (EDT) Subject: [openib-general] Re: [PATCH #2] new uDAPL openIB provider using socket CM, corrected license headers In-Reply-To: References: Message-ID: > James, > > This version includes updated license headers per your request. > > -arlin Committed in revision 3882. From mst at mellanox.co.il Wed Oct 26 14:28:12 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 26 Oct 2005 23:28:12 +0200 Subject: [openib-general] Re: device add/remove in userspace In-Reply-To: <435FE2F6.4050508@ichips.intel.com> References: <435FE2F6.4050508@ichips.intel.com> Message-ID: <20051026212812.GA7760@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: device add/remove in userspace > > Roland Dreier wrote: > > Sean> Is there a way for a userspace application to know if a > > Sean> device has been added or removed? > > > > We don't really handle this right now. It could probably be made to > > work on top of hotplug/udev/hal/something but it seems tricky to me. > > Any idea on how we currently want to handle this? Can a device remove occur if > a usermode app is using the device? Well, an easy way out would be to pass an event up to the application, and block device removal until the application closes all resources. This doesnt seem to be hard to do, given exiting mechanisms of passing asynchronous events up to userspace. Ideally, of course, we would free the device without waiting for userspace, but that may be trickier to accomplish: probably need to modify all QPs to error state, then close them and close all CQs. Further, need to unmap the device memory mapped into application's memory. > The kernel CMA will block until all connection identifiers associated with the > removed device are freed. I export the device removal up to userspace through > the uCMA, but that leaves the kernel blocked until the userspace application > cleans up. Doing something more intelligent requires a lot more effort. > > - Sean I think this is reasonable, short-term. -- MST From mshefty at ichips.intel.com Wed Oct 26 14:53:09 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 26 Oct 2005 14:53:09 -0700 Subject: [openib-general] Re: RFC userspace CMA In-Reply-To: <20051026204419.GA7541@mellanox.co.il> References: <435FA60C.4070708@ichips.intel.com> <20051026204419.GA7541@mellanox.co.il> Message-ID: <435FFAC5.3040205@ichips.intel.com> Michael S. Tsirkin wrote: > But I mean, we can already send ARP packets from userspace, cant we? That I don't know. Are there APIs to send requests and view responses in userspace? - Sean From caitlinb at broadcom.com Wed Oct 26 15:57:55 2005 From: caitlinb at broadcom.com (Caitlin Bestler) Date: Wed, 26 Oct 2005 15:57:55 -0700 Subject: [openib-general] Re: RFC userspace CMA Message-ID: <54AD0F12E08D1541B826BE97C98F99F1020B3A@NT-SJCA-0751.brcm.ad.broadcom.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of > Michael S. Tsirkin > Sent: Wednesday, October 26, 2005 1:44 PM > To: Sean Hefty > Cc: openib > Subject: [openib-general] Re: RFC userspace CMA > > > But I mean, we can already send ARP packets from userspace, cant we? > No, non-privileged users are not allowed to modify the ARP table, open /dev/arp or to send raw Ethernet. You can use ARP to query from non-privileged userspace. But nothing beyond that. If you check the man page you'll also note that the ARP daemon specifically listens to ensure that nobody else is impersonating it. That's exactly the type of safety check that is blocked if IP addresses are passed via private data where the fact that the data is an IP address is not defined in the wire protocol. From nacc at us.ibm.com Wed Oct 26 16:51:37 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 26 Oct 2005 16:51:37 -0700 Subject: [openib-general] Automated userspace build error In-Reply-To: <52pspt2q4f.fsf@cisco.com> References: <20051025220446.GA27205@us.ibm.com> <52u0f52qqh.fsf@cisco.com> <20051025221849.GB27205@us.ibm.com> <52pspt2q4f.fsf@cisco.com> Message-ID: <20051026235137.GA6369@us.ibm.com> On 25.10.2005 [15:22:56 -0700], Roland Dreier wrote: > Nishanth> Hrm, well, I'm testing the latest svn (3865), did the > Nishanth> patch just get checked in? > > Yeah, I only noticed it and fixed it after your original email. I > just meant that I had already checked it in before sending my reply. > Sorry for the confusion... No worries, I figured that's what happened. On a related note, do you (or anyone else) have any suggestions for build-testing all of the userspace components? There isn't a top-level Makefile of any kind to make it easy :/ Thanks, Nish From robert.j.woodruff at intel.com Wed Oct 26 17:15:05 2005 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 26 Oct 2005 17:15:05 -0700 Subject: [openib-general] Automated userspace build error Message-ID: <1AC79F16F5C5284499BB9591B33D6F0005F0FEFB@orsmsx408> Nish wrote, >On a related note, do you (or anyone else) have any suggestions for >build-testing all of the userspace components? There isn't a top-level >Makefile of any kind to make it easy :/ >Thanks, >Nish If you look at the openib download page, Makia posted a userspace source RPM, although it is a bit out of date. I also have a similar build proceedure that I use internally, basically building all of the usermode components and then building an RPM to allow easy installation on other nodes for testing There are also .spec files for most of the individual libraries, if you prefer to build RPMs for individual libraries. I find it easier just to lump it all into one big usermode component RPM and one kernel-mode component RPM. woody From johann at pathscale.com Wed Oct 26 21:40:34 2005 From: johann at pathscale.com (Johann George) Date: Wed, 26 Oct 2005 21:40:34 -0700 Subject: [openib-general] last version for 2.6.9 backport In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0005ED85DF@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0005ED85DF@orsmsx408> Message-ID: <20051027044034.GA26386@cuprite.internal.keyresearch.com> > BTW. I was not able to test the pathscale driver as I do not > have any of their H/W, so if someone that has H/W could > test it, that would be great. I think we might have some hardware lying around. :-) Will try it out. Johann From liran at mellanox.co.il Wed Oct 26 22:49:33 2005 From: liran at mellanox.co.il (Liran Sorani) Date: Thu, 27 Oct 2005 07:49:33 +0200 Subject: [openib-general] RE: Osmtest removal from Gen2 main trunk Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35AB931@mtlexch01.mtl.com> Hi , Hal . No problem , it can wait till next week. -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Wednesday, October 26, 2005 6:36 PM To: Liran Sorani Cc: openib-general at openib.org Subject: RE: Osmtest removal from Gen2 main trunk Hi Liran, I'm out at SC05 staging. Can this wait until I get back (no later than early next week) ? I want to do a side by side comparison before osmtest is removed from the trunk. -- Hal ________________________________ From: Liran Sorani [mailto:liran at mellanox.co.il] Sent: Tue 10/25/2005 1:35 AM To: Hal Rosenstock Cc: openib-general at openib.org Subject: Osmtest removal from Gen2 main trunk Hi , Hal . Since now the Osmtest is updated (in all stack flavours) under ibtp repository (https://openib.org/svn/trunk/contrib/mellanox/ibtp/), I'd like to remove it from main trunk : https://openib.org/svn/gen2/trunk/src/userspace/management/osm/osmtest. New updates will be checked into ibtp repository only , thanks . -----Original Message----- From: Liran Sorani Sent: Sunday, October 23, 2005 9:01 AM To: 'Hal Rosenstock'; Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update Currently only a minor bug fix in osmt_service flow , and cosmetics changes to fit WinIb stack . -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Thursday, October 20, 2005 1:01 PM To: Liran Sorani Cc: openib-general at openib.org Subject: RE: [openib-general] InfiniBand Test Project (IBTP) - Update On Thu, 2005-10-20 at 03:49, Liran Sorani wrote: > Hi , Hal . > The Linux & WinIB are the same , except for several cosmetic changes . I was referring to the (differences in the) Linux one in ibtp and the Linux one under gen2/trunk. > Regarding Makefile.in , it's an outcome of autogen , I'll remove it . Thanks. -- Hal > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 19, 2005 10:25 PM > To: Liran Sorani > Cc: openib-general at openib.org > Subject: Re: [openib-general] InfiniBand Test Project (IBTP) - Update > > > On Wed, 2005-10-19 at 15:33, Liran Sorani wrote: > > Hi , > > We've updated IBTP tree with Osmtest sources both on ibal (WinIB) > and > > Gen2 stacks : > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/ibal/ulp/opensm/user/osmt est > > > > > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest > > > > Osmtest is the main verification tool for OpenSM , include various > SA > > (Good / Bad) flows. > > Attached to each directory a short README file for setup and usage > > information. > > How is the Linux one different from osmtest in the trunk ? > > Also, (nit): > I think > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/management /osm/osmtest/Makefile.in > is a generated file and should be removed. > > -- Hal > > > > Liran Sorani > > > Mellanox Technologies LTD. > > > mailto:liran at mellanox.co.il > > > Phone: +972(4)9097200 Ext: 214 > > > Israel, Yokneam P.O.B 586 ZIP 20692 > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Oct 26 23:03:57 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 27 Oct 2005 08:03:57 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361882D@mtlexch01.mtl.com> Yes this MIB needs some cleanup. I would love to hear from the community some feedback regarding SM MIB usefulness. In the past we did not get any push for interactive SM or online configurable SM so I did not see any reason to work on it. I do not think it is a huge task to make SM MIB work with OpenSM. At least not the 90% of it that I glanced through. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 26, 2005 7:44 PM > To: Eitan Zahavi > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > Hi Eitan, > > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I was > thinking of all the other IPoIB MIBs. It's been a while since I looked at the SM MIB. > > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while ago. At a > minimum, it needs to be dusted off. That would include updating it for IBA 1.2. > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tue 10/25/2005 5:19 AM > To: Hal Rosenstock > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > > > Hal Rosenstock wrote: > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > > > >>Hal Rosenstock wrote: > >> > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > >>> > >>> > >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > > > has > > > >>>>defined an SNMP MIB that can support the required functionality > > > > below. > > > >>> > >>>The IETF SNMP MIBs are one way of presenting the information to the > >>>outside world. There are other possible management interfaces. The > > > > SNMP > > > >>>MIB instrumentation would need to use lower layer APIs to get this > >>>information out of the SM. > >> > >>Yes but the IETF SM MIB is the only one that is close to a standard > > > > way. > > > >>It does not require low level interface if it will integrate into the > > > > OpenSM code. > > > >>One way to do it is buy extending OpenSM with an AgentX interface. > >> > >>IMO one clear advantage of using SNMP for SM integration is that the > > > > code will work with any SM that is IETF compliant. > > > >>Also if you want to write a "client server" type of application on top > > > > of an SM you > > > >>can either stick to sending MADs which translate into SA client based > > > > application or > > > >>you better stay with some known protocol for management (like SNMP) > > > > and not develop yet another protocol for > > > >>doing exactly the same things as SNMP already supports. > > > > > > There are limitations in the SNMP MIBs. One is that they are RO so they > > are more for monitoring. Also, many environments do not use SNMP. It is > > unclear how much of a requirement it is to manage any SM or how many > > other SMs support the SM MIB. (There are other IB associated MIBs too). > > SNMP MIBs are certainly not just RO a simple example from the SM MIB: > ibSmPortInfoLMC OBJECT-TYPE > SYNTAX Unsigned32(0..7) > MAX-ACCESS read-write > STATUS current > DESCRIPTION > "LID mask for multipath support. User should take extra caution > when setting this value, since any change will effect packet > routing." > ::= { ibSmPortInfoEntry 19 } > > > I agree that it is possible that currently no SM is supporting the SM MIB. > But it does make sense to have ALL of the them support it. Such that they can > be activated/deactivated and configured in the manner. > > Most unix distributions and windows box have standard SNMP agent and client > included in them > So it does not take more then simple bash or C code to interact with the SM if it > supports SNMP. > > > > > > >>>>Everything but the dynamic partitioning (OpenSM does not have > >>>>partition manager to this moment) > >>> > >>> > >>>What Troy meant by partitioning is not necessarily IB partitioning. > >> > >>How are you sure about that? Troy - please comment. > > > > > > I think you missed an email on this. > > > > > >>>>and forwarding of Performance > >>>>Monitoring traps (which are generated by the PM) can be done through > >>>>osmsh or through SA client today. > >>> > >>> > >>>What PerfMgr are you referring to ? > >> > >>No specific one. But the specification does not require the SM too. > > > > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > > without this. There is a subnet without a PerfMgr. > Yes its a typo I meant PM. SM is a requirement. You know I did not mean that. > > > > > >>For various reasons (like load) it might make more sense to have the > > > > PM distributed. > > > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > > > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > > > It is the PM that > > > >>should support Reporting (I.e InformInfo registration and Trap > > > > forwarding) for PM traps. > > > >>But the spec does not define such traps anyway. > > > > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > > that is defined and has no traps so these will all need synthesis by the > > PerfMgr. > Agree. > > > > -- Hal > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Oct 27 04:44:32 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 27 Oct 2005 13:44:32 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C69@taurus.voltaire.com> There have been requests for this CLI functionality from at least the labs. It has been discussed on the list. Also, there was the following comment in OpenSM::main.c: /* Sit here forever In the future, some sort of console interactivity could be implemented in this loop. */ -- Hal ________________________________ From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Thu 10/27/2005 2:03 AM To: Hal Rosenstock; Eitan Zahavi Cc: Troy Benjegerdes; openib-general at openib.org Subject: RE: [openib-general] [RFC] OpenSM Interactive Console Yes this MIB needs some cleanup. I would love to hear from the community some feedback regarding SM MIB usefulness. In the past we did not get any push for interactive SM or online configurable SM so I did not see any reason to work on it. I do not think it is a huge task to make SM MIB work with OpenSM. At least not the 90% of it that I glanced through. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, October 26, 2005 7:44 PM > To: Eitan Zahavi > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > Hi Eitan, > > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I was > thinking of all the other IPoIB MIBs. It's been a while since I looked at the SM MIB. > > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while ago. At a > minimum, it needs to be dusted off. That would include updating it for IBA 1.2. > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tue 10/25/2005 5:19 AM > To: Hal Rosenstock > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > > > Hal Rosenstock wrote: > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > > > >>Hal Rosenstock wrote: > >> > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > >>> > >>> > >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > > > has > > > >>>>defined an SNMP MIB that can support the required functionality > > > > below. > > > >>> > >>>The IETF SNMP MIBs are one way of presenting the information to the > >>>outside world. There are other possible management interfaces. The > > > > SNMP > > > >>>MIB instrumentation would need to use lower layer APIs to get this > >>>information out of the SM. > >> > >>Yes but the IETF SM MIB is the only one that is close to a standard > > > > way. > > > >>It does not require low level interface if it will integrate into the > > > > OpenSM code. > > > >>One way to do it is buy extending OpenSM with an AgentX interface. > >> > >>IMO one clear advantage of using SNMP for SM integration is that the > > > > code will work with any SM that is IETF compliant. > > > >>Also if you want to write a "client server" type of application on top > > > > of an SM you > > > >>can either stick to sending MADs which translate into SA client based > > > > application or > > > >>you better stay with some known protocol for management (like SNMP) > > > > and not develop yet another protocol for > > > >>doing exactly the same things as SNMP already supports. > > > > > > There are limitations in the SNMP MIBs. One is that they are RO so they > > are more for monitoring. Also, many environments do not use SNMP. It is > > unclear how much of a requirement it is to manage any SM or how many > > other SMs support the SM MIB. (There are other IB associated MIBs too). > > SNMP MIBs are certainly not just RO a simple example from the SM MIB: > ibSmPortInfoLMC OBJECT-TYPE > SYNTAX Unsigned32(0..7) > MAX-ACCESS read-write > STATUS current > DESCRIPTION > "LID mask for multipath support. User should take extra caution > when setting this value, since any change will effect packet > routing." > ::= { ibSmPortInfoEntry 19 } > > > I agree that it is possible that currently no SM is supporting the SM MIB. > But it does make sense to have ALL of the them support it. Such that they can > be activated/deactivated and configured in the manner. > > Most unix distributions and windows box have standard SNMP agent and client > included in them > So it does not take more then simple bash or C code to interact with the SM if it > supports SNMP. > > > > > > >>>>Everything but the dynamic partitioning (OpenSM does not have > >>>>partition manager to this moment) > >>> > >>> > >>>What Troy meant by partitioning is not necessarily IB partitioning. > >> > >>How are you sure about that? Troy - please comment. > > > > > > I think you missed an email on this. > > > > > >>>>and forwarding of Performance > >>>>Monitoring traps (which are generated by the PM) can be done through > >>>>osmsh or through SA client today. > >>> > >>> > >>>What PerfMgr are you referring to ? > >> > >>No specific one. But the specification does not require the SM too. > > > > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > > without this. There is a subnet without a PerfMgr. > Yes its a typo I meant PM. SM is a requirement. You know I did not mean that. > > > > > >>For various reasons (like load) it might make more sense to have the > > > > PM distributed. > > > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > > > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > > > It is the PM that > > > >>should support Reporting (I.e InformInfo registration and Trap > > > > forwarding) for PM traps. > > > >>But the spec does not define such traps anyway. > > > > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > > that is defined and has no traps so these will all need synthesis by the > > PerfMgr. > Agree. > > > > -- Hal > > From mst at mellanox.co.il Thu Oct 27 05:31:31 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 27 Oct 2005 14:31:31 +0200 Subject: [openib-general] Re: Automated userspace build error In-Reply-To: <20051026235137.GA6369@us.ibm.com> References: <20051026235137.GA6369@us.ibm.com> Message-ID: <20051027123131.GU4769@mellanox.co.il> Quoting r. Nishanth Aravamudan : > Subject: Re: Automated userspace build error > > On 25.10.2005 [15:22:56 -0700], Roland Dreier wrote: > > Nishanth> Hrm, well, I'm testing the latest svn (3865), did the > > Nishanth> patch just get checked in? > > > > Yeah, I only noticed it and fixed it after your original email. I > > just meant that I had already checked it in before sending my reply. > > Sorry for the confusion... > > No worries, I figured that's what happened. > > On a related note, do you (or anyone else) have any suggestions for > build-testing all of the userspace components? There isn't a top-level > Makefile of any kind to make it easy :/ > > Thanks, > Nish Yes, look at scripts in https://openib.org/svn/trunk/contrib/mellanox/scripts You can also, basically, cut and paste stuff from the FAQ page, but that relies on performing make as root. -- MST From yael at mellanox.co.il Thu Oct 27 06:04:25 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 27 Oct 2005 15:04:25 +0200 Subject: [openib-general] [PATCH] Opensm - fix lmc algorithm Message-ID: <5zfyqnxg9y.fsf@mtl066.yok.mtl.com> Hi Hal, We noticed a problem in the lmc assignment algorithm. In the current code - when trying to run opensm with lmc > 0, the opensm goes into infinite loop. Debugging the problem we noticed that there is a problem with the lid assignment, and we changed the algorithm. The change is in the osm_lid_mgr_init_sweep function. We have done some testing to the new code, and it seems that the lmc assignment is ok with the fix. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 3848) +++ opensm/osm_lid_mgr.c (working copy) @@ -337,7 +337,7 @@ __osm_lid_mgr_init_sweep( uint16_t max_defined_lid; uint16_t max_persistent_lid; uint16_t max_discovered_lid; - uint16_t lid, l; + uint16_t lid; uint16_t disc_min_lid; uint16_t disc_max_lid; uint16_t db_min_lid; @@ -349,16 +349,23 @@ __osm_lid_mgr_init_sweep( osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); + uint16_t lmc_mask; + uint16_t req_lid, num_lids; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_init_sweep ); + if (p_mgr->p_subn->opt.lmc) + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); + else + lmc_mask = 0xffff; + /* if we came out of standby we need to discard any previous guid 2 lid info we might had */ if ( p_mgr->p_subn->coming_out_of_standby == TRUE ) { osm_db_clear( p_mgr->p_g2l ); for (lid = 0; lid < cl_ptr_vector_get_size(&p_mgr->used_lids); lid++) - cl_ptr_vector_set(&p_mgr->used_lids, lid, NULL); + cl_ptr_vector_set(p_persistent_vec, lid, NULL); } /* we need to cleanup the empty ranges list */ @@ -375,7 +382,7 @@ __osm_lid_mgr_init_sweep( /* we if are on the first sweep and in re-assign lids mode we should ignore all the available info and simply define one - hufe empty range */ + huge empty range */ if ((p_mgr->p_subn->first_time_master_sweep == TRUE) && (p_mgr->p_subn->opt.reassign_lids == TRUE )) { @@ -398,6 +405,34 @@ __osm_lid_mgr_init_sweep( osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); for (lid = disc_min_lid; lid <= disc_max_lid; lid++) cl_ptr_vector_set(p_discovered_vec, lid, p_port ); + /* make sure the guid2lid entry is valid. If not - clean it. */ + if (!osm_db_guid2lid_get( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port)), + &db_min_lid, &db_max_lid)) + { + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + num_lids = lmc_num_lids; + else + num_lids = 1; + + if ((num_lids != 1) && + (((db_min_lid & lmc_mask) != db_min_lid) || + (db_max_lid - db_min_lid + 1 < num_lids)) ) + { + /* Not alligned, or not wide enough - remove the entry */ + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "__osm_lid_mgr_init_sweep: " + "Cleaning persistent entry for guid:0x%016" PRIx64 + " illegal range:[0x%x:0x%x] \n", + cl_ntoh64(osm_port_get_guid(p_port)), db_min_lid, + db_max_lid ); + osm_db_guid2lid_delete( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port))); + for ( lid = db_min_lid ; lid <= db_max_lid ; lid++ ) + cl_ptr_vector_set(p_persistent_vec, lid, NULL); + } + } } /* @@ -434,7 +469,7 @@ __osm_lid_mgr_init_sweep( { is_free = TRUE; /* first check to see if the lid is used by a persistent assignment */ - if ((lid < max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) + if ((lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " @@ -442,62 +477,86 @@ __osm_lid_mgr_init_sweep( lid); is_free = FALSE; } - - /* check the discovered port if there is one */ - if ((lid < max_discovered_lid) && - (p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + else { - /* get the lid range of that port - but we know how many lids we - are about to assign to it */ - osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != - IB_NODE_TYPE_SWITCH) - disc_max_lid = disc_min_lid + lmc_num_lids - 1; - + /* check this is a discovered port */ + CL_ASSERT(lid <= max_discovered_lid); + if ((p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + { + /* we have a port. Now lets see if we can preserve its lid range. */ + /* For that - we need to make sure: + 1. The port has a (legal) persistancy entry. Then the local lid + is free (we will use the persistancy value). + 2. Can the port keep its local assignment? + a. Make sure the lid a alligned. + b. Make sure all needed lids (for the lmc) are free according + to persistancy table. + */ /* qualify the guid of the port is not persistently mapped to another range */ if (!osm_db_guid2lid_get( p_mgr->p_g2l, cl_ntoh64(osm_port_get_guid(p_port)), &db_min_lid, &db_max_lid)) { - /* ok there is an asignment - is it the same ? */ - if ((disc_min_lid == db_min_lid) && (disc_max_lid == db_max_lid)) - { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is not free as it was discovered " - " and mapped by the persistent db.\n", - disc_min_lid, disc_max_lid); - is_free = FALSE; + "0x%04x is free as it was discovered " + "but mapped by the persistent db to [0x%04x:0x%04x].\n", + lid, db_min_lid, db_max_lid); + } + else + { + /* can the port keep its assignment ? */ + /* get the lid range of that port, and the required number + of lids we are about to assign to it */ + osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + { + disc_max_lid = disc_min_lid + lmc_num_lids - 1; + num_lids = lmc_num_lids; } else { + num_lids = 1; + } + /* Make sure the lid is alligned */ + if ((num_lids != 1) && ((disc_min_lid & lmc_mask) != disc_min_lid)) + { + /* The lid cannot be used */ osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is free as it was discovered" - " but mapped to range: [0x%x:0x%x] by the persistent db.\n", - disc_min_lid, disc_max_lid, db_min_lid, db_max_lid); - for (l = disc_min_lid; l <= disc_max_lid; l++) - cl_ptr_vector_set(p_discovered_vec, l, NULL); - } + "0x%04x is free as it was discovered " + "but not alligned. \n", + lid ); } else { + /* check that all needed lids are not persistantly mapped */ + is_free = FALSE; + for ( req_lid = disc_min_lid + 1 ; req_lid <= disc_max_lid ; req_lid++ ) + { + if ((req_lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, req_lid)) + { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "0x%04x is not free as it was discovered" - " and there is no persistent db entry for it.\n", + "0x%04x is free as it was discovered " + "but mapped. \n", lid); - is_free = FALSE; + is_free = TRUE; + break; + } } - - /* if there is more then one lid on that port - and the discovered port - is going to retain its lids advance to the max lid */ if (is_free == FALSE) { + /* This port will use its local lid, and consume the entire required lid range. + Thus we can skip that range. */ lid = disc_max_lid; } } + } + } + } if (is_free) { @@ -1300,7 +1359,6 @@ osm_lid_mgr_process_subnet( /* the proc returns the fact it sent a set port info */ if (__osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; - } } /* all ports */ From hozer at hozed.org Thu Oct 27 07:29:57 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 27 Oct 2005 09:29:57 -0500 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E361882D@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361882D@mtlexch01.mtl.com> Message-ID: <20051027142956.GF3275@kalmia.hozed.org> For me, the only purpose for an SNMP MIB would be to get the information into a network management system. In my case, I'll be using something that's open-source or has a plugin architecture like Nagios, and I'd really rather just have the network management system communicate with the subnet manager or SMA packets directly rather than introducting an extra translation to SNMP. SNMP is only usefull to me because it is (in theory) an interoperable cross-vendor standard. In the infiniband case, we already have a cross-vendor standard implementation (OpenIB), and adding SNMP is another dependency and layer of complexity that can break and be difficult to set up. If I knew of an open-source tool that was actually able to use SNMP to query a random ethernet vendor's switch and be able to tell me what port a particular MAC address was plugged into, I might be more positive. But as far as I know, each vendor's SNMP implementation is broken in subtly different ways, so that this gets to be a nightmare to actually implement. I guess the point of all this is find a end-user use-case for the SM MIB, and work back from there to decide if haveing a MIB actually helps solve the problem. On Thu, Oct 27, 2005 at 08:03:57AM +0200, Eitan Zahavi wrote: > Yes this MIB needs some cleanup. > I would love to hear from the community some feedback regarding SM MIB > usefulness. > > In the past we did not get any push for interactive SM or online > configurable SM so I did not see any reason to work on it. > > I do not think it is a huge task to make SM MIB work with OpenSM. At least > not the 90% of it that I glanced through. > > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, October 26, 2005 7:44 PM > > To: Eitan Zahavi > > Cc: Troy Benjegerdes; openib-general at openib.org > > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > > > Hi Eitan, > > > > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I > was > > thinking of all the other IPoIB MIBs. It's been a while since I looked at > the SM MIB. > > > > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while > ago. At a > > minimum, it needs to be dusted off. That would include updating it for IBA > 1.2. > > > > -- Hal > > > > ________________________________ > > > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > > Sent: Tue 10/25/2005 5:19 AM > > To: Hal Rosenstock > > Cc: Troy Benjegerdes; openib-general at openib.org > > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > > > > > > > Hal Rosenstock wrote: > > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > > > > > >>Hal Rosenstock wrote: > > >> > > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > > >>> > > >>> > > >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > > > > > has > > > > > >>>>defined an SNMP MIB that can support the required functionality > > > > > > below. > > > > > >>> > > >>>The IETF SNMP MIBs are one way of presenting the information to the > > >>>outside world. There are other possible management interfaces. The > > > > > > SNMP > > > > > >>>MIB instrumentation would need to use lower layer APIs to get this > > >>>information out of the SM. > > >> > > >>Yes but the IETF SM MIB is the only one that is close to a standard > > > > > > way. > > > > > >>It does not require low level interface if it will integrate into the > > > > > > OpenSM code. > > > > > >>One way to do it is buy extending OpenSM with an AgentX interface. > > >> > > >>IMO one clear advantage of using SNMP for SM integration is that the > > > > > > code will work with any SM that is IETF compliant. > > > > > >>Also if you want to write a "client server" type of application on top > > > > > > of an SM you > > > > > >>can either stick to sending MADs which translate into SA client based > > > > > > application or > > > > > >>you better stay with some known protocol for management (like SNMP) > > > > > > and not develop yet another protocol for > > > > > >>doing exactly the same things as SNMP already supports. > > > > > > > > > There are limitations in the SNMP MIBs. One is that they are RO so they > > > are more for monitoring. Also, many environments do not use SNMP. It is > > > unclear how much of a requirement it is to manage any SM or how many > > > other SMs support the SM MIB. (There are other IB associated MIBs too). > > > > SNMP MIBs are certainly not just RO a simple example from the SM MIB: > > ibSmPortInfoLMC OBJECT-TYPE > > SYNTAX Unsigned32(0..7) > > MAX-ACCESS read-write > > STATUS current > > DESCRIPTION > > "LID mask for multipath support. User should take extra caution > > when setting this value, since any change will effect packet > > routing." > > ::= { ibSmPortInfoEntry 19 } > > > > > > I agree that it is possible that currently no SM is supporting the SM MIB. > > But it does make sense to have ALL of the them support it. Such that they > can > > be activated/deactivated and configured in the manner. > > > > Most unix distributions and windows box have standard SNMP agent and > client > > included in them > > So it does not take more then simple bash or C code to interact with the > SM if it > > supports SNMP. > > > > > > > > > > >>>>Everything but the dynamic partitioning (OpenSM does not have > > >>>>partition manager to this moment) > > >>> > > >>> > > >>>What Troy meant by partitioning is not necessarily IB partitioning. > > >> > > >>How are you sure about that? Troy - please comment. > > > > > > > > > I think you missed an email on this. > > > > > > > > >>>>and forwarding of Performance > > >>>>Monitoring traps (which are generated by the PM) can be done through > > >>>>osmsh or through SA client today. > > >>> > > >>> > > >>>What PerfMgr are you referring to ? > > >> > > >>No specific one. But the specification does not require the SM too. > > > > > > > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > > > without this. There is a subnet without a PerfMgr. > > Yes its a typo I meant PM. SM is a requirement. You know I did not mean > that. > > > > > > > > >>For various reasons (like load) it might make more sense to have the > > > > > > PM distributed. > > > > > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > > > > > > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > > > > > It is the PM that > > > > > >>should support Reporting (I.e InformInfo registration and Trap > > > > > > forwarding) for PM traps. > > > > > >>But the spec does not define such traps anyway. > > > > > > > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > > > that is defined and has no traps so these will all need synthesis by the > > > PerfMgr. > > Agree. > > > > > > -- Hal > > > > -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' hozer at hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz From eitan at mellanox.co.il Thu Oct 27 08:30:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 27 Oct 2005 17:30:39 +0200 Subject: [openib-general] [RFC] OpenSM Interactive Console Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361883D@mtlexch01.mtl.com> Hi Hal, I still think that a "server" like behavior is much preferable to having the SM sit there and wait for console inputs. The SM is a service and thus should run like a daemon. MIB is just a standard way to avoid the need to define our own protocol to do that. In your implementation the SM should be put in console mode from the first invocation and thus will need a dedicated terminal. Even with osmsh one could implement (using standard Tcl sockets) a simple server that could just wait for remote commands (I can provide the code as I have done zillions of such servers). The MIB is nicer and I think it is not very complicated to implement. At least not the trivial groups of setting SM parameters. The more I think about it the more I get convinced we need to do it. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Thursday, October 27, 2005 1:45 PM > To: Eitan Zahavi > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > There have been requests for this CLI functionality from at least the labs. It has been > discussed on the list. > > Also, there was the following comment in OpenSM::main.c: > > /* > Sit here forever > In the future, some sort of console interactivity could > be implemented in this loop. > */ > > -- Hal > > ________________________________ > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Thu 10/27/2005 2:03 AM > To: Hal Rosenstock; Eitan Zahavi > Cc: Troy Benjegerdes; openib-general at openib.org > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > > > Yes this MIB needs some cleanup. > I would love to hear from the community some feedback regarding SM MIB > usefulness. > > In the past we did not get any push for interactive SM or online configurable SM so I > did not see any reason to work on it. > > I do not think it is a huge task to make SM MIB work with OpenSM. At least not the > 90% of it that I glanced through. > > > Eitan Zahavi > Design Technology Director > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Wednesday, October 26, 2005 7:44 PM > > To: Eitan Zahavi > > Cc: Troy Benjegerdes; openib-general at openib.org > > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console > > > > Hi Eitan, > > > > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I was > > thinking of all the other IPoIB MIBs. It's been a while since I looked at the SM MIB. > > > > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while ago. At > a > > minimum, it needs to be dusted off. That would include updating it for IBA 1.2. > > > > -- Hal > > > > ________________________________ > > > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > > Sent: Tue 10/25/2005 5:19 AM > > To: Hal Rosenstock > > Cc: Troy Benjegerdes; openib-general at openib.org > > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console > > > > > > > > Hal Rosenstock wrote: > > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: > > > > > >>Hal Rosenstock wrote: > > >> > > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: > > >>> > > >>> > > >>>>I would suggest to use SNMP for the tasks below. IETF IPoIB group > > > > > > has > > > > > >>>>defined an SNMP MIB that can support the required functionality > > > > > > below. > > > > > >>> > > >>>The IETF SNMP MIBs are one way of presenting the information to the > > >>>outside world. There are other possible management interfaces. The > > > > > > SNMP > > > > > >>>MIB instrumentation would need to use lower layer APIs to get this > > >>>information out of the SM. > > >> > > >>Yes but the IETF SM MIB is the only one that is close to a standard > > > > > > way. > > > > > >>It does not require low level interface if it will integrate into the > > > > > > OpenSM code. > > > > > >>One way to do it is buy extending OpenSM with an AgentX interface. > > >> > > >>IMO one clear advantage of using SNMP for SM integration is that the > > > > > > code will work with any SM that is IETF compliant. > > > > > >>Also if you want to write a "client server" type of application on top > > > > > > of an SM you > > > > > >>can either stick to sending MADs which translate into SA client based > > > > > > application or > > > > > >>you better stay with some known protocol for management (like SNMP) > > > > > > and not develop yet another protocol for > > > > > >>doing exactly the same things as SNMP already supports. > > > > > > > > > There are limitations in the SNMP MIBs. One is that they are RO so they > > > are more for monitoring. Also, many environments do not use SNMP. It is > > > unclear how much of a requirement it is to manage any SM or how many > > > other SMs support the SM MIB. (There are other IB associated MIBs too). > > > > SNMP MIBs are certainly not just RO a simple example from the SM MIB: > > ibSmPortInfoLMC OBJECT-TYPE > > SYNTAX Unsigned32(0..7) > > MAX-ACCESS read-write > > STATUS current > > DESCRIPTION > > "LID mask for multipath support. User should take extra caution > > when setting this value, since any change will effect packet > > routing." > > ::= { ibSmPortInfoEntry 19 } > > > > > > I agree that it is possible that currently no SM is supporting the SM MIB. > > But it does make sense to have ALL of the them support it. Such that they can > > be activated/deactivated and configured in the manner. > > > > Most unix distributions and windows box have standard SNMP agent and client > > included in them > > So it does not take more then simple bash or C code to interact with the SM if it > > supports SNMP. > > > > > > > > > > >>>>Everything but the dynamic partitioning (OpenSM does not have > > >>>>partition manager to this moment) > > >>> > > >>> > > >>>What Troy meant by partitioning is not necessarily IB partitioning. > > >> > > >>How are you sure about that? Troy - please comment. > > > > > > > > > I think you missed an email on this. > > > > > > > > >>>>and forwarding of Performance > > >>>>Monitoring traps (which are generated by the PM) can be done through > > >>>>osmsh or through SA client today. > > >>> > > >>> > > >>>What PerfMgr are you referring to ? > > >> > > >>No specific one. But the specification does not require the SM too. > > > > > > > > > Huh ? What spec ? An SM is required in a subnet. There is no subnet > > > without this. There is a subnet without a PerfMgr. > > Yes its a typo I meant PM. SM is a requirement. You know I did not mean that. > > > > > > > > >>For various reasons (like load) it might make more sense to have the > > > > > > PM distributed. > > > > > > Sure. Also, the PerfMgr need not be colocated with the SM anyhow. > > > > > > > > >>Anyway, my point is that the SM is not the owner of PM trap reporting. > > > > > > It is the PM that > > > > > >>should support Reporting (I.e InformInfo registration and Trap > > > > > > forwarding) for PM traps. > > > > > >>But the spec does not define such traps anyway. > > > > > > > > > My point was that the PerfMgr is beyond the IBA spec. It is only the PMA > > > that is defined and has no traps so these will all need synthesis by the > > > PerfMgr. > > Agree. > > > > > > -- Hal > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Thu Oct 27 08:58:50 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 27 Oct 2005 08:58:50 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB Message-ID: I was trying to set up my system to use the normal /etc/sysconfig/network-scripts/ifcfg-ib0 and have the interface brought up at startup using /sbin/ifup, as it does with Ethernet. I am running on a RedHat EL4.0 U2 distribution. My config files looks like this, # OpenIB IPoIB Controller DEVICE=ib0 BOOTPROTO=static ONBOOT=yes IPADDR=192.168.0.1 NETMASK=255.255.255.0 BROADCAST=192.168.0.255 When I run /sbin/ifup ib0, I get [root at iclust-1 woody]# /sbin/ifup ib0 Error, some other host already uses address 192.168.0.1. Looking at the ifup script, it does a if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then echo $"Error, some other host already uses address ${IPADDR}." exit 1 fi If I run the arping command manually, I get arping -c 2 -w 3 -D -I ib0 102.168.0.1 ARPING 102.168.0.1 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received -1 response(s) but when I run it on the eth0 device, I get arping -c 2 -w 3 -D -I eth0 10.0.0.1 ARPING 10.0.0.1 from 0.0.0.0 eth0 Sent 2 probes (2 broadcast(s)) Received 0 response(s) So why with IPoIB, does arping return -1 for IPoIB, rather than 0 like it does with ethernet ? Is this a problem with IPoIB or the ifup script ? woody From eitan at mellanox.co.il Thu Oct 27 09:16:39 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 27 Oct 2005 18:16:39 +0200 Subject: [openib-general] osm_console.c - compilation warnings Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361883E@mtlexch01.mtl.com> Hi Hal I think you are missing #include As I get the following warnings: osm_console.c: In function `loglevel_parse': osm_console.c:112: warning: implicit declaration of function `strtoul' osm_console.c:118: warning: implicit declaration of function `strtol' osm_console.c: In function `osm_console': osm_console.c:177: warning: implicit declaration of function `free' EZ Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From iod00d at hp.com Thu Oct 27 09:17:02 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 27 Oct 2005 09:17:02 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB In-Reply-To: References: Message-ID: <20051027161702.GB18189@esmail.cup.hp.com> On Thu, Oct 27, 2005 at 08:58:50AM -0700, Bob Woodruff wrote: > If I run the arping command manually, I get > > arping -c 2 -w 3 -D -I ib0 102.168.0.1 What does it say when you use *192* for the first byte? (This may not be the only problem...but need to get that right too) grant From surs at cse.ohio-state.edu Thu Oct 27 09:21:56 2005 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Thu, 27 Oct 2005 12:21:56 -0400 Subject: [openib-general] PGI compiler issue with dat_platform_specific.h Message-ID: <20051027162154.GA23710@cse.ohio-state.edu> Hi, We ran into some troubles when compiling the OpenIB dapl provider with the PGI compiler. I believe this should appear in both ibat-cm and the scm based providers. Has anyone compiled DAPL/Gen2 with PGI? Is there a quick workaround for this? ---- PGC-W-0221-Redefinition of symbol UINT64_C (/usr/include/stdint.h: 304) PGC-S-0040-Illegal use of symbol, u_int64_t (/home/1/surs/projects/Gen2/dapl_scm _patch/dapl/dat/include/dat/dat_platform_specific.h: 139) PGC/x86-64 Linux/x86-64 6.0-5: compilation completed with severe errors ---- Our machine is SuSe 9.3, with linux kernel version 2.6.13.1 and OpenIB svn #3882. Thanks, Sayantan. -- http://www.cse.ohio-state.edu/~surs From hozer at hozed.org Thu Oct 27 09:23:36 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 27 Oct 2005 11:23:36 -0500 Subject: [openib-general] ib_mthca panic on PPC64 In-Reply-To: <52br1jes5u.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> <52br1jes5u.fsf@cisco.com> Message-ID: <20051027162336.GH3275@kalmia.hozed.org> I got this the other day (before I had a chance to add the debug code) p5l0:~# [443954.161068] mthca0: ib_query_pkey port 0 failed (ret = -22) [443988.334644] mthca0: ib_query_pkey port 0 failed (ret = -22) [444037.579342] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) [444037.579360] ib_mthca: Initializing 0000:d9:00.0 [444101.503664] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) [444101.503682] ib_mthca: Initializing 0000:d9:00.0 [444107.815375] Oops: Kernel access of bad area, sig: 7 [#1] [444107.815389] SMP NR_CPUS=8 NUMA PSERIES LPAR [444107.815401] Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core openaf s [444107.815425] NIP: D0000000098BF638 XER: 20000018 LR: C000000000057B2C CTR: D0 000000098BF5D0 [444107.815440] REGS: c0000001ee79b490 TRAP: 0300 Tainted: P (2.6.13.3-p ower5) [444107.815455] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 2800 0084 [444107.815469] DAR: d000010082189a04 DSISR: 0000000040000000 [444107.815481] TASK: c0000001ee7950e0[0] 'swapper' THREAD: c0000001ee798000 CPU : 6 [444107.815494] GPR00: 0000000000000010 C0000001EE79B710 D0000000098D6540 D00001 0082189A04 [444107.815515] GPR04: 0000000000000008 00000001009D0180 0000000000000000 000000 0000000800 [444107.815535] GPR08: C0000003DDA91910 0000000000000000 C0000001EE79B840 D000010082189A04 [444107.815556] GPR12: 0000000048000082 C0000000004BF400 0000000000000000 0000000000C00060 [444107.815576] GPR16: 0000000000000006 0000000000000000 0000000000000000 0000000000000000 [444107.815595] GPR20: 0000000000000000 C0000000005F7ED8 C0000000005F7F40 C000000000606500 [444107.815617] GPR24: C0000001ECEFC498 C0000001EE79B840 C0000001EE798000 C0000003DDA91000 [444107.815639] GPR28: 0000000000000100 C0000003DDA91000 D0000000098D4EC0 0000000000000000 [444107.815661] NIP [d0000000098bf638] .poll_catas+0x68/0x2f0 [ib_mthca] [444107.815699] LR [c000000000057b2c] .run_timer_softirq+0x15c/0x260 [444107.815717] Call Trace: [444107.815725] [c0000001ee79b710] [c0000001ee79b7c0] 0xc0000001ee79b7c0 (unreliable) [444107.815744] [c0000001ee79b7d0] [c000000000057b2c] .run_timer_softirq+0x15c/0x260 [444107.815764] [c0000001ee79b890] [c000000000051e68] .__do_softirq+0xe8/0x1c0 [444107.815783] [c0000001ee79b950] [c000000000051fc4] .do_softirq+0x84/0x90 [444107.815801] [c0000001ee79b9d0] [c0000000000108f0] .timer_interrupt+0xd0/0x41 0 [444107.815821] [c0000001ee79bad0] [c00000000000a2b4] decrementer_common+0xb4/0x100 [444107.815838] --- Exception: 901 at .pseries_dedicated_idle+0x104/0x280 [444107.815857] LR = .pseries_dedicated_idle+0x1e0/0x280 [444107.815868] [c0000001ee79be90] [c00000000000f460] .cpu_idle+0x40/0x60 [444107.815886] [c0000001ee79bf00] [c000000000032fa0] .start_secondary+0x120/0x150 [444107.815905] [c0000001ee79bf90] [c00000000000ba7c] .enable_64b_mode+0x0/0x28 [444107.815922] Instruction dump: [444107.815930] 3be00000 48000020 2fab0000 381f0001 7c1f07b4 409e0058 801d0908 7f9f0040 [444107.815955] 409c00c8 e97d08f8 7be91764 7c6b4a14 <7c001c2c> 0c000000 4c00012c 780b0020 [444107.815983] <0>Kernel panic - not syncing: Fatal exception in interrupt [444107.815998] From hozer at hozed.org Thu Oct 27 09:36:42 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Thu, 27 Oct 2005 11:36:42 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: <52br1jes5u.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> <52br1jes5u.fsf@cisco.com> Message-ID: <20051027163642.GI3275@kalmia.hozed.org> On Thu, Oct 20, 2005 at 03:32:13PM -0700, Roland Dreier wrote: > Troy> There is some sort of strange initializiation error going on here.. > > Yes, very strange. Can you add > > printk(KERN_ERR "hca->node_type = %d\n", hca->node_type); > > to the beginning of ipoib_add_port(), and > > printk(KERN_ERR "dev->ib_dev.node_type = %d\n", dev->ib_dev.node_type); > > right before the call to ib_register_device() in > mthca_register_device() and send the output that you get when hotplug > loads ib_mthca vs. when you load ib_mthca by hand? When loaded at boot: [586811.915831] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) [586811.915849] ib_mthca: Initializing 0000:d9:00.0 [586811.916634] PCI: Enabling device: (0000:d9:00.0), cmd 142 [586818.501595] openafs: module license 'http://www.openafs.org/dl/license10.html' taints kernel. [586818.504651] Found system call table at 0xc000000000013e68 (scan: close+ioctl) [586818.520240] Starting AFS cache scan...Memory cache: Allocating 12500 dcacheentries...found 0 non-empty cache files (0%). [586875.848354] afs: Lost contact with volume location server 147.155.137.10 incell scl.ameslab.gov [586875.848374] afs: Lost contact with volume location server 147.155.137.10 incell scl.ameslab.gov [587154.758768] hca->node_type = 236 [587154.760578] hca->node_type = 236 [587154.761511] hca->node_type = 236 [587154.761572] mthca0: ib_query_pkey port 3 failed (ret = -22) [587154.761584] hca->node_type = 236 [587154.761633] mthca0: ib_query_pkey port 4 failed (ret = -22) [587154.761644] hca->node_type = 236 [587154.762506] hca->node_type = 236 [587154.763422] hca->node_type = 236 [587154.763480] mthca0: ib_query_pkey port 7 failed (ret = -22) [587154.763491] hca->node_type = 236 [587154.763542] mthca0: ib_query_pkey port 8 failed (ret = -22) [587154.763553] hca->node_type = 236 [587154.765698] hca->node_type = 236 [587154.767136] hca->node_type = 236 [587154.767312] mthca0: ib_query_pkey port 11 failed (ret = -22) [587154.767324] hca->node_type = 236 [587154.767455] mthca0: ib_query_pkey port 12 failed (ret = -22) [587154.767471] hca->node_type = 236 [587154.769140] hca->node_type = 236 [587154.772116] hca->node_type = 236 [587154.772180] mthca0: ib_query_pkey port 15 failed (ret = -22) [587154.772192] hca->node_type = 236 [587154.772243] mthca0: ib_query_pkey port 16 failed (ret = -22) [587154.772255] hca->node_type = 236 [587154.773401] hca->node_type = 236 [587154.776817] hca->node_type = 236 [587154.776974] mthca0: ib_query_pkey port 19 failed (ret = -22) [587154.776986] hca->node_type = 236 [587154.778179] mthca0: ib_query_pkey port 20 failed (ret = -22) [587154.778198] hca->node_type = 236 [587154.780159] hca->node_type = 236 [587154.785406] hca->node_type = 236 [587154.785512] mthca0: ib_query_pkey port 23 failed (ret = -22) [587154.785523] hca->node_type = 236 [587154.785582] mthca0: ib_query_pkey port 24 failed (ret = -22) [587154.785599] hca->node_type = 236 [587154.789427] hca->node_type = 236 [587154.794314] hca->node_type = 236 [587154.794458] mthca0: ib_query_pkey port 27 failed (ret = -22) [587154.794474] hca->node_type = 236 [587154.794634] mthca0: ib_query_pkey port 28 failed (ret = -22) [587154.794646] hca->node_type = 236 [587154.797133] hca->node_type = 236 [587154.803507] hca->node_type = 236 [587154.803597] mthca0: ib_query_pkey port 31 failed (ret = -22) [587154.803608] hca->node_type = 236 [587154.803667] mthca0: ib_query_pkey port 32 failed (ret = -22) [587154.803679] hca->node_type = 236 [587154.820947] hca->node_type = 236 [587154.829795] hca->node_type = 236 [587154.831921] mthca0: ib_query_pkey port 35 failed (ret = -22) [587154.831934] hca->node_type = 236 [587154.834932] mthca0: ib_query_pkey port 36 failed (ret = -22) [587154.834946] hca->node_type = 236 [587154.844314] hca->node_type = 236 [587154.853591] hca->node_type = 236 [587154.853680] mthca0: ib_query_pkey port 39 failed (ret = -22) [587154.853692] hca->node_type = 236 [587154.853745] mthca0: ib_query_pkey port 40 failed (ret = -22) [587154.853761] hca->node_type = 236 [587154.869483] hca->node_type = 236 [587154.874749] hca->node_type = 236 [587154.874952] mthca0: ib_query_pkey port 43 failed (ret = -22) [587154.874969] hca->node_type = 236 [587154.875609] mthca0: ib_query_pkey port 44 failed (ret = -22) [587154.875624] hca->node_type = 236 [587154.894612] hca->node_type = 236 [587154.908058] hca->node_type = 236 [587154.909244] mthca0: ib_query_pkey port 47 failed (ret = -22) [587154.909261] hca->node_type = 236 [587154.909323] mthca0: ib_query_pkey port 48 failed (ret = -22) [587154.909334] hca->node_type = 236 [587154.918749] hca->node_type = 236 [587154.939629] hca->node_type = 236 [587154.939729] mthca0: ib_query_pkey port 51 failed (ret = -22) [587154.939745] hca->node_type = 236 [587154.939866] mthca0: ib_query_pkey port 52 failed (ret = -22) [587154.939883] hca->node_type = 236 [587154.957219] hca->node_type = 236 [587154.971523] hca->node_type = 236 [587154.971643] mthca0: ib_query_pkey port 55 failed (ret = -22) [587154.971664] hca->node_type = 236 [587154.972717] mthca0: ib_query_pkey port 56 failed (ret = -22) [587154.972733] hca->node_type = 236 [587154.984707] hca->node_type = 236 [587154.999129] hca->node_type = 236 [587154.999963] mthca0: ib_query_pkey port 59 failed (ret = -22) [587154.999976] hca->node_type = 236 [587155.000264] mthca0: ib_query_pkey port 60 failed (ret = -22) [587155.000282] hca->node_type = 236 [587155.012766] hca->node_type = 236 [587155.041105] hca->node_type = 236 [587155.041178] mthca0: ib_query_pkey port 63 failed (ret = -22) [587155.041189] hca->node_type = 236 [587155.041319] mthca0: ib_query_pkey port 64 failed (ret = -22) [587155.041332] hca->node_type = 236 [587155.066730] hca->node_type = 236 [587155.077348] hca->node_type = 236 [587155.077576] mthca0: ib_query_pkey port 67 failed (ret = -22) [587155.077593] hca->node_type = 236 [587155.077883] mthca0: ib_query_pkey port 68 failed (ret = -22) [587155.077896] hca->node_type = 236 [587155.097490] hca->node_type = 236 [587155.117809] hca->node_type = 236 [587155.117946] mthca0: ib_query_pkey port 71 failed (ret = -22) [587155.117962] hca->node_type = 236 [587155.118016] mthca0: ib_query_pkey port 72 failed (ret = -22) [587155.118031] hca->node_type = 236 [587155.138066] hca->node_type = 236 [587155.170056] hca->node_type = 236 [587155.170137] mthca0: ib_query_pkey port 75 failed (ret = -22) [587155.170153] hca->node_type = 236 [587155.170213] mthca0: ib_query_pkey port 76 failed (ret = -22) [587155.170225] hca->node_type = 236 [587155.205813] hca->node_type = 236 [587155.238014] hca->node_type = 236 [587155.238154] mthca0: ib_query_pkey port 79 failed (ret = -22) [587155.238168] hca->node_type = 236 [587155.238242] mthca0: ib_query_pkey port 80 failed (ret = -22) [587155.238256] hca->node_type = 236 [587155.266483] hca->node_type = 236 [587155.381938] hca->node_type = 236 [587155.382011] mthca0: ib_query_pkey port 83 failed (ret = -22) [587155.382027] hca->node_type = 236 [587155.382113] mthca0: ib_query_pkey port 84 failed (ret = -22) [587155.382125] hca->node_type = 236 [587155.418259] hca->node_type = 236 [587155.457782] hca->node_type = 236 [587155.457870] mthca0: ib_query_pkey port 87 failed (ret = -22) [587155.457886] hca->node_type = 236 [587155.457953] mthca0: ib_query_pkey port 88 failed (ret = -22) [587155.457966] hca->node_type = 236 [587155.477128] hca->node_type = 236 [587155.501172] hca->node_type = 236 [587155.501235] mthca0: ib_query_pkey port 91 failed (ret = -22) [587155.501245] hca->node_type = 236 [587155.501312] mthca0: ib_query_pkey port 92 failed (ret = -22) [587155.501323] hca->node_type = 236 [587155.580150] hca->node_type = 236 [587155.611763] hca->node_type = 236 [587155.611842] mthca0: ib_query_pkey port 95 failed (ret = -22) [587155.611855] hca->node_type = 236 [587155.611913] mthca0: ib_query_pkey port 96 failed (ret = -22) [587155.611929] hca->node_type = 236 [587155.663057] hca->node_type = 236 [587155.692342] hca->node_type = 236 [587155.692482] mthca0: ib_query_pkey port 99 failed (ret = -22) [587155.692494] hca->node_type = 236 [587155.692554] mthca0: ib_query_pkey port 100 failed (ret = -22) [587155.692572] hca->node_type = 236 [587155.759843] hca->node_type = 236 [587155.808226] hca->node_type = 236 [587155.808297] mthca0: ib_query_pkey port 103 failed (ret = -22) [587155.808317] hca->node_type = 236 [587155.808370] mthca0: ib_query_pkey port 104 failed (ret = -22) [587155.808383] hca->node_type = 236 [587155.847076] hca->node_type = 236 [587155.870709] hca->node_type = 236 [587155.870781] mthca0: ib_query_pkey port 107 failed (ret = -22) [587155.870797] hca->node_type = 236 [587155.870857] mthca0: ib_query_pkey port 108 faile6 [587155.986258] mthca0: ib_query_pkey port 111 failed (ret = -22) [587155.986269] hca->node_type = 236 [587155.986338] mthca0: ib_query_pkey port 112 failed (ret = -22) [587155.986353] hca->node_type = 236 [587156.020368] hca->node_type = 236 [587156.068549] hca->node_type = 236 [587156.068626] mthca0: ib_query_pkey port 115 failed (ret = -22) [587156.068643] hca->node_type = 236 [587156.068700] mthca0: ib_query_pkey port 116 failed (ret = -22) [587156.068719] hca->node_type = 236 p5l1:~# p5l1:~# p5l1:~# p5l1:~# # reload...... p5l1:~# p5l1:~# rmmod ib_ipoib p5l1:~# rmmod ib_mad ERROR: Module ib_mad is in use by ib_sa,ib_mthca p5l1:~# rmmod ib_sa p5l1:~# rmmod ib_mthca p5l1:~# rmmod ib_mad p5l1:~# rmmod ib_core p5l1:~# p5l1:~# modprobe ib_mthca p5l1:~# modprobe . [587324.500037] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) [587324.500056] ib_mthca: Initializing 0000:d9:00.0 [587325.778913] dev->ib_dev.node_type = 1 [587330.812591] Oops: Kernel access of bad area, sig: 7 [#1] [587330.812605] SMP NR_CPUS=8 NUMA PSERIES LPAR [587330.812618] Modules linked in: ib_mthca ib_mad ib_core openafs [587330.812637] NIP: D0000000098BF558 XER: 2000000B LR: C000000000057B2C CTR: D0 000000098BF4F0 [587330.812653] REGS: c0000001e3fb3490 TRAP: 0300 Tainted: P (2.6.13.3-p ower5) [587330.812669] MSR: 8000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 2800 0084 [587330.812682] DAR: d000010082187a04 DSISR: 0000000040000000 [587330.812694] TASK: c0000003dbf4d640[0] 'swapper' THREAD: c0000001e3fb0000 CPU : 5 [587330.812708] GPR00: 0000000000000010 C0000001E3FB3710 D0000000098D64C0 D00001 0082187A04 [587330.812729] GPR04: 0000000000000008 000000010003727D 0000000000000000 00000000000007D0 [587330.812748] GPR08: C0000001E3E08910 0000000000000000 C0000001E3FB3840 D000010082187A04 [587330.812770] GPR12: 0000000048000082 C0000000004BEC00 0000000000000000 000000000FA8536C [587330.812790] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [587330.812809] GPR20: 0000000000000000 C0000000005F7ED8 C0000000005F7F40 C000000000606500 [587330.812830] GPR24: C0000001EAE84498 C0000001E3FB3840 C0000001E3FB0000 C0000001E3E08000 [587330.812852] GPR28: 0000000000000100 C0000001E3E08000 D0000000098D4E40 0000000000000000 [587330.812875] NIP [d0000000098bf558] .poll_catas+0x68/0x2f0 [ib_mthca] [587330.812914] LR [c000000000057b2c] .run_timer_softirq+0x15c/0x260 [587330.812932] Call Trace: [587330.812940] [c0000001e3fb3710] [c0000001e3fb37d0] 0xc0000001e3fb37d0 (unreliable) [587330.812959] [c0000001e3fb37d0] [c000000000057b2c] .run_timer_softirq+0x15c/0x260 [587330.812979] [c0000001e3fb3890] [c000000000051e68] .__do_softirq+0xe8/0x1c0 [587330.812997] [c0000001e3fb3950] [c000000000051fc4] .do_softirq+0x84/0x90 [587330.813016] [c0000001e3fb39d0] [c0000000000108f0] .timer_interrupt+0xd0/0x41 0 [587330.813036] [c0000001e3fb3ad0] [c00000000000a2b4] decrementer_common+0xb4/0x100 [587330.813052] --- Exception: 901 at .pseries_dedicated_idle+0x108/0x280 [587330.813071] LR = .pseries_dedicated_idle+0x1e0/0x280 [587330.813083] [c0000001e3fb3e90] [c00000000000f460] .cpu_idle+0x40/0x60 [587330.813101] [c0000001e3fb3f00] [c000000000032fa0] .start_secondary+0x120/0x150 [587330.813120] [c0000001e3fb3f90] [c00000000000ba7c] .enable_64b_mode+0x0/0x28 [587330.813136] Instruction dump: [587330.813144] 3be00000 48000020 2fab0000 381f0001 7c1f07b4 409e0058 801d0908 7f9f0040 [587330.813169] 409c00c8 e97d08f8 7be91764 7c6b4a14 <7c001c2c> 0c000000 4c00012c 780b0020 [587330.813193] <0>Kernel panic - not syncing: Fatal exception in interrupt [587330.813208] From robert.j.woodruff at intel.com Thu Oct 27 09:37:29 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 27 Oct 2005 09:37:29 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB In-Reply-To: <20051027161702.GB18189@esmail.cup.hp.com> Message-ID: Grant wrote, >What does it say when you use *192* for the first byte? Same thing, I had a typo in first email, arping -c 2 -w 3 -D -I ib0 192.168.0.1 ARPING 192.168.0.1 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received -1 response(s) woody From rolandd at cisco.com Thu Oct 27 10:03:17 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 10:03:17 -0700 Subject: [openib-general] Re: ehca testing In-Reply-To: <20051027163642.GI3275@kalmia.hozed.org> (Troy Benjegerdes's message of "Thu, 27 Oct 2005 11:36:42 -0500") References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> <52br1jes5u.fsf@cisco.com> <20051027163642.GI3275@kalmia.hozed.org> Message-ID: <52k6fyyjsa.fsf@cisco.com> OK, looks like you have two problems. First of all, you seem to have two versions of ib_mthca, one of which gets picked up by hotplug on boot and one of which gets picked up by modprobe. Notice how you don't see the dev->ib_dev.node_type = 1 line when mthca runs on boot? The only explanation I can come up with for that would be that you have an old version of it in an initrd or something that's screwing thing up. As for the crash in poll_catas, I understand what's going on there. The catastrophic error polling code is ioremap()ing a PCI address instead of the correct CPU address. They're different on pSeries but not on most other architectures, so I didn't see problems in testing. I'll commit a fix for that problem shortly. - R. From sean.hefty at intel.com Thu Oct 27 10:06:04 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 10:06:04 -0700 Subject: [openib-general] [PATCH] add node_guid to struct ib_device Message-ID: Here's a modified version of Roland's original patch that adds only the node_guid to struct ib_device. Signed-off-by: Sean Hefty I'll rework my other patches based on this change. Index: include/rdma/ib_verbs.h =================================================================== --- include/rdma/ib_verbs.h (revision 3861) +++ include/rdma/ib_verbs.h (working copy) @@ -951,6 +951,7 @@ u64 uverbs_cmd_mask; int uverbs_abi_ver; + __be64 node_guid; u8 node_type; u8 phys_port_cnt; }; Index: hw/mthca/mthca_dev.h =================================================================== --- hw/mthca/mthca_dev.h (revision 3830) +++ hw/mthca/mthca_dev.h (working copy) @@ -290,7 +290,7 @@ u64 ddr_end; MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock) - struct semaphore cap_mask_mutex; + struct semaphore dev_attr_mutex; void __iomem *hcr; void __iomem *kar; @@ -528,4 +528,17 @@ return dev->mthca_flags & MTHCA_FLAG_MEMFREE; } +/* + * XXX remove once 2.6.14 is released. + */ +static inline void *mthca_kzalloc(size_t size, unsigned int __nocast flags) +{ + void *ret = kmalloc(size, flags); + if (ret) + memset(ret, 0, size); + return ret; +} +#undef kzalloc +#define kzalloc(s, f) mthca_kzalloc(s, f); + #endif /* MTHCA_DEV_H */ Index: hw/mthca/mthca_provider.c =================================================================== --- hw/mthca/mthca_provider.c (revision 3830) +++ hw/mthca/mthca_provider.c (working copy) @@ -45,6 +45,14 @@ #include "mthca_user.h" #include "mthca_memfree.h" +static void init_query_mad(struct ib_smp *mad) +{ + mad->base_version = 1; + mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + mad->class_version = 1; + mad->method = IB_MGMT_METHOD_GET; +} + static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { @@ -55,7 +63,7 @@ u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; @@ -64,12 +72,8 @@ props->fw_ver = mdev->fw_ver; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(mdev, 1, 1, 1, NULL, NULL, in_mad, out_mad, @@ -127,20 +131,16 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; memset(props, 0, sizeof *props); - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -185,7 +185,7 @@ int err; u8 status; - if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex)) + if (down_interruptible(&to_mdev(ibdev)->dev_attr_mutex)) return -ERESTARTSYS; err = mthca_query_port(ibdev, port, &attr); @@ -207,7 +207,7 @@ } out: - up(&to_mdev(ibdev)->cap_mask_mutex); + up(&to_mdev(ibdev)->dev_attr_mutex); return err; } @@ -219,18 +219,14 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; - in_mad->attr_mod = cpu_to_be32(index / 32); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32(index / 32); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -258,18 +254,14 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -283,13 +275,9 @@ memcpy(gid->raw, out_mad->data + 8, 8); - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; - in_mad->attr_mod = cpu_to_be32(index / 8); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32(index / 8); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -1069,11 +1057,48 @@ &class_device_attr_board_id }; +static int mthca_init_node_data(struct mthca_dev *dev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mthca_MAD_IFC(dev, 1, 1, + 1, NULL, NULL, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(&dev->ib_dev.node_guid, out_mad->data + 12, 8); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + int mthca_register_device(struct mthca_dev *dev) { int ret; int i; + ret = mthca_init_node_data(dev); + if (ret) + return ret; + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.owner = THIS_MODULE; @@ -1160,7 +1185,7 @@ dev->ib_dev.post_recv = mthca_tavor_post_receive; } - init_MUTEX(&dev->cap_mask_mutex); + init_MUTEX(&dev->dev_attr_mutex); ret = ib_register_device(&dev->ib_dev); if (ret) From halr at voltaire.com Thu Oct 27 10:12:59 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 27 Oct 2005 19:12:59 +0200 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C7F@taurus.voltaire.com> I think arping needs a minor change to work for IB due to the difference in the HW addresses for IPoIB and other LAN MACs. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Bob Woodruff Sent: Thu 10/27/2005 12:37 PM To: 'Grant Grundler' Cc: openib-general at openib.org Subject: RE: [openib-general] ifup/ifdown scripts don't work with IPoIB Grant wrote, >What does it say when you use *192* for the first byte? Same thing, I had a typo in first email, arping -c 2 -w 3 -D -I ib0 192.168.0.1 ARPING 192.168.0.1 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received -1 response(s) woody _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From higley at dbresearch.net Thu Oct 27 10:28:47 2005 From: higley at dbresearch.net (Jay Higley) Date: Thu, 27 Oct 2005 12:28:47 -0500 Subject: [openib-general] OpenSM causes kernel trap Message-ID: <43610E4F.3030103@dbresearch.net> I am trying to start up opensm on a Dell PowerEdge 2850 with a Mellanox based infiniband card. We are using the x86-64 Architecture. The kernel is recompiled with the latest stack from subversion, and all of the modules load OK. However, when I try to start opensm I get the following error. After this, then modules can not be successfully removed from the kernel and opensm is not successfully running. I can send the output from opensm's log file if anyone is interested. Thanks. -Jay Higley Oct 27 12:07:17 riba OpenSM[3321]: OpenSM Rev:openib-1.1.0 Oct 27 12:07:17 riba kernel: Unable to handle kernel paging request at ffffffffffffffff RIP: Oct 27 12:07:17 riba kernel: {kfree+107} Oct 27 12:07:17 riba kernel: PGD 103027 PUD 5619067 PMD 0 Oct 27 12:07:17 riba kernel: Oops: 0000 [1] SMP Oct 27 12:07:17 riba kernel: CPU 3 Oct 27 12:07:17 riba kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6 sunrpc ib_uverbs ib_at ib_sdp ib_ucm ib_cm ib_ping ib_mthca ib_umad binfmt_misc dm_mod video thermal processor fan container button battery ac ehci_hcd uhci_hcd pcspkr floppy parport_pc parport ib_ipoib ib_sa ib_mad ib_core e1000 snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd soundcore ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod Oct 27 12:07:17 riba kernel: Pid: 1783, comm: ib_mad1 Not tainted 2.6.13.4-86.caos.smp Oct 27 12:07:17 riba kernel: RIP: 0010:[] {kfree+107} Oct 27 12:07:17 riba kernel: RSP: 0018:ffff81013df97db8 EFLAGS: 00010006 Oct 27 12:07:17 riba kernel: RAX: 0000000000000003 RBX: ffffffffffffffff RCX: ffff81013fd93518 Oct 27 12:07:17 riba kernel: RDX: 0000000000762000 RSI: 0000000000000292 RDI: ffff810004b02028 Oct 27 12:07:17 riba kernel: RBP: ffff81010e000000 R08: ffff81013df96000 R09: 0000000000000000 Oct 27 12:07:17 riba kernel: R10: 0000000000000001 R11: 00000000ffffffff R12: ffff81013e600e10 Oct 27 12:07:17 riba kernel: R13: ffff810037deb000 R14: ffff81013e600e78 R15: ffffffff880e5190 Oct 27 12:07:17 riba kernel: FS: 0000000000000000(0000) GS:ffffffff804f3980(0000) knlGS:0000000000000000 Oct 27 12:07:17 riba kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Oct 27 12:07:17 riba kernel: CR2: ffffffffffffffff CR3: 000000013907a000 CR4: 00000000000006e0 Oct 27 12:07:17 riba kernel: Process ib_mad1 (pid: 1783, threadinfo ffff81013df96000, task ffff81013e40a1b0) Oct 27 12:07:17 riba kernel: Stack: 0000000000000286 ffff81013e600e10 ffff81013f3db180 ffffffff880e272e Oct 27 12:07:17 riba kernel: ffff81013df97e28 ffffffff8817113f ffff81013e40a3c8 ffff81013fd93500 Oct 27 12:07:17 riba kernel: ffff81013e600e00 0000000000000292 Oct 27 12:07:17 riba kernel: Call Trace:{:ib_mad:ib_free_send_mad+14} {:ib_umad:send_handler+63} Oct 27 12:07:17 riba kernel: {:ib_mad:timeout_sends+404} {__wake_up+67} Oct 27 12:07:17 riba kernel: {worker_thread+498} {default_wake_function+0} Oct 27 12:07:17 riba kernel: {__wake_up_common+64} {default_wake_function+0} Oct 27 12:07:17 riba kernel: {keventd_create_kthread+0} {worker_thread+0} Oct 27 12:07:17 riba kernel: {keventd_create_kthread+0} {kthread+217} Oct 27 12:07:17 riba kernel: {child_rip+8} {keventd_create_kthread+0} Oct 27 12:07:17 riba kernel: {kthread+0} {child_rip+0} Oct 27 12:07:17 riba kernel: Oct 27 12:07:17 riba kernel: Oct 27 12:07:17 riba kernel: Code: 8b 03 3b 43 04 73 04 89 c0 eb 0a 48 89 de e8 a2 03 00 00 8b Oct 27 12:07:17 riba kernel: RIP {kfree+107} RSP Oct 27 12:07:17 riba kernel: CR2: ffffffffffffffff From rolandd at cisco.com Thu Oct 27 10:34:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 10:34:39 -0700 Subject: [openib-general] Re: ib_mthca panic on PPC64 In-Reply-To: <20051027162336.GH3275@kalmia.hozed.org> (Troy Benjegerdes's message of "Thu, 27 Oct 2005 11:23:36 -0500") References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> <52br1jes5u.fsf@cisco.com> <20051027162336.GH3275@kalmia.hozed.org> Message-ID: <52ek66yic0.fsf@cisco.com> OK, the latest svn should work again. - R. From rolandd at cisco.com Thu Oct 27 10:39:10 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 10:39:10 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <43610E4F.3030103@dbresearch.net> (Jay Higley's message of "Thu, 27 Oct 2005 12:28:47 -0500") References: <43610E4F.3030103@dbresearch.net> Message-ID: <528xweyi4h.fsf@cisco.com> Sean, looks like your MAD send buf stuff may have broken send timeouts. Any quick ideas before I dig into this? - R. From robert.j.woodruff at intel.com Thu Oct 27 10:42:25 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 27 Oct 2005 10:42:25 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175C7F@taurus.voltaire.com> Message-ID: Hal wrote, >I think arping needs a minor change to work for IB due to the difference in the >HW addresses for IPoIB and other LAN MACs. >-- Hal Yep. That is the conclusion that we came to also. A work around for now, one can just remove the arping check in ifup if the device is an ib device. Not perfect, but allows it to work for ib devices and the normal ifcfg-xxxx scripts. Something like, if [ "x`echo ${REALDEVICE} | sed -e "s/^ib.//"`" != "x" ]; then if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then echo $"Error, some other host already uses address ${IPADDR}." exit 1 fi fi woody From mshefty at ichips.intel.com Thu Oct 27 10:44:50 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 10:44:50 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <528xweyi4h.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> Message-ID: <43611212.902@ichips.intel.com> Roland Dreier wrote: > Sean, looks like your MAD send buf stuff may have broken send > timeouts. Any quick ideas before I dig into this? No quick ideas why. I'll start looking into this as well. - Sean From mshefty at ichips.intel.com Thu Oct 27 10:51:29 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 10:51:29 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <528xweyi4h.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> Message-ID: <436113A1.2020105@ichips.intel.com> Roland Dreier wrote: > Sean, looks like your MAD send buf stuff may have broken send > timeouts. Any quick ideas before I dig into this? I think that the send_handler in user_mad.c is broken. - Sean From jlentini at netapp.com Thu Oct 27 10:54:21 2005 From: jlentini at netapp.com (James Lentini) Date: Thu, 27 Oct 2005 13:54:21 -0400 (EDT) Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <528xweyi4h.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> Message-ID: On Thu, 27 Oct 2005, Roland Dreier wrote: > Sean, looks like your MAD send buf stuff may have broken send > timeouts. Any quick ideas before I dig into this? Itamar also had a problem with the MAD layer on x86_64: http://openib.org/pipermail/openib-general/2005-October/013029.html From iod00d at hp.com Thu Oct 27 11:03:57 2005 From: iod00d at hp.com (Grant Grundler) Date: Thu, 27 Oct 2005 11:03:57 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB In-Reply-To: References: <20051027161702.GB18189@esmail.cup.hp.com> Message-ID: <20051027180357.GD18189@esmail.cup.hp.com> On Thu, Oct 27, 2005 at 09:37:29AM -0700, Bob Woodruff wrote: > Grant wrote, > >What does it say when you use *192* for the first byte? > > Same thing, I had a typo in first email, > > arping -c 2 -w 3 -D -I ib0 192.168.0.1 > ARPING 192.168.0.1 from 0.0.0.0 ib0 > Sent 2 probes (2 broadcast(s)) > Received -1 response(s) Hrm...wouldn't that be a bug in arping program? How can one get "-1" responses? And I can't reproduce that here (ia64-linux): gsyprf3:~# ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:10.0.0.51 Bcast:10.0.0.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) gsyprf3:~# arping -c 2 -w 3 -D -I ib0 10.0.0.51 ARPING 10.0.0.51 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received 0 response(s) gsyprf3:~# arping -c 2 -w 3 -D -I ib0 10.0.0.55 ARPING 10.0.0.55 from 0.0.0.0 ib0 Sent 2 probes (2 broadcast(s)) Received 0 response(s) There is no 10.0.0.55 IP in use on this network. I don't understand if the above result is correct or not and I did RTFM. BTW, I'm using Debian "iputils-arping 20020927-2". grant From rolandd at cisco.com Thu Oct 27 11:05:32 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 11:05:32 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <436113A1.2020105@ichips.intel.com> (Sean Hefty's message of "Thu, 27 Oct 2005 10:51:29 -0700") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> Message-ID: <524q72ygwj.fsf@cisco.com> Sean> I think that the send_handler in user_mad.c is broken. I don't see anything obviously wrong -- in Jay's log, the call to ib_free_send_mad() is crashing. When can it be wrong to do that from the send handler? - R. From robert.j.woodruff at intel.com Thu Oct 27 11:08:38 2005 From: robert.j.woodruff at intel.com (Bob Woodruff) Date: Thu, 27 Oct 2005 11:08:38 -0700 Subject: [openib-general] ifup/ifdown scripts don't work with IPoIB In-Reply-To: <20051027180357.GD18189@esmail.cup.hp.com> Message-ID: Grant wrote, >> >What does it say when you use *192* for the first byte? >> >> Same thing, I had a typo in first email, >> >> arping -c 2 -w 3 -D -I ib0 192.168.0.1 >> ARPING 192.168.0.1 from 0.0.0.0 ib0 >> Sent 2 probes (2 broadcast(s)) >> Received -1 response(s) >Hrm...wouldn't that be a bug in arping program? >How can one get "-1" responses? >And I can't reproduce that here (ia64-linux): The theory is that it is a bug in arping (and maybe other raw network frame utilities) not understanding the different size MAC of IPoIB. woody From mshefty at ichips.intel.com Thu Oct 27 11:13:15 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 11:13:15 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <524q72ygwj.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> Message-ID: <436118BB.4010809@ichips.intel.com> Roland Dreier wrote: > Sean> I think that the send_handler in user_mad.c is broken. > > I don't see anything obviously wrong -- in Jay's log, the call to > ib_free_send_mad() is crashing. When can it be wrong to do that from > the send handler? I don't see anything off there either. Timeouts seem to work fine with CM testing, so I'm guessing that the issue is somewhere in user_mad.c. I'm trying to see if there's anything wrong in ib_umad_write() that might cause it to crash on the completion. - Sean From nacc at us.ibm.com Thu Oct 27 12:06:38 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 27 Oct 2005 12:06:38 -0700 Subject: [openib-general] Automated userspace build error In-Reply-To: <1AC79F16F5C5284499BB9591B33D6F0005F0FEFB@orsmsx408> References: <1AC79F16F5C5284499BB9591B33D6F0005F0FEFB@orsmsx408> Message-ID: <20051027190638.GB28730@us.ibm.com> On 26.10.2005 [17:15:05 -0700], Woodruff, Robert J wrote: > Nish wrote, > >On a related note, do you (or anyone else) have any suggestions for > >build-testing all of the userspace components? There isn't a top-level > >Makefile of any kind to make it easy :/ > > >Thanks, > >Nish > > If you look at the openib download page, Makia posted a userspace > source RPM, although it is a bit out of date. RPM's aren't necessarily useful, but the means to get there might be. > I also have a similar build proceedure that I use > internally, basically building all of the usermode components > and then building an RPM to allow easy installation on other > nodes for testing There are also .spec files for most of the individual > libraries, if you prefer to build RPMs for individual libraries. > I find it easier just to lump it all into one big usermode component RPM > and > one kernel-mode component RPM. Yes, that's my goal. But I don't necessarily want to install the libraries. Just build them. I will take a look at the SRPM you mentioned above. Thanks, Nish From mshefty at ichips.intel.com Thu Oct 27 12:07:10 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 12:07:10 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <436118BB.4010809@ichips.intel.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> Message-ID: <4361255E.7010400@ichips.intel.com> Sean Hefty wrote: > I don't see anything off there either. Timeouts seem to work fine with > CM testing, so I'm guessing that the issue is somewhere in user_mad.c. > I'm trying to see if there's anything wrong in ib_umad_write() that > might cause it to crash on the completion. Re-testing with grmpp, I didn't hit any issues running with or without RMPP. ib_umad_write() can be cleaned up a little, but the only bug I saw was accessing packet->length after calling ib_post_send_mad(). The send_handler() will free the packet, so there's a race there. This doesn't seem related to this crash though. - Sean From nacc at us.ibm.com Thu Oct 27 12:12:35 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Thu, 27 Oct 2005 12:12:35 -0700 Subject: [openib-general] Re: Automated userspace build error In-Reply-To: <20051027123131.GU4769@mellanox.co.il> References: <20051026235137.GA6369@us.ibm.com> <20051027123131.GU4769@mellanox.co.il> Message-ID: <20051027191235.GE28730@us.ibm.com> On 27.10.2005 [14:31:31 +0200], Michael S. Tsirkin wrote: > Quoting r. Nishanth Aravamudan : > > Subject: Re: Automated userspace build error > > > > On 25.10.2005 [15:22:56 -0700], Roland Dreier wrote: > > > Nishanth> Hrm, well, I'm testing the latest svn (3865), did the > > > Nishanth> patch just get checked in? > > > > > > Yeah, I only noticed it and fixed it after your original email. I > > > just meant that I had already checked it in before sending my reply. > > > Sorry for the confusion... > > > > No worries, I figured that's what happened. > > > > On a related note, do you (or anyone else) have any suggestions for > > build-testing all of the userspace components? There isn't a top-level > > Makefile of any kind to make it easy :/ > > > > Thanks, > > Nish > > Yes, look at scripts in > https://openib.org/svn/trunk/contrib/mellanox/scripts > > You can also, basically, cut and paste stuff from the FAQ page, > but that relies on performing make as root. Which luckily I can do (or is that unluckily -- if I screw up, the machine tends to fall over ;). Thanks for the pointer! -Nish From halr at voltaire.com Thu Oct 27 12:18:29 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 27 Oct 2005 21:18:29 +0200 Subject: [openib-general] OpenSM causes kernel trap Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C84@taurus.voltaire.com> I think that is likely a different issue. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of James Lentini Sent: Thu 10/27/2005 1:54 PM To: Roland Dreier Cc: openib-general at openib.org Subject: Re: [openib-general] OpenSM causes kernel trap On Thu, 27 Oct 2005, Roland Dreier wrote: > Sean, looks like your MAD send buf stuff may have broken send > timeouts. Any quick ideas before I dig into this? Itamar also had a problem with the MAD layer on x86_64: http://openib.org/pipermail/openib-general/2005-October/013029.html _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From lindahl at pathscale.com Thu Oct 27 13:24:40 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 27 Oct 2005 13:24:40 -0700 Subject: [openib-general] [RFC] OpenSM Interactive Console In-Reply-To: <20051027142956.GF3275@kalmia.hozed.org> References: <6AB138A2AB8C8E4A98B9C0C3D52670E361882D@mtlexch01.mtl.com> <20051027142956.GF3275@kalmia.hozed.org> Message-ID: <20051027202440.GA4832@greglaptop.internal.keyresearch.com> On Thu, Oct 27, 2005 at 09:29:57AM -0500, Troy Benjegerdes wrote: > I guess the point of all this is find a end-user use-case for the SM > MIB, and work back from there to decide if haveing a MIB actually helps > solve the problem. The end-use case is likely to be something like "an enterprise which insists on managing as much as possible through HP OpenView." Which isn't anyone in HPC, hence the current lack of interest. Now the things you'd actually want to monitor for a cluster, it's not really the normal stuff that's in MIBs. I'd want to know if a cable was unexpectedly unplugged, or if a node was up but its IB connection wasn't. I'd like to know if a link had an unusual error rate. -- greg From rolandd at cisco.com Thu Oct 27 13:38:25 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 13:38:25 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <4361255E.7010400@ichips.intel.com> (Sean Hefty's message of "Thu, 27 Oct 2005 12:07:10 -0700") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> Message-ID: <52r7a6wv9a.fsf@cisco.com> Sean> the only bug I saw was accessing packet->length after Sean> calling ib_post_send_mad(). The send_handler() will free Sean> the packet, so there's a race there. Good catch. Seems like the below patch is the right fix: we start out with length = count - sizeof (struct ib_user_mad); and then do packet->length = length; so in return sizeof (struct ib_user_mad_hdr) + packet->length; we're really just returning count -- in ib_user_mad.h, the definition of struct ib_user_mad is: struct ib_user_mad { struct ib_user_mad_hdr hdr; __u8 data[0]; }; so sizeof struct ib_user_mad == struct ib_user_mad_hdr. Hal, am I missing something? Was there any reason to write the return statement like that, or is it OK to just return count directly? - R. --- infiniband/core/user_mad.c (revision 3867) +++ infiniband/core/user_mad.c (working copy) @@ -414,7 +414,7 @@ static ssize_t ib_umad_write(struct file up_read(&file->agent_mutex); - return sizeof (struct ib_user_mad_hdr) + packet->length; + return count; err_msg: ib_free_send_mad(packet->msg); From kingman at austin.rr.com Thu Oct 27 13:41:04 2005 From: kingman at austin.rr.com (John Kingman) Date: Thu, 27 Oct 2005 15:41:04 -0500 (CDT) Subject: [openib-general] [PATCH] [SRP] srp_cm_handler expanded response handling Message-ID: This patch expands the srp_cm_handler code to recognize more response cases and provides a place holder for future code to handle SRP target exceptions such as IB_CM_REJ_CONSUMER_DEFINED with reason code 0x00010002 (requested max_it_iu_len too large). Patch has been tested with our target. Signed-off-by: John Kingman Index: ib_srp.c =================================================================== --- ib_srp.c (revision 3883) +++ ib_srp.c (working copy) @@ -975,6 +975,7 @@ static int srp_cm_handler(struct ib_cm_i struct ib_qp_attr *qp_attr = NULL; int attr_mask = 0; int comp = 0; + int rsp_opcode = 0; switch (event->event) { case IB_CM_REQ_ERROR: @@ -985,17 +986,20 @@ static int srp_cm_handler(struct ib_cm_i case IB_CM_REP_RECEIVED: comp = 1; + rsp_opcode = *(u8 *) event->private_data; - { + if (rsp_opcode == SRP_LOGIN_RSP) { struct srp_login_rsp *rsp = event->private_data; - /* XXX check that opcode is SRP RSP */ - target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); target->req_lim = be32_to_cpu(rsp->req_lim_delta); target->scsi_host->can_queue = min(target->req_lim, target->scsi_host->can_queue); + } else { + printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", rsp_opcode); + target->status = -ECONNRESET; + break; } target->status = srp_alloc_iu_bufs(target); @@ -1043,7 +1047,8 @@ static int srp_cm_handler(struct ib_cm_i printk(KERN_DEBUG PFX "REJ received\n"); comp = 1; - if (event->param.rej_rcvd.reason == IB_CM_REJ_PORT_CM_REDIRECT) { + switch (event->param.rej_rcvd.reason) { + case IB_CM_REJ_PORT_CM_REDIRECT: cpi = event->param.rej_rcvd.ari; target->path.dlid = cpi->redirect_lid; target->path.pkey = cpi->redirect_pkey; @@ -1052,23 +1057,52 @@ static int srp_cm_handler(struct ib_cm_i target->status = target->path.dlid ? SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; - } else if (topspin_workarounds && - !memcmp(&target->ioc_guid, topspin_oui, 3) && - event->param.rej_rcvd.reason == IB_CM_REJ_PORT_REDIRECT) { - /* - * Topspin/Cisco SRP gateways incorrectly send - * reject reason code 25 when they mean 24 - * (port redirect). - */ - memcpy(target->path.dgid.raw, - event->param.rej_rcvd.ari, 16); - - printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", - (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), - (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + break; - target->status = SRP_PORT_REDIRECT; - } else { + case IB_CM_REJ_PORT_REDIRECT: + if (topspin_workarounds && + !memcmp(&target->ioc_guid, topspin_oui, 3)) { + /* + * Topspin/Cisco SRP gateways incorrectly send + * reject reason code 25 when they mean 24 + * (port redirect). + */ + memcpy(target->path.dgid.raw, + event->param.rej_rcvd.ari, 16); + + printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", + (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), + (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + + target->status = SRP_PORT_REDIRECT; + } else { + printk(KERN_WARNING " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); + target->status = -ECONNRESET; + } + break; + + case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: + printk(KERN_WARNING " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); + target->status = -ECONNRESET; + break; + + case IB_CM_REJ_CONSUMER_DEFINED: + if(*(u8 *) event->private_data == SRP_LOGIN_REJ) { + struct srp_login_rej *rej = event->private_data; + u32 reason = be32_to_cpu(rej->reason); + + if (reason == 0x00010002) + printk(KERN_WARNING PFX + "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); + else + printk(KERN_WARNING PFX + "SRP LOGIN REJECTED, reason 0x%8.8x\n", reason); + } else + printk(KERN_WARNING " REJ reason: IB_CM_REJ_CONSUMER_DEFINED\n"); + target->status = -ECONNRESET; + break; + + default: printk(KERN_WARNING " REJ reason 0x%x\n", event->param.rej_rcvd.reason); target->status = -ECONNRESET; From rolandd at cisco.com Thu Oct 27 13:47:39 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 13:47:39 -0700 Subject: [openib-general] Re: [PATCH] [SRP] srp_cm_handler expanded response handling In-Reply-To: (John Kingman's message of "Thu, 27 Oct 2005 15:41:04 -0500 (CDT)") References: Message-ID: <52irviwutw.fsf@cisco.com> Looks good, except: > + if (reason == 0x00010002) can you add enums for all these SRP_LOGIN_REJ reason codes rather than open-coding this magic number here? Thanks, Roland From kingman at storagegear.com Thu Oct 27 14:18:07 2005 From: kingman at storagegear.com (John Kingman) Date: Thu, 27 Oct 2005 16:18:07 -0500 (CDT) Subject: [openib-general] Re: [PATCH] [SRP] srp_cm_handler expanded response handling In-Reply-To: <52irviwutw.fsf@cisco.com> References: <52irviwutw.fsf@cisco.com> Message-ID: On Thu, 27 Oct 2005, Roland Dreier wrote: >Looks good, except: > > > + if (reason == 0x00010002) > >can you add enums for all these SRP_LOGIN_REJ reason codes rather than >open-coding this magic number here? OK. Thanks, John Signed-off-by: John Kingman Index: ib_srp.h =================================================================== --- ib_srp.h (revision 3884) +++ ib_srp.h (working copy) @@ -76,6 +76,16 @@ enum srp_target_state { SRP_TARGET_REMOVED }; +enum srp_login_rej_reason { + SRP_UNABLE_ESTABLISH_CHANNEL = 0x00010000, + SRP_INSUFFICIENT_RESOURCES = 0x00010001, + SRP_REQ_IT_IU_LENGTH_TOO_LARGE = 0x00010002, + SRP_UNABLE_ASSOCIATE_CHANNEL = 0x00010003, + SRP_UNSUPPORTED_DESCRIPTOR_FMT = 0x00010004, + SRP_MULTI_CHANNEL_UNSUPPORTED = 0x00010005, + SRP_CHANNEL_LIMIT_REACHED = 0x00010006 +}; + struct srp_host { u8 initiator_port_id[16]; struct ib_device *dev; Index: ib_srp.c =================================================================== --- ib_srp.c (revision 3883) +++ ib_srp.c (working copy) @@ -975,6 +975,7 @@ static int srp_cm_handler(struct ib_cm_i struct ib_qp_attr *qp_attr = NULL; int attr_mask = 0; int comp = 0; + int rsp_opcode = 0; switch (event->event) { case IB_CM_REQ_ERROR: @@ -985,17 +986,20 @@ static int srp_cm_handler(struct ib_cm_i case IB_CM_REP_RECEIVED: comp = 1; + rsp_opcode = *(u8 *) event->private_data; - { + if (rsp_opcode == SRP_LOGIN_RSP) { struct srp_login_rsp *rsp = event->private_data; - /* XXX check that opcode is SRP RSP */ - target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); target->req_lim = be32_to_cpu(rsp->req_lim_delta); target->scsi_host->can_queue = min(target->req_lim, target->scsi_host->can_queue); + } else { + printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", rsp_opcode); + target->status = -ECONNRESET; + break; } target->status = srp_alloc_iu_bufs(target); @@ -1043,7 +1047,8 @@ static int srp_cm_handler(struct ib_cm_i printk(KERN_DEBUG PFX "REJ received\n"); comp = 1; - if (event->param.rej_rcvd.reason == IB_CM_REJ_PORT_CM_REDIRECT) { + switch (event->param.rej_rcvd.reason) { + case IB_CM_REJ_PORT_CM_REDIRECT: cpi = event->param.rej_rcvd.ari; target->path.dlid = cpi->redirect_lid; target->path.pkey = cpi->redirect_pkey; @@ -1052,23 +1057,52 @@ static int srp_cm_handler(struct ib_cm_i target->status = target->path.dlid ? SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; - } else if (topspin_workarounds && - !memcmp(&target->ioc_guid, topspin_oui, 3) && - event->param.rej_rcvd.reason == IB_CM_REJ_PORT_REDIRECT) { - /* - * Topspin/Cisco SRP gateways incorrectly send - * reject reason code 25 when they mean 24 - * (port redirect). - */ - memcpy(target->path.dgid.raw, - event->param.rej_rcvd.ari, 16); - - printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", - (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), - (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + break; - target->status = SRP_PORT_REDIRECT; - } else { + case IB_CM_REJ_PORT_REDIRECT: + if (topspin_workarounds && + !memcmp(&target->ioc_guid, topspin_oui, 3)) { + /* + * Topspin/Cisco SRP gateways incorrectly send + * reject reason code 25 when they mean 24 + * (port redirect). + */ + memcpy(target->path.dgid.raw, + event->param.rej_rcvd.ari, 16); + + printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", + (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), + (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + + target->status = SRP_PORT_REDIRECT; + } else { + printk(KERN_WARNING " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); + target->status = -ECONNRESET; + } + break; + + case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: + printk(KERN_WARNING " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); + target->status = -ECONNRESET; + break; + + case IB_CM_REJ_CONSUMER_DEFINED: + if(*(u8 *) event->private_data == SRP_LOGIN_REJ) { + struct srp_login_rej *rej = event->private_data; + u32 reason = be32_to_cpu(rej->reason); + + if (reason == SRP_REQ_IT_IU_LENGTH_TOO_LARGE) + printk(KERN_WARNING PFX + "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); + else + printk(KERN_WARNING PFX + "SRP LOGIN REJECTED, reason 0x%8.8x\n", reason); + } else + printk(KERN_WARNING " REJ reason: IB_CM_REJ_CONSUMER_DEFINED\n"); + target->status = -ECONNRESET; + break; + + default: printk(KERN_WARNING " REJ reason 0x%x\n", event->param.rej_rcvd.reason); target->status = -ECONNRESET; From mshefty at ichips.intel.com Thu Oct 27 14:34:59 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 14:34:59 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <52r7a6wv9a.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> Message-ID: <43614803.6080308@ichips.intel.com> Roland Dreier wrote: > Good catch. Seems like the below patch is the right fix: > we start out with Fix looks right to me. > packet->length = length; I don't think that this assignment is needed. Once the packet is sent, it is simply freed. - Sean From Arkady.Kanevsky at netapp.com Thu Oct 27 14:46:47 2005 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 27 Oct 2005 17:46:47 -0400 Subject: [openib-general] ping over IPoIB does not work between 2 cards on the same host Message-ID: I have a host with 2 HCAs (dual port each but I only connected one port per machine) connected to a switch. When IPoIB configured I ping cards own IP address it works. I can ping another machines with their HCA cards configured with IPoIB fine. And I can ping both local IP addresses from remote machine(s) Details: ifconfig ib1 192.168.0.1 netmask 255.255.0.0 ifconfig ib3 192.168.0.3 netmask 255.255.0.0 On remote machine: ifconfig ib0 192.168.1.0 netmask 255.255.0.0 Locally: ping -I ib3 192.168.0.3 PING 192.168.0.3 (192.168.97.3) from 192.168.0.3 ib3: 56(84) bytes of data. 64 bytes from 192.168.0.3: icmp_seq=0 ttl=64 time=0.028 ms ping -I ib1 192.168.0.1 PING 192.168.0.1 (192.168.97.1) from 192.168.0.1 ib1: 56(84) bytes of data. 64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.028 ms # ping -I ib3 192.168.1.0 PING 192.168.1.0 (192.168.1.0) from 192.168.0.3 ib3: 56(84) bytes of data. 64 bytes from 192.168.1.0: icmp_seq=0 ttl=64 time=1.81 ms >From remote host: # ping -I ib0 192.168.0.1 PING 192.168.0.1 (192.168.0.1) from 192.168.1.0 ib0: 56(84) bytes of data. 64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.086 ms # ping -I ib0 192.168.0.3 PING 192.168.0.3 (192.168.0.3) from 192.168.1.0 ib0: 56(84) bytes of data. 64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.086 ms Locally between 2 cards:# ping -I ib3 192.168.0.1 PING 192.168.0.1 (192.168.0.1) from 192.168.0.3 ib3: 56(84) bytes of data. >From 192.168.0.3 icmp_seq=1 Destination Host Unreachable From 192.168.0.3 icmp_seq=2 Destination Host Unreachable From 192.168.0.3 icmp_seq=3 Destination Host Unreachable Arkady Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 275 Totten Pond Rd. Fax: 781-895-1195 Waltham, MA 02451-2010 central phone: 781-768-5300 From rolandd at cisco.com Thu Oct 27 15:17:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 15:17:26 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <43614803.6080308@ichips.intel.com> (Sean Hefty's message of "Thu, 27 Oct 2005 14:34:59 -0700") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> Message-ID: <52vezivc3t.fsf@cisco.com> OK, I think I found it. The problem was that ib_umad_write() wrote through packet->msg in a few places where it should have used packet->msg->mad, and therefore corrupted the address of the buffer. I'll commit the patch below in a little while, which fixes this issue and the packet->length race that Sean spotted, unless someone sees a problem with it: --- infiniband/core/user_mad.c (revision 3867) +++ infiniband/core/user_mad.c (working copy) @@ -297,8 +297,6 @@ static ssize_t ib_umad_write(struct file goto err; } - packet->length = length; - down_read(&file->agent_mutex); agent = file->agent[packet->mad.hdr.id]; @@ -398,12 +396,12 @@ static ssize_t ib_umad_write(struct file * transaction ID matches the agent being used to send the * MAD. */ - method = ((struct ib_mad_hdr *) packet->msg)->method; + method = ((struct ib_mad_hdr *) packet->msg->mad)->method; if (!(method & IB_MGMT_METHOD_RESP) && method != IB_MGMT_METHOD_TRAP_REPRESS && method != IB_MGMT_METHOD_SEND) { - tid = &((struct ib_mad_hdr *) packet->msg)->tid; + tid = &((struct ib_mad_hdr *) packet->msg->mad)->tid; *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | (be64_to_cpup(tid) & 0xffffffff)); } @@ -414,7 +412,7 @@ static ssize_t ib_umad_write(struct file up_read(&file->agent_mutex); - return sizeof (struct ib_user_mad_hdr) + packet->length; + return count; err_msg: ib_free_send_mad(packet->msg); From rolandd at cisco.com Thu Oct 27 15:19:11 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 15:19:11 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <52vezivc3t.fsf@cisco.com> (Roland Dreier's message of "Thu, 27 Oct 2005 15:17:26 -0700") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> Message-ID: <52r7a6vc0w.fsf@cisco.com> BTW, Jay, can you confirm that this patch fixes your problem too? Thanks, Roland From hbchen at lanl.gov Thu Oct 27 15:33:35 2005 From: hbchen at lanl.gov (Hb Chen) Date: Thu, 27 Oct 2005 16:33:35 -0600 Subject: [openib-general] Boot over IB - support in Bproc status? Message-ID: <436155BF.4000305@lanl.gov> Hi, Can anyone point out the current staus of Boot over IB - support in Bproc? Also what is the other solution about "mass boot over IB" now? (openSM, SRP...) Thanks. HB LANL CCN-9 From gshipman at lanl.gov Thu Oct 27 15:33:47 2005 From: gshipman at lanl.gov (Galen M. Shipman) Date: Thu, 27 Oct 2005 16:33:47 -0600 Subject: [openib-general] SRQ limit reached async event. Message-ID: <1C361243-6C9D-432A-9763-B766580D7C49@lanl.gov> Hello, Does anyone now if openib supports the SRQ limit asynchronous event? I am working with mellanox verbs right now and it doesn't seem to support this. I say this because I have to set the srq_limit attribute via VAPI_modify_srq in order to get the event, unfortunately when I call VAPI_modify_srq I get: error in VAPI_modify_srq: Not implemented Any insight is appreciated. Thanks, Galen From sean.hefty at intel.com Thu Oct 27 15:41:37 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 27 Oct 2005 15:41:37 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <52vezivc3t.fsf@cisco.com> Message-ID: >OK, I think I found it. The problem was that ib_umad_write() wrote >through packet->msg in a few places where it should have used >packet->msg->mad, and therefore corrupted the address of the buffer. Yep - that appears to be the issue. I've attached another patch that includes your fixes, plus adds some additional code cleanup. Signed-off-by: Sean Hefty Index: user_mad.c =================================================================== --- user_mad.c (revision 3861) +++ user_mad.c (working copy) @@ -99,7 +99,6 @@ struct ib_mad_send_buf *msg; struct list_head list; int length; - DECLARE_PCI_UNMAP_ADDR(mapping) struct ib_user_mad mad; }; @@ -138,24 +137,23 @@ struct ib_mad_send_wc *send_wc) { struct ib_umad_file *file = agent->context; - struct ib_umad_packet *timeout, *packet = send_wc->send_buf->context[0]; + struct ib_umad_packet *timeout; + struct ib_umad_packet *packet = send_wc->send_buf->context[0]; ib_destroy_ah(packet->msg->ah); ib_free_send_mad(packet->msg); if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) { - timeout = kmalloc(sizeof *timeout + sizeof (struct ib_mad_hdr), - GFP_KERNEL); + timeout = kmalloc(sizeof *timeout + IB_MGMT_MAD_HDR, GFP_KERNEL); if (!timeout) goto out; - memset(timeout, 0, sizeof *timeout + sizeof (struct ib_mad_hdr)); + memset(timeout, 0, sizeof *timeout + IB_MGMT_MAD_HDR); - timeout->length = sizeof (struct ib_mad_hdr); + timeout->length = IB_MGMT_MAD_HDR; timeout->mad.hdr.id = packet->mad.hdr.id; timeout->mad.hdr.status = ETIMEDOUT; - memcpy(timeout->mad.data, packet->mad.data, - sizeof (struct ib_mad_hdr)); + memcpy(timeout->mad.data, packet->mad.data, IB_MGMT_MAD_HDR); if (!queue_packet(file, agent, timeout)) return; @@ -245,7 +243,7 @@ else ret = -ENOSPC; } else if (copy_to_user(buf, &packet->mad, - packet->length + sizeof (struct ib_user_mad))) + packet->length + sizeof (struct ib_user_mad))) ret = -EFAULT; else ret = packet->length + sizeof (struct ib_user_mad); @@ -270,22 +268,19 @@ struct ib_rmpp_mad *rmpp_mad; u8 method; __be64 *tid; - int ret, length, hdr_len, rmpp_hdr_size; + int ret, length, hdr_len, copy_offset; int rmpp_active = 0; if (count < sizeof (struct ib_user_mad)) return -EINVAL; length = count - sizeof (struct ib_user_mad); - packet = kmalloc(sizeof *packet + sizeof(struct ib_mad_hdr) + - sizeof (struct ib_rmpp_hdr), GFP_KERNEL); + packet = kmalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL); if (!packet) return -ENOMEM; if (copy_from_user(&packet->mad, buf, - sizeof (struct ib_user_mad) + - sizeof (struct ib_mad_hdr) + - sizeof (struct ib_rmpp_hdr))) { + sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) { ret = -EFAULT; goto err; } @@ -296,8 +291,6 @@ goto err; } - packet->length = length; - down_read(&file->agent_mutex); agent = file->agent[packet->mad.hdr.id]; @@ -344,12 +337,10 @@ goto err_ah; } rmpp_active = 1; + copy_offset = IB_MGMT_RMPP_HDR; } else { - if (length > sizeof (struct ib_mad)) { - ret = -EINVAL; - goto err_ah; - } hdr_len = IB_MGMT_MAD_HDR; + copy_offset = IB_MGMT_MAD_HDR; } packet->msg = ib_create_send_mad(agent, @@ -363,32 +354,18 @@ } packet->msg->ah = ah; - packet->msg->timeout_ms = packet->mad.hdr.timeout_ms; + packet->msg->timeout_ms = packet->mad.hdr.timeout_ms; packet->msg->retries = packet->mad.hdr.retries; packet->msg->context[0] = packet; - if (!rmpp_active) { - /* Copy message from user into send buffer */ - if (copy_from_user(packet->msg->mad, - buf + sizeof (struct ib_user_mad), length)) { - ret = -EFAULT; - goto err_msg; - } - } else { - rmpp_hdr_size = sizeof (struct ib_mad_hdr) + - sizeof (struct ib_rmpp_hdr); - - /* Only copy MAD headers (RMPP header in place) */ - memcpy(packet->msg->mad, packet->mad.data, - sizeof (struct ib_mad_hdr)); - - /* Now, copy rest of message from user into send buffer */ - if (copy_from_user(((struct ib_rmpp_mad *) packet->msg->mad)->data, - buf + sizeof (struct ib_user_mad) + rmpp_hdr_size, - length - rmpp_hdr_size)) { - ret = -EFAULT; - goto err_msg; - } + /* Copy MAD headers (RMPP header in place) */ + memcpy(packet->msg->mad, packet->mad.data, IB_MGMT_MAD_HDR); + /* Now, copy rest of message from user into send buffer */ + if (copy_from_user(packet->msg->mad + copy_offset, + buf + sizeof (struct ib_user_mad) + copy_offset, + length - copy_offset)) { + ret = -EFAULT; + goto err_msg; } /* @@ -397,12 +374,12 @@ * transaction ID matches the agent being used to send the * MAD. */ - method = ((struct ib_mad_hdr *) packet->msg)->method; + method = ((struct ib_mad_hdr *) packet->msg->mad)->method; if (!(method & IB_MGMT_METHOD_RESP) && method != IB_MGMT_METHOD_TRAP_REPRESS && method != IB_MGMT_METHOD_SEND) { - tid = &((struct ib_mad_hdr *) packet->msg)->tid; + tid = &((struct ib_mad_hdr *) packet->msg->mad)->tid; *tid = cpu_to_be64(((u64) agent->hi_tid) << 32 | (be64_to_cpup(tid) & 0xffffffff)); } @@ -413,17 +390,14 @@ up_read(&file->agent_mutex); - return sizeof (struct ib_user_mad_hdr) + packet->length; + return count; err_msg: ib_free_send_mad(packet->msg); - err_ah: ib_destroy_ah(ah); - err_up: up_read(&file->agent_mutex); - err: kfree(packet); return ret; From rolandd at cisco.com Thu Oct 27 15:43:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 15:43:48 -0700 Subject: [openib-general] SRQ limit reached async event. In-Reply-To: <1C361243-6C9D-432A-9763-B766580D7C49@lanl.gov> (Galen M. Shipman's message of "Thu, 27 Oct 2005 16:33:47 -0600") References: <1C361243-6C9D-432A-9763-B766580D7C49@lanl.gov> Message-ID: <52mzkuvavv.fsf@cisco.com> Galen> Does anyone now if openib supports the SRQ limit Galen> asynchronous event? Yes, openib verbs and the mthca driver supports this. However, with current firmware, you will only receive this event for mem-free HCAs (firmware versions 5.x and 1.x). For mem-ful HCAs (firmware versions 3.x and 4.x), you will need to use as-yet-unreleased firmware for the event to be generated. - R. From bohra at cs.rutgers.edu Thu Oct 27 17:26:10 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Thu, 27 Oct 2005 20:26:10 -0400 Subject: [openib-general] OpenSM crash with today's trunk Message-ID: <43617022.7000803@cs.rutgers.edu> Hello, I updated the OpenIB stack today and I get the following error on starting OpenSM. The verbose log is available at http://www.cs.rutgers.edu/~bohra/osm-v.log # opensm -V -d10 -r ------------------------------------------------- OpenSM Rev:openib-1.1.0 Command Line Arguments: Big V selected d level = 0xa Reassign LIDs Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-1.1.0 Using default guid 0x2c901081e7471 Error from osm_opensm_bind (0x2A) Exiting SM Segmentation fault Please let me know what I can do to debug this. Thanks Aniruddha From rolandd at cisco.com Thu Oct 27 20:35:59 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 20:35:59 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43617022.7000803@cs.rutgers.edu> (Aniruddha Bohra's message of "Thu, 27 Oct 2005 20:26:10 -0400") References: <43617022.7000803@cs.rutgers.edu> Message-ID: <52irviuxcw.fsf@cisco.com> I believe that this is in r3889. - R. From rolandd at cisco.com Thu Oct 27 20:49:16 2005 From: rolandd at cisco.com (Roland Dreier) Date: Thu, 27 Oct 2005 20:49:16 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: (Sean Hefty's message of "Thu, 27 Oct 2005 15:41:37 -0700") References: Message-ID: <52ek66uwqr.fsf@cisco.com> Thanks, I committed just the packet->msg => packet->msg->mad fix as one changeset, and the rest of this patch (along with some kmalloc()+memset() => kzalloc() cleanups now that 2.6.14 is out) as a second changeset. - R. From tziporet at mellanox.co.il Fri Oct 28 00:55:30 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Fri, 28 Oct 2005 09:55:30 +0200 Subject: [openib-general] SRQ limit reached async event. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35ABB83@mtlexch01.mtl.com> Which HCA are you using? Till lately SRQ limit event was supported only for mem-free HCAs. Now it is supported for full-mem too but you need a special FW for this (4.7.400 release will be next week) In gen2 it is implemented already for both types of HCAs and if you have the correct FW it will work. in VAPI (gen1) you need an update of VAPI for this since we blocked it for full-mem cards. Tziporet -----Original Message----- From: Galen M. Shipman [mailto:gshipman at lanl.gov] Sent: Friday, October 28, 2005 12:34 AM To: openib-general at openib.org Subject: [openib-general] SRQ limit reached async event. Hello, Does anyone now if openib supports the SRQ limit asynchronous event? I am working with mellanox verbs right now and it doesn't seem to support this. I say this because I have to set the srq_limit attribute via VAPI_modify_srq in order to get the event, unfortunately when I call VAPI_modify_srq I get: error in VAPI_modify_srq: Not implemented Any insight is appreciated. Thanks, Galen _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From bohra at cs.rutgers.edu Fri Oct 28 04:57:18 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 07:57:18 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52irviuxcw.fsf@cisco.com> References: <43617022.7000803@cs.rutgers.edu> <52irviuxcw.fsf@cisco.com> Message-ID: <4362121E.3090500@cs.rutgers.edu> Roland Dreier wrote: >I believe that this is in r3889. > > - R. > > I tried with r3888 and r3891 with the same result. Aniruddha From suri at baymicrosystems.com Fri Oct 28 05:40:09 2005 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Fri, 28 Oct 2005 08:40:09 -0400 Subject: [openib-general] IB traffic generators In-Reply-To: <43614803.6080308@ichips.intel.com> Message-ID: <200510281240.j9SCe987022114@ns1.baymicrosystems.com> Folks: can you please point me to some traffic generators out there. Thanks, Suri From halr at voltaire.com Fri Oct 28 05:51:15 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 28 Oct 2005 14:51:15 +0200 Subject: [openib-general] IB traffic generators Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175C9D@taurus.voltaire.com> Hi Suri, The only traffic generator I am aware of is from Agilent (E2950 series) but they discontinued their IB support a while ago. I'm not sure if it is still available from them. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Suresh Shelvapille Sent: Fri 10/28/2005 8:40 AM To: openib-general at openib.org Subject: [openib-general] IB traffic generators Folks: can you please point me to some traffic generators out there. Thanks, Suri _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ebiederman at lnxi.com Fri Oct 28 06:25:12 2005 From: ebiederman at lnxi.com (Eric W. Biederman) Date: Fri, 28 Oct 2005 07:25:12 -0600 Subject: [openib-general] Boot over IB - support in Bproc status? In-Reply-To: <436155BF.4000305@lanl.gov> (Hb Chen's message of "Thu, 27 Oct 2005 16:33:35 -0600") References: <436155BF.4000305@lanl.gov> Message-ID: Hb Chen writes: > Hi, > Can anyone point out the current staus of Boot over IB - support in Bproc? We have it working here :) kexec appears to work fine with the openIB stack. The raw packet interfaces are a little more difficult to use in the kernel because of the long MAC address. But no real problems. > Also what is the other solution about "mass boot over IB" now? (openSM, SRP...) Eric From rolandd at cisco.com Fri Oct 28 07:47:28 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 07:47:28 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <4362121E.3090500@cs.rutgers.edu> (Aniruddha Bohra's message of "Fri, 28 Oct 2005 07:57:18 -0400") References: <43617022.7000803@cs.rutgers.edu> <52irviuxcw.fsf@cisco.com> <4362121E.3090500@cs.rutgers.edu> Message-ID: <52zmotu29r.fsf@cisco.com> Aniruddha> I tried with r3888 and r3891 with the same result. Oh well, I guess this is a different bug. Is there an oops or anything in your kernel log, or is this just a userspace crash? If it's just opensm crashing then I'm not much use in debugging. - R. From rolandd at cisco.com Fri Oct 28 07:53:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 07:53:48 -0700 Subject: [openib-general] Re: [PATCH] add node_guid to struct ib_device In-Reply-To: (Sean Hefty's message of "Thu, 27 Oct 2005 10:06:04 -0700") References: Message-ID: <52vezhu1z7.fsf@cisco.com> Thanks, I applied the following version (doesn't add a private kzalloc() now that 2.6.14 is out and doesn't rename cap_mask_mutex). By the way, the ipath and ehca drivers will need something similar. - R. --- include/rdma/ib_verbs.h (revision 3861) +++ include/rdma/ib_verbs.h (working copy) @@ -951,6 +951,7 @@ u64 uverbs_cmd_mask; int uverbs_abi_ver; + __be64 node_guid; u8 node_type; u8 phys_port_cnt; }; --- hw/mthca/mthca_provider.c (revision 3830) +++ hw/mthca/mthca_provider.c (working copy) @@ -45,6 +45,14 @@ #include "mthca_user.h" #include "mthca_memfree.h" +static void init_query_mad(struct ib_smp *mad) +{ + mad->base_version = 1; + mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; + mad->class_version = 1; + mad->method = IB_MGMT_METHOD_GET; +} + static int mthca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { @@ -55,7 +63,7 @@ u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; @@ -64,12 +72,8 @@ props->fw_ver = mdev->fw_ver; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; err = mthca_MAD_IFC(mdev, 1, 1, 1, NULL, NULL, in_mad, out_mad, @@ -127,20 +131,16 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; memset(props, 0, sizeof *props); - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -219,18 +219,14 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; - in_mad->attr_mod = cpu_to_be32(index / 32); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PKEY_TABLE; + in_mad->attr_mod = cpu_to_be32(index / 32); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -258,18 +254,14 @@ int err = -ENOMEM; u8 status; - in_mad = kmalloc(sizeof *in_mad, GFP_KERNEL); + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); if (!in_mad || !out_mad) goto out; - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; - in_mad->attr_mod = cpu_to_be32(port); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad->attr_mod = cpu_to_be32(port); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -283,13 +275,9 @@ memcpy(gid->raw, out_mad->data + 8, 8); - memset(in_mad, 0, sizeof *in_mad); - in_mad->base_version = 1; - in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED; - in_mad->class_version = 1; - in_mad->method = IB_MGMT_METHOD_GET; - in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; - in_mad->attr_mod = cpu_to_be32(index / 8); + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_GUID_INFO; + in_mad->attr_mod = cpu_to_be32(index / 8); err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad, @@ -1069,11 +1057,48 @@ &class_device_attr_board_id }; +static int mthca_init_node_data(struct mthca_dev *dev) +{ + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; + int err = -ENOMEM; + u8 status; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad->attr_id = IB_SMP_ATTR_NODE_INFO; + + err = mthca_MAD_IFC(dev, 1, 1, + 1, NULL, NULL, in_mad, out_mad, + &status); + if (err) + goto out; + if (status) { + err = -EINVAL; + goto out; + } + + memcpy(&dev->ib_dev.node_guid, out_mad->data + 12, 8); + +out: + kfree(in_mad); + kfree(out_mad); + return err; +} + int mthca_register_device(struct mthca_dev *dev) { int ret; int i; + ret = mthca_init_node_data(dev); + if (ret) + return ret; + strlcpy(dev->ib_dev.name, "mthca%d", IB_DEVICE_NAME_MAX); dev->ib_dev.owner = THIS_MODULE; From swise at opengridcomputing.com Fri Oct 28 07:58:58 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 28 Oct 2005 09:58:58 -0500 Subject: [openib-general] cq callback question Message-ID: <001c01c5dbd0$1fb78ad0$d5000a0a@STEVO> This may seem like a dumb question, but can a kernel ULP assume that after returning from ib_destroy_qp(), there will be no more callbacks for that QP on the associated cq event handler? From bohra at cs.rutgers.edu Fri Oct 28 08:28:25 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 11:28:25 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52zmotu29r.fsf@cisco.com> References: <43617022.7000803@cs.rutgers.edu> <52irviuxcw.fsf@cisco.com> <4362121E.3090500@cs.rutgers.edu> <52zmotu29r.fsf@cisco.com> Message-ID: <43624399.3040908@cs.rutgers.edu> Roland Dreier wrote: > Aniruddha> I tried with r3888 and r3891 with the same result. > >Oh well, I guess this is a different bug. Is there an oops or >anything in your kernel log, or is this just a userspace crash? > > This is what I see : Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM Is this useful? Aniruddha From eitan at mellanox.co.il Fri Oct 28 09:05:41 2005 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 28 Oct 2005 18:05:41 +0200 Subject: [openib-general] OpenSM crash with today's trunk Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E361884B@mtlexch01.mtl.com> This means you have another SM or application already registered for handling SubnetManagement packets. Thus OpenSM fails to start (register as the handler for such requests). The crash is a bug that should be solved. Eitan Zahavi Design Technology Director Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Aniruddha Bohra [mailto:bohra at cs.rutgers.edu] > Sent: Friday, October 28, 2005 5:28 PM > To: Roland Dreier > Cc: openib-general at openib.org > Subject: Re: [openib-general] OpenSM crash with today's trunk > > Roland Dreier wrote: > > > Aniruddha> I tried with r3888 and r3891 with the same result. > > > >Oh well, I guess this is a different bug. Is there an oops or > >anything in your kernel log, or is this just a userspace crash? > > > > > This is what I see : > Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 > Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use > Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM > > Is this useful? > > Aniruddha > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Fri Oct 28 09:01:19 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Oct 2005 09:01:19 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43624399.3040908@cs.rutgers.edu> References: <43617022.7000803@cs.rutgers.edu> <52irviuxcw.fsf@cisco.com> <4362121E.3090500@cs.rutgers.edu> <52zmotu29r.fsf@cisco.com> <43624399.3040908@cs.rutgers.edu> Message-ID: <43624B4F.6080500@ichips.intel.com> Aniruddha Bohra wrote: >> Oh well, I guess this is a different bug. Is there an oops or >> anything in your kernel log, or is this just a userspace crash? >> > This is what I see : > Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 > Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use > Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM > > Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean From halr at voltaire.com Fri Oct 28 09:02:52 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 28 Oct 2005 18:02:52 +0200 Subject: [openib-general] OpenSM crash with today's trunk Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? ________________________________ From: openib-general-bounces at openib.org on behalf of Sean Hefty Sent: Fri 10/28/2005 12:01 PM To: Aniruddha Bohra Cc: openib-general at openib.org Subject: Re: [openib-general] OpenSM crash with today's trunk Aniruddha Bohra wrote: >> Oh well, I guess this is a different bug. Is there an oops or >> anything in your kernel log, or is this just a userspace crash? >> > This is what I see : > Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 > Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use > Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM > > Is this useful? Is there any chance opensm is already running on the system? It sounds like something has already registered to receive the same MADs that opensm wants to receive. - Sean _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From iod00d at hp.com Fri Oct 28 10:09:29 2005 From: iod00d at hp.com (Grant Grundler) Date: Fri, 28 Oct 2005 10:09:29 -0700 Subject: [openib-general] IB traffic generators In-Reply-To: <200510281240.j9SCe987022114@ns1.baymicrosystems.com> References: <43614803.6080308@ichips.intel.com> <200510281240.j9SCe987022114@ns1.baymicrosystems.com> Message-ID: <20051028170929.GA22677@esmail.cup.hp.com> On Fri, Oct 28, 2005 at 08:40:09AM -0400, Suresh Shelvapille wrote: > can you please point me to some traffic generators out there. Hypothetically one could use IPoIB and pktgen driver to generate UDP-like traffic. Someone more experienced than I could rewrite pktgen driver to use OpenIB Verbs API to produce "raw" IB traffic. ib_pktgen would be a cool ULP to have for testing. grant From sean.hefty at intel.com Fri Oct 28 10:30:38 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 28 Oct 2005 10:30:38 -0700 Subject: [openib-general] RE: [PATCH] add node_guid to struct ib_device In-Reply-To: <52vezhu1z7.fsf@cisco.com> Message-ID: >Thanks, I applied the following version (doesn't add a private kzalloc() >now that 2.6.14 is out and doesn't rename cap_mask_mutex). Thanks. I forgot to include the changes to sysfs.c in my previous patch. Not sure if we want to wait on this until the other drivers have been updated. We'll probably want to remove node_guid from the device attributes as well. Signed-off-by: Sean Hefty Index: sysfs.c =================================================================== --- sysfs.c (revision 3892) +++ sysfs.c (working copy) @@ -622,21 +622,15 @@ static ssize_t show_node_guid(struct class_device *cdev, char *buf) { struct ib_device *dev = container_of(cdev, struct ib_device, class_dev); - struct ib_device_attr attr; - ssize_t ret; if (!ibdev_is_alive(dev)) return -ENODEV; - ret = ib_query_device(dev, &attr); - if (ret) - return ret; - return sprintf(buf, "%04x:%04x:%04x:%04x\n", - be16_to_cpu(((__be16 *) &attr.node_guid)[0]), - be16_to_cpu(((__be16 *) &attr.node_guid)[1]), - be16_to_cpu(((__be16 *) &attr.node_guid)[2]), - be16_to_cpu(((__be16 *) &attr.node_guid)[3])); + be16_to_cpu(((__be16 *) &dev->node_guid)[0]), + be16_to_cpu(((__be16 *) &dev->node_guid)[1]), + be16_to_cpu(((__be16 *) &dev->node_guid)[2]), + be16_to_cpu(((__be16 *) &dev->node_guid)[3])); } static CLASS_DEVICE_ATTR(node_type, S_IRUGO, show_node_type, NULL); From bohra at cs.rutgers.edu Fri Oct 28 10:50:23 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 13:50:23 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> Message-ID: <436264DF.5090609@cs.rutgers.edu> Hal Rosenstock wrote: >Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? > > > This is after a clean reboot. There are two systems on the switch and this is the only active one. I will reboot both and see again. Thanks Aniruddha >________________________________ > >From: openib-general-bounces at openib.org on behalf of Sean Hefty >Sent: Fri 10/28/2005 12:01 PM >To: Aniruddha Bohra >Cc: openib-general at openib.org >Subject: Re: [openib-general] OpenSM crash with today's trunk > > > >Aniruddha Bohra wrote: > > >>>Oh well, I guess this is a different bug. Is there an oops or >>>anything in your kernel log, or is this just a userspace crash? >>> >>> >>> >>This is what I see : >>Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 >>Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use >>Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM >> >>Is this useful? >> >> > >Is there any chance opensm is already running on the system? It sounds like >something has already registered to receive the same MADs that opensm wants to >receive. > >- Sean >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From bohra at cs.rutgers.edu Fri Oct 28 11:08:42 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 14:08:42 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> Message-ID: <4362692A.80207@cs.rutgers.edu> Hal Rosenstock wrote: >Or perhaps something crashed and didn't clean up properly. Does this occur immediately after a boot ? > > > After a fresh reboot of the machines on the switch, I get the log at http://www.cs.rutgers.edu/~bohra/osm-v2.log The opensm process does not crash but hangs. The state of the port never changes. Now there is an OOPS in the dmesg : ct 28 13:52:13 hora-3 OpenSM[5168]: OpenSM Rev:openib-1.1.0 Oct 28 13:52:14 hora-3 kernel: Unable to handle kernel paging request at virtual address 09000010 Oct 28 13:52:14 hora-3 kernel: printing eip: Oct 28 13:52:14 hora-3 kernel: f883f12d Oct 28 13:52:14 hora-3 kernel: *pde = 00000000 Oct 28 13:52:14 hora-3 kernel: Oops: 0000 [#1] Oct 28 13:52:14 hora-3 kernel: SMP Oct 28 13:52:14 hora-3 kernel: Modules linked in: ib_uverbs ib_umad ipv6 i2c_dev i2c_core sunrpc dm_mod video button battery ac uhci_hcd hw_random ib_mthca ib_mad ib_core e1000 floppy Oct 28 13:52:14 hora-3 kernel: CPU: 1 Oct 28 13:52:14 hora-3 kernel: EIP: 0060:[] Not tainted VLI Oct 28 13:52:14 hora-3 kernel: EFLAGS: 00010286 (2.6.13bohra) Oct 28 13:52:14 hora-3 kernel: EIP is at ib_post_send_mad+0x1c/0x1b1 [ib_mad] Oct 28 13:52:14 hora-3 kernel: eax: 09000000 ebx: c1a7d900 ecx: c1a7d918 edx: 00000000 Oct 28 13:52:14 hora-3 kernel: esi: c1a7d918 edi: f6571f68 ebp: f6571efc esp: f6571ed8 Oct 28 13:52:14 hora-3 kernel: ds: 007b es: 007b ss: 0068 Oct 28 13:52:14 hora-3 kernel: Process opensm (pid: 5224, threadinfo=f6570000 task=f7dfb020) Oct 28 13:52:14 hora-3 kernel: Stack: f883ef5a 00000000 c1a7d800 080bd018 f6571efc 00000000 f6a42900 a0f684f6 Oct 28 13:52:14 hora-3 kernel: f6571f68 f6571f74 f88f1728 00000000 00000018 000000e8 000000d0 f6a42948 Oct 28 13:52:14 hora-3 kernel: f68bda24 00000000 00000009 a0f684f6 00000009 c1a7d918 00000000 00000100 Oct 28 13:52:14 hora-3 kernel: Call Trace: Oct 28 13:52:14 hora-3 kernel: [] show_stack+0x7c/0x92 Oct 28 13:52:14 hora-3 kernel: [] show_registers+0x152/0x1ca Oct 28 13:52:14 hora-3 kernel: [] die+0xf4/0x16f Oct 28 13:52:14 hora-3 kernel: [] do_page_fault+0x463/0x649 Oct 28 13:52:14 hora-3 kernel: [] error_code+0x4f/0x54 Oct 28 13:52:14 hora-3 kernel: [] ib_umad_write+0x2d0/0x30e [ib_umad] Oct 28 13:52:14 hora-3 kernel: [] vfs_write+0x155/0x15a Oct 28 13:52:14 hora-3 kernel: [] sys_write+0x3d/0x64 Oct 28 13:52:14 hora-3 kernel: [] sysenter_past_esp+0x54/0x75 Oct 28 13:52:14 hora-3 kernel: Code: e8 d8 63 af c7 89 d8 83 c4 0c 5b 5e 5f 5d c3 55 89 e5 57 56 89 c6 53 83 ec 18 85 f6 89 55 f0 0f 84 ff 00 00 00 8b 46 08 8d 5e e8 <8b> 50 10 8b 7b 14 85 d2 0f 84 7c 01 00 00 8b 4e 18 85 c9 74 0b Thanks Aniruddha >________________________________ > >From: openib-general-bounces at openib.org on behalf of Sean Hefty >Sent: Fri 10/28/2005 12:01 PM >To: Aniruddha Bohra >Cc: openib-general at openib.org >Subject: Re: [openib-general] OpenSM crash with today's trunk > > > >Aniruddha Bohra wrote: > > >>>Oh well, I guess this is a different bug. Is there an oops or >>>anything in your kernel log, or is this just a userspace crash? >>> >>> >>> >>This is what I see : >>Oct 27 22:03:34 hora-3 OpenSM[7995]: OpenSM Rev:openib-1.1.0 >>Oct 27 22:03:34 hora-3 kernel: ib_mad: Method 1 already in use >>Oct 27 22:03:34 hora-3 OpenSM[7995]: Exiting SM >> >>Is this useful? >> >> > >Is there any chance opensm is already running on the system? It sounds like >something has already registered to receive the same MADs that opensm wants to >receive. > >- Sean >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From higley at dbresearch.net Fri Oct 28 12:01:33 2005 From: higley at dbresearch.net (Jay Higley) Date: Fri, 28 Oct 2005 14:01:33 -0500 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <52r7a6vc0w.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> Message-ID: <4362758D.20407@dbresearch.net> Roland Dreier wrote: >BTW, Jay, can you confirm that this patch fixes your problem too? > >Thanks, > Roland > > > > I updated to version 3891 and tried it with the 2.6.13.4 Kernel that I was using and got unresolved symbol errors for kzalloc. I upgraded the kernel to 2.6.14 and tried agin and got the below compile errors. As an aside, when I was running the unpatched openSM on a single-processor system I occasionally got it to start up, but it would hang and the port state would never change. The same sort of behavior as in the "OpenSM crash with today's trunk" thread. -Jay Higley CC [M] drivers/infiniband/core/addr.o CC [M] net/sched/em_text.o CC [M] net/sctp/outqueue.o CC [M] net/sunrpc/xprt.o drivers/infiniband/core/addr.c:330: warning: initialization from incompatible po inter type CC [M] net/sctp/ulpqueue.o CC [M] drivers/infiniband/core/at.o drivers/infiniband/core/at.c:1547: warning: initialization from incompatible poi nter type CC [M] drivers/infiniband/core/cm.o CC [M] net/sctp/command.o drivers/infiniband/core/cm.c: In function `cm_alloc_msg': drivers/infiniband/core/cm.c:179: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:179: error: (Each undeclared identifier is reported only once drivers/infiniband/core/cm.c:179: error: for each function it appears in.) drivers/infiniband/core/cm.c:180: error: too few arguments to function `ib_creat e_send_mad' drivers/infiniband/core/cm.c:187: error: structure has no member named `ah' drivers/infiniband/core/cm.c:188: error: structure has no member named `retries' drivers/infiniband/core/cm.c: In function `cm_alloc_response_msg': drivers/infiniband/core/cm.c:209: error: `IB_MGMT_MAD_HDR' undeclared (first use in this function) drivers/infiniband/core/cm.c:210: error: too few arguments to function `ib_creat e_send_mad' drivers/infiniband/core/cm.c:215: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_free_msg': drivers/infiniband/core/cm.c:222: error: structure has no member named `ah' drivers/infiniband/core/cm.c: In function `cm_insert_listen': drivers/infiniband/core/cm.c:371: error: structure has no member named `device' drivers/infiniband/core/cm.c:371: error: structure has no member named `device' drivers/infiniband/core/cm.c:374: error: structure has no member named `device' drivers/infiniband/core/cm.c:374: error: structure has no member named `device' drivers/infiniband/core/cm.c:376: error: structure has no member named `device' drivers/infiniband/core/cm.c:376: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `cm_find_listen': drivers/infiniband/core/cm.c:398: error: structure has no member named `device' drivers/infiniband/core/cm.c:401: error: structure has no member named `device' drivers/infiniband/core/cm.c:403: error: structure has no member named `device' drivers/infiniband/core/cm.c: At top level: drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was h ere drivers/infiniband/core/cm.c:543: error: conflicting types for 'ib_create_cm_id' include/rdma/ib_cm.h:306: error: previous declaration of 'ib_create_cm_id' was h ere drivers/infiniband/core/cm.c: In function `ib_create_cm_id': drivers/infiniband/core/cm.c:553: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_destroy_cm_id': drivers/infiniband/core/cm.c:681: warning: passing arg 2 of `ib_cancel_mad' make s integer from pointer without a cast drivers/infiniband/core/cm.c:692: warning: passing arg 2 of `ib_cancel_mad' make s integer from pointer without a cast drivers/infiniband/core/cm.c:709: warning: passing arg 2 of `ib_cancel_mad' make s integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_req': drivers/infiniband/core/cm.c:935: error: structure has no member named `timeout_ ms' drivers/infiniband/core/cm.c:944: warning: passing arg 1 of `ib_post_send_mad' f rom incompatible pointer type drivers/infiniband/core/cm.c:944: error: too few arguments to function `ib_post_ send_mad' drivers/infiniband/core/cm.c: In function `cm_issue_rej': drivers/infiniband/core/cm.c:989: warning: passing arg 1 of `ib_post_send_mad' f rom incompatible pointer type drivers/infiniband/core/cm.c:989: error: too few arguments to function `ib_post_ send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_req_handler': drivers/infiniband/core/cm.c:1197: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1197: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_match_req': drivers/infiniband/core/cm.c:1237: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_rep': drivers/infiniband/core/cm.c:1383: error: structure has no member named `timeout _ms' drivers/infiniband/core/cm.c:1386: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1386: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_rtu': drivers/infiniband/core/cm.c:1450: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1450: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_dup_rep_handler': drivers/infiniband/core/cm.c:1522: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1522: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_rep_handler': drivers/infiniband/core/cm.c:1590: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_establish_handler': drivers/infiniband/core/cm.c:1624: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_rtu_handler': drivers/infiniband/core/cm.c:1663: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_dreq': drivers/infiniband/core/cm.c:1721: error: structure has no member named `timeout _ms' drivers/infiniband/core/cm.c:1724: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1724: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_drep': drivers/infiniband/core/cm.c:1787: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1787: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_dreq_handler': drivers/infiniband/core/cm.c:1822: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c:1836: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1836: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_drep_handler': drivers/infiniband/core/cm.c:1883: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_rej': drivers/infiniband/core/cm.c:1951: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:1951: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_rej_handler': drivers/infiniband/core/cm.c:2027: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c:2037: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_mra': drivers/infiniband/core/cm.c:2095: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2095: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c:2108: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2108: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c:2121: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2121: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_mra_handler': drivers/infiniband/core/cm.c:2183: warning: passing arg 2 of `ib_modify_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c:2190: warning: passing arg 2 of `ib_modify_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c:2198: warning: passing arg 2 of `ib_modify_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_lap': drivers/infiniband/core/cm.c:2281: error: structure has no member named `timeout _ms' drivers/infiniband/core/cm.c:2284: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2284: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_lap_handler': drivers/infiniband/core/cm.c:2361: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2361: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `ib_send_cm_apr': drivers/infiniband/core/cm.c:2439: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2439: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_apr_handler': drivers/infiniband/core/cm.c:2478: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_req': drivers/infiniband/core/cm.c:2575: error: structure has no member named `timeout _ms' drivers/infiniband/core/cm.c:2580: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2580: error: too few arguments to function `ib_post _send_mad' CC [M] net/sctp/tsnmap.o drivers/infiniband/core/cm.c: In function `cm_sidr_req_handler': drivers/infiniband/core/cm.c:2644: error: structure has no member named `device' drivers/infiniband/core/cm.c: In function `ib_send_cm_sidr_rep': drivers/infiniband/core/cm.c:2715: warning: passing arg 1 of `ib_post_send_mad' from incompatible pointer type drivers/infiniband/core/cm.c:2715: error: too few arguments to function `ib_post _send_mad' drivers/infiniband/core/cm.c: In function `cm_sidr_rep_handler': drivers/infiniband/core/cm.c:2768: warning: passing arg 2 of `ib_cancel_mad' mak es integer from pointer without a cast drivers/infiniband/core/cm.c: In function `cm_send_handler': drivers/infiniband/core/cm.c:2836: error: structure has no member named `send_bu f' CC [M] net/sunrpc/sched.o make[3]: *** [drivers/infiniband/core/cm.o] Error 1 make[2]: *** [drivers/infiniband/core] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 make: *** Waiting for unfinished jobs.... From mshefty at ichips.intel.com Fri Oct 28 12:06:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 28 Oct 2005 12:06:28 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <4362758D.20407@dbresearch.net> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> <4362758D.20407@dbresearch.net> Message-ID: <436276B4.10604@ichips.intel.com> Jay Higley wrote: > I updated to version 3891 and tried it with the 2.6.13.4 Kernel that I > was using and got unresolved symbol errors for kzalloc. I upgraded the > kernel to 2.6.14 and tried agin and got the below compile errors. As an > aside, when I was running the unpatched openSM on a single-processor > system I occasionally got it to start up, but it would hang and the port > state would never change. The same sort of behavior as in the "OpenSM > crash with today's trunk" thread. It looks like you have old header files (possibly the original ones shipped with 2.6.14). I'm updating my systems to 2.6.14 at the moment, and will start testing this once done. - Sean From rolandd at cisco.com Fri Oct 28 12:14:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 12:14:06 -0700 Subject: [openib-general] Re: [PATCH] add node_guid to struct ib_device In-Reply-To: (Sean Hefty's message of "Fri, 28 Oct 2005 10:30:38 -0700") References: Message-ID: <52mzkttpxd.fsf@cisco.com> Sean> Thanks. I forgot to include the changes to sysfs.c in my Sean> previous patch. Not sure if we want to wait on this until Sean> the other drivers have been updated. We'll probably want to Sean> remove node_guid from the device attributes as well. Yes, I think that needs to wait until you or someone else updates ipath and ehca. - R. From rolandd at cisco.com Fri Oct 28 12:16:09 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 12:16:09 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <4362692A.80207@cs.rutgers.edu> (Aniruddha Bohra's message of "Fri, 28 Oct 2005 14:08:42 -0400") References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> Message-ID: <52irvhtpty.fsf@cisco.com> > Now there is an OOPS in the dmesg : This really looks like the bug I fixed in r3889. What svn rev are your kernel modules built from? - R. From higley at dbresearch.net Fri Oct 28 12:23:24 2005 From: higley at dbresearch.net (Jay Higley) Date: Fri, 28 Oct 2005 14:23:24 -0500 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <436276B4.10604@ichips.intel.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> <4362758D.20407@dbresearch.net> <436276B4.10604@ichips.intel.com> Message-ID: <43627AAC.6050602@dbresearch.net> Sean Hefty wrote: > Jay Higley wrote: > >> I updated to version 3891 and tried it with the 2.6.13.4 Kernel that >> I was using and got unresolved symbol errors for kzalloc. I upgraded >> the kernel to 2.6.14 and tried agin and got the below compile >> errors. As an aside, when I was running the unpatched openSM on a >> single-processor system I occasionally got it to start up, but it >> would hang and the port state would never change. The same sort of >> behavior as in the "OpenSM crash with today's trunk" thread. > > > It looks like you have old header files (possibly the original ones > shipped with 2.6.14). > > I'm updating my systems to 2.6.14 at the moment, and will start > testing this once done. > > - Sean > > I am using the source dowloaded from kernel.org for 2.6.14 with only the sk98lin and infiniband patches. What newer headers are you refering to? Where are the supposed to be located? -Jay From rolandd at cisco.com Fri Oct 28 12:30:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 12:30:14 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <43627AAC.6050602@dbresearch.net> (Jay Higley's message of "Fri, 28 Oct 2005 14:23:24 -0500") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> <4362758D.20407@dbresearch.net> <436276B4.10604@ichips.intel.com> <43627AAC.6050602@dbresearch.net> Message-ID: <52vezhsam1.fsf@cisco.com> Jay> I am using the source dowloaded from kernel.org for 2.6.14 Jay> with only the sk98lin and infiniband patches. What newer Jay> headers are you refering to? Where are the supposed to be Jay> located? If you link a subversion tree into your kernel's drivers/infiniband subdirectory, then you have to rm -rf include/rdma in your kernel tree, or else the build will pick up the old headers from the kernel tree instead of the new headers from the subversion tree. - R. From rolandd at cisco.com Fri Oct 28 12:34:43 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 12:34:43 -0700 Subject: [openib-general] cq callback question In-Reply-To: <001c01c5dbd0$1fb78ad0$d5000a0a@STEVO> (Steve Wise's message of "Fri, 28 Oct 2005 09:58:58 -0500") References: <001c01c5dbd0$1fb78ad0$d5000a0a@STEVO> Message-ID: <52r7a5saek.fsf@cisco.com> Steve> This may seem like a dumb question, but can a kernel ULP Steve> assume that after returning from ib_destroy_qp(), there Steve> will be no more callbacks for that QP on the associated cq Steve> event handler? No, I don't think that's a valid assumption, at least with the current code. Also, there's no requirement in Documentation/infiniband/core_locking.txt that destroy QP operations synchronize against CQ callbacks. It is valid to assume that no callbacks will happen after ib_destroy_cq() returns. - R. From tagemehdmqn at proxad.net Fri Oct 28 10:45:03 2005 From: tagemehdmqn at proxad.net (Glenn Mcqueen) Date: Fri, 28 Oct 2005 19:45:03 +0200 Subject: [openib-general] Hey! Message-ID: <28180054095115.tagemehdmqn@proxad.net> We are happy to present you with six deals from four different brokers. Please remember that there is no commitment required on your part, and your credit is not an issue. Please validate your information with our secure and private database to ensure our records are up to date and accurate. http://brave-123.com/save2.asp Have a good day. Sincerely, Glenn Mcqueen Customer Service Rep eDSC Inc. boris be oakley but see fur the try meniscus in or extradite ! be venial or may tantalum it not embower tryand multiplicand on. Update on site inform ! meantime be try spencer the try haberman on a bourgeois a , bestial but in nebular it's and rosary seenot monkeyflower be. From higley at dbresearch.net Fri Oct 28 13:13:29 2005 From: higley at dbresearch.net (Jay Higley) Date: Fri, 28 Oct 2005 15:13:29 -0500 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <52vezhsam1.fsf@cisco.com> References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> <4362758D.20407@dbresearch.net> <436276B4.10604@ichips.intel.com> <43627AAC.6050602@dbresearch.net> <52vezhsam1.fsf@cisco.com> Message-ID: <43628669.8080405@dbresearch.net> Roland Dreier wrote: > Jay> I am using the source dowloaded from kernel.org for 2.6.14 > Jay> with only the sk98lin and infiniband patches. What newer > Jay> headers are you refering to? Where are the supposed to be > Jay> located? > >If you link a subversion tree into your kernel's drivers/infiniband >subdirectory, then you have to rm -rf include/rdma in your kernel >tree, or else the build will pick up the old headers from the kernel >tree instead of the new headers from the subversion tree. > > - R. > > > > Thanks. I'll try that. I also looked into "user_mad.c" and see that you don't have the same compatibility defines for kzalloc that you used in sdp. I've attempted to duplicate them and am trying the recompile with 2.6.13 right now. -Jay From rolandd at cisco.com Fri Oct 28 13:19:26 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 13:19:26 -0700 Subject: [openib-general] OpenSM causes kernel trap In-Reply-To: <43628669.8080405@dbresearch.net> (Jay Higley's message of "Fri, 28 Oct 2005 15:13:29 -0500") References: <43610E4F.3030103@dbresearch.net> <528xweyi4h.fsf@cisco.com> <436113A1.2020105@ichips.intel.com> <524q72ygwj.fsf@cisco.com> <436118BB.4010809@ichips.intel.com> <4361255E.7010400@ichips.intel.com> <52r7a6wv9a.fsf@cisco.com> <43614803.6080308@ichips.intel.com> <52vezivc3t.fsf@cisco.com> <52r7a6vc0w.fsf@cisco.com> <4362758D.20407@dbresearch.net> <436276B4.10604@ichips.intel.com> <43627AAC.6050602@dbresearch.net> <52vezhsam1.fsf@cisco.com> <43628669.8080405@dbresearch.net> Message-ID: <52k6fxs8c1.fsf@cisco.com> Jay> I also looked into "user_mad.c" and see that you don't have Jay> the same compatibility defines for kzalloc that you used in Jay> sdp. Right, now that 2.6.14 is out, we won't try to maintain backward compatibility with 2.6.13 in the main subversion trunk. - R. From bohra at cs.rutgers.edu Fri Oct 28 13:26:13 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 16:26:13 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52irvhtpty.fsf@cisco.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> Message-ID: <43628965.60902@cs.rutgers.edu> Roland Dreier wrote: > > Now there is an OOPS in the dmesg : > >This really looks like the bug I fixed in r3889. What svn rev are >your kernel modules built from? > > - R. > > With 3892 I now get the following warnings on compilation: WARNING: /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko needs unknown symbol kzalloc WARNING: /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko needs unknown symbol kzalloc Aniruddha From bohra at cs.rutgers.edu Fri Oct 28 13:31:24 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 16:31:24 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52irvhtpty.fsf@cisco.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> Message-ID: <43628A9C.6090108@cs.rutgers.edu> Roland Dreier wrote: > > Now there is an OOPS in the dmesg : > >This really looks like the bug I fixed in r3889. What svn rev are >your kernel modules built from? > > - R. > > And of course, the module does not load : Oct 28 16:21:57 hora-3 kernel: ib_mthca: Unknown symbol kzalloc Oct 28 16:21:58 hora-3 kernel: ib_umad: Unknown symbol kzalloc Aniruddha From rolandd at cisco.com Fri Oct 28 13:25:48 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 13:25:48 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43628965.60902@cs.rutgers.edu> (Aniruddha Bohra's message of "Fri, 28 Oct 2005 16:26:13 -0400") References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> Message-ID: <52fyqls81f.fsf@cisco.com> > With 3892 I now get the following warnings on compilation: > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko > needs unknown symbol kzalloc > WARNING: > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko > needs unknown symbol kzalloc Yes, kzalloc() was added in 2.6.14. Now that 2.6.14 has been released, the subversion trunk is targeted against that kernel rather than the old 2.6.13 release. - R. From bohra at cs.rutgers.edu Fri Oct 28 13:51:56 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 16:51:56 -0400 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52fyqls81f.fsf@cisco.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> <52fyqls81f.fsf@cisco.com> Message-ID: <43628F6C.9070308@cs.rutgers.edu> Roland Dreier wrote: > > With 3892 I now get the following warnings on compilation: > > WARNING: > > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko > > needs unknown symbol kzalloc > > WARNING: > > /lib/modules/2.6.13bohra/kernel/drivers/infiniband/core/ib_umad.ko > > needs unknown symbol kzalloc > >Yes, kzalloc() was added in 2.6.14. Now that 2.6.14 has been >released, the subversion trunk is targeted against that kernel rather >than the old 2.6.13 release. > > - R. > > OK so, what options do I have right now -- compile a new kernel and apply patches and continue, or is there some patch that I can apply ? Thanks Aniruddha From rolandd at cisco.com Fri Oct 28 13:56:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 13:56:56 -0700 Subject: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43628F6C.9070308@cs.rutgers.edu> (Aniruddha Bohra's message of "Fri, 28 Oct 2005 16:51:56 -0400") References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> <52fyqls81f.fsf@cisco.com> <43628F6C.9070308@cs.rutgers.edu> Message-ID: <52acgts6lj.fsf@cisco.com> > OK so, what options do I have right now -- compile a new kernel and > apply patches and > continue, or is there some patch that I can apply ? I don't think anyone has prepared a kzalloc() patch, but just adding something like static void *kzalloc(size_t size, unsigned int flags) { void *ret = kmalloc(size, flags); if (ret) memset(ret, 0, size); return ret; } to files that use kzalloc() should let you use 2.6.13 (assuming there are no other incompatibilities). - R. From rolandd at cisco.com Fri Oct 28 14:15:14 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 14:15:14 -0700 Subject: [openib-general] Re: [PATCH] [SRP] srp_cm_handler expanded response handling In-Reply-To: (John Kingman's message of "Thu, 27 Oct 2005 16:18:07 -0500 (CDT)") References: <52irviwutw.fsf@cisco.com> Message-ID: <521x25s5r1.fsf@cisco.com> Thanks, applied. - R. From bohra at cs.rutgers.edu Fri Oct 28 15:00:58 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Fri, 28 Oct 2005 18:00:58 -0400 Subject: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk In-Reply-To: <52acgts6lj.fsf@cisco.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> <52fyqls81f.fsf@cisco.com> <43628F6C.9070308@cs.rutgers.edu> <52acgts6lj.fsf@cisco.com> Message-ID: <43629F9A.3070704@cs.rutgers.edu> Roland Dreier wrote: > > OK so, what options do I have right now -- compile a new kernel and > > apply patches and > > continue, or is there some patch that I can apply ? > >I don't think anyone has prepared a kzalloc() patch, but just adding >something like > > static void *kzalloc(size_t size, unsigned int flags) > { > void *ret = kmalloc(size, flags); > if (ret) > memset(ret, 0, size); > return ret; > } > >to files that use kzalloc() should let you use 2.6.13 (assuming there >are no other incompatibilities). > > > Thanks, that works. Now, I have a problem with udapl : The following is a code snippet from : dapl_ib_dto.h for (i = 0; i < segments; i++ ) { if ( !local_iov[i].segment_length ) continue; ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; ds_array_p->length = local_iov[i].segment_length; ds_array_p->lkey = local_iov[i].lmr_context; dapl_dbg_log ( DAPL_DBG_TYPE_EP, " post_snd: lkey 0x%x va %p len %d \n", ds_array_p->lkey, ds_array_p->addr, ds_array_p->length ); total_len += ds_array_p->length; wr.num_sge++; ds_array_p++; } The following is the relevant part of the log with DAPL_DBG_TYPE=0xffff dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov 0xbfc29060 f 0^M post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M ^^^^^^^^ From the above loop, how is this possible : If local_iov[i].segment_length == 0, it should not be printed. And the if the assignment is successful, len must not be 0. Any ideas? Of course following this, the ep is disconnected in the next step :( Also a minor patch, you can see that %P is printed as %P and not used as a format character. Index: common/dapl_ep_post_rdma_write.c =================================================================== --- common/dapl_ep_post_rdma_write.c (revision 3892) +++ common/dapl_ep_post_rdma_write.c (working copy) @@ -78,7 +78,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_ep_post_send.c =================================================================== --- common/dapl_ep_post_send.c (revision 3892) +++ common/dapl_ep_post_send.c (working copy) @@ -75,7 +75,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_srq_post_recv.c =================================================================== --- common/dapl_srq_post_recv.c (revision 3892) +++ common/dapl_srq_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_srq_post_recv (%p, %d, %p, %P)\n", + "dapl_srq_post_recv (%p, %d, %p, %p)\n", srq_handle, num_segments, local_iov, Index: common/dapl_ep_post_recv.c =================================================================== --- common/dapl_ep_post_recv.c (revision 3892) +++ common/dapl_ep_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Thanks Aniruddha From rolandd at cisco.com Fri Oct 28 15:42:38 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 15:42:38 -0700 Subject: [openib-general] [PATCH] fix umad object lifetime stuff Message-ID: <528xwdqn4x.fsf@cisco.com> I just committed the following patch for user_mad.c, which fixes various issues with possibly freeing various data structures before the last reference is gone. For example, cdev_del() might return before the last reference to the cdev is gone, so freeing a structure containing the cdev is wrong at that point. (Side note: it's essentially impossible to use cdev_init() safely unless the cdev in question is statically allocated as part of the module). Something like this is probably required for ucm and anything else that exports a character device, since everyone seems to have copied my bad user_mad code. But I haven't had a chance to do anything beyond user_mad and uverbs so far... - R. --- infiniband/core/user_mad.c (revision 3890) +++ infiniband/core/user_mad.c (working copy) @@ -64,18 +64,39 @@ enum { IB_UMAD_MINOR_BASE = 0 }; +/* + * Our lifetime rules for these structs are the following: each time a + * device special file is opened, we look up the corresponding struct + * ib_umad_port by minor in the umad_port[] table while holding the + * port_lock. If this lookup succeeds, we take a reference on the + * ib_umad_port's struct ib_umad_device while still holding the + * port_lock; if the lookup fails, we fail the open(). We drop these + * references in the corresponding close(). + * + * In addition to references coming from open character devices, there + * is one more reference to each ib_umad_device representing the + * module's reference taken when allocating the ib_umad_device in + * ib_umad_add_one(). + * + * When destroying an ib_umad_device, we clear all of its + * ib_umad_ports from umad_port[] while holding port_lock before + * dropping the module's reference to the ib_umad_device. This is + * always safe because any open() calls will either succeed and obtain + * a reference before we clear the umad_port[] entries, or fail after + * we clear the umad_port[] entries. + */ + struct ib_umad_port { - int devnum; - struct cdev dev; - struct class_device class_dev; - - int sm_devnum; - struct cdev sm_dev; - struct class_device sm_class_dev; + struct cdev *dev; + struct class_device *class_dev; + + struct cdev *sm_dev; + struct class_device *sm_class_dev; struct semaphore sm_sem; struct ib_device *ib_dev; struct ib_umad_device *umad_dev; + int dev_num; u8 port_num; }; @@ -102,13 +123,25 @@ struct ib_umad_packet { struct ib_user_mad mad; }; +static struct class *umad_class; + static const dev_t base_dev = MKDEV(IB_UMAD_MAJOR, IB_UMAD_MINOR_BASE); -static spinlock_t map_lock; + +static DEFINE_SPINLOCK(port_lock); +static struct ib_umad_port *umad_port[IB_UMAD_MAX_PORTS]; static DECLARE_BITMAP(dev_map, IB_UMAD_MAX_PORTS * 2); static void ib_umad_add_one(struct ib_device *device); static void ib_umad_remove_one(struct ib_device *device); +static void ib_umad_release_dev(struct kref *ref) +{ + struct ib_umad_device *dev = + container_of(ref, struct ib_umad_device, ref); + + kfree(dev); +} + static int queue_packet(struct ib_umad_file *file, struct ib_mad_agent *agent, struct ib_umad_packet *packet) @@ -534,13 +567,23 @@ static long ib_umad_ioctl(struct file *f static int ib_umad_open(struct inode *inode, struct file *filp) { - struct ib_umad_port *port = - container_of(inode->i_cdev, struct ib_umad_port, dev); + struct ib_umad_port *port; struct ib_umad_file *file; + spin_lock(&port_lock); + port = umad_port[iminor(inode) - IB_UMAD_MINOR_BASE]; + if (port) + kref_get(&port->umad_dev->ref); + spin_unlock(&port_lock); + + if (!port) + return -ENXIO; + file = kzalloc(sizeof *file, GFP_KERNEL); - if (!file) + if (!file) { + kref_put(&port->umad_dev->ref, ib_umad_release_dev); return -ENOMEM; + } spin_lock_init(&file->recv_lock); init_rwsem(&file->agent_mutex); @@ -556,6 +599,7 @@ static int ib_umad_open(struct inode *in static int ib_umad_close(struct inode *inode, struct file *filp) { struct ib_umad_file *file = filp->private_data; + struct ib_umad_device *dev = file->port->umad_dev; struct ib_umad_packet *packet, *tmp; int i; @@ -570,6 +614,8 @@ static int ib_umad_close(struct inode *i kfree(file); + kref_put(&dev->ref, ib_umad_release_dev); + return 0; } @@ -586,30 +632,46 @@ static struct file_operations umad_fops static int ib_umad_sm_open(struct inode *inode, struct file *filp) { - struct ib_umad_port *port = - container_of(inode->i_cdev, struct ib_umad_port, sm_dev); + struct ib_umad_port *port; struct ib_port_modify props = { .set_port_cap_mask = IB_PORT_SM }; int ret; + spin_lock(&port_lock); + port = umad_port[iminor(inode) - IB_UMAD_MINOR_BASE - IB_UMAD_MAX_PORTS]; + if (port) + kref_get(&port->umad_dev->ref); + spin_unlock(&port_lock); + + if (!port) + return -ENXIO; + if (filp->f_flags & O_NONBLOCK) { - if (down_trylock(&port->sm_sem)) - return -EAGAIN; + if (down_trylock(&port->sm_sem)) { + ret = -EAGAIN; + goto fail; + } } else { - if (down_interruptible(&port->sm_sem)) - return -ERESTARTSYS; + if (down_interruptible(&port->sm_sem)) { + ret = -ERESTARTSYS; + goto fail; + } } ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); if (ret) { up(&port->sm_sem); - return ret; + goto fail; } filp->private_data = port; return 0; + +fail: + kref_put(&port->umad_dev->ref, ib_umad_release_dev); + return ret; } static int ib_umad_sm_close(struct inode *inode, struct file *filp) @@ -623,6 +685,8 @@ static int ib_umad_sm_close(struct inode ret = ib_modify_port(port->ib_dev, port->port_num, 0, &props); up(&port->sm_sem); + kref_put(&port->umad_dev->ref, ib_umad_release_dev); + return ret; } @@ -642,6 +706,9 @@ static ssize_t show_ibdev(struct class_d { struct ib_umad_port *port = class_get_devdata(class_dev); + if (!port) + return -ENODEV; + return sprintf(buf, "%s\n", port->ib_dev->name); } static CLASS_DEVICE_ATTR(ibdev, S_IRUGO, show_ibdev, NULL); @@ -650,38 +717,13 @@ static ssize_t show_port(struct class_de { struct ib_umad_port *port = class_get_devdata(class_dev); + if (!port) + return -ENODEV; + return sprintf(buf, "%d\n", port->port_num); } static CLASS_DEVICE_ATTR(port, S_IRUGO, show_port, NULL); -static void ib_umad_release_dev(struct kref *ref) -{ - struct ib_umad_device *dev = - container_of(ref, struct ib_umad_device, ref); - - kfree(dev); -} - -static void ib_umad_release_port(struct class_device *class_dev) -{ - struct ib_umad_port *port = class_get_devdata(class_dev); - - if (class_dev == &port->class_dev) { - cdev_del(&port->dev); - clear_bit(port->devnum, dev_map); - } else { - cdev_del(&port->sm_dev); - clear_bit(port->sm_devnum, dev_map); - } - - kref_put(&port->umad_dev->ref, ib_umad_release_dev); -} - -static struct class umad_class = { - .name = "infiniband_mad", - .release = ib_umad_release_port -}; - static ssize_t show_abi_version(struct class *class, char *buf) { return sprintf(buf, "%d\n", IB_USER_MAD_ABI_VERSION); @@ -691,89 +733,102 @@ static CLASS_ATTR(abi_version, S_IRUGO, static int ib_umad_init_port(struct ib_device *device, int port_num, struct ib_umad_port *port) { - spin_lock(&map_lock); - port->devnum = find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); - if (port->devnum >= IB_UMAD_MAX_PORTS) { - spin_unlock(&map_lock); + spin_lock(&port_lock); + port->dev_num = find_first_zero_bit(dev_map, IB_UMAD_MAX_PORTS); + if (port->dev_num >= IB_UMAD_MAX_PORTS) { + spin_unlock(&port_lock); return -1; } - port->sm_devnum = find_next_zero_bit(dev_map, IB_UMAD_MAX_PORTS * 2, IB_UMAD_MAX_PORTS); - if (port->sm_devnum >= IB_UMAD_MAX_PORTS * 2) { - spin_unlock(&map_lock); - return -1; - } - set_bit(port->devnum, dev_map); - set_bit(port->sm_devnum, dev_map); - spin_unlock(&map_lock); + set_bit(port->dev_num, dev_map); + spin_unlock(&port_lock); port->ib_dev = device; port->port_num = port_num; init_MUTEX(&port->sm_sem); - cdev_init(&port->dev, &umad_fops); - port->dev.owner = THIS_MODULE; - kobject_set_name(&port->dev.kobj, "umad%d", port->devnum); - if (cdev_add(&port->dev, base_dev + port->devnum, 1)) + port->dev = cdev_alloc(); + if (!port->dev) return -1; - - port->class_dev.class = &umad_class; - port->class_dev.dev = device->dma_device; - port->class_dev.devt = port->dev.dev; - - snprintf(port->class_dev.class_id, BUS_ID_SIZE, "umad%d", port->devnum); - - if (class_device_register(&port->class_dev)) + port->dev->owner = THIS_MODULE; + port->dev->ops = &umad_fops; + kobject_set_name(&port->dev->kobj, "umad%d", port->dev_num); + if (cdev_add(port->dev, base_dev + port->dev_num, 1)) goto err_cdev; - class_set_devdata(&port->class_dev, port); - kref_get(&port->umad_dev->ref); + port->class_dev = class_device_create(umad_class, port->dev->dev, + device->dma_device, + "umad%d", port->dev_num); + if (IS_ERR(port->class_dev)) + goto err_cdev; - if (class_device_create_file(&port->class_dev, &class_device_attr_ibdev)) + if (class_device_create_file(port->class_dev, &class_device_attr_ibdev)) goto err_class; - if (class_device_create_file(&port->class_dev, &class_device_attr_port)) + if (class_device_create_file(port->class_dev, &class_device_attr_port)) goto err_class; - cdev_init(&port->sm_dev, &umad_sm_fops); - port->sm_dev.owner = THIS_MODULE; - kobject_set_name(&port->dev.kobj, "issm%d", port->sm_devnum - IB_UMAD_MAX_PORTS); - if (cdev_add(&port->sm_dev, base_dev + port->sm_devnum, 1)) - return -1; - - port->sm_class_dev.class = &umad_class; - port->sm_class_dev.dev = device->dma_device; - port->sm_class_dev.devt = port->sm_dev.dev; - - snprintf(port->sm_class_dev.class_id, BUS_ID_SIZE, "issm%d", port->sm_devnum - IB_UMAD_MAX_PORTS); + port->sm_dev = cdev_alloc(); + if (!port->sm_dev) + goto err_class; + port->sm_dev->owner = THIS_MODULE; + port->sm_dev->ops = &umad_sm_fops; + kobject_set_name(&port->dev->kobj, "issm%d", port->dev_num); + if (cdev_add(port->sm_dev, base_dev + port->dev_num + IB_UMAD_MAX_PORTS, 1)) + goto err_sm_cdev; - if (class_device_register(&port->sm_class_dev)) + port->sm_class_dev = class_device_create(umad_class, port->sm_dev->dev, + device->dma_device, + "issm%d", port->dev_num); + if (IS_ERR(port->sm_class_dev)) goto err_sm_cdev; - class_set_devdata(&port->sm_class_dev, port); - kref_get(&port->umad_dev->ref); + class_set_devdata(port->class_dev, port); + class_set_devdata(port->sm_class_dev, port); - if (class_device_create_file(&port->sm_class_dev, &class_device_attr_ibdev)) + if (class_device_create_file(port->sm_class_dev, &class_device_attr_ibdev)) goto err_sm_class; - if (class_device_create_file(&port->sm_class_dev, &class_device_attr_port)) + if (class_device_create_file(port->sm_class_dev, &class_device_attr_port)) goto err_sm_class; + spin_lock(&port_lock); + umad_port[port->dev_num] = port; + spin_unlock(&port_lock); + return 0; err_sm_class: - class_device_unregister(&port->sm_class_dev); + class_device_destroy(umad_class, port->sm_dev->dev); err_sm_cdev: - cdev_del(&port->sm_dev); + cdev_del(port->sm_dev); err_class: - class_device_unregister(&port->class_dev); + class_device_destroy(umad_class, port->dev->dev); err_cdev: - cdev_del(&port->dev); - clear_bit(port->devnum, dev_map); + cdev_del(port->dev); + clear_bit(port->dev_num, dev_map); return -1; } +static void ib_umad_kill_port(struct ib_umad_port *port) +{ + class_set_devdata(port->class_dev, NULL); + class_set_devdata(port->sm_class_dev, NULL); + + class_device_destroy(umad_class, port->dev->dev); + class_device_destroy(umad_class, port->sm_dev->dev); + + cdev_del(port->dev); + cdev_del(port->sm_dev); + + spin_lock(&port_lock); + umad_port[port->dev_num] = NULL; + spin_unlock(&port_lock); + + clear_bit(port->dev_num, dev_map); +} + static void ib_umad_add_one(struct ib_device *device) { struct ib_umad_device *umad_dev; @@ -809,10 +864,8 @@ static void ib_umad_add_one(struct ib_de return; err: - while (--i >= s) { - class_device_unregister(&umad_dev->port[i - s].class_dev); - class_device_unregister(&umad_dev->port[i - s].sm_class_dev); - } + while (--i >= s) + ib_umad_kill_port(&umad_dev->port[i]); kref_put(&umad_dev->ref, ib_umad_release_dev); } @@ -825,10 +878,8 @@ static void ib_umad_remove_one(struct ib if (!umad_dev) return; - for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) { - class_device_unregister(&umad_dev->port[i].class_dev); - class_device_unregister(&umad_dev->port[i].sm_class_dev); - } + for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) + ib_umad_kill_port(&umad_dev->port[i]); kref_put(&umad_dev->ref, ib_umad_release_dev); } @@ -837,8 +888,6 @@ static int __init ib_umad_init(void) { int ret; - spin_lock_init(&map_lock); - ret = register_chrdev_region(base_dev, IB_UMAD_MAX_PORTS * 2, "infiniband_mad"); if (ret) { @@ -846,13 +895,14 @@ static int __init ib_umad_init(void) goto out; } - ret = class_register(&umad_class); - if (ret) { + umad_class = class_create(THIS_MODULE, "infiniband_mad"); + if (IS_ERR(umad_class)) { + ret = PTR_ERR(umad_class); printk(KERN_ERR "user_mad: couldn't create class infiniband_mad\n"); goto out_chrdev; } - ret = class_create_file(&umad_class, &class_attr_abi_version); + ret = class_create_file(umad_class, &class_attr_abi_version); if (ret) { printk(KERN_ERR "user_mad: couldn't create abi_version attribute\n"); goto out_class; @@ -867,7 +917,7 @@ static int __init ib_umad_init(void) return 0; out_class: - class_unregister(&umad_class); + class_destroy(umad_class); out_chrdev: unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS * 2); @@ -879,7 +929,7 @@ out: static void __exit ib_umad_cleanup(void) { ib_unregister_client(&umad_client); - class_unregister(&umad_class); + class_destroy(umad_class); unregister_chrdev_region(base_dev, IB_UMAD_MAX_PORTS * 2); } From ardavis at ichips.intel.com Fri Oct 28 16:07:56 2005 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 28 Oct 2005 16:07:56 -0700 Subject: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43629F9A.3070704@cs.rutgers.edu> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> <52fyqls81f.fsf@cisco.com> <43628F6C.9070308@cs.rutgers.edu> <52acgts6lj.fsf@cisco.com> <43629F9A.3070704@cs.rutgers.edu> Message-ID: <4362AF4C.7080403@ichips.intel.com> Aniruddha Bohra wrote: > > Now, I have a problem with udapl : > > The following is a code snippet from : > dapl_ib_dto.h > > for (i = 0; i < segments; i++ ) { > if ( !local_iov[i].segment_length ) > continue; > > ds_array_p->addr = (uint64_t) > local_iov[i].virtual_address; > ds_array_p->length = local_iov[i].segment_length; > ds_array_p->lkey = local_iov[i].lmr_context; > > dapl_dbg_log ( DAPL_DBG_TYPE_EP, > " post_snd: lkey 0x%x va %p len %d \n", > ds_array_p->lkey, ds_array_p->addr, > ds_array_p->length ); > > total_len += ds_array_p->length; > wr.num_sge++; > ds_array_p++; > } > > The following is the relevant part of the log with DAPL_DBG_TYPE=0xffff > > dapl_ep_post_send (0x8087110, 2, 0x80f9910, %P, b5f395bc)^M > post_snd: ep 0x8087110 op 2 ck 0x8087374 sgs 2 l_iov 0x80f9910 r_iov > 0xbfc29060 f 0^M > post_snd: ep 0x8087110 cookie 0x8087374 segs 2 l_iov 0x80f9910^M > post_snd: lkey 0x10de003b va 0xb5f3976c len 0 ^M > post_snd: lkey 0x10de003b va 0xb5f39924 len 0 ^M > > ^^^^^^^^ > > From the above loop, how is this possible : > If local_iov[i].segment_length == 0, it should not be printed. And the > if the assignment is successful, len must not be 0. > > Any ideas? Of course following this, the ep is disconnected in the > next step :( local_iov (LMR) length is 64bits and the ibv_sge (ds_array) length is 32 bits so it truncates. Sounds like you setup a transfer greater then 4GB-1? If you query the device via uDAPL you will see the max limits (2GB): query_hca: (a0.0) ep 64512 ep_q 65535 evd 65408 evd_q 131071 query_hca: msg 2147483648 rdma 2147483648 iov 59 lmr 131056 rmr 0 -arlin > > Also a minor patch, you can see that %P is printed as %P and not used as > a format character. > > Index: common/dapl_ep_post_rdma_write.c > =================================================================== > --- common/dapl_ep_post_rdma_write.c (revision 3892) > +++ common/dapl_ep_post_rdma_write.c (working copy) > @@ -78,7 +78,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", > + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_send.c > =================================================================== > --- common/dapl_ep_post_send.c (revision 3892) > +++ common/dapl_ep_post_send.c (working copy) > @@ -75,7 +75,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_srq_post_recv.c > =================================================================== > --- common/dapl_srq_post_recv.c (revision 3892) > +++ common/dapl_srq_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_srq_post_recv (%p, %d, %p, %P)\n", > + "dapl_srq_post_recv (%p, %d, %p, %p)\n", > srq_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_recv.c > =================================================================== > --- common/dapl_ep_post_recv.c (revision 3892) > +++ common/dapl_ep_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > > Thanks > Aniruddha > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From rolandd at cisco.com Fri Oct 28 16:40:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 16:40:47 -0700 Subject: [openib-general] [git pull] InfiniBand updates for 2.6.14 Message-ID: <523bmlqkg0.fsf@cisco.com> Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: rsync://rsync.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus The pull will get the following changes: Jack Morgenstein: [IB] Add checks to multicast attach and detach [IB] mthca: Report correct atomic capability [IB] mthca: Fill in more fields in query_port method [IB] mthca: Better limit checking and reporting [IB] mthca: Don't enter QP into MCG more than once. Roland Dreier: [IB] uverbs: ABI-breaking fixes for userspace verbs [IB] uverbs: Fix up resource creation error paths [IB] uverbs: Add device-specific ABI version attribute [IB] uverbs: reject invalid memory registration permission flags [IB] Check port number in ib_query_port()/ib_modify_port() [IB] mthca: SRQ limit reached events [IB] mthca: detect SRQ overflow [IB] Fix leak on MAD initialization failure [IPoIB] Rename ipoib_create_qp() -> ipoib_init_qp() and fix error cleanup [IB] uverbs: unlock correctly in error paths [IB] fail SA queries if device initialization failed [IB] uverbs: Add a mask of device methods allowed for userspace [IB] uverbs: Add ABI structures for more commands [IB] uverbs: Implement more commands [IB] ucm: quiet sparse warnings [IPoIB] Improve ipoib_timeout() output [IB] mthca: Use enum in mthca_alloc_db() prototype [IB] mthca: Add struct pci_driver.owner field [IB] Fail sysfs queries after device is unregistered [IB] cm: Add missing break in switch [IB] user_mad: trivial coding style fixes [IB] user_mad: Use class_device.devt [IB] mthca: Always re-arm EQs in mthca_tavor_interrupt() Merge master.kernel.org:/.../torvalds/linux-2.6 [IB] Add idr_destroy() calls on module unload Manual merge of for-linus to upstream (fix conflicts in drivers/infiniband/core/ucm.c) [IB] mthca: correct modify QP attribute masks for UC [IB] simplify mad_rmpp.c:alloc_response_msg() [IB] mthca: first pass at catastrophic error reporting [IB] ib_umad: fix crash when freeing send buffers [IPoIB] Drop RX packets when out of memory [IB] umad: Fix device lifetime problems [IB] uverbs: Fix device lifetime problems Merge master.kernel.org:/.../torvalds/linux-2.6 [IB] fix up class_device_create() calls Sean Hefty: [IB] merge ucm.h into ucm.c [IB] CM: bind IDs to a specific device [IB] CM: Fix initialization of QP attributes for UC QPs. [IB] Fix MAD layer DMA mappings to avoid touching data buffer once mapped [IB] ib_umad: various cleanups drivers/infiniband/core/agent.c | 301 ++------- drivers/infiniband/core/agent.h | 13 drivers/infiniband/core/agent_priv.h | 62 -- drivers/infiniband/core/cm.c | 217 +++---- drivers/infiniband/core/cm_msgs.h | 1 drivers/infiniband/core/device.c | 12 drivers/infiniband/core/mad.c | 329 +++++----- drivers/infiniband/core/mad_priv.h | 8 drivers/infiniband/core/mad_rmpp.c | 112 ++- drivers/infiniband/core/mad_rmpp.h | 2 drivers/infiniband/core/sa_query.c | 272 ++++---- drivers/infiniband/core/smi.h | 2 drivers/infiniband/core/sysfs.c | 16 drivers/infiniband/core/ucm.c | 267 ++++++-- drivers/infiniband/core/ucm.h | 83 --- drivers/infiniband/core/user_mad.c | 403 ++++++------ drivers/infiniband/core/uverbs.h | 62 +- drivers/infiniband/core/uverbs_cmd.c | 858 +++++++++++++++++++++----- drivers/infiniband/core/uverbs_main.c | 503 ++++++++++----- drivers/infiniband/core/verbs.c | 18 - drivers/infiniband/hw/mthca/Makefile | 3 drivers/infiniband/hw/mthca/mthca_catas.c | 153 +++++ drivers/infiniband/hw/mthca/mthca_cmd.c | 11 drivers/infiniband/hw/mthca/mthca_dev.h | 22 + drivers/infiniband/hw/mthca/mthca_eq.c | 21 + drivers/infiniband/hw/mthca/mthca_mad.c | 72 -- drivers/infiniband/hw/mthca/mthca_main.c | 11 drivers/infiniband/hw/mthca/mthca_mcg.c | 11 drivers/infiniband/hw/mthca/mthca_memfree.c | 3 drivers/infiniband/hw/mthca/mthca_memfree.h | 3 drivers/infiniband/hw/mthca/mthca_provider.c | 49 + drivers/infiniband/hw/mthca/mthca_qp.c | 16 drivers/infiniband/hw/mthca/mthca_srq.c | 43 + drivers/infiniband/hw/mthca/mthca_user.h | 6 drivers/infiniband/ulp/ipoib/ipoib.h | 23 - drivers/infiniband/ulp/ipoib/ipoib_ib.c | 122 ++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 15 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 9 include/rdma/ib_cm.h | 10 include/rdma/ib_mad.h | 66 +- include/rdma/ib_user_cm.h | 10 include/rdma/ib_user_verbs.h | 222 +++++-- include/rdma/ib_verbs.h | 6 43 files changed, 2675 insertions(+), 1773 deletions(-) delete mode 100644 drivers/infiniband/core/agent_priv.h delete mode 100644 drivers/infiniband/core/ucm.h create mode 100644 drivers/infiniband/hw/mthca/mthca_catas.c From akpm at osdl.org Fri Oct 28 17:12:18 2005 From: akpm at osdl.org (Andrew Morton) Date: Fri, 28 Oct 2005 17:12:18 -0700 Subject: [openib-general] Re: [git pull] InfiniBand updates for 2.6.14 In-Reply-To: <523bmlqkg0.fsf@cisco.com> References: <523bmlqkg0.fsf@cisco.com> Message-ID: <20051028171218.2b8e71e7.akpm@osdl.org> Roland Dreier wrote: > > 43 files changed, 2675 insertions(+), 1773 deletions(-) That's rather a lot of code. AFAIK it hasn't been past linux-kernel. It hasn't been in -mm. Can we please a) arrange for the current infiniband devel tree to be included in -mm and b) arrange for infiniband patches to get wider review than this? Thanks. From rolandd at cisco.com Fri Oct 28 17:18:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 17:18:42 -0700 Subject: [openib-general] Re: [git pull] InfiniBand updates for 2.6.14 In-Reply-To: <20051028171218.2b8e71e7.akpm@osdl.org> (Andrew Morton's message of "Fri, 28 Oct 2005 17:12:18 -0700") References: <523bmlqkg0.fsf@cisco.com> <20051028171218.2b8e71e7.akpm@osdl.org> Message-ID: <52y84dp44d.fsf@cisco.com> Andrew> a) arrange for the current infiniband devel tree to be Andrew> included in -mm and Sure. How do you want to handle that? The way I've been working lately is to merge things onto my "upstream" branch when I intend for them to go to Linus eventually, and merge that onto the "for-linus" branch when I'm going to ask Linus to pull. I guess it would make sense for you to grab the upstream branch for -mm. Andrew> b) arrange for infiniband patches to get wider review than this? No objection from me. How do you suggest I do that? Post things to linux-kernel as I merge them into git? Thanks, Roland From akpm at osdl.org Fri Oct 28 17:39:01 2005 From: akpm at osdl.org (Andrew Morton) Date: Fri, 28 Oct 2005 17:39:01 -0700 Subject: [openib-general] Re: [git pull] InfiniBand updates for 2.6.14 In-Reply-To: <52y84dp44d.fsf@cisco.com> References: <523bmlqkg0.fsf@cisco.com> <20051028171218.2b8e71e7.akpm@osdl.org> <52y84dp44d.fsf@cisco.com> Message-ID: <20051028173901.6bd2c302.akpm@osdl.org> Roland Dreier wrote: > > Andrew> a) arrange for the current infiniband devel tree to be > Andrew> included in -mm and > > Sure. How do you want to handle that? The way I've been working > lately is to merge things onto my "upstream" branch when I intend for > them to go to Linus eventually, and merge that onto the "for-linus" > branch when I'm going to ask Linus to pull. I guess it would make > sense for you to grab the upstream branch for -mm. That suits. I'll include master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git#upstream > Andrew> b) arrange for infiniband patches to get wider review than this? > > No objection from me. How do you suggest I do that? Post things to > linux-kernel as I merge them into git? That would be suitable, I guess. It's a bit of a hassle, but some bugs will likely be found, and useful suggestions will be made. From hozer at hozed.org Fri Oct 28 18:10:30 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 28 Oct 2005 20:10:30 -0500 Subject: [openib-general] Re: ehca testing In-Reply-To: <52k6fyyjsa.fsf@cisco.com> References: <20051020144020.GR30127@kalmia.hozed.org> <20051020150432.GS30127@kalmia.hozed.org> <52ach4f5ak.fsf@cisco.com> <20051020175603.GV30127@kalmia.hozed.org> <52sluwdq1b.fsf@cisco.com> <20051020220759.GX30127@kalmia.hozed.org> <52br1jes5u.fsf@cisco.com> <20051027163642.GI3275@kalmia.hozed.org> <52k6fyyjsa.fsf@cisco.com> Message-ID: <20051029011030.GO3275@kalmia.hozed.org> On Thu, Oct 27, 2005 at 10:03:17AM -0700, Roland Dreier wrote: > OK, looks like you have two problems. First of all, you seem to have > two versions of ib_mthca, one of which gets picked up by hotplug on > boot and one of which gets picked up by modprobe. Notice how you > don't see the > > dev->ib_dev.node_type = 1 > > line when mthca runs on boot? The only explanation I can come up with > for that would be that you have an old version of it in an initrd or > something that's screwing thing up. Whoops, that's exactly what's going on.. Now to figure out how to not have IB stuff included in my initrd.. From rolandd at cisco.com Fri Oct 28 19:39:51 2005 From: rolandd at cisco.com (Roland Dreier) Date: Fri, 28 Oct 2005 19:39:51 -0700 Subject: [openib-general] Re: [git pull] InfiniBand updates for 2.6.14 In-Reply-To: <20051028173901.6bd2c302.akpm@osdl.org> (Andrew Morton's message of "Fri, 28 Oct 2005 17:39:01 -0700") References: <523bmlqkg0.fsf@cisco.com> <20051028171218.2b8e71e7.akpm@osdl.org> <52y84dp44d.fsf@cisco.com> <20051028173901.6bd2c302.akpm@osdl.org> Message-ID: <52k6fxoxl4.fsf@cisco.com> Andrew> That would be suitable, I guess. It's a bit of a hassle, Andrew> but some bugs will likely be found, and useful suggestions Andrew> will be made. No objections here... the more people I can get reading patches, the better. I'll see about scripting something to make it a semi-automatic part of my workflow. - R. From hozer at hozed.org Fri Oct 28 19:50:27 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Fri, 28 Oct 2005 21:50:27 -0500 Subject: [openib-general] prototype version of ebus driver In-Reply-To: References: Message-ID: <20051029025027.GP3275@kalmia.hozed.org> On Wed, Oct 26, 2005 at 04:56:08PM +0200, IBMEHCA DD wrote: > on kernel 2.6.13 and 14 a "ebus" driver is needed to enable the ehca > driver on power5. > I just uploaded a prototype patch to gen2/users/ehca svn 3879 > Please get some responses from the PPC64 maintainers, or possibly linux-kernel. I'd like to see ehca get reviewed as well, but it may be a little early for that ;) From sinate at yahoo.com Sat Oct 29 03:03:14 2005 From: sinate at yahoo.com (Steven Wooding) Date: Sat, 29 Oct 2005 11:03:14 +0100 (BST) Subject: [openib-general] Missing ib_al.h file? Message-ID: <20051029100315.38826.qmail@web32507.mail.mud.yahoo.com> Hi, I am trying to get my app to use serviceRecords with the SA. Anyway, my problem is that a file called ib_al.h seems to be missing. It should be in trunk/...osm/include/iba/ directory, along with ib_types.h. The file osm_vendor_al.h includes it, which is included by osm_vendor.h. I notice that file selects the vendor. I have Mellonox IB cards, so have I got that right? Thanks, Steve. ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com From 20calculus at epitha.com Sat Oct 29 04:34:19 2005 From: 20calculus at epitha.com (Jonathan Lopez) Date: Sat, 29 Oct 2005 13:34:19 +0200 Subject: [openib-general] Stop throwing away your money Message-ID: <000001c5dc8d$10c87800$0100007f@localhost> Finally the real thing- no more ripoffs! Enhancment Patches are hot right now, VERY hot! Unfortunately, most are cheap imitiations and do very little to increase your size and stamina. Well this is the real thing, not an imitation! One of the very originals, the absolutely strongest Patch available, anywhere! A top team of British scientists and medical doctors have worked to develop the state-of-the-art Pen1s Enlargment Patch delivery system which automatically increases pen1s size up to 3-4 full inches. The patches are the easiest and most effective way to increase your size. You won't have to take pills, get under the knife to perform expensive and very painful surgery, use any pumps or other devices. No one will ever find out that you are using our product. Just apply one patch on your body and wear it for 3 days and you will start noticing dramatic results. Millions of men are taking advantage of this revolutionary new product - Don't be left behind! As an added incentive, they are offering huge discount specials right now, check out the site to see for yourself! Here's the link to check out! Name Patches Regular Now Steel Package 10 Patches $79.95 $49.95 Free shipping Silver Package 25 Patches $129.95 $99.95 Free shipping and exercise manual included Gold Package 40 Patches $189.95 $149.95 Free shipping and exercise manual included Platinum Package 65 Patches $259.95 $199.95 Free shipping and exercise manual included -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sat Oct 29 08:39:18 2005 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 29 Oct 2005 17:39:18 +0200 Subject: [openib-general] Missing ib_al.h file? Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5175CC0@taurus.voltaire.com> Hi, That's an IBAL file (gen1). You need to build with VENDOR=openib to use this which should not need that file. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Steven Wooding Sent: Sat 10/29/2005 6:03 AM To: openib-general at openib.org Subject: [openib-general] Missing ib_al.h file? Hi, I am trying to get my app to use serviceRecords with the SA. Anyway, my problem is that a file called ib_al.h seems to be missing. It should be in trunk/...osm/include/iba/ directory, along with ib_types.h. The file osm_vendor_al.h includes it, which is included by osm_vendor.h. I notice that file selects the vendor. I have Mellonox IB cards, so have I got that right? Thanks, Steve. ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hbxieedvryocb at go.com Sat Oct 29 10:42:30 2005 From: hbxieedvryocb at go.com (Numbers Craig) Date: Sat, 29 Oct 2005 18:42:30 +0100 Subject: [openib-general] Fwd: Look your best during the summer months Message-ID: <042523.412000.77@go.com> You've seen it on "60 Minutes" and read the BBC News report -- now find out just what everyone is talking about. # Suppress your appetite and feel full and satisfied all day long # Increase your energy levels # Lose excess weight # Increase your metabolism # Burn body fat # Burn calories # Attack obesity And more.. http://supergreen-hoodia.com/ # Suitable for vegetarians and vegans # MAINTAIN your weight loss # Make losing weight a sure guarantee # Look your best during the summer months http://supergreen-hoodia.com/ Regards, Dr. Numbers Craig From kjreilly at us.ibm.com Sat Oct 29 15:21:43 2005 From: kjreilly at us.ibm.com (Kevin Reilly) Date: Sat, 29 Oct 2005 18:21:43 -0400 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a Message-ID: Thanks Sean, I think the rdma_resolve_addr() does what we want. Translate a local IP to a ib_device structure that i can use in the ibverbs. What we want to do is pretty simple and we won't need to create a connection. Can we have a discussion on the timeframe for this? Kevin J. Reilly STSM, HPC Architecture -Federation/HPS Chief Engineer -HPC interconnect architect (office) 845-433-7976 (tieline) 8-293-7976 From bos at pathscale.com Sat Oct 29 16:55:58 2005 From: bos at pathscale.com (Bryan O'Sullivan) Date: Sat, 29 Oct 2005 16:55:58 -0700 Subject: [openib-general] [PATCH] Build RPM packages cleanly Message-ID: <1130630158.9725.8.camel@camp4.serpentine.com> This patch makes it next to trivial to build RPM packages of OpenIB on reasonably recent RPM-based distributions (e.g. Fedora, SuSE). The patch has a few major components: * Cleanups of RPM spec files that either never built or have bit-rotted since they were written. * Cleanups of autotools scripts so that they work properly with the spec files in all cases. * A spec file for OpenSM, which was the only major component of OpenIB not packages. * A spec file for the kernel modules, to make it possible to run the latest, greatest OpenIB kernel components on unpatched kernels. This has been tested with pristine Fedora kernels. * A shell script that automates the entire RPM build process, so you don't need any special smarts to do a build. I think that the last three items above are pretty significant steps in making OpenIB more approachable to users who aren't necessarily bleeding-edge hackers. I have tested the RPMs that this stuff builds on Fedora Core 3 and 4, and SuSE 9.3. Both i386 and x86_64 RPMs build happily, and on x86_64 machines, the two arches coexist without any conflicts or problems. If people are interested, I can make yum repositories of the relevant built RPMs available for recent Fedora and SuSE releases. From finbscrspna at hotmail.com Sat Oct 29 20:25:45 2005 From: finbscrspna at hotmail.com (Trey Giles) Date: Sun, 30 Oct 2005 05:25:45 +0200 Subject: [openib-general] the secret to reversing the effects of aging Message-ID: <749e036t.7011037@hotmail.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gatlinburg.61.gif Type: image/gif Size: 30316 bytes Desc: not available URL: From baden at edomex.com Sat Oct 29 22:46:53 2005 From: baden at edomex.com (Colonialist R. Bushy) Date: Sun, 30 Oct 2005 01:46:53 -0400 Subject: [openib-general] Software Message-ID: <2734916643.20051030014653@edomex.com> 75% Off for All New Software. microsoft, adobe, macromedia! Best software prices. New software on our site: Flash MX 2004 - $69.95 Photoshop 7 - $69.95 Office 2000 Premium Edition PE (2CD) - $59.95 Photoshop CS with ImageReady CS - $99.95 PhotoRetouch Pro 3.0 - $59.95 FreeHand MX - $69.95 Quark Xpress 6 Passport Multilanguage - $69.95 Fireworks MX 2004 - $69.95 After Effects 6 - $69.95 Picture It Premium 9 - $59.95 Office 97 SR2 - $49.95 InDesign CS - $69.95 Borland Delphi 7 Enterprise Edition (2CD) - $69.95 Borland Delphi 7 Enterprise Edition (2CD) - $69.95 Our site: http://btfrhos.5ctjawajteti6nn0an50s5nn.deglazeag.com From info at skjih.com Sat Oct 29 20:55:33 2005 From: info at skjih.com (info at skjih.com) Date: 30 Oct 2005 12:55:33 +0900 Subject: [openib-general] $BNA6bITMW4|4VCf$G$9(B Message-ID: <20051030035533.17826.qmail@mail.skjih.com> $BC/$K$b Hello! Now that 2.6.14 is out, patches to make svn trunk compile against 2.6.13 and older kernels have been uploaded to https://openib.org/svn/gen2/branches/backport Enjoy, -- MST From info at uytfes.com Sun Oct 30 01:12:28 2005 From: info at uytfes.com (info at uytfes.com) Date: 30 Oct 2005 18:12:28 +0900 Subject: [openib-general] $B40A4L5NA$G9,$;C5$7(B Message-ID: <20051030091228.26604.qmail@mail.uytfes.com> $B"c"dF|K\?M0J30$N=w at -$bB??t:_ at RCf"d(B $B"(F|K\:_=;$N=w at -Cf?4$G$9!#(B http://1191.jp/kensaku/index.html $B5U1g!&%(%C%A$J%A%c%C%H$dEEOC!"%a!<%k8r49!&(B1-$BBP(B-1$B$N%;%C%/%9!&HkL)$N4X78!&(BSM$B4X78!&=wAu$dCKAu!&!&!&!&!&!&!&!&$+$iA*$s$G%T%C%?%7$N=w at -$r$*C5$7$/$@$5$$!#(B http://1191.jp/kensaku/index.html $B$*;n$7EPO?$NJ}$K40A4(B10000$B1_J,:9$7>e$2$^$9!#(B *********************************** $B!|(BNO.I don't veceive your mail$B!|(B sweet_as_candy_700 at yahoo.fr $B!|:#8e!"l9g$O!|(B sweet_as_candy_700 at yahoo.fr *********************************** 18$B:PL$K~$N$4MxMQ$O1sN8$/$@$5$$(B From yael at mellanox.co.il Sun Oct 30 03:54:33 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Sun, 30 Oct 2005 13:54:33 +0200 Subject: [openib-general] Patches for Opensm Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E2392@mtlexch01.mtl.com> Hi Hal, I noticed that you've checked in a change to the osm trunk few days ago without sending a patch regarding it. Since I am the owner of the opensm tree under Windows, and I am trying to keep the Windows tree as similar as possible to the Linux tree - I want to know about checkins to the osm tree, so I can add the patches to the Windows tree as well. Please send an e-mail with a patch when you commit changes to the osm tree. Thanks, Yael -----Original Message----- From: Yael Kalka [mailto:yael at mellanox.co.il] Sent: Thursday, October 27, 2005 3:04 PM To: halr at voltaire.com Cc: openib-general at openib.org; eitan at mellanox.co.il; yael at mellanox.co.il Subject: [PATCH] Opensm - fix lmc algorithm Hi Hal, We noticed a problem in the lmc assignment algorithm. In the current code - when trying to run opensm with lmc > 0, the opensm goes into infinite loop. Debugging the problem we noticed that there is a problem with the lid assignment, and we changed the algorithm. The change is in the osm_lid_mgr_init_sweep function. We have done some testing to the new code, and it seems that the lmc assignment is ok with the fix. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 3848) +++ opensm/osm_lid_mgr.c (working copy) @@ -337,7 +337,7 @@ __osm_lid_mgr_init_sweep( uint16_t max_defined_lid; uint16_t max_persistent_lid; uint16_t max_discovered_lid; - uint16_t lid, l; + uint16_t lid; uint16_t disc_min_lid; uint16_t disc_max_lid; uint16_t db_min_lid; @@ -349,16 +349,23 @@ __osm_lid_mgr_init_sweep( osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); + uint16_t lmc_mask; + uint16_t req_lid, num_lids; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_init_sweep ); + if (p_mgr->p_subn->opt.lmc) + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); + else + lmc_mask = 0xffff; + /* if we came out of standby we need to discard any previous guid 2 lid info we might had */ if ( p_mgr->p_subn->coming_out_of_standby == TRUE ) { osm_db_clear( p_mgr->p_g2l ); for (lid = 0; lid < cl_ptr_vector_get_size(&p_mgr->used_lids); lid++) - cl_ptr_vector_set(&p_mgr->used_lids, lid, NULL); + cl_ptr_vector_set(p_persistent_vec, lid, NULL); } /* we need to cleanup the empty ranges list */ @@ -375,7 +382,7 @@ __osm_lid_mgr_init_sweep( /* we if are on the first sweep and in re-assign lids mode we should ignore all the available info and simply define one - hufe empty range */ + huge empty range */ if ((p_mgr->p_subn->first_time_master_sweep == TRUE) && (p_mgr->p_subn->opt.reassign_lids == TRUE )) { @@ -398,6 +405,34 @@ __osm_lid_mgr_init_sweep( osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); for (lid = disc_min_lid; lid <= disc_max_lid; lid++) cl_ptr_vector_set(p_discovered_vec, lid, p_port ); + /* make sure the guid2lid entry is valid. If not - clean it. */ + if (!osm_db_guid2lid_get( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port)), + &db_min_lid, &db_max_lid)) + { + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + num_lids = lmc_num_lids; + else + num_lids = 1; + + if ((num_lids != 1) && + (((db_min_lid & lmc_mask) != db_min_lid) || + (db_max_lid - db_min_lid + 1 < num_lids)) ) + { + /* Not alligned, or not wide enough - remove the entry */ + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "__osm_lid_mgr_init_sweep: " + "Cleaning persistent entry for guid:0x%016" PRIx64 + " illegal range:[0x%x:0x%x] \n", + cl_ntoh64(osm_port_get_guid(p_port)), db_min_lid, + db_max_lid ); + osm_db_guid2lid_delete( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port))); + for ( lid = db_min_lid ; lid <= db_max_lid ; lid++ ) + cl_ptr_vector_set(p_persistent_vec, lid, NULL); + } + } } /* @@ -434,7 +469,7 @@ __osm_lid_mgr_init_sweep( { is_free = TRUE; /* first check to see if the lid is used by a persistent assignment */ - if ((lid < max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) + if ((lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " @@ -442,62 +477,86 @@ __osm_lid_mgr_init_sweep( lid); is_free = FALSE; } - - /* check the discovered port if there is one */ - if ((lid < max_discovered_lid) && - (p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + else { - /* get the lid range of that port - but we know how many lids we - are about to assign to it */ - osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != - IB_NODE_TYPE_SWITCH) - disc_max_lid = disc_min_lid + lmc_num_lids - 1; - + /* check this is a discovered port */ + CL_ASSERT(lid <= max_discovered_lid); + if ((p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + { + /* we have a port. Now lets see if we can preserve its lid range. */ + /* For that - we need to make sure: + 1. The port has a (legal) persistancy entry. Then the local lid + is free (we will use the persistancy value). + 2. Can the port keep its local assignment? + a. Make sure the lid a alligned. + b. Make sure all needed lids (for the lmc) are free according + to persistancy table. + */ /* qualify the guid of the port is not persistently mapped to another range */ if (!osm_db_guid2lid_get( p_mgr->p_g2l, cl_ntoh64(osm_port_get_guid(p_port)), &db_min_lid, &db_max_lid)) { - /* ok there is an asignment - is it the same ? */ - if ((disc_min_lid == db_min_lid) && (disc_max_lid == db_max_lid)) - { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is not free as it was discovered " - " and mapped by the persistent db.\n", - disc_min_lid, disc_max_lid); - is_free = FALSE; + "0x%04x is free as it was discovered " + "but mapped by the persistent db to [0x%04x:0x%04x].\n", + lid, db_min_lid, db_max_lid); + } + else + { + /* can the port keep its assignment ? */ + /* get the lid range of that port, and the required number + of lids we are about to assign to it */ + osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + { + disc_max_lid = disc_min_lid + lmc_num_lids - 1; + num_lids = lmc_num_lids; } else { + num_lids = 1; + } + /* Make sure the lid is alligned */ + if ((num_lids != 1) && ((disc_min_lid & lmc_mask) != disc_min_lid)) + { + /* The lid cannot be used */ osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is free as it was discovered" - " but mapped to range: [0x%x:0x%x] by the persistent db.\n", - disc_min_lid, disc_max_lid, db_min_lid, db_max_lid); - for (l = disc_min_lid; l <= disc_max_lid; l++) - cl_ptr_vector_set(p_discovered_vec, l, NULL); - } + "0x%04x is free as it was discovered " + "but not alligned. \n", + lid ); } else { + /* check that all needed lids are not persistantly mapped */ + is_free = FALSE; + for ( req_lid = disc_min_lid + 1 ; req_lid <= disc_max_lid ; req_lid++ ) + { + if ((req_lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, req_lid)) + { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "0x%04x is not free as it was discovered" - " and there is no persistent db entry for it.\n", + "0x%04x is free as it was discovered " + "but mapped. \n", lid); - is_free = FALSE; + is_free = TRUE; + break; + } } - - /* if there is more then one lid on that port - and the discovered port - is going to retain its lids advance to the max lid */ if (is_free == FALSE) { + /* This port will use its local lid, and consume the entire required lid range. + Thus we can skip that range. */ lid = disc_max_lid; } } + } + } + } if (is_free) { @@ -1300,7 +1359,6 @@ osm_lid_mgr_process_subnet( /* the proc returns the fact it sent a set port info */ if (__osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; - } } /* all ports */ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Sun Oct 30 04:36:22 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 30 Oct 2005 14:36:22 +0200 Subject: [openib-general] 2.6.14 patches Message-ID: <20051030123622.GD4769@mellanox.co.il> Hi! Sean, Hal, now that 2.6.14 is out, do you plan to apply the patches in https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/? Once you do, I'll put reverted patches in the backport directory. I suggest we then remove the rest of the 2.6.14-rc3 files from the patches directory except linux-2.6.14-fib-frontend.diff - what do you guys think? I already did this for SDP and for linux-2.6.14-rc3-sdp_link.diff I took the liberty to rename linux-2.6.14-rc3-fib-frontend.diff to linux-2.6.14-fib-frontend.diff, since the patch is for 2.6.14 as well. Thanks, -- MST From sinate at yahoo.com Sun Oct 30 08:23:30 2005 From: sinate at yahoo.com (Steven Wooding) Date: Sun, 30 Oct 2005 16:23:30 +0000 (GMT) Subject: [openib-general] Missing ib_al.h file? In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5175CC0@taurus.voltaire.com> Message-ID: <20051030162330.38088.qmail@web32512.mail.mud.yahoo.com> Thanks Hal. That makes sense. I'll give that a go. Cheers, Steve. --- Hal Rosenstock wrote: > Hi, > > That's an IBAL file (gen1). You need to build with > VENDOR=openib to use this which should not need that > file. > > -- Hal > ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com From amip at mellanox.co.il Sun Oct 30 08:38:14 2005 From: amip at mellanox.co.il (Ami Parlmuter) Date: Sun, 30 Oct 2005 18:38:14 +0200 Subject: [openib-general] SRQ freezes up Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E35ABDA4@mtlexch01.mtl.com> running ibv_srq_pingpong pops up two bugs in the SRQ. 1. a failure to RRs to the SRQ after polling completions sent to it (the verb ibv_post_srq_recv fails returning -1) 2. as a direct result of this, the other side gets a bad completion with RETRY EXCEEDED error, and then the machine freezes up the first bug has been there for quit some time, the second only happens from REV 3890 (when the previous version I tested was 3382) the command lines I used with the test: server: /usr/local/bin/ibv_srq_pingpong --port=19872 --ib-dev=mthca0 --ib-port=1 -n 10000 --num-qp=200 --rx-depth=5 client: /usr/local/bin/ibv_srq_pingpong --port=19872 --ib-dev=mthca0 --ib-port=1 -n 10000 --num-qp=200 --rx-depth=5 SERVER_IP_ADDR -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sun Oct 30 08:51:21 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Oct 2005 11:51:21 -0500 Subject: [openib-general] Missing ib_al.h file? In-Reply-To: <20051030162330.38088.qmail@web32512.mail.mud.yahoo.com> References: <20051030162330.38088.qmail@web32512.mail.mud.yahoo.com> Message-ID: <1130691080.4425.904.camel@hal.voltaire.com> On Sun, 2005-10-30 at 11:23, Steven Wooding wrote: > Thanks Hal. > > That makes sense. I'll give that a go. This is built as part of libosmvendor so if you build OpenSM, you will have this to link with. -- Hal > Cheers, > > Steve. > > --- Hal Rosenstock wrote: > > > Hi, > > > > That's an IBAL file (gen1). You need to build with > > VENDOR=openib to use this which should not need that > > file. > > > > -- Hal > > > > > > > > > ___________________________________________________________ > Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com From rolandd at cisco.com Sun Oct 30 09:52:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Sun, 30 Oct 2005 09:52:49 -0800 Subject: [openib-general] Re: SRQ freezes up In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E35ABDA4@mtlexch01.mtl.com> (Ami Parlmuter's message of "Sun, 30 Oct 2005 18:38:14 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E35ABDA4@mtlexch01.mtl.com> Message-ID: <52y84anb7y.fsf@cisco.com> >>>>> "Ami" == Ami Parlmuter writes: Ami> running ibv_srq_pingpong pops up two bugs in the SRQ. 1. a Ami> failure to RRs to the SRQ after polling completions sent to Ami> it (the verb ibv_post_srq_recv fails returning -1) 2. as a Ami> direct result of this, the other side gets a bad completion Ami> with RETRY EXCEEDED error, and then the machine freezes up Anything printed in the console from the kernel when this happens? Ami> the first bug has been there for quit some time, Any reason you kept it a secret until now? Ami> the second only happens from REV 3890 (when the previous Ami> version I tested was 3382) I wasn't able to duplicate the exact symptoms you see, but I fixed a couple of bugs that your test showed for me: one in the uverbs kernel module that can cause a kernel panic, and one in the srq_pingpong example that would cause a CQ overrun. Do you still see problems with the latest svn code? - R. From hozer at hozed.org Sun Oct 30 15:55:04 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Sun, 30 Oct 2005 17:55:04 -0600 Subject: [openib-general] opensm errors with ehca Message-ID: <20051030235504.GT3275@kalmia.hozed.org> The firmware on the IBM eHCA causes opensm to spit out these kinds of errors all the time.. Is there a way we can either not send P_KeyTable requests to any eHCA guids, or figure out what (if anything) is broken in their firmware? Is this a spec violation, or just ambiguities in implementation? Oct 30 17:49:46 053820 [43005960] -> umad_receiver: ERR 5409: send completed wit h error (method=0x1 attr=0x16 trans_id=0x158c) -- dropping. Oct 30 17:49:46 053830 [43005960] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 h op count 2 DR SLID 0x0 DR DLID 0x0 Oct 30 17:49:46 053839 [43005960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MA D completed in error (IB_TIMEOUT). Oct 30 17:49:46 053861 [43005960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x2 trans_id................0x158c attr_id.................0x16 (P_KeyTable) resv....................0x0 attr_mod................0x260000 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][1][16] Return path: [0][0][0] Reserved: [0][0][0][0][0][0][0] From tziporet at mellanox.co.il Sun Oct 30 22:44:26 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 31 Oct 2005 08:44:26 +0200 Subject: [openib-general] SRQ limit reached async event. Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E366C190@mtlexch01.mtl.com> FW 4.7.400 for Arbel mem-full was officially released yesterday. Tavor (3.x) release will be by the end of the year. Tziporet -----Original Message----- From: Roland Dreier [mailto:rolandd at cisco.com] Sent: Friday, October 28, 2005 12:44 AM To: Galen M. Shipman Cc: openib-general at openib.org Subject: Re: [openib-general] SRQ limit reached async event. Galen> Does anyone now if openib supports the SRQ limit Galen> asynchronous event? Yes, openib verbs and the mthca driver supports this. However, with current firmware, you will only receive this event for mem-free HCAs (firmware versions 5.x and 1.x). For mem-ful HCAs (firmware versions 3.x and 4.x), you will need to use as-yet-unreleased firmware for the event to be generated. - R. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at mellanox.co.il Sun Oct 30 22:50:05 2005 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 31 Oct 2005 08:50:05 +0200 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freei ng it Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E366C191@mtlexch01.mtl.com> Hi Roland, When do you expect to apply the FMRs patch for SRP? Thanks, Tziporet -----Original Message----- From: Vu Pham [mailto:vuhuong at mellanox.com] Sent: Tuesday, October 11, 2005 8:03 PM To: Roland Dreier Cc: kingman at storagegear.com; openib-general at openib.org Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freeing it Roland, Thanks or reviewing it. Responding to your feedback, I prepare new patch (attached) > > Why put a pointer to struct list_head here instead of just a struct > list_head? If you just used the struct, then you wouldn't need this: > Done. Using struct list_head instead of pointer > > + u16 in_use; > > }; > > I can't find anywhere that the in_use flag is used. > Removed > > +static int srp_map_fmr(struct srp_target_port *target, struct scatterlist *scat, > > + int sg_cnt, struct srp_request *req) > > [...] > > > + return -ENOMEM; > > > + } else if (fmr_cnt <= 0) { > > fmr_cnt is unsigned so I think this is going to get you in trouble. > Might as well make fmr_cnt a plain int to make things simpler. > In previous patch, fmr_cnt was already declared as int > Also, it might be good to try and add some more comments explaining > srp_map_fmr() -- it would definitely help me review. > I added some comments - Hope they help your review (instead of confusing you more :)) Signed-off-by: Vu Pham -------------- next part -------------- An HTML attachment was scrubbed... URL: From hozer at hozed.org Sun Oct 30 23:17:03 2005 From: hozer at hozed.org (Troy Benjegerdes) Date: Mon, 31 Oct 2005 01:17:03 -0600 Subject: [openib-general] libehca causes segfault when not physically present.. Message-ID: <20051031071703.GU3275@kalmia.hozed.org> On an Openpower720 system with a mellanox HCA (and no IBM ehca installed), I get the following when trying to run ibv_rc_pingpong: Starting program: /usr/src/openib-src/userspace/libibverbs/examples/.libs/ibv_rc_pingpong [Thread debugging using libthread_db enabled] [New Thread 4398046660640 (LWP 6167)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 4398046660640 (LWP 6167)] hipz_galpa_store (galpa={fw_handle = 0}, offset=48, value=0) at src/hcp_phyp.c:72 72 *(u64 *) addr = value; (gdb) bt #0 hipz_galpa_store (galpa={fw_handle = 0}, offset=48, value=0) at src/hcp_phyp.c:72 #1 0x0000000010001b7c in pp_post_recv (ctx=0x100177d0, n=-3807848) at verbs.h:844 #2 0x0000000010002364 in main (argc=Variable "argc" is not available. ) at examples/rc_pingpong.c:566 I assume this means something somewhere is not actually checking sysfs to see if the driver is actually there and active. From mst at mellanox.co.il Mon Oct 31 01:20:36 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 11:20:36 +0200 Subject: [openib-general] [PATCH] libmthca: fix max_inline_data calculation Message-ID: <20051031092036.GK4769@mellanox.co.il> Hello, Roland! Please consider this patch. Fix max_inline_data calculation: we add mthca_bind_seg when calculating wqe size, so we should substract mthca_bind_seg as well. Signed-off-by: Michael S. Tsirkin Index: userspace/libmthca/src/qp.c =================================================================== --- userspace.orig/libmthca/src/qp.c 2005-10-30 09:25:43.000000000 +0200 +++ userspace/libmthca/src/qp.c 2005-10-31 11:15:10.000000000 +0200 @@ -787,7 +787,8 @@ void mthca_return_cap(struct ibv_pd *pd, break; default: - cap->max_inline_data -= sizeof (struct mthca_raddr_seg); + /* bind seg is as big as atomic + raddr segs */ + cap->max_inline_data -= sizeof (struct mthca_bind_seg); break; } -- MST From mst at mellanox.co.il Mon Oct 31 01:54:55 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 11:54:55 +0200 Subject: [openib-general] [PATCH] libmthca: fix qp max_send_sge calculation Message-ID: <20051031095455.GN4769@mellanox.co.il> Hello, Roland! Currently, max_send_sge reported for a qp is based not only on the max_send_sge requested, but also on the max_inline_data requested. While this may help in the sense that for some combinations of these values the user may have an actual use for a bigger value of max_send_sge, this also creates situations where max_send_sge reported exceeds the maximum s/g value supported by the HCA, so attempts to post such a work request will fail in strange ways. A simple fix is to avoid touching max_gs for send, same as we do for receive. --- Avoid setting max_send_sge to a value bigger than supported by the HCA. Signed-off-by: Michael S. Tsirkin Index: userspace/libmthca/src/qp.c =================================================================== --- userspace.orig/libmthca/src/qp.c 2005-10-31 11:31:14.000000000 +0200 +++ userspace/libmthca/src/qp.c 2005-10-31 11:31:38.000000000 +0200 @@ -682,13 +682,14 @@ out: int mthca_alloc_qp_buf(struct ibv_pd *pd, struct ibv_qp_cap *cap, enum ibv_qp_type type, struct mthca_qp *qp) { - int size; + int size, max_sge; qp->rq.max_gs = cap->max_recv_sge; - qp->sq.max_gs = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), + qp->sq.max_gs = cap->max_send_sge; + max_sge = align(cap->max_inline_data + sizeof (struct mthca_inline_seg), sizeof (struct mthca_data_seg)) / sizeof (struct mthca_data_seg); - if (qp->sq.max_gs < cap->max_send_sge) - qp->sq.max_gs = cap->max_send_sge; + if (max_sge < cap->max_send_sge) + max_sge = cap->max_send_sge; qp->wrid = malloc((qp->rq.max + qp->sq.max) * sizeof (uint64_t)); if (!qp->wrid) @@ -702,7 +703,7 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd ; /* nothing */ size = sizeof (struct mthca_next_seg) + - qp->sq.max_gs * sizeof (struct mthca_data_seg); + max_sge * sizeof (struct mthca_data_seg); switch (type) { case IBV_QPT_UD: if (mthca_is_memfree(pd->context)) -- MST From mst at mellanox.co.il Mon Oct 31 03:09:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 13:09:00 +0200 Subject: [openib-general] uverbs_cmd warning Message-ID: <20051031110900.GR4769@mellanox.co.il> Hello! I am seeing these warnings when building core: CC [M] drivers/infiniband/core/uverbs_main.o drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_cq_event_handler': drivers/infiniband/core/uverbs_main.c:450: warning: passing arg 1 of `ib_uverbs_async_handler' from incompatible pointer type CC [M] drivers/infiniband/core/uverbs_cmd.o drivers/infiniband/core/uverbs_cmd.c: In function `ib_uverbs_create_cq': drivers/infiniband/core/uverbs_cmd.c:605: warning: assignment from incompatible pointer type Am I doing something wrong? -- MST From mst at mellanox.co.il Mon Oct 31 03:19:05 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 13:19:05 +0200 Subject: [openib-general] [PATCH] uverbs: fix typo Message-ID: <20051031111904.GA31134@mellanox.co.il> Fix typo in uverbs.h Signed-off-by: Michael S. Tsirkin Index: linux-2.6.14/drivers/infiniband/core/uverbs.h =================================================================== --- linux-2.6.14.orig/drivers/infiniband/core/uverbs.h +++ linux-2.6.14/drivers/infiniband/core/uverbs.h @@ -113,7 +113,7 @@ struct ib_uevent_object { struct ib_ucq_object { struct ib_uobject uobject; - struct ib_uverb_file *uverbs_file; + struct ib_uverbs_file *uverbs_file; struct list_head comp_list; struct list_head async_list; u32 comp_events_reported; -- MST From yael at mellanox.co.il Mon Oct 31 03:34:30 2005 From: yael at mellanox.co.il (Yael Kalka) Date: Mon, 31 Oct 2005 13:34:30 +0200 Subject: [openib-general] RE: Patches for Opensm Message-ID: <6AB138A2AB8C8E4A98B9C0C3D52670E30E239B@mtlexch01.mtl.com> Hi Hal, whitespace/typo changes do not interest me, but if you can send me personal mail regarding other changes (I can do fine with the new version number or something) - that will be great. Thanks, Yael -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Sunday, October 30, 2005 2:36 PM To: Yael Kalka Cc: openib-general at openib.org; Eitan Zahavi Subject: Re: Patches for Opensm Hi Yael, On Sun, 2005-10-30 at 06:54, Yael Kalka wrote: > I noticed that you've checked in a change to the osm trunk few days > ago without > sending a patch regarding it. > Since I am the owner of the opensm tree under Windows, and I am trying > to keep > the Windows tree as similar as possible to the Linux tree - I want to > know > about checkins to the osm tree, so I can add the patches to the > Windows tree as well. > Please send an e-mail with a patch when you commit changes to the osm > tree. I have not been doing this for minor and cosmetic (whitespace/typo) changes so there is more than just that. I don't think this is worthy of bothering the list so there are 3 choices: 1. Sync to the tree and diff 2. Subscribe to openib-commits (you will get all commits) 3. Personal email Let me know your preference. Thanks. -- Hal -------------- next part -------------- An HTML attachment was scrubbed... URL: From yael at mellanox.co.il Mon Oct 31 05:42:07 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 31 Oct 2005 15:42:07 +0200 Subject: [openib-general] [PATCH] Opensm - fix lmc algorithm - new Message-ID: <5zek61yf9s.fsf@mtl066.yok.mtl.com> Hi Hal, Since you haven't applied this fix yet - please take this new one. There was a wrong CL_ASSERT in my original patch. I'm also adding my explanation from previous mail regarding the patch: We noticed a problem in the lmc assignment algorithm. In the current code - when trying to run opensm with lmc > 0, the opensm goes into infinite loop. Debugging the problem we noticed that there is a problem with the lid assignment, and we changed the algorithm. The change is in the osm_lid_mgr_init_sweep function. We have done some testing to the new code, and it seems that the lmc assignment is ok with the fix. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 3915) +++ opensm/osm_lid_mgr.c (working copy) @@ -337,7 +337,7 @@ __osm_lid_mgr_init_sweep( uint16_t max_defined_lid; uint16_t max_persistent_lid; uint16_t max_discovered_lid; - uint16_t lid, l; + uint16_t lid; uint16_t disc_min_lid; uint16_t disc_max_lid; uint16_t db_min_lid; @@ -349,16 +349,23 @@ __osm_lid_mgr_init_sweep( osm_port_t *p_port; cl_qmap_t *p_port_guid_tbl; uint8_t lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc); + uint16_t lmc_mask; + uint16_t req_lid, num_lids; OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_init_sweep ); + if (p_mgr->p_subn->opt.lmc) + lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1); + else + lmc_mask = 0xffff; + /* if we came out of standby we need to discard any previous guid 2 lid info we might had */ if ( p_mgr->p_subn->coming_out_of_standby == TRUE ) { osm_db_clear( p_mgr->p_g2l ); for (lid = 0; lid < cl_ptr_vector_get_size(&p_mgr->used_lids); lid++) - cl_ptr_vector_set(&p_mgr->used_lids, lid, NULL); + cl_ptr_vector_set(p_persistent_vec, lid, NULL); } /* we need to cleanup the empty ranges list */ @@ -375,7 +382,7 @@ __osm_lid_mgr_init_sweep( /* we if are on the first sweep and in re-assign lids mode we should ignore all the available info and simply define one - hufe empty range */ + huge empty range */ if ((p_mgr->p_subn->first_time_master_sweep == TRUE) && (p_mgr->p_subn->opt.reassign_lids == TRUE )) { @@ -398,6 +405,34 @@ __osm_lid_mgr_init_sweep( osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); for (lid = disc_min_lid; lid <= disc_max_lid; lid++) cl_ptr_vector_set(p_discovered_vec, lid, p_port ); + /* make sure the guid2lid entry is valid. If not - clean it. */ + if (!osm_db_guid2lid_get( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port)), + &db_min_lid, &db_max_lid)) + { + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + num_lids = lmc_num_lids; + else + num_lids = 1; + + if ((num_lids != 1) && + (((db_min_lid & lmc_mask) != db_min_lid) || + (db_max_lid - db_min_lid + 1 < num_lids)) ) + { + /* Not alligned, or not wide enough - remove the entry */ + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "__osm_lid_mgr_init_sweep: " + "Cleaning persistent entry for guid:0x%016" PRIx64 + " illegal range:[0x%x:0x%x] \n", + cl_ntoh64(osm_port_get_guid(p_port)), db_min_lid, + db_max_lid ); + osm_db_guid2lid_delete( p_mgr->p_g2l, + cl_ntoh64(osm_port_get_guid(p_port))); + for ( lid = db_min_lid ; lid <= db_max_lid ; lid++ ) + cl_ptr_vector_set(p_persistent_vec, lid, NULL); + } + } } /* @@ -434,7 +469,7 @@ __osm_lid_mgr_init_sweep( { is_free = TRUE; /* first check to see if the lid is used by a persistent assignment */ - if ((lid < max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) + if ((lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid)) { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " @@ -442,62 +477,85 @@ __osm_lid_mgr_init_sweep( lid); is_free = FALSE; } - - /* check the discovered port if there is one */ - if ((lid < max_discovered_lid) && - (p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + else { - /* get the lid range of that port - but we know how many lids we - are about to assign to it */ - osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); - if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != - IB_NODE_TYPE_SWITCH) - disc_max_lid = disc_min_lid + lmc_num_lids - 1; - + /* check this is a discovered port */ + if (lid <= max_discovered_lid && (p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid))) + { + /* we have a port. Now lets see if we can preserve its lid range. */ + /* For that - we need to make sure: + 1. The port has a (legal) persistancy entry. Then the local lid + is free (we will use the persistancy value). + 2. Can the port keep its local assignment? + a. Make sure the lid a alligned. + b. Make sure all needed lids (for the lmc) are free according + to persistancy table. + */ /* qualify the guid of the port is not persistently mapped to another range */ if (!osm_db_guid2lid_get( p_mgr->p_g2l, cl_ntoh64(osm_port_get_guid(p_port)), &db_min_lid, &db_max_lid)) { - /* ok there is an asignment - is it the same ? */ - if ((disc_min_lid == db_min_lid) && (disc_max_lid == db_max_lid)) - { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is not free as it was discovered " - " and mapped by the persistent db.\n", - disc_min_lid, disc_max_lid); - is_free = FALSE; + "0x%04x is free as it was discovered " + "but mapped by the persistent db to [0x%04x:0x%04x].\n", + lid, db_min_lid, db_max_lid); + } + else + { + /* can the port keep its assignment ? */ + /* get the lid range of that port, and the required number + of lids we are about to assign to it */ + osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); + if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) != + IB_NODE_TYPE_SWITCH) + { + disc_max_lid = disc_min_lid + lmc_num_lids - 1; + num_lids = lmc_num_lids; } else { + num_lids = 1; + } + /* Make sure the lid is alligned */ + if ((num_lids != 1) && ((disc_min_lid & lmc_mask) != disc_min_lid)) + { + /* The lid cannot be used */ osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "[0x%04x,0x%04x] is free as it was discovered" - " but mapped to range: [0x%x:0x%x] by the persistent db.\n", - disc_min_lid, disc_max_lid, db_min_lid, db_max_lid); - for (l = disc_min_lid; l <= disc_max_lid; l++) - cl_ptr_vector_set(p_discovered_vec, l, NULL); - } + "0x%04x is free as it was discovered " + "but not alligned. \n", + lid ); } else { + /* check that all needed lids are not persistantly mapped */ + is_free = FALSE; + for ( req_lid = disc_min_lid + 1 ; req_lid <= disc_max_lid ; req_lid++ ) + { + if ((req_lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, req_lid)) + { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_lid_mgr_init_sweep: " - "0x%04x is not free as it was discovered" - " and there is no persistent db entry for it.\n", + "0x%04x is free as it was discovered " + "but mapped. \n", lid); - is_free = FALSE; + is_free = TRUE; + break; + } } - - /* if there is more then one lid on that port - and the discovered port - is going to retain its lids advance to the max lid */ if (is_free == FALSE) { + /* This port will use its local lid, and consume the entire required lid range. + Thus we can skip that range. */ lid = disc_max_lid; } } + } + } + } if (is_free) { @@ -1300,7 +1358,6 @@ osm_lid_mgr_process_subnet( /* the proc returns the fact it sent a set port info */ if (__osm_lid_mgr_set_physp_pi( p_mgr, p_physp, cl_hton16( min_lid_ho ))) p_mgr->send_set_reqs = TRUE; - } } /* all ports */ From yael at mellanox.co.il Mon Oct 31 05:49:59 2005 From: yael at mellanox.co.il (Yael Kalka) Date: 31 Oct 2005 15:49:59 +0200 Subject: [openib-general] [PATCH] Opensm - race in opensm signalling Message-ID: <5zd5llyewo.fsf@mtl066.yok.mtl.com> Hi Hal, During our Windows testing we've encountered a case where for some reason the opensm changes the state of its port to down, and then brings it back up. After debugging it, we found out that the reason for that is a possible race when signaling "OSM_SIGNAL_NO_PENDING_TRANSACTIONS" to the osm_state_mgr_process. The qp0_mads_outstanding is decremented, and only later is checked if reaches zero. So if 2 threads decrement the qp0_mads_outstanding, and they are running simultanously, they can both signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS! This, of course, results in a big mess in the osm_state_mgr_process flow. The following patch fixes this issue. Thanks, Yael Signed-off-by: Yael Kalka Index: opensm/osm_vl15intf.c =================================================================== --- opensm/osm_vl15intf.c (revision 3915) +++ opensm/osm_vl15intf.c (working copy) @@ -183,28 +183,13 @@ __osm_vl15_poller( the cl_disp_post with OSM_SIGNAL_NO_PENDING_TRANSACTION (in order to wake up the state mgr). */ - cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); + outstanding = cl_atomic_dec( &p_vl->p_stats->qp0_mads_outstanding ); osm_log( p_vl->p_log, OSM_LOG_DEBUG, "__osm_vl15_poller: " "%u QP0 MADs outstanding.\n", p_vl->p_stats->qp0_mads_outstanding ); - /* - Acquire the lock non-exclusively. - Other modules that send MADs grab this lock exclusively. - These modules that are in the process of sending MADs - will hold the lock until they finish posting all the MADs - they plan to send. While the other module is sending MADs - the outstanding count may temporarily go to zero. - Thus, by grabbing the lock ourselves, we get an accurate - view of whether or not the number of outstanding MADs is - really zero. - */ - CL_PLOCK_ACQUIRE( p_vl->p_lock ); - outstanding = p_vl->p_stats->qp0_mads_outstanding; - CL_PLOCK_RELEASE( p_vl->p_lock ); - if( outstanding == 0 ) { /* Index: opensm/osm_sm_mad_ctrl.c =================================================================== --- opensm/osm_sm_mad_ctrl.c (revision 3915) +++ opensm/osm_sm_mad_ctrl.c (working copy) @@ -99,7 +99,7 @@ __osm_sm_mad_ctrl_retire_trans_mad( osm_mad_pool_put( p_ctrl->p_mad_pool, p_madw ); - cl_atomic_dec( &p_ctrl->p_stats->qp0_mads_outstanding ); + outstanding = cl_atomic_dec( &p_ctrl->p_stats->qp0_mads_outstanding ); if( osm_log_is_active( p_ctrl->p_log, OSM_LOG_DEBUG ) ) { @@ -109,21 +109,6 @@ __osm_sm_mad_ctrl_retire_trans_mad( p_ctrl->p_stats->qp0_mads_outstanding ); } - /* - Acquire the lock non-exclusively. - Other modules that send MADs grab this lock exclusively. - These modules that are in the process of sending MADs - will hold the lock until they finish posting all the MADs - they plan to send. While the other module is sending MADs - the outstanding count may temporarily go to zero. - Thus, by grabbing the lock ourselves, we get an accurate - view of whether or not the number of outstanding MADs is - really zero. - */ - CL_PLOCK_ACQUIRE( p_ctrl->p_lock ); - outstanding = p_ctrl->p_stats->qp0_mads_outstanding; - CL_PLOCK_RELEASE( p_ctrl->p_lock ); - if( outstanding == 0 ) { /* From rolandd at cisco.com Mon Oct 31 07:01:08 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 07:01:08 -0800 Subject: [openib-general] libehca causes segfault when not physically present.. In-Reply-To: <20051031071703.GU3275@kalmia.hozed.org> (Troy Benjegerdes's message of "Mon, 31 Oct 2005 01:17:03 -0600") References: <20051031071703.GU3275@kalmia.hozed.org> Message-ID: <52d5lln32j.fsf@cisco.com> Troy> I assume this means something somewhere is not actually Troy> checking sysfs to see if the driver is actually there and Troy> active. Yes, if you look at openib_driver_init() in ehca_uinit.c, you'll actually see the line: /* @@TODO check vendor and device numbers */ - R. From info at furyd.com Mon Oct 31 06:54:49 2005 From: info at furyd.com (info at furyd.com) Date: 31 Oct 2005 23:54:49 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051031145449.23642.qmail@mail.furyd.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B References: <20051027162154.GA23710@cse.ohio-state.edu> Message-ID: On Thu, 27 Oct 2005, Sayantan Sur wrote: > Hi, > > We ran into some troubles when compiling the OpenIB dapl provider with > the PGI compiler. I believe this should appear in both ibat-cm and the > scm based providers. > > Has anyone compiled DAPL/Gen2 with PGI? Nobody has reported this problem before. Therefore it is likely that nobody has tried to compile the code with the PGI compiler recently. > Is there a quick workaround for this? > > ---- > PGC-W-0221-Redefinition of symbol UINT64_C (/usr/include/stdint.h: 304) The DAT headers define UINT64_C on line 147 of dat/dat_platform_specific.h The definition there is guarded by an ifndef You must be including stdint.h sometime after you include udat.h If you include stdint.h before you include udat.h, the problem should go away. > PGC-S-0040-Illegal use of symbol, u_int64_t > (/home/1/surs/projects/Gen2/dapl_scm > _patch/dapl/dat/include/dat/dat_platform_specific.h: 139) Did you try changing u_int64_t to uint64_t? > PGC/x86-64 Linux/x86-64 6.0-5: compilation completed with severe errors > ---- > > Our machine is SuSe 9.3, with linux kernel version 2.6.13.1 and OpenIB > svn #3882. > > Thanks, > Sayantan. From jlentini at netapp.com Mon Oct 31 07:58:30 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 31 Oct 2005 10:58:30 -0500 (EST) Subject: uDAPL Problem : [WasRe: [openib-general] OpenSM crash with today's trunk In-Reply-To: <43629F9A.3070704@cs.rutgers.edu> References: <5CE025EE7D88BA4599A2C8FEFCF226F5175CA4@taurus.voltaire.com> <4362692A.80207@cs.rutgers.edu> <52irvhtpty.fsf@cisco.com> <43628965.60902@cs.rutgers.edu> <52fyqls81f.fsf@cisco.com> <43628F6C.9070308@cs.rutgers.edu> <52acgts6lj.fsf@cisco.com> <43629F9A.3070704@cs.rutgers.edu> Message-ID: Thanks for the patch Aniruddha. Can you resend with a signed-off-by line? See "How do I submit source code patches?" at https://openib.org/tiki/tiki-index.php?page=OpenIBFAQ > Also a minor patch, you can see that %P is printed as %P and not used as > a format character. > > Index: common/dapl_ep_post_rdma_write.c > =================================================================== > --- common/dapl_ep_post_rdma_write.c (revision 3892) > +++ common/dapl_ep_post_rdma_write.c (working copy) > @@ -78,7 +78,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", > + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_send.c > =================================================================== > --- common/dapl_ep_post_send.c (revision 3892) > +++ common/dapl_ep_post_send.c (working copy) > @@ -75,7 +75,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > Index: common/dapl_srq_post_recv.c > =================================================================== > --- common/dapl_srq_post_recv.c (revision 3892) > +++ common/dapl_srq_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_srq_post_recv (%p, %d, %p, %P)\n", > + "dapl_srq_post_recv (%p, %d, %p, %p)\n", > srq_handle, > num_segments, > local_iov, > Index: common/dapl_ep_post_recv.c > =================================================================== > --- common/dapl_ep_post_recv.c (revision 3892) > +++ common/dapl_ep_post_recv.c (working copy) > @@ -79,7 +79,7 @@ > DAT_RETURN dat_status; > > dapl_dbg_log (DAPL_DBG_TYPE_API, > - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", > + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", > ep_handle, > num_segments, > local_iov, > > Thanks > Aniruddha > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Mon Oct 31 08:26:21 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 18:26:21 +0200 Subject: [openib-general] libmthca problem: max_inline_size Message-ID: <20051031162621.GD31134@mellanox.co.il> Hello, Roland! Currently, libmthca calculates the qp capability max_inline_size based on the wqe size (wqe_shift) field. However, especially for memfree HCAs, the maximum descriptor size supported by the HCA may be less than the full wqe size (this is reported in the max_desc_sz or max_desc_sz_sq/max_desc_sz_rq), and as a result the max inline size returned to the caller is too large. I see three possible solutions: 1. Do not increase max_inline_size beyond what is given by the user. This is unfortunate bigger inline size may provide optimization opportunity. 2. Return the actual QP capability in create qp command. This is an ABI change, although the library can be made to work in a backward compatible way. 3. Add a command (device specific) to query the max descriptor size supported by the HCA (returned by query dev lim) and calculate max_inline_size based on that. Again, this is an ABI change. I am inclining towards the second option (2.) since this way the resulting capability calculations will be all in one place in kernel. There's a similiar problem in mthca_arbel_post_send where the inline data size is checked against the expression int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; I would imagine the way to fix that is to add a max_inline_size field to the mthca_qp structure and is that instead of 1 << qp->sq.wqe_shift. Pls let me know what do you think. Thanks, -- MST From mst at mellanox.co.il Mon Oct 31 08:28:00 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 18:28:00 +0200 Subject: [openib-general] Re: [PATCH] perftest/rdma_bw; add support for RDMA read and starting PSN In-Reply-To: <435D4B27.2010208@ichips.intel.com> References: <435D4B27.2010208@ichips.intel.com> Message-ID: <20051031162759.GE31134@mellanox.co.il> Quoting Arlin Davis : > Subject: Re: [openib-general] Re: [PATCH] perftest/rdma_bw;?add support for RDMA read and starting PSN > Any luck isolating this performance problem? I just want to understand > the cause so I know for sure 4.7 FW is a solid fix. Didn't see anything > in the 4.7 release notes that covered this issue. > Hello, Arlin! We have verified that, while the issue didnt appear in the release notes, 4.7 is a solid fix for this issue. Thanks, -- MST From info at hdyfg.com Mon Oct 31 07:28:34 2005 From: info at hdyfg.com (info at hdyfg.com) Date: 1 Nov 2005 00:28:34 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051031152834.4427.qmail@mail.hdyfg.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B $B$*Hh$lMM$G$9!*:#F|=i$a$F;XL>$r$7$F$^$9!#(B $B5U!}(BOK$B$G$9!*!Y$H$$$&%a%C%;!<%8$,F~$j$^$7$?!#(B $B;XL>$r!"8f=P$G$/$@$5$$!#(B $B8D<<$G?4B!BQ$($J$$J}$*Bg;v$K(B $B5qH]!'(Bbadluck at arigatouo.net From rolandd at cisco.com Mon Oct 31 08:59:44 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 08:59:44 -0800 Subject: [openib-general] Re: [PATCH] uverbs: fix typo In-Reply-To: <20051031111904.GA31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Oct 2005 13:19:05 +0200") References: <20051031111904.GA31134@mellanox.co.il> Message-ID: <52r7a1lj0f.fsf@cisco.com> Thanks, applied. - R. From rolandd at cisco.com Mon Oct 31 09:03:00 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:03:00 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix max_inline_data calculation In-Reply-To: <20051031092036.GK4769@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Oct 2005 11:20:36 +0200") References: <20051031092036.GK4769@mellanox.co.il> Message-ID: <52mzkpliuz.fsf@cisco.com> Michael> Fix max_inline_data calculation: we add mthca_bind_seg Michael> when calculating wqe size, so we should substract Michael> mthca_bind_seg as well. Hmm, I would prefer to fix this in a more correct way. A bind WQE will never have inline data, so we should just make sure that the calculated WQE is large enough for bind requests, rather than adding the possibility of a bind segment to every WQE. - R. From rolandd at cisco.com Mon Oct 31 09:03:58 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:03:58 -0800 Subject: [openib-general] Re: [PATCH] libmthca: fix qp max_send_sge calculation In-Reply-To: <20051031095455.GN4769@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Oct 2005 11:54:55 +0200") References: <20051031095455.GN4769@mellanox.co.il> Message-ID: <52irvdlitd.fsf@cisco.com> Hmm, again I'd rather do the calculation correctly and just compare against the HCA's actual capabilities to make sure we don't overflow that. - R. From rolandd at cisco.com Mon Oct 31 09:06:30 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:06:30 -0800 Subject: [openib-general] Re: libmthca problem: max_inline_size In-Reply-To: <20051031162621.GD31134@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Oct 2005 18:26:21 +0200") References: <20051031162621.GD31134@mellanox.co.il> Message-ID: <52ek61lip5.fsf@cisco.com> > 2. Return the actual QP capability in create qp command. > This is an ABI change, although the library can be made to work in a > backward compatible way. > 3. Add a command (device specific) to query the max descriptor size supported > by the HCA (returned by query dev lim) and calculate max_inline_size > based on that. > Again, this is an ABI change. > I am inclining towards the second option (2.) since this way the > resulting capability calculations will be all in one place in kernel. I think we need a combination of 2. and 3. because the WQE shifts and buffers from userspace need to match up with the kernel. For 3. there's no need to a completely new command. We could return extra device-dependent values from the GET_CONTEXT command, or even just add some sysfs attributes to the mthca device (similar to the fw_ver attribute). > There's a similiar problem in mthca_arbel_post_send where the > inline data size is checked against the expression > int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; > I would imagine the way to fix that is to add a max_inline_size field > to the mthca_qp structure and is that instead of 1 << qp->sq.wqe_shift. Yes, that makes sense. - R. From rolandd at cisco.com Mon Oct 31 09:08:15 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:08:15 -0800 Subject: [openib-general] Re: [PATCH] SRP: don't use TX IU after freei ng it In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E366C191@mtlexch01.mtl.com> (Tziporet Koren's message of "Mon, 31 Oct 2005 08:50:05 +0200") References: <6AB138A2AB8C8E4A98B9C0C3D52670E366C191@mtlexch01.mtl.com> Message-ID: <52acgplim8.fsf@cisco.com> Tziporet> When do you expect to apply the FMRs patch for SRP? My current plan is to try and get the existing code (without FMRs) merged for 2.6.15 and then merge the FMR changes. Looking back at the patch on more time, I realize that it needs updating to handle devices that only support standard verbs -- we can't assume that all devices will support Mellanox-style FMRs. - R. From rolandd at cisco.com Mon Oct 31 09:23:06 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:23:06 -0800 Subject: [openib-general] [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator Message-ID: <52wtjtk3d1.fsf@cisco.com> I've posted this several times for review and gotten some (but not very much) feedback. Is there any objection to me asking Linus to pull this for 2.6.15? Thanks, Roland Add an InfiniBand SCSI RDMA Protocol (SRP) initiator. This lets us talk to InfiniBand SRP targets (storage devices). Signed-off-by: Roland Dreier --- drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/ulp/srp/Kbuild | 3 drivers/infiniband/ulp/srp/Kconfig | 11 drivers/infiniband/ulp/srp/ib_srp.c | 1650 +++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/srp/ib_srp.h | 334 +++++++ 6 files changed, 2001 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/srp/Kbuild create mode 100644 drivers/infiniband/ulp/srp/Kconfig create mode 100644 drivers/infiniband/ulp/srp/ib_srp.c create mode 100644 drivers/infiniband/ulp/srp/ib_srp.h applies-to: d918cd1ba0ef9afa692cef281afee2f6d6634a1e 6424304e0c52070dc39c7bf329542267c72754be diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 325d502..bdf0891 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -33,4 +33,6 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/ulp/ipoib/Kconfig" +source "drivers/infiniband/ulp/srp/Kconfig" + endmenu diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index d256cf7..a43fb34 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_INFINIBAND) += core/ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mthca/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ +obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ diff --git a/drivers/infiniband/ulp/srp/Kbuild b/drivers/infiniband/ulp/srp/Kbuild new file mode 100644 index 0000000..f966e94 --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kbuild @@ -0,0 +1,3 @@ +EXTRA_CFLAGS += -Idrivers/infiniband/include + +obj-$(CONFIG_INFINIBAND_SRP) += ib_srp.o diff --git a/drivers/infiniband/ulp/srp/Kconfig b/drivers/infiniband/ulp/srp/Kconfig new file mode 100644 index 0000000..8fe3be4 --- /dev/null +++ b/drivers/infiniband/ulp/srp/Kconfig @@ -0,0 +1,11 @@ +config INFINIBAND_SRP + tristate "InfiniBand SCSI RDMA Protocol" + depends on INFINIBAND && SCSI + ---help--- + Support for the SCSI RDMA Protocol over InfiniBand. This + allows you to access storage devices that speak SRP over + InfiniBand. + + The SRP protocol is defined by the INCITS T10 technical + committee. See . + diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c new file mode 100644 index 0000000..f2cee76 --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -0,0 +1,1650 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.c 3895 2005-10-28 21:20:11Z roland $ + */ + +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#include + +#include "ib_srp.h" + +#define DRV_NAME "ib_srp" +#define PFX DRV_NAME ": " +#define DRV_VERSION "0.01" +#define DRV_RELDATE "January 11, 2005" + +MODULE_AUTHOR("Roland Dreier"); +MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int topspin_workarounds = 1; + +module_param(topspin_workarounds, int, 0444); +MODULE_PARM_DESC(topspin_workarounds, + "Enable workarounds for Topspin/Cisco SRP target bugs if != 0"); + +static const u8 topspin_oui[3] = { 0x00, 0x05, 0xad }; + +static void srp_add_one(struct ib_device *device); +static void srp_remove_one(struct ib_device *device); +static void srp_completion(struct ib_cq *cq, void *target_ptr); +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +static struct ib_client srp_client = { + .name = "srp", + .add = srp_add_one, + .remove = srp_remove_one +}; + +static inline struct srp_target_port *host_to_target(struct Scsi_Host *host) +{ + return (struct srp_target_port *) host->hostdata; +} + +static const char *srp_target_info(struct Scsi_Host *host) +{ + return host_to_target(host)->target_name; +} + +static struct srp_iu *srp_alloc_iu(struct srp_host *host, size_t size, + gfp_t gfp_mask, + enum dma_data_direction direction) +{ + struct srp_iu *iu; + + iu = kmalloc(sizeof *iu, gfp_mask); + if (!iu) + goto out; + + iu->buf = kzalloc(size, gfp_mask); + if (!iu->buf) + goto out_free_iu; + + iu->dma = dma_map_single(host->dev->dma_device, iu->buf, size, direction); + if (dma_mapping_error(iu->dma)) + goto out_free_buf; + + iu->size = size; + iu->direction = direction; + + return iu; + +out_free_buf: + kfree(iu->buf); +out_free_iu: + kfree(iu); +out: + return NULL; +} + +static void srp_free_iu(struct srp_host *host, struct srp_iu *iu) +{ + if (!iu) + return; + + dma_unmap_single(host->dev->dma_device, iu->dma, iu->size, iu->direction); + kfree(iu->buf); + kfree(iu); +} + +static void srp_qp_event(struct ib_event *event, void *context) +{ + printk(KERN_ERR PFX "QP event %d\n", event->event); +} + +static int srp_init_qp(struct srp_target_port *target, + struct ib_qp *qp) +{ + struct ib_qp_attr *attr; + int ret; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + ret = ib_find_cached_pkey(target->srp_host->dev, + target->srp_host->port, + be16_to_cpu(target->path.pkey), + &attr->pkey_index); + if (ret) + return ret; + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = (IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + attr->port_num = target->srp_host->port; + + return ib_modify_qp(qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | + IB_QP_PORT); +} + +static int srp_create_target_ib(struct srp_target_port *target) +{ + struct ib_qp_init_attr *init_attr = NULL; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + target->cq = ib_create_cq(target->srp_host->dev, srp_completion, + NULL, target, SRP_CQ_SIZE); + if (IS_ERR(target->cq)) { + ret = PTR_ERR(target->cq); + goto out; + } + + ib_req_notify_cq(target->cq, IB_CQ_NEXT_COMP); + + init_attr->event_handler = srp_qp_event; + init_attr->cap.max_send_wr = SRP_SQ_SIZE; + init_attr->cap.max_recv_wr = SRP_RQ_SIZE; + init_attr->cap.max_recv_sge = 1; + init_attr->cap.max_send_sge = 1; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = target->cq; + init_attr->recv_cq = target->cq; + + target->qp = ib_create_qp(target->srp_host->pd, init_attr); + if (IS_ERR(target->qp)) { + ret = PTR_ERR(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + + ret = srp_init_qp(target, target->qp); + if (ret) { + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + goto out; + } + +out: + kfree(init_attr); + return ret; +} + +static void srp_free_target_ib(struct srp_target_port *target) +{ + int i; + + ib_destroy_qp(target->qp); + ib_destroy_cq(target->cq); + + for (i = 0; i < SRP_RQ_SIZE; ++i) + srp_free_iu(target->srp_host, target->rx_ring[i]); + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) + srp_free_iu(target->srp_host, target->tx_ring[i]); +} + +static void srp_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + target->status = status; + if (status) + printk(KERN_ERR PFX "Got failed path rec status %d\n", status); + else + target->path = *pathrec; + complete(&target->done); +} + +static int srp_lookup_path(struct srp_target_port *target) +{ + target->path.numb_path = 1; + + init_completion(&target->done); + + target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev, + target->srp_host->port, + &target->path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + SRP_PATH_REC_TIMEOUT_MS, + GFP_KERNEL, + srp_path_rec_completion, + target, &target->path_query); + if (target->path_query_id < 0) + return target->path_query_id; + + wait_for_completion(&target->done); + + if (target->status < 0) + printk(KERN_WARNING PFX "Path record query failed\n"); + + return target->status; +} + +static int srp_send_req(struct srp_target_port *target) +{ + struct { + struct ib_cm_req_param param; + struct srp_login_req priv; + } *req = NULL; + int status; + + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + req->param.primary_path = &target->path; + req->param.alternate_path = NULL; + req->param.service_id = target->service_id; + req->param.qp_num = target->qp->qp_num; + req->param.qp_type = target->qp->qp_type; + req->param.starting_psn = 0; /* XXX */ + req->param.private_data = &req->priv; + req->param.private_data_len = sizeof req->priv; + req->param.responder_resources = 4; + req->param.remote_cm_response_timeout = 20; + req->param.flow_control = 1; + req->param.local_cm_response_timeout = 20; + req->param.retry_count = 7; + req->param.rnr_retry_count = 7; + req->param.max_cm_retries = 15; + + req->priv.opcode = SRP_LOGIN_REQ; + req->priv.tag = 0; + req->priv.req_it_iu_len = cpu_to_be32(SRP_MAX_IU_LEN); + req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | + SRP_BUF_FORMAT_INDIRECT); + memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); + /* + * Topspin/Cisco SRP targets will reject our login unless we + * zero out the first 8 bytes of our initiator port ID. The + * second 8 bytes must be our local node GUID, but we always + * use that anyway. + */ + if (topspin_workarounds && !memcmp(&target->ioc_guid, topspin_oui, 3)) { + printk(KERN_DEBUG PFX "Topspin/Cisco initiator port ID workaround " + "activated for target GUID %016llx\n", + (unsigned long long) be64_to_cpu(target->ioc_guid)); + memset(req->priv.initiator_port_id, 0, 8); + } + memcpy(req->priv.target_port_id, &target->id_ext, 8); + memcpy(req->priv.target_port_id + 8, &target->ioc_guid, 8); + + status = ib_send_cm_req(target->cm_id, &req->param); + + kfree(req); + + return status; +} + +static void srp_disconnect_target(struct srp_target_port *target) +{ + /* XXX should send SRP_I_LOGOUT request */ + + init_completion(&target->done); + ib_send_cm_dreq(target->cm_id, NULL, 0); + wait_for_completion(&target->done); +} + +static void srp_remove_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_DEAD) { + spin_unlock_irq(target->scsi_host->host_lock); + scsi_host_put(target->scsi_host); + return; + } + target->state = SRP_TARGET_REMOVED; + spin_unlock_irq(target->scsi_host->host_lock); + + down(&target->srp_host->target_mutex); + list_del(&target->list); + up(&target->srp_host->target_mutex); + + scsi_remove_host(target->scsi_host); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + /* And another put to really free the target port... */ + scsi_host_put(target->scsi_host); +} + +static int srp_connect_target(struct srp_target_port *target) +{ + int ret; + + ret = srp_lookup_path(target); + if (ret) + return ret; + + while (1) { + init_completion(&target->done); + ret = srp_send_req(target); + if (ret) + return ret; + wait_for_completion(&target->done); + + /* + * The CM event handling code will set status to + * SRP_PORT_REDIRECT if we get a port redirect REJ + * back, or SRP_DLID_REDIRECT if we get a lid/qp + * redirect REJ back. + */ + switch (target->status) { + case 0: + return 0; + + case SRP_PORT_REDIRECT: + ret = srp_lookup_path(target); + if (ret) + return ret; + break; + + case SRP_DLID_REDIRECT: + break; + + default: + return target->status; + } + } +} + +static int srp_reconnect_target(struct srp_target_port *target) +{ + struct ib_cm_id *new_cm_id; + struct ib_qp_attr qp_attr; + struct srp_request *req; + struct ib_wc wc; + int ret; + int i; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state != SRP_TARGET_LIVE) { + spin_unlock_irq(target->scsi_host->host_lock); + return -EAGAIN; + } + target->state = SRP_TARGET_CONNECTING; + spin_unlock_irq(target->scsi_host->host_lock); + + srp_disconnect_target(target); + /* + * Now get a new local CM ID so that we avoid confusing the + * target in case things are really fouled up. + */ + new_cm_id = ib_create_cm_id(target->srp_host->dev, + srp_cm_handler, target); + if (IS_ERR(new_cm_id)) { + ret = PTR_ERR(new_cm_id); + goto err; + } + ib_destroy_cm_id(target->cm_id); + target->cm_id = new_cm_id; + + qp_attr.qp_state = IB_QPS_RESET; + ret = ib_modify_qp(target->qp, &qp_attr, IB_QP_STATE); + if (ret) + goto err; + + ret = srp_init_qp(target, target->qp); + if (ret) + goto err; + + while (ib_poll_cq(target->cq, 1, &wc) > 0) + ; /* nothing */ + + list_for_each_entry(req, &target->req_queue, list) { + req->scmnd->result = DID_RESET << 16; + req->scmnd->scsi_done(req->scmnd); + } + + target->rx_head = 0; + target->tx_head = 0; + target->tx_tail = 0; + target->req_head = 0; + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_connect_target(target); + if (ret) + goto err; + + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + ret = 0; + target->state = SRP_TARGET_LIVE; + } else + ret = -EAGAIN; + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; + +err: + printk(KERN_ERR PFX "reconnect failed (%d), removing target port.\n", ret); + + /* + * We couldn't reconnect, so kill our target port off. + * However, we have to defer the real removal because we might + * be in the context of the SCSI error handler now, which + * would deadlock if we call scsi_remove_host(). + */ + spin_lock_irq(target->scsi_host->host_lock); + if (target->state == SRP_TARGET_CONNECTING) { + target->state = SRP_TARGET_DEAD; + INIT_WORK(&target->work, srp_remove_work, target); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); + + return ret; +} + +static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_target_port *target, + struct srp_request *req) +{ + struct srp_cmd *cmd = req->cmd->buf; + int len; + u8 fmt; + + if (!scmnd->request_buffer || scmnd->sc_data_direction == DMA_NONE) + return sizeof (struct srp_cmd); + + if (scmnd->sc_data_direction != DMA_FROM_DEVICE && + scmnd->sc_data_direction != DMA_TO_DEVICE) { + printk(KERN_WARNING PFX "Unhandled data direction %d\n", + scmnd->sc_data_direction); + return -EINVAL; + } + + if (scmnd->use_sg) { + struct scatterlist *scat = scmnd->request_buffer; + int n; + int i; + + n = dma_map_sg(target->srp_host->dev->dma_device, + scat, scmnd->use_sg, scmnd->sc_data_direction); + + if (n == 1) { + struct srp_direct_buf *buf = (void *) cmd->add_data; + + fmt = SRP_DATA_DESC_DIRECT; + + buf->va = cpu_to_be64(sg_dma_address(scat)); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(sg_dma_len(scat)); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_direct_buf); + } else { + struct srp_indirect_buf *buf = (void *) cmd->add_data; + u32 datalen = 0; + + fmt = SRP_DATA_DESC_INDIRECT; + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->data_out_desc_cnt = n; + else + cmd->data_in_desc_cnt = n; + + buf->table_desc.va = cpu_to_be64(req->cmd->dma + + sizeof *cmd + + sizeof *buf); + buf->table_desc.key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->table_desc.len = + cpu_to_be32(n * sizeof (struct srp_direct_buf)); + + for (i = 0; i < n; ++i) { + buf->desc_list[i].va = cpu_to_be64(sg_dma_address(&scat[i])); + buf->desc_list[i].key = + cpu_to_be32(target->srp_host->mr->rkey); + buf->desc_list[i].len = cpu_to_be32(sg_dma_len(&scat[i])); + + datalen += sg_dma_len(&scat[i]); + } + + buf->len = cpu_to_be32(datalen); + + len = sizeof (struct srp_cmd) + + sizeof (struct srp_indirect_buf) + + n * sizeof (struct srp_direct_buf); + } + } else { + struct srp_direct_buf *buf = (void *) cmd->add_data; + dma_addr_t dma; + + dma = dma_map_single(target->srp_host->dev->dma_device, + scmnd->request_buffer, scmnd->request_bufflen, + scmnd->sc_data_direction); + if (dma_mapping_error(dma)) { + printk(KERN_WARNING PFX "unable to map %p/%d (dir %d)\n", + scmnd->request_buffer, (int) scmnd->request_bufflen, + scmnd->sc_data_direction); + return -EINVAL; + } + + pci_unmap_addr_set(req, direct_mapping, dma); + + buf->va = cpu_to_be64(dma); + buf->key = cpu_to_be32(target->srp_host->mr->rkey); + buf->len = cpu_to_be32(scmnd->request_bufflen); + + fmt = SRP_DATA_DESC_DIRECT; + + len = sizeof (struct srp_cmd) + sizeof (struct srp_direct_buf); + } + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->buf_fmt = fmt << 4; + else + cmd->buf_fmt = fmt; + + + return len; +} + +static void srp_unmap_data(struct scsi_cmnd *scmnd, + struct srp_target_port *target, + struct srp_request *req) +{ + if (!scmnd->request_buffer || + (scmnd->sc_data_direction != DMA_TO_DEVICE && + scmnd->sc_data_direction != DMA_FROM_DEVICE)) + return; + + if (scmnd->use_sg) + dma_unmap_sg(target->srp_host->dev->dma_device, + (struct scatterlist *) scmnd->request_buffer, + scmnd->use_sg, scmnd->sc_data_direction); + else + dma_unmap_single(target->srp_host->dev->dma_device, + pci_unmap_addr(req, direct_mapping), + scmnd->request_bufflen, + scmnd->sc_data_direction); +} + +static void srp_process_rsp(struct srp_target_port *target, struct srp_rsp *rsp) +{ + struct srp_request *req; + struct scsi_cmnd *scmnd; + unsigned long flags; + s32 delta; + + delta = (s32) be32_to_cpu(rsp->req_lim_delta); + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + + target->req_lim += delta; + + req = &target->req_ring[rsp->tag & ~SRP_TAG_TSK_MGMT]; + + if (unlikely(rsp->tag & SRP_TAG_TSK_MGMT)) { + if (be32_to_cpu(rsp->resp_data_len) < 4) + req->tsk_status = -1; + else + req->tsk_status = rsp->data[3]; + complete(&req->done); + } else { + scmnd = req->scmnd; + if (!scmnd) + printk(KERN_ERR "Null scmnd for RSP w/tag %016llx\n", + (unsigned long long) rsp->tag); + scmnd->result = rsp->status; + + if (rsp->flags & SRP_RSP_FLAG_SNSVALID) { + memcpy(scmnd->sense_buffer, rsp->data + + be32_to_cpu(rsp->resp_data_len), + min_t(int, be32_to_cpu(rsp->sense_data_len), + SCSI_SENSE_BUFFERSIZE)); + } + + if (rsp->flags & (SRP_RSP_FLAG_DOOVER | SRP_RSP_FLAG_DOUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_out_res_cnt); + else if (rsp->flags & (SRP_RSP_FLAG_DIOVER | SRP_RSP_FLAG_DIUNDER)) + scmnd->resid = be32_to_cpu(rsp->data_in_res_cnt); + + srp_unmap_data(scmnd, target, req); + + if (!req->tsk_mgmt) { + req->scmnd = NULL; + scmnd->host_scribble = (void *) -1L; + scmnd->scsi_done(scmnd); + + list_del(&req->list); + req->next = target->req_head; + target->req_head = rsp->tag & ~SRP_TAG_TSK_MGMT; + } else + req->cmd_done = 1; + } + + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); +} + +static void srp_reconnect_work(void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + + srp_reconnect_target(target); +} + +static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc) +{ + struct srp_iu *iu; + u8 opcode; + + iu = target->rx_ring[wc->wr_id & ~SRP_OP_RECV]; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); + + opcode = *(u8 *) iu->buf; + + if (0) { + int i; + + printk(KERN_ERR PFX "recv completion, opcode 0x%02x\n", opcode); + + for (i = 0; i < wc->byte_len; ++i) { + if (i % 8 == 0) + printk(KERN_ERR " [%02x] ", i); + printk(" %02x", ((u8 *) iu->buf)[i]); + if ((i + 1) % 8 == 0) + printk("\n"); + } + + if (wc->byte_len % 8) + printk("\n"); + } + + switch (opcode) { + case SRP_RSP: + srp_process_rsp(target, iu->buf); + break; + + case SRP_T_LOGOUT: + /* XXX Handle target logout */ + printk(KERN_WARNING PFX "Got target logout request\n"); + break; + + default: + printk(KERN_WARNING PFX "Unhandled SRP opcode 0x%02x\n", opcode); + break; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + target->max_ti_iu_len, DMA_FROM_DEVICE); +} + +static void srp_completion(struct ib_cq *cq, void *target_ptr) +{ + struct srp_target_port *target = target_ptr; + struct ib_wc wc; + unsigned long flags; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + if (wc.status) { + printk(KERN_ERR PFX "failed %s status %d\n", + wc.wr_id & SRP_OP_RECV ? "receive" : "send", + wc.status); + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state == SRP_TARGET_LIVE) + schedule_work(&target->work); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + break; + } + + if (wc.wr_id & SRP_OP_RECV) + srp_handle_recv(target, &wc); + else + ++target->tx_tail; + } +} + +static int __srp_post_recv(struct srp_target_port *target) +{ + struct srp_iu *iu; + struct ib_sge list; + struct ib_recv_wr wr, *bad_wr; + unsigned int next; + int ret; + + next = target->rx_head & (SRP_RQ_SIZE - 1); + wr.wr_id = next | SRP_OP_RECV; + iu = target->rx_ring[next]; + + list.addr = iu->dma; + list.length = iu->size; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.sg_list = &list; + wr.num_sge = 1; + + ret = ib_post_recv(target->qp, &wr, &bad_wr); + if (!ret) + ++target->rx_head; + + return ret; +} + +static int srp_post_recv(struct srp_target_port *target) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(target->scsi_host->host_lock, flags); + ret = __srp_post_recv(target); + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + + return ret; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static struct srp_iu *__srp_get_tx_iu(struct srp_target_port *target) +{ + if (target->tx_head - target->tx_tail >= SRP_SQ_SIZE) + return NULL; + + return target->tx_ring[target->tx_head & SRP_SQ_SIZE]; +} + +/* + * Must be called with target->scsi_host->host_lock held to protect + * req_lim and tx_head. + */ +static int __srp_post_send(struct srp_target_port *target, + struct srp_iu *iu, int len) +{ + struct ib_sge list; + struct ib_send_wr wr, *bad_wr; + int ret = 0; + + if (target->req_lim < 1) { + printk(KERN_ERR PFX "Target has req_lim %d\n", target->req_lim); + return -EAGAIN; + } + + list.addr = iu->dma; + list.length = len; + list.lkey = target->srp_host->mr->lkey; + + wr.next = NULL; + wr.wr_id = target->tx_head & SRP_SQ_SIZE; + wr.sg_list = &list; + wr.num_sge = 1; + wr.opcode = IB_WR_SEND; + wr.send_flags = IB_SEND_SIGNALED; + + ret = ib_post_send(target->qp, &wr, &bad_wr); + + if (!ret) { + ++target->tx_head; + --target->req_lim; + } + + return ret; +} + +static int srp_queuecommand(struct scsi_cmnd *scmnd, + void (*done)(struct scsi_cmnd *)) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_cmd *cmd; + long req_index; + int len; + + if (target->state == SRP_TARGET_CONNECTING) + goto err; + + if (target->state == SRP_TARGET_DEAD || + target->state == SRP_TARGET_REMOVED) { + scmnd->result = DID_BAD_TARGET << 16; + done(scmnd); + return 0; + } + + iu = __srp_get_tx_iu(target); + if (!iu) + goto err; + + dma_sync_single_for_cpu(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + req_index = target->req_head; + + scmnd->scsi_done = done; + scmnd->result = 0; + scmnd->host_scribble = (void *) req_index; + + cmd = iu->buf; + memset(cmd, 0, sizeof *cmd); + + cmd->opcode = SRP_CMD; + cmd->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + cmd->tag = req_index; + memcpy(cmd->cdb, scmnd->cmnd, scmnd->cmd_len); + + req = &target->req_ring[req_index]; + + req->scmnd = scmnd; + req->cmd = iu; + req->cmd_done = 0; + req->tsk_mgmt = NULL; + + len = srp_map_data(scmnd, target, req); + if (len < 0) { + printk(KERN_ERR PFX "Failed to map data\n"); + goto err; + } + + if (__srp_post_recv(target)) { + printk(KERN_ERR PFX "Recv failed\n"); + goto err_unmap; + } + + dma_sync_single_for_device(target->srp_host->dev->dma_device, iu->dma, + SRP_MAX_IU_LEN, DMA_TO_DEVICE); + + if (__srp_post_send(target, iu, len)) { + printk(KERN_ERR PFX "Send failed\n"); + goto err_unmap; + } + + target->req_head = req->next; + list_add_tail(&req->list, &target->req_queue); + + return 0; + +err_unmap: + srp_unmap_data(scmnd, target, req); + +err: + return SCSI_MLQUEUE_HOST_BUSY; +} + +static int srp_alloc_iu_bufs(struct srp_target_port *target) +{ + int i; + + for (i = 0; i < SRP_RQ_SIZE; ++i) { + target->rx_ring[i] = srp_alloc_iu(target->srp_host, + target->max_ti_iu_len, + GFP_KERNEL, DMA_FROM_DEVICE); + if (!target->rx_ring[i]) + goto err; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + target->tx_ring[i] = srp_alloc_iu(target->srp_host, + SRP_MAX_IU_LEN, + GFP_KERNEL, DMA_TO_DEVICE); + if (!target->tx_ring[i]) + goto err; + } + + return 0; + +err: + for (i = 0; i < SRP_RQ_SIZE; ++i) { + srp_free_iu(target->srp_host, target->rx_ring[i]); + target->rx_ring[i] = NULL; + } + + for (i = 0; i < SRP_SQ_SIZE + 1; ++i) { + srp_free_iu(target->srp_host, target->tx_ring[i]); + target->tx_ring[i] = NULL; + } + + return -ENOMEM; +} + +static void srp_cm_rej_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event, + struct srp_target_port *target) +{ + struct ib_class_port_info *cpi; + int opcode; + + switch (event->param.rej_rcvd.reason) { + case IB_CM_REJ_PORT_CM_REDIRECT: + cpi = event->param.rej_rcvd.ari; + target->path.dlid = cpi->redirect_lid; + target->path.pkey = cpi->redirect_pkey; + cm_id->remote_cm_qpn = be32_to_cpu(cpi->redirect_qp) & 0x00ffffff; + memcpy(target->path.dgid.raw, cpi->redirect_gid, 16); + + target->status = target->path.dlid ? + SRP_DLID_REDIRECT : SRP_PORT_REDIRECT; + break; + + case IB_CM_REJ_PORT_REDIRECT: + if (topspin_workarounds && + !memcmp(&target->ioc_guid, topspin_oui, 3)) { + /* + * Topspin/Cisco SRP gateways incorrectly send + * reject reason code 25 when they mean 24 + * (port redirect). + */ + memcpy(target->path.dgid.raw, + event->param.rej_rcvd.ari, 16); + + printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target port GID %016llx%016llx\n", + (unsigned long long) be64_to_cpu(target->path.dgid.global.subnet_prefix), + (unsigned long long) be64_to_cpu(target->path.dgid.global.interface_id)); + + target->status = SRP_PORT_REDIRECT; + } else { + printk(KERN_WARNING " REJ reason: IB_CM_REJ_PORT_REDIRECT\n"); + target->status = -ECONNRESET; + } + break; + + case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID: + printk(KERN_WARNING " REJ reason: IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n"); + target->status = -ECONNRESET; + break; + + case IB_CM_REJ_CONSUMER_DEFINED: + opcode = *(u8 *) event->private_data; + if (opcode == SRP_LOGIN_REJ) { + struct srp_login_rej *rej = event->private_data; + u32 reason = be32_to_cpu(rej->reason); + + if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) + printk(KERN_WARNING PFX + "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); + else + printk(KERN_WARNING PFX + "SRP LOGIN REJECTED, reason 0x%08x\n", reason); + } else + printk(KERN_WARNING " REJ reason: IB_CM_REJ_CONSUMER_DEFINED," + " opcode 0x%02x\n", opcode); + target->status = -ECONNRESET; + break; + + default: + printk(KERN_WARNING " REJ reason 0x%x\n", + event->param.rej_rcvd.reason); + target->status = -ECONNRESET; + } +} + +static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct srp_target_port *target = cm_id->context; + struct ib_qp_attr *qp_attr = NULL; + int attr_mask = 0; + int comp = 0; + int opcode = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + printk(KERN_DEBUG PFX "Sending CM REQ failed\n"); + comp = 1; + target->status = -ECONNRESET; + break; + + case IB_CM_REP_RECEIVED: + comp = 1; + opcode = *(u8 *) event->private_data; + + if (opcode == SRP_LOGIN_RSP) { + struct srp_login_rsp *rsp = event->private_data; + + target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); + target->req_lim = be32_to_cpu(rsp->req_lim_delta); + + target->scsi_host->can_queue = min(target->req_lim, + target->scsi_host->can_queue); + } else { + printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", opcode); + target->status = -ECONNRESET; + break; + } + + target->status = srp_alloc_iu_bufs(target); + if (target->status) + break; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) { + target->status = -ENOMEM; + break; + } + + qp_attr->qp_state = IB_QPS_RTR; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + qp_attr->rq_psn = 0; /* XXX */ + attr_mask |= IB_QP_RQ_PSN; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = srp_post_recv(target); + if (target->status) + break; + + qp_attr->qp_state = IB_QPS_RTS; + target->status = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (target->status) + break; + + target->status = ib_modify_qp(target->qp, qp_attr, attr_mask); + if (target->status) + break; + + target->status = ib_send_cm_rtu(cm_id, NULL, 0); + if (target->status) + break; + + break; + + case IB_CM_REJ_RECEIVED: + printk(KERN_DEBUG PFX "REJ received\n"); + comp = 1; + + srp_cm_rej_handler(cm_id, event, target); + break; + + case IB_CM_MRA_RECEIVED: + printk(KERN_ERR PFX "MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + break; + + case IB_CM_TIMEWAIT_EXIT: + printk(KERN_ERR PFX "connection closed\n"); + + comp = 1; + target->status = 0; + break; + + default: + printk(KERN_WARNING PFX "Unhandled CM event %d\n", event->event); + break; + } + + if (comp) + complete(&target->done); + + kfree(qp_attr); + + return 0; +} + +static int srp_send_tsk_mgmt(struct scsi_cmnd *scmnd, u8 func) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req; + struct srp_iu *iu; + struct srp_tsk_mgmt *tsk_mgmt; + int req_index; + int ret = FAILED; + + spin_lock_irq(target->scsi_host->host_lock); + + if (scmnd->host_scribble == (void *) -1L) + goto out; + + req_index = (long) scmnd->host_scribble; + printk(KERN_ERR "Abort for req_index %d\n", req_index); + + req = &target->req_ring[req_index]; + init_completion(&req->done); + + iu = __srp_get_tx_iu(target); + if (!iu) + goto out; + + tsk_mgmt = iu->buf; + memset(tsk_mgmt, 0, sizeof *tsk_mgmt); + + tsk_mgmt->opcode = SRP_TSK_MGMT; + tsk_mgmt->lun = cpu_to_be64((u64) scmnd->device->lun << 48); + tsk_mgmt->tag = req_index | SRP_TAG_TSK_MGMT; + tsk_mgmt->tsk_mgmt_func = func; + tsk_mgmt->task_tag = req_index; + + if (__srp_post_send(target, iu, sizeof *tsk_mgmt)) + goto out; + + req->tsk_mgmt = iu; + + spin_unlock_irq(target->scsi_host->host_lock); + if (!wait_for_completion_timeout(&req->done, + msecs_to_jiffies(SRP_ABORT_TIMEOUT_MS))) + return FAILED; + spin_lock_irq(target->scsi_host->host_lock); + + if (req->cmd_done) { + list_del(&req->list); + req->next = target->req_head; + target->req_head = req_index; + + scmnd->scsi_done(scmnd); + } else if (!req->tsk_status) { + scmnd->result = DID_ABORT << 16; + ret = SUCCESS; + } + +out: + spin_unlock_irq(target->scsi_host->host_lock); + return ret; +} + +static int srp_abort(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP abort called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_ABORT_TASK); +} + +static int srp_reset_device(struct scsi_cmnd *scmnd) +{ + printk(KERN_ERR "SRP reset_device called\n"); + + return srp_send_tsk_mgmt(scmnd, SRP_TSK_LUN_RESET); +} + +static int srp_reset_host(struct scsi_cmnd *scmnd) +{ + struct srp_target_port *target = host_to_target(scmnd->device->host); + int ret = FAILED; + + printk(KERN_ERR PFX "SRP reset_host called\n"); + + if (!srp_reconnect_target(target)) + ret = SUCCESS; + + return ret; +} + +static struct scsi_host_template srp_template = { + .module = THIS_MODULE, + .name = DRV_NAME, + .info = srp_target_info, + .queuecommand = srp_queuecommand, + .eh_abort_handler = srp_abort, + .eh_device_reset_handler = srp_reset_device, + .eh_host_reset_handler = srp_reset_host, + .can_queue = SRP_SQ_SIZE, + .this_id = -1, + .sg_tablesize = SRP_MAX_INDIRECT, + .cmd_per_lun = SRP_SQ_SIZE, + .use_clustering = ENABLE_CLUSTERING +}; + +static int srp_add_target(struct srp_host *host, struct srp_target_port *target) +{ + sprintf(target->target_name, "SRP.T10:%016llX", + (unsigned long long) be64_to_cpu(target->id_ext)); + + if (scsi_add_host(target->scsi_host, host->dev->dma_device)) + return -ENODEV; + + down(&host->target_mutex); + list_add_tail(&target->list, &host->target_list); + up(&host->target_mutex); + + target->state = SRP_TARGET_LIVE; + + /* XXX: are we supposed to have a definition of SCAN_WILD_CARD ?? */ + scsi_scan_target(&target->scsi_host->shost_gendev, + 0, target->scsi_id, ~0, 0); + + return 0; +} + +static void srp_release_class_dev(struct class_device *class_dev) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + + complete(&host->released); +} + +static struct class srp_class = { + .name = "infiniband_srp", + .release = srp_release_class_dev +}; + +/* + * Target ports are added by writing + * + * id_ext=,ioc_guid=,dgid=, + * pkey=,service_id= + * + * to the add_target sysfs attribute. + */ +enum { + SRP_OPT_ERR = 0, + SRP_OPT_ID_EXT = 1 << 0, + SRP_OPT_IOC_GUID = 1 << 1, + SRP_OPT_DGID = 1 << 2, + SRP_OPT_PKEY = 1 << 3, + SRP_OPT_SERVICE_ID = 1 << 4, + SRP_OPT_MAX_SECT = 1 << 5, + SRP_OPT_ALL = (SRP_OPT_ID_EXT | + SRP_OPT_IOC_GUID | + SRP_OPT_DGID | + SRP_OPT_PKEY | + SRP_OPT_SERVICE_ID), +}; + +static match_table_t srp_opt_tokens = { + { SRP_OPT_ID_EXT, "id_ext=%s" }, + { SRP_OPT_IOC_GUID, "ioc_guid=%s" }, + { SRP_OPT_DGID, "dgid=%s" }, + { SRP_OPT_PKEY, "pkey=%x" }, + { SRP_OPT_SERVICE_ID, "service_id=%s" }, + { SRP_OPT_MAX_SECT, "max_sect=%d" }, + { SRP_OPT_ERR, NULL } +}; + +static int srp_parse_options(const char *buf, struct srp_target_port *target) +{ + char *options; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + while ((p = strsep(&options, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, srp_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case SRP_OPT_ID_EXT: + p = match_strdup(args); + target->id_ext = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_IOC_GUID: + p = match_strdup(args); + target->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) + goto out; + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + target->path.dgid.raw[i] = simple_strtoul(dgid, NULL, 16); + } + break; + + case SRP_OPT_PKEY: + if (match_hex(args, &token)) + goto out; + target->path.pkey = cpu_to_be16(token); + break; + + case SRP_OPT_SERVICE_ID: + p = match_strdup(args); + target->service_id = cpu_to_be64(simple_strtoull(p, NULL, 16)); + kfree(p); + break; + + case SRP_OPT_MAX_SECT: + if (match_int(args, &token)) + goto out; + target->scsi_host->max_sectors = token; + break; + + default: + goto out; + } + } + + if (opt_mask == SRP_OPT_ALL) + ret = 0; + +out: + kfree(options); + return ret; +} + +static ssize_t srp_create_target(struct class_device *class_dev, + const char *buf, size_t count) +{ + struct srp_host *host = + container_of(class_dev, struct srp_host, class_dev); + struct Scsi_Host *target_host; + struct srp_target_port *target; + int ret; + int i; + + target_host = scsi_host_alloc(&srp_template, + sizeof (struct srp_target_port)); + if (!target_host) + return -ENOMEM; + + target = host_to_target(target_host); + memset(target, 0, sizeof *target); + + target->scsi_host = target_host; + target->srp_host = host; + + INIT_WORK(&target->work, srp_reconnect_work, target); + + for (i = 0; i < SRP_SQ_SIZE - 1; ++i) + target->req_ring[i].next = i + 1; + target->req_ring[SRP_SQ_SIZE - 1].next = -1; + INIT_LIST_HEAD(&target->req_queue); + + ret = srp_parse_options(buf, target); + if (ret) + goto err; + + ib_get_cached_gid(host->dev, host->port, 0, &target->path.sgid); + + printk(KERN_DEBUG PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " + "service_id %016llx dgid %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + (unsigned long long) be64_to_cpu(target->id_ext), + (unsigned long long) be64_to_cpu(target->ioc_guid), + be16_to_cpu(target->path.pkey), + (unsigned long long) be64_to_cpu(target->service_id), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[0]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[2]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[4]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[6]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[8]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[10]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[12]), + (int) be16_to_cpu(*(__be16 *) &target->path.dgid.raw[14])); + + ret = srp_create_target_ib(target); + if (ret) + goto err; + + target->cm_id = ib_create_cm_id(host->dev, srp_cm_handler, target); + if (IS_ERR(target->cm_id)) { + ret = PTR_ERR(target->cm_id); + goto err_free; + } + + ret = srp_connect_target(target); + if (ret) { + printk(KERN_ERR PFX "Connection failed\n"); + goto err_cm_id; + } + + ret = srp_add_target(host, target); + if (ret) + goto err_disconnect; + + return count; + +err_disconnect: + srp_disconnect_target(target); + +err_cm_id: + ib_destroy_cm_id(target->cm_id); + +err_free: + srp_free_target_ib(target); + +err: + scsi_host_put(target_host); + + return ret; +} + +static CLASS_DEVICE_ATTR(add_target, S_IWUSR, NULL, srp_create_target); + +static struct srp_host *srp_add_port(struct ib_device *device, + __be64 node_guid, u8 port) +{ + struct srp_host *host; + + host = kzalloc(sizeof *host, GFP_KERNEL); + if (!host) + return NULL; + + INIT_LIST_HEAD(&host->target_list); + init_MUTEX(&host->target_mutex); + init_completion(&host->released); + host->dev = device; + host->port = port; + + host->initiator_port_id[7] = port; + memcpy(host->initiator_port_id + 8, &node_guid, 8); + + host->pd = ib_alloc_pd(device); + if (IS_ERR(host->pd)) + goto err_free; + + host->mr = ib_get_dma_mr(host->pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(host->mr)) + goto err_pd; + + host->class_dev.class = &srp_class; + host->class_dev.dev = device->dma_device; + snprintf(host->class_dev.class_id, BUS_ID_SIZE, "srp-%s-%d", + device->name, port); + + if (class_device_register(&host->class_dev)) + goto err_mr; + if (class_device_create_file(&host->class_dev, &class_device_attr_add_target)) + goto err_class; + /* XXX ibdev / port files as well */ + + return host; + +err_class: + class_device_unregister(&host->class_dev); + +err_mr: + ib_dereg_mr(host->mr); + +err_pd: + ib_dealloc_pd(host->pd); + +err_free: + kfree(host); + + return NULL; +} + +static void srp_add_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host; + struct ib_device_attr *dev_attr; + int s, e, p; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + printk(KERN_WARNING PFX "Couldn't query node GUID for %s.\n", + device->name); + goto out; + } + + dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL); + if (!dev_list) + goto out; + + INIT_LIST_HEAD(dev_list); + + if (device->node_type == IB_NODE_SWITCH) { + s = 0; + e = 0; + } else { + s = 1; + e = device->phys_port_cnt; + } + + for (p = s; p <= e; ++p) { + host = srp_add_port(device, dev_attr->node_guid, p); + if (host) + list_add_tail(&host->list, dev_list); + } + + ib_set_client_data(device, &srp_client, dev_list); + +out: + kfree(dev_attr); +} + +static void srp_remove_one(struct ib_device *device) +{ + struct list_head *dev_list; + struct srp_host *host, *tmp_host; + LIST_HEAD(target_list); + struct srp_target_port *target, *tmp_target; + unsigned long flags; + + dev_list = ib_get_client_data(device, &srp_client); + + list_for_each_entry_safe(host, tmp_host, dev_list, list) { + class_device_unregister(&host->class_dev); + /* + * Wait for the sysfs entry to go away, so that no new + * target ports can be created. + */ + wait_for_completion(&host->released); + + /* + * Mark all target ports as removed, so we stop queueing + * commands and don't try to reconnect. + */ + down(&host->target_mutex); + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (target->state != SRP_TARGET_REMOVED) + target->state = SRP_TARGET_REMOVED; + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); + } + up(&host->target_mutex); + + /* + * Wait for any reconnection tasks that may have + * started before we marked our target ports as + * removed, and any target port removal tasks. + */ + flush_scheduled_work(); + + list_for_each_entry_safe(target, tmp_target, + &host->target_list, list) { + scsi_remove_host(target->scsi_host); + srp_disconnect_target(target); + ib_destroy_cm_id(target->cm_id); + srp_free_target_ib(target); + scsi_host_put(target->scsi_host); + } + + ib_dereg_mr(host->mr); + ib_dealloc_pd(host->pd); + kfree(host); + } + + kfree(dev_list); +} + +static int __init srp_init_module(void) +{ + int ret; + + ret = class_register(&srp_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class infiniband_srp\n"); + return ret; + } + + ret = ib_register_client(&srp_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client\n"); + class_unregister(&srp_class); + return ret; + } + + return 0; +} + +static void __exit srp_cleanup_module(void) +{ + ib_unregister_client(&srp_client); + class_unregister(&srp_class); +} + +module_init(srp_init_module); +module_exit(srp_cleanup_module); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h new file mode 100644 index 0000000..f536252 --- /dev/null +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -0,0 +1,334 @@ +/* + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id: ib_srp.h 3893 2005-10-28 21:15:40Z roland $ + */ + +#ifndef IB_SRP_H +#define IB_SRP_H + +#include +#include + +#include + +#include +#include + +#include +#include +#include + +enum { + SRP_PATH_REC_TIMEOUT_MS = 1000, + SRP_ABORT_TIMEOUT_MS = 5000, + + SRP_PORT_REDIRECT = 1, + SRP_DLID_REDIRECT = 2, + + SRP_MAX_IU_LEN = 256, + + SRP_RQ_SHIFT = 6, + SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT, + SRP_SQ_SIZE = SRP_RQ_SIZE - 1, + SRP_CQ_SIZE = SRP_SQ_SIZE + SRP_RQ_SIZE, + + SRP_TAG_TSK_MGMT = 1 << (SRP_RQ_SHIFT + 1) +}; + +#define SRP_OP_RECV (1 << 31) +#define SRP_MAX_INDIRECT ((SRP_MAX_IU_LEN - \ + sizeof (struct srp_cmd) - \ + sizeof (struct srp_indirect_buf)) / 16) + +enum srp_target_state { + SRP_TARGET_LIVE, + SRP_TARGET_CONNECTING, + SRP_TARGET_DEAD, + SRP_TARGET_REMOVED +}; + +struct srp_host { + u8 initiator_port_id[16]; + struct ib_device *dev; + u8 port; + struct ib_pd *pd; + struct ib_mr *mr; + struct class_device class_dev; + struct list_head target_list; + struct semaphore target_mutex; + struct completion released; + struct list_head list; +}; + +struct srp_request { + struct list_head list; + struct scsi_cmnd *scmnd; + struct srp_iu *cmd; + struct srp_iu *tsk_mgmt; + DECLARE_PCI_UNMAP_ADDR(direct_mapping) + struct completion done; + short next; + u8 cmd_done; + u8 tsk_status; +}; + +struct srp_target_port { + __be64 id_ext; + __be64 ioc_guid; + __be64 service_id; + struct srp_host *srp_host; + struct Scsi_Host *scsi_host; + char target_name[32]; + unsigned int scsi_id; + + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + + struct ib_cm_id *cm_id; + struct ib_cq *cq; + struct ib_qp *qp; + + int max_ti_iu_len; + s32 req_lim; + + unsigned rx_head; + struct srp_iu *rx_ring[SRP_RQ_SIZE]; + + unsigned tx_head; + unsigned tx_tail; + struct srp_iu *tx_ring[SRP_SQ_SIZE + 1]; + + int req_head; + struct list_head req_queue; + struct srp_request req_ring[SRP_SQ_SIZE]; + + struct work_struct work; + + struct list_head list; + struct completion done; + int status; + enum srp_target_state state; +}; + +struct srp_iu { + dma_addr_t dma; + void *buf; + size_t size; + enum dma_data_direction direction; +}; + +/* + * SRP protocol definitions + */ + +enum { + SRP_LOGIN_REQ = 0x00, + SRP_TSK_MGMT = 0x01, + SRP_CMD = 0x02, + SRP_I_LOGOUT = 0x03, + SRP_LOGIN_RSP = 0xc0, + SRP_RSP = 0xc1, + SRP_LOGIN_REJ = 0xc2, + SRP_T_LOGOUT = 0x80, + SRP_CRED_REQ = 0x81, + SRP_AER_REQ = 0x82, + SRP_CRED_RSP = 0x41, + SRP_AER_RSP = 0x42 +}; + +enum { + SRP_BUF_FORMAT_DIRECT = 1 << 1, + SRP_BUF_FORMAT_INDIRECT = 1 << 2 +}; + +enum { + SRP_NO_DATA_DESC = 0, + SRP_DATA_DESC_DIRECT = 1, + SRP_DATA_DESC_INDIRECT = 2 +}; + +enum { + SRP_TSK_ABORT_TASK = 0x01, + SRP_TSK_ABORT_TASK_SET = 0x02, + SRP_TSK_CLEAR_TASK_SET = 0x04, + SRP_TSK_LUN_RESET = 0x08, + SRP_TSK_CLEAR_ACA = 0x40 +}; + +enum srp_login_rej_reason { + SRP_LOGIN_REJ_UNABLE_ESTABLISH_CHANNEL = 0x00010000, + SRP_LOGIN_REJ_INSUFFICIENT_RESOURCES = 0x00010001, + SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE = 0x00010002, + SRP_LOGIN_REJ_UNABLE_ASSOCIATE_CHANNEL = 0x00010003, + SRP_LOGIN_REJ_UNSUPPORTED_DESCRIPTOR_FMT = 0x00010004, + SRP_LOGIN_REJ_MULTI_CHANNEL_UNSUPPORTED = 0x00010005, + SRP_LOGIN_REJ_CHANNEL_LIMIT_REACHED = 0x00010006 +}; + +struct srp_direct_buf { + __be64 va; + __be32 key; + __be32 len; +}; + +/* + * We need the packed attribute because the SRP spec puts the list of + * descriptors at an offset of 20, which is not aligned to the size + * of struct srp_direct_buf. + */ +struct srp_indirect_buf { + struct srp_direct_buf table_desc; + __be32 len; + struct srp_direct_buf desc_list[0] __attribute__((packed)); +}; + +enum { + SRP_MULTICHAN_SINGLE = 0, + SRP_MULTICHAN_MULTI = 1 +}; + +struct srp_login_req { + u8 opcode; + u8 reserved1[7]; + u64 tag; + __be32 req_it_iu_len; + u8 reserved2[4]; + __be16 req_buf_fmt; + u8 req_flags; + u8 reserved3[5]; + u8 initiator_port_id[16]; + u8 target_port_id[16]; +}; + +struct srp_login_rsp { + u8 opcode; + u8 reserved1[3]; + __be32 req_lim_delta; + u64 tag; + __be32 max_it_iu_len; + __be32 max_ti_iu_len; + __be16 buf_fmt; + u8 rsp_flags; + u8 reserved2[25]; +}; + +struct srp_login_rej { + u8 opcode; + u8 reserved1[3]; + __be32 reason; + u64 tag; + u8 reserved2[8]; + __be16 buf_fmt; + u8 reserved3[6]; +}; + +struct srp_i_logout { + u8 opcode; + u8 reserved[7]; + u64 tag; +}; + +struct srp_t_logout { + u8 opcode; + u8 sol_not; + u8 reserved[2]; + __be32 reason; + u64 tag; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_tsk_mgmt { + u8 opcode; + u8 sol_not; + u8 reserved1[6]; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3[2]; + u8 tsk_mgmt_func; + u8 reserved4; + u64 task_tag; + u8 reserved5[8]; +}; + +/* + * We need the packed attribute because the SRP spec only aligns the + * 8-byte LUN field to 4 bytes. + */ +struct srp_cmd { + u8 opcode; + u8 sol_not; + u8 reserved1[3]; + u8 buf_fmt; + u8 data_out_desc_cnt; + u8 data_in_desc_cnt; + u64 tag; + u8 reserved2[4]; + __be64 lun __attribute__((packed)); + u8 reserved3; + u8 task_attr; + u8 reserved4; + u8 add_cdb_len; + u8 cdb[16]; + u8 add_data[0]; +}; + +enum { + SRP_RSP_FLAG_RSPVALID = 1 << 0, + SRP_RSP_FLAG_SNSVALID = 1 << 1, + SRP_RSP_FLAG_DOOVER = 1 << 2, + SRP_RSP_FLAG_DOUNDER = 1 << 3, + SRP_RSP_FLAG_DIOVER = 1 << 4, + SRP_RSP_FLAG_DIUNDER = 1 << 5 +}; + +struct srp_rsp { + u8 opcode; + u8 sol_not; + u8 reserved1[2]; + __be32 req_lim_delta; + u64 tag; + u8 reserved2[2]; + u8 flags; + u8 status; + __be32 data_out_res_cnt; + __be32 data_in_res_cnt; + __be32 sense_data_len; + __be32 resp_data_len; + u8 data[0]; +}; + +#endif /* IB_SRP_H */ From higley at dbresearch.net Mon Oct 31 09:24:11 2005 From: higley at dbresearch.net (Jay Higley) Date: Mon, 31 Oct 2005 11:24:11 -0600 Subject: [openib-general] ip_dev_find symbol missing In-Reply-To: <20051031155634.3934.qmail@mail.jjyhd.com> References: <20051031155634.3934.qmail@mail.jjyhd.com> Message-ID: <4366533B.3060101@dbresearch.net> I compiled the openIB stack with the 2.6.14 kernel and three modules (ib_sdp, ib_at, and ib_addr) will not load due to missing symbol "ip_dev_find". I see the source for this routine in kernel/net, but apparently it didn't get compiled into my kernel. Does anyone know what configure options need to be set to enable this? Also, are there some missing dependency checks in the infiband stack that allowed the modules to be compiled without this symbol being present? -Jay Higley From jsquyres at open-mpi.org Mon Oct 31 09:28:51 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Mon, 31 Oct 2005 12:28:51 -0500 Subject: [openib-general] Question about locked pages Message-ID: Greetings. I'm writing up some FAQ entries for Open MPI and I'm adding a question about "ulimit -l" for OpenIB (i.e., how users may wish to increase their locked pages limit). However, it's unclear to me exactly what needs to happen -- do users both need to "ulimit -l unlimited" (or some large number) *and* set /etc/sysctl.conf values for kernel.shmall and kernel.shmmax to unlimited (or a large number)? Or does performing one of those obviate the need for the other? Here's my preliminary FAQ entry about this -- comments and suggestions would be welcome: http://www.open-mpi.org/faq/?category=infiniband#ib-locked-pages If someone could provide me with details (or point me to the relevant docs), I'd greatly appreciate it. Specifically, I'd rather have Correct information -- or HREF out to Correct information -- rather than include hearsay and 3rd party "this worked for me" information (which is what I have right now ;-) ). Many thanks. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From johann at pathscale.com Mon Oct 31 09:29:13 2005 From: johann at pathscale.com (Johann George) Date: Mon, 31 Oct 2005 09:29:13 -0800 Subject: [openib-general] ip_dev_find symbol missing In-Reply-To: <4366533B.3060101@dbresearch.net> References: <20051031155634.3934.qmail@mail.jjyhd.com> <4366533B.3060101@dbresearch.net> Message-ID: <20051031172913.GA12728@cuprite.internal.keyresearch.com> > I compiled the openIB stack with the 2.6.14 kernel and three modules > (ib_sdp, ib_at, and ib_addr) will not load due to missing symbol > "ip_dev_find". I cannot get the current version (3915) of the OpenIB stack to compile. It gives me: drivers/infiniband/core/cm.c:2836: error: structure has no member named `send_buf' Or did I forget to apply some patches? Johann From mshefty at ichips.intel.com Mon Oct 31 09:30:43 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 31 Oct 2005 09:30:43 -0800 Subject: [openib-general] Re: Questions about libibat, ib_uat, and ib_a In-Reply-To: References: Message-ID: <436654C3.8060507@ichips.intel.com> Kevin Reilly wrote: > Thanks Sean, > I think the rdma_resolve_addr() does what we want. Translate a local IP > to a ib_device structure that i can use in the ibverbs. > What we want to do is pretty simple and we won't need to create a > connection. Based on your description, I think that rdma_bind_addr() may work better. The bind call works synchronously, whereas, resolve is an asynchronous operation. The difference is that bind translates a local address only, and resolve translates remote and an optional local address. > Can we have a discussion on the timeframe for this? These are already working in the kernel. A userspace implementation should be available in 1-2 weeks. I finished coding the necessary userspace support kernel module on Friday of last week, and am now starting on the userspace library. - Sean From rolandd at cisco.com Mon Oct 31 09:39:37 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:39:37 -0800 Subject: [openib-general] Question about locked pages In-Reply-To: (Jeff Squyres's message of "Mon, 31 Oct 2005 12:28:51 -0500") References: Message-ID: <52k6ftk2li.fsf@cisco.com> Jeff> However, it's unclear to me exactly what needs to happen -- Jeff> do users both need to "ulimit -l unlimited" (or some large Jeff> number) *and* set /etc/sysctl.conf values for kernel.shmall Jeff> and kernel.shmmax to unlimited (or a large number)? Or does Jeff> performing one of those obviate the need for the other? I believe the changing the ulimit for locked pages is all that is needed. Does changing shmall and shmmax have any effect? I thought those were limits on the total amount of shared memory allowed for the system, not limits on locked/pinned memory. - R. From mshefty at ichips.intel.com Mon Oct 31 09:42:28 2005 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 31 Oct 2005 09:42:28 -0800 Subject: [openib-general] ip_dev_find symbol missing In-Reply-To: <4366533B.3060101@dbresearch.net> References: <20051031155634.3934.qmail@mail.jjyhd.com> <4366533B.3060101@dbresearch.net> Message-ID: <43665784.40804@ichips.intel.com> Jay Higley wrote: > I compiled the openIB stack with the 2.6.14 kernel and three modules > (ib_sdp, ib_at, and ib_addr) will not load due to missing symbol > "ip_dev_find". I see the source for this routine in kernel/net, but > apparently it didn't get compiled into my kernel. Does anyone know what > configure options need to be set to enable this? Also, are there some > missing dependency checks in the infiband stack that allowed the modules > to be compiled without this symbol being present? The export of this symbol was removed in 2.6.14. You'll need to add: EXPORT_SYMBOL(ip_dev_find); to net/ipv4/fib_frontend.c to compile these modules now. - Sean From rolandd at cisco.com Mon Oct 31 09:48:47 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 09:48:47 -0800 Subject: [openib-general] ip_dev_find symbol missing In-Reply-To: <43665784.40804@ichips.intel.com> (Sean Hefty's message of "Mon, 31 Oct 2005 09:42:28 -0800") References: <20051031155634.3934.qmail@mail.jjyhd.com> <4366533B.3060101@dbresearch.net> <43665784.40804@ichips.intel.com> Message-ID: <52fyqhk268.fsf@cisco.com> Sean> The export of this symbol was removed in 2.6.14. You'll Sean> need to add: Sean> EXPORT_SYMBOL(ip_dev_find); Sean> to net/ipv4/fib_frontend.c to compile these modules now. ...and BTW a patch to do this is in the svn tree at linux-kernel/patches/linux-2.6.14-fib-frontend.diff - R. From sean.hefty at intel.com Mon Oct 31 09:59:27 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 31 Oct 2005 09:59:27 -0800 Subject: [openib-general] RE: 2.6.14 patches In-Reply-To: <20051030123622.GD4769@mellanox.co.il> Message-ID: >Sean, Hal, now that 2.6.14 is out, do you plan to apply >the patches in https://openib.org/svn/gen2/trunk/src/linux-kernel/patches/? >Once you do, I'll put reverted patches in the backport directory. I'll apply the patch to addr.c shortly. Thanks for the reminder. - Sean From bohra at cs.rutgers.edu Mon Oct 31 10:21:23 2005 From: bohra at cs.rutgers.edu (Aniruddha Bohra) Date: Mon, 31 Oct 2005 13:21:23 -0500 Subject: [openib-general] [PATCH] uDAPL : Fix debug printfs Message-ID: <436660A3.4060607@cs.rutgers.edu> Fix printing of debug statements. Signed off by : Aniruddha Bohra Index: common/dapl_ep_post_rdma_write.c =================================================================== --- common/dapl_ep_post_rdma_write.c (revision 3892) +++ common/dapl_ep_post_rdma_write.c (working copy) @@ -78,7 +78,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_rdma_write (%p, %d, %p, %P, %p, %x)\n", + "dapl_ep_post_rdma_write (%p, %d, %p, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_ep_post_send.c =================================================================== --- common/dapl_ep_post_send.c (revision 3892) +++ common/dapl_ep_post_send.c (working copy) @@ -75,7 +75,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_send (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_send (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, Index: common/dapl_srq_post_recv.c =================================================================== --- common/dapl_srq_post_recv.c (revision 3892) +++ common/dapl_srq_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_srq_post_recv (%p, %d, %p, %P)\n", + "dapl_srq_post_recv (%p, %d, %p, %p)\n", srq_handle, num_segments, local_iov, Index: common/dapl_ep_post_recv.c =================================================================== --- common/dapl_ep_post_recv.c (revision 3892) +++ common/dapl_ep_post_recv.c (working copy) @@ -79,7 +79,7 @@ DAT_RETURN dat_status; dapl_dbg_log (DAPL_DBG_TYPE_API, - "dapl_ep_post_recv (%p, %d, %p, %P, %x)\n", + "dapl_ep_post_recv (%p, %d, %p, %p, %x)\n", ep_handle, num_segments, local_iov, From jsquyres at open-mpi.org Mon Oct 31 10:25:20 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Mon, 31 Oct 2005 13:25:20 -0500 Subject: [openib-general] Question about locked pages In-Reply-To: <52k6ftk2li.fsf@cisco.com> References: <52k6ftk2li.fsf@cisco.com> Message-ID: <08b93b5d18f11dc475a28b6a3966e403@open-mpi.org> On Oct 31, 2005, at 12:39 PM, Roland Dreier wrote: > Jeff> However, it's unclear to me exactly what needs to happen -- > Jeff> do users both need to "ulimit -l unlimited" (or some large > Jeff> number) *and* set /etc/sysctl.conf values for kernel.shmall > Jeff> and kernel.shmmax to unlimited (or a large number)? Or does > Jeff> performing one of those obviate the need for the other? > > I believe the changing the ulimit for locked pages is all that is > needed. Does changing shmall and shmmax have any effect? I thought > those were limits on the total amount of shared memory allowed for the > system, not limits on locked/pinned memory. Ditto (I thought those were shmem values / didn't think they had any effect on Open IB). The information that I got was third-hand, which is why I posted here to ask about it. :-) I'll remove them from the FAQ entry -- any other comments? -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From jlentini at netapp.com Mon Oct 31 10:28:57 2005 From: jlentini at netapp.com (James Lentini) Date: Mon, 31 Oct 2005 13:28:57 -0500 (EST) Subject: [openib-general] Re: [PATCH] uDAPL : Fix debug printfs In-Reply-To: <436660A3.4060607@cs.rutgers.edu> References: <436660A3.4060607@cs.rutgers.edu> Message-ID: On Mon, 31 Oct 2005, Aniruddha Bohra wrote: > Fix printing of debug statements. Committed in revision 3917. Your patch didn't apply cleanly. Tabs were replaced by spaces. I fixed it up by hand. If you see an error, let me know. Thanks, james From shubbell at dbresearch.net Mon Oct 31 10:29:28 2005 From: shubbell at dbresearch.net (Sean Hubbell) Date: Mon, 31 Oct 2005 12:29:28 -0600 Subject: [openib-general] ip_dev_find symbol missing In-Reply-To: <52fyqhk268.fsf@cisco.com> References: <20051031155634.3934.qmail@mail.jjyhd.com> <4366533B.3060101@dbresearch.net> <43665784.40804@ichips.intel.com> <52fyqhk268.fsf@cisco.com> Message-ID: <43666288.6050700@dbresearch.net> Roland Dreier wrote: > Sean> The export of this symbol was removed in 2.6.14. You'll > Sean> need to add: > > Sean> EXPORT_SYMBOL(ip_dev_find); > > Sean> to net/ipv4/fib_frontend.c to compile these modules now. > >...and BTW a patch to do this is in the svn tree at > > linux-kernel/patches/linux-2.6.14-fib-frontend.diff > > - R. >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > Are there any other patches that should be applied when updating to kernel 2.6.14? Sean -- Sean Hubbell Senior Software Engineer deciBel Research, Inc. (256) 426-8957 From nacc at us.ibm.com Mon Oct 31 10:49:24 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 31 Oct 2005 10:49:24 -0800 Subject: [openib-general] ppc64 compilation failure Message-ID: <20051031184924.GD6246@us.ibm.com> Hi Roland, Looks like ppc64 build with 2.6.14-git3 and svn 3918 is busted: drivers/infiniband/core/uat.c: In function `ib_uat_init': drivers/infiniband/core/uat.c:837: warning: passing arg 2 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/uat.c:837: warning: passing arg 3 of `class_device_create' makes integer from pointer without a cast drivers/infiniband/core/uat.c:837: warning: passing arg 4 of `class_device_create' from incompatible pointer type drivers/infiniband/core/uat.c:837: error: too few arguments to function `class_device_create' drivers/infiniband/core/uverbs_main.c: In function `ib_uverbs_add_one': drivers/infiniband/core/uverbs_main.c:759: warning: passing arg 2 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/uverbs_main.c:759: warning: passing arg 3 of `class_device_create' makes integer from pointer without a cast drivers/infiniband/core/uverbs_main.c:759: warning: passing arg 4 of `class_device_create' from incompatible pointer type drivers/infiniband/core/uverbs_main.c:759: warning: passing arg 5 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/user_mad.c: In function `ib_umad_init_port': drivers/infiniband/core/user_mad.c:760: warning: passing arg 2 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/user_mad.c:760: warning: passing arg 3 of `class_device_create' makes integer from pointer without a cast drivers/infiniband/core/user_mad.c:760: warning: passing arg 4 of `class_device_create' from incompatible pointer type drivers/infiniband/core/user_mad.c:760: warning: passing arg 5 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/user_mad.c:780: warning: passing arg 2 of `class_device_create' makes pointer from integer without a cast drivers/infiniband/core/user_mad.c:780: warning: passing arg 3 of `class_device_create' makes integer from pointer without a cast drivers/infiniband/core/user_mad.c:780: warning: passing arg 4 of `class_device_create' from incompatible pointer type drivers/infiniband/core/user_mad.c:780: warning: passing arg 5 of `class_device_create' makes pointer from integer without a cast First one causes the build to fail obviously (error). Thanks, Nish From nacc at us.ibm.com Mon Oct 31 11:03:40 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Mon, 31 Oct 2005 11:03:40 -0800 Subject: [openib-general] ppc64 compilation failure In-Reply-To: <20051031184924.GD6246@us.ibm.com> References: <20051031184924.GD6246@us.ibm.com> Message-ID: <20051031190340.GE6246@us.ibm.com> On 31.10.2005 [10:49:24 -0800], Nishanth Aravamudan wrote: > Hi Roland, > > Looks like ppc64 build with 2.6.14-git3 and svn 3918 is busted: Only the ppc64 build had finished when I sent this mail, but the same happens on x86, with an additional: drivers/infiniband/ulp/iser/iser_mod.c:59: warning: large integer implicitly truncated to unsigned type Thanks, Nish From mst at mellanox.co.il Mon Oct 31 11:36:44 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 21:36:44 +0200 Subject: [openib-general] Re: libmthca problem: max_inline_size In-Reply-To: <52ek61lip5.fsf@cisco.com> References: <52ek61lip5.fsf@cisco.com> Message-ID: <20051031193644.GA708@mellanox.co.il> Quoting Roland Dreier : > Subject: Re: libmthca problem: max_inline_size > > > 2. Return the actual QP capability in create qp command. > > This is an ABI change, although the library can be made to work in a > > backward compatible way. > > 3. Add a command (device specific) to query the max descriptor size supported > > by the HCA (returned by query dev lim) and calculate max_inline_size > > based on that. > > Again, this is an ABI change. > > > I am inclining towards the second option (2.) since this way the > > resulting capability calculations will be all in one place in kernel. > > I think we need a combination of 2. and 3. because the WQE shifts and > buffers from userspace need to match up with the kernel. Its slightly unclear to me what is meant by "the combination of 2 and 3". With 2., the kernel returns back the actual capabilities supported, so we dont need to know the max descriptor size in userspace, since kernel will do the checks and return the actual qp capabilities back to us. Maybe we'll see as we work on the implementation. > For 3. there's no need to a completely new command. We could return > extra device-dependent values from the GET_CONTEXT command, or even > just add some sysfs attributes to the mthca device (similar to the > fw_ver attribute). > > > There's a similiar problem in mthca_arbel_post_send where the > > inline data size is checked against the expression > > > int max_size = (1 << qp->sq.wqe_shift) - sizeof *seg - size * 16; > > > I would imagine the way to fix that is to add a max_inline_size field > > to the mthca_qp structure and is that instead of 1 << qp->sq.wqe_shift. > > Yes, that makes sense. Okay ... do you want me and Jack to prepare such a patch, or would you rather do it yourself? -- MST From mst at mellanox.co.il Mon Oct 31 11:51:01 2005 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 31 Oct 2005 21:51:01 +0200 Subject: [openib-general] Re: ip_dev_find symbol missing In-Reply-To: <43666288.6050700@dbresearch.net> References: <43666288.6050700@dbresearch.net> Message-ID: <20051031195101.GC708@mellanox.co.il> Quoting r. Sean Hubbell : > Subject: Re: ip_dev_find symbol missing > > Roland Dreier wrote: > > > Sean> The export of this symbol was removed in 2.6.14. You'll > > Sean> need to add: > > > > Sean> EXPORT_SYMBOL(ip_dev_find); > > > > Sean> to net/ipv4/fib_frontend.c to compile these modules now. > > > >...and BTW a patch to do this is in the svn tree at > > > > linux-kernel/patches/linux-2.6.14-fib-frontend.diff > > > > - R. > > > Are there any other patches that should be applied when updating to > kernel 2.6.14? > > Sean Since you are using the svn trunk, please notice that in 2.6.14, you need to link linux-kernel/infiniband/include/rdma to include/rdma, in addition to linking linux-kernel/infiniband to drivers/infiniband. Roland, maybe we should just remove the EXTRA_CFLAGS hacks from the makefiles, it seems that the fact that there are two ways to find a header is only creating confusion. What do you say? -- MST From rolandd at cisco.com Mon Oct 31 11:55:56 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 11:55:56 -0800 Subject: [openib-general] Re: libmthca problem: max_inline_size In-Reply-To: <20051031193644.GA708@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 31 Oct 2005 21:36:44 +0200") References: <52ek61lip5.fsf@cisco.com> <20051031193644.GA708@mellanox.co.il> Message-ID: <52vezdihpv.fsf@cisco.com> Michael> Its slightly unclear to me what is meant by "the Michael> combination of 2 and 3". With 2., the kernel returns Michael> back the actual capabilities supported, so we dont need Michael> to know the max descriptor size in userspace, since Michael> kernel will do the checks and return the actual qp Michael> capabilities back to us. Yeah, I guess you're right. If userspace allocates a QP with too-big WQEs, the kernel will just fail the request. Michael> Okay ... do you want me and Jack to prepare such a patch, Michael> or would you rather do it yourself? If you have time now, that would be great. Otherwise I'll add it to my TODO list. - R. From rolandd at cisco.com Mon Oct 31 11:58:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 11:58:49 -0800 Subject: [openib-general] Question about locked pages In-Reply-To: <08b93b5d18f11dc475a28b6a3966e403@open-mpi.org> (Jeff Squyres's message of "Mon, 31 Oct 2005 13:25:20 -0500") References: <52k6ftk2li.fsf@cisco.com> <08b93b5d18f11dc475a28b6a3966e403@open-mpi.org> Message-ID: <52r7a1ihl2.fsf@cisco.com> Jeff> Ditto (I thought those were shmem values / didn't think they Jeff> had any effect on Open IB). The information that I got was Jeff> third-hand, which is why I posted here to ask about it. :-) Jeff> I'll remove them from the FAQ entry -- any other comments? Well, a normal user can't use "ulimit -l" to increase their limit on locked memory. However I've never really looked into what the cleanest way to increase the limit is. /etc/security/limits.conf is part of the answer, but ssh+privilege separation can cause that to break as well. - R. From swise at opengridcomputing.com Mon Oct 31 12:35:01 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 31 Oct 2005 14:35:01 -0600 Subject: [openib-general] possible CMA bug Message-ID: <013701c5de5a$9124ad10$d5000a0a@STEVO> Hi, I'm using the new rdma cma interface and i've perhaps stumbled onto a bug. I'm trying to bind to port 9999 on both IB ports of a mthca device. The IPoIB interfaces for the HCA are configured as two seperate subnets. The second rdma_listen() always fails with EBUSY. Maybe this is a limitation in the CMA design, but TCP stacks allow binding to the same port on different ip addresses. And the CMA interface allows it too as long as the two ip addresses map to different IB devices. Whether this should work or not, I am seeing a crash when I try to destroy the cm_id after the rdma_listen() failure. Here is a log of the event (printks from my krping module in branches/iwarp/utils/src/linux-kernel/infiniband/krping). It seems as though the cm_id is being destroyed twice, but I don't think the krping module is doing it... krping: proc write |verbose,server,addr=192.168.80.154,port=9999| krping: verbose krping: server krping: ipaddr (192.168.80.154), nbo 0x(9a50a8c0) krping: port hbo 0x270f nbo 0xf27 krping: created cm_id ffff81003f376800 krping: rdma_bind_addr worked krping: created pd ffff81003feac600 krping: created cq ffff81007a1df080 krping: create listener krping: rdma_listen error -16 krping: listen error -16 krping: destroying cq ffff81007a1df080 krping: dealloc pd ffff81003feac600 krping: destroy cm_id ffff81003f376800 idr_remove called for id=2 which is not allocated. Call Trace:{idr_remove+244} {:ib_cm:ib_destroy_cm_id+408} {printk+141} {:rdma_cm:cma_exch+70} {:rdma_cm:rdma_destroy_id+57} {:ib_mthca:mthca_free+44} {:ib_mthca:mthca_free_mr+213} {:ib_krping:krping_write_proc+6657} {__d_lookup+297} {dput+54} {__follow_mount+52} {do_lookup+100} {proc_file_write+39} {vfs_write+233} {sys_write+83} {system_call+126} From swise at opengridcomputing.com Mon Oct 31 12:39:57 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 31 Oct 2005 14:39:57 -0600 Subject: [openib-general] possible CMA bug References: <013701c5de5a$9124ad10$d5000a0a@STEVO> Message-ID: <014401c5de5b$47922f00$d5000a0a@STEVO> I've traced it down to cma_ib_listen(). It destroys the cm_id if the listen fails. It probably shouldn't, correct? IE the cm_id is owned by the ULP who called rdma_create_id() and should be destroyed by that ULP... Steve. --- snipit from cma_ib_listen() --- ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); if (ret) ib_destroy_cm_id(id_priv->cm_id); ----- Original Message ----- From: "Steve Wise" To: Sent: Monday, October 31, 2005 2:35 PM Subject: [openib-general] possible CMA bug > Hi, > > I'm using the new rdma cma interface and i've perhaps stumbled onto a > bug. I'm trying to bind to port 9999 on both IB ports of a mthca > device. The IPoIB interfaces for the HCA are configured as two > seperate subnets. The second rdma_listen() always fails with EBUSY. > Maybe this is a limitation in the CMA design, but TCP stacks allow > binding to the same port on different ip addresses. And the CMA > interface allows it too as long as the two ip addresses map to > different IB devices. Whether this should work or not, I am seeing a > crash when I try to destroy the cm_id after the rdma_listen() failure. > > Here is a log of the event (printks from my krping module in > branches/iwarp/utils/src/linux-kernel/infiniband/krping). It seems as > though the cm_id is being destroyed twice, but I don't think the > krping module is doing it... > > > krping: proc write |verbose,server,addr=192.168.80.154,port=9999| > krping: verbose > krping: server > krping: ipaddr (192.168.80.154), nbo 0x(9a50a8c0) > krping: port hbo 0x270f nbo 0xf27 > krping: created cm_id ffff81003f376800 > krping: rdma_bind_addr worked > krping: created pd ffff81003feac600 > krping: created cq ffff81007a1df080 > krping: create listener > krping: rdma_listen error -16 > krping: listen error -16 > krping: destroying cq ffff81007a1df080 > krping: dealloc pd ffff81003feac600 > krping: destroy cm_id ffff81003f376800 > idr_remove called for id=2 which is not allocated. > > Call Trace:{idr_remove+244} > {:ib_cm:ib_destroy_cm_id+408} > {printk+141} > {:rdma_cm:cma_exch+70} > {:rdma_cm:rdma_destroy_id+57} > {:ib_mthca:mthca_free+44} > {:ib_mthca:mthca_free_mr+213} > {:ib_krping:krping_write_proc+6657} > {__d_lookup+297} {dput+54} > {__follow_mount+52} > {do_lookup+100} > {proc_file_write+39} > {vfs_write+233} > {sys_write+83} > {system_call+126} > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Oct 31 12:42:27 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Oct 2005 15:42:27 -0500 Subject: [openib-general] Re: [PATCH] Opensm - fix lmc algorithm - new In-Reply-To: <5zek61yf9s.fsf@mtl066.yok.mtl.com> References: <5zek61yf9s.fsf@mtl066.yok.mtl.com> Message-ID: <1130791346.15904.281.camel@hal.voltaire.com> On Mon, 2005-10-31 at 08:42, Yael Kalka wrote: > Hi Hal, > > Since you haven't applied this fix yet - please take this new one. > There was a wrong CL_ASSERT in my original patch. > I'm also adding my explanation from previous mail regarding the patch: > We noticed a problem in the lmc assignment algorithm. > In the current code - when trying to run opensm with lmc > 0, the > opensm goes into infinite loop. > Debugging the problem we noticed that there is a problem with the > lid assignment, and we changed the algorithm. The change is in the > osm_lid_mgr_init_sweep function. > We have done some testing to the new code, and it seems that the lmc > assignment is ok with the fix. Thanks. Applied. -- Hal From halr at voltaire.com Mon Oct 31 12:45:44 2005 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Oct 2005 15:45:44 -0500 Subject: [openib-general] Re: [PATCH] Opensm - race in opensm signalling In-Reply-To: <5zd5llyewo.fsf@mtl066.yok.mtl.com> References: <5zd5llyewo.fsf@mtl066.yok.mtl.com> Message-ID: <1130791411.15904.289.camel@hal.voltaire.com> On Mon, 2005-10-31 at 08:49, Yael Kalka wrote: > Hi Hal, > > During our Windows testing we've encountered a case where for some > reason the opensm changes the state of its port to down, and then > brings it back up. > After debugging it, we found out that the reason for that is a > possible race when signaling "OSM_SIGNAL_NO_PENDING_TRANSACTIONS" to > the osm_state_mgr_process. > The qp0_mads_outstanding is decremented, and only later is checked if > reaches zero. So if 2 threads decrement the qp0_mads_outstanding, and > they are running simultanously, they can both signal > OSM_SIGNAL_NO_PENDING_TRANSACTIONS! > This, of course, results in a big mess in the osm_state_mgr_process > flow. > The following patch fixes this issue. I did see this at staging in the Linux version of this too: Oct 29 18:19:36 894556 [B6F63BB0] -> __osm_state_mgr_signal_error: ERR 3303: Invalid signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS(3) in state OSM_SM_STATE_IDLE. Thanks. Applied. -- Hal From swise at opengridcomputing.com Mon Oct 31 12:54:22 2005 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 31 Oct 2005 14:54:22 -0600 Subject: [openib-general] [PATCH] fix for a bug in cma_ib_listen() References: <013701c5de5a$9124ad10$d5000a0a@STEVO> <014401c5de5b$47922f00$d5000a0a@STEVO> Message-ID: <017c01c5de5d$456a0cf0$d5000a0a@STEVO> Fix for bug in cma_ib_listen(). Set cm_id to NULL after destroying the listen ib_cm_id so rdma_destory_id() doesn't try and destroy it again later. This fixes the crash I'm seeing when I destroy a cma_id after rdma_listen() fails... Signed off by: Steve Wise Index: cma.c =================================================================== --- cma.c (revision 3860) +++ cma.c (working copy) @@ -713,8 +713,10 @@ svc_id = cma_get_service_id(&id_priv->id.route.addr.src_addr); ret = ib_cm_listen(id_priv->cm_id, svc_id, 0); - if (ret) + if (ret) { ib_destroy_cm_id(id_priv->cm_id); + id_priv->cm_id = NULL; + } return ret; } From sean.hefty at intel.com Mon Oct 31 12:57:44 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 31 Oct 2005 12:57:44 -0800 Subject: [openib-general] possible CMA bug In-Reply-To: <013701c5de5a$9124ad10$d5000a0a@STEVO> Message-ID: >I'm using the new rdma cma interface and i've perhaps stumbled onto a bug. I'm >trying to bind to port 9999 on both IB ports of a >mthca device. The IPoIB interfaces for the HCA are configured as two seperate >subnets. The second rdma_listen() always fails with How are you binding the address to the rdma_cm_id? Are you binding based on the port number only, or binding to a port and IP address. >EBUSY. Maybe this is a limitation in the CMA design, but TCP stacks allow >binding to the same port on different ip addresses. And What's happening is that the both listen requests are being mapped to the same service ID. The request listen is then failing. The IB CM maps listens to a device, and not a port on that device. Fixing this will require adding some additional demultiplexing code to the CMA >the CMA interface allows it too as long as the two ip addresses map to >different IB devices. Whether this should work or not, I am >seeing a crash when I try to destroy the cm_id after the rdma_listen() failure. The crash shouldn't be happening. The cma_ib_listen() should clear the cm_id pointer after destroying it to prevent it from being destroyed a second time. I'll get a patch for this shortly. - Sean From sean.hefty at intel.com Mon Oct 31 13:04:09 2005 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 31 Oct 2005 13:04:09 -0800 Subject: [openib-general] [PATCH] fix for a bug in cma_ib_listen() In-Reply-To: <017c01c5de5d$456a0cf0$d5000a0a@STEVO> Message-ID: >Fix for bug in cma_ib_listen(). > >Set cm_id to NULL after destroying the listen ib_cm_id so rdma_destory_id() >doesn't try and destroy it again later. Thanks - applied. - Sean From rolandd at cisco.com Mon Oct 31 14:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 22:34:42 +0000 Subject: [openib-general] [git patch review 1/5] [IB] mthca: report asynchronous CQ events Message-ID: <1130798082548-646b24d6f405c5f5@cisco.com> Implement reporting asynchronous CQ events in Mellanox HCA driver. Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_cq.c | 31 ++++++++++++++++++++++++++++++- drivers/infiniband/hw/mthca/mthca_dev.h | 4 +++- drivers/infiniband/hw/mthca/mthca_eq.c | 4 +++- 3 files changed, 36 insertions(+), 3 deletions(-) applies-to: d918cd1ba0ef9afa692cef281afee2f6d6634a1e affcd50546d4788b7849e2b2e2ec7bc50d64c5f8 diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 8600b6c..f98e235 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -208,7 +208,7 @@ static inline void update_cons_index(str } } -void mthca_cq_event(struct mthca_dev *dev, u32 cqn) +void mthca_cq_completion(struct mthca_dev *dev, u32 cqn) { struct mthca_cq *cq; @@ -224,6 +224,35 @@ void mthca_cq_event(struct mthca_dev *de cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); } +void mthca_cq_event(struct mthca_dev *dev, u32 cqn, + enum ib_event_type event_type) +{ + struct mthca_cq *cq; + struct ib_event event; + + spin_lock(&dev->cq_table.lock); + + cq = mthca_array_get(&dev->cq_table.cq, cqn & (dev->limits.num_cqs - 1)); + + if (cq) + atomic_inc(&cq->refcount); + spin_unlock(&dev->cq_table.lock); + + if (!cq) { + mthca_warn(dev, "Async event for bogus CQ %08x\n", cqn); + return; + } + + event.device = &dev->ib_dev; + event.event = event_type; + event.element.cq = &cq->ibcq; + if (cq->ibcq.event_handler) + cq->ibcq.event_handler(&event, cq->ibcq.cq_context); + + if (atomic_dec_and_test(&cq->refcount)) + wake_up(&cq->wait); +} + void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, struct mthca_srq *srq) { diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 7e68bd4..e7e5d3b 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -460,7 +460,9 @@ int mthca_init_cq(struct mthca_dev *dev, struct mthca_cq *cq); void mthca_free_cq(struct mthca_dev *dev, struct mthca_cq *cq); -void mthca_cq_event(struct mthca_dev *dev, u32 cqn); +void mthca_cq_completion(struct mthca_dev *dev, u32 cqn); +void mthca_cq_event(struct mthca_dev *dev, u32 cqn, + enum ib_event_type event_type); void mthca_cq_clean(struct mthca_dev *dev, u32 cqn, u32 qpn, struct mthca_srq *srq); diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index e5a047a..34d68e5 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -292,7 +292,7 @@ static int mthca_eq_int(struct mthca_dev case MTHCA_EVENT_TYPE_COMP: disarm_cqn = be32_to_cpu(eqe->event.comp.cqn) & 0xffffff; disarm_cq(dev, eq->eqn, disarm_cqn); - mthca_cq_event(dev, disarm_cqn); + mthca_cq_completion(dev, disarm_cqn); break; case MTHCA_EVENT_TYPE_PATH_MIG: @@ -364,6 +364,8 @@ static int mthca_eq_int(struct mthca_dev eqe->event.cq_err.syndrome == 1 ? "overrun" : "access violation", be32_to_cpu(eqe->event.cq_err.cqn) & 0xffffff); + mthca_cq_event(dev, be32_to_cpu(eqe->event.cq_err.cqn), + IB_EVENT_CQ_ERR); break; case MTHCA_EVENT_TYPE_EQ_OVERFLOW: --- 0.99.9 From rolandd at cisco.com Mon Oct 31 14:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 22:34:42 +0000 Subject: [openib-general] [git patch review 4/5] [IB] mthca: Avoid SRQ free WQE list corruption In-Reply-To: <1130798082548-f241f7f48ee0a31b@cisco.com> Message-ID: <1130798082548-8e2587fc62785f94@cisco.com> Fix wqe_to_link() to use a structure field that we know is definitely always unused for receive work requests, so that it really avoids the free list corruption bug that the comment claims it does. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/mthca/mthca_srq.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) applies-to: ffd7eba03f29dd2932dd32ac4adc2921bde7644b e5b251a24a9cd34a7ef98e361eb94e7ab122a554 diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index 64f70aa..292f55b 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -75,15 +75,16 @@ static void *get_wqe(struct mthca_srq *s /* * Return a pointer to the location within a WQE that we're using as a - * link when the WQE is in the free list. We use an offset of 4 - * because in the Tavor case, posting a WQE may overwrite the first - * four bytes of the previous WQE. The offset avoids corrupting our - * free list if the WQE has already completed and been put on the free - * list when we post the next WQE. + * link when the WQE is in the free list. We use the imm field + * because in the Tavor case, posting a WQE may overwrite the next + * segment of the previous WQE, but a receive WQE will never touch the + * imm field. This avoids corrupting our free list if the previous + * WQE has already completed and been put on the free list when we + * post the next WQE. */ static inline int *wqe_to_link(void *wqe) { - return (int *) (wqe + 4); + return (int *) (wqe + offsetof(struct mthca_next_seg, imm)); } static void mthca_tavor_init_srq_context(struct mthca_dev *dev, --- 0.99.9 From rolandd at cisco.com Mon Oct 31 14:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 22:34:42 +0000 Subject: [openib-general] [git patch review 3/5] [IB] uverbs: Avoid NULL pointer deref on CQ async event In-Reply-To: <1130798082548-c351d7732f360685@cisco.com> Message-ID: <1130798082548-f241f7f48ee0a31b@cisco.com> Userspace CQs that have no completion event channel attached end up with their cq_context set to NULL. However, asynchronous events like "CQ overrun" can still occur on such CQs, so add a uverbs_file member to struct ib_ucq_object that we can follow to deliver these events. Signed-off-by: Roland Dreier --- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 1 + drivers/infiniband/core/uverbs_main.c | 9 +++------ 3 files changed, 5 insertions(+), 6 deletions(-) applies-to: e7fbd856e7522b65d309e9dfd425541d8f45a0bd 7162a3e0db34e914a8bc5bf74bbae0b386310cf8 diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 031cdf3..ecb8301 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -113,6 +113,7 @@ struct ib_uevent_object { struct ib_ucq_object { struct ib_uobject uobject; + struct ib_uverbs_file *uverbs_file; struct list_head comp_list; struct list_head async_list; u32 comp_events_reported; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 8c89abc..63a7415 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -602,6 +602,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uv uobj->uobject.user_handle = cmd.user_handle; uobj->uobject.context = file->ucontext; + uobj->uverbs_file = file; uobj->comp_events_reported = 0; uobj->async_events_reported = 0; INIT_LIST_HEAD(&uobj->comp_list); diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 0eb38f4..e58a7b2 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -442,13 +442,10 @@ static void ib_uverbs_async_handler(stru void ib_uverbs_cq_event_handler(struct ib_event *event, void *context_ptr) { - struct ib_uverbs_event_file *ev_file = context_ptr; - struct ib_ucq_object *uobj; + struct ib_ucq_object *uobj = container_of(event->element.cq->uobject, + struct ib_ucq_object, uobject); - uobj = container_of(event->element.cq->uobject, - struct ib_ucq_object, uobject); - - ib_uverbs_async_handler(ev_file->uverbs_file, uobj->uobject.user_handle, + ib_uverbs_async_handler(uobj->uverbs_file, uobj->uobject.user_handle, event->event, &uobj->async_list, &uobj->async_events_reported); --- 0.99.9 From rolandd at cisco.com Mon Oct 31 14:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 22:34:42 +0000 Subject: [openib-general] [git patch review 5/5] [IPoIB] cleanups: fix comment, remove useless variables In-Reply-To: <1130798082548-8e2587fc62785f94@cisco.com> Message-ID: <1130798082548-b095f02a09987549@cisco.com> Minor cleanups: fix a misleading comment, and get rid of attr_mask variables that are only used to hold constants (just use the constants directly). Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 12 ++++++------ drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 4 +--- 2 files changed, 7 insertions(+), 9 deletions(-) applies-to: c29760bafd7107252389712965ad7e4ed0791a82 3bc12e75b23c0499cc2c0873a5f77494be173761 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 192fef8..0a6f578 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -486,15 +486,16 @@ int ipoib_ib_dev_stop(struct net_device { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; - int attr_mask; unsigned long begin; struct ipoib_tx_buf *tx_req; int i; - /* Kill the existing QP and allocate a new one */ + /* + * Move our QP to the error state and then reinitialize in + * when all work requests have completed or have been flushed. + */ qp_attr.qp_state = IB_QPS_ERR; - attr_mask = IB_QP_STATE; - if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to ERROR state\n"); /* Wait for all sends and receives to complete */ @@ -541,8 +542,7 @@ int ipoib_ib_dev_stop(struct net_device timeout: qp_attr.qp_state = IB_QPS_RESET; - attr_mask = IB_QP_STATE; - if (ib_modify_qp(priv->qp, &qp_attr, attr_mask)) + if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); /* Wait for all AHs to be reaped */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index b5902a7..e829e10 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -41,7 +41,6 @@ int ipoib_mcast_attach(struct net_device { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr *qp_attr; - int attr_mask; int ret; u16 pkey_index; @@ -59,8 +58,7 @@ int ipoib_mcast_attach(struct net_device /* set correct QKey for QP */ qp_attr->qkey = priv->qkey; - attr_mask = IB_QP_QKEY; - ret = ib_modify_qp(priv->qp, qp_attr, attr_mask); + ret = ib_modify_qp(priv->qp, qp_attr, IB_QP_QKEY); if (ret) { ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret); goto out; --- 0.99.9 From rolandd at cisco.com Mon Oct 31 14:34:42 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 22:34:42 +0000 Subject: [openib-general] [git patch review 2/5] [IPoIB] use spin_trylock_irqsave() In-Reply-To: <1130798082548-646b24d6f405c5f5@cisco.com> Message-ID: <1130798082548-c351d7732f360685@cisco.com> Use spin_trylock_irqsave() in ipoib_start_xmit() instead of reinventing it out of local_irq_save(), spin_trylock() and local_irq_restore(). Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) applies-to: e4e6a0f5f2203569b6ada4c101a146c3a4f24c28 a20583a7c2e35d80b1dfc1f60c9729498838725e diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cd4f423..273d5f4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -551,11 +551,8 @@ static int ipoib_start_xmit(struct sk_bu struct ipoib_neigh *neigh; unsigned long flags; - local_irq_save(flags); - if (!spin_trylock(&priv->tx_lock)) { - local_irq_restore(flags); + if (!spin_trylock_irqsave(&priv->tx_lock, flags)) return NETDEV_TX_LOCKED; - } /* * Check if our queue is stopped. Since we have the LLTX bit --- 0.99.9 From kingman at storagegear.com Mon Oct 31 15:45:36 2005 From: kingman at storagegear.com (John Kingman) Date: Mon, 31 Oct 2005 17:45:36 -0600 (CST) Subject: [openib-general] [PATCH] [SRP] support for it_iu length negotiation Message-ID: At SRP login, the client indicates to the target a requested maximum initiator to target IU (it_iu) length. If the target cannot handle this length, it rejects the login. If the target can handle the requested length, on the other hand, it will accept the login and respond with its own maximum it_iu length. Currently, ib_srp does not utilize the maximum it_iu length suggested by the target, and if the target rejects the value used by ib_srp, no connection will be made. This patch adds the ability for ib_srp to retry a login if the target rejects the login because the maximum it_iu value used by ib_srp is too large. The ib_srp client will reduce its requested maximum it_iu length to a minimum value and retry the login. On a successful login, the ib_srp client will set the maximum it_iu length it will use to the maximum it_iu length requested by the target, within the bounds of the minimum and maximum it_iu lengths it can support. The size of the indirect memory descriptor table built by ib_srp is established at build time, based on an internal maximum iu size. In order to accomodate the changes described above, this patch modifies ib_srp so that if the maximum it_iu length established with the target is less than the internal maximum iu size specified when ib_srp was built, only those indirect memory descriptors that will fit in the established maximum it_iu length (the partial descriptor list) will be sent to the target. Any indirect memory descriptors beyond that must be retrieved by the target via rdma read, as described in the SRP documentation. The patch has been tested with our target. Signed-off-by: John Kingman Index: ib_srp.h =================================================================== --- ib_srp.h (revision 3914) +++ ib_srp.h (working copy) @@ -53,8 +53,10 @@ enum { SRP_PORT_REDIRECT = 1, SRP_DLID_REDIRECT = 2, + SRP_LOGIN_RETRY = 3, - SRP_MAX_IU_LEN = 256, + SRP_MAX_IU_LEN = 1024, /* our maximum it_iu size */ + SRP_REQ_IU_LEN = 256, /* it_iu size to request initially */ SRP_RQ_SHIFT = 6, SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT, @@ -65,9 +67,11 @@ enum { }; #define SRP_OP_RECV (1 << 31) -#define SRP_MAX_INDIRECT ((SRP_MAX_IU_LEN - \ - sizeof (struct srp_cmd) - \ - sizeof (struct srp_indirect_buf)) / 16) +#define SRP_MIN_IU_LEN (sizeof (struct srp_cmd) + \ + sizeof (struct srp_indirect_buf)) +#define SRP_NUM_PARTIALS(x) \ + (((x) - SRP_MIN_IU_LEN) / sizeof (struct srp_direct_buf)) +#define SRP_MAX_INDIRECT SRP_NUM_PARTIALS(SRP_MAX_IU_LEN) enum srp_target_state { SRP_TARGET_LIVE, @@ -117,8 +121,10 @@ struct srp_target_port { struct ib_cm_id *cm_id; struct ib_cq *cq; struct ib_qp *qp; - int max_ti_iu_len; + int max_it_iu_len; + int req_it_iu_len; + int max_partial_desc; s32 req_lim; unsigned rx_head; Index: ib_srp.c =================================================================== --- ib_srp.c (revision 3914) +++ ib_srp.c (working copy) @@ -295,7 +297,7 @@ static int srp_send_req(struct srp_targe req->priv.opcode = SRP_LOGIN_REQ; req->priv.tag = 0; - req->priv.req_it_iu_len = cpu_to_be32(SRP_MAX_IU_LEN); + req->priv.req_it_iu_len = cpu_to_be32(target->req_it_iu_len); req->priv.req_buf_fmt = cpu_to_be16(SRP_BUF_FORMAT_DIRECT | SRP_BUF_FORMAT_INDIRECT); memcpy(req->priv.initiator_port_id, target->srp_host->initiator_port_id, 16); @@ -386,6 +388,7 @@ static int srp_connect_target(struct srp return ret; break; + case SRP_LOGIN_RETRY: case SRP_DLID_REDIRECT: break; @@ -528,11 +531,6 @@ static int srp_map_data(struct scsi_cmnd fmt = SRP_DATA_DESC_INDIRECT; - if (scmnd->sc_data_direction == DMA_TO_DEVICE) - cmd->data_out_desc_cnt = n; - else - cmd->data_in_desc_cnt = n; - buf->table_desc.va = cpu_to_be64(req->cmd->dma + sizeof *cmd + sizeof *buf); @@ -552,6 +550,20 @@ static int srp_map_data(struct scsi_cmnd buf->len = cpu_to_be32(datalen); + /* + * buf->len includes the SRP indirect descriptor table. + * We only include in the descriptor count (n) and len + * the descriptors that fit in partial descriptor list + * of the it_iu. Any beyond that will be rdma read by + * the target. + */ + n = min(target->max_partial_desc, n); + + if (scmnd->sc_data_direction == DMA_TO_DEVICE) + cmd->data_out_desc_cnt = n; + else + cmd->data_in_desc_cnt = n; + len = sizeof (struct srp_cmd) + sizeof (struct srp_indirect_buf) + n * sizeof (struct srp_direct_buf); @@ -1003,10 +1015,19 @@ static void srp_cm_rej_handler(struct ib struct srp_login_rej *rej = event->private_data; u32 reason = be32_to_cpu(rej->reason); - if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) - printk(KERN_WARNING PFX + if (reason == SRP_LOGIN_REJ_REQ_IT_IU_LENGTH_TOO_LARGE) { + printk(KERN_DEBUG PFX "SRP_LOGIN_REJ: requested max_it_iu_len too large\n"); - else + /* + * Retry with minimum it_iu length and let target + * suggest max it_iu length. + */ + if (target->req_it_iu_len > SRP_MIN_IU_LEN) { + target->req_it_iu_len = SRP_MIN_IU_LEN; + target->status = SRP_LOGIN_RETRY; + break; + } + } else printk(KERN_WARNING PFX "SRP LOGIN REJECTED, reason 0x%08x\n", reason); } else @@ -1045,6 +1066,17 @@ static int srp_cm_handler(struct ib_cm_i struct srp_login_rsp *rsp = event->private_data; target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len); + target->max_it_iu_len = be32_to_cpu(rsp->max_it_iu_len); + if (target->max_it_iu_len > SRP_MAX_IU_LEN) + target->max_it_iu_len = SRP_MAX_IU_LEN; + else if (target->max_it_iu_len < SRP_MIN_IU_LEN) { + printk(KERN_ERR PFX "Invalid rsp->max_it_iu_len: %d\n", + target->max_it_iu_len); + target->status = -ECONNRESET; + break; + } + target->max_partial_desc = + SRP_NUM_PARTIALS(target->max_it_iu_len); target->req_lim = be32_to_cpu(rsp->req_lim_delta); target->scsi_host->can_queue = min(target->req_lim, @@ -1227,7 +1259,8 @@ static struct scsi_host_template srp_tem .eh_host_reset_handler = srp_reset_host, .can_queue = SRP_SQ_SIZE, .this_id = -1, - .sg_tablesize = SRP_MAX_INDIRECT, + .sg_tablesize = (SRP_MAX_INDIRECT < SG_ALL ? + SRP_MAX_INDIRECT : SG_ALL), .cmd_per_lun = SRP_SQ_SIZE, .use_clustering = ENABLE_CLUSTERING }; @@ -1434,6 +1467,8 @@ static ssize_t srp_create_target(struct ret = PTR_ERR(target->cm_id); goto err_free; } + target->req_it_iu_len = SRP_REQ_IU_LEN; + target->max_partial_desc = SRP_NUM_PARTIALS(SRP_REQ_IU_LEN); ret = srp_connect_target(target); if (ret) { From iod00d at hp.com Mon Oct 31 16:28:11 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 31 Oct 2005 16:28:11 -0800 Subject: [openib-general] [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <52wtjtk3d1.fsf@cisco.com> References: <52wtjtk3d1.fsf@cisco.com> Message-ID: <20051101002811.GD3107@esmail.cup.hp.com> On Mon, Oct 31, 2005 at 09:23:06AM -0800, Roland Dreier wrote: > I've posted this several times for review and gotten some (but not > very much) feedback. Has anyone purchased IB SRP target and for use with linux? I've seen references to "Cisco SFS 3001 Multifabric Server Switch" (TS90) with the optional FC gateway stuff. Anyway, while I have a TS90, I don't have the FC GW. If someone sent me one, I'd plug it into my test ring. I have a switch and 2Gb/s FC JBODs. Are any native IB/SRP native storage devices available? (Note that I'm asking only out of curiosity. I'm not going to rush out and buy one for developement.) > Is there any objection to me asking Linus to pull this for 2.6.15? I don't have anything. Just some nits: > +#define DRV_VERSION "0.01" > +#define DRV_RELDATE "January 11, 2005" Implies the driver hasn't changed since Jan 11. Is that correct? (I find that hard to believe if you got feedback) Revision numbers are cheap - just roll it to 0.9 (or whatever) and apply a current date. > +MODULE_AUTHOR("Roland Dreier"); > +MODULE_DESCRIPTION("InfiniBand SCSI RDMA Protocol driver"); I'd add "initiator" here unless you think this driver could support targets in the future too. I do realize the difference between initiator and target for RDMA is alot smaller than it was for traditional parallel SCSI implementations. In fact, I'm wondering is one could be implemented for SRP entirely in userspace. > +static int srp_create_target_ib(struct srp_target_port *target) > +{ > + struct ib_qp_init_attr *init_attr = NULL; Don't need the NULL assignment here. BTW, does gcc just throw this away since it gets overwritten? > + int ret; > + > + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); > + if (!init_attr) > + return -ENOMEM; > + > + target->cq = ib_create_cq(target->srp_host->dev, srp_completion, > + NULL, target, SRP_CQ_SIZE); > + if (IS_ERR(target->cq)) { > + ret = PTR_ERR(target->cq); > + goto out; > + } Could this be "adjusted" to read: if (ret = PTR_ERR(target->qp)) { ... I'm sure I do NOT understand the utility of "IS_ERR" in this case. Most uses of "IS_ERR" seem superfluous. ... > + target->qp = ib_create_qp(target->srp_host->pd, init_attr); > + if (IS_ERR(target->qp)) { > + ret = PTR_ERR(target->qp); > + ib_destroy_cq(target->cq); > + goto out; > + } > + > + ret = srp_init_qp(target, target->qp); > + if (ret) { > + ib_destroy_qp(target->qp); > + ib_destroy_cq(target->cq); > + goto out; > + } The second "goto out" can be dropped. Falls through anyway. I'm ambiviently if it's good coding style or not. (ie in case someone adds code later). > + > +out: > + kfree(init_attr); > + return ret; > +} ... > +static int srp_lookup_path(struct srp_target_port *target) > +{ > + target->path.numb_path = 1; > + > + init_completion(&target->done); > + > + target->path_query_id = ib_sa_path_rec_get(target->srp_host->dev, > + target->srp_host->port, > + &target->path, > + IB_SA_PATH_REC_DGID | > + IB_SA_PATH_REC_SGID | > + IB_SA_PATH_REC_NUMB_PATH | > + IB_SA_PATH_REC_PKEY, My preference is to put the '|' on the next line with expressions behind it. That way the last line is obviously not a standalone usage when seen with grep or similar line-based text tool. > + SRP_PATH_REC_TIMEOUT_MS, > + GFP_KERNEL, > + srp_path_rec_completion, > + target, &target->path_query); ... > + req->param.starting_psn = 0; /* XXX */ There are still 6 "XXX" markers...don't want to suggest they need to be fixed. > + req->param.private_data = &req->priv; > + req->param.private_data_len = sizeof req->priv; > + req->param.responder_resources = 4; > + req->param.remote_cm_response_timeout = 20; > + req->param.flow_control = 1; > + req->param.local_cm_response_timeout = 20; > + req->param.retry_count = 7; > + req->param.rnr_retry_count = 7; > + req->param.max_cm_retries = 15; Are these retry counts specified by some standard or just "this ought to be enough" kind of numbers? If the latter, another "XXX" about making them system tunables (e.g. MOD_PARM or /sys) would be good. > + /* > + * Topspin/Cisco SRP targets will reject our login unless we > + * zero out the first 8 bytes of our initiator port ID. The > + * second 8 bytes must be our local node GUID, but we always > + * use that anyway. > + */ ... > +static int srp_connect_target(struct srp_target_port *target) > +{ > + int ret; > + > + ret = srp_lookup_path(target); > + if (ret) > + return ret; > + > + while (1) { > + init_completion(&target->done); > + ret = srp_send_req(target); > + if (ret) > + return ret; > + wait_for_completion(&target->done); > + > + /* > + * The CM event handling code will set status to > + * SRP_PORT_REDIRECT if we get a port redirect REJ > + * back, or SRP_DLID_REDIRECT if we get a lid/qp > + * redirect REJ back. > + */ > + switch (target->status) { > + case 0: > + return 0; > + > + case SRP_PORT_REDIRECT: > + ret = srp_lookup_path(target); > + if (ret) > + return ret; > + break; > + > + case SRP_DLID_REDIRECT: > + break; > + > + default: > + return target->status; > + } > + } Maybe add this for lint? /* NOTREACHED */ > +} Maybe lint is smart enough to realize that these days. > +static int srp_reconnect_target(struct srp_target_port *target) > +{ ... > + ib_destroy_cm_id(target->cm_id); > + target->cm_id = new_cm_id; Is it explained somplace why we drop the old cm_id and create a new one in this case? I'm hoping this was explained elsewhere and I just missed it. ... > + while (ib_poll_cq(target->cq, 1, &wc) > 0) > + ; /* nothing */ does a "relax_cpu()" belong in here? ok..out of time.. I scanned the last couple of hundred lines and didn't see any nits there. hth, grant From iod00d at hp.com Mon Oct 31 16:41:48 2005 From: iod00d at hp.com (Grant Grundler) Date: Mon, 31 Oct 2005 16:41:48 -0800 Subject: [openib-general] [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: References: Message-ID: <20051101004148.GG3107@esmail.cup.hp.com> On Mon, Oct 31, 2005 at 05:45:36PM -0600, John Kingman wrote: > On a successful login, the ib_srp client will set the maximum it_iu > length it will use to the maximum it_iu length requested by the > target, within the bounds of the minimum and maximum it_iu lengths it > can support. ... > Index: ib_srp.c > =================================================================== > --- ib_srp.c (revision 3914) > +++ ib_srp.c (working copy) It would be good if we started updating the driver versions that are embedded in the driver. It makes tracking down "known bugs" and "known fixes" alot easier later once distro's start sending this out to customers. Openib code will become a collection of seperate drivers once we have "stable" releases that are used by customers. Each driver will need it's own revision history once some peon like me has to maintain it. I'm going through this exercise with tg3 driver for "IOX Core LAN" issues and Dave Miller's made my life *alot* easier by regularly rolling the tg3 version # and release date. thanks, grant From info at hhfygd.com Mon Oct 31 15:58:38 2005 From: info at hhfygd.com (info at hhfygd.com) Date: 1 Nov 2005 08:58:38 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051031235838.27476.qmail@mail.hhfygd.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B References: <52wtjtk3d1.fsf@cisco.com> Message-ID: <20051101110409V.fujita.tomonori@lab.ntt.co.jp> From: Roland Dreier Subject: [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator Date: Mon, 31 Oct 2005 09:23:06 -0800 > I've posted this several times for review and gotten some (but not > very much) feedback. Is there any objection to me asking Linus to > pull this for 2.6.15? Any reason the existing SRP definitions (drivers/scsi/ibmvscsi/srp.h) doesn't work for you? From info at lloi-2.com Mon Oct 31 19:24:54 2005 From: info at lloi-2.com (info at lloi-2.com) Date: 1 Nov 2005 12:24:54 +0900 Subject: [openib-general] $BLt6I1?1DpJs(B Message-ID: <20051101032454.30849.qmail@mail.lloi-2.com> $B5.J}$N%"%I%l%9$,!Z(BID:145265 $B at 6;R![$5$s$+$iD>@\;XL>$r$5$l$?$3$H$,3NG'$G$-$^$7$?$N$G!"D>@\O"Mm2DG=$H at _Dj$5$;$FD:$-$^$7$?!#:#$+$iD>@\O"MmJ}K!$r$40FFb$G$-$7$^$9$N$G!"G'>Z$H$7$F4JC1$JFCJL?=9~$_(B($BA4$FL5NA(B)$B$r$*4j$$CW$7$^$9!#(B $B8^IC$GL5NAEPO?"*%m%0%$%s!!(Bhttp://www.jumpb2.net/?raku $B"!4JC1(BPF$B>R2p"!(B $BG/Np!'Fb=o(B $B;E;v!'Lt6IE9J^1?1D(B($BA49q==FsE9J^(B) $B%3%a%s%H!'!V0l2s#5K|$/$i$$G=w at -$r0FFbCW$7$^$9$N$G!"D>@\%a!<%k(B $B$h$j%"%I%l%9$J$I$N3NG'$,$G$-$k$HJ]>Z$7$^$9!#L>A0!Z at 6;R![$G(B $BEPO?$5$l$F$*$j$^$9!#(B \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ $B5qH]%"%I(B (Refusal Adress) iranai at jumpb2.net $B!!(B From rolandd at cisco.com Mon Oct 31 20:51:49 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 20:51:49 -0800 Subject: [openib-general] [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051101002811.GD3107@esmail.cup.hp.com> (Grant Grundler's message of "Mon, 31 Oct 2005 16:28:11 -0800") References: <52wtjtk3d1.fsf@cisco.com> <20051101002811.GD3107@esmail.cup.hp.com> Message-ID: <52mzkpgeca.fsf@cisco.com> > Has anyone purchased IB SRP target and for use with linux? > I've seen references to "Cisco SFS 3001 Multifabric Server Switch" > (TS90) with the optional FC gateway stuff. Yes, we have actually sold some... > Are any native IB/SRP native storage devices available? I don't know what the release status of the various products are, but Data Direct, Engenio and Mellanox have all talked about native IB/SRP targets, and judging by John Kingman's activity, it's a safe bet that StorageGear has something cooking as well. > Implies the driver hasn't changed since Jan 11. Is that correct? Nope, I bumped it to 0.2 and put it in the modinfo. > I'd add "initiator" here unless you think this driver could > support targets in the future too. It's definitely an initiator, so I changed that. > Don't need the NULL assignment here. Fixed. > Could this be "adjusted" to read: > if (ret = PTR_ERR(target->qp)) { > ... > > I'm sure I do NOT understand the utility of "IS_ERR" in this case. > Most uses of "IS_ERR" seem superfluous. I don't think this sort of change will work. IS_ERR() is only true if the pointer (as an unsigned long) is in the range -1000L ... -1L. But PTR_ERR() will be true if the pointer is non-NULL. > There are still 6 "XXX" markers...don't want to suggest they need > to be fixed. I fixed the easy ones... > Are these retry counts specified by some standard or just > "this ought to be enough" kind of numbers? > If the latter, another "XXX" about making them system tunables > (e.g. MOD_PARM or /sys) would be good. Nope, no spec. I added a comment talking about this issue. > Is it explained somplace why we drop the old cm_id and create > a new one in this case? > I'm hoping this was explained elsewhere and I just missed it. Yes, a few lines earlier: /* * Now get a new local CM ID so that we avoid confusing the * target in case things are really fouled up. */ > > + while (ib_poll_cq(target->cq, 1, &wc) > 0) > > + ; /* nothing */ > does a "relax_cpu()" belong in here? I don't think so. No entries can be added to the CQ while we're in that loop -- I just want to go through the CQ and throw away any of the entries that are there. So it's not busy-waiting -- it's just iterating through the queue until it drains it. Thanks, Roland From rolandd at cisco.com Mon Oct 31 20:55:23 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 20:55:23 -0800 Subject: [openib-general] Re: [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051101110409V.fujita.tomonori@lab.ntt.co.jp> (FUJITA Tomonori's message of "Tue, 01 Nov 2005 11:04:09 +0900") References: <52wtjtk3d1.fsf@cisco.com> <20051101110409V.fujita.tomonori@lab.ntt.co.jp> Message-ID: <52irvdge6c.fsf@cisco.com> FUJITA> Any reason the existing SRP definitions FUJITA> (drivers/scsi/ibmvscsi/srp.h) doesn't work for you? Wow ... I never realized that ibmvscsi was an SRP initiator as well. Anyway, looking at drivers/scsi/ibmvscsi/srp.h, the main problem I see is that the file has a bunch of bitfields that are big-endian only (which makes sense because the driver can only be compiled for pSeries or iSeries anyway). But I have no objection to moving the file to include/scsi/srp.h, adding a bunch of #if defined(__LITTLE_ENDIAN_BITFIELD) #elif defined(__BIG_ENDIAN_BITFIELD) #endif and adding a few missing defines, and then converting ib_srp to use the same file. Does that seem like the right thing to do? Thanks, Roland From hch at lst.de Mon Oct 31 20:58:00 2005 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Nov 2005 05:58:00 +0100 Subject: [openib-general] Re: [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <52irvdge6c.fsf@cisco.com> References: <52wtjtk3d1.fsf@cisco.com> <20051101110409V.fujita.tomonori@lab.ntt.co.jp> <52irvdge6c.fsf@cisco.com> Message-ID: <20051101045800.GA25519@lst.de> On Mon, Oct 31, 2005 at 08:55:23PM -0800, Roland Dreier wrote: > Anyway, looking at drivers/scsi/ibmvscsi/srp.h, the main problem I see > is that the file has a bunch of bitfields that are big-endian only > (which makes sense because the driver can only be compiled for pSeries > or iSeries anyway). > > But I have no objection to moving the file to include/scsi/srp.h, > adding a bunch of > > #if defined(__LITTLE_ENDIAN_BITFIELD) > #elif defined(__BIG_ENDIAN_BITFIELD) > #endif > > and adding a few missing defines, and then converting ib_srp to use > the same file. > > Does that seem like the right thing to do? No. Bitfields for accessing hardware/wire datastructures are wrong and will always break in some circumstances. Your header is much better. > > Thanks, > Roland > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ---end quoted text--- From rolandd at cisco.com Mon Oct 31 21:00:52 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 21:00:52 -0800 Subject: [openib-general] Re: [PATCH] [SRP] support for it_iu length negotiation In-Reply-To: (John Kingman's message of "Mon, 31 Oct 2005 17:45:36 -0600 (CST)") References: Message-ID: <52ek61gdx7.fsf@cisco.com> Thanks for the patch. However, I would like to hold off on new features for the SRP driver to get it merged into 2.6.15. There's about another week in the 2.6.15 merge window, so either way the delay shouldn't be too long. With that said I don't think I like this patch. I don't think it's a win to allocate 1 KB IUs when we'll almost never have gather/scatter lists that big. Even the 256 byte IUs that the current driver uses seem on the borderline of being too big. Also, is it really a win to have the target fetch a large indirect buffer list? It seems like it would be better for performance to give the SCSI layer a limit on the size of the gather/scatter list we support so that our indirect buffer lists always fit in the IUs we send. - R. From rolandd at cisco.com Mon Oct 31 21:03:35 2005 From: rolandd at cisco.com (Roland Dreier) Date: Mon, 31 Oct 2005 21:03:35 -0800 Subject: [openib-general] Re: [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <20051101045800.GA25519@lst.de> (Christoph Hellwig's message of "Tue, 1 Nov 2005 05:58:00 +0100") References: <52wtjtk3d1.fsf@cisco.com> <20051101110409V.fujita.tomonori@lab.ntt.co.jp> <52irvdge6c.fsf@cisco.com> <20051101045800.GA25519@lst.de> Message-ID: <52acgpgdso.fsf@cisco.com> Christoph> No. Bitfields for accessing hardware/wire Christoph> datastructures are wrong and will always break in some Christoph> circumstances. Your header is much better. OK, that's my feeling as well. Would it make sense for me to split the pure SRP spec structures and so on into a separate file and put it in include/scsi/srp.h? Then we can move ibmvscsi towards using that file. - R. From hch at lst.de Mon Oct 31 21:04:23 2005 From: hch at lst.de (Christoph Hellwig) Date: Tue, 1 Nov 2005 06:04:23 +0100 Subject: [openib-general] Re: [PATCH/RFC] IB: Add SCSI RDMA Protocol (SRP) initiator In-Reply-To: <52acgpgdso.fsf@cisco.com> References: <52wtjtk3d1.fsf@cisco.com> <20051101110409V.fujita.tomonori@lab.ntt.co.jp> <52irvdge6c.fsf@cisco.com> <20051101045800.GA25519@lst.de> <52acgpgdso.fsf@cisco.com> Message-ID: <20051101050423.GA25691@lst.de> On Mon, Oct 31, 2005 at 09:03:35PM -0800, Roland Dreier wrote: > Christoph> No. Bitfields for accessing hardware/wire > Christoph> datastructures are wrong and will always break in some > Christoph> circumstances. Your header is much better. > > OK, that's my feeling as well. > > Would it make sense for me to split the pure SRP spec structures and > so on into a separate file and put it in include/scsi/srp.h? Then we > can move ibmvscsi towards using that file. Sounds like a good idea, yes. From info at gushdt.com Mon Oct 31 22:15:52 2005 From: info at gushdt.com (info at gushdt.com) Date: 1 Nov 2005 15:15:52 +0900 Subject: [openib-general] $B$$$-$J$j$9$_$^$;$s!*(B Message-ID: <20051101061552.22745.qmail@mail.gushdt.com> http://www.s-bj.net/?luckget $B=P2q$$7O%5%$%H$r1?1D$7$F$$$kEDCf$H?=$7$^$9!#:#G/$O=w at -2q0w3MF@$K(B $B%l%G%#%3%_Ej9F$d1XA0$G$N%F%#%C%7%eG[I[$K#12/$rEj;q$7$?7k2LCK at -2q(B $B0w$H$NHfN($,(B7$B!'(B3$B$K$J$C$F$7$^$$!"=w at -$+$i$N6l>p$,=P$F$7$^$C$F:$$C(B $B$F$$$^$9!#$=$N$?$a$"$J$?$r1J5WE*$KFCJLL5NA$G$*;H$$$$$?$@$1$kFCJL(B $B2q0w$K$J$C$F$$$?$@$-$?$$$H;W$C$F$*$j$^$9!#%K%C%/%M!<%`$N:G8e$K(B $B!V(B*$B!W$rIU$1$F$$$?$@$1$l$P$3$A$i$N$[$&$GFCJL2q0w$K at _Dj$5$;$F$$$?(B $B$@$-$^$9!#(B http://www.s-bj.net/?luckget $B$f$C$/$j$H9bNp$N$*6b$b$A$N=w at -$r8+$D$1$F%j%C%A$J at 83h$rAw$C$F$_$F(B $B$/$@$5$$!#(B $B