From eli at dev.mellanox.co.il Thu May 1 00:01:25 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 01 May 2008 10:01:25 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup In-Reply-To: References: <1209577156.1790.11.camel@mtls03> Message-ID: <1209625285.1790.24.camel@mtls03> On Wed, 2008-04-30 at 20:05 -0700, Roland Dreier wrote: > > we haves seen a few other cases where a large tx queue is needed. I > > think we should choose a larger default value than the current 64. > > maybe yes, maybe no... what are the cases where it is needed? > > The send queue is basically acting as a "shock absorber" for bursty > traffic. If the queue is filling up because of a steady traffic rate, > then making the queue bigger means it will just take a little longer to > fill. The way a longer send queue helps I guess is if the send queue is > emptying out before the transmit queue is woken up... I agree, but I want to have a larger buffer to absorb larger picks. For example, after applying this patch I tested how many times the net queue is stopped and woken up when running four streams of netperf, udp, small packets. When using the default 64 tx queue size it happened 500 times. When I used a 256 tx queue size it happened only 37 times. This makes me think that we have larger picks that a larger queue size can help handle. Also looking for example on Broadcom bnx2 driver on my machine, it uses a 1000 tx queue len. From jackm at dev.mellanox.co.il Thu May 1 00:04:27 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 1 May 2008 10:04:27 +0300 Subject: [ofa-general] Re: [PATCH] ib_mthca: use log values instead of numeric values when specifiying HCA resource maxes in module parameters In-Reply-To: References: <200804291822.57820.jackm@dev.mellanox.co.il> Message-ID: <200805011004.27307.jackm@dev.mellanox.co.il> On Tuesday 29 April 2008 19:48, Roland Dreier wrote: > given that mthca has had the old interface for nearly a year and a half, > what do we gain from changing it now? > We gain clarity and consistency. The mlx4 driver in OFED 1.3 uses log values in the module parameters (patch for mlx4 that I submitted in October 2007). >  > I put a check in the patch for detecting if the user specified a log or not, >  > to make the transition from the old method (of numbers instead of logs) >  > easier. > > Yes, that is nice.  Would the plan be just to allow both methods? Good idea, but cannot be done for all parameters. "max rdb per qp" is by default 4 rdb's per qp (log = 2). If an administrator supplies ONLY this parameter in an "options" line for ib_mthca, how can I tell if the value is a log or a number? (Say administrator places the following line: "options ib_mthca num_rdb=4" -- how will I know that the admin means "log" , or just a number). Maybe the best solution is to change the parameter name from "num_xxx" to "log_num_xxx". That way, if the administrator is using an old /etc/modprobe.conf file, with lines like "options ib_mthca num_xxxx=20000", then ib_mthca will fail to load, and there will be lines in file /var/log/messages like "ib_mthca: Unknown parameter `num_cq' ". Please note also that very few customer are using this module-parameter capability as yet. > it would make sense for mlx4 to allow setting parameter > values by value and not by log, and then we end up with all the same > code in both places, and so why not just have mlx4 set by value the same > way as mthca? > OFED 1.3 has the patch I submitted for mlx4 in October 07, and this already uses logs, not values. We would then be confusing Hermon customers if we change this to values. I think it is healthiest to: 1. Use the ib_mthca patch I submitted, but change the parameter names from "num_xxx" to "log_num_xxx" 2. Take the mlx4 patch as is (maybe adding a check that values are <31). - Jack From kensandars at hotmail.com Thu May 1 01:24:15 2008 From: kensandars at hotmail.com (Ken Sandars) Date: Thu, 1 May 2008 18:24:15 +1000 Subject: [Ips] [Stgt-devel] [ofa-general] Re: Calculating the VA iniSER header In-Reply-To: <39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com> References: <4804B03C.6060507@voltaire.com><694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com><20080416144830.GC23861@osc.edu> <694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com> <20080429170516.GA8857@osc.edu> <39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com> Message-ID: >> [Ken] It appears the current Linux iSER initiator does not send the HELLO message when >> the connection transits to full feature phase. The stgt target also ignores this >> message (if it were to appear). [Ken] The IBTA document does not mention the HELLO/HELLOREPLY messages. Implementing this message exchange gives a distinction between the current implementations and those that will correctly calculate the write_va (as per Pete Wyckoff's option 3). >> [Ken] Both of these implementations use a non-conformant iSER header (they add >> write_va and read_va fields, which incidentally do not appear to be used). Are >> these changes documented anywhere in the IB domain, or are these variations >> needed for another reason? > > [Erez] Take a look at the iSER for IB annex: > http://www.infinibandta.org/members/spec/Annex_iSER.PDF [Ken] Ouch. That link requires a username/password. Looks like it is only available to members of the Infiniband Trade Association. Fortunately I gained access to it with username "open" and password "standard". ;-) [Ken] Neither of these implementations send or examine the iSER CM REQ/REP message private data. The document doesn't define what action to take when this message is absent. Interestingly, when the target reports that "ZBVA shall be used for this connection" and "the target shall issue Send with Invalidate as needed" then it appears the iSER header specified in RFC5046 should be used for control-type PDUs. Is there any plan to conform with the list of requirements for IBTA compliance? Cheers, Ken _________________________________________________________________ Never miss another e-mail with Hotmail on your mobile. http://www.livelife.ninemsn.com.au/article.aspx?id=343869 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Thu May 1 04:48:25 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 01 May 2008 14:48:25 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: log matched QoS criteria Message-ID: <4819AE09.2020509@dev.mellanox.co.il> Adding log messages for matched criteria of the QoS policy rule. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_policy.c | 18 +++++++++++++++--- 1 files changed, 15 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 6c81872..185ccc0 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -624,6 +624,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_match_rule_by_params: " + "Source port matched.\n"); } /* If a match rule has Destination groups, PR request dest. has to be in this list */ @@ -637,6 +640,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_match_rule_by_params: " + "Destination port matched.\n"); } /* If a match rule has QoS classes, PR request HAS @@ -655,7 +661,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_match_rule_by_params: " + "QoS Class matched.\n"); } /* If a match rule has Service IDs, PR request HAS @@ -675,7 +683,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_match_rule_by_params: " + "Service ID matched.\n"); } /* If a match rule has PKeys, PR request HAS @@ -694,7 +704,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "__qos_policy_get_match_rule_by_params: " + "PKey matched.\n"); } /* if we got here, then this match-rule matched this PR request */ -- 1.5.1.4 From tziporet at mellanox.co.il Thu May 1 05:30:33 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 1 May 2008 15:30:33 +0300 Subject: [ofa-general] Reminder: OFED 1.3.1-rc1 is planed for next week Message-ID: <6C2C79E72C305246B504CBA17B5500C903EBFA77@mtlexch01.mtl.com> Hi All, OFED 1.3.1-rc1 is planned for Tuesday next week (May 6) Please send all your patches/new packages (e.g. mpi) by the end of this week so we will be able to integrate them and have rc1 on time Thanks Tziporet From kliteyn at dev.mellanox.co.il Thu May 1 05:36:46 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 01 May 2008 15:36:46 +0300 Subject: [ofa-general] [PATCH] opensm/osm_qos_policy.c: log matched QoS criteria In-Reply-To: <4819AE09.2020509@dev.mellanox.co.il> References: <4819AE09.2020509@dev.mellanox.co.il> Message-ID: <4819B95E.8080801@dev.mellanox.co.il> Hi Sasha, Please ignore this patch - it is using the old osm_log. I'll repost v2 of this patch. -- Yevgeny Yevgeny Kliteynik wrote: > Adding log messages for matched criteria of > the QoS policy rule. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_qos_policy.c | 18 +++++++++++++++--- > 1 files changed, 15 insertions(+), 3 deletions(-) > > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 6c81872..185ccc0 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -624,6 +624,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_match_rule_by_params: " > + "Source port matched.\n"); > } > > /* If a match rule has Destination groups, PR request dest. has to be in this list */ > @@ -637,6 +640,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_match_rule_by_params: " > + "Destination port matched.\n"); > } > > /* If a match rule has QoS classes, PR request HAS > @@ -655,7 +661,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_match_rule_by_params: " > + "QoS Class matched.\n"); > } > > /* If a match rule has Service IDs, PR request HAS > @@ -675,7 +683,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_match_rule_by_params: " > + "Service ID matched.\n"); > } > > /* If a match rule has PKeys, PR request HAS > @@ -694,7 +704,9 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + osm_log(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "__qos_policy_get_match_rule_by_params: " > + "PKey matched.\n"); > } > > /* if we got here, then this match-rule matched this PR request */ From monis at Voltaire.COM Thu May 1 05:48:27 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 01 May 2008 15:48:27 +0300 Subject: [ofa-general] Re: [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48187E5A.7040809@Voltaire.COM> Message-ID: <4819BC1B.1040909@Voltaire.COM> Thanks for the comments. I'll resend soon. From monis at Voltaire.COM Thu May 1 05:52:57 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 01 May 2008 15:52:57 +0300 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: <48187E5A.7040809@Voltaire.COM> References: <48187E5A.7040809@Voltaire.COM> Message-ID: <4819BD29.7080002@Voltaire.COM> I made some changes according to Or and Roland's comments This patch solves a race between work elements that are carried out after an event occurs. When SM address handle becomes invalid and needs an update it is handled by a work in the global workqueue. On the other hand this event is also handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. The patch sets the SM address handle to NULL and until update_sm_ah() is called, any request that needs sm_ah is replied with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch mads are sent to the wrong SM and now they are now blocked before they are sent. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse in in some cases it gets better. Being specific in this issue yields * Callers of ib_sa_mcmember_rec_query() seem to handle the error returns properly but without checking specifically for EAGAIN. * Callers of ib_sa_path_rec_get() handle error returns but not with a retry * I didn't find any caller of ib_sa_service_rec_query() Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/core/sa_query.c | 26 ++++++++++++++++++++++---- 1 files changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index cf474ec..a2e61d7 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_ event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER) { - struct ib_sa_device *sa_dev; - sa_dev = container_of(handler, typeof(*sa_dev), event_handler); - + unsigned long flags; + struct ib_sa_device *sa_dev = + container_of(handler, typeof(*sa_dev), event_handler); + struct ib_sa_port *port = + &sa_dev->port[event->element.port_num - sa_dev->start_port]; + struct ib_sa_sm_ah *sm_ah; + + spin_lock_irqsave(&port->ah_lock, flags); + sm_ah = port->sm_ah; + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + if (sm_ah) + kref_put(&sm_ah->ref, free_sm_ah); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); } @@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_clie return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + if (!port->sm_ah) + return -EAGAIN; agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); @@ -780,6 +793,9 @@ int ib_sa_service_rec_query(struct ib_sa return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + if (!port->sm_ah) + return -EAGAIN; + agent = port->agent; if (method != IB_MGMT_METHOD_GET && @@ -877,8 +893,10 @@ int ib_sa_mcmember_rec_query(struct ib_s return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; + if (!port->sm_ah) + return -EAGAIN; + agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; From David.Shue.ctr at rl.af.mil Thu May 1 06:09:12 2008 From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB) Date: Thu, 1 May 2008 09:09:12 -0400 Subject: [ofa-general] Infiniband Card Trouble Message-ID: Hello, I have used the OFED-1.3 software to communicate with the current cards I have. These cards come up as "MT23108" in the logs, and I am not sure whom the manufacturer is. I was able to program the cards, and even install MPICH2 and run tests. I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric (HPC) Adapter" http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id& prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do not work the same. The machine boots up fine with the card in, and shows the card as Mellanox "MT23108" also? The two cards are visibly different in every way. Is the MT23108 a certain platform for IB? I am new to the entire IB technology. This is the history of what I did. 1) Staged the machine RH EL v5 2) Install the IB card 3) Boot machine up 4) Can see the card looking at "lspci" and "dmesg" but nothing in the network area or under "ifconfig" (Just like with the first cards) 5) I then install the OFED-1.3 software to communicate and configure the card 6) When I go to start the card (instead of reboot but have tried both ways) /etc/init.d/openib start, it all fails. I then look in the log file and see a bunch of "unknown symbol..." and "disagrees..." for all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on. 7) When I reboot, the machine reaches "UDEV" of the reboot stage, hangs for a little bit, and then many errors show and the machine won't boot, unless I take the card out. If I uninstall the OFED software, it will reboot fine with the card still in. The card from HP giving me problems, does not appear to have any drivers for it. It looks like HP supports it to work on Windows, and HPUX. I'm look for any help you can provide. Thanks in advance, Dave >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> David Shue Systems Specialist Computer Sciences Corporation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< -------------- next part -------------- An HTML attachment was scrubbed... URL: From dorfman.eli at gmail.com Thu May 1 06:50:55 2008 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Thu, 1 May 2008 16:50:55 +0300 Subject: [ofa-general] Re: [Ips] Calculating the VA in iSER header In-Reply-To: <20080429170516.GA8857@osc.edu> References: <4804B03C.6060507@voltaire.com> <694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com> <20080416144830.GC23861@osc.edu> <694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com> <20080429170516.GA8857@osc.edu> Message-ID: <694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com> On Tue, Apr 29, 2008 at 8:05 PM, Pete Wyckoff wrote: > dorfman.eli at gmail.com wrote on Thu, 17 Apr 2008 14:13 +0300: > > > On Wed, Apr 16, 2008 at 6:46 PM, Roland Dreier wrote: > > > > Agree with the interpretation of the spec, and it's probably a bit > > > > clearer that way too. But we have working initiators and targets > > > > that do it the "wrong" way. > > > > > > Yes... I guess the key question is whether there are any initiators that > > > do things the "right" way. > > > > > > > > > > 1. Flag day: all initiators and targets change at the same time. > > > > Will see data corruption if someone unluckily runs one or the other > > > > using old non-fixed code. > > > > > > Seems unacceptable to me... it doesn't make sense at all to break every > > > setup in the world just to be "right" according to the spec. > > > > This will break only when both initiator and target will use > > InitialR2T=No, which means allow unsolicited data. > > As far as I know, STGT is not very common (and its version in RHEL5.1 > > is considered experimental). Its default is also InitialR2T=Yes. > > Voltaire's iSCSI over iSER target also uses default InitialR2T=Yes. > > So it seems that nothing will break. > > I finally got a chance to look at this just now. I think you mean > default is InitialR2T=No above, which means no unsolicited data. > That is the default case, and true, the two different meanings > of the initiator-supplied VA coincide. InitialR2T=Yes means that R2T is required, hence no unsolicited data. Only is both sides, initiator and target agree on InitialR2T=No then first data burst is unsolicited. > > But you missed the impact of immediate data. We run with the > defaults (I think) that say the first write request packet should be > filled with a bit of the coming data stream. From iscsid.conf: > > # To enable immediate data (i.e., the initiator sends unsolicited data > # with the iSCSI command packet), uncomment the following line: > # > # The default is Yes > node.session.iscsi.ImmediateData = Yes > > Looking at the offset printed out by your patch, it is indeed > non-zero for the first RDMA read. Please correct me if I am > mistaken about this---you must have tested all four variations of > with and without the patches on initiator and target side, but I did > not. You are right about the ImmediateData=Yes. I really missed that, so after all this patch will break current target implementation and cause data corruption. I suggest to postpone this patch till we implement the iSER HELLO message and then add this patch with the corresponding target patch. This will allow current initiator to work with current target and new initiator work with new target. I still think we should do that since future iser implementation will probably rely on the spec. > > Hence I am still a bit unhappy about having to deal with the > fallout, with no way to detect it. For our local use, I'll keep an > older version of stgt in use until we switch to a new kernel, then > merge up the target side change. It is a bother, but I can deal > with it. For other institutions, this lockstep upgrade requirement > will not be obvious until they debug the resulting data corruption. > > Still, I do understand why it would be nice to conform to the spec, > and it is maybe a bit cleaner that way too. Maybe you can help with > the bug reports on stgt-devel during the transition, and maintain > and publish a patch to let it work with old kernels. > > -- Pete > From kliteyn at dev.mellanox.co.il Thu May 1 07:11:08 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 01 May 2008 17:11:08 +0300 Subject: [ofa-general] [PATCH v2] opensm/osm_qos_policy.c: log matched QoS criteria Message-ID: <4819CF7C.6040606@dev.mellanox.co.il> Adding log messages for matched criteria of the QoS policy rule. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_policy.c | 18 +++++++++++++++--- 1 files changed, 15 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 6c81872..ebe3a7f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( { osm_qos_match_rule_t *p_qos_match_rule = NULL; cl_list_iterator_t list_iterator; + osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log; if (!cl_list_count(&p_qos_policy->qos_match_rules)) return NULL; + OSM_LOG_ENTER(p_log); + /* Go over all QoS match rules and find the one that matches the request */ list_iterator = cl_list_head(&p_qos_policy->qos_match_rules); @@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Source port matched.\n"); } /* If a match rule has Destination groups, PR request dest. has to be in this list */ @@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Destination port matched.\n"); } /* If a match rule has QoS classes, PR request HAS @@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "QoS Class matched.\n"); } /* If a match rule has Service IDs, PR request HAS @@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Service ID matched.\n"); } /* If a match rule has PKeys, PR request HAS @@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "PKey matched.\n"); } /* if we got here, then this match-rule matched this PR request */ break; } + OSM_LOG_EXIT(p_log); + if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules)) return NULL; -- 1.5.1.4 From tziporet at dev.mellanox.co.il Thu May 1 07:13:23 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 01 May 2008 17:13:23 +0300 Subject: [ofa-general] Infiniband Card Trouble In-Reply-To: References: Message-ID: <4819D003.70508@mellanox.co.il> Shue, David CTR USAF AFMC AFRL/RITB wrote: > > Hello, > > I have used the OFED-1.3 software to communicate with the current > cards I have. These cards come up as “MT23108” in the logs, and I am > not sure whom the manufacturer is. I was able to program the cards, > and even install MPICH2 and run tests. > > I have recently obtained new IB cards from HP “*HP PCI-X 2-port 4X > Fabric (HPC) Adapter” > http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id&prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id > > *and these cards do not work the same. The machine boots up fine with > the card in, and shows the card as Mellanox “MT23108” also? The two > cards are visibly different in every way. Is the MT23108 a certain > platform for IB? > Yes it is. Its Mellanox PCIX cards. Maybe you need to upgrade teh FW for the new card. You can get the new FW and burn it using the instruction on Mellanox web: http://www.mellanox.com/support/firmware_download.php Your card is Dual port InfiniHost PCI-X HCA cards (Cougar Cub) > This is the history of what I did. > > 1) Staged the machine RH EL v5 > > 2) Install the IB card > > 3) Boot machine up > > 4) Can see the card looking at “lspci” and “dmesg” but nothing in the > network area or under “ifconfig” (Just like with the first cards) > Can you send output of lspci -vv > > 5) I then install the OFED-1.3 software to communicate and configure > the card > > 6) When I go to start the card (instead of reboot but have tried both > ways) /etc/init.d/openib start, it all fails. I then look in the log > file and see a bunch of “unknown symbol…” and “disagrees…” for all > items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on. > > 7) When I reboot, the machine reaches “UDEV” of the reboot stage, > hangs for a little bit, and then many errors show and the machine > won’t boot, unless I take the card out. If I uninstall the OFED > software, it will reboot fine with the card still in. The card from HP > giving me problems, does not appear to have any drivers for it. It > looks like HP supports it to work on Windows, and HPUX. > What is the machine type you use? Is it IA64? Tziporet From dorfman.eli at gmail.com Thu May 1 07:18:45 2008 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Thu, 1 May 2008 17:18:45 +0300 Subject: [Ips] [Stgt-devel] [ofa-general] Re: Calculating the VA iniSER header In-Reply-To: References: <4804B03C.6060507@voltaire.com> <694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com> <20080416144830.GC23861@osc.edu> <694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com> <20080429170516.GA8857@osc.edu> <39C75744D164D948A170E9792AF8E7CAF60D50@exil.voltaire.com> Message-ID: <694d48600805010718n7f02a30ev38d35c50e926d02@mail.gmail.com> On Thu, May 1, 2008 at 11:24 AM, Ken Sandars wrote: > > >> [Ken] It appears the current Linux iSER initiator does not send the HELLO > message when > > >> the connection transits to full feature phase. The stgt target also > ignores this > >> message (if it were to appear). > [Ken] The IBTA document does not mention the HELLO/HELLOREPLY messages. > Implementing this message exchange gives a distinction between the current > implementations > and those that will correctly calculate the write_va (as per Pete Wyckoff's > option 3). I agree. > > >> [Ken] Both of these implementations use a non-conformant iSER header > (they add > > >> write_va and read_va fields, which incidentally do not appear to be > used). Are > >> these changes documented anywhere in the IB domain, or are these > variations > >> needed for another reason? > > > > [Erez] Take a look at the iSER for IB annex: > > > http://www.infinibandta.org/members/spec/Annex_iSER.PDF > > [Ken] Ouch. That link requires a username/password. Looks like it is only > available to members > of the Infiniband Trade Association. Fortunately I gained access to it with > username "open" and > password "standard". ;-) > > [Ken] Neither of these implementations send or examine the iSER CM REQ/REP > message private data. > The document doesn't define what action to take when this message is > absent. Interestingly, > when the target reports that "ZBVA shall be used for this connection" and > "the target shall issue > Send with Invalidate as needed" then it appears the iSER header specified in > RFC5046 should be > used for control-type PDUs. Is there any plan to conform with the list of > requirements for IBTA > compliance? At the moment these capabilities (ZBVA, Send Invalidate) are not supported in the driver, though they seem to be supported by the ConnectX HCA. Hence, iSER implementation do not send/examine them. This may be added to the CM REQ/REP with the current defaults but in order to use these capabilities a code should be added to the HCA driver and iser. > > Cheers, > Ken > > ________________________________ > Hotmail on your mobile. Never miss another e-mail with From pw at osc.edu Thu May 1 07:26:18 2008 From: pw at osc.edu (Pete Wyckoff) Date: Thu, 1 May 2008 10:26:18 -0400 Subject: [ofa-general] Re: [Ips] Calculating the VA in iSER header In-Reply-To: <694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com> References: <4804B03C.6060507@voltaire.com> <694d48600804160122l1cc97b8aka8986ee6deb7dec8@mail.gmail.com> <20080416144830.GC23861@osc.edu> <694d48600804170413g4d54cd9g447abd345a1f6301@mail.gmail.com> <20080429170516.GA8857@osc.edu> <694d48600805010650l75c2f70bx662456ce85e7e9a5@mail.gmail.com> Message-ID: <20080501142618.GA19304@osc.edu> dorfman.eli at gmail.com wrote on Thu, 01 May 2008 16:50 +0300: > InitialR2T=Yes means that R2T is required, hence no unsolicited data. > Only is both sides, initiator and target agree on InitialR2T=No then > first data burst is unsolicited. Thanks for the explanation. I keep getting that backwards. > On Tue, Apr 29, 2008 at 8:05 PM, Pete Wyckoff wrote: > > But you missed the impact of immediate data. We run with the > > defaults (I think) that say the first write request packet should be > > filled with a bit of the coming data stream. From iscsid.conf: > > > > # To enable immediate data (i.e., the initiator sends unsolicited data > > # with the iSCSI command packet), uncomment the following line: > > # > > # The default is Yes > > node.session.iscsi.ImmediateData = Yes > > > > Looking at the offset printed out by your patch, it is indeed > > non-zero for the first RDMA read. Please correct me if I am > > mistaken about this---you must have tested all four variations of > > with and without the patches on initiator and target side, but I did > > not. > > You are right about the ImmediateData=Yes. > I really missed that, so after all this patch will break current > target implementation and > cause data corruption. > I suggest to postpone this patch till we implement the iSER HELLO > message and then add this patch with the corresponding target patch. > This will allow current initiator to work with current target and new > initiator work with new target. > I still think we should do that since future iser implementation will > probably rely on the spec. We might as well do the Hello message exchange anyway. As Ken points out, the spec would approve. We could even use this opportunity to set the IRD and ORD too, but I'm not sure exactly how that would work in IB once the connection is up. Here we're not proposing a new bit in the Hello message to indicate "VA starts before unsol data", but rather lack of Hello message indicates old initiatior that gets the VA wrong. That will be easy to detect in targets. I wonder if by supporting Hello, that we could remove the use of private data as specified in the IBTA annex? These negotiated parameters (ZBVA and Send w/Inval) could be in the Hello exchange. We still have the need to put VAs in the iSER header to support the non-ZBVA case (pre-ConnectX IB), though. Once we get the Hello worked out, it might be time to update RFC 5046 to encompass this hardware model too. -- Pete From dorfman.eli at gmail.com Thu May 1 07:32:13 2008 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Thu, 1 May 2008 17:32:13 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/iSER: Use offset from r2t header for rdma In-Reply-To: <694d48600804270555i6ee55843x51c416294fec6397@mail.gmail.com> References: <694d48600804270555i6ee55843x51c416294fec6397@mail.gmail.com> Message-ID: <694d48600805010732p43bed1a7q75dd8d8512b275f2@mail.gmail.com> On Sun, Apr 27, 2008 at 3:55 PM, Eli Dorfman wrote: > Use offset from r2t header for rdma instead of using > internal offset counter. > > Signed-off-by: Eli Dorfman > --- > usr/iscsi/iscsi_rdma.c | 16 +++++----------- > 1 files changed, 5 insertions(+), 11 deletions(-) > > diff --git a/usr/iscsi/iscsi_rdma.c b/usr/iscsi/iscsi_rdma.c > index d46ddff..84f5949 100644 > --- a/usr/iscsi/iscsi_rdma.c > +++ b/usr/iscsi/iscsi_rdma.c > @@ -1447,28 +1447,22 @@ static int iscsi_rdma_rdma_read(struct > iscsi_connection *conn) > struct iscsi_r2t_rsp *r2t = (struct iscsi_r2t_rsp *) &conn->rsp.bhs; > uint8_t *buf; > uint32_t len; > + uint32_t offset; > int ret; > > buf = (uint8_t *) task->data + task->offset; > len = be32_to_cpu(r2t->data_length); > + offset = be32_to_cpu(r2t->data_offset); > > - dprintf("len %u stag %x va %llx\n", > + dprintf("len %u stag %x va %llx offset %x\n", > len, itask->rem_write_stag, > - (unsigned long long) itask->rem_write_va); > + (unsigned long long) itask->rem_write_va, offset); > > ret = iser_post_rdma_wr(ci, task, buf, len, IBV_WR_RDMA_READ, > - itask->rem_write_va, itask->rem_write_stag); > + itask->rem_write_va + offset, itask->rem_write_stag); > if (ret < 0) > return ret; > > - /* > - * Initiator registers the entire buffer, but gives us a VA that > - * is advanced by immediate + unsolicited data amounts. Advance > - * rem_va as we read, knowing that the target always grabs segments > - * in order. > - */ > - itask->rem_write_va += len; > - > return 0; > } > > -- > 1.5.5 > Please do not apply this patch until we decide how to sync this with the initiator side. See the following discussion for details: http://www.ietf.org/mail-archive/web/ips/current/msg02506.html I tend to agree with Pete's option (3) implementing iSER HELLO message in the initiator and target. Then adding this patch and the corresponding initiator patch so that we have: Old initiator working with old target, AND New initiator working with new target. Eli From dorfman.eli at gmail.com Thu May 1 07:35:48 2008 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Thu, 1 May 2008 17:35:48 +0300 Subject: [ofa-general] Re: [PATCH 1/2] IB/iSER: Do not add unsolicited data offset to VA in iSER header In-Reply-To: <694d48600804270553u36b776ame9695a8858dd278@mail.gmail.com> References: <694d48600804270553u36b776ame9695a8858dd278@mail.gmail.com> Message-ID: <694d48600805010735k4836e955jabde51ddaf85d645@mail.gmail.com> On Sun, Apr 27, 2008 at 3:53 PM, Eli Dorfman wrote: > iSER initiator sends a VA (in the iSER header) which includes > an offset for the unsolicited data (which is wrong according to the spec). > > Signed-off-by: Eli Dorfman > Signed-off-by: Erez Zilber > --- > drivers/infiniband/ulp/iser/iser_initiator.c | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c > b/drivers/infiniband/ulp/iser/iser_initiator.c > index 08dc81c..5c2bbc6 100644 > --- a/drivers/infiniband/ulp/iser/iser_initiator.c > +++ b/drivers/infiniband/ulp/iser/iser_initiator.c > @@ -154,12 +154,12 @@ iser_prepare_write_cmd(struct iscsi_cmd_task *ctask, > if (unsol_sz < edtl) { > hdr->flags |= ISER_WSV; > hdr->write_stag = cpu_to_be32(regd_buf->reg.rkey); > - hdr->write_va = cpu_to_be64(regd_buf->reg.va + unsol_sz); > + hdr->write_va = cpu_to_be64(regd_buf->reg.va); > > iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X " > - "VA:%#llX + unsol:%d\n", > + "VA:%#llX\n", > ctask->itt, regd_buf->reg.rkey, > - (unsigned long long)regd_buf->reg.va, unsol_sz); > + (unsigned long long)regd_buf->reg.va); > } > > if (imm_sz > 0) { > -- > 1.5.5 > Please do not apply this patch until we decide how to sync this with the target side. Thanks, Eli From shemminger at vyatta.com Thu May 1 07:56:06 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 1 May 2008 07:56:06 -0700 Subject: [ofa-general] Re: [PATCH 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080430171955.31725.7771.stgit@localhost.localdomain> References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171955.31725.7771.stgit@localhost.localdomain> Message-ID: <20080501075606.4963afa3@extreme> On Wed, 30 Apr 2008 22:49:55 +0530 Ramachandra K wrote: > From: Amar Mudrankit > > The sysfs interface for the QLogic VNIC driver is implemented through > this patch. > > Signed-off-by: Ramachandra K > Signed-off-by: Poornima Kamath > --- > > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1127 +++++++++++++++++++++++++++ > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 62 + > 2 files changed, 1189 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h > > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > new file mode 100644 > index 0000000..7e70b0c > --- /dev/null > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > @@ -0,0 +1,1127 @@ > +/* > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > +#include > + > +#include "vnic_util.h" > +#include "vnic_config.h" > +#include "vnic_ib.h" > +#include "vnic_viport.h" > +#include "vnic_main.h" > +#include "vnic_stats.h" > + > +/* > + * target eiocs are added by writing > + * > + * ioc_guid=,dgid=,pkey=,name= > + * to the create_primary sysfs attribute. > + */ > +enum { > + VNIC_OPT_ERR = 0, > + VNIC_OPT_IOC_GUID = 1 << 0, > + VNIC_OPT_DGID = 1 << 1, > + VNIC_OPT_PKEY = 1 << 2, > + VNIC_OPT_NAME = 1 << 3, > + VNIC_OPT_INSTANCE = 1 << 4, > + VNIC_OPT_RXCSUM = 1 << 5, > + VNIC_OPT_TXCSUM = 1 << 6, > + VNIC_OPT_HEARTBEAT = 1 << 7, > + VNIC_OPT_IOC_STRING = 1 << 8, > + VNIC_OPT_IB_MULTICAST = 1 << 9, > + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | > + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), > +}; > + > +static match_table_t vnic_opt_tokens = { > + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, > + {VNIC_OPT_DGID, "dgid=%s"}, > + {VNIC_OPT_PKEY, "pkey=%x"}, > + {VNIC_OPT_NAME, "name=%s"}, > + {VNIC_OPT_INSTANCE, "instance=%d"}, > + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, > + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, > + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, > + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, > + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, > + {VNIC_OPT_ERR, NULL} > +}; > NO 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...) 2. Sysfs is one value per file not name=value From shemminger at vyatta.com Thu May 1 07:58:16 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 1 May 2008 07:58:16 -0700 Subject: [ofa-general] Re: [PATCH 11/13] QLogic VNIC: Driver utility file - implements various utility macros In-Reply-To: <20080430172126.31725.48554.stgit@localhost.localdomain> References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172126.31725.48554.stgit@localhost.localdomain> Message-ID: <20080501075816.1010ec3a@extreme> On Wed, 30 Apr 2008 22:51:26 +0530 Ramachandra K wrote: > From: Poornima Kamath > > This patch adds the driver utility file which mainly contains utility > macros for debugging of QLogic VNIC driver. > > Signed-off-by: Ramachandra K > Signed-off-by: Amar Mudrankit > --- > > drivers/infiniband/ulp/qlgc_vnic/vnic_util.h | 251 ++++++++++++++++++++++++++ > 1 files changed, 251 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h > > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h > new file mode 100644 > index 0000000..4d7d540 > --- /dev/null > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h > @@ -0,0 +1,251 @@ > +/* > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifndef VNIC_UTIL_H_INCLUDED > +#define VNIC_UTIL_H_INCLUDED > + > +#define MODULE_NAME "QLGC_VNIC" > + > +#define VNIC_MAJORVERSION 1 > +#define VNIC_MINORVERSION 1 > + > +#define is_power_of2(value) (((value) & ((value - 1))) == 0) > +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) In kernel.h already > +extern u32 vnic_debug; Use msg level macros instead? > + > +enum { > + DEBUG_IB_INFO = 0x00000001, > + DEBUG_IB_FUNCTION = 0x00000002, > + DEBUG_IB_FSTATUS = 0x00000004, > + DEBUG_IB_ASSERTS = 0x00000008, > + DEBUG_CONTROL_INFO = 0x00000010, > + DEBUG_CONTROL_FUNCTION = 0x00000020, > + DEBUG_CONTROL_PACKET = 0x00000040, > + DEBUG_CONFIG_INFO = 0x00000100, > + DEBUG_DATA_INFO = 0x00001000, > + DEBUG_DATA_FUNCTION = 0x00002000, > + DEBUG_NETPATH_INFO = 0x00010000, > + DEBUG_VIPORT_INFO = 0x00100000, > + DEBUG_VIPORT_FUNCTION = 0x00200000, > + DEBUG_LINK_STATE = 0x00400000, > + DEBUG_VNIC_INFO = 0x01000000, > + DEBUG_VNIC_FUNCTION = 0x02000000, > + DEBUG_MCAST_INFO = 0x04000000, > + DEBUG_MCAST_FUNCTION = 0x08000000, > + DEBUG_SYS_INFO = 0x10000000, > + DEBUG_SYS_VERBOSE = 0x40000000 > +}; > + > +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_DEBUG > +#define PRINT(level, x, fmt, arg...) \ > + printk(level "%s: %s: %s, line %d: " fmt, \ > + MODULE_NAME, x, __FILE__, __LINE__, ##arg) > + > +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ > + do { \ > + if (condition) \ > + printk(level "%s: %s: %s, line %d: " fmt, \ > + MODULE_NAME, x, __FILE__, __LINE__, \ > + ##arg); \ > + } while (0) > +#else > +#define PRINT(level, x, fmt, arg...) \ > + printk(level "%s: " fmt, MODULE_NAME, ##arg) > + > +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ > + do { \ > + if (condition) \ > + printk(level "%s: %s: " fmt, \ > + MODULE_NAME, x, ##arg); \ > + } while (0) > +#endif /*CONFIG_INFINIBAND_QLGC_VNIC_DEBUG*/ > + > +#define IB_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "IB", fmt, ##arg) > +#define IB_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "IB", fmt, ##arg) > + > +#define IB_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "IB", \ > + (vnic_debug & DEBUG_IB_FUNCTION), \ > + fmt, ##arg) > + > +#define IB_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "IB", \ > + (vnic_debug & DEBUG_IB_INFO), \ > + fmt, ##arg) > + > +#define IB_ASSERT(x) \ > + do { \ > + if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x)) \ > + panic("%s assertion failed, file: %s," \ > + " line %d: ", \ > + MODULE_NAME, __FILE__, __LINE__) \ > + } while (0) > + > +#define CONTROL_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "CONTROL", fmt, ##arg) > +#define CONTROL_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "CONTROL", fmt, ##arg) > + > +#define CONTROL_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "CONTROL", \ > + (vnic_debug & DEBUG_CONTROL_INFO), \ > + fmt, ##arg) > + > +#define CONTROL_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "CONTROL", \ > + (vnic_debug & DEBUG_CONTROL_FUNCTION), \ > + fmt, ##arg) > + > +#define CONTROL_PACKET(pkt) \ > + do { \ > + if (vnic_debug & DEBUG_CONTROL_PACKET) \ > + control_log_control_packet(pkt); \ > + } while (0) > + > +#define CONFIG_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "CONFIG", fmt, ##arg) > +#define CONFIG_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "CONFIG", fmt, ##arg) > + > +#define CONFIG_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "CONFIG", \ > + (vnic_debug & DEBUG_CONFIG_INFO), \ > + fmt, ##arg) > + > +#define DATA_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "DATA", fmt, ##arg) > +#define DATA_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "DATA", fmt, ##arg) > + > +#define DATA_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "DATA", \ > + (vnic_debug & DEBUG_DATA_INFO), \ > + fmt, ##arg) > + > +#define DATA_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "DATA", \ > + (vnic_debug & DEBUG_DATA_FUNCTION), \ > + fmt, ##arg) > + > + > +#define MCAST_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "MCAST", fmt, ##arg) > +#define MCAST_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "MCAST", fmt, ##arg) > + > +#define MCAST_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "MCAST", \ > + (vnic_debug & DEBUG_MCAST_INFO), \ > + fmt, ##arg) > + > +#define MCAST_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "MCAST", \ > + (vnic_debug & DEBUG_MCAST_FUNCTION), \ > + fmt, ##arg) > + > +#define NETPATH_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "NETPATH", fmt, ##arg) > +#define NETPATH_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "NETPATH", fmt, ##arg) > + > +#define NETPATH_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "NETPATH", \ > + (vnic_debug & DEBUG_NETPATH_INFO), \ > + fmt, ##arg) > + > +#define VIPORT_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "VIPORT", fmt, ##arg) > +#define VIPORT_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "VIPORT", fmt, ##arg) > + > +#define VIPORT_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "VIPORT", \ > + (vnic_debug & DEBUG_VIPORT_INFO), \ > + fmt, ##arg) > + > +#define VIPORT_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "VIPORT", \ > + (vnic_debug & DEBUG_VIPORT_FUNCTION), \ > + fmt, ##arg) > + > +#define LINK_STATE(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "LINK", \ > + (vnic_debug & DEBUG_LINK_STATE), \ > + fmt, ##arg) > + > +#define VNIC_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "NIC", fmt, ##arg) > +#define VNIC_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "NIC", fmt, ##arg) > +#define VNIC_INIT(fmt, arg...) \ > + PRINT(KERN_INFO, "NIC", fmt, ##arg) > + > +#define VNIC_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "NIC", \ > + (vnic_debug & DEBUG_VNIC_INFO), \ > + fmt, ##arg) > + > +#define VNIC_FUNCTION(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "NIC", \ > + (vnic_debug & DEBUG_VNIC_FUNCTION), \ > + fmt, ##arg) > + > +#define SYS_PRINT(fmt, arg...) \ > + PRINT(KERN_INFO, "SYS", fmt, ##arg) > +#define SYS_ERROR(fmt, arg...) \ > + PRINT(KERN_ERR, "SYS", fmt, ##arg) > + > +#define SYS_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "SYS", \ > + (vnic_debug & DEBUG_SYS_INFO), \ > + fmt, ##arg) > + > +#endif /* VNIC_UTIL_H_INCLUDED */ Many of these are already in standard macros pr_info, pr_err etc. From rdreier at cisco.com Thu May 1 08:07:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 May 2008 08:07:18 -0700 Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup In-Reply-To: <1209625285.1790.24.camel@mtls03> (Eli Cohen's message of "Thu, 01 May 2008 10:01:25 +0300") References: <1209577156.1790.11.camel@mtls03> <1209625285.1790.24.camel@mtls03> Message-ID: > I agree, but I want to have a larger buffer to absorb larger picks. For > example, after applying this patch I tested how many times the net queue > is stopped and woken up when running four streams of netperf, udp, small > packets. When using the default 64 tx queue size it happened 500 times. > When I used a 256 tx queue size it happened only 37 times. This makes me > think that we have larger picks that a larger queue size can help > handle. OK, that makes sense -- although did you see any performance difference? > Also looking for example on Broadcom bnx2 driver on my machine, it uses > a 1000 tx queue len. Isn't that the software queue above the hardware? (That's what txqueuelen in ifconfig is reporting) From rdreier at cisco.com Thu May 1 08:08:56 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 01 May 2008 08:08:56 -0700 Subject: [ofa-general] Infiniband Card Trouble In-Reply-To: (David Shue's message of "Thu, 1 May 2008 09:09:12 -0400") References: Message-ID: > 7) When I reboot, the machine reaches "UDEV" of the reboot stage, > hangs for a little bit, and then many errors show and the machine won't > boot, unless I take the card out. If I uninstall the OFED software, it > will reboot fine with the card still in. The card from HP giving me > problems, does not appear to have any drivers for it. It looks like HP > supports it to work on Windows, and HPUX. What are the errors? This is too vague to solve without the actual console output. From michael.heinz at qlogic.com Thu May 1 08:20:31 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 1 May 2008 10:20:31 -0500 Subject: [ofa-general] Infiniband Card Trouble In-Reply-To: References: Message-ID: #6 makes it sound like it's an ofed installation issue rather than the HCA itself. Could you post the relevant /var/log/messages? Messages from ib_mthca would be especially important. In addition, the output from mstflint -d q could also be useful. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shue, David CTR USAF AFMC AFRL/RITB Sent: Thursday, May 01, 2008 9:09 AM To: general at lists.openfabrics.org Subject: [ofa-general] Infiniband Card Trouble Hello, I have used the OFED-1.3 software to communicate with the current cards I have. These cards come up as "MT23108" in the logs, and I am not sure whom the manufacturer is. I was able to program the cards, and even install MPICH2 and run tests. I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric (HPC) Adapter" http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id& prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do not work the same. The machine boots up fine with the card in, and shows the card as Mellanox "MT23108" also? The two cards are visibly different in every way. Is the MT23108 a certain platform for IB? I am new to the entire IB technology. This is the history of what I did. 1) Staged the machine RH EL v5 2) Install the IB card 3) Boot machine up 4) Can see the card looking at "lspci" and "dmesg" but nothing in the network area or under "ifconfig" (Just like with the first cards) 5) I then install the OFED-1.3 software to communicate and configure the card 6) When I go to start the card (instead of reboot but have tried both ways) /etc/init.d/openib start, it all fails. I then look in the log file and see a bunch of "unknown symbol..." and "disagrees..." for all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on. 7) When I reboot, the machine reaches "UDEV" of the reboot stage, hangs for a little bit, and then many errors show and the machine won't boot, unless I take the card out. If I uninstall the OFED software, it will reboot fine with the card still in. The card from HP giving me problems, does not appear to have any drivers for it. It looks like HP supports it to work on Windows, and HPUX. I'm look for any help you can provide. Thanks in advance, Dave >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> David Shue Systems Specialist Computer Sciences Corporation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Thu May 1 08:37:06 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 01 May 2008 18:37:06 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: fix net queue lockup In-Reply-To: References: <1209577156.1790.11.camel@mtls03> <1209625285.1790.24.camel@mtls03> Message-ID: <1209656226.1790.39.camel@mtls03> On Thu, 2008-05-01 at 08:07 -0700, Roland Dreier wrote: > OK, that makes sense -- although did you see any performance difference? Yes. With four streams on a 4 cores machine, the senders sum up to: 898 * 10^6 bits/sec @ 256 tx queue length 756 * 10^6 bits/sec @ 64 tx queue length > > > Also looking for example on Broadcom bnx2 driver on my machine, it uses > > a 1000 tx queue len. > > Isn't that the software queue above the hardware? (That's what > txqueuelen in ifconfig is reporting) I wrongly assumed this is the hardware queue size. Now I looked at the driver code and if I did not make any mistake in the calculations, the hardware queue size is... 256. From ramachandra.kuchimanchi at qlogic.com Thu May 1 09:02:14 2008 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra (Contractor - )) Date: Thu, 1 May 2008 11:02:14 -0500 Subject: [ofa-general] RE: [PATCH 08/13] QLogic VNIC: sysfs interface implementation for the driver References: <20080430171028.31725.86190.stgit@localhost.localdomain><20080430171955.31725.7771.stgit@localhost.localdomain> <20080501075606.4963afa3@extreme> Message-ID: Stephen, Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: >> On Wed, 30 Apr 2008 22:49:55 +0530 >> Ramachandra K wrote: >> +static match_table_t vnic_opt_tokens = { >> + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, >> + {VNIC_OPT_DGID, "dgid=%s"}, >> + {VNIC_OPT_PKEY, "pkey=%x"}, >> + {VNIC_OPT_NAME, "name=%s"}, >> + {VNIC_OPT_INSTANCE, "instance=%d"}, >> + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, >> + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, >> + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, >> + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, >> + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, >> + {VNIC_OPT_ERR, NULL} >> +}; >> > NO > 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...) > 2. Sysfs is one value per file not name=value The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space to connect to the EVIC. For this the "name=value" mechanism is used for a write-only sysfs file as an input method to the driver. The driver follows the one value per file sysfs rule when it returns any data with each readable file returning only a single value. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramachandra.kuchimanchi at qlogic.com Thu May 1 09:18:53 2008 From: ramachandra.kuchimanchi at qlogic.com (Kuchimanchi, Ramachandra (Contractor - )) Date: Thu, 1 May 2008 11:18:53 -0500 Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility file - implements various utility macros References: <20080430171028.31725.86190.stgit@localhost.localdomain><20080430172126.31725.48554.stgit@localhost.localdomain> <20080501075816.1010ec3a@extreme> Message-ID: Stephen, Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: > Ramachandra K wrote: >> +#define is_power_of2(value) (((value) & ((value - 1))) == 0) >> +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) > In kernel.h already Will fix this. Thanks. >> +extern u32 vnic_debug; > Use msg level macros instead? I am sorry, I did not understand this comment. > + > +#define SYS_INFO(fmt, arg...) \ > + PRINT_CONDITIONAL(KERN_INFO, \ > + "SYS", \ > + (vnic_debug & DEBUG_SYS_INFO), \ > + fmt, ##arg) > + > +#endif /* VNIC_UTIL_H_INCLUDED */ > Many of these are already in standard macros pr_info, pr_err etc. These macros are for providing a debug log level functionality through the vnic_debug module parameter. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramachandra.kuchimanchi at qlogic.com Thu May 1 09:43:08 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 1 May 2008 22:13:08 +0530 Subject: [ofa-general] [RESEND] RE: [PATCH 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171955.31725.7771.stgit@localhost.localdomain> <20080501075606.4963afa3@extreme> Message-ID: <71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com> Sorry for the resend. Original mail got bounced from netdev. On Thu, May 1, 2008 at 9:32 PM, wrote: > > Stephen, > > > Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: > > >> On Wed, 30 Apr 2008 22:49:55 +0530 > >> Ramachandra K wrote: > > > > > >> +static match_table_t vnic_opt_tokens = { > >> + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, > >> + {VNIC_OPT_DGID, "dgid=%s"}, > >> + {VNIC_OPT_PKEY, "pkey=%x"}, > >> + {VNIC_OPT_NAME, "name=%s"}, > >> + {VNIC_OPT_INSTANCE, "instance=%d"}, > >> + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, > >> + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, > >> + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, > >> + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, > >> + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, > >> + {VNIC_OPT_ERR, NULL} > >> +}; > >> > > > NO > > 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...) > > 2. Sysfs is one value per file not name=value > > The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space > to connect to the EVIC. For this the "name=value" mechanism is used for > a write-only sysfs file as an input method to the driver. > > The driver follows the one value per file sysfs rule when it returns any data > with each readable file returning only a single value. > > Regards, > Ram From ramachandra.kuchimanchi at qlogic.com Thu May 1 10:01:10 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 1 May 2008 22:31:10 +0530 Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility file - implements various utility macros In-Reply-To: References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172126.31725.48554.stgit@localhost.localdomain> <20080501075816.1010ec3a@extreme> Message-ID: <71d336490805011001he7359dfw831470986d87b385@mail.gmail.com> Sorry for the resend. Original mail bounced from netdev. On Thu, May 1, 2008 at 9:48 PM wrote: > Stephen, > > > Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: > > > Ramachandra K wrote: > > >> +#define is_power_of2(value) (((value) & ((value - 1))) == 0) > >> +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) > > > In kernel.h already > > Will fix this. Thanks. > > > >> +extern u32 vnic_debug; > > > Use msg level macros instead? > > I am sorry, I did not understand this comment. > > > > + > > +#define SYS_INFO(fmt, arg...) \ > > + PRINT_CONDITIONAL(KERN_INFO, \ > > + "SYS", \ > > + (vnic_debug & DEBUG_SYS_INFO), \ > > + fmt, ##arg) > > + > > +#endif /* VNIC_UTIL_H_INCLUDED */ > > > > Many of these are already in standard macros pr_info, pr_err etc. > > These macros are for providing a debug log level functionality through > the vnic_debug module parameter. > > Regards, > Ram From David.Shue.ctr at rl.af.mil Thu May 1 10:09:28 2008 From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB) Date: Thu, 1 May 2008 13:09:28 -0400 Subject: [ofa-general] RE: HP PCI-X 2-port 4X Fabric (HPC) Adapter In-Reply-To: <48188364.9030809@systemfabricworks.com> References: <00d101c8aac6$1e06ae60$6401a8c0@YOURCB10AA3FFD> <48188364.9030809@systemfabricworks.com> Message-ID: ALL: I appreciate everyone's help. This is where I stand and I am putting the info in the email in case the firewall decides to block it. I wanted to update the FW as some have suggested. I tried using both "flint" and "mstflint" and get the same output you can see below which will not show any PSID so I can obtain the correct Mellanox FW update. I am also including the dmesg, lspci, and mst stat information. ****flint**** ./flint -d /dev/mst/mt23108_pci_cr0 query Image type: Failsafe I.S. Version: 1 Chip Revision: A1 Description: Node Port1 Port2 Sys image GUIDs: 0008f10403972174 0008f10403972175 0008f10403972176 0008f10403972177 Board ID: VSD: PSID: ***** MST STATUS ***** mst status MST modules: ------------ MST PCI module loaded MST PCI configuration module loaded MST Calibre (I2C) module is not loaded MST devices: ------------ /dev/mst/mt23108_pciconf0 - PCI configuration cycles access. bus:dev.fn=09:01.0 addr.reg=88 data.reg=92 Chip revision is: A1 /dev/mst/mt23108_pci_cr0 - PCI direct access. bus:dev.fn=0a:00.0 bar=0xd8000000 size=0x100000 Chip revision is: A1 /dev/mst/mt23108_pci_ddr0 - PCI direct access. bus:dev.fn=0a:00.0 bar=0xc0000000 size=0x8000000 /dev/mst/mt23108_pci_uar0 - PCI direct access. bus:dev.fn=0a:00.0 bar=0xc8000000 size=0x800000 ****DMESG**** ib_mthca 0000:0a:00.0: HCA FW version 3.3.2 is old (3.4.0 is current). ib_mthca 0000:0a:00.0: If you have problems, try updating your HCA FW. ****LSPCI**** 09:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 0a:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) The original card (manufactured in Israel) and works as a result of OFED, returned "HP_0040000001" when I ran the "ibv_devinfo". These boards were ordered at different times, but the same part number was used. The one that does not work, is clearly labeled "HP" and was manufactured in USA it says. The IB card that works, does not clearly state a manufacturer, but would it appear to be HP seeing that the board_id is "HP_0040000001"? Anyone have any ideas where I should go with this all at this point? If you require any further logs please let me know. I appreciate your help GREATLY. -Dave -----Original Message----- From: David McMillen [mailto:davem at systemfabricworks.com] Sent: Wednesday, April 30, 2008 10:34 AM To: Shue, David CTR USAF AFMC AFRL/RITB Subject: Re: HP PCI-X 2-port 4X Fabric (HPC) Adapter ________________________________ From: Shue, David CTR USAF AFMC AFRL/RITB [mailto:David.Shue.ctr at rl.af.mil] Sent: Wednesday, April 30, 2008 6:17 AM To: membership at openfabrics.org Subject: HP PCI-X 2-port 4X Fabric (HPC) Adapter I have used the OFED-1.3 software to communicate to the Mellanox HPC I use. However, the OFED-1.3 does not appear to work with the subject HPC card. The card is an HPC 380299-B21. Is there any information you may provide in how to communicate to this card? Thank you. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> David Shue Systems Specialist Computer Sciences Corporation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< This appears to be an older PCI-X card, and may have a firmware level that is too old to be supported. There may be some information printed by the driver on startup, so look at dmesg or /var/log/messages. According to the OFED 1.3 release notes, the firmware level needs to be 3.5.000 which happens to be the latest released by Mellanox. Does the ibv_devinfo command give any output? If so, it will show the firmware level. Otherwise, perhaps the tvflash -i command will tell you. If that does not work, please send me the output of both "lspci" and "lspci -n" and I will see if there is any obvious reason from the PCI identification. You can get new firmware from Mellanox at http://www.mellanox.com/support/firmware_table_IH.php Be very careful to match up the PSID of your existing card with the firmware, as there are enough differences between the models that the wrong firmware might render your card useless. Let me know how this works out for you, or if you need more information. Dave McMillen From arlin.r.davis at intel.com Thu May 1 10:52:08 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 1 May 2008 10:52:08 -0700 Subject: [ofa-general] [ANNOUNCE] dapl-1.2.6 and dapl-2.0.8 release Message-ID: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com> New release for uDAPL v1 (1.2.6) and v2 (2.0.8) is available at: http://www.openfabrics.org/downloads/dapl md5sum: 752ae54a93b4883c88b41241f52db4ab dapl-1.2.6.tar.gz md5sum: a48f9da59318c395bcc6ad170226764a dapl-2.0.8.tar.gz Vlad, please pull into OFED 1.3.1 using package spec files and installing: dapl-1.2.6-1 dapl-devel-1.2.6-1 dapl-2.0.8-1 dapl-utils-2.0.8-1 dapl-devel-2.0.8-1 dapl-debuginfo-2.0.8-1 tags: dapl-1.2.6-1, dapl-2.0.8-1 Summary of changes since last release: v2 - add private data exchange with reject v1,v2 - better error reporting in non-debug builds v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers v1,v2 - support for zero byte operations, iov==NULL v1,v2 - multi-transport support for inline data and private data differences v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 Thanks, -arlin From andrea at qumranet.com Thu May 1 11:12:56 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 1 May 2008 20:12:56 +0200 Subject: [ofa-general] mmu notifier-core v14->v15 diff for review In-Reply-To: <20080426164511.GJ9514@duo.random> References: <20080426164511.GJ9514@duo.random> Message-ID: <20080501181256.GK8150@duo.random> Hello everyone, this is the v14 to v15 difference to the mmu-notifier-core patch. This is just for review of the difference, I'll post full v15 soon, please review the diff in the meantime. Lots of those cleanups are thanks to Andrew review on mmu-notifier-core in v14. He also spotted the GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP failed to catch it until I enabled PREEMPT (GFP_KERNEL there was perfectly safe with all patchset applied but not ok if only mmu-notifier-core was applied). As usual that bug couldn't hurt anybody unless the mmu notifiers were armed. I also wrote a proper changelog to the mmu-notifier-core patch that I will append before the v14->v15 diff: Subject: mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_lock() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken in virtual address order. The order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running concurrently to trigger lock inversion deadlocks. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_lock may not allocate the required vmalloc space. See the comment on top of mm_lock() implementation for the worst case memory requirements. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -739,7 +739,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -755,6 +755,26 @@ static inline void hlist_del_init(struct } } +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ static inline void hlist_del_init_rcu(struct hlist_node *n) { if (!hlist_unhashed(n)) { diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,18 +1050,6 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); -/* - * mm_lock will take mmap_sem writably (to prevent all modifications - * and scanning of vmas) and then also takes the mapping locks for - * each of the vma to lockout any scans of pagetables of this address - * space. This can be used to effectively holding off reclaim from the - * address space. - * - * mm_lock can fail if there is not enough memory to store a pointer - * array to all vmas. - * - * mm_lock and mm_unlock are expensive operations that may take a long time. - */ struct mm_lock_data { spinlock_t **i_mmap_locks; spinlock_t **anon_vma_locks; diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -4,17 +4,24 @@ #include #include #include +#include struct mmu_notifier; struct mmu_notifier_ops; #ifdef CONFIG_MMU_NOTIFIER -#include +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_lock() protected critical section + * and it's released only when mm_count reaches zero in mmdrop(). + */ struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ struct hlist_head list; + /* srcu structure for this mm */ struct srcu_struct srcu; - /* to serialize mmu_notifier_unregister against mmu_notifier_release */ + /* to serialize the list modifications and hlist_unhashed */ spinlock_t lock; }; @@ -23,8 +30,8 @@ struct mmu_notifier_ops { * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are * freed. It's mandatory to implement this method. This can - * run concurrently to other mmu notifier methods and it - * should teardown all secondary mmu mappings and freeze the + * run concurrently with other mmu notifier methods and it + * should tear down all secondary mmu mappings and freeze the * secondary mmu. */ void (*release)(struct mmu_notifier *mn, @@ -43,9 +50,10 @@ struct mmu_notifier_ops { /* * Before this is invoked any secondary MMU is still ok to - * read/write to the page previously pointed by the Linux pte - * because the old page hasn't been freed yet. If required - * set_page_dirty has to be called internally to this method. + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. */ void (*invalidate_page)(struct mmu_notifier *mn, struct mm_struct *mm, @@ -53,20 +61,18 @@ struct mmu_notifier_ops { /* * invalidate_range_start() and invalidate_range_end() must be - * paired and are called only when the mmap_sem is held and/or - * the semaphores protecting the reverse maps. Both functions + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. Both functions * may sleep. The subsystem must guarantee that no additional - * references to the pages in the range established between - * the call to invalidate_range_start() and the matching call - * to invalidate_range_end(). + * references are taken to the pages in the range established + * between the call to invalidate_range_start() and the + * matching call to invalidate_range_end(). * - * Invalidation of multiple concurrent ranges may be permitted - * by the driver or the driver may exclude other invalidation - * from proceeding by blocking on new invalidate_range_start() - * callback that overlap invalidates that are already in - * progress. Either way the establishment of sptes to the - * range can only be allowed if all invalidate_range_stop() - * function have been called. + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. * * invalidate_range_start() is called when all pages in the * range are still mapped and have at least a refcount of one. @@ -187,6 +193,14 @@ static inline void mmu_notifier_mm_destr __mmu_notifier_mm_destroy(mm); } +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ #define ptep_clear_flush_notify(__vma, __address, __ptep) \ ({ \ pte_t __pte; \ diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,7 +193,3 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS - -config MMU_NOTIFIER - def_bool y - bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -613,6 +613,12 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2329,7 +2329,36 @@ static inline void __mm_unlock(spinlock_ * operations that could ever happen on a certain mm. This includes * vmtruncate, try_to_unmap, and all page faults. The holder * must not hold any mm related lock. A single task can't take more - * than one mm lock in a row or it would deadlock. + * than one mm_lock in a row or it would deadlock. + * + * The mmap_sem must be taken in write mode to block all operations + * that could modify pagetables and free pages without altering the + * vma layout (for example populate_range() with nonlinear vmas). + * + * The sorting is needed to avoid lock inversion deadlocks if two + * tasks run mm_lock at the same time on different mm that happen to + * share some anon_vmas/inodes but mapped in different order. + * + * mm_lock and mm_unlock are expensive operations that may have to + * take thousand of locks. Thanks to sort() the complexity is + * O(N*log(N)) where N is the number of VMAs in the mm. The max number + * of vmas is defined in /proc/sys/vm/max_map_count. + * + * mm_lock() can fail if memory allocation fails. The worst case + * vmalloc allocation required is 2*max_map_count*sizeof(spinlock *), + * so around 1Mbyte, but in practice it'll be much less because + * normally there won't be max_map_count vmas allocated in the task + * that runs mm_lock(). + * + * The vmalloc memory allocated by mm_lock is stored in the + * mm_lock_data structure that must be allocated by the caller and it + * must be later passed to mm_unlock that will free it after using it. + * Allocating the mm_lock_data structure on the stack is fine because + * it's only a couple of bytes in size. + * + * If mm_lock() returns -ENOMEM no memory has been allocated and the + * mm_lock_data structure can be freed immediately, and mm_unlock must + * not be called. */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { @@ -2350,6 +2379,13 @@ int mm_lock(struct mm_struct *mm, struct return -ENOMEM; } + /* + * When mm_lock_sort_anon_vma/i_mmap returns zero it + * means there's no lock to take and so we can free + * the array here without waiting mm_unlock. mm_unlock + * will do nothing if nr_i_mmap/anon_vma_locks is + * zero. + */ data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); @@ -2374,7 +2410,17 @@ static void mm_unlock_vfree(spinlock_t * vfree(locks); } -/* avoid memory allocations for mm_unlock to prevent deadlock */ +/* + * mm_unlock doesn't require any memory allocation and it won't fail. + * + * All memory has been previously allocated by mm_lock and it'll be + * all freed before returning. Only after mm_unlock returns, the + * caller is allowed to free and forget the mm_lock_data structure. + * + * mm_unlock runs in O(N) where N is the max number of VMAs in the + * mm. The max number of vmas is defined in + * /proc/sys/vm/max_map_count. + */ void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) { if (mm->map_count) { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -21,12 +21,12 @@ * This function can't run concurrently against mmu_notifier_register * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap * runs with mm_users == 0. Other tasks may still invoke mmu notifiers - * in parallel despite there's no task using this mm anymore, through - * the vmas outside of the exit_mmap context, like with + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with * vmtruncate. This serializes against mmu_notifier_unregister with * the mmu_notifier_mm->lock in addition to SRCU and it serializes * against the other mmu notifiers with SRCU. struct mmu_notifier_mm - * can't go away from under us as exit_mmap holds a mm_count pin + * can't go away from under us as exit_mmap holds an mm_count pin * itself. */ void __mmu_notifier_release(struct mm_struct *mm) @@ -41,7 +41,7 @@ void __mmu_notifier_release(struct mm_st hlist); /* * We arrived before mmu_notifier_unregister so - * mmu_notifier_unregister will do nothing else than + * mmu_notifier_unregister will do nothing other than * to wait ->release to finish and * mmu_notifier_unregister to return. */ @@ -66,7 +66,11 @@ void __mmu_notifier_release(struct mm_st spin_unlock(&mm->mmu_notifier_mm->lock); /* - * Wait ->release if mmu_notifier_unregister is running it. + * synchronize_srcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * * The mmu_notifier_mm can't go away from under us because one * mm_count is hold by exit_mmap. */ @@ -144,8 +148,9 @@ void __mmu_notifier_invalidate_range_end * Must not hold mmap_sem nor any other VM related lock when calling * this registration function. Must also ensure mm_users can't go down * to zero while this runs to avoid races with mmu_notifier_release, - * so mm has to be current->mm or the mm should be pinned safely like - * with get_task_mm(). mmput can be called after mmu_notifier_register + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register * returns. mmu_notifier_unregister must be always called to * unregister the notifier. mm_count is automatically pinned to allow * mmu_notifier_unregister to safely run at any time later, before or @@ -155,29 +160,29 @@ int mmu_notifier_register(struct mmu_not int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) { struct mm_lock_data data; + struct mmu_notifier_mm * mmu_notifier_mm; int ret; BUG_ON(atomic_read(&mm->mm_users) <= 0); + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + ret = init_srcu_struct(&mmu_notifier_mm->srcu); + if (unlikely(ret)) + goto out_kfree; + ret = mm_lock(mm, &data); if (unlikely(ret)) - goto out; + goto out_cleanup; if (!mm_has_notifiers(mm)) { - mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), - GFP_KERNEL); - ret = -ENOMEM; - if (unlikely(!mm_has_notifiers(mm))) - goto out_unlock; - - ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu); - if (unlikely(ret)) { - kfree(mm->mmu_notifier_mm); - mmu_notifier_mm_init(mm); - goto out_unlock; - } - INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list); - spin_lock_init(&mm->mmu_notifier_mm->lock); + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; } atomic_inc(&mm->mm_count); @@ -192,8 +197,14 @@ int mmu_notifier_register(struct mmu_not spin_lock(&mm->mmu_notifier_mm->lock); hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); spin_unlock(&mm->mmu_notifier_mm->lock); -out_unlock: + mm_unlock(mm, &data); +out_cleanup: + if (mmu_notifier_mm) + cleanup_srcu_struct(&mmu_notifier_mm->srcu); +out_kfree: + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); out: BUG_ON(atomic_read(&mm->mm_users) <= 0); return ret; From shemminger at vyatta.com Thu May 1 11:22:52 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 1 May 2008 11:22:52 -0700 Subject: [ofa-general] Re: [RESEND] RE: [PATCH 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com> References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171955.31725.7771.stgit@localhost.localdomain> <20080501075606.4963afa3@extreme> <71d336490805010943s79f01e01u9b4566165c4fba3f@mail.gmail.com> Message-ID: <20080501112252.167695c7@extreme> On Thu, 1 May 2008 22:13:08 +0530 "Ramachandra K" wrote: > Sorry for the resend. Original mail got bounced from netdev. > > On Thu, May 1, 2008 at 9:32 PM, wrote: > > > > Stephen, > > > > > > Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: > > > > >> On Wed, 30 Apr 2008 22:49:55 +0530 > > >> Ramachandra K wrote: > > > > > > > > > > >> +static match_table_t vnic_opt_tokens = { > > >> + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, > > >> + {VNIC_OPT_DGID, "dgid=%s"}, > > >> + {VNIC_OPT_PKEY, "pkey=%x"}, > > >> + {VNIC_OPT_NAME, "name=%s"}, > > >> + {VNIC_OPT_INSTANCE, "instance=%d"}, > > >> + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, > > >> + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, > > >> + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, > > >> + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, > > >> + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, > > >> + {VNIC_OPT_ERR, NULL} > > >> +}; > > >> > > > > > NO > > > 1. Most of this shouldn't be done via sysfs (rx_csum, tx_csum, ...) > > > 2. Sysfs is one value per file not name=value > > > > The VNIC driver needs multiple parameters (IOCGUID, DGID etc) from user space > > to connect to the EVIC. For this the "name=value" mechanism is used for > > a write-only sysfs file as an input method to the driver. > > > > The driver follows the one value per file sysfs rule when it returns any data > > with each readable file returning only a single value. > > > > Regards, > > Ram The undocumented style rule of sysfs is one value (ascii) per file. From shemminger at vyatta.com Thu May 1 11:26:33 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 1 May 2008 11:26:33 -0700 Subject: [ofa-general] RE: [PATCH 11/13] QLogic VNIC: Driver utility file - implements various utility macros In-Reply-To: <71d336490805011001he7359dfw831470986d87b385@mail.gmail.com> References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172126.31725.48554.stgit@localhost.localdomain> <20080501075816.1010ec3a@extreme> <71d336490805011001he7359dfw831470986d87b385@mail.gmail.com> Message-ID: <20080501112633.7e272dcc@extreme> On Thu, 1 May 2008 22:31:10 +0530 "Ramachandra K" wrote: > Sorry for the resend. Original mail bounced from netdev. > > On Thu, May 1, 2008 at 9:48 PM wrote: > > Stephen, > > > > > > Stephen Hemminger [mailto:shemminger at vyatta.com] wrote: > > > > > Ramachandra K wrote: > > > > >> +#define is_power_of2(value) (((value) & ((value - 1))) == 0) > > >> +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) > > > > > In kernel.h already > > > > Will fix this. Thanks. > > > > > > >> +extern u32 vnic_debug; > > > > > Use msg level macros instead? There is a ethtool mechanism to set message level for debug method, read other network drivers and look at netif_msg_timer(x), netif_msg_probe(x), etc in include/linux/netdevice.h The goal here is to not have any special configuration for each different type of hardware. It is bad user design to have each hardware vendor choosing different mechanism and values to enable debugging. You can argue that the existing infrastructure is inadequate, in which case extend the infrastructure for all devices. From makc at sgi.com Thu May 1 11:43:34 2008 From: makc at sgi.com (Max Matveev) Date: Fri, 2 May 2008 04:43:34 +1000 Subject: [ofa-general] mapping IP addresses to GIDs across IP subnets In-Reply-To: <20080430213051.GX24525@obsidianresearch.com> References: <18456.56771.908062.459625@kuku.melbourne.sgi.com> <20080430213051.GX24525@obsidianresearch.com> Message-ID: <18458.3926.633446.715678@kuku.melbourne.sgi.com> On Wed, 30 Apr 2008 15:30:51 -0600, Jason Gunthorpe wrote: JG> Well, you can't just assume that a AAAA record associated with the JG> reverse of a IPv4 is a GID - it could be a legitimate IPv6 address. JG> The GID space and IPv6 space are completely distinct, despite the same JG> format of the address. You can also make an administrative decision to have IB prefix to be the same as IPv6 prefix and then IPv6 and GID becomes the same. JG> The only way I could see to do this with DNS is to introduce a new JG> record type for GIDs.. JG> Alternatively, you could use DNS to manage a mapping table, ala the JG> reverse map: JG> 1.0.0.10.ipv4.ibta-addr. AAAA fd83:609c:bdc8:1:213:72ff:fe29:e65d That could work too but resolver would need to be modifed to ask for the different TLA. max From weiny2 at llnl.gov Thu May 1 15:50:45 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 1 May 2008 15:50:45 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1 is down. Message-ID: <20080501155045.4aa3ef2c.weiny2@llnl.gov> I found a bug in the printing of the names of switches on iblinkinfo.pl. The name of the switch was being pulled from the first ports "link" structure. The problem is, if the first port is down there was no structure available. This gets the switch name from the first link structure available and prints the name correctly. Ira >From 9b69c0ff4c7785be78157ab78e4a4892d64e2fb2 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 1 May 2008 15:46:25 -0700 Subject: [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1 is down. Signed-off-by: Ira K. Weiny --- infiniband-diags/scripts/iblinkinfo.pl | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl index 890567c..6077ded 100755 --- a/infiniband-diags/scripts/iblinkinfo.pl +++ b/infiniband-diags/scripts/iblinkinfo.pl @@ -139,11 +139,23 @@ sub main foreach my $port (1 .. $num_ports) { my $hr = $IBswcountlimits::link_ends{$switch}{$port}; if ($switch_prompt eq "no" && !$line_mode) { + my $switch_name = ""; + my $tmp_port = $port; + while ($switch_name eq "" && $tmp_port <= $num_ports) { + # the first port is down find switch name with up port + my $hr = $IBswcountlimits::link_ends{$switch}{$tmp_port}; + $switch_name = $hr->{loc_desc}; + $tmp_port++; + } + if ($switch_name eq "") { + printf( + "WARNING: Switch Name not found for $switch\n"); + } push( @output_lines, sprintf( "Switch %18s %s%s:\n", - $switch, $hr->{loc_desc}, $pkt_life_prompt + $switch, $switch_name, $pkt_life_prompt ) ); $switch_prompt = "yes"; -- 1.5.1 From rajib.majumder at credit-suisse.com Fri May 2 02:37:38 2008 From: rajib.majumder at credit-suisse.com (Majumder, Rajib) Date: Fri, 2 May 2008 17:37:38 +0800 Subject: [ofa-general] RDMA vs Shared Memory Message-ID: <0175FAC12977B047809C1BACA25881AD239C7E@ESNG17P32002A.csfb.cs-group.com> Hello, I was trying to find out fastest IPC where data source and sink run on the same host on SLES 10, running ConnectX and OFED 1.3.0. It seems RDMA is performing much better than shared memory. Mode Latency (microsecs) Throughput ------- ----------------------------- ---------------- SHM 20 (32 bytes) 17 Mbps RDMA 1.07 (32 bytes) 1.2 Gbps SHM 70 (32k) 5.2 Gbps RDMA 30 (32k) 8.5 Gbps Can someone explain why RDMA's giving better performance on the same host? Is it only kernel bypass and zcopy? Thanks Rajib ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ============================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea at qumranet.com Fri May 2 08:05:05 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:05 +0200 Subject: [ofa-general] [PATCH 02 of 11] get_task_mm In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1209740185 -7200 # Node ID c85c85c4be165eb6de16136bb97cf1fa7fd5c88f # Parent 1489529e7b53d3f2dab8431372aa4850ec821caa get_task_mm get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas if (task->flags & PF_BORROWED_MM) mm = NULL; else - atomic_inc(&mm->mm_users); + if (!atomic_inc_not_zero(&mm->mm_users)) + mm = NULL; } task_unlock(task); return mm; From andrea at qumranet.com Fri May 2 08:05:06 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:06 +0200 Subject: [ofa-general] [PATCH 03 of 11] invalidate_page outside PT lock In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1209740185 -7200 # Node ID ea8fc9187b6d3ef2742061b4f62598afe55281cf # Parent c85c85c4be165eb6de16136bb97cf1fa7fd5c88f invalidate_page outside PT lock Moves all mmu notifier methods outside the PT lock (first and not last step to make them sleep capable). Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -193,35 +193,6 @@ static inline void mmu_notifier_mm_destr __mmu_notifier_mm_destroy(mm); } -/* - * These two macros will sometime replace ptep_clear_flush. - * ptep_clear_flush is impleemnted as macro itself, so this also is - * implemented as a macro until ptep_clear_flush will converted to an - * inline function, to diminish the risk of compilation failure. The - * invalidate_page method over time can be moved outside the PT lock - * and these two macros can be later removed. - */ -#define ptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - pte_t __pte; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ - __pte; \ -}) - -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ -({ \ - int __young; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ - __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ - ___address); \ - __young; \ -}) - #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -257,9 +228,6 @@ static inline void mmu_notifier_mm_destr { } -#define ptep_clear_flush_young_notify ptep_clear_flush_young -#define ptep_clear_flush_notify ptep_clear_flush - #endif /* CONFIG_MMU_NOTIFIER */ #endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); + /* must invalidate_page _before_ freeing the page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(page); } } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct * */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - page_cache_release(old_page); + new_page = NULL; if (!pte_same(*page_table, orig_pte)) goto unlock; + page_cache_release(old_page); page_mkwrite = 1; } @@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = new_page = NULL; goto unlock; } @@ -1776,7 +1778,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1788,12 +1790,18 @@ gotten: } else mem_cgroup_uncharge_page(new_page); - if (new_page) +unlock: + pte_unmap_unlock(page_table, ptl); + + if (new_page) { + if (new_page == old_page) + /* cow happened, notify before releasing old_page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(new_page); + } if (old_page) page_cache_release(old_page); -unlock: - pte_unmap_unlock(page_table, ptl); + if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -275,7 +275,7 @@ static int page_referenced_one(struct pa unsigned long address; pte_t *pte; spinlock_t *ptl; - int referenced = 0; + int referenced = 0, clear_flush_young = 0; address = vma_address(page, vma); if (address == -EFAULT) @@ -288,8 +288,11 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young_notify(vma, address, pte)) - referenced++; + } else { + clear_flush_young = 1; + if (ptep_clear_flush_young(vma, address, pte)) + referenced++; + } /* Pretend the page is referenced if the task has the swap token and is in the middle of a page fault. */ @@ -299,6 +302,10 @@ static int page_referenced_one(struct pa (*mapcount)--; pte_unmap_unlock(pte, ptl); + + if (clear_flush_young) + referenced += mmu_notifier_clear_flush_young(mm, address); + out: return referenced; } @@ -458,7 +465,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush_notify(vma, address, pte); + entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -466,6 +473,10 @@ static int page_mkclean_one(struct page } pte_unmap_unlock(pte, ptl); + + if (ret) + mmu_notifier_invalidate_page(mm, address); + out: return ret; } @@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. */ - if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young_notify(vma, address, pte)))) { + if (!migration && (vma->vm_flags & VM_LOCKED)) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (ret != SWAP_FAIL) + mmu_notifier_invalidate_page(mm, address); out: return ret; } @@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne spinlock_t *ptl; struct page *page; unsigned long address; - unsigned long end; + unsigned long start, end; address = (vma->vm_start + cursor) & CLUSTER_MASK; end = address + CLUSTER_SIZE; @@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne if (!pmd_present(*pmd)) return; + start = address; + mmu_notifier_invalidate_range_start(mm, start, end); pte = pte_offset_map_lock(mm, pmd, address, &ptl); /* Update high watermark before we lower rss */ @@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young_notify(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne (*mapcount)--; } pte_unmap_unlock(pte - 1, ptl); + mmu_notifier_invalidate_range_end(mm, start, end); } static int try_to_unmap_anon(struct page *page, int migration) From andrea at qumranet.com Fri May 2 08:05:07 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:07 +0200 Subject: [ofa-general] [PATCH 04 of 11] free-pgtables In-Reply-To: Message-ID: <14e9f5a12bb1657fa675.1209740707@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740185 -7200 # Node ID 14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e # Parent ea8fc9187b6d3ef2742061b4f62598afe55281cf free-pgtables Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables() and we cannot sleep while gathering pages for a tlb flush. Move the tlb_gather/tlb_finish call to free_pgtables() to be done for each vma. This may add a number of tlb flushes depending on the number of vmas that cannot be coalesced into one. The first pointer argument to free_pgtables() can then be dropped. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather ** } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; @@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + hugetlb_free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { /* @@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } + tlb_finish_mmu(tlb, addr, vma->vm_end); vma = next; } } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, + tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); - tlb_finish_mmu(tlb, start, end); } /* @@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, From andrea at qumranet.com Fri May 2 08:05:04 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:04 +0200 Subject: [ofa-general] [PATCH 01 of 11] mmu-notifier-core In-Reply-To: Message-ID: <1489529e7b53d3f2dab8.1209740704@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740175 -7200 # Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_lock() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken in virtual address order. The order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running concurrently to trigger lock inversion deadlocks. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_lock may not allocate the required vmalloc space. See the comment on top of mm_lock() implementation for the worst case memory requirements. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. Signed-off-by: Andrea Arcangeli Signed-off-by: Nick Piggin Signed-off-by: Christoph Lameter diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct if (!hlist_unhashed(n)) { __hlist_del(n); INIT_HLIST_NODE(n); + } +} + +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; } } diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -19,6 +20,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -235,6 +237,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,265 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include +#include +#include +#include + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_lock() protected critical section + * and it's released only when mm_count reaches zero in mmdrop(). + */ +struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ + struct hlist_head list; + /* srcu structure for this mm */ + struct srcu_struct srcu; + /* to serialize the list modifications and hlist_unhashed */ + spinlock_t lock; +}; + +struct mmu_notifier_ops { + /* + * Called either by mmu_notifier_unregister or when the mm is + * being destroyed by exit_mmap, always before all pages are + * freed. It's mandatory to implement this method. This can + * run concurrently with other mmu notifier methods and it + * should tear down all secondary mmu mappings and freeze the + * secondary mmu. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. Both functions + * may sleep. The subsystem must guarantee that no additional + * references are taken to the pages in the range established + * between the call to invalidate_range_start() and the + * matching call to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/include/linux/srcu.h b/include/linux/srcu.h --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include + struct srcu_struct_array { int c[2]; }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include @@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -205,3 +205,6 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + bool diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1753,7 +1776,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include #include #include +#include +#include +#include #include #include @@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2255,3 +2259,190 @@ int install_special_mapping(struct mm_st return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + unsigned long _a = (unsigned long)*(spinlock_t **)a; + unsigned long _b = (unsigned long)*(spinlock_t **)b; + + cond_resched(); + if (_a < _b) + return -1; + if (_a > _b) + return 1; + return 0; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm_lock in a row or it would deadlock. + * + * The mmap_sem must be taken in write mode to block all operations + * that could modify pagetables and free pages without altering the + * vma layout (for example populate_range() with nonlinear vmas). + * + * The sorting is needed to avoid lock inversion deadlocks if two + * tasks run mm_lock at the same time on different mm that happen to + * share some anon_vmas/inodes but mapped in different order. + * + * mm_lock and mm_unlock are expensive operations that may have to + * take thousand of locks. Thanks to sort() the complexity is + * O(N*log(N)) where N is the number of VMAs in the mm. The max number + * of vmas is defined in /proc/sys/vm/max_map_count. + * + * mm_lock() can fail if memory allocation fails. The worst case + * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *), + * so around 1Mbyte, but in practice it'll be much less because + * normally there won't be max_map_count vmas allocated in the task + * that runs mm_lock(). + * + * The vmalloc memory allocated by mm_lock is stored in the + * mm_lock_data structure that must be allocated by the caller and it + * must be later passed to mm_unlock that will free it after using it. + * Allocating the mm_lock_data structure on the stack is fine because + * it's only a couple of bytes in size. + * + * If mm_lock() returns -ENOMEM no memory has been allocated and the + * mm_lock_data structure can be freed immediately, and mm_unlock must + * not be called. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + down_write(&mm->mmap_sem); + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) { + up_write(&mm->mmap_sem); + return -ENOMEM; + } + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + up_write(&mm->mmap_sem); + vfree(anon_vma_locks); + return -ENOMEM; + } + + /* + * When mm_lock_sort_anon_vma/i_mmap returns zero it + * means there's no lock to take and so we can free + * the array here without waiting mm_unlock. mm_unlock + * will do nothing if nr_i_mmap/anon_vma_locks is + * zero. + */ + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* + * mm_unlock doesn't require any memory allocation and it won't fail. + * + * All memory has been previously allocated by mm_lock and it'll be + * all freed before returning. Only after mm_unlock returns, the + * caller is allowed to free and forget the mm_lock_data structure. + * + * mm_unlock runs in O(N) where N is the max number of VMAs in the + * mm. The max number of vmas is defined in + * /proc/sys/vm/max_map_count. + */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->nr_i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } + up_write(&mm->mmap_sem); +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,269 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * This function can't run concurrently against mmu_notifier_register + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds an mm_count pin + * itself. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + int srcu; + + spin_lock(&mm->mmu_notifier_mm->lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing other than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init_rcu(&mn->hlist); + /* + * SRCU here will block mmu_notifier_unregister until + * ->release returns. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes and stop the driver + * from establishing any more sptes before all the + * pages in the mm are freed. + */ + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + spin_lock(&mm->mmu_notifier_mm->lock); + } + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * synchronize_srcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * + * The mmu_notifier_mm can't go away from under us because one + * mm_count is hold by exit_mmap. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0, srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct mm_lock_data data; + struct mmu_notifier_mm * mmu_notifier_mm; + int ret; + + BUG_ON(atomic_read(&mm->mm_users) <= 0); + + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + ret = init_srcu_struct(&mmu_notifier_mm->srcu); + if (unlikely(ret)) + goto out_kfree; + + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out_cleanup; + + if (!mm_has_notifiers(mm)) { + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; + } + atomic_inc(&mm->mm_count); + + /* + * Serialize the update against mmu_notifier_unregister. A + * side note: mmu_notifier_release can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + * We can't race against any other mmu notifiers either thanks + * to mm_lock(). + */ + spin_lock(&mm->mmu_notifier_mm->lock); + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); + spin_unlock(&mm->mmu_notifier_mm->lock); + + mm_unlock(mm, &data); +out_cleanup: + if (mmu_notifier_mm) + cleanup_srcu_struct(&mmu_notifier_mm->srcu); +out_kfree: + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); +out: + BUG_ON(atomic_read(&mm->mm_users) <= 0); + return ret; +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* this is called after the last mmu_notifier_unregister() returned */ +void __mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); + cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu); + kfree(mm->mmu_notifier_mm); + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ +} + +/* + * This releases the mm_count pin automatically and frees the mm + * structure if it was the last user of it. It serializes against + * running mmu notifiers with SRCU and against mmu_notifier_unregister + * with the unregister lock + SRCU. All sptes must be dropped before + * calling mmu_notifier_unregister. ->release or any other notifier + * method may be invoked concurrently with mmu_notifier_unregister, + * and only after mmu_notifier_unregister returned we're guaranteed + * that ->release or any other method can't run anymore. + */ +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + spin_lock(&mm->mmu_notifier_mm->lock); + if (!hlist_unhashed(&mn->hlist)) { + int srcu; + + hlist_del_rcu(&mn->hlist); + + /* + * SRCU here will force exit_mmap to wait ->release to finish + * before freeing the pages. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee ->release is called before freeing the + * pages. + */ + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + } else + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * Wait any running method to finish, of course including + * ->release if it was run by mmu_notifier_relase instead of us. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + mmdrop(mm); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) From andrea at qumranet.com Fri May 2 08:05:08 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:08 +0200 Subject: [ofa-general] [PATCH 05 of 11] unmap vmas tlb flushing In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1209740186 -7200 # Node ID a8ac53b928dfcea0ccb326fb7d71f908f0df85f4 # Parent 14e9f5a12bb1657fa6756e18d5dac71d4ad1a55e unmap vmas tlb flushing Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath int tlb_start_valid = 0; unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; - int fullmm = (*tlbp)->fullmm; + int fullmm; + struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb->fullmm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, &zap_work, details); if (zap_work > 0) { @@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { if (i_mmap_lock) { - *tlbp = NULL; + tlb = NULL; goto out; } cond_resched(); } - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm); + tlb = tlb_gather_mmu(vma->vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); out: mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ @@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma->vm_mm; - struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); - update_hiwater_rss(mm); - end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) - tlb_finish_mmu(tlb, address, end); - return end; + return unmap_vmas(vma, address, end, &nr_accounted, details); } /* diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc unsigned long start, unsigned long end) { struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; - struct mmu_gather *tlb; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); - update_hiwater_rss(mm); - unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); + unmap_vmas(vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - tlb_finish_mmu(tlb, start, end); free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); } @@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ void exit_mmap(struct mm_struct *mm) { - struct mmu_gather *tlb; struct vm_area_struct *vma = mm->mmap; unsigned long nr_accounted = 0; unsigned long end; @@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); - tlb = tlb_gather_mmu(mm, 1); + /* Don't update_hiwater_rss(mm) here, do_exit already did */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ - end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); + end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - tlb_finish_mmu(tlb, 0, end); free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* From andrea at qumranet.com Fri May 2 08:05:09 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:09 +0200 Subject: [ofa-general] [PATCH 06 of 11] rwsem contended In-Reply-To: Message-ID: <74b873f3ea07012e2fc8.1209740709@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740186 -7200 # Node ID 74b873f3ea07012e2fc864f203edf1179865feb1 # Parent a8ac53b928dfcea0ccb326fb7d71f908f0df85f4 rwsem contended Add a function to rw_semaphores to check if there are any processes waiting for the semaphore. Add rwsem_needbreak to sched.h that works in the same way as spinlock_needbreak(). Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore */ extern void downgrade_write(struct rw_semaphore *sem); +extern int rwsem_is_contended(struct rw_semaphore *sem); + #ifdef CONFIG_DEBUG_LOCK_ALLOC /* * nested locking. NOTE: rwsems are not allowed to recurse diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc #endif } +static inline int rwsem_needbreak(struct rw_semaphore *sem) +{ +#ifdef CONFIG_PREEMPT + return rwsem_is_contended(sem); +#else + return 0; +#endif +} + /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c --- a/lib/rwsem-spinlock.c +++ b/lib/rwsem-spinlock.c @@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho spin_unlock_irqrestore(&sem->wait_lock, flags); } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* + * Racy check for an empty list. False positives or negatives + * would be okay. False positive may cause a useless dropping of + * locks. False negatives may cause locks to be held a bit + * longer until the next check. + */ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(__init_rwsem); EXPORT_SYMBOL(__down_read); EXPORT_SYMBOL(__down_read_trylock); diff --git a/lib/rwsem.c b/lib/rwsem.c --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do return sem; } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* + * Racy check for an empty list. False positives or negatives + * would be okay. False positive may cause a useless dropping of + * locks. False negatives may cause locks to be held a bit + * longer until the next check. + */ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(rwsem_down_read_failed); EXPORT_SYMBOL(rwsem_down_write_failed); EXPORT_SYMBOL(rwsem_wake); From andrea at qumranet.com Fri May 2 08:05:03 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:03 +0200 Subject: [ofa-general] [PATCH 00 of 11] mmu notifier #v15 Message-ID: Hello everyone, 1/11 is the latest version of the mmu-notifier-core patch. As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26. Thanks! Andrea From andrea at qumranet.com Fri May 2 08:05:10 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:10 +0200 Subject: [ofa-general] [PATCH 07 of 11] i_mmap_rwsem In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1209740186 -7200 # Node ID de28c85baef11b90c993047ca851a2f52c85a5be # Parent 74b873f3ea07012e2fc864f203edf1179865feb1 i_mmap_rwsem The conversion to a rwsem allows notifier callbacks during rmap traversal for files. A rw style lock also allows concurrent walking of the reverse map so that multiple processors can expire pages in the same memory area of the same process. So it increases the potential concurrency. Signed-off-by: Andrea Arcangeli Signed-off-by: Christoph Lameter diff --git a/Documentation/vm/locking b/Documentation/vm/locking --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_sem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str if (!vma_shareable(vma, addr)) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); if (!prio_tree_empty(&mapping->i_mmap)) hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); rwlock_init(&inode->i_data.tree_lock); - spin_lock_init(&inode->i_data.i_mmap_lock); + init_rwsem(&inode->i_data.i_mmap_sem); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -502,7 +502,7 @@ struct address_space { unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock; /* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned int truncate_count; /* Cover race condition with truncate */ unsigned long nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -735,7 +735,7 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm atomic_dec(&inode->i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(&file->f_mapping->i_mmap_lock); + down_write(&file->f_mapping->i_mmap_sem); tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(file->f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file->f_mapping); - spin_unlock(&file->f_mapping->i_mmap_lock); + up_write(&file->f_mapping->i_mmap_sem); } /* diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki /* * Lock ordering: * - * ->i_mmap_lock (vmtruncate) + * ->i_mmap_sem (vmtruncate) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * * ->i_mutex - * ->i_mmap_lock (truncate->unmap_mapping_range) + * ->i_mmap_sem (truncate->unmap_mapping_range) * * ->mmap_sem - * ->i_mmap_lock + * ->i_mmap_sem * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * @@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki * ->sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * - * ->i_mmap_lock + * ->i_mmap_sem * ->anon_vma.lock (vma_adjust) * * ->anon_vma.lock diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp if (!page) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { mm = vma->vm_mm; address = vma->vm_start + @@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp page_cache_release(page); } } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns } goto out; } - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); flush_dcache_mmap_lock(mapping); vma->vm_flags |= VM_NONLINEAR; vma_prio_tree_remove(vma, &mapping->i_mmap); vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); flush_dcache_mmap_unlock(mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } mmu_notifier_invalidate_range_start(mm, start, start + size); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar struct page *page; struct page *tmp; /* - * A page gathering list, protected by per file i_mmap_lock. The + * A page gathering list, protected by per file i_mmap_sem. The * lock is used to avoid list corruption from multiple unmapping * of the same page since we are using page->lru. */ @@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area * do nothing in this case. */ if (vma->vm_file) { - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); __unmap_hugepage_range(vma, start, end); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); } } @@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm BUG_ON(address >= end); flush_cache_range(vma, address, end); - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); spin_lock(&mm->page_table_lock); for (; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm } } spin_unlock(&mm->page_table_lock); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); flush_tlb_range(vma, start, end); } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_ unsigned long tlb_start = 0; /* For tlb_finish_mmu */ int tlb_start_valid = 0; unsigned long start = start_addr; - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; + struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL; int fullmm; struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; @@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_ tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || - (i_mmap_lock && spin_needbreak(i_mmap_lock))) { - if (i_mmap_lock) { + (i_mmap_sem && rwsem_needbreak(i_mmap_sem))) { + if (i_mmap_sem) { tlb = NULL; goto out; } @@ -1829,7 +1829,7 @@ unwritable_page: /* * Helper functions for unmap_mapping_range(). * - * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __ + * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __ * * We have to restart searching the prio_tree whenever we drop the lock, * since the iterator is only valid while the lock is held, and anyway @@ -1848,7 +1848,7 @@ unwritable_page: * can't efficiently keep all vmas in step with mapping->truncate_count: * so instead reset them all whenever it wraps back to 0 (then go to 1). * mapping->truncate_count and vma->vm_truncate_count are protected by - * i_mmap_lock. + * i_mmap_sem. * * In order to make forward progress despite repeatedly restarting some * large vma, note the restart_addr from unmap_vmas when it breaks out: @@ -1898,7 +1898,7 @@ again: restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr, details); - need_break = need_resched() || spin_needbreak(details->i_mmap_lock); + need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem); if (restart_addr >= end_addr) { /* We have now completed this vma: mark it so */ @@ -1912,9 +1912,9 @@ again: goto again; } - spin_unlock(details->i_mmap_lock); + up_write(details->i_mmap_sem); cond_resched(); - spin_lock(details->i_mmap_lock); + down_write(details->i_mmap_sem); return -EINTR; } @@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_ details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - details.i_mmap_lock = &mapping->i_mmap_lock; + details.i_mmap_sem = &mapping->i_mmap_sem; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); /* Protect against endless unmapping loops */ mapping->truncate_count++; @@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_ unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } EXPORT_SYMBOL(unmap_mapping_range); diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s if (!mapping) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) remove_migration_pte(vma, old, new); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -189,7 +189,7 @@ error: } /* - * Requires inode->i_mapping->i_mmap_lock + * Requires inode->i_mapping->i_mmap_sem */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) @@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru if (file) { struct address_space *mapping = file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); __remove_shared_vm_struct(vma, file, mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } } @@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m mapping = vma->vm_file->f_mapping; if (mapping) { - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); vma->vm_truncate_count = mapping->truncate_count; } anon_vma_lock(vma); @@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m anon_vma_unlock(vma); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mm->map_count++; validate_mm(mm); @@ -542,7 +542,7 @@ again: remove_next = 1 + (end > next-> mapping = file->f_mapping; if (!(vma->vm_flags & VM_NONLINEAR)) root = &mapping->i_mmap; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (importer && vma->vm_truncate_count != next->vm_truncate_count) { /* @@ -626,7 +626,7 @@ again: remove_next = 1 + (end > next-> if (anon_vma) spin_unlock(&anon_vma->lock); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); if (remove_next) { if (file) { @@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm) /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_lock is taken here. + * then i_mmap_sem is taken here. */ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) { diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str * and we propagate stale pages into the dst afterward. */ mapping = vma->vm_file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (new_vma->vm_truncate_count && new_vma->vm_truncate_count != vma->vm_truncate_count) new_vma->vm_truncate_count = 0; @@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str pte_unmap_nested(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -24,7 +24,7 @@ * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) - * mapping->i_mmap_lock + * mapping->i_mmap_sem * anon_vma->lock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) @@ -373,14 +373,14 @@ static int page_referenced_file(struct p * The page lock not only makes sure that page->mapping cannot * suddenly be NULLified by truncation, it makes sure that the * structure at mapping cannot be freed and reused yet, - * so we can safely take mapping->i_mmap_lock. + * so we can safely take mapping->i_mmap_sem. */ BUG_ON(!PageLocked(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); /* - * i_mmap_lock does not stabilize mapcount at all, but mapcount + * i_mmap_sem does not stabilize mapcount at all, but mapcount * is more likely to be accurate if we note it after spinning. */ mapcount = page_mapcount(page); @@ -403,7 +403,7 @@ static int page_referenced_file(struct p break; } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return referenced; } @@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr BUG_ON(PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { if (vma->vm_flags & VM_SHARED) ret += page_mkclean_one(page, vma); } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } @@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page unsigned long max_nl_size = 0; unsigned int mapcount; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) @@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page mapcount = page_mapcount(page); if (!mapcount) goto out; - cond_resched_lock(&mapping->i_mmap_lock); max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; if (max_nl_cursor == 0) @@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page } vma->vm_private_data = (void *) max_nl_cursor; } - cond_resched_lock(&mapping->i_mmap_lock); max_nl_cursor += CLUSTER_SIZE; } while (max_nl_cursor <= max_nl_size); @@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } From andrea at qumranet.com Fri May 2 08:05:11 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:11 +0200 Subject: [ofa-general] [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: Message-ID: <0be678c52e540d5f5d5f.1209740711@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740186 -7200 # Node ID 0be678c52e540d5f5d5fd9af549b57b9bb018d32 # Parent de28c85baef11b90c993047ca851a2f52c85a5be anon-vma-rwsem Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -25,7 +25,8 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock; /* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; @@ -43,18 +44,31 @@ static inline void anon_vma_free(struct kmem_cache_free(anon_vma_cachep, anon_vma); } +struct anon_vma *grab_anon_vma(struct page *page); + +static inline void get_anon_vma(struct anon_vma *anon_vma) +{ + atomic_inc(&anon_vma->refcount); +} + +static inline void put_anon_vma(struct anon_vma *anon_vma) +{ + if (atomic_dec_and_test(&anon_vma->refcount)) + anon_vma_free(anon_vma); +} + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); } /* diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s return; /* - * We hold the mmap_sem lock. So no need to call page_lock_anon_vma. + * We hold either the mmap_sem lock or a reference on the + * anon_vma. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); + down_read(&anon_vma->sem); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(&anon_vma->lock); + up_read(&anon_vma->sem); } /* @@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get int rc = 0; int *result = NULL; struct page *newpage = get_new_page(page, private, &result); - int rcu_locked = 0; + struct anon_vma *anon_vma = NULL; int charge = 0; if (!newpage) @@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get } /* * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, - * we cannot notice that anon_vma is freed while we migrates a page. + * we cannot notice that anon_vma is freed while we migrate a page. * This rcu_read_lock() delays freeing anon_vma pointer until the end * of migration. File cache pages are no problem because of page_lock() * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { - rcu_read_lock(); - rcu_locked = 1; - } + if (PageAnon(page)) + anon_vma = grab_anon_vma(page); /* * Corner case handling: @@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get if (!PageAnon(page) && PagePrivate(page)) { /* * Go direct to try_to_free_buffers() here because - * a) that's what try_to_release_page() would do anyway - * b) we may be under rcu_read_lock() here, so we can't - * use GFP_KERNEL which is what try_to_release_page() - * needs to be effective. + * that's what try_to_release_page() would do anyway */ try_to_free_buffers(page); } @@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get } else if (charge) mem_cgroup_end_migration(newpage); rcu_unlock: - if (rcu_locked) - rcu_read_unlock(); + if (anon_vma) + put_anon_vma(anon_vma); unlock: diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -570,7 +570,7 @@ again: remove_next = 1 + (end > next-> if (vma->anon_vma) anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the @@ -624,7 +624,7 @@ again: remove_next = 1 + (end > next-> } if (anon_vma) - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); if (mapping) up_write(&mapping->i_mmap_sem); diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru if (anon_vma) { allocated = NULL; locked = anon_vma; - spin_lock(&locked->lock); + down_write(&locked->sem); } else { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) @@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { + get_anon_vma(anon_vma); vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); allocated = NULL; @@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru spin_unlock(&mm->page_table_lock); if (locked) - spin_unlock(&locked->lock); + up_write(&locked->sem); if (unlikely(allocated)) anon_vma_free(allocated); } @@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str { BUG_ON(vma->anon_vma != next->anon_vma); list_del(&next->anon_vma_node); + put_anon_vma(vma->anon_vma); } void __anon_vma_link(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; - if (anon_vma) + if (anon_vma) { + get_anon_vma(anon_vma); list_add_tail(&vma->anon_vma_node, &anon_vma->head); + } } void anon_vma_link(struct vm_area_struct *vma) @@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + get_anon_vma(anon_vma); + down_write(&anon_vma->sem); list_add_tail(&vma->anon_vma_node, &anon_vma->head); - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); } } void anon_vma_unlink(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; - int empty; if (!anon_vma) return; - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); list_del(&vma->anon_vma_node); - - /* We must garbage collect the anon_vma if it's empty */ - empty = list_empty(&anon_vma->head); - spin_unlock(&anon_vma->lock); - - if (empty) - anon_vma_free(anon_vma); + up_write(&anon_vma->sem); + put_anon_vma(anon_vma); } static void anon_vma_ctor(struct kmem_cache *cachep, void *data) { struct anon_vma *anon_vma = data; - spin_lock_init(&anon_vma->lock); + init_rwsem(&anon_vma->sem); + atomic_set(&anon_vma->refcount, 0); INIT_LIST_HEAD(&anon_vma->head); } @@ -157,9 +157,9 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -static struct anon_vma *page_lock_anon_vma(struct page *page) +struct anon_vma *grab_anon_vma(struct page *page) { - struct anon_vma *anon_vma; + struct anon_vma *anon_vma = NULL; unsigned long anon_mapping; rcu_read_lock(); @@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); - return anon_vma; + if (!atomic_inc_not_zero(&anon_vma->refcount)) + anon_vma = NULL; out: rcu_read_unlock(); - return NULL; + return anon_vma; +} + +static struct anon_vma *page_lock_anon_vma(struct page *page) +{ + struct anon_vma *anon_vma = grab_anon_vma(page); + + if (anon_vma) + down_read(&anon_vma->sem); + return anon_vma; } static void page_unlock_anon_vma(struct anon_vma *anon_vma) { - spin_unlock(&anon_vma->lock); - rcu_read_unlock(); + up_read(&anon_vma->sem); + put_anon_vma(anon_vma); } /* From andrea at qumranet.com Fri May 2 08:05:13 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:13 +0200 Subject: [ofa-general] [PATCH 10 of 11] export zap_page_range for XPMEM In-Reply-To: Message-ID: <4f462fb3dff614cd7d97.1209740713@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740229 -7200 # Node ID 4f462fb3dff614cd7d971219c3feaef0b43359c1 # Parent 721c3787cd42043734331e54a42eb20c51766f71 export zap_page_range for XPMEM XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson Signed-off-by: Andrea Arcangeli diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a return unmap_vmas(vma, address, end, &nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. From andrea at qumranet.com Fri May 2 08:05:12 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:12 +0200 Subject: [ofa-general] [PATCH 09 of 11] mm_lock-rwsem In-Reply-To: Message-ID: <721c3787cd4204373433.1209740712@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1209740226 -7200 # Node ID 721c3787cd42043734331e54a42eb20c51766f71 # Parent 0be678c52e540d5f5d5fd9af549b57b9bb018d32 mm_lock-rwsem Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock conversion. Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc unsigned long flags, struct page **pages); struct mm_lock_data { - spinlock_t **i_mmap_locks; - spinlock_t **anon_vma_locks; - size_t nr_i_mmap_locks; - size_t nr_anon_vma_locks; + struct rw_semaphore **i_mmap_sems; + struct rw_semaphore **anon_vma_sems; + size_t nr_i_mmap_sems; + size_t nr_anon_vma_sems; }; extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st static int mm_lock_cmp(const void *a, const void *b) { - unsigned long _a = (unsigned long)*(spinlock_t **)a; - unsigned long _b = (unsigned long)*(spinlock_t **)b; + unsigned long _a = (unsigned long)*(struct rw_semaphore **)a; + unsigned long _b = (unsigned long)*(struct rw_semaphore **)b; cond_resched(); if (_a < _b) @@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co return 0; } -static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, +static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems, int anon) { struct vm_area_struct *vma; @@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct for (vma = mm->mmap; vma; vma = vma->vm_next) { if (anon) { if (vma->anon_vma) - locks[i++] = &vma->anon_vma->lock; + sems[i++] = &vma->anon_vma->sem; } else { if (vma->vm_file && vma->vm_file->f_mapping) - locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem; } } if (!i) goto out; - sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL); out: return i; } static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 1); + return mm_lock_sort(mm, sems, 1); } static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 0); + return mm_lock_sort(mm, sems, 0); } -static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock) { - spinlock_t *last = NULL; + struct rw_semaphore *last = NULL; size_t i; for (i = 0; i < nr; i++) /* Multiple vmas may use the same lock. */ - if (locks[i] != last) { - BUG_ON((unsigned long) last > (unsigned long) locks[i]); - last = locks[i]; + if (sems[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) sems[i]); + last = sems[i]; if (lock) - spin_lock(last); + down_write(last); else - spin_unlock(last); + up_write(last); } } -static inline void __mm_lock(spinlock_t **locks, size_t nr) +static inline void __mm_lock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 1); + mm_lock_unlock(sems, nr, 1); } -static inline void __mm_unlock(spinlock_t **locks, size_t nr) +static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 0); + mm_lock_unlock(sems, nr, 0); } /* @@ -2351,10 +2351,10 @@ static inline void __mm_unlock(spinlock_ * of vmas is defined in /proc/sys/vm/max_map_count. * * mm_lock() can fail if memory allocation fails. The worst case - * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *), - * so around 1Mbyte, but in practice it'll be much less because - * normally there won't be max_map_count vmas allocated in the task - * that runs mm_lock(). + * vmalloc allocation required is 2*max_map_count*sizeof(struct + * rw_semaphore *), so around 1Mbyte, but in practice it'll be much + * less because normally there won't be max_map_count vmas allocated + * in the task that runs mm_lock(). * * The vmalloc memory allocated by mm_lock is stored in the * mm_lock_data structure that must be allocated by the caller and it @@ -2368,20 +2368,20 @@ static inline void __mm_unlock(spinlock_ */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { - spinlock_t **anon_vma_locks, **i_mmap_locks; + struct rw_semaphore **anon_vma_sems, **i_mmap_sems; down_write(&mm->mmap_sem); if (mm->map_count) { - anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!anon_vma_locks)) { + anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count); + if (unlikely(!anon_vma_sems)) { up_write(&mm->mmap_sem); return -ENOMEM; } - i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!i_mmap_locks)) { + i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count); + if (unlikely(!i_mmap_sems)) { up_write(&mm->mmap_sem); - vfree(anon_vma_locks); + vfree(anon_vma_sems); return -ENOMEM; } @@ -2389,31 +2389,31 @@ int mm_lock(struct mm_struct *mm, struct * When mm_lock_sort_anon_vma/i_mmap returns zero it * means there's no lock to take and so we can free * the array here without waiting mm_unlock. mm_unlock - * will do nothing if nr_i_mmap/anon_vma_locks is + * will do nothing if nr_i_mmap/anon_vma_sems is * zero. */ - data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); - data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems); + data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems); - if (data->nr_anon_vma_locks) { - __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); - data->anon_vma_locks = anon_vma_locks; + if (data->nr_anon_vma_sems) { + __mm_lock(anon_vma_sems, data->nr_anon_vma_sems); + data->anon_vma_sems = anon_vma_sems; } else - vfree(anon_vma_locks); + vfree(anon_vma_sems); - if (data->nr_i_mmap_locks) { - __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); - data->i_mmap_locks = i_mmap_locks; + if (data->nr_i_mmap_sems) { + __mm_lock(i_mmap_sems, data->nr_i_mmap_sems); + data->i_mmap_sems = i_mmap_sems; } else - vfree(i_mmap_locks); + vfree(i_mmap_sems); } return 0; } -static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr) { - __mm_unlock(locks, nr); - vfree(locks); + __mm_unlock(sems, nr); + vfree(sems); } /* @@ -2430,12 +2430,12 @@ void mm_unlock(struct mm_struct *mm, str void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) { if (mm->map_count) { - if (data->nr_anon_vma_locks) - mm_unlock_vfree(data->anon_vma_locks, - data->nr_anon_vma_locks); - if (data->nr_i_mmap_locks) - mm_unlock_vfree(data->i_mmap_locks, - data->nr_i_mmap_locks); + if (data->nr_anon_vma_sems) + mm_unlock_vfree(data->anon_vma_sems, + data->nr_anon_vma_sems); + if (data->nr_i_mmap_sems) + mm_unlock_vfree(data->i_mmap_sems, + data->nr_i_mmap_sems); } up_write(&mm->mmap_sem); } From andrea at qumranet.com Fri May 2 08:05:14 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 02 May 2008 17:05:14 +0200 Subject: [ofa-general] [PATCH 11 of 11] mmap sems In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1209740229 -7200 # Node ID b4bf6df98bc00bfbef9423b0dd31cfdba63a5eeb # Parent 4f462fb3dff614cd7d971219c3feaef0b43359c1 mmap sems This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson Signed-off-by: Andrea Arcangeli diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki * * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) + * + * When taking multiple mmap_sems, one should lock the lowest-addressed + * one first proceeding on up to the highest-addressed one. * * ->i_mutex * ->i_alloc_sem (various) From swise at opengridcomputing.com Fri May 2 09:17:41 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 May 2008 11:17:41 -0500 Subject: [ofa-general] [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes Message-ID: <20080502161741.30500.95337.stgit@dell3.ogc.int> - Flush the QP only after the HW disables the connection. Currently we flush the QP when transitioning to CLOSING. This exposes a race condition where the HW can complete a RECV WR, for instance, -and- the SW can flush that same WR. - Only call CQ event handlers on flush IFF we actually flushed anything. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 13 ++++++++++--- drivers/infiniband/hw/cxgb3/cxio_hal.h | 4 ++-- drivers/infiniband/hw/cxgb3/iwch_qp.c | 13 ++++++++----- 3 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 3de0fbf..8a86960 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -359,9 +359,10 @@ static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) cq->sw_wptr++; } -void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) { u32 ptr; + int flushed = 0; PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq); @@ -369,8 +370,11 @@ void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, wq->rq_rptr, wq->rq_wptr, count); ptr = wq->rq_rptr + count; - while (ptr++ != wq->rq_wptr) + while (ptr++ != wq->rq_wptr) { insert_recv_cqe(wq, cq); + flushed++; + } + return flushed; } static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, @@ -394,9 +398,10 @@ static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, cq->sw_wptr++; } -void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) { __u32 ptr; + int flushed = 0; struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); ptr = wq->sq_rptr + count; @@ -405,7 +410,9 @@ void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) insert_sq_cqe(wq, cq, sqp); sqp++; ptr++; + flushed++; } + return flushed; } /* diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 2bcff7f..69ab08e 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -173,8 +173,8 @@ u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); int __init cxio_hal_init(void); void __exit cxio_hal_exit(void); -void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); -void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_flush_hw_cq(struct t3_cq *cq); diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index b0e5aea..353fbb3 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -655,6 +655,7 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) { struct iwch_cq *rchp, *schp; int count; + int flushed; rchp = get_chp(qhp->rhp, qhp->attr.rcq); schp = get_chp(qhp->rhp, qhp->attr.scq); @@ -669,20 +670,22 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) spin_lock(&qhp->lock); cxio_flush_hw_cq(&rchp->cq); cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); - cxio_flush_rq(&qhp->wq, &rchp->cq, count); + flushed = cxio_flush_rq(&qhp->wq, &rchp->cq, count); spin_unlock(&qhp->lock); spin_unlock_irqrestore(&rchp->lock, *flag); - (*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context); + if (flushed) + (*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context); /* locking heirarchy: cq lock first, then qp lock. */ spin_lock_irqsave(&schp->lock, *flag); spin_lock(&qhp->lock); cxio_flush_hw_cq(&schp->cq); cxio_count_scqes(&schp->cq, &qhp->wq, &count); - cxio_flush_sq(&qhp->wq, &schp->cq, count); + flushed = cxio_flush_sq(&qhp->wq, &schp->cq, count); spin_unlock(&qhp->lock); spin_unlock_irqrestore(&schp->lock, *flag); - (*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context); + if (flushed) + (*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context); /* deref */ if (atomic_dec_and_test(&qhp->refcnt)) @@ -880,7 +883,6 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, ep = qhp->ep; get_ep(&ep->com); } - flush_qp(qhp, &flag); break; case IWCH_QP_STATE_TERMINATE: qhp->attr.state = IWCH_QP_STATE_TERMINATE; @@ -911,6 +913,7 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, } switch (attrs->next_state) { case IWCH_QP_STATE_IDLE: + flush_qp(qhp, &flag); qhp->attr.state = IWCH_QP_STATE_IDLE; qhp->attr.llp_stream_handle = NULL; put_ep(&qhp->ep->com); From swise at opengridcomputing.com Fri May 2 09:17:43 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 May 2008 11:17:43 -0500 Subject: [ofa-general] [PATCH 2.6.26 2/3] RDMA/cxgb3: Silently ignore close reply after abort. In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int> References: <20080502161741.30500.95337.stgit@dell3.ogc.int> Message-ID: <20080502161743.30500.27928.stgit@dell3.ogc.int> Remove bad BUG_ON() in close_con_rpl(). It is possible to get a close_rpl message on a dead connection. The sequence is: host refs ep for close exchange host posts close_req hw posts PEER_ABORT from incoming RST host marks ep DEAD host posts ABORT_RPL and releases ep resources hw posts CLOSE_RPL host derefs ep and ep freed. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index f4f3c9e..b2db0a9 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1650,8 +1650,8 @@ static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) release = 1; break; case ABORTING: - break; case DEAD: + break; default: BUG_ON(1); break; From swise at opengridcomputing.com Fri May 2 09:17:45 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 May 2008 11:17:45 -0500 Subject: [ofa-general] [PATCH 2.6.26 3/3] RDMA/cxgb3: Bump up the mpa connection setup timeout. In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int> References: <20080502161741.30500.95337.stgit@dell3.ogc.int> Message-ID: <20080502161745.30500.99485.stgit@dell3.ogc.int> Testing on large clusters shows its way too short at 10 secs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index b2db0a9..9ea3a07 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -67,10 +67,10 @@ int peer2peer = 0; module_param(peer2peer, int, 0644); MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=0)"); -static int ep_timeout_secs = 10; +static int ep_timeout_secs = 60; module_param(ep_timeout_secs, int, 0644); MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " - "in seconds (default=10)"); + "in seconds (default=60)"); static int mpa_rev = 1; module_param(mpa_rev, int, 0644); From rdreier at cisco.com Fri May 2 10:58:10 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 May 2008 10:58:10 -0700 Subject: [ofa-general] Re: [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes In-Reply-To: <20080502161741.30500.95337.stgit@dell3.ogc.int> (Steve Wise's message of "Fri, 02 May 2008 11:17:41 -0500") References: <20080502161741.30500.95337.stgit@dell3.ogc.int> Message-ID: thanks, applied all 3... (1/3 was against a slightly old tree, so I had to fix up __FUNCTION__ -> __func__ in the context of the diffs) From rdreier at cisco.com Fri May 2 11:15:10 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 02 May 2008 11:15:10 -0700 Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: <20080430171624.31725.98475.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:46:24 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171624.31725.98475.stgit@localhost.localdomain> Message-ID: > From: Ramachandra K > > QLogic Virtual NIC Driver. This patch implements netdev registration, > netdev functions and state maintenance of the QLogic Virtual NIC > corresponding to the various events associated with the QLogic Ethernet > Virtual I/O Controller (EVIC/VEx) connection. > > Signed-off-by: Poornima Kamath > Signed-off-by: Amar Mudrankit For the next submission please clean up the From and Signed-off-by lines. As it stands now you are saying that you (Ramachandra K) are the author of the patch, and that Poornima and Amar signed off on it (ie forwarded it), but you as the person sending the email did not sign off on it. > +#include I would like to kill off the caching support in the IB core, so adding new users of the API is not desirable. However your code doesn't seem to call any functions from this header anyway, so I guess you can just delete the include. > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_stop_xmit()\n"); > + if (netpath == vnic->current_path) { > + if (vnic->xmit_started) { > + netif_stop_queue(vnic->netdevice); > + vnic->xmit_started = 0; > + } > + > + vnic_stop_xmit_stats(vnic); > + } > +} Do you have sufficient locking here? Could vnic->current_path or vnic->xmit_started change after they are tested, leading to bad results? Also do you get anything from having a xmit_started flag that you couldn't get just by testing with netif_queue_stopped()? > + vnic = (struct vnic *)device->priv; All this device->priv should probably be using netdev_priv() instead, and without a cast (since a cast from void * is not needed). > + if (jiffies > netpath->connect_time + > + vnic->config->no_path_timeout) { want to use time_after() for jiffies comparison to avoid problems with jiffies wrap. > + vnic->netdevice = alloc_netdev((int) 0, config->name, vnic_setup); > + vnic->netdevice->priv = (void *)vnic; Not sure this is even kosher to do any more. Anyway I think it's much cleaner if you just allocate everything with alloc_netdev instead of trying to stick your own structure in the priv pointer. > +extern cycles_t recv_ref; seems like too generic a name to make global. What the heck are you using cycle_t to keep track of anyway? > +/* This array should be kept next to enum above since a change to npevent_type > + enum affects this array. */ > +static const char *const vnic_npevent_str[] = { > + "PRIMARY CONNECTED", > + "PRIMARY DISCONNECTED", > + "PRIMARY CARRIER", putting this in a header means every file that uses it gets a private copy/ From xavier at tddft.org Fri May 2 11:21:15 2008 From: xavier at tddft.org (Xavier Andrade) Date: Fri, 2 May 2008 20:21:15 +0200 (CEST) Subject: [ofa-general] Loading of ib_mthca fails In-Reply-To: <4815BA98.8000802@mellanox.co.il> References: <4815BA98.8000802@mellanox.co.il> Message-ID: Hi Tziporet, On Mon, 28 Apr 2008, Tziporet Koren wrote: > > Attached is the ini file for this PSID. > Please create a binary using the MFT package in our web site and try to burn > it. > If you have more issues please work with Todd that cc on this maik > I have generated the firmware with the .ini you send me, I burned it and for the moment it seems to work. Thanks, Xavier From swise at opengridcomputing.com Fri May 2 13:16:23 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 02 May 2008 15:16:23 -0500 Subject: [ofa-general] Re: [PATCH 2.6.26 1/3] RDMA/cxgb3: QP flush fixes In-Reply-To: References: <20080502161741.30500.95337.stgit@dell3.ogc.int> Message-ID: <481B7697.2080500@opengridcomputing.com> Roland Dreier wrote: > thanks, applied all 3... (1/3 was against a slightly old tree, so I had > to fix up __FUNCTION__ -> __func__ in the context of the diffs) > Sorry about that. Someday I'll learn __func__. ;-) From steiner at sgi.com Sat May 3 04:09:04 2008 From: steiner at sgi.com (Jack Steiner) Date: Sat, 3 May 2008 06:09:04 -0500 Subject: [ofa-general] Re: [PATCH 00 of 11] mmu notifier #v15 In-Reply-To: References: Message-ID: <20080503110904.GA19688@sgi.com> On Fri, May 02, 2008 at 05:05:03PM +0200, Andrea Arcangeli wrote: > Hello everyone, > > 1/11 is the latest version of the mmu-notifier-core patch. > > As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26. > Not sure why -mm is different, but I get compile errors w/o the following... --- jack Index: linux/mm/mmu_notifier.c =================================================================== --- linux.orig/mm/mmu_notifier.c 2008-05-02 16:54:52.780576831 -0500 +++ linux/mm/mmu_notifier.c 2008-05-02 16:56:38.817719509 -0500 @@ -16,6 +16,7 @@ #include #include #include +#include /* * This function can't run concurrently against mmu_notifier_register From swise at opengridcomputing.com Sat May 3 09:05:26 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 03 May 2008 11:05:26 -0500 Subject: [ofa-general] [ GIT PULL ofed-1.3.1] - more cxgb3 fixes for -rc1 Message-ID: <481C8D46.3050305@opengridcomputing.com> Vlad, Please pull these additional upstream bug fixes into ofed-1.3.1. Pull from git://git.openfabrics.org/~swise/ofed-1.3.git ofed_kernel Shortlog: Steve Wise (4): RDMA/cxgb3: Program hardware IRD with correct value RDMA/cxgb3: QP flush fixes RDMA/cxgb3: Silently ignore close reply after abort. RDMA/cxgb3: Bump up the mpa connection setup timeout. From vlad at dev.mellanox.co.il Sun May 4 00:43:43 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 04 May 2008 10:43:43 +0300 Subject: [ofa-general] Re: [GIT PULL ofed-1.3.1] - chelsio changes for ofed-1.3.1 In-Reply-To: <4818E2C5.7060907@opengridcomputing.com> References: <4818E2C5.7060907@opengridcomputing.com> Message-ID: <481D692F.5080503@dev.mellanox.co.il> Steve Wise wrote: > Vlad, > > Please pull from: > > git://git.openfabrics.org/~swise/ofed-1.3 ofed_kernel > > This will sync up ofed-1.3.1 with all the important upstream fixes since > ofed-1.3. The patch files added are: > > kernel_patches/fixes/iw_cxgb3_0080_Fail_Loopback_Connections.patch > kernel_patches/fixes/iw_cxgb3_0090_Fix_shift_calc_in_build_phys_page_list_for_1-entry_page_lists.patch > > kernel_patches/fixes/iw_cxgb3_0100_Return_correct_max_inline_data_when_creating_a_QP.patch > > kernel_patches/fixes/iw_cxgb3_0110_Fix_iwch_create_cq_off-by-one_error.patch > > kernel_patches/fixes/iw_cxgb3_0120_Dont_access_a_cm_id_after_dropping_reference.patch > > kernel_patches/fixes/iw_cxgb3_0130_Correctly_set_the_max_mr_size_device_attribute.patch > > kernel_patches/fixes/iw_cxgb3_0140_Correctly_serialize_peer_abort_path.patch > > kernel_patches/fixes/iw_cxgb3_0150_Support_peer-2-peer_connection_setup.patch > > > > Thanks, > > Steve. > Done, Regards, Vladimir From vlad at dev.mellanox.co.il Sun May 4 00:45:49 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 04 May 2008 10:45:49 +0300 Subject: [ofa-general] Re: [GIT PULL ofed-1.3.1] libcxgb3 version 1.2.0 In-Reply-To: <4818E35C.4050206@opengridcomputing.com> References: <4818E35C.4050206@opengridcomputing.com> Message-ID: <481D69AD.9070404@dev.mellanox.co.il> Steve Wise wrote: > Vlad, > > Please pull in version 1.2.0 of libcxgb3. This is needed for the > ofed-1.3.1 kernel drivers. > > Pull from: > > git://git.openfabrics.org/~swise/libcxgb3 ofed_1_3_1 > > Thanks, > > Steve. > Done, Regards, Vladimir From vlad at dev.mellanox.co.il Sun May 4 00:49:21 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 04 May 2008 10:49:21 +0300 Subject: [ofa-general] Re: [ GIT PULL ofed-1.3.1] - more cxgb3 fixes for -rc1 In-Reply-To: <481C8D46.3050305@opengridcomputing.com> References: <481C8D46.3050305@opengridcomputing.com> Message-ID: <481D6A81.3060705@dev.mellanox.co.il> Steve Wise wrote: > Vlad, > > Please pull these additional upstream bug fixes into ofed-1.3.1. Pull from > > git://git.openfabrics.org/~swise/ofed-1.3.git ofed_kernel > > Shortlog: > > Steve Wise (4): > RDMA/cxgb3: Program hardware IRD with correct value > RDMA/cxgb3: QP flush fixes > RDMA/cxgb3: Silently ignore close reply after abort. > RDMA/cxgb3: Bump up the mpa connection setup timeout. > Done, Regards, Vladimir From vlad at dev.mellanox.co.il Sun May 4 00:52:59 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 04 May 2008 10:52:59 +0300 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] dapl-1.2.6 and dapl-2.0.8 release In-Reply-To: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com> References: <000401c8abb4$1334a7a0$f9b7020a@amr.corp.intel.com> Message-ID: <481D6B5B.20208@dev.mellanox.co.il> Arlin Davis wrote: > New release for uDAPL v1 (1.2.6) and v2 (2.0.8) is available at: > > http://www.openfabrics.org/downloads/dapl > > md5sum: 752ae54a93b4883c88b41241f52db4ab dapl-1.2.6.tar.gz > md5sum: a48f9da59318c395bcc6ad170226764a dapl-2.0.8.tar.gz > > Vlad, please pull into OFED 1.3.1 using package spec files and installing: > > dapl-1.2.6-1 > dapl-devel-1.2.6-1 > dapl-2.0.8-1 > dapl-utils-2.0.8-1 > dapl-devel-2.0.8-1 > dapl-debuginfo-2.0.8-1 > > tags: dapl-1.2.6-1, dapl-2.0.8-1 > > Summary of changes since last release: > > v2 - add private data exchange with reject > v1,v2 - better error reporting in non-debug builds > v1,v2 - update only OFA entries in dat.conf, cooperate with non-ofa providers > v1,v2 - support for zero byte operations, iov==NULL > v1,v2 - multi-transport support for inline data and private data differences > v1,v2 - fix memory leaks and other reported bugs since OFED 1.3 > > Thanks, > > -arlin > Done, Regards, Vladimir From kliteyn at dev.mellanox.co.il Sun May 4 02:57:30 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 12:57:30 +0300 Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache Message-ID: <481D888A.7080608@dev.mellanox.co.il> Hi Sasha, The following series of 4 patches implements unicast routing cache in OpenSM. None of the current routing engines is scalable when we're talking about big clusters. On ~5K cluster with ~1.3K switches, it takes about two minutes to calculate the routing. The problem is, each time the routing is calculated from scratch. Incremental routing (which is on my to-do list) aims to address this problem when there is some "local" change in fabric (e.g. single switch failure, single link failure, link added, etc). In such cases we can use the routing that was already calculated in the previous heavy sweep, and then we just have to modify it according to the change. For instance, if some switch has disappeared from the fabric, we can use the routing that existed with this switch, take a step back from this switch and see if it is possible to route all the lids that were routed through this switch some other way (which is usually the case). To implement incremental routing, we need to create some kind of unicast routing cache, which is what these patches implement. In addition to being a step toward the incremental routing, routing cache is usefull by itself. This cache can save us routing calculation in case of change in the leaf switches or in hosts. For instance, if some node is rebooted, OpenSM would start a heavy sweep with full routing recalculation when the HCA is going down, and another one when HCA is brought up, when in fact both of these routing calculation can be replaced by using of unicast routing cache. Unicast routing cache comprises the following: - Topology: a data structure with all the switches and CAs of the fabric - LFTs: each switch has an LFT cached - Lid matrices: each switch has lid matrices cached, which is needed for multicast routing (which is not cached). There is a topology matching function that compares the current topology with the cached one to find out whether the cache is usable (valid) or not. The cache is used the following way: - SM is executed - it starts first routing calculation - calculated routing is stored in the cache - at some point new heavy sweep is triggered - unicast manager checks whether the cache can be used instead of new routing calculation. In one of the following cases we can use cached routing + there is no topology change + one or more CAs disappeared (they exist in the cached topology model, but missing in the newly discovered fabric) + one or more leaf switches disappeared In these cases cached routing is written to the switches as is (unless the switch doesn't exist). If there is any other topology change: - existing cache is invalidated - topology is cached - routing is calculated as usual - routing is cached My simulations show that when the usual routing phase of the heavy sweep on the topology that I mentioned above takes ~2 minutes, cached routing reduces this time to 6 seconds (which is nice, if you ask me...). Of all the cases when the cache is valid, the most painful and "complainable" case is when a compute node reboot (which happens pretty often) causes two heavy sweeps with two full routing calculations. Unicast Routing Cache is aimed to solve this problem (again, in addition to being a step toward the incremental routing). -- Yevgeny From kliteyn at dev.mellanox.co.il Sun May 4 02:59:33 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 12:59:33 +0300 Subject: [ofa-general] [PATCH 1/4] opensm/osm_ucast_cache.{c, h}: ucast routing cache implementation Message-ID: <481D8905.1010207@dev.mellanox.co.il> Unicast routing cache implementation. Unicast routing cache comprises the following: - Topology: a data structure with all the switches and CAs of the fabric - LFTs: each switch has an LFT cached - Lid matrices: each switch has lid matrices cached, which is needed for multicast routing (which is not cached). There is also a topology matching function that compares the current topology with the cached one to find out whether the cache is usable (valid) or not. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_cache.h | 319 ++++++++ opensm/opensm/osm_ucast_cache.c | 1197 +++++++++++++++++++++++++++++++ 2 files changed, 1516 insertions(+), 0 deletions(-) create mode 100644 opensm/include/opensm/osm_ucast_cache.h create mode 100644 opensm/opensm/osm_ucast_cache.c diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h new file mode 100644 index 0000000..a3b40f9 --- /dev/null +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -0,0 +1,319 @@ +/* + * Copyright (c) 2002-2008 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_ucast_cache_t. + * This object represents the Unicast Cache object. + * + * Environment: + * Linux User Mode + * + * $Revision: 1.4 $ + */ + +#ifndef _OSM_UCAST_CACHE_H_ +#define _OSM_UCAST_CACHE_H_ + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +struct _osm_ucast_mgr; + +#define UCAST_CACHE_TOPOLOGY_MATCH 0x0000 +#define UCAST_CACHE_TOPOLOGY_LESS_SWITCHES 0x0001 +#define UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING 0x0002 +#define UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING 0x0004 +#define UCAST_CACHE_TOPOLOGY_MORE_SWITCHES 0x0008 +#define UCAST_CACHE_TOPOLOGY_NEW_LID 0x0010 +#define UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING 0x0020 +#define UCAST_CACHE_TOPOLOGY_LINK_ADDED 0x0040 +#define UCAST_CACHE_TOPOLOGY_NEW_SWITCH 0x0080 +#define UCAST_CACHE_TOPOLOGY_NEW_CA 0x0100 +#define UCAST_CACHE_TOPOLOGY_NO_MATCH 0x0200 + +/****h* OpenSM/Unicast Manager/Unicast Cache +* NAME +* Unicast Cache +* +* DESCRIPTION +* The Unicast Cache object encapsulates the information +* needed to cache and write unicast routing of the subnet. +* +* The Unicast Cache object is NOT thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Yevgeny Kliteynik, Mellanox +* +*********/ + + +/****s* OpenSM: Unicast Cache/osm_ucast_cache_t +* NAME +* osm_ucast_cache_t +* +* DESCRIPTION +* Unicast Cache structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct osm_ucast_cache_t_ { + struct _osm_ucast_mgr * p_ucast_mgr; + cl_qmap_t sw_tbl; + cl_qmap_t ca_tbl; + boolean_t topology_valid; + boolean_t routing_valid; + boolean_t need_update; +} osm_ucast_cache_t; +/* +* FIELDS +* p_ucast_mgr +* Pointer to the Unicast Manager for this subnet. +* +* sw_tbl +* Cached switches table. +* +* ca_tbl +* Cached CAs table. +* +* topology_valid +* TRUE if the cache is populated with the fabric topology. +* +* routing_valid +* TRUE if the cache is populated with the unicast routing +* in addition to the topology. +* +* need_update +* TRUE if the cached routing needs to be updated. +* +* SEE ALSO +* Unicast Manager object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_construct +* NAME +* osm_ucast_cache_construct +* +* DESCRIPTION +* This function constructs a Unicast Cache object. +* +* SYNOPSIS +*/ +osm_ucast_cache_t * +osm_ucast_cache_construct(struct _osm_ucast_mgr * const p_mgr); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to a Unicast Manager object. +* +* RETURN VALUE +* This function return the created Ucast Cache object on success, +* or NULL on any error. +* +* NOTES +* Allows osm_ucast_cache_destroy +* +* Calling osm_ucast_mgr_construct is a prerequisite to +* calling any other method. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_destroy +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_destroy +* NAME +* osm_ucast_cache_destroy +* +* DESCRIPTION +* The osm_ucast_cache_destroy function destroys the object, +* releasing all resources. +* +* SYNOPSIS +*/ +void osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Unicast Cache object. +* Further operations should not be attempted on the +* destroyed object. +* This function should only be called after a call to +* osm_ucast_cache_construct. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_construct +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_refresh_topo +* NAME +* osm_ucast_cache_refresh_topo +* +* DESCRIPTION +* The osm_ucast_cache_refresh_topo function re-reads the +* updated topology. +* +* SYNOPSIS +*/ +void osm_ucast_cache_refresh_topo(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object to refresh. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* This function invalidates the existing unicast cache +* and re-reads the updated topology. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_construct +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_refresh_lid_matrices +* NAME +* osm_ucast_cache_refresh_lid_matrices +* +* DESCRIPTION +* The osm_ucast_cache_refresh_topo function re-reads the +* updated lid matrices. +* +* SYNOPSIS +*/ +void osm_ucast_cache_refresh_lid_matrices(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object to refresh. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* This function re-reads the updated lid matrices. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_construct +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_apply +* NAME +* osm_ucast_cache_apply +* +* DESCRIPTION +* The osm_ucast_cache_apply function tries to apply +* the cached unicast routing on the subnet switches. +* +* SYNOPSIS +*/ +int osm_ucast_cache_apply(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object to be used. +* +* RETURN VALUE +* 0 if unicast cache was successfully written to switches, +* non-zero for any error. +* +* NOTES +* Compares the current topology to the cached topology, +* and if the topology matches, or if changes in topology +* have no impact on routing tables, writes the cached +* unicast routing to the subnet switches. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_set_sw_fwd_table +* NAME +* osm_ucast_cache_set_sw_fwd_table +* +* DESCRIPTION +* The osm_ucast_cache_set_sw_fwd_table function sets +* (caches) linear forwarding table for the specified +* switch. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_set_sw_fwd_table(osm_ucast_cache_t * p_cache, + uint8_t * ucast_mgr_lft_buf, + osm_switch_t * p_osm_sw); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object to be used. +* +* ucast_mgr_lft_buf +* [in] LFT to set. +* +* p_osm_sw +* [in] pointer to the switch that the LFT refers to. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* +* SEE ALSO +* Unicast Cache object +*********/ + +END_C_DECLS +#endif /* _OSM_UCAST_MGR_H_ */ + diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c new file mode 100644 index 0000000..4ad7c30 --- /dev/null +++ b/opensm/opensm/osm_ucast_cache.c @@ -0,0 +1,1197 @@ +/* + * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of OpenSM Cached routing + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct cache_sw_t_; +struct cache_ca_t_; +struct cache_port_t_; + +typedef union cache_sw_or_ca_ { + struct cache_sw_t_ * p_sw; + struct cache_ca_t_ * p_ca; +} cache_node_t; + +typedef struct cache_port_t_ { + uint8_t remote_node_type; + cache_node_t remote_node; +} cache_port_t; + +typedef struct cache_ca_t_ { + cl_map_item_t map_item; + uint16_t lid_ho; +} cache_ca_t; + +typedef struct cache_sw_t_ { + cl_map_item_t map_item; + uint16_t lid_ho; + uint16_t max_lid_ho; + osm_switch_t *p_osm_sw; /* pointer to the updated switch object */ + uint8_t num_ports; + cache_port_t ** ports; + uint8_t **lid_matrix; + uint8_t * lft_buff; + boolean_t is_leaf; +} cache_sw_t; + +/********************************************************************** + **********************************************************************/ + +static osm_switch_t * +__ucast_cache_get_starting_osm_sw(osm_ucast_cache_t * p_cache) +{ + osm_port_t * p_osm_port; + osm_node_t * p_osm_node; + osm_physp_t * p_osm_physp; + + CL_ASSERT(p_cache->p_ucast_mgr); + + /* find the OSM node */ + p_osm_port = osm_get_port_by_guid( + p_cache->p_ucast_mgr->p_subn, + p_cache->p_ucast_mgr->p_subn->sm_port_guid); + CL_ASSERT(p_osm_port); + + p_osm_node = p_osm_port->p_node; + switch (osm_node_get_type(p_osm_node)) { + case IB_NODE_TYPE_SWITCH: + /* OpenSM runs on switch - we're done */ + return p_osm_node->sw; + + case IB_NODE_TYPE_CA: + /* SM runs on CA - get the switch + that CA is connected to. */ + p_osm_physp = p_osm_port->p_physp; + p_osm_physp = osm_physp_get_remote(p_osm_physp); + p_osm_node = osm_physp_get_node_ptr(p_osm_physp); + CL_ASSERT(p_osm_node); + return p_osm_node->sw; + + default: + /* SM runs on some other node - not supported */ + return NULL; + } +} /* __ucast_cache_get_starting_osm_sw() */ + +/********************************************************************** + **********************************************************************/ + +static cache_sw_t * +__ucast_cache_get_sw(osm_ucast_cache_t * p_cache, + uint16_t lid_ho) +{ + cache_sw_t * p_sw; + + p_sw = (cache_sw_t *) cl_qmap_get(&p_cache->sw_tbl, lid_ho); + if (p_sw == (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl)) + return NULL; + + return p_sw; +} /* __ucast_cache_get_sw() */ + +/********************************************************************** + **********************************************************************/ + +static cache_ca_t * +__ucast_cache_get_ca(osm_ucast_cache_t * p_cache, + uint16_t lid_ho) +{ + cache_ca_t * p_ca; + + p_ca = (cache_ca_t *) cl_qmap_get(&p_cache->ca_tbl, lid_ho); + if (p_ca == (cache_ca_t *) cl_qmap_end(&p_cache->ca_tbl)) + return NULL; + + return p_ca; +} /* __ucast_cache_get_ca() */ + +/********************************************************************** + **********************************************************************/ + +static cache_port_t * +__ucast_cache_add_port(osm_ucast_cache_t * p_cache, + uint8_t remote_node_type, + uint16_t lid_ho) +{ + cache_port_t * p_port = (cache_port_t *) malloc(sizeof(cache_port_t)); + memset(p_port, 0, sizeof(cache_port_t)); + + p_port->remote_node_type = remote_node_type; + if (remote_node_type == IB_NODE_TYPE_SWITCH) + { + cache_sw_t * p_sw = __ucast_cache_get_sw( + p_cache, lid_ho); + CL_ASSERT(p_sw); + p_port->remote_node.p_sw = p_sw; + } + else { + cache_ca_t * p_ca = __ucast_cache_get_ca( + p_cache, lid_ho); + CL_ASSERT(p_ca); + p_port->remote_node.p_ca = p_ca; + } + + return p_port; +} /* __ucast_cache_add_port() */ + +/********************************************************************** + **********************************************************************/ + +static cache_sw_t * +__ucast_cache_add_sw(osm_ucast_cache_t * p_cache, + osm_switch_t * p_osm_sw) +{ + cache_sw_t *p_sw = (cache_sw_t*)malloc(sizeof(cache_sw_t)); + memset(p_sw, 0, sizeof(cache_sw_t)); + + p_sw->p_osm_sw = p_osm_sw; + + p_sw->lid_ho = + cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node, 0)); + + p_sw->num_ports = osm_node_get_num_physp(p_osm_sw->p_node); + p_sw->ports = (cache_port_t **) + malloc(p_sw->num_ports * sizeof(cache_port_t *)); + memset(p_sw->ports, 0, p_sw->num_ports * sizeof(cache_port_t *)); + + cl_qmap_insert(&p_cache->sw_tbl, p_sw->lid_ho, &p_sw->map_item); + return p_sw; +} /* __ucast_cache_add_sw() */ + +/********************************************************************** + **********************************************************************/ + +static cache_ca_t * +__ucast_cache_add_ca(osm_ucast_cache_t * p_cache, + uint16_t lid_ho) +{ + cache_ca_t *p_ca = (cache_ca_t*)malloc(sizeof(cache_ca_t)); + memset(p_ca, 0, sizeof(cache_ca_t)); + + p_ca->lid_ho = lid_ho; + + cl_qmap_insert(&p_cache->ca_tbl, p_ca->lid_ho, &p_ca->map_item); + return p_ca; +} /* __ucast_cache_add_ca() */ + +/********************************************************************** + **********************************************************************/ + +static void +__cache_port_destroy(cache_port_t * p_port) +{ + if (!p_port) + return; + free(p_port); +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_sw_destroy(cache_sw_t * p_sw) +{ + int i; + + if (!p_sw) + return; + + if (p_sw->ports) { + for (i = 0; i < p_sw->num_ports; i++) + if (p_sw->ports[i]) + __cache_port_destroy(p_sw->ports[i]); + free(p_sw->ports); + } + + if (p_sw->lid_matrix) { + for (i = 0; i <= p_sw->max_lid_ho; i++) + if (p_sw->lid_matrix[i]) + free(p_sw->lid_matrix[i]); + free(p_sw->lid_matrix); + } + + if (p_sw->lft_buff) + free(p_sw->lft_buff); + + free(p_sw); +} /* __cache_sw_destroy() */ + +/********************************************************************** + **********************************************************************/ + +static void +__cache_ca_destroy(cache_ca_t * p_ca) +{ + if (!p_ca) + return; + free(p_ca); +} + +/********************************************************************** + **********************************************************************/ + +static int +__ucast_cache_populate(osm_ucast_cache_t * p_cache) +{ + cl_list_t sw_bfs_list; + osm_switch_t * p_osm_sw; + osm_switch_t * p_remote_osm_sw; + osm_node_t * p_osm_node; + osm_node_t * p_remote_osm_node; + osm_physp_t * p_osm_physp; + osm_physp_t * p_remote_osm_physp; + cache_sw_t * p_sw; + cache_sw_t * p_remote_sw; + cache_ca_t * p_remote_ca; + uint16_t remote_lid_ho; + unsigned num_ports; + unsigned i; + int res = 0; + osm_log_t * p_log = p_cache->p_ucast_mgr->p_log; + + OSM_LOG_ENTER(p_log); + + cl_list_init(&sw_bfs_list, 10); + + /* Use management switch or switch that is connected + to management CA as a BFS scan starting point */ + + p_osm_sw = __ucast_cache_get_starting_osm_sw(p_cache); + if (!p_osm_sw) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A51: " + "failed getting cache population starting point\n"); + res = 1; + goto Exit; + } + + /* switch is cached BEFORE entering to the BFS list, + so we will know whether this switch was "visited" */ + + p_sw = __ucast_cache_add_sw(p_cache, p_osm_sw); + cl_list_insert_tail(&sw_bfs_list, p_sw); + + /* Create cached switches in the BFS order. + This will ensure that the fabric scan is done each + time the same way and will allow accurate matching + between the current fabric and the cached one. */ + while (!cl_is_list_empty(&sw_bfs_list)) { + p_sw = (cache_sw_t *) cl_list_remove_head(&sw_bfs_list); + p_osm_sw = p_sw->p_osm_sw; + p_osm_node = p_osm_sw->p_node; + num_ports = osm_node_get_num_physp(p_osm_node); + + /* skipping port 0 on switches */ + for (i = 1; i < num_ports; i++) { + p_osm_physp = osm_node_get_physp_ptr(p_osm_node, i); + if (!p_osm_physp || + !osm_physp_is_valid(p_osm_physp) || + !osm_link_is_healthy(p_osm_physp)) + continue; + + p_remote_osm_physp = osm_physp_get_remote(p_osm_physp); + if (!p_remote_osm_physp || + !osm_physp_is_valid(p_remote_osm_physp) || + !osm_link_is_healthy(p_remote_osm_physp)) + continue; + + p_remote_osm_node = + osm_physp_get_node_ptr(p_remote_osm_physp); + if (!p_remote_osm_node) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A52: " + "no node for remote port\n"); + res = 1; + goto Exit; + } + + if (osm_node_get_type(p_remote_osm_node) == + IB_NODE_TYPE_SWITCH) { + + remote_lid_ho = cl_ntoh16( + osm_node_get_base_lid( + p_remote_osm_node, 0)); + + p_remote_osm_sw = p_remote_osm_node->sw; + CL_ASSERT(p_remote_osm_sw); + + p_remote_sw = __ucast_cache_get_sw( + p_cache, + remote_lid_ho); + + /* If the remote switch hasn't been + cached yet, add it to the cache + and insert it into the BFS list */ + + if (!p_remote_sw) { + p_remote_sw = __ucast_cache_add_sw( + p_cache, + p_remote_osm_sw); + cl_list_insert_tail(&sw_bfs_list, + p_remote_sw); + } + } + else { + remote_lid_ho = cl_ntoh16( + osm_physp_get_base_lid( + p_remote_osm_physp)); + + p_sw->is_leaf = TRUE; + p_remote_ca = __ucast_cache_add_ca( + p_cache, remote_lid_ho); + + /* no need to add this node to BFS list */ + } + + /* cache this port */ + p_sw->ports[i] = __ucast_cache_add_port( + p_cache, + osm_node_get_type(p_remote_osm_node), + remote_lid_ho); + } + } + + cl_list_destroy(&sw_bfs_list); + p_cache->topology_valid = TRUE; + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "cache populated (%u SWs, %u CAs)\n", + cl_qmap_count(&p_cache->sw_tbl), + cl_qmap_count(&p_cache->ca_tbl)); + + Exit: + OSM_LOG_EXIT(p_log); + return res; +} /* __ucast_cache_populate() */ + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_read_sw_lid_matrix(cl_map_item_t * const p_map_item, + void *context) +{ + cache_sw_t *p_sw = (cache_sw_t * const)p_map_item; + uint16_t target_lid_ho; + uint8_t port_num; + + if (!p_sw->p_osm_sw) + return; + + /* allocate lid matrices buffer: + lid_matrix[target_lids][port_nums] */ + CL_ASSERT(!p_sw->lid_matrix); + p_sw->lid_matrix = (uint8_t **) + malloc((p_sw->max_lid_ho + 1) * sizeof(uint8_t*)); + + for (target_lid_ho = 0; + target_lid_ho <= p_sw->max_lid_ho; target_lid_ho++){ + + /* set hops for this target through every switch port */ + + p_sw->lid_matrix[target_lid_ho] = + (uint8_t *)malloc(p_sw->num_ports); + memset(p_sw->lid_matrix[target_lid_ho], + OSM_NO_PATH, p_sw->num_ports); + + for (port_num = 1; port_num < p_sw->num_ports; port_num++) + p_sw->lid_matrix[target_lid_ho][port_num] = + osm_switch_get_hop_count(p_sw->p_osm_sw, + target_lid_ho, + port_num); + } +} /* __ucast_cache_read_sw_lid_matrix() */ + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_write_sw_routing(cl_map_item_t * const p_map_item, + void * context) +{ + cache_sw_t *p_sw = (cache_sw_t * const)p_map_item; + osm_ucast_cache_t * p_cache = (osm_ucast_cache_t *) context; + uint8_t *ucast_mgr_lft_buf = p_cache->p_ucast_mgr->lft_buf; + uint16_t target_lid_ho; + uint8_t port_num; + uint8_t hops; + osm_log_t * p_log = p_cache->p_ucast_mgr->p_log; + + OSM_LOG_ENTER(p_log); + + if (!p_sw->p_osm_sw) { + /* some switches (leaf switches) may exist in the + cache, but not exist in the current topology */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "cached switch 0x%04x doesn't exist in the fabric\n", + p_sw->lid_ho); + goto Exit; + } + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "writing routing for cached switch 0x%04x, " + "max_lid_ho = 0x%04x\n", + p_sw->lid_ho, p_sw->max_lid_ho); + + /* write cached LFT to this switch: clear existing + ucast mgr lft buffer, write the cached lft to the + ucast mgr buffer, and set this lft on switch */ + CL_ASSERT(p_sw->lft_buff); + memset(ucast_mgr_lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + if (p_sw->max_lid_ho > 0) + memcpy(ucast_mgr_lft_buf, p_sw->lft_buff, + p_sw->max_lid_ho + 1); + + p_sw->p_osm_sw->max_lid_ho = p_sw->max_lid_ho; + osm_ucast_mgr_set_fwd_table(p_cache->p_ucast_mgr,p_sw->p_osm_sw); + + /* write cached lid matrix to this switch */ + + osm_switch_prepare_path_rebuild(p_sw->p_osm_sw, p_sw->max_lid_ho); + + /* set hops to itself */ + osm_switch_set_hops(p_sw->p_osm_sw,p_sw->lid_ho,0,0); + + for (target_lid_ho = 0; + target_lid_ho <= p_sw->max_lid_ho; target_lid_ho++){ + /* port 0 on switches lid matrices is used + for storing minimal hops to the target + lid, so we iterate from port 1 */ + for (port_num = 1; port_num < p_sw->num_ports; port_num++) { + hops = p_sw->lid_matrix[target_lid_ho][port_num]; + if (hops != OSM_NO_PATH) + osm_switch_set_hops(p_sw->p_osm_sw, + target_lid_ho, port_num, hops); + } + } + Exit: + OSM_LOG_EXIT(p_log); +} /* __ucast_cache_write_sw_routing() */ + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_clear_sw_routing(cl_map_item_t * const p_map_item, + void *context) +{ + cache_sw_t *p_sw = (cache_sw_t * const)p_map_item; + unsigned lid; + + if(p_sw->lft_buff) { + free(p_sw->lft_buff); + p_sw->lft_buff = NULL; + } + + if(p_sw->lid_matrix) { + for (lid = 0; lid < p_sw->max_lid_ho; lid++) + if (p_sw->lid_matrix[lid]) + free(p_sw->lid_matrix[lid]); + free(p_sw->lid_matrix); + p_sw->lid_matrix = NULL; + } + + p_sw->max_lid_ho = 0; +} /* __ucast_cache_clear_sw_routing() */ + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_clear_routing(osm_ucast_cache_t * p_cache) +{ + cl_qmap_apply_func(&p_cache->sw_tbl, __ucast_cache_clear_sw_routing, + (void *)p_cache); + p_cache->routing_valid = FALSE; +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_invalidate(osm_ucast_cache_t * p_cache) +{ + cache_sw_t * p_sw; + cache_sw_t * p_next_sw; + cache_ca_t * p_ca; + cache_ca_t * p_next_ca; + + p_next_sw = (cache_sw_t *) cl_qmap_head(&p_cache->sw_tbl); + while (p_next_sw != (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl)) { + p_sw = p_next_sw; + p_next_sw = (cache_sw_t *) cl_qmap_next(&p_sw->map_item); + __cache_sw_destroy(p_sw); + } + cl_qmap_remove_all(&p_cache->sw_tbl); + + p_next_ca = (cache_ca_t *) cl_qmap_head(&p_cache->ca_tbl); + while (p_next_ca != (cache_ca_t *) cl_qmap_end(&p_cache->ca_tbl)) { + p_ca = p_next_ca; + p_next_ca = (cache_ca_t *) cl_qmap_next(&p_ca->map_item); + __cache_ca_destroy(p_ca); + } + cl_qmap_remove_all(&p_cache->ca_tbl); + + p_cache->routing_valid = FALSE; + p_cache->topology_valid = FALSE; + p_cache->need_update = FALSE; +} /* __ucast_cache_invalidate() */ + +/********************************************************************** + **********************************************************************/ + +static int +__ucast_cache_read_topology(osm_ucast_cache_t * p_cache) +{ + CL_ASSERT(p_cache && p_cache->p_ucast_mgr); + + return __ucast_cache_populate(p_cache); +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_read_lid_matrices(osm_ucast_cache_t * p_cache) +{ + CL_ASSERT(p_cache && p_cache->p_ucast_mgr && + p_cache->topology_valid); + + if (p_cache->routing_valid) + __ucast_cache_clear_routing(p_cache); + + cl_qmap_apply_func(&p_cache->sw_tbl, + __ucast_cache_read_sw_lid_matrix, + (void *)p_cache); + p_cache->routing_valid = TRUE; +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_write_routing(osm_ucast_cache_t * p_cache) +{ + CL_ASSERT(p_cache && p_cache->p_ucast_mgr && + p_cache->topology_valid && p_cache->routing_valid); + + cl_qmap_apply_func(&p_cache->sw_tbl, + __ucast_cache_write_sw_routing, + (void *)p_cache); +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_sw_clear_osm_ptr(cl_map_item_t * const p_map_item, + void *context) +{ + ((cache_sw_t * const)p_map_item)->p_osm_sw = NULL; +} + +/********************************************************************** + **********************************************************************/ + +static int +__ucast_cache_validate(osm_ucast_cache_t * p_cache) +{ + osm_switch_t * p_osm_sw; + osm_node_t * p_osm_node; + osm_node_t * p_remote_osm_node; + osm_physp_t * p_osm_physp; + osm_physp_t * p_remote_osm_physp; + cache_sw_t * p_sw; + cache_sw_t * p_remote_sw; + cache_ca_t * p_remote_ca; + uint16_t lid_ho; + uint16_t remote_lid_ho; + uint8_t remote_node_type; + unsigned num_ports; + unsigned i; + int res = UCAST_CACHE_TOPOLOGY_MATCH; + boolean_t fabric_link_exists; + osm_log_t * p_log = p_cache->p_ucast_mgr->p_log; + cl_qmap_t * p_osm_sw_guid_tbl; + + OSM_LOG_ENTER(p_log); + + p_osm_sw_guid_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl; + + if (cl_qmap_count(p_osm_sw_guid_tbl) > + cl_qmap_count(&p_cache->sw_tbl)) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "current subnet has more switches than the cache - " + "cache is invalid\n"); + res |= UCAST_CACHE_TOPOLOGY_MORE_SWITCHES; + goto Exit; + } + + if (cl_qmap_count(p_osm_sw_guid_tbl) < + cl_qmap_count(&p_cache->sw_tbl)) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "current subnet has less switches than the cache - " + "continuing validation\n"); + res |= UCAST_CACHE_TOPOLOGY_LESS_SWITCHES; + } + + /* Clear the pointers to osm switch on all the cached switches. + These pointers might be invalid right now: some cached switch + might be missing in the real subnet, and some missing switch + might reappear, such as in case of switch reboot. */ + cl_qmap_apply_func(&p_cache->sw_tbl, __ucast_cache_sw_clear_osm_ptr, + NULL); + + + for (p_osm_sw = (osm_switch_t *) cl_qmap_head(p_osm_sw_guid_tbl); + p_osm_sw != (osm_switch_t *) cl_qmap_end(p_osm_sw_guid_tbl); + p_osm_sw = (osm_switch_t *) cl_qmap_next(&p_osm_sw->map_item)) { + + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node,0)); + p_sw = __ucast_cache_get_sw(p_cache, lid_ho); + if (!p_sw) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "new lid (0x%04x)is in the fabric - " + "cache is invalid\n", lid_ho); + res |= UCAST_CACHE_TOPOLOGY_NEW_LID; + goto Exit; + } + + p_sw->p_osm_sw = p_osm_sw; + + /* scan all the ports and check if the cache is valid */ + + p_osm_node = p_osm_sw->p_node; + num_ports = osm_node_get_num_physp(p_osm_node); + + /* skipping port 0 on switches */ + for (i = 1; i < num_ports; i++) { + p_osm_physp = osm_node_get_physp_ptr(p_osm_node, i); + + fabric_link_exists = FALSE; + if (p_osm_physp && + osm_physp_is_valid(p_osm_physp) && + osm_link_is_healthy(p_osm_physp)) { + p_remote_osm_physp = + osm_physp_get_remote(p_osm_physp); + if (p_remote_osm_physp && + osm_physp_is_valid(p_remote_osm_physp) && + osm_link_is_healthy(p_remote_osm_physp)) + fabric_link_exists = TRUE; + } + + if (!fabric_link_exists && !p_sw->ports[i]) + continue; + + if (fabric_link_exists && !p_sw->ports[i]) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, link exists " + "in the fabric, but not cached - " + "cache is invalid\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_LINK_ADDED; + goto Exit; + } + + if (!fabric_link_exists && p_sw->ports[i]){ + /* + * link exists in cache, but missing + * in current fabric + */ + if (p_sw->ports[i]->remote_node_type == + IB_NODE_TYPE_SWITCH) { + p_remote_sw = + p_sw->ports[i]->remote_node.p_sw; + /* cache is allowed to have a + leaf switch that is missing + in the current subnet */ + if (!p_remote_sw->is_leaf) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "fabric is missing a link " + "to non-leaf switch - " + "cache is invalid\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING; + goto Exit; + } + else { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "fabric is missing a link " + "to leaf switch - " + "continuing validation\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING; + continue; + } + } + else { + /* this means that link to + non-switch node is missing */ + CL_ASSERT(p_sw->is_leaf); + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "fabric is missing a link " + "to CA - " + "continuing validation\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING; + continue; + } + } + + /* + * Link exists both in fabric and in cache. + * Compare remote nodes. + */ + + p_remote_osm_node = + osm_physp_get_node_ptr(p_remote_osm_physp); + if (!p_remote_osm_node) { + /* No node for remote port! + Something wrong is going on here, + so we better not use cache... */ + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A53: " + "lid 0x%04x, port %d, " + "no node for remote port - " + "cache mismatch\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + remote_node_type = + osm_node_get_type(p_remote_osm_node); + + if (remote_node_type != + p_sw->ports[i]->remote_node_type) { + /* remote node type in the current fabric + differs from the cached one - looks like + node was replaced by something else */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "remote node type mismatch - " + "cache is invalid\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + if (remote_node_type == IB_NODE_TYPE_SWITCH) { + remote_lid_ho = + cl_ntoh16(osm_node_get_base_lid( + p_remote_osm_node, 0)); + + p_remote_sw = __ucast_cache_get_sw( + p_cache, + remote_lid_ho); + + if (!p_remote_sw) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, " + "new switch in the fabric - " + "cache is invalid\n", + remote_lid_ho); + res |= UCAST_CACHE_TOPOLOGY_NEW_SWITCH; + goto Exit; + } + + if (p_sw->ports[i]->remote_node.p_sw != + p_remote_sw) { + /* remote cached switch that pointed + by the port is not equal to the + switch that was obtained for the + remote lid - link was changed */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "link location changed " + "(remote node mismatch) - " + "cache is invalid\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + } + else { + if (!p_sw->is_leaf) { + /* remote node type is CA, but the + cached switch is not marked as + leaf - something has changed */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "link changed - " + "cache is invalid\n", + lid_ho, i); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + remote_lid_ho = + cl_ntoh16(osm_physp_get_base_lid( + p_remote_osm_physp)); + + p_remote_ca = __ucast_cache_get_ca( + p_cache, remote_lid_ho); + + if (!p_remote_ca) { + /* new lid is in the fabric - + cache is invalid */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "new CA in the fabric " + "(lid 0x%04x) - " + "cache is invalid\n", + lid_ho, i, remote_lid_ho); + res |= UCAST_CACHE_TOPOLOGY_NEW_CA; + goto Exit; + } + + if (p_sw->ports[i]->remote_node.p_ca != + p_remote_ca) { + /* remote cached CA that pointed + by the port is not equal to the + CA that was obtained for the + remote lid - link was changed */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "lid 0x%04x, port %d, " + "link to CA (lid 0x%04x) " + "has changed - " + "cache is invalid\n", + lid_ho, i, remote_lid_ho); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + } + } /* done comparing the ports of the switch */ + } /* done comparing all the switches */ + + /* At this point we have four possible flags on: + 1. UCAST_CACHE_TOPOLOGY_MATCH + We have a perfect topology match to the cache + 2. UCAST_CACHE_TOPOLOGY_LESS_SWITCHES + Cached topology has one or more switches that do not exist + in the current topology. There are two types of such switches: + leaf switches and the regular switches. But if some regular + switch was missing, we would exit the comparison with the + UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING flag, so if some switch + in the topology is missing, it has to be leaf switch. + 3. UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING + One or more link to leaf switches are missing in the current + topology. + 4. UCAST_CACHE_TOPOLOGY_LINK_TO_CA_MISSING + One or more CAs are missing in the current topology. + In all these cases the cache is perfectly usable - it just might + have routing to unexisting lids. */ + + if (res & UCAST_CACHE_TOPOLOGY_LESS_SWITCHES) { + /* if there are switches in the cache that don't exist + in the current topology, make sure that they are + all leaf switches, otherwise cache is useless */ + for (p_sw = (cache_sw_t *) cl_qmap_head(&p_cache->sw_tbl); + p_sw != (cache_sw_t *) cl_qmap_end(&p_cache->sw_tbl); + p_sw = (cache_sw_t *) cl_qmap_next(&p_sw->map_item)) { + if (!p_sw->p_osm_sw && !p_sw->is_leaf) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "non-leaf switch in the fabric is " + "missing - cache is invalid\n"); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + } + } + + if ((res & UCAST_CACHE_TOPOLOGY_LINK_TO_LEAF_SW_MISSING) && + !(res & UCAST_CACHE_TOPOLOGY_LESS_SWITCHES)) { + /* some link to leaf switch is missing, but there are + no missing switches - link failure or topology + changes, which means that we probably shouldn't + use the cache here */ + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "topology change - cache is invalid\n"); + res |= UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + Exit: + OSM_LOG_EXIT(p_log); + return res; + +} /* __ucast_cache_validate() */ + +/********************************************************************** + **********************************************************************/ + +int +osm_ucast_cache_apply(osm_ucast_cache_t * p_cache) +{ + int res = 0; + osm_log_t * p_log; + + if (!p_cache) + return 1; + + p_log = p_cache->p_ucast_mgr->p_log; + + OSM_LOG_ENTER(p_log); + if (!p_cache->topology_valid) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "unicast cache is empty - can't " + "use it on this sweep\n"); + res = UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + if (!p_cache->routing_valid) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A55: " + "cached routing invalid\n"); + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "invalidating cache\n"); + __ucast_cache_invalidate(p_cache); + res = UCAST_CACHE_TOPOLOGY_NO_MATCH; + goto Exit; + } + + res = __ucast_cache_validate(p_cache); + + if ((res & UCAST_CACHE_TOPOLOGY_NO_MATCH ) || + (res & UCAST_CACHE_TOPOLOGY_MORE_SWITCHES ) || + (res & UCAST_CACHE_TOPOLOGY_LINK_ADDED ) || + (res & UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING) || + (res & UCAST_CACHE_TOPOLOGY_NEW_SWITCH ) || + (res & UCAST_CACHE_TOPOLOGY_NEW_CA ) || + (res & UCAST_CACHE_TOPOLOGY_NEW_LID ) || + (res & UCAST_CACHE_TOPOLOGY_LINK_TO_SW_MISSING)) { + /* The change in topology doesn't allow us to use the. + existing cache. Cache should be invalidated, and new + cache should be built after the routing recalculation. */ + OSM_LOG(p_log, OSM_LOG_INFO, + "changes in topology (0x%x) - " + "invalidating cache\n", res); + __ucast_cache_invalidate(p_cache); + goto Exit; + } + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "cache is valid (status 0x%04x) - using the cached routing\n",res); + + /* existing cache can be used - write back the cached routing */ + __ucast_cache_write_routing(p_cache); + + /* + * ToDo: Detailed result of the topology comparison will + * ToDo: be needed later for the Incremental Routing, + * ToDo: where based on this result, the routing algorithm + * ToDo: will try to route "around" the missing components. + * ToDo: For now - reset the result whenever the cache + * ToDo: is valid. + */ + res = 0; + + Exit: + OSM_LOG_EXIT(p_log); + return res; +} /* osm_ucast_cache_apply() */ + +/********************************************************************** + **********************************************************************/ + +void osm_ucast_cache_set_sw_fwd_table(osm_ucast_cache_t * p_cache, + uint8_t * ucast_mgr_lft_buf, + osm_switch_t * p_osm_sw) +{ + uint16_t lid_ho = + cl_ntoh16(osm_node_get_base_lid(p_osm_sw->p_node, 0)); + cache_sw_t * p_sw = __ucast_cache_get_sw(p_cache, lid_ho); + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "caching lft for switch 0x%04x\n", + lid_ho); + + if (!p_sw || !p_sw->p_osm_sw) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, + "ERR 3A57: " + "fabric switch 0x%04x %s in the unicast cache\n", + lid_ho, + (p_sw) ? "is not initialized" : "doesn't exist"); + goto Exit; + } + + CL_ASSERT(p_sw->p_osm_sw == p_osm_sw); + CL_ASSERT(!p_sw->lft_buff); + + p_sw->max_lid_ho = p_osm_sw->max_lid_ho; + + /* allocate linear forwarding table buffer and fill it */ + p_sw->lft_buff = (uint8_t *)malloc(IB_LID_UCAST_END_HO + 1); + memcpy(p_sw->lft_buff, p_cache->p_ucast_mgr->lft_buf, + IB_LID_UCAST_END_HO + 1); + + Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* osm_ucast_cache_set_sw_fwd_table() */ + +/********************************************************************** + **********************************************************************/ + +void osm_ucast_cache_refresh_topo(osm_ucast_cache_t * p_cache) +{ + osm_log_t * p_log = p_cache->p_ucast_mgr->p_log; + OSM_LOG_ENTER(p_log); + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "starting ucast cache topology refresh\n"); + + if (p_cache->topology_valid) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "invalidating existing ucast cache\n"); + __ucast_cache_invalidate(p_cache); + } + + OSM_LOG(p_log, OSM_LOG_VERBOSE, "caching topology\n"); + + if (__ucast_cache_read_topology(p_cache) != 0) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A56: " + "cache population failed\n"); + __ucast_cache_invalidate(p_cache); + goto Exit; + } + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "ucast cache topology refresh done\n"); + Exit: + OSM_LOG_EXIT(p_log); +} /* osm_ucast_cache_refresh_topo() */ + +/********************************************************************** + **********************************************************************/ + +void osm_ucast_cache_refresh_lid_matrices(osm_ucast_cache_t * p_cache) +{ + osm_log_t * p_log = p_cache->p_ucast_mgr->p_log; + OSM_LOG_ENTER(p_log); + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "starting ucast cache lid matrices refresh\n"); + + if (!p_cache->topology_valid) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 3A54: " + "cached topology is invalid\n"); + goto Exit; + } + + if (p_cache->routing_valid) { + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "invalidating existing ucast routing cache\n"); + __ucast_cache_clear_routing(p_cache); + } + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "caching lid matrices\n"); + + __ucast_cache_read_lid_matrices(p_cache); + + OSM_LOG(p_log, OSM_LOG_VERBOSE, + "ucast cache lid matrices refresh done\n"); + Exit: + OSM_LOG_EXIT(p_log); +} /* osm_ucast_cache_refresh_lid_matrices() */ + +/********************************************************************** + **********************************************************************/ + +osm_ucast_cache_t * +osm_ucast_cache_construct(osm_ucast_mgr_t * const p_mgr) +{ + if (p_mgr->p_subn->opt.lmc > 0) { + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A50: " + "Unicast cache is not supported for LMC>0\n"); + return NULL; + } + + osm_ucast_cache_t * p_cache = + (osm_ucast_cache_t*)malloc(sizeof(osm_ucast_cache_t)); + if (!p_cache) + return NULL; + + memset(p_cache, 0, sizeof(osm_ucast_cache_t)); + + cl_qmap_init(&p_cache->sw_tbl); + cl_qmap_init(&p_cache->ca_tbl); + p_cache->p_ucast_mgr = p_mgr; + + return p_cache; +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache) +{ + if (!p_cache) + return; + __ucast_cache_invalidate(p_cache); + free(p_cache); +} + +/********************************************************************** + **********************************************************************/ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun May 4 03:00:36 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 13:00:36 +0300 Subject: [ofa-general] [PATCH 2/4] opensm: adding ucast cache option Message-ID: <481D8944.10003@dev.mellanox.co.il> Adding ucast cache option to OpenSM command line arguments: -F or --ucast_cache. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_subnet.h | 6 +++++- opensm/opensm/main.c | 33 +++++++++++++++++++++++++++++++-- opensm/opensm/osm_subnet.c | 11 ++++++++++- 3 files changed, 46 insertions(+), 4 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index b1dd659..cffbe5e 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -256,6 +256,7 @@ typedef struct _osm_subn_opt { boolean_t sweep_on_trap; char *routing_engine_name; boolean_t connect_roots; + boolean_t use_ucast_cache; char *lid_matrix_dump_file; char *ucast_dump_file; char *root_guid_file; @@ -441,6 +442,9 @@ typedef struct _osm_subn_opt { * up/down routing engine (even if this violates "pure" deadlock * free up/down algorithm) * +* use_ucast_cache +* When TRUE enables unicast routing cache. +* * lid_matrix_dump_file * Name of the lid matrix dump file from where switch * lid matrices (min hops tables) will be loaded diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index fb41d50..71deacb 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -183,6 +183,17 @@ static void show_usage(void) " and in this way be IBA compliant. In many cases,\n" " this can violate \"pure\" deadlock free algorithm, so\n" " use it carefully.\n\n"); + printf("-F\n" + "--ucast_cache\n" + " This option enables unicast routing cache to prevent\n" + " routing recalculation (which is a heavy task in a\n" + " large cluster) when there was no topology change\n" + " detected during the heavy sweep, or when the topology\n" + " change does not require new routing calculation,\n" + " e.g. in case of host reboot.\n" + " This option becomes very handy when the cluster size\n" + " is thousands of nodes.\n" + " Unicast cache is not supported for LMC > 0.\n\n"); printf("-M\n" "--lid_matrix_file \n" " This option specifies the name of the lid matrix dump file\n" @@ -599,7 +610,7 @@ int main(int argc, char *argv[]) char *ignore_guids_file_name = NULL; uint32_t val; const char *const short_option = - "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:"; + "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:"; /* In the array below, the 2nd parameter specifies the number @@ -634,6 +645,7 @@ int main(int argc, char *argv[]) {"smkey", 1, NULL, 'k'}, {"routing_engine", 1, NULL, 'R'}, {"connect_roots", 0, NULL, 'z'}, + {"ucast_cache", 0, NULL, 'F'}, {"lid_matrix_file", 1, NULL, 'M'}, {"ucast_file", 1, NULL, 'U'}, {"sadb_file", 1, NULL, 'S'}, @@ -805,6 +817,12 @@ int main(int argc, char *argv[]) "ERROR: LMC must be 7 or less."); return (-1); } + if (opt.use_ucast_cache && temp > 0) { + fprintf(stderr, + "ERROR: Unicast routing cache is " + "not supported for LMC > 0\n"); + return (-1); + } opt.lmc = (uint8_t) temp; printf(" LMC = %d\n", temp); break; @@ -891,6 +909,17 @@ int main(int argc, char *argv[]) printf(" Connect roots option is on\n"); break; + case 'F': + if (opt.lmc > 0) { + fprintf(stderr, + "ERROR: Unicast routing cache is " + "not supported for LMC > 0\n"); + return (-1); + } + opt.use_ucast_cache = TRUE; + printf(" Unicast routing cache option is on\n"); + break; + case 'M': opt.lid_matrix_dump_file = optarg; printf(" Lid matrix dump file is \'%s\'\n", optarg); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 47d735f..dc55e72 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->sweep_on_trap = TRUE; p_opt->routing_engine_name = NULL; p_opt->connect_roots = FALSE; + p_opt->use_ucast_cache = FALSE; p_opt->lid_matrix_dump_file = NULL; p_opt->ucast_dump_file = NULL; p_opt->root_guid_file = NULL; @@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_boolean("connect_roots", p_key, p_val, &p_opts->connect_roots); + opts_unpack_boolean("use_ucast_cache", + p_key, p_val, &p_opts->use_ucast_cache); + opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); opts_unpack_uint32("log_max_size", @@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "# Connect roots (use FALSE if unsure)\n" "connect_roots %s\n\n", p_opts->connect_roots ? "TRUE" : "FALSE"); + if (p_opts->use_ucast_cache) + fprintf(opts_file, + "# Use unicast routing cache (use FALSE if unsure)\n" + "use_ucast_cache %s\n\n", + p_opts->use_ucast_cache ? "TRUE" : "FALSE"); if (p_opts->lid_matrix_dump_file) fprintf(opts_file, "# Lid matrix dump file name\n" -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun May 4 03:02:03 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 13:02:03 +0300 Subject: [ofa-general] [PATCH 3/4] opensm: compile new ucast cache files Message-ID: <481D899B.5010206@dev.mellanox.co.il> Include new ucast cache c/h files in the makefiles. Signed-off-by: Yevgeny Kliteynik --- opensm/include/Makefile.am | 1 + opensm/opensm/Makefile.am | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am index 48264ff..a6791d4 100644 --- a/opensm/include/Makefile.am +++ b/opensm/include/Makefile.am @@ -33,6 +33,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sm.h \ $(srcdir)/opensm/osm_lin_fwd_tbl.h \ $(srcdir)/opensm/osm_ucast_mgr.h \ + $(srcdir)/opensm/osm_ucast_cache.h \ $(srcdir)/opensm/osm_db.h \ $(srcdir)/opensm/osm_mad_pool.h \ $(srcdir)/opensm/osm_remote_sm.h \ diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index acd0b1d..ec6c5b0 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -57,6 +57,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ osm_prtn.c osm_prtn_config.c osm_qos.c osm_router.c \ osm_trap_rcv.c osm_ucast_mgr.c osm_ucast_updn.c \ osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ + osm_ucast_cache.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ st.c osm_perfmgr.c osm_perfmgr_db.c \ osm_event_plugin.c osm_dump.c \ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun May 4 03:03:14 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 13:03:14 +0300 Subject: [ofa-general] [PATCH 4/4] opensm/osm_ucast_mgr.{c, h}: integrate ucast cache Message-ID: <481D89E2.1090303@dev.mellanox.co.il> Integrating unicast routing cache into the unicast manager. The cache is used the following way: - SM is executed - it starts first routing calculation - calculated routing is stored in the cache - at some point new heavy sweep is triggered - unicast manager checks whether the cache can be used instead of new routing calculation. In one of the following cases we can use cached routing + there is no topology change + one or more CAs disappeared (they exist in the cached topology model, but missing in the newly discovered fabric) + one or more leaf switches disappeared In these cases cached routing is written to the switches as is (unless the switch doesn't exist). If there is any other topology change: - existing cache is invalidated - topology is cached - routing is calculated as usual - routing is cached Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_mgr.h | 7 +++- opensm/opensm/osm_ucast_mgr.c | 79 ++++++++++++++++++++++----------- 2 files changed, 59 insertions(+), 27 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 0317c93..33e164b 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -53,6 +53,7 @@ #include #include #include +#include #include #ifdef __cplusplus @@ -103,6 +104,7 @@ typedef struct _osm_ucast_mgr { boolean_t is_dor; boolean_t any_change; boolean_t some_hop_count_set; + osm_ucast_cache_t * p_cache; uint8_t *lft_buf; } osm_ucast_mgr_t; /* @@ -132,6 +134,9 @@ typedef struct _osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * +* p_cache +* Pointer to the unicast cache object. +* * lft_buf * LFT buffer - used during LFT calculation/setup. * diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 938db84..d854fa9 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -81,6 +81,9 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) if (p_mgr->lft_buf) free(p_mgr->lft_buf); + if (p_mgr->p_cache) + osm_ucast_cache_destroy(p_mgr->p_cache); + OSM_LOG_EXIT(p_mgr->p_log); } @@ -104,6 +107,9 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) if (!p_mgr->lft_buf) return IB_INSUFFICIENT_MEMORY; + if (p_mgr->p_subn->opt.use_ucast_cache) + p_mgr->p_cache = osm_ucast_cache_construct(p_mgr); + OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -375,6 +381,10 @@ osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, CL_ASSERT(p_path); + if (p_mgr->p_cache && p_mgr->p_cache->need_update) + osm_ucast_cache_set_sw_fwd_table(p_mgr->p_cache, + p_mgr->lft_buf, p_sw); + /* Set the top of the unicast forwarding table. */ @@ -688,33 +698,50 @@ osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) p_mgr->any_change = FALSE; - if (!p_routing_eng->build_lid_matrices || - (blm = p_routing_eng->build_lid_matrices(p_routing_eng->context))) - osm_ucast_mgr_build_lid_matrices(p_mgr); + if (p_mgr->p_cache && (osm_ucast_cache_apply(p_mgr->p_cache) == 0)) + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "configured switch tables using cached routing\n"); + else { + if (p_mgr->p_cache) { + /* ucast cache is enabled - refresh + topology and mark routing for update */ + p_mgr->p_cache->need_update = TRUE; + osm_ucast_cache_refresh_topo(p_mgr->p_cache); + } + + if (!p_routing_eng->build_lid_matrices || + (blm = p_routing_eng->build_lid_matrices(p_routing_eng->context))) + osm_ucast_mgr_build_lid_matrices(p_mgr); - /* - Now that the lid matrices have been built, we can - build and download the switch forwarding tables. - */ - if (!p_routing_eng->ucast_build_fwd_tables || - (ubft = - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context))) - cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, - p_mgr); - - /* 'file' routing engine has one unique logic corner case */ - if (p_routing_eng->name && (strcmp(p_routing_eng->name, "file") == 0) - && (!blm || !ubft)) - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_FILE; - else if (!blm && !ubft) - p_osm->routing_engine_used = - osm_routing_engine_type(p_routing_eng->name); - else - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; + /* + Now that the lid matrices have been built, we can + build and download the switch forwarding tables. + */ + if (!p_routing_eng->ucast_build_fwd_tables || + (ubft = + p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context))) + cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, + p_mgr); + + /* 'file' routing engine has one unique logic corner case */ + if (p_routing_eng->name && (strcmp(p_routing_eng->name, "file") == 0) + && (!blm || !ubft)) + p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_FILE; + else if (!blm && !ubft) + p_osm->routing_engine_used = + osm_routing_engine_type(p_routing_eng->name); + else + p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; + + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "%s tables configured on all switches\n", + osm_routing_engine_type_str(p_osm->routing_engine_used)); - OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, - "%s tables configured on all switches\n", - osm_routing_engine_type_str(p_osm->routing_engine_used)); + if (p_mgr->p_cache) { + osm_ucast_cache_refresh_lid_matrices(p_mgr->p_cache); + p_mgr->p_cache->need_update = FALSE; + } + } if (p_mgr->any_change) { signal = OSM_SIGNAL_DONE_PENDING; -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun May 4 03:08:51 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 04 May 2008 13:08:51 +0300 Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache In-Reply-To: <481D888A.7080608@dev.mellanox.co.il> References: <481D888A.7080608@dev.mellanox.co.il> Message-ID: <481D8B33.4000803@dev.mellanox.co.il> One thing I need to add here: ucast cache is currently supported for LMC=0 only. -- Yevgeny Yevgeny Kliteynik wrote: > Hi Sasha, > > The following series of 4 patches implements unicast routing cache > in OpenSM. > > None of the current routing engines is scalable when we're talking > about big clusters. On ~5K cluster with ~1.3K switches, it takes > about two minutes to calculate the routing. The problem is, each > time the routing is calculated from scratch. > > Incremental routing (which is on my to-do list) aims to address this > problem when there is some "local" change in fabric (e.g. single > switch failure, single link failure, link added, etc). > In such cases we can use the routing that was already calculated in > the previous heavy sweep, and then we just have to modify it according > to the change. > > For instance, if some switch has disappeared from the fabric, we can > use the routing that existed with this switch, take a step back from > this switch and see if it is possible to route all the lids that were > routed through this switch some other way (which is usually the case). > > To implement incremental routing, we need to create some kind of unicast > routing cache, which is what these patches implement. In addition to being > a step toward the incremental routing, routing cache is usefull by itself. > > This cache can save us routing calculation in case of change in the leaf > switches or in hosts. For instance, if some node is rebooted, OpenSM would > start a heavy sweep with full routing recalculation when the HCA is going > down, and another one when HCA is brought up, when in fact both of these > routing calculation can be replaced by using of unicast routing cache. > > Unicast routing cache comprises the following: > - Topology: a data structure with all the switches and CAs of the fabric > - LFTs: each switch has an LFT cached > - Lid matrices: each switch has lid matrices cached, which is needed for > multicast routing (which is not cached). > > There is a topology matching function that compares the current topology > with the cached one to find out whether the cache is usable (valid) or not. > > The cache is used the following way: > - SM is executed - it starts first routing calculation > - calculated routing is stored in the cache > - at some point new heavy sweep is triggered > - unicast manager checks whether the cache can be used instead > of new routing calculation. > In one of the following cases we can use cached routing > + there is no topology change > + one or more CAs disappeared (they exist in the cached topology > model, but missing in the newly discovered fabric) > + one or more leaf switches disappeared > In these cases cached routing is written to the switches as is > (unless the switch doesn't exist). > If there is any other topology change: > - existing cache is invalidated > - topology is cached > - routing is calculated as usual > - routing is cached > > My simulations show that when the usual routing phase of the heavy > sweep on the topology that I mentioned above takes ~2 minutes, > cached routing reduces this time to 6 seconds (which is nice, if you > ask me...). > > Of all the cases when the cache is valid, the most painful and > "complainable" case is when a compute node reboot (which happens pretty > often) causes two heavy sweeps with two full routing calculations. > Unicast Routing Cache is aimed to solve this problem (again, in addition > to being a step toward the incremental routing). > > -- Yevgeny > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From holt at sgi.com Sun May 4 12:13:45 2008 From: holt at sgi.com (Robin Holt) Date: Sun, 4 May 2008 14:13:45 -0500 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <1489529e7b53d3f2dab8.1209740704@duo.random> References: <1489529e7b53d3f2dab8.1209740704@duo.random> Message-ID: <20080504191345.GD18857@sgi.com> > diff --git a/mm/Kconfig b/mm/Kconfig > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -205,3 +205,6 @@ config VIRT_TO_BUS > config VIRT_TO_BUS > def_bool y > depends on !ARCH_NO_VIRT_TO_BUS > + > +config MMU_NOTIFIER > + bool Without some text following the bool keyword, I am not even asked for this config setting on my ia64 build. Thanks, Robin From hrosenstock at xsigo.com Sun May 4 14:00:52 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 04 May 2008 14:00:52 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] ibsim/sim_net.c: Fix some typos Message-ID: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com> ibsim/sim.net.c: Fix some typos Signed-off-by: Hal Rosenstock diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index 888c91c..1873187 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -270,7 +270,7 @@ static int new_hca(Node * nd) static int build_nodeid(char *nodeid, char *base) { if (strchr(base, '#') || strchr(base, '@')) { - IBWARN("bad nodeid \"%s\": '#' & '@' characters are resereved", + IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved", base); return -1; } @@ -649,7 +649,7 @@ static int parse_port(char *line, Node * node, int type, int maxports) build_alias(port->remotealias, s, 0); expand_name(s, remotenodeid, &sp); - PDEBUG("remotenodid %s s %s sp %s", remotenodeid, s, sp); + PDEBUG("remotenodeid %s s %s sp %s", remotenodeid, s, sp); s += strlen(s) + 1; if (!sp && *s == '[') @@ -791,7 +791,7 @@ static int parse_guidbase(int fd, char *line, int type) char *s; if (!(s = strchr(line, '=')) && !(s = strchr(line, '+'))) { - IBWARN("bad assignemnt: missing '=|+' sign"); + IBWARN("bad assignment: missing '=|+' sign"); return -1; } @@ -805,7 +805,7 @@ static int parse_guidbase(int fd, char *line, int type) guidbase = 0; } guids[type] = absguids[type] + guidbase; - PDEBUG("new guidbase for %s: base %" PRIx64 " current %" PRIx64, + PDEBUG("new guidbase for %s: base 0x%" PRIx64 " current 0x%" PRIx64, node_type_name(type), absguids[type], guids[type]); return 1; @@ -816,7 +816,7 @@ static int parse_devid(int fd, char *line) char *s; if (!(s = strchr(line, '='))) { - IBWARN("bad assignemnt: missing '=' sign"); + IBWARN("bad assignment: missing '=' sign"); return -1; } @@ -831,7 +831,7 @@ static int parse_width(int fd, char *line) int width; if (!(s = strchr(line, '='))) { - IBWARN("bad assignemnt: missing '=' sign"); + IBWARN("bad assignment: missing '=' sign"); return -1; } @@ -851,7 +851,7 @@ static int parse_speed(int fd, char *line) int speed; if (!(s = strchr(line, '='))) { - IBWARN("bad assignemnt: missing '=' sign"); + IBWARN("bad assignment: missing '=' sign"); return -1; } @@ -870,7 +870,7 @@ static int parse_netprefix(int fd, char *line) char *s; if (!(s = strchr(line, '='))) { - IBWARN("bad assignemnt: missing '=' sign"); + IBWARN("bad assignment: missing '=' sign"); return -1; } @@ -907,7 +907,7 @@ static int set_var(char *line, int *var) char *s; if (!(s = strchr(line, '='))) { - IBWARN("bad assignemnt: missing '=' sign"); + IBWARN("bad assignment: missing '=' sign"); return -1; } From hrosenstock at xsigo.com Sun May 4 14:01:07 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 04 May 2008 14:01:07 -0700 Subject: [ofa-general] [PATCH] ibsim/sim.h: Fix NodeDescription size Message-ID: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> ibsim/sim.h: Fix NodeDescription size so can have maximum size NodeDescription per IBA spec rather than truncating them Signed-off-by: Hal Rosenstock diff --git a/ibsim/sim.h b/ibsim/sim.h index bea136a..dbf1220 100644 --- a/ibsim/sim.h +++ b/ibsim/sim.h @@ -67,7 +67,7 @@ #define NODEIDBASE 20 #define NODEPREFIX 20 -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) +#define NODEIDLEN 65 #define ALIASLEN 40 #define MAXHOPS 16 From hrosenstock at xsigo.com Sun May 4 14:01:35 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 04 May 2008 14:01:35 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] ibsim/ibsim.c: Fix usage display Message-ID: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com> ibsim/ibsim.c: Fix usage display Signed-off-by: Hal Rosenstock diff --git a/ibsim/ibsim.c b/ibsim/ibsim.c index d0f1c30..5b996fd 100644 --- a/ibsim/ibsim.c +++ b/ibsim/ibsim.c @@ -698,7 +698,7 @@ Client *find_client(Port * port, int response, int qp, uint64_t trid) void usage(char *prog_name) { fprintf(stderr, - "Usage: %s [-f outfile -d debug_level -p parse_debug -s(tart) -v(erbose) " + "Usage: %s [-f outfile -d(ebug) -p(arse_debug) -s(tart) -v(erbose) " "-I(gnore_duplicate) -N nodes -S switchs -P ports -L linearcap" " -M mcastcap -r(emote_mode) -l(isten_to_port) ] \n", prog_name); From hrosenstock at xsigo.com Sun May 4 14:01:48 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 04 May 2008 14:01:48 -0700 Subject: [ofa-general] [PATCH] ibsim/README: Clarify point of attachment/SIM_HOST use Message-ID: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com> ibsim/README: Clarify point of attachment/SIM_HOST use Signed-off-by: Hal Rosenstock diff --git a/README b/README index f782fbe..6f3c711 100644 --- a/README +++ b/README @@ -90,6 +90,9 @@ Building and using ibsim. - in order to run OpenSM as non-privileged user you may need to export OSM_CACHE_DIR variable and to use '-f' option in order to specify writable path to OpenSM log file. + - Point of attachment is indicated by SIM_HOST environment variable. + If not specified, first entry in topology file is used. For OpenSM, + if -g option is used, it must same as this. 5. Enjoy and comment. From hrosenstock at xsigo.com Sun May 4 14:02:24 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 04 May 2008 14:02:24 -0700 Subject: [ofa-general] ibsim parsing question Message-ID: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> Hi Sasha, I have a question on ibsim parsing: In sim_net.c:parse_port, there is the following code: parse_opt: line = s; while (s && (s = strchr(s + 1, '='))) { char *opt = s; while (opt && !isalpha(*opt)) opt--; if (!opt || parse_port_opt(port, opt, s + 1) < 0) { IBWARN("bad port option"); return -1; } line = s + 1; } port options appear include w for link width and s for link speed. An issue is that this parsing starts inside the NodeDescription. = is a valid character there and causes an invalid port option. There seem to me to be two choices here: 1. Either ignore unknown options in parse_port_option and the rule becomes w= and s= are invalid in the NodeDescription (which is artificial and not really per the spec). or 2. Find some way to start this port option parsing past the end of the NodeDescription. As I'm not sure about all the formats supported, I don't know how to determine a "solid" way to get past the end of the NodeDescription in the topology format. Do you ? -- Hal From andrea at qumranet.com Sun May 4 15:08:25 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Mon, 5 May 2008 00:08:25 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080504191345.GD18857@sgi.com> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080504191345.GD18857@sgi.com> Message-ID: <20080504220824.GA21051@duo.random> On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote: > > diff --git a/mm/Kconfig b/mm/Kconfig > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -205,3 +205,6 @@ config VIRT_TO_BUS > > config VIRT_TO_BUS > > def_bool y > > depends on !ARCH_NO_VIRT_TO_BUS > > + > > +config MMU_NOTIFIER > > + bool > > Without some text following the bool keyword, I am not even asked for > this config setting on my ia64 build. Yes, this was explicitly asked by Andrew after his review. This is the explanation pasted from the changelog. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. From holt at sgi.com Sun May 4 19:25:46 2008 From: holt at sgi.com (Robin Holt) Date: Sun, 4 May 2008 21:25:46 -0500 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080504220824.GA21051@duo.random> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080504191345.GD18857@sgi.com> <20080504220824.GA21051@duo.random> Message-ID: <20080505022546.GE18857@sgi.com> On Mon, May 05, 2008 at 12:08:25AM +0200, Andrea Arcangeli wrote: > On Sun, May 04, 2008 at 02:13:45PM -0500, Robin Holt wrote: > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -205,3 +205,6 @@ config VIRT_TO_BUS > > > config VIRT_TO_BUS > > > def_bool y > > > depends on !ARCH_NO_VIRT_TO_BUS > > > + > > > +config MMU_NOTIFIER > > > + bool > > > > Without some text following the bool keyword, I am not even asked for > > this config setting on my ia64 build. > > Yes, this was explicitly asked by Andrew after his review. This is the > explanation pasted from the changelog. > > 3) It'd be a waste to add branches in the VM if nobody could possibly > run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled > if CONFIG_KVM=m/y. In the current kernel kvm won't yet take > advantage of mmu notifiers, but this already allows to compile a > KVM external module against a kernel with mmu notifiers enabled and > from the next pull from kvm.git we'll start using them. And > GRU/XPMEM will also be able to continue the development by enabling > KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code > to the mainline kernel. Then they can also enable MMU_NOTIFIERS in > the same way KVM does it (even if KVM=n). This guarantees nobody > selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. Ah, so Andrew wants users of KVM to do a select of MMU_NOTIFIER. That makes sense. I will change (fix) my Kconfig changes. Thanks, Robin From keshetti85-student at yahoo.co.in Sun May 4 21:55:46 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 5 May 2008 10:25:46 +0530 Subject: [ofa-general] Install IPoIB separately .. Message-ID: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com> While installing OFED-1.3 on a machine, I forgot to select the IPoIB module. Now, is it possible to build IPoIB module separately and install it without affecting the earlier installation? -Mahesh From ogerlitz at voltaire.com Sun May 4 23:37:29 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 05 May 2008 09:37:29 +0300 Subject: [ofa-general] Re: infiniband hot unplug In-Reply-To: <481A7173.3030007@cs.wisc.edu> References: <47E7DBEA.9030704@cs.wisc.edu> <4818AEB2.9050407@cs.wisc.edu> <4819D1BF.6090002@Voltaire.COM> <4819DF25.1010202@cs.wisc.edu> <481A7173.3030007@cs.wisc.edu> Message-ID: <481EAB29.5090901@voltaire.com> Mike Christie wrote: > > Oh yeah, I was just checking to see how infinnband handled hot > unplugging the card and sparks started to shoot out. Hi Mike, Maybe you can drop an email to Roland with cc to open fabrics general list in order to initiate a discussion on the matter? Or. From vlad at dev.mellanox.co.il Sun May 4 23:39:38 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 05 May 2008 09:39:38 +0300 Subject: [ofa-general] Install IPoIB separately .. In-Reply-To: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com> References: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com> Message-ID: <481EABAA.6080105@dev.mellanox.co.il> Keshetti Mahesh wrote: > While installing OFED-1.3 on a machine, I forgot to select the IPoIB module. > Now, is it possible to build IPoIB module separately and install it without > affecting the earlier installation? > > -Mahesh Not with OFED-1.3 installation script. If you will run install.pl then it will: 1. Uninstall the current OFED installation. 2. Rebuild and install kernel-ib RPM (with IPoIB). 3. Install other binary RPMs following your selection using binary RPMs that were created during the previous install. Regards, Vladimir From telma.mesquita at vedior.pt Mon May 5 03:48:54 2008 From: telma.mesquita at vedior.pt (Lela Hart) Date: Mon, 5 May 2008 12:48:54 +0200 Subject: [ofa-general] Re: Re: nature attraction breeze Message-ID: <016105204.44166266863975@vedior.pt> Best used for girls attraction! http://www.fetiu.cn/r/ From lshhgyfijmnt at brainstopping.com Mon May 5 04:13:33 2008 From: lshhgyfijmnt at brainstopping.com (Guillermo Haney) Date: Mon, 5 May 2008 20:13:33 +0900 Subject: [ofa-general] Hi piramid in your pants Message-ID: <01c8aeec$7e079480$e78b9477@lshhgyfijmnt> Eat a strip and be like a egyptian god! http://www.fetiu.cn/v/ From tziporet at mellanox.co.il Mon May 5 04:26:54 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 5 May 2008 14:26:54 +0300 Subject: [ofa-general] Agenda for the OFED meeting today (May 5) Message-ID: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com> Hi, This is the agenda for the OFED meeting today: 1. OFED 1.3.1: 1.1 Planned changes: ULPs changes: IB-bonding - done SRP failover - done SDP crashes - on work (not clear if we will have something on time) RDS fixes for RDMA API - done librdmacm 1.0.7 - done uDAPL updates - done Open MPI 1.2.6 - done MVAPICH 1.0.1 - done IPoIB - 2 bugs are fixed. There are still one issue that should be resolved. Low level drivers: Changes that already committed: nes mlx4 cxgb3 ehca 1.2 Schedule: rc1 - will be released tomorrow rc2 - May 20 GA - May 29 Daily builds of 1.3.1 are already available at: http://www.openfabrics.org/builds/ofed-1.3.1 2. OFED 1.4: Delayed the work on the kernel rebase and will do it now on 2.6.26-rc1. Will have the new tree ready next week. Reason: Many fixes are already applied on the 2.6.26 tree and in this way we can do all the work only once. 3. Open discussion - Open SuSE build system - If Yiftah will be able to update on progress - Other topics ... Tziporet From keshetti85-student at yahoo.co.in Mon May 5 05:29:48 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 5 May 2008 17:59:48 +0530 Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2 In-Reply-To: References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com> Message-ID: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com> On Mon, May 5, 2008 at 5:54 PM, Dhabaleswar Panda wrote: > RDMA CM is designed by OpenFabrics. MVAPICH2 uses this. If you post your > questions to OpenFabrics General mailing list, the designers of RDMA CM > component will be able to provide detailed answers regarding how it works. > > DK Hi all, I want to use the RDMA CM option of MVAPICH2. The procedure described in the user guide is not much informative. Can anyone here give me the detailed procedure for using the RDMA CM option. Also, I'll be glad if some one can give me a document describing how it works in detail. Actually I have some doubts like, how the IP addresses (???) are resolved into IB addresses and what happens in the case of nodes two HCAs (or 1 HCA with two ports) ? In the MVAPICH2 user guide it is mentioned that "RDMA CM device needs to be setup, configured with an IP address and connected to the network". Is this same as configuring IPoIB device ? -Mahesh From olaf.kirch at oracle.com Mon May 5 05:50:08 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Mon, 5 May 2008 14:50:08 +0200 Subject: [ofa-general] Re: [ewg] Agenda for the OFED meeting today (May 5) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C903EC0879@mtlexch01.mtl.com> Message-ID: <200805051450.09752.olaf.kirch@oracle.com> Hi Tziporet, > RDS fixes for RDMA API - done As an update, I just sent Vlad a bugfix for a RDMA related crash in RDS. It would be cool if that could be included in 1.3.1. I am also currently testing three more bugfix patches; two of them related to dma_sync related issues, and one patch to reduce the latency of RDS RDMA notifications (a process expects a notification from the kernel that tells it when it's okay to release the RDMA buffer - the current code tries to give a reliable status at the expense of one round-trip; this turns out to be too slow for some purposes). It is not yet clear however which (if any) of these three pending patches will make OFED 1.3.1. Regards, Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From jackm at dev.mellanox.co.il Mon May 5 08:20:49 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 5 May 2008 18:20:49 +0300 Subject: [ofa-general] [PATCH] mlx4_core: Support creation of FMRs with pages smaller than 4K Message-ID: <200805051820.49796.jackm@dev.mellanox.co.il> From: Oren Duer mlx4_core: Support creation of FMRs with pages smaller than 4K Actual smallest page size is given by device capabilities. Signed-off-by: Oren Duer Signed-off-by: Jack Morgenstein --- Roland, The device minimum page size should be taken from the device capabilities, and not hard-coded. This hard-coding has lead to problems with new mlx4 firmware. This is for your 2.6.26 git tree. Jack diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 79b317b..7123463 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -551,7 +551,7 @@ int mlx4_fmr_alloc(struct mlx4_dev *dev, u32 pd, u32 access, int max_pages, u64 mtt_seg; int err = -ENOMEM; - if (page_shift < 12 || page_shift >= 32) + if (page_shift < (ffs(dev->caps.page_size_cap) - 1) || page_shift >= 32) return -EINVAL; /* All MTTs must fit in the same page */ From ramachandra.kuchimanchi at qlogic.com Mon May 5 08:36:58 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 5 May 2008 21:06:58 +0530 Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171624.31725.98475.stgit@localhost.localdomain> Message-ID: <71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com> Roland, Thanks for the review. Your comments make sense and we will fix the things you pointed out. Please see some clarifications in-line. On Fri, May 2, 2008 at 11:45 PM, Roland Dreier wrote: > > From: Ramachandra K > > > > Signed-off-by: Poornima Kamath > > Signed-off-by: Amar Mudrankit > > For the next submission please clean up the From and Signed-off-by > lines. As it stands now you are saying that you (Ramachandra K) are the > author of the patch, and that Poornima and Amar signed off on it (ie > forwarded it), but you as the person sending the email did not sign off > on it. > I will make sure to sign off on all patches. Should I also drop the From line for the patches which I developed, since I am mailing them myself ? I am using the Signed-off-by line to indicate the people who were involved in the development of the patches at some stage. > > > > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) > > +{ > > + VNIC_FUNCTION("vnic_stop_xmit()\n"); > > + if (netpath == vnic->current_path) { > > + if (vnic->xmit_started) { > > + netif_stop_queue(vnic->netdevice); > > + vnic->xmit_started = 0; > > + } > > + > > + vnic_stop_xmit_stats(vnic); > > + } > > +} > > Do you have sufficient locking here? Could vnic->current_path or > vnic->xmit_started change after they are tested, leading to bad results? > Also do you get anything from having a xmit_started flag that you > couldn't get just by testing with netif_queue_stopped()? > You are right, xmit_started might not be required and we will look at the locking issue too. > > > > +extern cycles_t recv_ref; > > seems like too generic a name to make global. What the heck are you > using cycle_t to keep track of anyway? > This is being used as part of the driver internal statistics collection to keep track of the time elapsed between a message arriving from the EVIC indicating that it has done an RDMA write of an Ethernet packet to the driver memory and the driver giving the packet to the network stack. Will fix the variable name. Regards, Ram From steiner at sgi.com Mon May 5 09:21:13 2008 From: steiner at sgi.com (Jack Steiner) Date: Mon, 5 May 2008 11:21:13 -0500 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <1489529e7b53d3f2dab8.1209740704@duo.random> References: <1489529e7b53d3f2dab8.1209740704@duo.random> Message-ID: <20080505162113.GA18761@sgi.com> On Fri, May 02, 2008 at 05:05:04PM +0200, Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli > # Date 1209740175 -7200 > # Node ID 1489529e7b53d3f2dab8431372aa4850ec821caa > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core I upgraded to the latest mmu notifier patch & hit a deadlock. (Sorry - I should have seen this earlier but I haven't tracked the last couple of patches). The GRU does the registration/deregistration of mmu notifiers from mmap/munmap. At this point, the mmap_sem is already held writeable. I hit a deadlock in mm_lock. A quick fix would be to do one of the following: - move the mmap_sem locking to the caller of the [de]registration routines. Since the first/last thing done in mm_lock/mm_unlock is to acquire/release mmap_sem, this change does not cause major changes. - add a flag to mmu_notifier_[un]register routines to indicate if mmap_sem is already locked. I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to test. More later.... --- jack From hrosenstock at xsigo.com Mon May 5 09:22:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 05 May 2008 09:22:13 -0700 Subject: [ofa-general] [PATCHv2] ibsim/README: Clarify point of attachment/SIM_HOST use Message-ID: <1210004533.20493.255.camel@hrosenstock-ws.xsigo.com> Fix typo in original version of this patch Resend as got bounce on general list -- ibsim/README: Clarify point of attachment/SIM_HOST use Signed-off-by: Hal Rosenstock diff --git a/README b/README index f782fbe..b7615aa 100644 --- a/README +++ b/README @@ -90,6 +90,9 @@ Building and using ibsim. - in order to run OpenSM as non-privileged user you may need to export OSM_CACHE_DIR variable and to use '-f' option in order to specify writable path to OpenSM log file. + - Point of attachment is indicated by SIM_HOST environment variable. + If not specified, first entry in topology file is used. For OpenSM, + if -g option is used, it must be the same as this. 5. Enjoy and comment. From chu11 at llnl.gov Mon May 5 09:32:49 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 05 May 2008 09:32:49 -0700 Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache In-Reply-To: <481D8B33.4000803@dev.mellanox.co.il> References: <481D888A.7080608@dev.mellanox.co.il> <481D8B33.4000803@dev.mellanox.co.il> Message-ID: <1210005169.11133.374.camel@cardanus.llnl.gov> Hey Yevgeny, This looks like a great idea. But is there a reason its only supported for LMC=0? Since the caching is handled at the ucast-mgr level (rather than in the routing algorithm code), I don't quite see why LMC=0 matters. Maybe it is b/c of future incremental routing on your todo? If that's the case, instead of only caching when LMC=0, perhaps initial incremental routing should only work under LMC=0. Later on incremental routing for LMC > 0 could be added. Al On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote: > One thing I need to add here: ucast cache is currently supported > for LMC=0 only. > > -- Yevgeny > > Yevgeny Kliteynik wrote: > > Hi Sasha, > > > > The following series of 4 patches implements unicast routing cache > > in OpenSM. > > > > None of the current routing engines is scalable when we're talking > > about big clusters. On ~5K cluster with ~1.3K switches, it takes > > about two minutes to calculate the routing. The problem is, each > > time the routing is calculated from scratch. > > > > Incremental routing (which is on my to-do list) aims to address this > > problem when there is some "local" change in fabric (e.g. single > > switch failure, single link failure, link added, etc). > > In such cases we can use the routing that was already calculated in > > the previous heavy sweep, and then we just have to modify it according > > to the change. > > > > For instance, if some switch has disappeared from the fabric, we can > > use the routing that existed with this switch, take a step back from > > this switch and see if it is possible to route all the lids that were > > routed through this switch some other way (which is usually the case). > > > > To implement incremental routing, we need to create some kind of unicast > > routing cache, which is what these patches implement. In addition to being > > a step toward the incremental routing, routing cache is usefull by itself. > > > > This cache can save us routing calculation in case of change in the leaf > > switches or in hosts. For instance, if some node is rebooted, OpenSM would > > start a heavy sweep with full routing recalculation when the HCA is going > > down, and another one when HCA is brought up, when in fact both of these > > routing calculation can be replaced by using of unicast routing cache. > > > > Unicast routing cache comprises the following: > > - Topology: a data structure with all the switches and CAs of the fabric > > - LFTs: each switch has an LFT cached > > - Lid matrices: each switch has lid matrices cached, which is needed for > > multicast routing (which is not cached). > > > > There is a topology matching function that compares the current topology > > with the cached one to find out whether the cache is usable (valid) or not. > > > > The cache is used the following way: > > - SM is executed - it starts first routing calculation > > - calculated routing is stored in the cache > > - at some point new heavy sweep is triggered > > - unicast manager checks whether the cache can be used instead > > of new routing calculation. > > In one of the following cases we can use cached routing > > + there is no topology change > > + one or more CAs disappeared (they exist in the cached topology > > model, but missing in the newly discovered fabric) > > + one or more leaf switches disappeared > > In these cases cached routing is written to the switches as is > > (unless the switch doesn't exist). > > If there is any other topology change: > > - existing cache is invalidated > > - topology is cached > > - routing is calculated as usual > > - routing is cached > > > > My simulations show that when the usual routing phase of the heavy > > sweep on the topology that I mentioned above takes ~2 minutes, > > cached routing reduces this time to 6 seconds (which is nice, if you > > ask me...). > > > > Of all the cases when the cache is valid, the most painful and > > "complainable" case is when a compute node reboot (which happens pretty > > often) causes two heavy sweeps with two full routing calculations. > > Unicast Routing Cache is aimed to solve this problem (again, in addition > > to being a step toward the incremental routing). > > > > -- Yevgeny > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Mon May 5 09:39:08 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 05 May 2008 09:39:08 -0700 Subject: [ofa-general] Re: [PATCH 2/4] opensm: adding ucast cache option In-Reply-To: <481D8944.10003@dev.mellanox.co.il> References: <481D8944.10003@dev.mellanox.co.il> Message-ID: <1210005548.11133.377.camel@cardanus.llnl.gov> Hey Yevgeny, Tiny nit, there is no manpage entry :-) Al On Sun, 2008-05-04 at 13:00 +0300, Yevgeny Kliteynik wrote: > Adding ucast cache option to OpenSM command line > arguments: -F or --ucast_cache. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_subnet.h | 6 +++++- > opensm/opensm/main.c | 33 +++++++++++++++++++++++++++++++-- > opensm/opensm/osm_subnet.c | 11 ++++++++++- > 3 files changed, 46 insertions(+), 4 deletions(-) > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index b1dd659..cffbe5e 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -256,6 +256,7 @@ typedef struct _osm_subn_opt { > boolean_t sweep_on_trap; > char *routing_engine_name; > boolean_t connect_roots; > + boolean_t use_ucast_cache; > char *lid_matrix_dump_file; > char *ucast_dump_file; > char *root_guid_file; > @@ -441,6 +442,9 @@ typedef struct _osm_subn_opt { > * up/down routing engine (even if this violates "pure" deadlock > * free up/down algorithm) > * > +* use_ucast_cache > +* When TRUE enables unicast routing cache. > +* > * lid_matrix_dump_file > * Name of the lid matrix dump file from where switch > * lid matrices (min hops tables) will be loaded > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index fb41d50..71deacb 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -183,6 +183,17 @@ static void show_usage(void) > " and in this way be IBA compliant. In many cases,\n" > " this can violate \"pure\" deadlock free algorithm, so\n" > " use it carefully.\n\n"); > + printf("-F\n" > + "--ucast_cache\n" > + " This option enables unicast routing cache to prevent\n" > + " routing recalculation (which is a heavy task in a\n" > + " large cluster) when there was no topology change\n" > + " detected during the heavy sweep, or when the topology\n" > + " change does not require new routing calculation,\n" > + " e.g. in case of host reboot.\n" > + " This option becomes very handy when the cluster size\n" > + " is thousands of nodes.\n" > + " Unicast cache is not supported for LMC > 0.\n\n"); > printf("-M\n" > "--lid_matrix_file \n" > " This option specifies the name of the lid matrix dump file\n" > @@ -599,7 +610,7 @@ int main(int argc, char *argv[]) > char *ignore_guids_file_name = NULL; > uint32_t val; > const char *const short_option = > - "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:"; > + "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:"; > > /* > In the array below, the 2nd parameter specifies the number > @@ -634,6 +645,7 @@ int main(int argc, char *argv[]) > {"smkey", 1, NULL, 'k'}, > {"routing_engine", 1, NULL, 'R'}, > {"connect_roots", 0, NULL, 'z'}, > + {"ucast_cache", 0, NULL, 'F'}, > {"lid_matrix_file", 1, NULL, 'M'}, > {"ucast_file", 1, NULL, 'U'}, > {"sadb_file", 1, NULL, 'S'}, > @@ -805,6 +817,12 @@ int main(int argc, char *argv[]) > "ERROR: LMC must be 7 or less."); > return (-1); > } > + if (opt.use_ucast_cache && temp > 0) { > + fprintf(stderr, > + "ERROR: Unicast routing cache is " > + "not supported for LMC > 0\n"); > + return (-1); > + } > opt.lmc = (uint8_t) temp; > printf(" LMC = %d\n", temp); > break; > @@ -891,6 +909,17 @@ int main(int argc, char *argv[]) > printf(" Connect roots option is on\n"); > break; > > + case 'F': > + if (opt.lmc > 0) { > + fprintf(stderr, > + "ERROR: Unicast routing cache is " > + "not supported for LMC > 0\n"); > + return (-1); > + } > + opt.use_ucast_cache = TRUE; > + printf(" Unicast routing cache option is on\n"); > + break; > + > case 'M': > opt.lid_matrix_dump_file = optarg; > printf(" Lid matrix dump file is \'%s\'\n", optarg); > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 47d735f..dc55e72 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -1,6 +1,6 @@ > /* > * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > @@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->sweep_on_trap = TRUE; > p_opt->routing_engine_name = NULL; > p_opt->connect_roots = FALSE; > + p_opt->use_ucast_cache = FALSE; > p_opt->lid_matrix_dump_file = NULL; > p_opt->ucast_dump_file = NULL; > p_opt->root_guid_file = NULL; > @@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) > opts_unpack_boolean("connect_roots", > p_key, p_val, &p_opts->connect_roots); > > + opts_unpack_boolean("use_ucast_cache", > + p_key, p_val, &p_opts->use_ucast_cache); > + > opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); > > opts_unpack_uint32("log_max_size", > @@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) > "# Connect roots (use FALSE if unsure)\n" > "connect_roots %s\n\n", > p_opts->connect_roots ? "TRUE" : "FALSE"); > + if (p_opts->use_ucast_cache) > + fprintf(opts_file, > + "# Use unicast routing cache (use FALSE if unsure)\n" > + "use_ucast_cache %s\n\n", > + p_opts->use_ucast_cache ? "TRUE" : "FALSE"); > if (p_opts->lid_matrix_dump_file) > fprintf(opts_file, > "# Lid matrix dump file name\n" -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From andrea at qumranet.com Mon May 5 10:14:34 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Mon, 5 May 2008 19:14:34 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080505162113.GA18761@sgi.com> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> Message-ID: <20080505171434.GF8470@duo.random> On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote: > The GRU does the registration/deregistration of mmu notifiers from mmap/munmap. > At this point, the mmap_sem is already held writeable. I hit a deadlock > in mm_lock. It'd been better to know about this detail earlier, but frankly this is a minor problem, the important thing is we all agree together on the more difficult parts ;). > A quick fix would be to do one of the following: > > - move the mmap_sem locking to the caller of the [de]registration routines. > Since the first/last thing done in mm_lock/mm_unlock is to > acquire/release mmap_sem, this change does not cause major changes. I don't like this solution very much. Nor GRU nor KVM will call mmu_notifier_register inside the mmap_sem protected sections, so I think the default mmu_notifier_register should be smp safe by itself without requiring additional locks to be artificially taken externally (especially because the need for mmap_sem in write mode is a very mmu_notifier internal detail). > - add a flag to mmu_notifier_[un]register routines to indicate > if mmap_sem is already locked. The interface would change like this: #define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0) void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long mmu_notifier_flags); A third solution is to add: /* * This must can be called instead of mmu_notifier_register after * taking the mmap_sem in write mode (read mode isn't enough). */ void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm); Do you still prefer the bitflag or you prefer __mmu_notifier_register. It's ok either ways, except __mmu_notifier_reigster could be removed in a backwards compatible way, the bitflag can't. > I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to > test. More later.... Sure! In the meantime go ahead this way. Another very minor change I've been thinking about is to make ->release not mandatory. It happens that with KVM ->release isn't strictly required because after mm_users reaches 0, no guest could possibly run anymore. So I'm using ->release only for debugging by placing -1UL in the root shadow pagetable, to be sure ;). So because at least one user won't strictly require ->release being consistent in having all method optional may be nicer. Alternatively we could make them all mandatory and if somebody doesn't need one of the methods it should implement it as a dummy function. Both ways have pros and cons, but they don't make any difference to us in practice. If I've to change the patch for the mmap_sem taken during registration I may as well cleanup this minor bit. Also note the rculist.h patch you sent earlier won't work against mainline so I can't incorporate it in my patchset, Andrew will have to apply it as mmu-notifier-core-mm after incorporating mmu-notifier-core into -mm. Until a new update is released, mmu-notifier-core v15 remains ok for merging, no known bugs, here we're talking about a new and simple feature and a tiny cleanup that nobody can notice anyway. From steiner at sgi.com Mon May 5 10:25:06 2008 From: steiner at sgi.com (Jack Steiner) Date: Mon, 5 May 2008 12:25:06 -0500 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080505171434.GF8470@duo.random> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random> Message-ID: <20080505172506.GA9247@sgi.com> On Mon, May 05, 2008 at 07:14:34PM +0200, Andrea Arcangeli wrote: > On Mon, May 05, 2008 at 11:21:13AM -0500, Jack Steiner wrote: > > The GRU does the registration/deregistration of mmu notifiers from mmap/munmap. > > At this point, the mmap_sem is already held writeable. I hit a deadlock > > in mm_lock. > > It'd been better to know about this detail earlier, Agree. My apologies... I should have caught it. > but frankly this > is a minor problem, the important thing is we all agree together on > the more difficult parts ;). > > > A quick fix would be to do one of the following: > > > > - move the mmap_sem locking to the caller of the [de]registration routines. > > Since the first/last thing done in mm_lock/mm_unlock is to > > acquire/release mmap_sem, this change does not cause major changes. > > I don't like this solution very much. Nor GRU nor KVM will call > mmu_notifier_register inside the mmap_sem protected sections, so I > think the default mmu_notifier_register should be smp safe by itself > without requiring additional locks to be artificially taken externally > (especially because the need for mmap_sem in write mode is a very > mmu_notifier internal detail). > > > - add a flag to mmu_notifier_[un]register routines to indicate > > if mmap_sem is already locked. > > The interface would change like this: > > #define MMU_NOTIFIER_REGISTER_MMAP_SEM (1<<0) > void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm, > unsigned long mmu_notifier_flags); That works... > > A third solution is to add: > > /* > * This must can be called instead of mmu_notifier_register after > * taking the mmap_sem in write mode (read mode isn't enough). > */ > void __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm); > > Do you still prefer the bitflag or you prefer > __mmu_notifier_register. It's ok either ways, except > __mmu_notifier_reigster could be removed in a backwards compatible > way, the bitflag can't. > > > I've temporarily deleted the mm_lock locking of mmap_sem and am continuing to > > test. More later.... __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to go, although either is ok. > > Sure! In the meantime go ahead this way. > > Another very minor change I've been thinking about is to make > ->release not mandatory. It happens that with KVM ->release isn't > strictly required because after mm_users reaches 0, no guest could > possibly run anymore. So I'm using ->release only for debugging by > placing -1UL in the root shadow pagetable, to be sure ;). So because > at least one user won't strictly require ->release being consistent in > having all method optional may be nicer. Alternatively we could make > them all mandatory and if somebody doesn't need one of the methods it > should implement it as a dummy function. Both ways have pros and cons, > but they don't make any difference to us in practice. If I've to > change the patch for the mmap_sem taken during registration I may as > well cleanup this minor bit. Let me finish my testing. At one time, I did not use ->release but with all the locking & teardown changes, I need to do some reverification. --- jack From andrea at qumranet.com Mon May 5 11:34:05 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Mon, 5 May 2008 20:34:05 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080505172506.GA9247@sgi.com> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random> <20080505172506.GA9247@sgi.com> Message-ID: <20080505183405.GI8470@duo.random> On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote: > Agree. My apologies... I should have caught it. No problem. > __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to > go, although either is ok. If you also like __mmu_notifier_register more I'll go with it. The bitflags seems like a bit of overkill as I can't see the need of any other bitflag other than this one and they also can't be removed as easily in case you'll find a way to call it outside the lock later. > Let me finish my testing. At one time, I did not use ->release but > with all the locking & teardown changes, I need to do some reverification. If you didn't implement it you shall apply this patch but you shall read carefully the comment I written that covers that usage case. diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -29,10 +29,25 @@ struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are - * freed. It's mandatory to implement this method. This can - * run concurrently with other mmu notifier methods and it + * freed. This can run concurrently with other mmu notifier + * methods (the ones invoked outside the mm context) and it * should tear down all secondary mmu mappings and freeze the - * secondary mmu. + * secondary mmu. If this method isn't implemented you've to + * be sure that nothing could possibly write to the pages + * through the secondary mmu by the time the last thread with + * tsk->mm == mm exits. + * + * As side note: the pages freed after ->release returns could + * be immediately reallocated by the gart at an alias physical + * address with a different cache model, so if ->release isn't + * implemented because all memory accesses through the + * secondary mmu implicitly are terminated by the time the + * last thread of this mm quits, you've also to be sure that + * speculative hardware operations can't allocate dirty + * cachelines in the cpu that could not be snooped and made + * coherent with the other read and write operations happening + * through the gart alias address, leading to memory + * corruption. */ void (*release)(struct mmu_notifier *mn, struct mm_struct *mm); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st * from establishing any more sptes before all the * pages in the mm are freed. */ - mn->ops->release(mn, mm); + if (mn->ops->release) + mn->ops->release(mn, mm); srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); spin_lock(&mm->mmu_notifier_mm->lock); } @@ -251,7 +252,8 @@ void mmu_notifier_unregister(struct mmu_ * guarantee ->release is called before freeing the * pages. */ - mn->ops->release(mn, mm); + if (mn->ops->release) + mn->ops->release(mn, mm); srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); } else spin_unlock(&mm->mmu_notifier_mm->lock); From steiner at sgi.com Mon May 5 12:46:25 2008 From: steiner at sgi.com (Jack Steiner) Date: Mon, 5 May 2008 14:46:25 -0500 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080505183405.GI8470@duo.random> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random> <20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random> Message-ID: <20080505194625.GA17734@sgi.com> On Mon, May 05, 2008 at 08:34:05PM +0200, Andrea Arcangeli wrote: > On Mon, May 05, 2008 at 12:25:06PM -0500, Jack Steiner wrote: > > Agree. My apologies... I should have caught it. > > No problem. > > > __mmu_notifier_register/__mmu_notifier_unregister seems like a better way to > > go, although either is ok. > > If you also like __mmu_notifier_register more I'll go with it. The > bitflags seems like a bit of overkill as I can't see the need of any > other bitflag other than this one and they also can't be removed as > easily in case you'll find a way to call it outside the lock later. > > > Let me finish my testing. At one time, I did not use ->release but > > with all the locking & teardown changes, I need to do some reverification. I finished testing & everything looks good. I do use the ->release callout but mainly as a performance hint that teardown is in progress & that TLB flushing is no longer needed. (GRU TLB entries are tagged with a task-specific ID that will not be reused until a full TLB purge is done. This eliminates the requirement to purge at task-exit.) Normally, a notifier is registered when a GRU segment is mmaped, and unregistered when the segment is unmapped. Well behaved tasks will not have a GRU or a notifier when exit starts. If a task fails to unmap a GRU segment, they still exist at the start of exit. On the ->release callout, I set a flag in the container of my mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes are skipped because of the flag is set. When the GRU VMA is deleted, I free my structure containing the notifier. I _think_ works. Do you see any problems? I should also mention that I have an open-coded function that possibly belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments. Each GRU has a couple of data structures linked to the VMA. All, however, need to share the same notifier. I currently open code a function that scans the notifier list to determine if a GRU notifier already exists. If it does, I update a refcnt & use it. Otherwise, I register a new one. All of this is protected by the mmap_sem. Just in case I mangled the above description, I'll attach a copy of the GRU mmuops code. --- jack -------------- next part -------------- A non-text attachment was scrubbed... Name: z Type: application/x-compress Size: 2662 bytes Desc: not available URL: From rdreier at cisco.com Mon May 5 13:28:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 May 2008 13:28:59 -0700 Subject: [ofa-general] Re: [PATCH] mlx4_core: Support creation of FMRs with pages smaller than 4K In-Reply-To: <200805051820.49796.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 5 May 2008 18:20:49 +0300") References: <200805051820.49796.jackm@dev.mellanox.co.il> Message-ID: > The device minimum page size should be taken from the device > capabilities, and not hard-coded. This hard-coding has lead > to problems with new mlx4 firmware. Please don't expect me to guess what kind of problems... what changed with new firmware, what breaks, and why? From rdreier at cisco.com Mon May 5 13:42:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 May 2008 13:42:15 -0700 Subject: [ofa-general] Re: [PATCH 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: <71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com> (Ramachandra K.'s message of "Mon, 5 May 2008 21:06:58 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171624.31725.98475.stgit@localhost.localdomain> <71d336490805050836o68d745f2k4bab68edcfe1da50@mail.gmail.com> Message-ID: > I will make sure to sign off on all patches. Should I also drop the From line > for the patches which I developed, since I am mailing them myself ? It doesn't hurt to include a From: line if it is the same as the one for the email itself, but it isn't necessary. When I import a patch the last From: line will be used. > I am using the Signed-off-by line to indicate the people who were > involved in the development of the patches at some stage. That's fine. You can read Documentation/SubmittingPatches to see the precise legal meaning of Signed-off-by, and make sure that it applies to everyone whose signoff you are including. You can also add less formal text like "X helped develop this patch" in the changelog entry. > > > +extern cycles_t recv_ref; > > > > seems like too generic a name to make global. What the heck are you > > using cycle_t to keep track of anyway? > > > > This is being used as part of the driver internal statistics > collection to keep track of the time > elapsed between a message arriving from the EVIC indicating that it > has done an RDMA write of > an Ethernet packet to the driver memory and the driver giving the packet > to the network stack. cycles don't track time (eg x86 TSC might stop for a while). Do you *really* need to use cycles, or are jiffies a better replacement? - R. From kliteyn at dev.mellanox.co.il Mon May 5 13:52:28 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 05 May 2008 23:52:28 +0300 Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache In-Reply-To: <1210005169.11133.374.camel@cardanus.llnl.gov> References: <481D888A.7080608@dev.mellanox.co.il> <481D8B33.4000803@dev.mellanox.co.il> <1210005169.11133.374.camel@cardanus.llnl.gov> Message-ID: <481F738C.3040804@dev.mellanox.co.il> Al Chu wrote: > Hey Yevgeny, > > This looks like a great idea. But is there a reason its only supported > for LMC=0? Since the caching is handled at the ucast-mgr level (rather > than in the routing algorithm code), I don't quite see why LMC=0 > matters. No particular reason - I'll enhance it for LMC>0, just didn't find the time to do it right now. The cached topology model is based on LIDs, so I just need to check that LMC>0 doesn't break anything. I also had a more complex topology and routing model, where I wasn't relying on LIDs - I had what I called "Virtual LIDs", and at every heavy sweep the topology model was built and Virtual LIDs were matched to LIDs to create VLID <-> LID mapping, so that the cache won't depend on fabric LIDs, and there I had some problems with LMC (can't remember what exactly), but that model proved to be useless. > Maybe it is b/c of future incremental routing on your todo? If that's > the case, instead of only caching when LMC=0, perhaps initial > incremental routing should only work under LMC=0. Later on incremental > routing for LMC > 0 could be added. Agree, that is what I eventually should do. -- Yevgeny > Al > > On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote: >> One thing I need to add here: ucast cache is currently supported >> for LMC=0 only. >> >> -- Yevgeny >> >> Yevgeny Kliteynik wrote: >>> Hi Sasha, >>> >>> The following series of 4 patches implements unicast routing cache >>> in OpenSM. >>> >>> None of the current routing engines is scalable when we're talking >>> about big clusters. On ~5K cluster with ~1.3K switches, it takes >>> about two minutes to calculate the routing. The problem is, each >>> time the routing is calculated from scratch. >>> >>> Incremental routing (which is on my to-do list) aims to address this >>> problem when there is some "local" change in fabric (e.g. single >>> switch failure, single link failure, link added, etc). >>> In such cases we can use the routing that was already calculated in >>> the previous heavy sweep, and then we just have to modify it according >>> to the change. >>> >>> For instance, if some switch has disappeared from the fabric, we can >>> use the routing that existed with this switch, take a step back from >>> this switch and see if it is possible to route all the lids that were >>> routed through this switch some other way (which is usually the case). >>> >>> To implement incremental routing, we need to create some kind of unicast >>> routing cache, which is what these patches implement. In addition to being >>> a step toward the incremental routing, routing cache is usefull by itself. >>> >>> This cache can save us routing calculation in case of change in the leaf >>> switches or in hosts. For instance, if some node is rebooted, OpenSM would >>> start a heavy sweep with full routing recalculation when the HCA is going >>> down, and another one when HCA is brought up, when in fact both of these >>> routing calculation can be replaced by using of unicast routing cache. >>> >>> Unicast routing cache comprises the following: >>> - Topology: a data structure with all the switches and CAs of the fabric >>> - LFTs: each switch has an LFT cached >>> - Lid matrices: each switch has lid matrices cached, which is needed for >>> multicast routing (which is not cached). >>> >>> There is a topology matching function that compares the current topology >>> with the cached one to find out whether the cache is usable (valid) or not. >>> >>> The cache is used the following way: >>> - SM is executed - it starts first routing calculation >>> - calculated routing is stored in the cache >>> - at some point new heavy sweep is triggered >>> - unicast manager checks whether the cache can be used instead >>> of new routing calculation. >>> In one of the following cases we can use cached routing >>> + there is no topology change >>> + one or more CAs disappeared (they exist in the cached topology >>> model, but missing in the newly discovered fabric) >>> + one or more leaf switches disappeared >>> In these cases cached routing is written to the switches as is >>> (unless the switch doesn't exist). >>> If there is any other topology change: >>> - existing cache is invalidated >>> - topology is cached >>> - routing is calculated as usual >>> - routing is cached >>> >>> My simulations show that when the usual routing phase of the heavy >>> sweep on the topology that I mentioned above takes ~2 minutes, >>> cached routing reduces this time to 6 seconds (which is nice, if you >>> ask me...). >>> >>> Of all the cases when the cache is valid, the most painful and >>> "complainable" case is when a compute node reboot (which happens pretty >>> often) causes two heavy sweeps with two full routing calculations. >>> Unicast Routing Cache is aimed to solve this problem (again, in addition >>> to being a step toward the incremental routing). >>> >>> -- Yevgeny >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Mon May 5 14:12:39 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 05 May 2008 14:12:39 -0700 Subject: [ofa-general] [PATCH 0/4] opensm: Unicast Routing Cache In-Reply-To: <481D888A.7080608@dev.mellanox.co.il> References: <481D888A.7080608@dev.mellanox.co.il> Message-ID: <1210021959.27137.89.camel@hrosenstock-ws.xsigo.com> On Sun, 2008-05-04 at 12:57 +0300, Yevgeny Kliteynik wrote: > My simulations show that when the usual routing phase of the heavy > sweep on the topology that I mentioned above takes ~2 minutes, > cached routing reduces this time to 6 seconds (which is nice, if you > ask me...). Cool! -- Hal From hrosenstock at xsigo.com Mon May 5 14:12:41 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 05 May 2008 14:12:41 -0700 Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c, h}: ucast routing cache implementation In-Reply-To: <481D8905.1010207@dev.mellanox.co.il> References: <481D8905.1010207@dev.mellanox.co.il> Message-ID: <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote: I haven't yet had a chance to review this in detail but think that router ports need to be accomodated in the subnet (I think this is a firm requirement as router ports on the subnet are already supported) and also think that nothing should be introduced precluding the running of OpenSM on a router port. From the latter standpoint, it looks much like a CA port. -- Hal From kliteyn at dev.mellanox.co.il Mon May 5 14:35:28 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 00:35:28 +0300 Subject: [ofa-general] Re: [PATCH 2/4] opensm: adding ucast cache option In-Reply-To: <1210005548.11133.377.camel@cardanus.llnl.gov> References: <481D8944.10003@dev.mellanox.co.il> <1210005548.11133.377.camel@cardanus.llnl.gov> Message-ID: <481F7DA0.6090707@dev.mellanox.co.il> Al Chu wrote: > Hey Yevgeny, > > Tiny nit, there is no manpage entry :-) Right, thanks :) -- Yevgeny > Al > > On Sun, 2008-05-04 at 13:00 +0300, Yevgeny Kliteynik wrote: >> Adding ucast cache option to OpenSM command line >> arguments: -F or --ucast_cache. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/include/opensm/osm_subnet.h | 6 +++++- >> opensm/opensm/main.c | 33 +++++++++++++++++++++++++++++++-- >> opensm/opensm/osm_subnet.c | 11 ++++++++++- >> 3 files changed, 46 insertions(+), 4 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h >> index b1dd659..cffbe5e 100644 >> --- a/opensm/include/opensm/osm_subnet.h >> +++ b/opensm/include/opensm/osm_subnet.h >> @@ -1,6 +1,6 @@ >> /* >> * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. >> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> * >> * This software is available to you under a choice of one of two >> @@ -256,6 +256,7 @@ typedef struct _osm_subn_opt { >> boolean_t sweep_on_trap; >> char *routing_engine_name; >> boolean_t connect_roots; >> + boolean_t use_ucast_cache; >> char *lid_matrix_dump_file; >> char *ucast_dump_file; >> char *root_guid_file; >> @@ -441,6 +442,9 @@ typedef struct _osm_subn_opt { >> * up/down routing engine (even if this violates "pure" deadlock >> * free up/down algorithm) >> * >> +* use_ucast_cache >> +* When TRUE enables unicast routing cache. >> +* >> * lid_matrix_dump_file >> * Name of the lid matrix dump file from where switch >> * lid matrices (min hops tables) will be loaded >> diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c >> index fb41d50..71deacb 100644 >> --- a/opensm/opensm/main.c >> +++ b/opensm/opensm/main.c >> @@ -1,6 +1,6 @@ >> /* >> * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. >> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> * >> * This software is available to you under a choice of one of two >> @@ -183,6 +183,17 @@ static void show_usage(void) >> " and in this way be IBA compliant. In many cases,\n" >> " this can violate \"pure\" deadlock free algorithm, so\n" >> " use it carefully.\n\n"); >> + printf("-F\n" >> + "--ucast_cache\n" >> + " This option enables unicast routing cache to prevent\n" >> + " routing recalculation (which is a heavy task in a\n" >> + " large cluster) when there was no topology change\n" >> + " detected during the heavy sweep, or when the topology\n" >> + " change does not require new routing calculation,\n" >> + " e.g. in case of host reboot.\n" >> + " This option becomes very handy when the cluster size\n" >> + " is thousands of nodes.\n" >> + " Unicast cache is not supported for LMC > 0.\n\n"); >> printf("-M\n" >> "--lid_matrix_file \n" >> " This option specifies the name of the lid matrix dump file\n" >> @@ -599,7 +610,7 @@ int main(int argc, char *argv[]) >> char *ignore_guids_file_name = NULL; >> uint32_t val; >> const char *const short_option = >> - "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:"; >> + "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:FNBIQvVhorcyxp:n:q:k:C:"; >> >> /* >> In the array below, the 2nd parameter specifies the number >> @@ -634,6 +645,7 @@ int main(int argc, char *argv[]) >> {"smkey", 1, NULL, 'k'}, >> {"routing_engine", 1, NULL, 'R'}, >> {"connect_roots", 0, NULL, 'z'}, >> + {"ucast_cache", 0, NULL, 'F'}, >> {"lid_matrix_file", 1, NULL, 'M'}, >> {"ucast_file", 1, NULL, 'U'}, >> {"sadb_file", 1, NULL, 'S'}, >> @@ -805,6 +817,12 @@ int main(int argc, char *argv[]) >> "ERROR: LMC must be 7 or less."); >> return (-1); >> } >> + if (opt.use_ucast_cache && temp > 0) { >> + fprintf(stderr, >> + "ERROR: Unicast routing cache is " >> + "not supported for LMC > 0\n"); >> + return (-1); >> + } >> opt.lmc = (uint8_t) temp; >> printf(" LMC = %d\n", temp); >> break; >> @@ -891,6 +909,17 @@ int main(int argc, char *argv[]) >> printf(" Connect roots option is on\n"); >> break; >> >> + case 'F': >> + if (opt.lmc > 0) { >> + fprintf(stderr, >> + "ERROR: Unicast routing cache is " >> + "not supported for LMC > 0\n"); >> + return (-1); >> + } >> + opt.use_ucast_cache = TRUE; >> + printf(" Unicast routing cache option is on\n"); >> + break; >> + >> case 'M': >> opt.lid_matrix_dump_file = optarg; >> printf(" Lid matrix dump file is \'%s\'\n", optarg); >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c >> index 47d735f..dc55e72 100644 >> --- a/opensm/opensm/osm_subnet.c >> +++ b/opensm/opensm/osm_subnet.c >> @@ -1,6 +1,6 @@ >> /* >> * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. >> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. >> + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. >> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. >> * >> * This software is available to you under a choice of one of two >> @@ -461,6 +461,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) >> p_opt->sweep_on_trap = TRUE; >> p_opt->routing_engine_name = NULL; >> p_opt->connect_roots = FALSE; >> + p_opt->use_ucast_cache = FALSE; >> p_opt->lid_matrix_dump_file = NULL; >> p_opt->ucast_dump_file = NULL; >> p_opt->root_guid_file = NULL; >> @@ -1290,6 +1291,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) >> opts_unpack_boolean("connect_roots", >> p_key, p_val, &p_opts->connect_roots); >> >> + opts_unpack_boolean("use_ucast_cache", >> + p_key, p_val, &p_opts->use_ucast_cache); >> + >> opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); >> >> opts_unpack_uint32("log_max_size", >> @@ -1543,6 +1547,11 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) >> "# Connect roots (use FALSE if unsure)\n" >> "connect_roots %s\n\n", >> p_opts->connect_roots ? "TRUE" : "FALSE"); >> + if (p_opts->use_ucast_cache) >> + fprintf(opts_file, >> + "# Use unicast routing cache (use FALSE if unsure)\n" >> + "use_ucast_cache %s\n\n", >> + p_opts->use_ucast_cache ? "TRUE" : "FALSE"); >> if (p_opts->lid_matrix_dump_file) >> fprintf(opts_file, >> "# Lid matrix dump file name\n" From Nathan.Dauchy at noaa.gov Mon May 5 14:49:29 2008 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Mon, 05 May 2008 15:49:29 -0600 Subject: [ofa-general] Install IPoIB separately .. In-Reply-To: <481EABAA.6080105@dev.mellanox.co.il> References: <829ded920805042155r20df506l2ced783586664b@mail.gmail.com> <481EABAA.6080105@dev.mellanox.co.il> Message-ID: <481F80E9.9010009@noaa.gov> Vladimir Sokolovsky wrote: > Keshetti Mahesh wrote: >> While installing OFED-1.3 on a machine, I forgot to select the IPoIB >> module. >> Now, is it possible to build IPoIB module separately and install it >> without >> affecting the earlier installation? >> >> -Mahesh > > Not with OFED-1.3 installation script. > If you will run install.pl then it will: > 1. Uninstall the current OFED installation. > 2. Rebuild and install kernel-ib RPM (with IPoIB). > 3. Install other binary RPMs following your selection using binary RPMs > that were created during the previous install. > > Regards, > Vladimir This coupling of install and build steps complicates life for users and seems like a step backwards from OFED-1.2. >From the "OFED Aug 13 meeting summary", this change was made in part because the previous build method and manner of handling dependencies did not follow standard RPM usage. I don't think that uninstalling multiple RPM's and rebuilding them in order to add another RPM is standard RPM usage either. Can this be put on the "to-do" list for OFED-1.3.2? Is "install.pl" the only way to reliably build OFED-1.3? Thanks, Nathan From kliteyn at dev.mellanox.co.il Mon May 5 14:58:13 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 00:58:13 +0300 Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c, h}: ucast routing cache implementation In-Reply-To: <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com> References: <481D8905.1010207@dev.mellanox.co.il> <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com> Message-ID: <481F82F5.8030505@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote: > > I haven't yet had a chance to review this in detail but think that > router ports need to be accomodated in the subnet (I think this is a > firm requirement as router ports on the subnet are already supported) > and also think that nothing should be introduced precluding the running > of OpenSM on a router port. From the latter standpoint, it looks much > like a CA port. This is exactly how I implemented it - any non-switch port is treated as CA, which is just a target LID. Well, I mean I intended to implement it that way - I reviewed it again, and it appears that the cache is fine with routing to routers and running from switches, but there will be a problem when SM runs on a router node - cache will complain and fall back to usual routing. That can be easily fixed. However, I'm not sure how the OpenSM will behave in general when running from switch or router - I've never tried it. Has anyone try it? -- Yevgeny > -- Hal > > From hrosenstock at xsigo.com Mon May 5 15:20:44 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 05 May 2008 15:20:44 -0700 Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c, h}: ucast routing cache implementation In-Reply-To: <481F82F5.8030505@dev.mellanox.co.il> References: <481D8905.1010207@dev.mellanox.co.il> <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com> <481F82F5.8030505@dev.mellanox.co.il> Message-ID: <1210026044.27137.121.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-06 at 00:58 +0300, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote: > > > > I haven't yet had a chance to review this in detail but think that > > router ports need to be accomodated in the subnet (I think this is a > > firm requirement as router ports on the subnet are already supported) > > and also think that nothing should be introduced precluding the running > > of OpenSM on a router port. From the latter standpoint, it looks much > > like a CA port. > > This is exactly how I implemented it - any non-switch port is > treated as CA, which is just a target LID. > > Well, I mean I intended to implement it that way - I reviewed it again, > and it appears that the cache is fine with routing to routers and running > from switches, then it's just the variable names which indicate ca if routers are grouped with cas. > but there will be a problem when SM runs on a router node - > cache will complain and fall back to usual routing. > That can be easily fixed. Right; the one thing I saw was in _ucast_cache_get_starting_osm_sw where routers were not supported. I think a one line change is all that's needed there. Not sure if there are other places. > However, I'm not sure how the OpenSM will behave in general when running > from switch or router - I've never tried it. Has anyone try it? I'm not sure either but would be interested to hear. I think there are some using it on switch port 0 and also think others have tried it on routers. In terms of switches, it used to work and there is some support in the vendor directory for this. -- Hal > -- Yevgeny > > > -- Hal > > > > > From rdreier at cisco.com Mon May 5 15:54:49 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 May 2008 15:54:49 -0700 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: <4819BD29.7080002@Voltaire.COM> (Moni Shoua's message of "Thu, 01 May 2008 15:52:57 +0300") References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> Message-ID: > For consumers, the patch doesn't make things worse. Before the patch > mads are sent to the wrong SM and now they are now blocked before > they are sent. Consumers can be improved if they examine the return > code and respond to EAGAIN properly but even without an improvement > the situation is not getting worse in in some cases it gets better. I guess I can believe things don't get worse but I still don't know how this makes things better. With the current code the request is lost because it goes to the wrong SM; with the new code the request is failed by the SA layer. So in both cases the consumer just has to try again. So is there some practical benefit we see by adding this code? - R. From rdreier at cisco.com Mon May 5 15:55:33 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 May 2008 15:55:33 -0700 Subject: [ofa-general] Re: [PATCH] mlx4_core: Support creation of FMRs with pages smaller than 4K In-Reply-To: <200805051820.49796.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 5 May 2008 18:20:49 +0300") References: <200805051820.49796.jackm@dev.mellanox.co.il> Message-ID: > mlx4_core: Support creation of FMRs with pages smaller than 4K never mind, the subject makes sense now. applied. From rdreier at cisco.com Mon May 5 16:00:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 May 2008 16:00:18 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a fixes for: - mlx4 breakage introduced (by me) with the CQ resize support - mlx4 breakage with new FW that supports smaller adapter pages - IPoIB breakage introduced with the separate send CQ support - longstanding cxgb3 breakage uncovered by stress testing - ehca minor messiness Eli Cohen (1): IB/ipoib: Fix transmit queue stalling forever Oren Duer (1): mlx4_core: Support creation of FMRs with pages smaller than 4K Roland Dreier (1): IB/mlx4: Fix off-by-one errors in calls to mlx4_ib_free_cq_buf() Stefan Roscher (1): IB/ehca: Fix function return types Steve Wise (3): RDMA/cxgb3: QP flush fixes RDMA/cxgb3: Silently ignore close reply after abort. RDMA/cxgb3: Bump up the MPA connection setup timeout. drivers/infiniband/hw/cxgb3/cxio_hal.c | 13 ++++++-- drivers/infiniband/hw/cxgb3/cxio_hal.h | 4 +- drivers/infiniband/hw/cxgb3/iwch_cm.c | 6 ++-- drivers/infiniband/hw/cxgb3/iwch_qp.c | 13 +++++--- drivers/infiniband/hw/ehca/ehca_hca.c | 7 ++-- drivers/infiniband/hw/mlx4/cq.c | 4 +- drivers/infiniband/ulp/ipoib/ipoib.h | 2 + drivers/infiniband/ulp/ipoib/ipoib_ib.c | 47 +++++++++++++++++++++++++--- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 +- drivers/net/mlx4/mr.c | 2 +- 10 files changed, 75 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index ed2ee4b..5fd8506 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -359,9 +359,10 @@ static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) cq->sw_wptr++; } -void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) { u32 ptr; + int flushed = 0; PDBG("%s wq %p cq %p\n", __func__, wq, cq); @@ -369,8 +370,11 @@ void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __func__, wq->rq_rptr, wq->rq_wptr, count); ptr = wq->rq_rptr + count; - while (ptr++ != wq->rq_wptr) + while (ptr++ != wq->rq_wptr) { insert_recv_cqe(wq, cq); + flushed++; + } + return flushed; } static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, @@ -394,9 +398,10 @@ static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, cq->sw_wptr++; } -void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) { __u32 ptr; + int flushed = 0; struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); ptr = wq->sq_rptr + count; @@ -405,7 +410,9 @@ void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) insert_sq_cqe(wq, cq, sqp); sqp++; ptr++; + flushed++; } + return flushed; } /* diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 2bcff7f..69ab08e 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -173,8 +173,8 @@ u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); int __init cxio_hal_init(void); void __exit cxio_hal_exit(void); -void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); -void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +int cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); void cxio_flush_hw_cq(struct t3_cq *cq); diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index d44a6df..c325c44 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -67,10 +67,10 @@ int peer2peer = 0; module_param(peer2peer, int, 0644); MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=0)"); -static int ep_timeout_secs = 10; +static int ep_timeout_secs = 60; module_param(ep_timeout_secs, int, 0644); MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " - "in seconds (default=10)"); + "in seconds (default=60)"); static int mpa_rev = 1; module_param(mpa_rev, int, 0644); @@ -1650,8 +1650,8 @@ static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) release = 1; break; case ABORTING: - break; case DEAD: + break; default: BUG_ON(1); break; diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 9b4be88..79dbe5b 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -655,6 +655,7 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) { struct iwch_cq *rchp, *schp; int count; + int flushed; rchp = get_chp(qhp->rhp, qhp->attr.rcq); schp = get_chp(qhp->rhp, qhp->attr.scq); @@ -669,20 +670,22 @@ static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) spin_lock(&qhp->lock); cxio_flush_hw_cq(&rchp->cq); cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); - cxio_flush_rq(&qhp->wq, &rchp->cq, count); + flushed = cxio_flush_rq(&qhp->wq, &rchp->cq, count); spin_unlock(&qhp->lock); spin_unlock_irqrestore(&rchp->lock, *flag); - (*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context); + if (flushed) + (*rchp->ibcq.comp_handler)(&rchp->ibcq, rchp->ibcq.cq_context); /* locking heirarchy: cq lock first, then qp lock. */ spin_lock_irqsave(&schp->lock, *flag); spin_lock(&qhp->lock); cxio_flush_hw_cq(&schp->cq); cxio_count_scqes(&schp->cq, &qhp->wq, &count); - cxio_flush_sq(&qhp->wq, &schp->cq, count); + flushed = cxio_flush_sq(&qhp->wq, &schp->cq, count); spin_unlock(&qhp->lock); spin_unlock_irqrestore(&schp->lock, *flag); - (*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context); + if (flushed) + (*schp->ibcq.comp_handler)(&schp->ibcq, schp->ibcq.cq_context); /* deref */ if (atomic_dec_and_test(&qhp->refcnt)) @@ -880,7 +883,6 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, ep = qhp->ep; get_ep(&ep->com); } - flush_qp(qhp, &flag); break; case IWCH_QP_STATE_TERMINATE: qhp->attr.state = IWCH_QP_STATE_TERMINATE; @@ -911,6 +913,7 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, } switch (attrs->next_state) { case IWCH_QP_STATE_IDLE: + flush_qp(qhp, &flag); qhp->attr.state = IWCH_QP_STATE_IDLE; qhp->attr.llp_stream_handle = NULL; put_ep(&qhp->ep->com); diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 2515cbd..bc3b37d 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) props->max_ee = limit_uint(rblock->max_rd_ee_context); props->max_rdd = limit_uint(rblock->max_rd_domain); props->max_fmr = limit_uint(rblock->max_mr); - props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); props->max_qp_rd_atom = limit_uint(rblock->max_rr_qp); props->max_ee_rd_atom = limit_uint(rblock->max_rr_ee_context); props->max_res_rd_atom = limit_uint(rblock->max_rr_hca); @@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) } props->max_pkeys = 16; - props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); + props->local_ca_ack_delay = min_t(u8, rblock->local_ca_ack_delay, 255); props->max_raw_ipv6_qp = limit_uint(rblock->max_raw_ipv6_qp); props->max_raw_ethy_qp = limit_uint(rblock->max_raw_ethy_qp); props->max_mcast_grp = limit_uint(rblock->max_mcast_grp); @@ -136,7 +135,7 @@ query_device1: return ret; } -static int map_mtu(struct ehca_shca *shca, u32 fw_mtu) +static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu) { switch (fw_mtu) { case 0x1: @@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shca, u32 fw_mtu) } } -static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) +static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) { switch (vl_cap) { case 0x1: diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 2f199c5..4521319 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -246,7 +246,7 @@ err_mtt: if (context) ib_umem_release(cq->umem); else - mlx4_ib_free_cq_buf(dev, &cq->buf, entries); + mlx4_ib_free_cq_buf(dev, &cq->buf, cq->ibcq.cqe); err_db: if (!context) @@ -434,7 +434,7 @@ int mlx4_ib_destroy_cq(struct ib_cq *cq) mlx4_ib_db_unmap_user(to_mucontext(cq->uobject->context), &mcq->db); ib_umem_release(mcq->umem); } else { - mlx4_ib_free_cq_buf(dev, &mcq->buf, cq->cqe + 1); + mlx4_ib_free_cq_buf(dev, &mcq->buf, cq->cqe); mlx4_db_free(dev->dev, &mcq->db); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 9044f88..ca126fc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -334,6 +334,7 @@ struct ipoib_dev_priv { #endif int hca_caps; struct ipoib_ethtool_st ethtool; + struct timer_list poll_timer; }; struct ipoib_ah { @@ -404,6 +405,7 @@ extern struct workqueue_struct *ipoib_workqueue; int ipoib_poll(struct napi_struct *napi, int budget); void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr); +void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 97b815c..f429bce 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -461,6 +461,26 @@ void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr) netif_rx_schedule(dev, &priv->napi); } +static void drain_tx_cq(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned long flags; + + spin_lock_irqsave(&priv->tx_lock, flags); + while (poll_tx(priv)) + ; /* nothing */ + + if (netif_queue_stopped(dev)) + mod_timer(&priv->poll_timer, jiffies + 1); + + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr) +{ + drain_tx_cq((struct net_device *)dev_ptr); +} + static inline int post_send(struct ipoib_dev_priv *priv, unsigned int wr_id, struct ib_ah *address, u32 qpn, @@ -555,12 +575,22 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, else priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM; + if (++priv->tx_outstanding == ipoib_sendq_size) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP)) + ipoib_warn(priv, "request notify on send CQ failed\n"); + netif_stop_queue(dev); + } + if (unlikely(post_send(priv, priv->tx_head & (ipoib_sendq_size - 1), address->ah, qpn, tx_req, phead, hlen))) { ipoib_warn(priv, "post_send failed\n"); ++dev->stats.tx_errors; + --priv->tx_outstanding; ipoib_dma_unmap_tx(priv->ca, tx_req); dev_kfree_skb_any(skb); + if (netif_queue_stopped(dev)) + netif_wake_queue(dev); } else { dev->trans_start = jiffies; @@ -568,14 +598,11 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, ++priv->tx_head; skb_orphan(skb); - if (++priv->tx_outstanding == ipoib_sendq_size) { - ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); - netif_stop_queue(dev); - } } if (unlikely(priv->tx_outstanding > MAX_SEND_CQE)) - poll_tx(priv); + while (poll_tx(priv)) + ; /* nothing */ } static void __ipoib_reap_ah(struct net_device *dev) @@ -609,6 +636,11 @@ void ipoib_reap_ah(struct work_struct *work) round_jiffies_relative(HZ)); } +static void ipoib_ib_tx_timer_func(unsigned long ctx) +{ + drain_tx_cq((struct net_device *)ctx); +} + int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -645,6 +677,10 @@ int ipoib_ib_dev_open(struct net_device *dev) queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, round_jiffies_relative(HZ)); + init_timer(&priv->poll_timer); + priv->poll_timer.function = ipoib_ib_tx_timer_func; + priv->poll_timer.data = (unsigned long)dev; + set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); return 0; @@ -810,6 +846,7 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush) ipoib_dbg(priv, "All sends and receives done.\n"); timeout: + del_timer_sync(&priv->poll_timer); qp_attr.qp_state = IB_QPS_RESET; if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE)) ipoib_warn(priv, "Failed to modify QP to RESET state\n"); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index c1e7ece..8766d29 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -187,7 +187,8 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) goto out_free_mr; } - priv->send_cq = ib_create_cq(priv->ca, NULL, NULL, dev, ipoib_sendq_size, 0); + priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL, + dev, ipoib_sendq_size, 0); if (IS_ERR(priv->send_cq)) { printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name); goto out_free_recv_cq; diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index cb46446..03a9abc 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -551,7 +551,7 @@ int mlx4_fmr_alloc(struct mlx4_dev *dev, u32 pd, u32 access, int max_pages, u64 mtt_seg; int err = -ENOMEM; - if (page_shift < 12 || page_shift >= 32) + if (page_shift < (ffs(dev->caps.page_size_cap) - 1) || page_shift >= 32) return -EINVAL; /* All MTTs must fit in the same page */ From kliteyn at dev.mellanox.co.il Mon May 5 21:33:32 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 07:33:32 +0300 Subject: [ofa-general] Re: [PATCH 1/4] opensm/osm_ucast_cache.{c, h}: ucast routing cache implementation In-Reply-To: <1210026044.27137.121.camel@hrosenstock-ws.xsigo.com> References: <481D8905.1010207@dev.mellanox.co.il> <1210021961.27137.90.camel@hrosenstock-ws.xsigo.com> <481F82F5.8030505@dev.mellanox.co.il> <1210026044.27137.121.camel@hrosenstock-ws.xsigo.com> Message-ID: <481FDF9C.9000108@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > On Tue, 2008-05-06 at 00:58 +0300, Yevgeny Kliteynik wrote: >> Hal Rosenstock wrote: >>> Hi Yevgeny, >>> >>> On Sun, 2008-05-04 at 12:59 +0300, Yevgeny Kliteynik wrote: >>> >>> I haven't yet had a chance to review this in detail but think that >>> router ports need to be accomodated in the subnet (I think this is a >>> firm requirement as router ports on the subnet are already supported) >>> and also think that nothing should be introduced precluding the running >>> of OpenSM on a router port. From the latter standpoint, it looks much >>> like a CA port. >> This is exactly how I implemented it - any non-switch port is >> treated as CA, which is just a target LID. >> >> Well, I mean I intended to implement it that way - I reviewed it again, >> and it appears that the cache is fine with routing to routers and running >> from switches, > > then it's just the variable names which indicate ca if routers are > grouped with cas. > >> but there will be a problem when SM runs on a router node - >> cache will complain and fall back to usual routing. >> That can be easily fixed. > > Right; the one thing I saw was in _ucast_cache_get_starting_osm_sw where > routers were not supported. I think a one line change is all that's > needed there. Not sure if there are other places. Right, that's the place I was talking about. AFAIK, no other places. -- Yevgeny >> However, I'm not sure how the OpenSM will behave in general when running >> from switch or router - I've never tried it. Has anyone try it? > > I'm not sure either but would be interested to hear. I think there are > some using it on switch port 0 and also think others have tried it on > routers. In terms of switches, it used to work and there is some support > in the vendor directory for this. > > -- Hal > >> -- Yevgeny >> >>> -- Hal >>> >>> > > From ogerlitz at voltaire.com Tue May 6 00:11:06 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 06 May 2008 10:11:06 +0300 Subject: [ofa-general] Re: Using RDMA CM with MPI In-Reply-To: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com> References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com> <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com> Message-ID: <4820048A.1000706@voltaire.com> Keshetti Mahesh wrote: > I want to use the RDMA CM option of MVAPICH2. The procedure described > in the user guide is not much informative. ... Also, I'll be glad if some one can give > me a document describing how it works in detail. try looking on the rdma_cm(7) man page installed by librdmacm > Actually I have some doubts like, how the IP addresses (???) are > resolved into IB addresses and what happens in the case of nodes two HCAs > (or 1 HCA with two ports) ? each port you want to use for your job (HCA with two ports, two HCAs, etc) has to have an IP address associated with it. This IP address (or a host name which is translated to it through DNS lookup) is probably what you want the rank to advertise. RDMA address resolution uses route lookup and ARP to learn the local and remote GID. > > In the MVAPICH2 user guide it is mentioned that "RDMA CM device needs > to be setup, configured with an IP address and connected to the network". > Is this same as configuring IPoIB device ? for IB, yes. Or. From keshetti85-student at yahoo.co.in Tue May 6 02:21:08 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Tue, 6 May 2008 14:51:08 +0530 Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2 In-Reply-To: <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com> References: <829ded920805050401v16961f10y4cae507d58c74cfe@mail.gmail.com> <829ded920805050529k19dee08o3ad19ce9d07bd2ca@mail.gmail.com> Message-ID: <829ded920805060221lc3b5f77safacf8a8ad299b33@mail.gmail.com> I'vecouple of more questions to ask you. Below are the steps mentioned in MVAPICH user guide for running MPI application with RDMA CM support. *• Setup the RDMA CM device: RDMA CM device needs to be setup, configured with an IP address and connected to the network.* I have two machines (n0 and n1)connected with one ethernet interface and 2 IB interfaces in each. And /etc/hosts on both machines is like below. 192.168.3.1 n0 192.168.3.2 n1 172.131.15.1 n0_ib0 172.131.15.2 n0_ib1 172.131.15.3 n1_ib0 172.131.15.4 n1_ib0 Now If I want to run an MPI job on both of the nodes what should I mention in the *'hostfile'* given to MPI ("n0, n1" or "n0_ib0, n1_ib0 ... " ) ? *• Setup the Local Address File: Create the file (/etc/mv2.conf) with the local IP address to be used by RDMA CM. $ echo 10.1.1.1 >> /etc/mv2.conf* Why is this file (/etc/mv2.conf) required ? Is it required to be present on all nodes? -Mahesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Tue May 6 03:49:47 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 06 May 2008 13:49:47 +0300 Subject: [ofa-general] OFED 1.3.1 RC1 release is available Message-ID: <482037CB.7020903@dev.mellanox.co.il> Hi, OFED 1.3.1 RC1 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc1.tgz To get BUILD_ID run ofed_info Please report any issues in Bugzilla https://bugs.openfabrics.org/ The RC2 release is expected on May 20 Release information: -------------------- Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - Fedora C6: 2.6.18-8.fc6 - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - OpenSuSE 10.3: 2.6.22-*-* - kernel.org: 2.6.23 and 2.6.24 Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.3: * MPI packages update: * mvapich-1.0.1-2434 * mvapich2-1.0.3-1 * openmpi-1.2.6-1 * Updated libraries: * dapl-v1 1.2.6 * dapl-v2 2.0.8 * libcxgb3 1.2.0 * librdmacm 1.0.7 I * ULPs changes: * IB Bonding: ib-bonding-0.9.0-24 * IPoIB bug fixes * RDS fixes for RDMA API * SRP failover * Updated low level drivers: * nes * mlx4 * cxgb3 * ehca Note: In the attached tgz file you can find git-log of all changes. Vlad & Tziporet -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.3-1.3.1-rc1.diff.tgz Type: application/octet-stream Size: 20514 bytes Desc: not available URL: From hrosenstock at xsigo.com Tue May 6 06:16:55 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 06:16:55 -0700 Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the example in QoS_management_in_OpenSM.txt In-Reply-To: <47E8C032.2050907@dev.mellanox.co.il> References: <47E8C032.2050907@dev.mellanox.co.il> Message-ID: <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote: > QoS_management_in_OpenSM.txt Shouldn't this doc also be available in the OpenSM git tree (in management/opensm/doc) and distributed as part of OpenSM ? If so, I can issue a patch for this. Thanks. -- Hal From kliteyn at dev.mellanox.co.il Tue May 6 06:40:56 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 16:40:56 +0300 Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the example in QoS_management_in_OpenSM.txt In-Reply-To: <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> References: <47E8C032.2050907@dev.mellanox.co.il> <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> Message-ID: <48205FE8.3070104@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote: >> QoS_management_in_OpenSM.txt > > Shouldn't this doc also be available in the OpenSM git tree (in > management/opensm/doc) and distributed as part of OpenSM ? > > If so, I can issue a patch for this. I think that it's a good idea (which reminds me that I still haven't fixed the patch for QoS stuff in OpenSM man page...) -- Yevgeny > Thanks. > > -- Hal > > From hrosenstock at xsigo.com Tue May 6 06:44:12 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 06:44:12 -0700 Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the example in QoS_management_in_OpenSM.txt In-Reply-To: <48205FE8.3070104@dev.mellanox.co.il> References: <47E8C032.2050907@dev.mellanox.co.il> <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> <48205FE8.3070104@dev.mellanox.co.il> Message-ID: <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote: > >> QoS_management_in_OpenSM.txt > > > > Shouldn't this doc also be available in the OpenSM git tree (in > > management/opensm/doc) and distributed as part of OpenSM ? > > > > If so, I can issue a patch for this. > > I think that it's a good idea (which reminds me that I still > haven't fixed the patch for QoS stuff in OpenSM man page...) Yes, that would be nice too :-) That's going to make the OpenSM man page huge. Not sure how that should be dealt with. Maybe a simple way out in the short term might be to just reference that doc in the man page. -- Hal > -- Yevgeny > > > Thanks. > > > > -- Hal > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From narravul at cse.ohio-state.edu Tue May 6 06:50:48 2008 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Tue, 6 May 2008 09:50:48 -0400 (EDT) Subject: [ofa-general] Re: [mvapich-discuss] Using RDMA CM with MVAPICH2 In-Reply-To: <829ded920805060221lc3b5f77safacf8a8ad299b33@mail.gmail.com> Message-ID: Hi Mahesh, Thanks for trying out our RDMA CM support. My answers are inline. > I'vecouple of more questions to ask you. > > Below are the steps mentioned in MVAPICH user guide for > running MPI application with RDMA CM support. > > *� Setup the RDMA CM device: RDMA CM device needs to be > setup, configured with an IP address and connected to the network.* > > I have two machines (n0 and n1)connected with one ethernet interface > and 2 IB interfaces in each. And /etc/hosts on both machines is like > below. > > 192.168.3.1 n0 > 192.168.3.2 n1 > 172.131.15.1 n0_ib0 > 172.131.15.2 n0_ib1 > 172.131.15.3 n1_ib0 > 172.131.15.4 n1_ib0 > > Now If I want to run an MPI job on both of the nodes what should I > mention in the *'hostfile'* given to MPI ("n0, n1" or "n0_ib0, n1_ib0 ... " > ) ? You can use any one of these pairs in your hostfile. i.e. using n0 and n1 should work fine. > *� Setup the Local Address File: Create the file (/etc/mv2.conf) with the > local IP address to be used by RDMA CM. > $ echo 10.1.1.1 >> /etc/mv2.conf* > > Why is this file (/etc/mv2.conf) required ? Is it required to be present on > all nodes? The local rdma-cm device that the mpi library needs to use is specified in the /etc/mv2.conf file. The file needs to be on all the machines. --Sundeep. > > -Mahesh > From suri at baymicrosystems.com Tue May 6 06:56:26 2008 From: suri at baymicrosystems.com (Suresh Shelvapille) Date: Tue, 6 May 2008 09:56:26 -0400 Subject: [ofa-general] IBTA Compliance- Mkey Violations trap In-Reply-To: <48205FE8.3070104@dev.mellanox.co.il> References: <47E8C032.2050907@dev.mellanox.co.il><1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> <48205FE8.3070104@dev.mellanox.co.il> Message-ID: <04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com> Folks: WRT sending traps on MkeyViolations, the spec is a little ambiguous IMO. In section 14.2.4.2 C-14-18 is it saying to send a trap the first time the Mkey Violation happens and the Lease Expiry timer is started or to send a trap(possibly multiple of them) every time Mkey Violation happens even though the lease timer may have been started already? What is the consensus in this group? Many thanks, Suri From monis at Voltaire.COM Tue May 6 06:56:30 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 06 May 2008 16:56:30 +0300 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> Message-ID: <4820638E.4030901@Voltaire.COM> > I guess I can believe things don't get worse but I still don't know how > this makes things better. With the current code the request is lost > because it goes to the wrong SM; with the new code the request is failed > by the SA layer. So in both cases the consumer just has to try again. > > So is there some practical benefit we see by adding this code? > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > In general I see the benefit in faster detection of wrong SM ah. Before the patch consumers need to wait for a timeout before the detection and after the patch it happens immediately on return from the function. This improves the performance of an SM failover scenario. Some applications may get the benefit above only they handle new return code (EAGAIN) specifically but this patch opens the door for such improvement. thanks MoniS From monis at Voltaire.COM Tue May 6 07:09:20 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 06 May 2008 17:09:20 +0300 Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and handle each according to level of severity Message-ID: <48206690.3090604@Voltaire.COM> The purpose of this patch is to make the events that are related to SM change (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. When SM related events are handled, it is not necessary to flush unicast info from device but only multicast info. This patch divides the events that are handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1 does more than 0). The main change is in __ipoib_ib_dev_flush(). Instead of flagging to the function about pkey_events we now use leveling. An event that requires "harder" flushing calls this function with higher number for level. Besides the concept, the actual change is that SM related events are not flushing unicast info and not bringing the device down but only refresh the multicast info in the background. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/ulp/ipoib/ipoib.h | 9 ++++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 37 ++++++++++++++++++----------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 ++- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 19 +++++++------- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 054fab8..e1e91d3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -268,10 +268,11 @@ struct ipoib_dev_priv { struct delayed_work pkey_poll_task; struct delayed_work mcast_task; - struct work_struct flush_task; + struct work_struct flush_task0; + struct work_struct flush_task1; + struct work_struct flush_task2; struct work_struct restart_task; struct delayed_work ah_reap_task; - struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -401,7 +402,9 @@ void ipoib_flush_paths(struct net_device struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); -void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_ib_dev_flush0(struct work_struct *work); +void ipoib_ib_dev_flush1(struct work_struct *work); +void ipoib_ib_dev_flush2(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 08c4396..54fee47 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -749,12 +749,14 @@ int ipoib_ib_dev_init(struct net_device return 0; } -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) { struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; u16 new_index; + ipoib_dbg(priv, "Try flushing level %d\n", level); + mutex_lock(&priv->vlan_mutex); /* @@ -762,7 +764,7 @@ static void __ipoib_ib_dev_flush(struct * the parent is down. */ list_for_each_entry(cpriv, &priv->child_intfs, list) - __ipoib_ib_dev_flush(cpriv, pkey_event); + __ipoib_ib_dev_flush(cpriv, level); mutex_unlock(&priv->vlan_mutex); @@ -776,7 +778,7 @@ static void __ipoib_ib_dev_flush(struct return; } - if (pkey_event) { + if (level == 2) { if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ipoib_ib_dev_down(dev, 0); @@ -794,11 +796,13 @@ static void __ipoib_ib_dev_flush(struct priv->pkey_index = new_index; } - ipoib_dbg(priv, "flushing\n"); - ipoib_ib_dev_down(dev, 0); + ipoib_mcast_dev_flush(dev); + + if (level >= 1) + ipoib_ib_dev_down(dev, 0); - if (pkey_event) { + if (level >= 2) { ipoib_ib_dev_stop(dev, 0); ipoib_ib_dev_open(dev); } @@ -808,29 +812,36 @@ static void __ipoib_ib_dev_flush(struct * we get here, don't bring it back up if it's not configured up */ if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { - ipoib_ib_dev_up(dev); + if (level >= 1) + ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } } -void ipoib_ib_dev_flush(struct work_struct *work) +void ipoib_ib_dev_flush0(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + container_of(work, struct ipoib_dev_priv, flush_task0); - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 0); } -void ipoib_pkey_event(struct work_struct *work) +void ipoib_ib_dev_flush1(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_event_task); + container_of(work, struct ipoib_dev_priv, flush_task1); - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 1); } +void ipoib_ib_dev_flush2(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task2); + + __ipoib_ib_dev_flush(priv, 2); +} + void ipoib_ib_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 5728204..54f046a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -992,9 +992,10 @@ static void ipoib_setup(struct net_devic INIT_LIST_HEAD(&priv->multicast_list); INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index a3aeb91..83d9c6d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -259,15 +259,16 @@ void ipoib_event(struct ib_event_handler if (record->element.port_num != priv->port) return; - if (record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PORT_ACTIVE || - record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE || - record->event == IB_EVENT_CLIENT_REREGISTER) { - ipoib_dbg(priv, "Port state change event\n"); - queue_work(ipoib_workqueue, &priv->flush_task); + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, + record->device->name, record->element.port_num); + if ( record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER) { + queue_work(ipoib_workqueue, &priv->flush_task0); + } else if (record->event == IB_EVENT_PORT_ERR || + record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE) { + queue_work(ipoib_workqueue, &priv->flush_task1); } else if (record->event == IB_EVENT_PKEY_CHANGE) { - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); - queue_work(ipoib_workqueue, &priv->pkey_event_task); + queue_work(ipoib_workqueue, &priv->flush_task2); } } From pawel.dziekonski at wcss.pl Tue May 6 07:10:40 2008 From: pawel.dziekonski at wcss.pl (Pawel Dziekonski) Date: Tue, 6 May 2008 16:10:40 +0200 Subject: [ofa-general] getting network statistics In-Reply-To: <1203424196.16145.1.camel@mtls03> References: <1203424196.16145.1.camel@mtls03> Message-ID: <20080506141039.GJ6586@cefeid.wcss.wroc.pl> you mean port_rcv_data and port_xmit_data ? if so, then I have 2 jobs that are definitelly using IB network, but those files almost do not change. :o OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al total 0 drwxr-xr-x 2 root root 0 May 6 15:45 ./ drwxr-xr-x 5 root root 0 May 6 15:45 ../ -r--r--r-- 1 root root 4096 May 6 15:45 VL15_dropped -r--r--r-- 1 root root 4096 May 6 15:45 excessive_buffer_overrun_errors -r--r--r-- 1 root root 4096 May 6 15:45 link_downed -r--r--r-- 1 root root 4096 May 6 15:45 link_error_recovery -r--r--r-- 1 root root 4096 May 6 15:45 local_link_integrity_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_constraint_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_data -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_packets -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_remote_physical_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_switch_relay_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_constraint_errors -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_data -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_discards -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_packets -r--r--r-- 1 root root 4096 May 6 15:45 symbol_error On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote: > cat /sys/class/infiniband/mlx4_0/ports/1/counters/* > > mlx4_* can be mthca* > > On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote: > > Under Linux with Mellanox ofed, how can I get real-time network > > statistics. e.g. how many bytes are being sent and received over each > > port at any given time? -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From hrosenstock at xsigo.com Tue May 6 07:33:26 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 07:33:26 -0700 Subject: [ofa-general] [PATCH] OpenSM: Add QoS_management_in_OpenSM.txt to opensm/doc directory Message-ID: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com> Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory Signed-off-by: Hal Rosenstock --- /dev/null 2008-03-17 00:34:45.630902751 -0700 +++ opensm/doc/QoS_management_in_OpenSM.txt 2008-04-01 08:29:04.625737000 -0700 @@ -0,0 +1,492 @@ + + QoS Management in OpenSM + +============================================================================== + Table of contents +============================================================================== + +1. Overview +2. Full QoS Policy File +3. Simplified QoS Policy Definition +4. Policy File Syntax Guidelines +5. Examples of Full Policy File +6. Simplified QoS Policy - Details and Examples +7. SL2VL Mapping and VL Arbitration + + +============================================================================== + 1. Overview +============================================================================== + +When QoS in OpenSM is enabled (-Q or --qos), OpenSM looks for QoS Policy file. +The default name of OpenSM QoS policy file is +/usr/local/etc/opensm/qos-policy.conf. The default may be changed by using -Y +or --qos_policy_file option with OpenSM. + +During fabric initialization and at every heavy sweep OpenSM parses the QoS +policy file, applies its settings to the discovered fabric elements, and +enforces the provided policy on client requests. The overall flow for such +requests is: + - The request is matched against the defined matching rules such that the + QoS Level definition is found. + - Given the QoS Level, path(s) search is performed with the given + restrictions imposed by that level. + +There are two ways to define QoS policy: + - Full policy, where the policy file syntax provides an administrator + various ways to match PathRecord/MultiPathRecord (PR/MPR) request and + enforce various QoS constraints on the requested PR/MPR + - Simplified QoS policy definition, where an administrator would be able to + match PR/MPR requests by various ULPs and applications running on top of + these ULPs. + +While the full policy syntax is very flexible, in many cases the simplified +policy definition would be sufficient. + + +============================================================================== + 2. Full QoS Policy File +============================================================================== + +QoS policy file has the following sections: + +I) Port Groups (denoted by port-groups). +This section defines zero or more port groups that can be referred later by +matching rules (see below). Port group lists ports by: + - Port GUID + - Port name, which is a combination of NodeDescription and IB port number + - PKey, which means that all the ports in the subnet that belong to + partition with a given PKey belong to this port group + - Partition name, which means that all the ports in the subnet that belong + to partition with a given name belong to this port group + - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and + SELF (SM's port). + +II) QoS Setup (denoted by qos-setup). +This section describes how to set up SL2VL and VL Arbitration tables on +various nodes in the fabric. +However, this is not supported in OFED 1.3. +SL2VL and VLArb tables should be configured in the OpenSM options file +(default location - /var/cache/opensm/opensm.opts). + +III) QoS Levels (denoted by qos-levels). +Each QoS Level defines Service Level (SL) and a few optional fields: + - MTU limit + - Rate limit + - PKey + - Packet lifetime +When path(s) search is performed, it is done with regards to restriction that +these QoS Level parameters impose. +One QoS level that is mandatory to define is a DEFAULT QoS level. It is +applied to a PR/MPR query that does not match any existing match rule. +Similar to any other QoS Level, it can also be explicitly referred by any +match rule. + +IV) QoS Matching Rules (denoted by qos-match-rules). +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against +the set of matching rules. Rules are scanned in order of appearance in the QoS +policy file such as the first match takes precedence. +Each rule has a name of QoS level that will be applied to the matching query. +A default QoS level is applied to a query that did not match any rule. +Queries can be matched by: + - Source port group (whether a source port is a member of a specified group) + - Destination port group (same as above, only for destination port) + - PKey + - QoS class + - Service ID +To match a certain matching rule, PR/MPR query has to match ALL the rule's +criteria. However, not all the fields of the PR/MPR query have to appear in +the matching rule. +For instance, if the rule has a single criterion - Service ID, it will match +any query that has this Service ID, disregarding rest of the query fields. +However, if a certain query has only Service ID (which means that this is the +only bit in the PR/MPR component mask that is on), it will not match any rule +that has other matching criteria besides Service ID. + + +============================================================================== + 3. Simplified QoS Policy Definition +============================================================================== + +Simplified QoS policy definition comprises of a single section denoted by +qos-ulps. Similar to the full QoS policy, it has a list of match rules and +their QoS Level, but in this case a match rule has only one criterion - its +goal is to match a certain ULP (or a certain application on top of this ULP) +PR/MPR request, and QoS Level has only one constraint - Service Level (SL). +The simplified policy section may appear in the policy file in combine with +the full policy, or as a stand-alone policy definition. +See more details and list of match rule criteria below. + + +============================================================================== + 4. Policy File Syntax Guidelines +============================================================================== + +- Empty lines are ignored. +- Leading and trailing blanks, as well as empty lines, are ignored, so + the indentation in the example is just for better readability. +- Comments are started with the pound sign (#) and terminated by EOL. +- Any keyword should be the first non-blank in the line, unless it's a + comment. +- Keywords that denote section/subsection start have matching closing + keywords. +- Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR + requests that didn't match any of the matching rules. +- Any section/subsection of the policy file is optional. + + +============================================================================== + 5. Examples of Full Policy File +============================================================================== + +As mentioned earlier, any section of the policy file is optional, and +the only mandatory part of the policy file is a default QoS Level. +Here's an example of the shortest policy file: + + qos-levels + qos-level + name: DEFAULT + sl: 0 + end-qos-level + end-qos-levels + +Port groups section is missing because there are no match rules, which means +that port groups are not referred anywhere, and there is no need defining +them. And since this policy file doesn't have any matching rules, PR/MPR query +won't match any rule, and OpenSM will enforce default QoS level. +Essentially, the above example is equivalent to not having QoS policy file +at all. + +The following example shows all the possible options and keywords in the +policy file and their syntax: + + # + # See the comments in the following example. + # They explain different keywords and their meaning. + # + port-groups + + port-group # using port GUIDs + name: Storage + # "use" is just a description that is used for logging + # Other than that, it is just a comment + use: SRP Targets + port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA + port-guid: 0x1000000000FFFF + end-port-group + + port-group + name: Virtual Servers + # The syntax of the port name is as follows: + # "node_description/Pnum". + # node_description is compared to the NodeDescription of the node, + # and "Pnum" is a port number on that node. + port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 + end-port-group + + # using partitions defined in the partition policy + port-group + name: Partitions + partition: Part1 + pkey: 0x1234 + end-port-group + + # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) + # or ALL (for all the nodes in the subnet) + port-group + name: CAs and SM + node-type: CA, SELF + end-port-group + + end-port-groups + + qos-setup + # This section of the policy file describes how to set up SL2VL and VL + # Arbitration tables on various nodes in the fabric. + # However, this is not supported in OFED 1.3 - the section is parsed + # and ignored. SL2VL and VLArb tables should be configured in the + # OpenSM options file (by default - /var/cache/opensm/opensm.opts). + end-qos-setup + + qos-levels + + # Having a QoS Level named "DEFAULT" is a must - it is applied to + # PR/MPR requests that didn't match any of the matching rules. + qos-level + name: DEFAULT + use: default QoS Level + sl: 0 + end-qos-level + + # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime + qos-level + name: WholeSet + sl: 1 + mtu-limit: 4 + rate-limit: 5 + pkey: 0x1234 + packet-life: 8 + end-qos-level + + end-qos-levels + + # Match rules are scanned in order of their apperance in the policy file. + # First matched rule takes precedence. + qos-match-rules + + # matching by single criteria: QoS class + qos-match-rule + use: by QoS class + qos-class: 7-9,11 + # Name of qos-level to apply to the matching PR/MPR + qos-level-name: WholeSet + end-qos-match-rule + + # show matching by destination group and service id + qos-match-rule + use: Storage targets + destination: Storage + service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF + qos-level-name: WholeSet + end-qos-match-rule + + qos-match-rule + source: Storage + use: match by source group only + qos-level-name: DEFAULT + end-qos-match-rule + + qos-match-rule + use: match by all parameters + qos-class: 7-9,11 + source: Virtual Servers + destination: Storage + service-id: 0x0000000000010000-0x000000000001FFFF + pkey: 0x0F00-0x0FFF + qos-level-name: WholeSet + end-qos-match-rule + + end-qos-match-rules + + +============================================================================== + 6. Simplified QoS Policy - Details and Examples +============================================================================== + +Simplified QoS policy match rules are tailored for matching ULPs (or some +application on top of a ULP) PR/MPR requests. This section has a list of +per-ULP (or per-application) match rules and the SL that should be enforced +on the matched PR/MPR query. + +Match rules include: + - Default match rule that is applied to PR/MPR query that didn't match any + of the other match rules + - SDP + - SDP application with a specific target TCP/IP port range + - SRP with a specific target IB port GUID + - RDS + - iSER + - iSER application with a specific target TCP/IP port range + - IPoIB with a default PKey + - IPoIB with a specific PKey + - any ULP/application with a specific Service ID in the PR/MPR query + - any ULP/application with a specific PKey in the PR/MPR query + - any ULP/application with a specific target IB port GUID in the PR/MPR query + +Since any section of the policy file is optional, as long as basic rules of +the file are kept (such as no referring to nonexisting port group, having +default QoS Level, etc), the simplified policy section (qos-ulps) can serve +as a complete QoS policy file. +The shortest policy file in this case would be as follows: + + qos-ulps + default : 0 #default SL + end-qos-ulps + +It is equivalent to the previous example of the shortest policy file, and it +is also equivalent to not having policy file at all. + +Below is an example of simplified QoS policy with all the possible keywords: + + qos-ulps + default : 0 # default SL + sdp, port-num 30000 : 0 # SL for application running on top + # of SDP when a destination + # TCP/IPport is 30000 + sdp, port-num 10000-20000 : 0 + sdp : 1 # default SL for any other + # application running on top of SDP + rds : 2 # SL for RDS traffic + iser, port-num 900 : 0 # SL for iSER with a specific target + # port + iser : 3 # default SL for iSER + ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with + # pkey 0x0001 + ipoib : 4 # default IPoIB partition, + # pkey=0x7FFF + any, service-id 0x6234 : 6 # match any PR/MPR query with a + # specific Service ID + any, pkey 0x0ABC : 6 # match any PR/MPR query with a + # specific PKey + srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on + # a specified IB port GUID + any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with + # a specific target port GUID + end-qos-ulps + + +Similar to the full policy definition, matching of PR/MPR queries is done in +order of appearance in the QoS policy file such as the first match takes +precedence, except for the "default" rule, which is applied only if the query +didn't match any other rule. + +All other sections of the QoS policy file take precedence over the qos-ulps +section. That is, if a policy file has both qos-match-rules and qos-ulps +sections, then any query is matched first against the rules in the +qos-match-rules section, and only if there was no match, the query is matched +against the rules in qos-ulps section. + +Note that some of these match rules may overlap, so in order to use the +simplified QoS definition effectively, it is important to understand how each +of the ULPs is matched: + +6.1 IPoIB +IPoIB query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so +the following three match rules are equivalent: + + ipoib : + ipoib, pkey 0x7fff : + any, pkey 0x7fff : + +6.2 SDP +SDP PR query is matched by Service ID. The Service-ID for SDP is +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP Port +Number to connect to. The following two match rules are equivalent: + + sdp : + any, service-id 0x0000000000010000-0x000000000001ffff : + +6.3 RDS +Similar to SDP, RDS PR query is matched by Service ID. The Service ID for RDS +is 0x000000000106PPPP, where PPPP are 4 hex digits holding the remote TCP/IP +Port Number to connect to. Default port number for RDS is 0x48CA, which makes +a default Service-ID 0x00000000010648CA. The following two match rules are +equivalent: + + rds : + any, service-id 0x00000000010648CA : + +6.4 iSER +Similar to RDS, iSER query is matched by Service ID, where the the Service ID +is also 0x000000000106PPPP. Default port number for iSER is 0x035C, which makes +a default Service-ID 0x000000000106035C. The following two match rules are +equivalent: + + iser : + any, service-id 0x000000000106035C : + +6.5 SRP +Service ID for SRP varies from storage vendor to vendor, thus SRP query is +matched by the target IB port GUID. The following two match rules are +equivalent: + + srp, target-port-guid 0x1234 : + any, target-port-guid 0x1234 : + +Note that any of the above ULPs might contain target port GUID in the PR +query, so in order for these queries not to be recognized by the QoS manager +as SRP, the SRP match rule (or any match rule that refers to the target port +guid only) should be placed at the end of the qos-ulps match rules. + +6.6 MPI +SL for MPI is manually configured by MPI admin. OpenSM is not forcing any SL +on the MPI traffic, and that's why it is the only ULP that did not appear in +the qos-ulps section. + + +============================================================================== + 7. SL2VL Mapping and VL Arbitration +============================================================================== + +OpenSM cached options file has a set of QoS related configuration parameters, +that are used to configure SL2VL mapping and VL arbitration on IB ports. +These parameters are: + - Max VLs: the maximum number of VLs that will be on the subnet. + - High limit: the limit of High Priority component of VL Arbitration + table (IBA 7.6.9). + - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. + - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. + - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs + corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). + +There are separate QoS configuration parameters sets for various target types: +CAs, routers, switch external ports, and switch's enhanced port 0. The names +of such parameters are prefixed by "qos__" string. Here is a full list +of the currently supported sets: + + qos_ca_ - QoS configuration parameters set for CAs. + qos_rtr_ - parameters set for routers. + qos_sw0_ - parameters set for switches' port 0. + qos_swe_ - parameters set for switches' external ports. + +Here's the example of typical default values for CAs and switches' external +ports (hard-coded in OpenSM initialization): + + qos_ca_max_vls=15 + qos_ca_high_limit=0 + qos_ca_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 + qos_ca_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 + qos_ca_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + + qos_swe_max_vls=15 + qos_swe_high_limit=0 + qos_swe_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 + qos_swe_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 + qos_swe_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + +VL arbitration tables (both high and low) are lists of VL/Weight pairs. +Each list entry contains a VL number (values from 0-14), and a weighting value +(values 0-255), indicating the number of 64 byte units (credits) which may be +transmitted from that VL when its turn in the arbitration occurs. A weight +of 0 indicates that this entry should be skipped. If a list entry is +programmed for VL15 or for a VL that is not supported or is not currently +configured by the port, the port may either skip that entry or send from any +supported VL for that entry. + +Note, that the same VLs may be listed multiple times in the High or Low +priority arbitration tables, and, further, it can be listed in both tables. + +The limit of high-priority VLArb table (qos__high_limit) indicates the +number of high-priority packets that can be transmitted without an opportunity +to send a low-priority packet. Specifically, the number of bytes that can be +sent is high_limit times 4K bytes. + +A high_limit value of 255 indicates that the byte limit is unbounded. +Note: if the 255 value is used, the low priority VLs may be starved. +A value of 0 indicates that only a single packet from the high-priority table +may be sent before an opportunity is given to the low-priority table. + +Keep in mind that ports usually transmit packets of size equal to MTU. +For instance, for 4KB MTU a single packet will require 64 credits, so in order +to achieve effective VL arbitration for packets of 4KB MTU, the weighting +values for each VL should be multiples of 64. + +Below is an example of SL2VL and VL Arbitration configuration on subnet: + + qos_ca_max_vls=15 + qos_ca_high_limit=6 + qos_ca_vlarb_high=0:4 + qos_ca_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 + qos_ca_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + + qos_swe_max_vls=15 + qos_swe_high_limit=6 + qos_swe_vlarb_high=0:4 + qos_swe_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 + qos_swe_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single +transmission burst. Such configuration would suilt VL that needs low latency +and uses small MTU when transmitting packets. Rest of VLs are defined as low +priority VLs with different weights, while VL4 is effectively turned off. From kliteyn at dev.mellanox.co.il Tue May 6 07:33:56 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 17:33:56 +0300 Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the example in QoS_management_in_OpenSM.txt In-Reply-To: <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com> References: <47E8C032.2050907@dev.mellanox.co.il> <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> <48205FE8.3070104@dev.mellanox.co.il> <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com> Message-ID: <48206C54.8010605@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> Hal Rosenstock wrote: >>> Hi Yevgeny, >>> >>> On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote: >>>> QoS_management_in_OpenSM.txt >>> Shouldn't this doc also be available in the OpenSM git tree (in >>> management/opensm/doc) and distributed as part of OpenSM ? >>> >>> If so, I can issue a patch for this. >> I think that it's a good idea (which reminds me that I still >> haven't fixed the patch for QoS stuff in OpenSM man page...) > > Yes, that would be nice too :-) > > That's going to make the OpenSM man page huge. Not sure how that should > be dealt with. Maybe a simple way out in the short term might be to just > reference that doc in the man page. I thought about it too. The only problem is that "the short term" solutions have an astonishing ability to stay as a "long term", or "final" solutions... :) Let's think about the long term solution right away. Are we OK with having just 10-15 lines about the existence of QoS annex support in OpenSM man page (in addition to SL2VL and VLArb tables configuration that already exists there), and a reference to the QoS Management doc? I, for one, have no problems with that. I tried reducing the full text to include it in the man - it's still very long... I just check the mailing list - the mail isn't there (it was "delayed for approval" when there were problems with mailing list filtering policy a month ago). I'll forward it to you. -- Yevgeny > -- Hal > >> -- Yevgeny >> >>> Thanks. >>> >>> -- Hal >>> >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From kliteyn at dev.mellanox.co.il Tue May 6 07:34:43 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 17:34:43 +0300 Subject: [Fwd: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages] Message-ID: <48206C83.8000009@dev.mellanox.co.il> Hi Hal, This is the mail that I was talking about (QoS info for OpenSM man page). Sasha has reviewed it, and posted his answer to the mailing list. -- Yevgeny -------- Original Message -------- Subject: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages Date: Wed, 26 Mar 2008 02:47:08 +0200 From: Yevgeny Kliteynik To: Sasha Khapyorsky CC: OpenIB Hi Sasha, I've added QoS related info to opensm man pages: enhanced existing part (that was talking about VL arbitration) and added description of QoS manager in accordance with QoS annex. Please apply to ofed_1_3 and master. Signed-off-by: Yevgeny Kliteynik --- opensm/man/opensm.8.in | 501 +++++++++++++++++++++++++++++++++++++++++++----- 1 files changed, 457 insertions(+), 44 deletions(-) diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 5322ab7..1d9c5b7 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each InfiniBand subnet). opensm also now contains an experimental version of a performance -manager as well. +manager and an experimental version QoS manager (in accordance with +IBA QoS Annex). opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it, and sweep occasionally for changes. @@ -433,51 +434,463 @@ partition manager: Default=0x7fff,ipoib:ALL=full; -.SH QOS CONFIGURATION +.SH QUALITY OF SERVICE .PP -There are a set of QoS related low-level configuration parameters. -All these parameter names are prefixed by "qos_" string. Here is a full -list of these parameters: - - qos_max_vls - The maximum number of VLs that will be on the subnet - qos_high_limit - The limit of High Priority component of VL - Arbitration table (IBA 7.6.9) - qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9) - template - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9) - template - Both VL arbitration templates are pairs of - VL and weight - qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is - a list of VLs corresponding to SLs 0-15 (Note - that VL15 used here means drop this SL) - -Typical default values (hard-coded in OpenSM initialization) are: - - qos_max_vls=15 - qos_high_limit=0 - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 - -The syntax is compatible with rest of OpenSM configuration options and -values may be stored in OpenSM config file (cached options file). - -In addition to the above, we may define separate QoS configuration -parameters sets for various target types. As targets, we currently support -CAs, routers, switch external ports, and switch's enhanced port 0. The -names of such specialized parameters are prefixed by "qos__" -string. Here is a full list of the currently supported sets: - - qos_ca_ - QoS configuration parameters set for CAs. - qos_rtr_ - parameters set for routers. - qos_sw0_ - parameters set for switches' port 0. - qos_swe_ - parameters set for switches' external ports. +OpenSM QoS support comprises of two parts: -Examples: - qos_sw0_max_vls=2 - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0, - qos_swe_high_limit=0 + 1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental) +.P + 2. \fBSL2VL and VL Arbitration tables configuration\fP +.P +.SS QoS Manager (experimental) +.PP +When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks +for QoS Policy file. The default name of this file is +\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using +-Y or --qos_policy_file option with OpenSM. + +During fabric initialization and at every heavy sweep OpenSM parses the +QoS policy file, applies its settings to the discovered fabric elements, +and enforces the provided policy on client requests. The overall flow for +such requests is as follows: + - The request is matched against the defined matching rules such that + the QoS Level definition is found. + - Given the QoS Level, path(s) search is performed with the given + restrictions imposed by that level. + +There are two ways to define QoS policy: + - \fBFull\fP: the full policy file syntax provides the administrator various + ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to + enforce various QoS constraints on the requested PR/MPR. + - \fBSimplified\fP: the simplified policy file syntax enables the administrator + match PR/MPR requests by various ULPs and applications running on top of + these ULPs. + +While the full policy syntax is very flexible, in many cases the simplified +policy definition would be sufficient. +.PP +.B Full QoS Policy File +.PP +QoS policy file has the following sections: + +.B I) +Port Groups (denoted by port-groups). +This section defines zero or more port groups that can be referred later by +matching rules (see below). Port group lists ports by: + - Port GUID + - Port name, which is a combination of NodeDescription and IB port number + - PKey, which means that all the ports in the subnet that belong to + partition with a given PKey belong to this port group + - Partition name, which means that all the ports in the subnet that belong + to partition with a given name belong to this port group + - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and + SELF (SM's port). + +.B II) +QoS Setup (denoted by qos-setup). +This section describes how to set up SL2VL and VL Arbitration tables on +various nodes in the fabric. +However, this is not supported in OFED 1.3. +SL2VL and VLArb tables should be configured in the OpenSM options file. + +.B III) +QoS Levels (denoted by qos-levels). +Each QoS Level defines Service Level (SL) and a few optional fields: + - MTU limit + - Rate limit + - PKey + - Packet lifetime + +When path(s) search is performed, it is done with regards to restriction that +these QoS Level parameters impose. +One QoS level that is mandatory to define is a DEFAULT QoS level. It is +applied to a PR/MPR query that does not match any existing match rule. +Similar to any other QoS Level, it can also be explicitly referred by any +match rule. + +.B IV) +QoS Matching Rules (denoted by qos-match-rules). +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against +the set of matching rules. Rules are scanned in order of appearance in the QoS +policy file such as the first match takes precedence. +Each rule has a name of QoS level that will be applied to the matching query. +A default QoS level is applied to a query that did not match any rule. +Queries can be matched by: + - Source port group (whether a source port is a member of a specified group) + - Destination port group (same as above, only for destination port) + - PKey + - QoS class + - Service ID + +To match a certain matching rule, PR/MPR query has to match ALL the rule's +criteria. However, not all the fields of the PR/MPR query have to appear in +the matching rule. +For instance, if the rule has a single criterion - Service ID, it will match +any query that has this Service ID, disregarding rest of the query fields. +However, if a certain query has only Service ID (which means that this is the +only bit in the PR/MPR component mask that is on), it will not match any rule +that has other matching criteria besides Service ID. +.PP +.B Simplified QoS Policy Definition +.PP +Simplified QoS policy definition comprises of a single section denoted by +qos-ulps. Similar to the full QoS policy, it has a list of match rules and +their QoS Level, but in this case a match rule has only one criterion - its +goal is to match a certain ULP (or a certain application on top of this ULP) +PR/MPR request, and QoS Level has only one constraint - Service Level (SL). +The simplified policy section may appear in the policy file in combine with +the full policy, or as a stand-alone policy definition. +See more details and list of match rule criteria below. +.PP +.B Policy File Syntax Guidelines +.PP +Empty lines are ignored. +Leading and trailing blanks, as well as empty lines, are ignored, so the +indentation in the example is just for better readability. +Comments are started with the pound sign (#) and terminated by EOL. +Any keyword should be the first non-blank in the line, unless it's a comment. +Keywords that denote section/subsection start have matching closing keywords. +Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR +requests that didn't match any of the matching rules. +Any section/subsection of the policy file is optional. + +.PP +.B Examples of Full Policy File +.PP +As mentioned earlier, any section of the policy file is optional, and +the only mandatory part of the policy file is a default QoS Level. +Here's an example of the shortest policy file: + + qos-levels + qos-level + name: DEFAULT + sl: 0 + end-qos-level + end-qos-levels + +Port groups section is missing because there are no match rules, which means +that port groups are not referred anywhere, and there is no need defining +them. And since this policy file doesn't have any matching rules, PR/MPR query +won't match any rule, and OpenSM will enforce default QoS level. +Essentially, the above example is equivalent to not having QoS policy file +at all. + +The following example shows all the possible options and keywords in the +policy file and their syntax: + + # + # See the comments in the following example. + # They explain different keywords and their meaning. + # + port-groups + port-group + name: Storage + # "use" is just a description that is used for logging + # Other than that, it is just a comment + use: SRP Targets + port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA + port-guid: 0x1000000000FFFF + end-port-group + + port-group + name: Virtual Servers + # The syntax of the port name is as follows: + # "node_description/Pnum". + # node_description is compared to the NodeDescription of the node, + # and "Pnum" is a port number on that node. + port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 + end-port-group + + # using partitions defined in the partition policy + port-group + name: Partitions + partition: Part1 + pkey: 0x1234 + end-port-group + + # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) + # or ALL (for all the nodes in the subnet) + port-group + name: CAs and SM + node-type: CA, SELF + end-port-group + + end-port-groups + + qos-setup + # This section of the policy file describes how to set up SL2VL and VL + # Arbitration tables on various nodes in the fabric. + # However, this is not supported in OFED 1.3 - the section is parsed + # and ignored. SL2VL and VLArb tables should be configured in the + # OpenSM options file (by default - /var/cache/opensm/opensm.opts). + end-qos-setup + + qos-levels + + # Having a QoS Level named "DEFAULT" is a must - it is applied to + # PR/MPR requests that didn't match any of the matching rules. + qos-level + name: DEFAULT + use: default QoS Level + sl: 0 + end-qos-level + + # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime + qos-level + name: WholeSet + sl: 1 + mtu-limit: 4 + rate-limit: 5 + pkey: 0x1234 + packet-life: 8 + end-qos-level + + end-qos-levels + + # Match rules are scanned in order of their apperance in the policy file. + # First matched rule takes precedence. + qos-match-rules + + # matching by single criteria: QoS class + qos-match-rule + use: by QoS class + qos-class: 7-9,11 + # Name of qos-level to apply to the matching PR/MPR + qos-level-name: WholeSet + end-qos-match-rule + + # show matching by destination group and service id + qos-match-rule + use: Storage targets + destination: Storage + service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF + qos-level-name: WholeSet + end-qos-match-rule + + qos-match-rule + source: Storage + use: match by source group only + qos-level-name: DEFAULT + end-qos-match-rule + + qos-match-rule + use: match by all parameters + qos-class: 7-9,11 + source: Virtual Servers + destination: Storage + service-id: 0x0000000000010000-0x000000000001FFFF + pkey: 0x0F00-0x0FFF + qos-level-name: WholeSet + end-qos-match-rule + + end-qos-match-rules + +.PP +.B Simplified QoS Policy - Details and Examples +.PP +Simplified QoS policy match rules are tailored for matching ULPs (or +some application on top of a ULP) PR/MPR requests. It has a list of +per-ULP (or per-application) match rules and the SL that should be +enforced on the matched PR/MPR query. + +Match rules include: + - Default match rule that is applied to PR/MPR query that didn't + match any of the other match rules + - SDP + - SDP application with a specific target TCP/IP port range + - SRP with a specific target IB port GUID + - RDS + - iSER + - iSER application with a specific target TCP/IP port range + - IPoIB with a default PKey + - IPoIB with a specific PKey + - any ULP/application with a specific Service ID in the PR/MPR query + - any ULP/application with a specific PKey in the PR/MPR query + - any ULP/application with a specific target IB port GUID in the PR/MPR query + +Since any section of the policy file is optional, as long as basic rules +of the file are kept (such as no referring to nonexisting port group, +having default QoS Level, etc), the simplified policy section (qos-ulps) +can serve as a complete QoS policy file. +The shortest policy file in this case would be as follows: + + qos-ulps + default : 0 #default SL + end-qos-ulps + +It is equivalent to not having policy file at all. + +Below is an example of simplified QoS policy with all the possible keywords: + + qos-ulps + default : 0 # default SL + sdp, port-num 30000 : 0 # SL for application running on top + # of SDP when a destination + # TCP/IPport is 30000 + sdp, port-num 10000-20000 : 0 + sdp : 1 # default SL for any other + # application running on top of SDP + rds : 2 # SL for RDS traffic + iser, port-num 900 : 0 # SL for iSER with a specific target + # port + iser : 3 # default SL for iSER + ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with + # pkey 0x0001 + ipoib : 4 # default IPoIB partition, + # pkey=0x7FFF + any, service-id 0x6234 : 6 # match any PR/MPR query with a + # specific Service ID + any, pkey 0x0ABC : 6 # match any PR/MPR query with a + # specific PKey + srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on + # a specified IB port GUID + any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with + # a specific target port GUID + end-qos-ulps + + +Similar to the full policy definition, matching of PR/MPR queries is done in +order of appearance in the QoS policy file such as the first match takes +precedence, except for the "default" rule, which is applied only if the query +didn't match any other rule. + +All other sections of the QoS policy file take precedence over the qos-ulps +section. That is, if a policy file has both qos-match-rules and qos-ulps +sections, then any query is matched first against the rules in the +qos-match-rules section, and only if there was no match, the query is matched +against the rules in qos-ulps section. + +Note that some of these match rules may overlap, so in order to use the +simplified QoS definition effectively, it is important to understand how each +of the ULPs is matched: + +.B IPoIB: +PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so +the following three match rules are equivalent: + + ipoib : + ipoib, pkey 0x7fff : + any, pkey 0x7fff : + +.I Note +: For OFED 1.3, IPoIB partition SL configuration should be done through +partition configuration file only. + +\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP +Port Number to connect to. The following two match rules are equivalent: + + sdp : + any, service-id 0x0000000000010000-0x000000000001ffff : + +\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The +Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits +holding the remote TCP/IP Port Number to connect to. Default port number +for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA. +The following two match rules are equivalent: + + rds : + any, service-id 0x00000000010648CA : + +\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the +Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C, +which makes a default Service-ID 0x000000000106035C. +The following two match rules are equivalent: + + iser : + any, service-id 0x000000000106035C : + +\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is +matched by the target IB port GUID. The following two match rules are +equivalent: + + srp, target-port-guid 0x1234 : + any, target-port-guid 0x1234 : + +Note that any of the above ULPs might contain target port GUID in the PR +query, so in order for these queries not to be recognized by the QoS manager +as SRP, the SRP match rule (or any match rule that refers to the target port +guid only) should be placed at the end of the qos-ulps match rules. + +\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not +forcing any SL on the MPI traffic, and that's why it is the only ULP that +did not appear in the qos-ulps section. + + +.SS SL2VL Mapping and VL Arbitration +.PP + +OpenSM cached options file has a set of QoS related configuration +parameters, that are used to configure SL2VL mapping and VL arbitration +on IB ports. These parameters are: + - Max VLs: the maximum number of VLs that will be on the subnet. + - High limit: the limit of High Priority component of VL Arbitration + table (IBA 7.6.9). + - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. + - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. + - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs + corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). + +There are separate QoS configuration parameters sets for various target +types: CAs, routers, switch external ports, and switch's enhanced port 0. +The names of such parameters are prefixed by "qos__" string. +Here is a full list of the currently supported sets: + + qos_ca_ - QoS configuration parameters set for CAs. + qos_rtr_ - parameters set for routers. + qos_sw0_ - parameters set for switches' port 0. + qos_swe_ - parameters set for switches' external ports. + +Here's the example of typical default values for all the ports in the +subnet (hard-coded in OpenSM initialization): + + qos_max_vls=15 + qos_high_limit=0 + qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 + qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + + +VL arbitration tables (both high and low) are lists of VL/Weight pairs. +Each list entry contains a VL number (values from 0-14), and a weighting value +(values 0-255), indicating the number of 64 byte units (credits) which may be +transmitted from that VL when its turn in the arbitration occurs. A weight +of 0 indicates that this entry should be skipped. If a list entry is +programmed for VL15 or for a VL that is not supported or is not currently +configured by the port, the port may either skip that entry or send from any +supported VL for that entry. + +Note, that the same VLs may be listed multiple times in the High or Low +priority arbitration tables, and, further, it can be listed in both tables. + +The limit of high-priority VLArb table (qos__high_limit) indicates the +number of high-priority packets that can be transmitted without an opportunity +to send a low-priority packet. Specifically, the number of bytes that can be +sent is high_limit times 4K bytes. + +A high_limit value of 255 indicates that the byte limit is unbounded. +Note: if the 255 value is used, the low priority VLs may be starved. +A value of 0 indicates that only a single packet from the high-priority table +may be sent before an opportunity is given to the low-priority table. + +Keep in mind that ports usually transmit packets of size equal to MTU. +For instance, for 4KB MTU a single packet will require 64 credits, so in order +to achieve effective VL arbitration for packets of 4KB MTU, the weighting +values for each VL should be multiples of 64. + +Below is an example of SL2VL and VL Arbitration configuration on subnet: + + qos_max_vls=15 + qos_high_limit=6 + qos_vlarb_high=0:4 + qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 + +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single +transmission burst. Such configuration would suilt VL that needs low latency +and uses small MTU when transmitting packets. Rest of VLs are defined as low +priority VLs with different weights, while VL4 is effectively turned off. .SH PREFIX ROUTES .PP -- 1.5.1.4 _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Tue May 6 07:37:04 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 07:37:04 -0700 Subject: [ofa-general] [PATCH] ofed/docs: fixing wrong syntax in the example in QoS_management_in_OpenSM.txt In-Reply-To: <48206C54.8010605@dev.mellanox.co.il> References: <47E8C032.2050907@dev.mellanox.co.il> <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> <48205FE8.3070104@dev.mellanox.co.il> <1210081452.2026.37.camel@hrosenstock-ws.xsigo.com> <48206C54.8010605@dev.mellanox.co.il> Message-ID: <1210084624.2026.51.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-06 at 17:33 +0300, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On Tue, 2008-05-06 at 16:40 +0300, Yevgeny Kliteynik wrote: > >> Hi Hal, > >> > >> Hal Rosenstock wrote: > >>> Hi Yevgeny, > >>> > >>> On Tue, 2008-03-25 at 11:04 +0200, Yevgeny Kliteynik wrote: > >>>> QoS_management_in_OpenSM.txt > >>> Shouldn't this doc also be available in the OpenSM git tree (in > >>> management/opensm/doc) and distributed as part of OpenSM ? > >>> > >>> If so, I can issue a patch for this. > >> I think that it's a good idea (which reminds me that I still > >> haven't fixed the patch for QoS stuff in OpenSM man page...) > > > > Yes, that would be nice too :-) > > > > That's going to make the OpenSM man page huge. Not sure how that should > > be dealt with. Maybe a simple way out in the short term might be to just > > reference that doc in the man page. > > I thought about it too. > > The only problem is that "the short term" solutions have an astonishing > ability to stay as a "long term", or "final" solutions... :) > > Let's think about the long term solution right away. > Are we OK with having just 10-15 lines about the existence of QoS > annex support in OpenSM man page (in addition to SL2VL and VLArb > tables configuration that already exists there), and a reference > to the QoS Management doc? I, for one, have no problems with At a high level, this sounds fine to me but I'd want to see the actual text to be sure. > I tried reducing the full text to include it in the man - it's > still very long... I think the OpenSM man page needs to be broken up. -- Hal > I just check the mailing list - the mail isn't there (it was "delayed > for approval" when there were problems with mailing list filtering > policy a month ago). I'll forward it to you. > > -- Yevgeny > > > -- Hal > > > >> -- Yevgeny > >> > >>> Thanks. > >>> > >>> -- Hal > >>> > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > From hrosenstock at xsigo.com Tue May 6 07:45:25 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 07:45:25 -0700 Subject: [Fwd: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages] In-Reply-To: <48206C83.8000009@dev.mellanox.co.il> References: <48206C83.8000009@dev.mellanox.co.il> Message-ID: <1210085125.2026.60.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, On Tue, 2008-05-06 at 17:34 +0300, Yevgeny Kliteynik wrote: > Hi Hal, > > This is the mail that I was talking about (QoS info for OpenSM man page). > Sasha has reviewed it, and posted his answer to the mailing list. I must have missed that. What was the date of that post ? See below for some additional comments. -- Hal > > -- Yevgeny > > > -------- Original Message -------- > Subject: [ofa-general] [PATCH] opensm/man: Adding QoS-related info to opensm man pages > Date: Wed, 26 Mar 2008 02:47:08 +0200 > From: Yevgeny Kliteynik > To: Sasha Khapyorsky > CC: OpenIB > > Hi Sasha, > > I've added QoS related info to opensm man pages: enhanced > existing part (that was talking about VL arbitration) and > added description of QoS manager in accordance with QoS annex. > > Please apply to ofed_1_3 and master. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/man/opensm.8.in | 501 +++++++++++++++++++++++++++++++++++++++++++----- > 1 files changed, 457 insertions(+), 44 deletions(-) > > diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in > index 5322ab7..1d9c5b7 100644 > --- a/opensm/man/opensm.8.in > +++ b/opensm/man/opensm.8.in > @@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each > InfiniBand subnet). > > opensm also now contains an experimental version of a performance > -manager as well. > +manager and an experimental version QoS manager (in accordance with > +IBA QoS Annex). Minor tweak as I think the performance manager is now longer being indicated as experimental: opensm also now contains a performance manager as well as an experimental QoS manager (in accordance with IBTA 1.2.1 QoS Annex). > opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB > fabric, initialize it, and sweep occasionally for changes. > @@ -433,51 +434,463 @@ partition manager: > > Default=0x7fff,ipoib:ALL=full; > > -.SH QOS CONFIGURATION > +.SH QUALITY OF SERVICE > .PP > -There are a set of QoS related low-level configuration parameters. > -All these parameter names are prefixed by "qos_" string. Here is a full > -list of these parameters: > - > - qos_max_vls - The maximum number of VLs that will be on the subnet > - qos_high_limit - The limit of High Priority component of VL > - Arbitration table (IBA 7.6.9) > - qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9) > - template > - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9) > - template > - Both VL arbitration templates are pairs of > - VL and weight > - qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is > - a list of VLs corresponding to SLs 0-15 (Note > - that VL15 used here means drop this SL) > - > -Typical default values (hard-coded in OpenSM initialization) are: > - > - qos_max_vls=15 > - qos_high_limit=0 > - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 > - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > - > -The syntax is compatible with rest of OpenSM configuration options and > -values may be stored in OpenSM config file (cached options file). > - > -In addition to the above, we may define separate QoS configuration > -parameters sets for various target types. As targets, we currently support > -CAs, routers, switch external ports, and switch's enhanced port 0. The > -names of such specialized parameters are prefixed by "qos__" > -string. Here is a full list of the currently supported sets: > - > - qos_ca_ - QoS configuration parameters set for CAs. > - qos_rtr_ - parameters set for routers. > - qos_sw0_ - parameters set for switches' port 0. > - qos_swe_ - parameters set for switches' external ports. > +OpenSM QoS support comprises of two parts: > > -Examples: > - qos_sw0_max_vls=2 > - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0, > - qos_swe_high_limit=0 > + 1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental) > +.P > + 2. \fBSL2VL and VL Arbitration tables configuration\fP > +.P > +.SS QoS Manager (experimental) > +.PP > +When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks > +for QoS Policy file. The default name of this file is > +\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using > +-Y or --qos_policy_file option with OpenSM. This is essentially the QoS management doc cast into the man page with the older primitive QoS tacked on. Should the annex support just refer to the doc and leave the older description present ? Or something else ? Maybe that was addressed in Sasha's response. > + > +During fabric initialization and at every heavy sweep OpenSM parses the > +QoS policy file, applies its settings to the discovered fabric elements, > +and enforces the provided policy on client requests. The overall flow for > +such requests is as follows: > + - The request is matched against the defined matching rules such that > + the QoS Level definition is found. > + - Given the QoS Level, path(s) search is performed with the given > + restrictions imposed by that level. > + > +There are two ways to define QoS policy: > + - \fBFull\fP: the full policy file syntax provides the administrator various > + ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to > + enforce various QoS constraints on the requested PR/MPR. > + - \fBSimplified\fP: the simplified policy file syntax enables the administrator > + match PR/MPR requests by various ULPs and applications running on top of > + these ULPs. > + > +While the full policy syntax is very flexible, in many cases the simplified > +policy definition would be sufficient. > +.PP > +.B Full QoS Policy File > +.PP > +QoS policy file has the following sections: > + > +.B I) > +Port Groups (denoted by port-groups). > +This section defines zero or more port groups that can be referred later by > +matching rules (see below). Port group lists ports by: > + - Port GUID > + - Port name, which is a combination of NodeDescription and IB port number > + - PKey, which means that all the ports in the subnet that belong to > + partition with a given PKey belong to this port group > + - Partition name, which means that all the ports in the subnet that belong > + to partition with a given name belong to this port group > + - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and > + SELF (SM's port). > + > +.B II) > +QoS Setup (denoted by qos-setup). > +This section describes how to set up SL2VL and VL Arbitration tables on > +various nodes in the fabric. > +However, this is not supported in OFED 1.3. > +SL2VL and VLArb tables should be configured in the OpenSM options file. > + > +.B III) > +QoS Levels (denoted by qos-levels). > +Each QoS Level defines Service Level (SL) and a few optional fields: > + - MTU limit > + - Rate limit > + - PKey > + - Packet lifetime > + > +When path(s) search is performed, it is done with regards to restriction that > +these QoS Level parameters impose. > +One QoS level that is mandatory to define is a DEFAULT QoS level. It is > +applied to a PR/MPR query that does not match any existing match rule. > +Similar to any other QoS Level, it can also be explicitly referred by any > +match rule. > + > +.B IV) > +QoS Matching Rules (denoted by qos-match-rules). > +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against > +the set of matching rules. Rules are scanned in order of appearance in the QoS > +policy file such as the first match takes precedence. > +Each rule has a name of QoS level that will be applied to the matching query. > +A default QoS level is applied to a query that did not match any rule. > +Queries can be matched by: > + - Source port group (whether a source port is a member of a specified group) > + - Destination port group (same as above, only for destination port) > + - PKey > + - QoS class > + - Service ID > + > +To match a certain matching rule, PR/MPR query has to match ALL the rule's > +criteria. However, not all the fields of the PR/MPR query have to appear in > +the matching rule. > +For instance, if the rule has a single criterion - Service ID, it will match > +any query that has this Service ID, disregarding rest of the query fields. > +However, if a certain query has only Service ID (which means that this is the > +only bit in the PR/MPR component mask that is on), it will not match any rule > +that has other matching criteria besides Service ID. > +.PP > +.B Simplified QoS Policy Definition > +.PP > +Simplified QoS policy definition comprises of a single section denoted by > +qos-ulps. Similar to the full QoS policy, it has a list of match rules and > +their QoS Level, but in this case a match rule has only one criterion - its > +goal is to match a certain ULP (or a certain application on top of this ULP) > +PR/MPR request, and QoS Level has only one constraint - Service Level (SL). > +The simplified policy section may appear in the policy file in combine with > +the full policy, or as a stand-alone policy definition. > +See more details and list of match rule criteria below. > +.PP > +.B Policy File Syntax Guidelines > +.PP > +Empty lines are ignored. > +Leading and trailing blanks, as well as empty lines, are ignored, so the > +indentation in the example is just for better readability. > +Comments are started with the pound sign (#) and terminated by EOL. > +Any keyword should be the first non-blank in the line, unless it's a comment. > +Keywords that denote section/subsection start have matching closing keywords. > +Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR > +requests that didn't match any of the matching rules. > +Any section/subsection of the policy file is optional. > + > +.PP > +.B Examples of Full Policy File > +.PP > +As mentioned earlier, any section of the policy file is optional, and > +the only mandatory part of the policy file is a default QoS Level. > +Here's an example of the shortest policy file: > + > + qos-levels > + qos-level > + name: DEFAULT > + sl: 0 > + end-qos-level > + end-qos-levels > + > +Port groups section is missing because there are no match rules, which means > +that port groups are not referred anywhere, and there is no need defining > +them. And since this policy file doesn't have any matching rules, PR/MPR query > +won't match any rule, and OpenSM will enforce default QoS level. > +Essentially, the above example is equivalent to not having QoS policy file > +at all. > + > +The following example shows all the possible options and keywords in the > +policy file and their syntax: > + > + # > + # See the comments in the following example. > + # They explain different keywords and their meaning. > + # > + port-groups > + port-group > + name: Storage > + # "use" is just a description that is used for logging > + # Other than that, it is just a comment > + use: SRP Targets > + port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA > + port-guid: 0x1000000000FFFF > + end-port-group > + > + port-group > + name: Virtual Servers > + # The syntax of the port name is as follows: > + # "node_description/Pnum". > + # node_description is compared to the NodeDescription of the node, > + # and "Pnum" is a port number on that node. > + port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 > + end-port-group > + > + # using partitions defined in the partition policy > + port-group > + name: Partitions > + partition: Part1 > + pkey: 0x1234 > + end-port-group > + > + # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) > + # or ALL (for all the nodes in the subnet) > + port-group > + name: CAs and SM > + node-type: CA, SELF > + end-port-group > + > + end-port-groups > + > + qos-setup > + # This section of the policy file describes how to set up SL2VL and VL > + # Arbitration tables on various nodes in the fabric. > + # However, this is not supported in OFED 1.3 - the section is parsed > + # and ignored. SL2VL and VLArb tables should be configured in the > + # OpenSM options file (by default - /var/cache/opensm/opensm.opts). > + end-qos-setup > + > + qos-levels > + > + # Having a QoS Level named "DEFAULT" is a must - it is applied to > + # PR/MPR requests that didn't match any of the matching rules. > + qos-level > + name: DEFAULT > + use: default QoS Level > + sl: 0 > + end-qos-level > + > + # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime > + qos-level > + name: WholeSet > + sl: 1 > + mtu-limit: 4 > + rate-limit: 5 > + pkey: 0x1234 > + packet-life: 8 > + end-qos-level > + > + end-qos-levels > + > + # Match rules are scanned in order of their apperance in the policy file. > + # First matched rule takes precedence. > + qos-match-rules > + > + # matching by single criteria: QoS class > + qos-match-rule > + use: by QoS class > + qos-class: 7-9,11 > + # Name of qos-level to apply to the matching PR/MPR > + qos-level-name: WholeSet > + end-qos-match-rule > + > + # show matching by destination group and service id > + qos-match-rule > + use: Storage targets > + destination: Storage > + service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF > + qos-level-name: WholeSet > + end-qos-match-rule > + > + qos-match-rule > + source: Storage > + use: match by source group only > + qos-level-name: DEFAULT > + end-qos-match-rule > + > + qos-match-rule > + use: match by all parameters > + qos-class: 7-9,11 > + source: Virtual Servers > + destination: Storage > + service-id: 0x0000000000010000-0x000000000001FFFF > + pkey: 0x0F00-0x0FFF > + qos-level-name: WholeSet > + end-qos-match-rule > + > + end-qos-match-rules > + > +.PP > +.B Simplified QoS Policy - Details and Examples > +.PP > +Simplified QoS policy match rules are tailored for matching ULPs (or > +some application on top of a ULP) PR/MPR requests. It has a list of > +per-ULP (or per-application) match rules and the SL that should be > +enforced on the matched PR/MPR query. > + > +Match rules include: > + - Default match rule that is applied to PR/MPR query that didn't > + match any of the other match rules > + - SDP > + - SDP application with a specific target TCP/IP port range > + - SRP with a specific target IB port GUID > + - RDS > + - iSER > + - iSER application with a specific target TCP/IP port range > + - IPoIB with a default PKey > + - IPoIB with a specific PKey > + - any ULP/application with a specific Service ID in the PR/MPR query > + - any ULP/application with a specific PKey in the PR/MPR query > + - any ULP/application with a specific target IB port GUID in the PR/MPR query > + > +Since any section of the policy file is optional, as long as basic rules > +of the file are kept (such as no referring to nonexisting port group, > +having default QoS Level, etc), the simplified policy section (qos-ulps) > +can serve as a complete QoS policy file. > +The shortest policy file in this case would be as follows: > + > + qos-ulps > + default : 0 #default SL > + end-qos-ulps > + > +It is equivalent to not having policy file at all. > + > +Below is an example of simplified QoS policy with all the possible keywords: > + > + qos-ulps > + default : 0 # default SL > + sdp, port-num 30000 : 0 # SL for application running on top > + # of SDP when a destination > + # TCP/IPport is 30000 > + sdp, port-num 10000-20000 : 0 > + sdp : 1 # default SL for any other > + # application running on top of SDP > + rds : 2 # SL for RDS traffic > + iser, port-num 900 : 0 # SL for iSER with a specific target > + # port > + iser : 3 # default SL for iSER > + ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with > + # pkey 0x0001 > + ipoib : 4 # default IPoIB partition, > + # pkey=0x7FFF > + any, service-id 0x6234 : 6 # match any PR/MPR query with a > + # specific Service ID > + any, pkey 0x0ABC : 6 # match any PR/MPR query with a > + # specific PKey > + srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on > + # a specified IB port GUID > + any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with > + # a specific target port GUID > + end-qos-ulps > + > + > +Similar to the full policy definition, matching of PR/MPR queries is done in > +order of appearance in the QoS policy file such as the first match takes > +precedence, except for the "default" rule, which is applied only if the query > +didn't match any other rule. > + > +All other sections of the QoS policy file take precedence over the qos-ulps > +section. That is, if a policy file has both qos-match-rules and qos-ulps > +sections, then any query is matched first against the rules in the > +qos-match-rules section, and only if there was no match, the query is matched > +against the rules in qos-ulps section. > + > +Note that some of these match rules may overlap, so in order to use the > +simplified QoS definition effectively, it is important to understand how each > +of the ULPs is matched: > + > +.B IPoIB: > +PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so > +the following three match rules are equivalent: > + > + ipoib : > + ipoib, pkey 0x7fff : > + any, pkey 0x7fff : > + > +.I Note > +: For OFED 1.3, IPoIB partition SL configuration should be done through > +partition configuration file only. > + > +\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is > +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP > +Port Number to connect to. The following two match rules are equivalent: > + > + sdp : > + any, service-id 0x0000000000010000-0x000000000001ffff : > + > +\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The > +Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits > +holding the remote TCP/IP Port Number to connect to. Default port number > +for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA. > +The following two match rules are equivalent: > + > + rds : > + any, service-id 0x00000000010648CA : > + > +\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the > +Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C, > +which makes a default Service-ID 0x000000000106035C. > +The following two match rules are equivalent: > + > + iser : > + any, service-id 0x000000000106035C : > + > +\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is > +matched by the target IB port GUID. The following two match rules are > +equivalent: > + > + srp, target-port-guid 0x1234 : > + any, target-port-guid 0x1234 : > + > +Note that any of the above ULPs might contain target port GUID in the PR > +query, so in order for these queries not to be recognized by the QoS manager > +as SRP, the SRP match rule (or any match rule that refers to the target port > +guid only) should be placed at the end of the qos-ulps match rules. > + > +\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not > +forcing any SL on the MPI traffic, and that's why it is the only ULP that > +did not appear in the qos-ulps section. > + > + > +.SS SL2VL Mapping and VL Arbitration > +.PP > + > +OpenSM cached options file has a set of QoS related configuration > +parameters, that are used to configure SL2VL mapping and VL arbitration > +on IB ports. These parameters are: > + - Max VLs: the maximum number of VLs that will be on the subnet. > + - High limit: the limit of High Priority component of VL Arbitration > + table (IBA 7.6.9). > + - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. > + - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. > + - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs > + corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). > + > +There are separate QoS configuration parameters sets for various target > +types: CAs, routers, switch external ports, and switch's enhanced port 0. > +The names of such parameters are prefixed by "qos__" string. > +Here is a full list of the currently supported sets: > + > + qos_ca_ - QoS configuration parameters set for CAs. > + qos_rtr_ - parameters set for routers. > + qos_sw0_ - parameters set for switches' port 0. > + qos_swe_ - parameters set for switches' external ports. > + > +Here's the example of typical default values for all the ports in the > +subnet (hard-coded in OpenSM initialization): > + > + qos_max_vls=15 > + qos_high_limit=0 > + qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > + qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 > + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > + > + > +VL arbitration tables (both high and low) are lists of VL/Weight pairs. > +Each list entry contains a VL number (values from 0-14), and a weighting value > +(values 0-255), indicating the number of 64 byte units (credits) which may be > +transmitted from that VL when its turn in the arbitration occurs. A weight > +of 0 indicates that this entry should be skipped. If a list entry is > +programmed for VL15 or for a VL that is not supported or is not currently > +configured by the port, the port may either skip that entry or send from any > +supported VL for that entry. > + > +Note, that the same VLs may be listed multiple times in the High or Low > +priority arbitration tables, and, further, it can be listed in both tables. > + > +The limit of high-priority VLArb table (qos__high_limit) indicates the > +number of high-priority packets that can be transmitted without an opportunity > +to send a low-priority packet. Specifically, the number of bytes that can be > +sent is high_limit times 4K bytes. > + > +A high_limit value of 255 indicates that the byte limit is unbounded. > +Note: if the 255 value is used, the low priority VLs may be starved. > +A value of 0 indicates that only a single packet from the high-priority table > +may be sent before an opportunity is given to the low-priority table. > + > +Keep in mind that ports usually transmit packets of size equal to MTU. > +For instance, for 4KB MTU a single packet will require 64 credits, so in order > +to achieve effective VL arbitration for packets of 4KB MTU, the weighting > +values for each VL should be multiples of 64. > + > +Below is an example of SL2VL and VL Arbitration configuration on subnet: > + > + qos_max_vls=15 > + qos_high_limit=6 > + qos_vlarb_high=0:4 > + qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 > + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > + > +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is > +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single > +transmission burst. Such configuration would suilt VL that needs low latency > +and uses small MTU when transmitting packets. Rest of VLs are defined as low > +priority VLs with different weights, while VL4 is effectively turned off. > > .SH PREFIX ROUTES > .PP From hrosenstock at xsigo.com Tue May 6 07:46:12 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 07:46:12 -0700 Subject: [ofa-general] IBTA Compliance- Mkey Violations trap In-Reply-To: <04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com> References: <47E8C032.2050907@dev.mellanox.co.il> <1210079815.27137.190.camel@hrosenstock-ws.xsigo.com> <48205FE8.3070104@dev.mellanox.co.il> <04d301c8af80$fd1a19c0$3414a8c0@md.baymicrosystems.com> Message-ID: <1210085172.2026.61.camel@hrosenstock-ws.xsigo.com> Suri, On Tue, 2008-05-06 at 09:56 -0400, Suresh Shelvapille wrote: > Folks: > > WRT sending traps on MkeyViolations, the spec is a little ambiguous IMO. > In section 14.2.4.2 C-14-18 is it saying to send a trap the first time the Mkey Violation > happens and the Lease Expiry timer is started or to send a trap(possibly multiple of them) > every time Mkey Violation happens even though the lease timer may have been started already? I think that it's every time (independent of whether the lease countdown has already been started or not): See o14-9. Note also that there is a max trap rate requirement though per PortInfo:SubnetTimeOut. If your company is an IBTA member, a better place for this inquiry is on the mgtwg mailing list IMO. -- Hal > What is the consensus in this group? > > Many thanks, > Suri > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From andrea at qumranet.com Tue May 6 07:46:54 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Tue, 6 May 2008 16:46:54 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080505194625.GA17734@sgi.com> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random> <20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random> <20080505194625.GA17734@sgi.com> Message-ID: <20080506144654.GD8471@duo.random> On Mon, May 05, 2008 at 02:46:25PM -0500, Jack Steiner wrote: > If a task fails to unmap a GRU segment, they still exist at the start of Yes, this will also happen in case the well behaved task receives SIGKILL, so you can test it that way too. > exit. On the ->release callout, I set a flag in the container of my > mmu_notifier that exit has started. As VMA are cleaned up, TLB flushes > are skipped because of the flag is set. When the GRU VMA is deleted, I free GRU TLB flushes aren't skipped because your flag is set but because __mmu_notifier_release already executed list_del_init_rcu(&grunotifier->hlist) before proceeding with unmap_vmas. > my structure containing the notifier. As long as nobody can write through the already established gru tlbs and nobody can establish new tlbs after exit_mmap run you don't strictly need ->release. > I _think_ works. Do you see any problems? You can remove the flag and ->release and ->clear_flush_young (if you keep clear_flush_young implemented it should return 0). The synchronize_rcu after mmu_notifier_register can also be dropped thanks to mm_lock(). gru_drop_mmu_notifier should be careful with current->mm if you're using an fd and if the fd can be passed to a different task through unix sockets (you should probably fail any operation if current->mm != gru->mm). The way I use ->release in KVM is to set the root hpa to -1UL (invalid) as a debug trap. That's only for debugging because even if tlb entries and sptes are still established on the secondary mmu they are only relevant when the cpu jumps to guest mode and that can never happen again after exit_mmap is started. > I should also mention that I have an open-coded function that possibly > belongs in mmu_notifier.c. A user is allowed to have multiple GRU segments. > Each GRU has a couple of data structures linked to the VMA. All, however, > need to share the same notifier. I currently open code a function that > scans the notifier list to determine if a GRU notifier already exists. > If it does, I update a refcnt & use it. Otherwise, I register a new > one. All of this is protected by the mmap_sem. > > Just in case I mangled the above description, I'll attach a copy of the GRU mmuops > code. Well that function needs fixing w.r.t. srcu. Are you sure you want to search for mn->ops == gru_mmuops and not for mn == gmn? And if you search for mn why can't you keep track of the mn being registered or unregistered outside of the mmu_notifier layer? Set a bitflag in the container after mmu_notifier_register returns and a clear it after _unregister returns. I doubt saving one bitflag is worth searching the list and your approach make it obvious that you've to protect the bitflag and the register/unregister under write-mmap_sem yourself. Otherwise the find function will return an object that can be freed at any time if somebody calls unregister and kfree. (synchronize_srcu in mmu_notifier_unregister won't wait for anything but some outstanding srcu_read_lock) From kliteyn at dev.mellanox.co.il Tue May 6 07:49:19 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 06 May 2008 17:49:19 +0300 Subject: [ofa-general] [Fwd: Re: [PATCH] opensm/man: Adding QoS-related info to opensm man pages] Message-ID: <48206FEF.6020602@dev.mellanox.co.il> And this is Sasha's response. -- Yevgeny -------- Original Message -------- Subject: Re: [PATCH] opensm/man: Adding QoS-related info to opensm man pages Date: Mon, 31 Mar 2008 10:52:13 +0000 From: Sasha Khapyorsky To: Yevgeny Kliteynik CC: OpenIB References: <47E99D0C.7040403 at dev.mellanox.co.il> Hi Yevgeny, On 02:47 Wed 26 Mar , Yevgeny Kliteynik wrote: > > I've added QoS related info to opensm man pages: enhanced > existing part (that was talking about VL arbitration) I see that this part was fully rewritten. And IMO it is less clear now than originally was (some comments are below). Any reason to not start enhancements from existing text? > and > added description of QoS manager in accordance with QoS annex. > > Please apply to ofed_1_3 and master. Comments are below. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/man/opensm.8.in | 501 +++++++++++++++++++++++++++++++++++++++++++----- > 1 files changed, 457 insertions(+), 44 deletions(-) > > diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in > index 5322ab7..1d9c5b7 100644 > --- a/opensm/man/opensm.8.in > +++ b/opensm/man/opensm.8.in > @@ -35,7 +35,8 @@ to initialize the InfiniBand hardware (at least one per each > InfiniBand subnet). > > opensm also now contains an experimental version of a performance > -manager as well. > +manager and an experimental version QoS manager (in accordance with > +IBA QoS Annex). > > opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB > fabric, initialize it, and sweep occasionally for changes. > @@ -433,51 +434,463 @@ partition manager: > > Default=0x7fff,ipoib:ALL=full; > > -.SH QOS CONFIGURATION > +.SH QUALITY OF SERVICE > .PP > -There are a set of QoS related low-level configuration parameters. > -All these parameter names are prefixed by "qos_" string. Here is a full > -list of these parameters: > - > - qos_max_vls - The maximum number of VLs that will be on the subnet > - qos_high_limit - The limit of High Priority component of VL > - Arbitration table (IBA 7.6.9) > - qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9) > - template > - qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9) > - template > - Both VL arbitration templates are pairs of > - VL and weight > - qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is > - a list of VLs corresponding to SLs 0-15 (Note > - that VL15 used here means drop this SL) > - > -Typical default values (hard-coded in OpenSM initialization) are: > - > - qos_max_vls=15 > - qos_high_limit=0 > - qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 > - qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > - qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > - > -The syntax is compatible with rest of OpenSM configuration options and > -values may be stored in OpenSM config file (cached options file). > - > -In addition to the above, we may define separate QoS configuration > -parameters sets for various target types. As targets, we currently support > -CAs, routers, switch external ports, and switch's enhanced port 0. The > -names of such specialized parameters are prefixed by "qos__" > -string. Here is a full list of the currently supported sets: > - > - qos_ca_ - QoS configuration parameters set for CAs. > - qos_rtr_ - parameters set for routers. > - qos_sw0_ - parameters set for switches' port 0. > - qos_swe_ - parameters set for switches' external ports. > +OpenSM QoS support comprises of two parts: > > -Examples: > - qos_sw0_max_vls=2 > - qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0, > - qos_swe_high_limit=0 > + 1. \fBQoS manager in accordance with IBA QoS Annex\fP (experimental) > +.P > + 2. \fBSL2VL and VL Arbitration tables configuration\fP > +.P > +.SS QoS Manager (experimental) > +.PP > +When Quality of Service in OpenSM is enabled (-Q or --qos), OpenSM looks > +for QoS Policy file. The default name of this file is > +\fB\%@CONF_DIR@/@QOS_POLICY_FILE@\fP. The default may be changed by using > +-Y or --qos_policy_file option with OpenSM. > + > +During fabric initialization and at every heavy sweep OpenSM parses the > +QoS policy file, applies its settings to the discovered fabric elements, > +and enforces the provided policy on client requests. The overall flow for > +such requests is as follows: > + - The request is matched against the defined matching rules such that > + the QoS Level definition is found. > + - Given the QoS Level, path(s) search is performed with the given > + restrictions imposed by that level. > + > +There are two ways to define QoS policy: > + - \fBFull\fP: the full policy file syntax provides the administrator various > + ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to > + enforce various QoS constraints on the requested PR/MPR. > + - \fBSimplified\fP: the simplified policy file syntax enables the administrator > + match PR/MPR requests by various ULPs and applications running on top of > + these ULPs. > + > +While the full policy syntax is very flexible, in many cases the simplified > +policy definition would be sufficient. > +.PP > +.B Full QoS Policy File > +.PP > +QoS policy file has the following sections: > + > +.B I) > +Port Groups (denoted by port-groups). > +This section defines zero or more port groups that can be referred later by > +matching rules (see below). Port group lists ports by: > + - Port GUID > + - Port name, which is a combination of NodeDescription and IB port number > + - PKey, which means that all the ports in the subnet that belong to > + partition with a given PKey belong to this port group > + - Partition name, which means that all the ports in the subnet that belong > + to partition with a given name belong to this port group > + - Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and > + SELF (SM's port). > + > +.B II) > +QoS Setup (denoted by qos-setup). > +This section describes how to set up SL2VL and VL Arbitration tables on > +various nodes in the fabric. > +However, this is not supported in OFED 1.3. Here and below. I would prefer to not refer OFED versions (OpenSM can be used as part of OFED, independently or OFED/OpenSM versions can be mixed), something like "this version of OpenSM" looks more appropriate here for me. > +SL2VL and VLArb tables should be configured in the OpenSM options file. > + > +.B III) > +QoS Levels (denoted by qos-levels). > +Each QoS Level defines Service Level (SL) and a few optional fields: > + - MTU limit > + - Rate limit > + - PKey > + - Packet lifetime > + > +When path(s) search is performed, it is done with regards to restriction that > +these QoS Level parameters impose. > +One QoS level that is mandatory to define is a DEFAULT QoS level. It is > +applied to a PR/MPR query that does not match any existing match rule. > +Similar to any other QoS Level, it can also be explicitly referred by any > +match rule. Shouldn't this paragraph be placed after IV) - it refers matching rules which defined below? Or maybe even merged with IV? > + > +.B IV) > +QoS Matching Rules (denoted by qos-match-rules). > +Each PathRecord/MultiPathRecord query that OpenSM receives is matched against > +the set of matching rules. Rules are scanned in order of appearance in the QoS > +policy file such as the first match takes precedence. > +Each rule has a name of QoS level that will be applied to the matching query. > +A default QoS level is applied to a query that did not match any rule. > +Queries can be matched by: > + - Source port group (whether a source port is a member of a specified group) > + - Destination port group (same as above, only for destination port) > + - PKey > + - QoS class > + - Service ID > + > +To match a certain matching rule, PR/MPR query has to match ALL the rule's > +criteria. However, not all the fields of the PR/MPR query have to appear in > +the matching rule. > +For instance, if the rule has a single criterion - Service ID, it will match > +any query that has this Service ID, disregarding rest of the query fields. > +However, if a certain query has only Service ID (which means that this is the > +only bit in the PR/MPR component mask that is on), it will not match any rule > +that has other matching criteria besides Service ID. > +.PP > +.B Simplified QoS Policy Definition > +.PP > +Simplified QoS policy definition comprises of a single section denoted by > +qos-ulps. Similar to the full QoS policy, it has a list of match rules and > +their QoS Level, but in this case a match rule has only one criterion - its > +goal is to match a certain ULP (or a certain application on top of this ULP) > +PR/MPR request, and QoS Level has only one constraint - Service Level (SL). > +The simplified policy section may appear in the policy file in combine with > +the full policy, or as a stand-alone policy definition. > +See more details and list of match rule criteria below. What about to merge this paragraph with Simplified QoS policy description below? Here it looks like duplication. > +.PP > +.B Policy File Syntax Guidelines > +.PP > +Empty lines are ignored. > +Leading and trailing blanks, as well as empty lines, are ignored, so the > +indentation in the example is just for better readability. > +Comments are started with the pound sign (#) and terminated by EOL. > +Any keyword should be the first non-blank in the line, unless it's a comment. > +Keywords that denote section/subsection start have matching closing keywords. > +Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR > +requests that didn't match any of the matching rules. > +Any section/subsection of the policy file is optional. And this paragraph to move above 'Full QoS Policy File' section? > + > +.PP > +.B Examples of Full Policy File > +.PP > +As mentioned earlier, any section of the policy file is optional, and > +the only mandatory part of the policy file is a default QoS Level. > +Here's an example of the shortest policy file: > + > + qos-levels > + qos-level > + name: DEFAULT > + sl: 0 > + end-qos-level > + end-qos-levels > + > +Port groups section is missing because there are no match rules, which means > +that port groups are not referred anywhere, and there is no need defining > +them. And since this policy file doesn't have any matching rules, PR/MPR query > +won't match any rule, and OpenSM will enforce default QoS level. > +Essentially, the above example is equivalent to not having QoS policy file > +at all. > + > +The following example shows all the possible options and keywords in the > +policy file and their syntax: > + > + # > + # See the comments in the following example. > + # They explain different keywords and their meaning. > + # > + port-groups > + port-group > + name: Storage > + # "use" is just a description that is used for logging > + # Other than that, it is just a comment > + use: SRP Targets > + port-guid: 0x10000000000001, 0x10000000000005-0x1000000000FFFA > + port-guid: 0x1000000000FFFF > + end-port-group > + > + port-group > + name: Virtual Servers > + # The syntax of the port name is as follows: > + # "node_description/Pnum". > + # node_description is compared to the NodeDescription of the node, > + # and "Pnum" is a port number on that node. > + port-name: vs1 HCA-1/P1, vs2 HCA-1/P1 > + end-port-group > + > + # using partitions defined in the partition policy > + port-group > + name: Partitions > + partition: Part1 > + pkey: 0x1234 > + end-port-group > + > + # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) > + # or ALL (for all the nodes in the subnet) > + port-group > + name: CAs and SM > + node-type: CA, SELF > + end-port-group > + > + end-port-groups > + > + qos-setup > + # This section of the policy file describes how to set up SL2VL and VL > + # Arbitration tables on various nodes in the fabric. > + # However, this is not supported in OFED 1.3 - the section is parsed > + # and ignored. SL2VL and VLArb tables should be configured in the > + # OpenSM options file (by default - /var/cache/opensm/opensm.opts). > + end-qos-setup > + > + qos-levels > + > + # Having a QoS Level named "DEFAULT" is a must - it is applied to > + # PR/MPR requests that didn't match any of the matching rules. > + qos-level > + name: DEFAULT > + use: default QoS Level > + sl: 0 > + end-qos-level > + > + # the whole set: SL, MTU-Limit, Rate-Limit, PKey, Packet Lifetime > + qos-level > + name: WholeSet > + sl: 1 > + mtu-limit: 4 > + rate-limit: 5 > + pkey: 0x1234 > + packet-life: 8 > + end-qos-level > + > + end-qos-levels > + > + # Match rules are scanned in order of their apperance in the policy file. > + # First matched rule takes precedence. > + qos-match-rules > + > + # matching by single criteria: QoS class > + qos-match-rule > + use: by QoS class > + qos-class: 7-9,11 > + # Name of qos-level to apply to the matching PR/MPR > + qos-level-name: WholeSet > + end-qos-match-rule > + > + # show matching by destination group and service id > + qos-match-rule > + use: Storage targets > + destination: Storage > + service-id: 0x10000000000001, 0x10000000000008-0x10000000000FFF > + qos-level-name: WholeSet > + end-qos-match-rule > + > + qos-match-rule > + source: Storage > + use: match by source group only > + qos-level-name: DEFAULT > + end-qos-match-rule > + > + qos-match-rule > + use: match by all parameters > + qos-class: 7-9,11 > + source: Virtual Servers > + destination: Storage > + service-id: 0x0000000000010000-0x000000000001FFFF > + pkey: 0x0F00-0x0FFF > + qos-level-name: WholeSet > + end-qos-match-rule > + > + end-qos-match-rules > + > +.PP > +.B Simplified QoS Policy - Details and Examples > +.PP > +Simplified QoS policy match rules are tailored for matching ULPs (or > +some application on top of a ULP) PR/MPR requests. It has a list of > +per-ULP (or per-application) match rules and the SL that should be > +enforced on the matched PR/MPR query. > + > +Match rules include: > + - Default match rule that is applied to PR/MPR query that didn't > + match any of the other match rules > + - SDP > + - SDP application with a specific target TCP/IP port range > + - SRP with a specific target IB port GUID > + - RDS > + - iSER > + - iSER application with a specific target TCP/IP port range > + - IPoIB with a default PKey > + - IPoIB with a specific PKey > + - any ULP/application with a specific Service ID in the PR/MPR query > + - any ULP/application with a specific PKey in the PR/MPR query > + - any ULP/application with a specific target IB port GUID in the PR/MPR query Are there duplicated entries (SDP, iSER, IPoIB)? > + > +Since any section of the policy file is optional, as long as basic rules > +of the file are kept (such as no referring to nonexisting port group, > +having default QoS Level, etc), the simplified policy section (qos-ulps) > +can serve as a complete QoS policy file. > +The shortest policy file in this case would be as follows: > + > + qos-ulps > + default : 0 #default SL > + end-qos-ulps > + > +It is equivalent to not having policy file at all. > + > +Below is an example of simplified QoS policy with all the possible keywords: > + > + qos-ulps > + default : 0 # default SL > + sdp, port-num 30000 : 0 # SL for application running on top > + # of SDP when a destination > + # TCP/IPport is 30000 > + sdp, port-num 10000-20000 : 0 > + sdp : 1 # default SL for any other > + # application running on top of SDP > + rds : 2 # SL for RDS traffic > + iser, port-num 900 : 0 # SL for iSER with a specific target > + # port > + iser : 3 # default SL for iSER > + ipoib, pkey 0x0001 : 0 # SL for IPoIB on partition with > + # pkey 0x0001 > + ipoib : 4 # default IPoIB partition, > + # pkey=0x7FFF > + any, service-id 0x6234 : 6 # match any PR/MPR query with a > + # specific Service ID > + any, pkey 0x0ABC : 6 # match any PR/MPR query with a > + # specific PKey > + srp, target-port-guid 0x1234 : 5 # SRP when SRP Target is located on > + # a specified IB port GUID > + any, target-port-guid 0x0ABC-0xFFFFF : 6 # match any PR/MPR query with > + # a specific target port GUID > + end-qos-ulps > + > + > +Similar to the full policy definition, matching of PR/MPR queries is done in > +order of appearance in the QoS policy file such as the first match takes > +precedence, except for the "default" rule, which is applied only if the query > +didn't match any other rule. > + > +All other sections of the QoS policy file take precedence over the qos-ulps > +section. That is, if a policy file has both qos-match-rules and qos-ulps > +sections, then any query is matched first against the rules in the > +qos-match-rules section, and only if there was no match, the query is matched > +against the rules in qos-ulps section. > + > +Note that some of these match rules may overlap, so in order to use the > +simplified QoS definition effectively, it is important to understand how each > +of the ULPs is matched: > + > +.B IPoIB: > +PR query is matched by PKey. Default PKey for IPoIB partition is 0x7fff, so > +the following three match rules are equivalent: > + > + ipoib : > + ipoib, pkey 0x7fff : > + any, pkey 0x7fff : > + > +.I Note > +: For OFED 1.3, IPoIB partition SL configuration should be done through > +partition configuration file only. > + > +\fBSDP\fP: PR query is matched by Service ID. The Service-ID for SDP is > +0x000000000001PPPP, where PPPP are 4 hex digits holding the remote TCP/IP > +Port Number to connect to. The following two match rules are equivalent: > + > + sdp : > + any, service-id 0x0000000000010000-0x000000000001ffff : > + > +\fBRDS\fP: Similar to SDP, RDS PR query is matched by Service ID. The > +Service ID for RDS is 0x000000000106PPPP, where PPPP are 4 hex digits > +holding the remote TCP/IP Port Number to connect to. Default port number > +for RDS is 0x48CA, which makes a default Service-ID 0x00000000010648CA. > +The following two match rules are equivalent: > + > + rds : > + any, service-id 0x00000000010648CA : > + > +\fBiSER\fP: Similar to RDS, iSER query is matched by Service ID, where the > +Service ID is also 0x000000000106PPPP. Default port number for iSER is 0x035C, > +which makes a default Service-ID 0x000000000106035C. > +The following two match rules are equivalent: > + > + iser : > + any, service-id 0x000000000106035C : > + > +\fBSRP\fP: Service ID for SRP varies from storage vendor to vendor, thus SRP query is > +matched by the target IB port GUID. The following two match rules are > +equivalent: > + > + srp, target-port-guid 0x1234 : > + any, target-port-guid 0x1234 : > + > +Note that any of the above ULPs might contain target port GUID in the PR > +query, so in order for these queries not to be recognized by the QoS manager > +as SRP, the SRP match rule (or any match rule that refers to the target port > +guid only) should be placed at the end of the qos-ulps match rules. > + > +\fBMPI\fP: SL for MPI is manually configured by MPI admin. OpenSM is not > +forcing any SL on the MPI traffic, and that's why it is the only ULP that > +did not appear in the qos-ulps section. > + > + > +.SS SL2VL Mapping and VL Arbitration > +.PP > + > +OpenSM cached options file has a set of QoS related configuration > +parameters, that are used to configure SL2VL mapping and VL arbitration > +on IB ports. These parameters are: > + - Max VLs: the maximum number of VLs that will be on the subnet. > + - High limit: the limit of High Priority component of VL Arbitration > + table (IBA 7.6.9). > + - VLArb low table: Low priority VL Arbitration table (IBA 7.6.9) template. > + - VLArb high table: High priority VL Arbitration table (IBA 7.6.9) template. > + - SL2VL: SL2VL Mapping table (IBA 7.6.6) template. It is a list of VLs > + corresponding to SLs 0-15 (Note that VL15 used here means drop this SL). Why configuration keywords were removed here? OpenSM doesn't know what is "Max VLs", but knows "max_vls". > +There are separate QoS configuration parameters sets for various target > +types: This is optional... > CAs, routers, switch external ports, and switch's enhanced port 0. > +The names of such parameters are prefixed by "qos__" string. > +Here is a full list of the currently supported sets: > + > + qos_ca_ - QoS configuration parameters set for CAs. > + qos_rtr_ - parameters set for routers. > + qos_sw0_ - parameters set for switches' port 0. > + qos_swe_ - parameters set for switches' external ports. > + > +Here's the example of typical default values for all the ports in the > +subnet (hard-coded in OpenSM initialization): > + > + qos_max_vls=15 > + qos_high_limit=0 > + qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > + qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 > + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > + > + > +VL arbitration tables (both high and low) are lists of VL/Weight pairs. > +Each list entry contains a VL number (values from 0-14), and a weighting value > +(values 0-255), indicating the number of 64 byte units (credits) which may be > +transmitted from that VL when its turn in the arbitration occurs. A weight > +of 0 indicates that this entry should be skipped. If a list entry is > +programmed for VL15 or for a VL that is not supported or is not currently > +configured by the port, the port may either skip that entry or send from any > +supported VL for that entry. > + > +Note, that the same VLs may be listed multiple times in the High or Low > +priority arbitration tables, and, further, it can be listed in both tables. Here and below. Do we need to rewrite the IBA spec in the man pages? Those parameters are not invited by OpenSM implementation, it is port parameters as defined in IBA spec. > + > +The limit of high-priority VLArb table (qos__high_limit) indicates the > +number of high-priority packets that can be transmitted without an opportunity > +to send a low-priority packet. Specifically, the number of bytes that can be > +sent is high_limit times 4K bytes. > + > +A high_limit value of 255 indicates that the byte limit is unbounded. > +Note: if the 255 value is used, the low priority VLs may be starved. > +A value of 0 indicates that only a single packet from the high-priority table > +may be sent before an opportunity is given to the low-priority table. > + > +Keep in mind that ports usually transmit packets of size equal to MTU. > +For instance, for 4KB MTU a single packet will require 64 credits, so in order > +to achieve effective VL arbitration for packets of 4KB MTU, the weighting > +values for each VL should be multiples of 64. > + > +Below is an example of SL2VL and VL Arbitration configuration on subnet: > + > + qos_max_vls=15 > + qos_high_limit=6 > + qos_vlarb_high=0:4 > + qos_vlarb_low=0:0,1:64,2:128,3:192,4:0,5:64,6:64,7:64 > + qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > + > +In this example, there are 8 VLs configured on subnet: VL0 to VL7. VL0 is > +defined as a high priority VL, and it is limited to 6 x 4KB = 24KB in a single > +transmission burst. Such configuration would suilt VL that needs low latency > +and uses small MTU when transmitting packets. Rest of VLs are defined as low > +priority VLs with different weights, while VL4 is effectively turned off. > > .SH PREFIX ROUTES > .PP And finally due to a huge size of QoS description, wouldn't it be useful to move it below another (shorter) sections of the man page? Sasha From hrosenstock at xsigo.com Tue May 6 07:53:08 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 06 May 2008 07:53:08 -0700 Subject: [ofa-general] Re: [Fwd: Re: [PATCH] opensm/man: Adding QoS-related info to opensm man pages] In-Reply-To: <48206FEF.6020602@dev.mellanox.co.il> References: <48206FEF.6020602@dev.mellanox.co.il> Message-ID: <1210085588.2026.63.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-06 at 17:49 +0300, Yevgeny Kliteynik wrote: > And this is Sasha's response. Thanks. Looks like there's some work to do here to address the comments. Also, a number of those comments apply to the doc just submitted as patch. -- Hal > -- Yevgeny From tziporet at mellanox.co.il Tue May 6 08:44:51 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 6 May 2008 18:44:51 +0300 Subject: [ofa-general] OFED May 5 meeting summary Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> May 5 OFED meeting summary: =========================== 1. OFED 1.3.1: 1.1 Status of changes: IB-bonding - on work SRP failover - done (need more testing) SDP crashes - on work (not clear if we will have something on time) RDS fixes for RDMA API - done librdmacm 1.0.7 - done uDAPL updates - done Open MPI 1.2.6 - done MVAPICH 1.0.1 - done MVAPICH2 1.0.3 - done IPoIB - 2 bugs fixed. There are still two issue that should be resolved. Low level drivers: Changes that already committed: nes mlx4 cxgb3 ehca 1.2 Schedule: rc1 - was released today rc2 - May 20 GA - May 29 1.3 Discussion: - ipath driver is going to be updated - There is an issue of bonding and Ethernet drivers on RHEL4 - under debug - We wish to add support for SLES10 SP2. Already got an approval from Novell Any volunteer to provide the new backport patches? 2. OFED 1.4: Updated that the new tree will be ready next week - based on 2.6.26-rc 3. Update on OpenSuSE build system - Yiftah updated on the work that is done and problems: - The system requires clean RPMs only (no use of install script) - they work to resolve - We target this system toward releases (and not to replace the daily build system). - we may try now with OFED 1.3.1 Tziporet From swise at opengridcomputing.com Tue May 6 10:02:30 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 06 May 2008 12:02:30 -0500 Subject: [ofa-general] [PATCH] Request For Comments: Message-ID: <20080506170230.11409.43625.stgit@dell3.ogc.int> From: Steve Wise Here is the top level API change I'm proposing for enabling interoperable peer2peer mode for iwarp. I want to get agreement on how to expose this to the application before posting more of the gritty details of the kernel driver changes needed. The plan is to include this support in linux-2.6.27 + ofed-1.4. Does this require an ABI bump? Note: We could do this several ways. I'm proposing one with this uncompiled patch. The downside of my proposal is the applications have to change to turn this on. However, I'm not sure thats too painful. We would have OMPI turn it on, and maybe even uDAPL so that all uDAPL ULPs would get it (IMPI, dapltest, HPMPI). Alternative designs: - always do peer2peer and don't let the app choose. This forces the overhead of p2p mode on all apps, but preserves the API. - use and environment variable that librdmacm will query. This doesn't force p2p, and has the beneifit of not changing the API. But at the expense of adding environment variables to the rdma-cm model. This is used extensively in MPIs and even DAPL. I think its an alternative we should consider. This approach, however, doesn't help kernel applications. Steve. ----- Peer2peer support in librdmacm. User applications can set a new u8 boolean named peer2peer_mode in the rdma_conn_param struct to indicate if they require peer2peer mode support. This means they don't enforce the "client must send first" iwarp requirement in their own application logic. If they set peer2peer_mode to 1, then the iwarp CM and drivers will handle this requirement. Applications that don't require this should set peer2peer_mode to 0 to reduce the message exchanged done at iwarp connection setup. Signed-off-by: Steve Wise --- include/rdma/rdma_cma.h | 1 + include/rdma/rdma_cma_abi.h | 1 + src/cma.c | 2 ++ 3 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index 76df90f..943aa45 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -118,6 +118,7 @@ struct rdma_conn_param { uint8_t flow_control; uint8_t retry_count; /* ignored when accepting */ uint8_t rnr_retry_count; + uint8_t peer2peer_mode; /* Fields below ignored if a QP is created on the rdma_cm_id. */ uint8_t srq; uint32_t qp_num; diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index 1a3a9c2..5914aaa 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -140,6 +140,7 @@ struct ucma_abi_conn_param { __u8 retry_count; __u8 rnr_retry_count; __u8 valid; + __u8 peer2peer_mode; }; struct ucma_abi_ud_param { diff --git a/src/cma.c b/src/cma.c index fc98c8f..dbbb2e8 100644 --- a/src/cma.c +++ b/src/cma.c @@ -844,6 +844,7 @@ static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, dst->retry_count = src->retry_count; dst->rnr_retry_count = src->rnr_retry_count; dst->valid = 1; + dst->peer2peer_mode = src->peer2peer_mode; if (src->private_data && src->private_data_len) { memcpy(dst->private_data, src->private_data, @@ -1261,6 +1262,7 @@ static void ucma_copy_conn_event(struct cma_event *event, dst->rnr_retry_count = src->rnr_retry_count; dst->srq = src->srq; dst->qp_num = src->qp_num; + dst->peer2peer_mode = src->peer2peer_mode; } static void ucma_copy_ud_event(struct cma_event *event, From sean.hefty at intel.com Tue May 6 10:31:37 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 May 2008 10:31:37 -0700 Subject: [ofa-general] RE: [PATCH] Request For Comments: In-Reply-To: <20080506170230.11409.43625.stgit@dell3.ogc.int> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> Message-ID: <000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com> Thanks for looking at this. >Here is the top level API change I'm proposing for enabling interoperable >peer2peer mode for iwarp. I want to get agreement on how to expose >this to the application before posting more of the gritty details of >the kernel driver changes needed. The plan is to include this support >in linux-2.6.27 + ofed-1.4. I don't have a better idea what to call this, but when I think of peer to peer, I think of that as the connection model, not a channel usage restriction. >Does this require an ABI bump? I'd like to avoid breaking the ABI or userspace API if possible. >Note: We could do this several ways. I'm proposing one with this >uncompiled patch. The downside of my proposal is the applications have >to change to turn this on. However, I'm not sure thats too painful. >We would have OMPI turn it on, and maybe even uDAPL so that all uDAPL >ULPs would get it (IMPI, dapltest, HPMPI). We could use rdma_set_option() for this. If we do go the route of changing the rdma_conn_param, adding generic flags or options would be more extendible. >- always do peer2peer and don't let the app choose. This forces >the overhead of p2p mode on all apps, but preserves the API. If we use rdma_set_option, I guess we could always enable it by default, and let apps disable it. I'm unsure if the better default is avoiding the overhead or making the API easier to use, but I'm leaning toward the latter in this case. >- use and environment variable that librdmacm will query. This doesn't >force p2p, and has the beneifit of not changing the API. But at the >expense of adding environment variables to the rdma-cm model. This is >used extensively in MPIs and even DAPL. I think its an alternative >we should consider. This approach, however, doesn't help kernel >applications. I'm not thrilled with this idea. Although I'm fine with the kernel solution being different from the userspace one. - Sean From rdreier at cisco.com Tue May 6 10:47:47 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 06 May 2008 10:47:47 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <20080506170230.11409.43625.stgit@dell3.ogc.int> (Steve Wise's message of "Tue, 06 May 2008 12:02:30 -0500") References: <20080506170230.11409.43625.stgit@dell3.ogc.int> Message-ID: > - always do peer2peer and don't let the app choose. This forces > the overhead of p2p mode on all apps, but preserves the API. How bad is the overhead? - R. From andrea at qumranet.com Tue May 6 10:53:57 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Tue, 6 May 2008 19:53:57 +0200 Subject: [ofa-general] mmu notifier v15 -> v16 diff In-Reply-To: <20080505194625.GA17734@sgi.com> References: <1489529e7b53d3f2dab8.1209740704@duo.random> <20080505162113.GA18761@sgi.com> <20080505171434.GF8470@duo.random> <20080505172506.GA9247@sgi.com> <20080505183405.GI8470@duo.random> <20080505194625.GA17734@sgi.com> Message-ID: <20080506175357.GB12593@duo.random> Hello everyone, This is to allow GRU code to call __mmu_notifier_register inside the mmap_sem (write mode is required as documented in the patch). It also removes the requirement to implement ->release as it's not guaranteed all users will really need it. I didn't integrate the search function as we can sort that out after 2.6.26 is out and it wasn't entirely obvious it's really needed, as the driver should be able to track if a mmu notifier is registered in the container. diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -29,10 +29,25 @@ struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are - * freed. It's mandatory to implement this method. This can - * run concurrently with other mmu notifier methods and it + * freed. This can run concurrently with other mmu notifier + * methods (the ones invoked outside the mm context) and it * should tear down all secondary mmu mappings and freeze the - * secondary mmu. + * secondary mmu. If this method isn't implemented you've to + * be sure that nothing could possibly write to the pages + * through the secondary mmu by the time the last thread with + * tsk->mm == mm exits. + * + * As side note: the pages freed after ->release returns could + * be immediately reallocated by the gart at an alias physical + * address with a different cache model, so if ->release isn't + * implemented because all _software_ driven memory accesses + * through the secondary mmu are terminated by the time the + * last thread of this mm quits, you've also to be sure that + * speculative _hardware_ operations can't allocate dirty + * cachelines in the cpu that could not be snooped and made + * coherent with the other read and write operations happening + * through the gart alias address, so leading to memory + * corruption. */ void (*release)(struct mmu_notifier *mn, struct mm_struct *mm); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2340,13 +2340,20 @@ static inline void __mm_unlock(spinlock_ /* * This operation locks against the VM for all pte/vma/mm related * operations that could ever happen on a certain mm. This includes - * vmtruncate, try_to_unmap, and all page faults. The holder - * must not hold any mm related lock. A single task can't take more - * than one mm_lock in a row or it would deadlock. + * vmtruncate, try_to_unmap, and all page faults. * - * The mmap_sem must be taken in write mode to block all operations - * that could modify pagetables and free pages without altering the - * vma layout (for example populate_range() with nonlinear vmas). + * The caller must take the mmap_sem in read or write mode before + * calling mm_lock(). The caller isn't allowed to release the mmap_sem + * until mm_unlock() returns. + * + * While mm_lock() itself won't strictly require the mmap_sem in write + * mode to be safe, in order to block all operations that could modify + * pagetables and free pages without need of altering the vma layout + * (for example populate_range() with nonlinear vmas) the mmap_sem + * must be taken in write mode by the caller. + * + * A single task can't take more than one mm_lock in a row or it would + * deadlock. * * The sorting is needed to avoid lock inversion deadlocks if two * tasks run mm_lock at the same time on different mm that happen to @@ -2377,17 +2384,13 @@ int mm_lock(struct mm_struct *mm, struct { spinlock_t **anon_vma_locks, **i_mmap_locks; - down_write(&mm->mmap_sem); if (mm->map_count) { anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!anon_vma_locks)) { - up_write(&mm->mmap_sem); + if (unlikely(!anon_vma_locks)) return -ENOMEM; - } i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); if (unlikely(!i_mmap_locks)) { - up_write(&mm->mmap_sem); vfree(anon_vma_locks); return -ENOMEM; } @@ -2426,10 +2429,12 @@ static void mm_unlock_vfree(spinlock_t * /* * mm_unlock doesn't require any memory allocation and it won't fail. * + * The mmap_sem cannot be released until mm_unlock returns. + * * All memory has been previously allocated by mm_lock and it'll be * all freed before returning. Only after mm_unlock returns, the * caller is allowed to free and forget the mm_lock_data structure. - * + * * mm_unlock runs in O(N) where N is the max number of VMAs in the * mm. The max number of vmas is defined in * /proc/sys/vm/max_map_count. @@ -2444,5 +2449,4 @@ void mm_unlock(struct mm_struct *mm, str mm_unlock_vfree(data->i_mmap_locks, data->nr_i_mmap_locks); } - up_write(&mm->mmap_sem); } diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -59,7 +59,8 @@ void __mmu_notifier_release(struct mm_st * from establishing any more sptes before all the * pages in the mm are freed. */ - mn->ops->release(mn, mm); + if (mn->ops->release) + mn->ops->release(mn, mm); srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); spin_lock(&mm->mmu_notifier_mm->lock); } @@ -144,20 +145,9 @@ void __mmu_notifier_invalidate_range_end srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); } -/* - * Must not hold mmap_sem nor any other VM related lock when calling - * this registration function. Must also ensure mm_users can't go down - * to zero while this runs to avoid races with mmu_notifier_release, - * so mm has to be current->mm or the mm should be pinned safely such - * as with get_task_mm(). If the mm is not current->mm, the mm_users - * pin should be released by calling mmput after mmu_notifier_register - * returns. mmu_notifier_unregister must be always called to - * unregister the notifier. mm_count is automatically pinned to allow - * mmu_notifier_unregister to safely run at any time later, before or - * after exit_mmap. ->release will always be called before exit_mmap - * frees the pages. - */ -int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +static int do_mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm, + int take_mmap_sem) { struct mm_lock_data data; struct mmu_notifier_mm * mmu_notifier_mm; @@ -174,6 +164,8 @@ int mmu_notifier_register(struct mmu_not if (unlikely(ret)) goto out_kfree; + if (take_mmap_sem) + down_write(&mm->mmap_sem); ret = mm_lock(mm, &data); if (unlikely(ret)) goto out_cleanup; @@ -200,6 +192,8 @@ int mmu_notifier_register(struct mmu_not mm_unlock(mm, &data); out_cleanup: + if (take_mmap_sem) + up_write(&mm->mmap_sem); if (mmu_notifier_mm) cleanup_srcu_struct(&mmu_notifier_mm->srcu); out_kfree: @@ -209,7 +203,35 @@ out: BUG_ON(atomic_read(&mm->mm_users) <= 0); return ret; } + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 1); +} EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * Same as mmu_notifier_register but here the caller must hold the + * mmap_sem in write mode. + */ +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 0); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); /* this is called after the last mmu_notifier_unregister() returned */ void __mmu_notifier_mm_destroy(struct mm_struct *mm) @@ -251,7 +273,8 @@ void mmu_notifier_unregister(struct mmu_ * guarantee ->release is called before freeing the * pages. */ - mn->ops->release(mn, mm); + if (mn->ops->release) + mn->ops->release(mn, mm); srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); } else spin_unlock(&mm->mmu_notifier_mm->lock); diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -148,6 +148,8 @@ static inline int mm_has_notifiers(struc extern int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm); +extern int __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); extern void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm); extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); From swise at opengridcomputing.com Tue May 6 11:32:07 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 06 May 2008 13:32:07 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: References: <20080506170230.11409.43625.stgit@dell3.ogc.int> Message-ID: <4820A427.1070405@opengridcomputing.com> Roland Dreier wrote: > > - always do peer2peer and don't let the app choose. This forces > > the overhead of p2p mode on all apps, but preserves the API. > > How bad is the overhead? > > - R. > The client side must send a "Ready To Receive" message. This will be negotiated via the MPA exchange and the resulting RTR message may be a 0B read + read response, 0B write, or a 0B send. For chelsio, the 0B write couldn't be used, and the 0B read was the least impact on the driver code, so we used that. For nes, they currently use a 0B write. Also, there are some "caveats" if you turn this on: 1) private data is used to negotiate the type of RTR message and if its needed. This is more of a global module option I think, since it will break interoperability with iwarp. Prolly will bump the MPA version number if this option is on too. 2) if the RTR message fails, it can generate a CQE that is unexpected. 3) if using SEND, then a recv completion is always generated. Steve. From ralph.campbell at qlogic.com Tue May 6 11:36:15 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:15 -0700 Subject: [ofa-general] [PATCH 0/7] IB/ipath -- fixes for 2.6.26 Message-ID: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> The following patches fix a number of bugs for the QLogic DDR HCA. IB/ipath -- only warn about prototype chip during init IB/ipath - only increment SSN if WQE is put on send queue IB/ipath - fix bug that can leave sends disabled after freeze recovery IB/ipath - Return the correct opcode for RDMA WRITE with immediate IB/ipath -- fix count of packets received by kernel IB/ipath - need to always request and handle PIO avail interrupts IB/ipath - fix SDMA error recovery in absence of link status change These can also be pulled into Roland's infiniband.git for-2.6.26 repo using: git pull git://git.qlogic.com/ipath-linux-2.6 for-roland From ralph.campbell at qlogic.com Tue May 6 11:36:21 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:21 -0700 Subject: [ofa-general] [PATCH 1/7] IB/ipath -- only warn about prototype chip during init In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183620.6521.62416.stgit@eng-46.mv.qlogic.com> From: Michael Albaugh We warn about prototype chips, but the function that checks for support is also called as a result of a get_portinfo request, which can clutter the logs. Restrict warning to only appear during initialization. Signed-off-by: Michael Albaugh --- drivers/infiniband/hw/ipath/ipath_iba7220.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_iba7220.c b/drivers/infiniband/hw/ipath/ipath_iba7220.c index e3ec0d1..5f693de 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba7220.c +++ b/drivers/infiniband/hw/ipath/ipath_iba7220.c @@ -870,8 +870,9 @@ static int ipath_7220_boardname(struct ipath_devdata *dd, char *name, "revision %u.%u!\n", dd->ipath_majrev, dd->ipath_minrev); ret = 1; - } else if (dd->ipath_minrev == 1) { - /* Rev1 chips are prototype. Complain, but allow use */ + } else if (dd->ipath_minrev == 1 && + !(dd->ipath_flags & IPATH_INITTED)) { + /* Rev1 chips are prototype. Complain at init, but allow use */ ipath_dev_err(dd, "Unsupported hardware " "revision %u.%u, Contact support at qlogic.com\n", dd->ipath_majrev, dd->ipath_minrev); From ralph.campbell at qlogic.com Tue May 6 11:36:26 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:26 -0700 Subject: [ofa-general] [PATCH 2/7] IB/ipath - only increment SSN if WQE is put on send queue In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183626.6521.97442.stgit@eng-46.mv.qlogic.com> If a send work request has immediate errors and is not put on the send queue, we shouldn't update any of the QP state. The increment of the SSN wasn't obeying this. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_verbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index e63927c..5015cd2 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -396,7 +396,6 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) wqe = get_swqe_ptr(qp, qp->s_head); wqe->wr = *wr; - wqe->ssn = qp->s_ssn++; wqe->length = 0; if (wr->num_sge) { acc = wr->opcode >= IB_WR_RDMA_READ ? @@ -422,6 +421,7 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) goto bail_inval; } else if (wqe->length > to_idev(qp->ibqp.device)->dd->ipath_ibmtu) goto bail_inval; + wqe->ssn = qp->s_ssn++; qp->s_head = next; ret = 0; From ralph.campbell at qlogic.com Tue May 6 11:36:31 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:31 -0700 Subject: [ofa-general] [PATCH 3/7] IB/ipath - fix bug that can leave sends disabled after freeze recovery In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183631.6521.82997.stgit@eng-46.mv.qlogic.com> From: Dave Olson The semantics of cancel_sends changed, but the code using it was missed. Don't leave sends and pioavail updates disabled, and add a comment as to why the force update is needed. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_intr.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 1b58f47..45c4c06 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -933,11 +933,15 @@ void ipath_clear_freeze(struct ipath_devdata *dd) * therefore would not be sent, and eventually * might cause the process to run out of bufs */ - ipath_cancel_sends(dd, 0); + ipath_cancel_sends(dd, 1); ipath_write_kreg(dd, dd->ipath_kregs->kr_control, dd->ipath_control); - /* ensure pio avail updates continue */ + /* + * ensure pio avail updates continue (because the update + * won't have happened from cancel_sends because we were + * still in freeze + */ ipath_force_pio_avail_update(dd); /* From ralph.campbell at qlogic.com Tue May 6 11:36:36 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:36 -0700 Subject: [ofa-general] [PATCH 4/7] IB/ipath - Return the correct opcode for RDMA WRITE with immediate In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183636.6521.42474.stgit@eng-46.mv.qlogic.com> This patch fixes a bug in the RC responder which generates a completion entry with the wrong opcode when an RDMA WRITE with immediate is received. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_rc.c | 6 +++++- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index c405dfb..08b11b5 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -1746,7 +1746,11 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, qp->r_wrid_valid = 0; wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; + if (opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE) || + opcode == OP(RDMA_WRITE_ONLY_WITH_IMMEDIATE)) + wc.opcode = IB_WC_RECV_RDMA_WITH_IMM; + else + wc.opcode = IB_WC_RECV; wc.vendor_err = 0; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; From ralph.campbell at qlogic.com Tue May 6 11:36:41 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:41 -0700 Subject: [ofa-general] [PATCH 5/7] IB/ipath -- fix count of packets received by kernel In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183641.6521.79318.stgit@eng-46.mv.qlogic.com> From: Michael Albaugh The loop in ipath_kreceive() that processes packets increments the loop-index 'i' once too often, because the exit condition does not depend on it, and is checked after the increment. By adding a check for !last to the iterator in the for loop, we correct that in a way that is not so likely to be re-broken by changes in the loop body. Signed-off-by: Michael Albaugh --- drivers/infiniband/hw/ipath/ipath_driver.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index acf30c0..f81dd4a 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1197,7 +1197,7 @@ void ipath_kreceive(struct ipath_portdata *pd) } reloop: - for (last = 0, i = 1; !last; i++) { + for (last = 0, i = 1; !last; i += !last) { hdr = dd->ipath_f_get_msgheader(dd, rhf_addr); eflags = ipath_hdrget_err_flags(rhf_addr); etype = ipath_hdrget_rcv_type(rhf_addr); From ralph.campbell at qlogic.com Tue May 6 11:36:47 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:47 -0700 Subject: [ofa-general] [PATCH 6/7] IB/ipath - need to always request and handle PIO avail interrupts In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183646.6521.81784.stgit@eng-46.mv.qlogic.com> From: Dave Olson Now that we always use PIO for vl15 on 7220, we could get stuck forever if we happened to run out of PIO buffers from the verbs code, because the setup code wouldn't run; the interrupt was also ignored if SDMA was supported. We also have to reduce the pio update threshold if we have fewer kernel buffers than the existing threshold. Cleans up the initialization a bit to get ordering safer and more sensible, and to use the existing ipath_chg_kernavail call to do init, rather than doing it separately. Drops unnecessary clearing of pio buffer on pio parity error. Drops incorrect updating of pioavailshadow when exitting freeze mode (software state may not match chip state if buffer has been allocated and not yet written). If we couldn't get a kernel buffer for a while, make sure we are in sync with hardware, mainly to handle the exitting freeze case. Signed-off-by: Dave Olson --- drivers/infiniband/hw/ipath/ipath_driver.c | 128 +++++++++++++++++++++++-- drivers/infiniband/hw/ipath/ipath_file_ops.c | 72 ++++++-------- drivers/infiniband/hw/ipath/ipath_iba7220.c | 21 +--- drivers/infiniband/hw/ipath/ipath_init_chip.c | 95 ++++++++----------- drivers/infiniband/hw/ipath/ipath_intr.c | 82 ++-------------- drivers/infiniband/hw/ipath/ipath_kernel.h | 8 +- drivers/infiniband/hw/ipath/ipath_ruc.c | 7 + drivers/infiniband/hw/ipath/ipath_sdma.c | 13 ++- 8 files changed, 224 insertions(+), 202 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index f81dd4a..2036d38 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1428,6 +1428,40 @@ static void ipath_update_pio_bufs(struct ipath_devdata *dd) spin_unlock_irqrestore(&ipath_pioavail_lock, flags); } +/* + * used to force update of pioavailshadow if we can't get a pio buffer. + * Needed primarily due to exitting freeze mode after recovering + * from errors. Done lazily, because it's safer (known to not + * be writing pio buffers). + */ +static void ipath_reset_availshadow(struct ipath_devdata *dd) +{ + int i, im; + unsigned long flags; + + spin_lock_irqsave(&ipath_pioavail_lock, flags); + for (i = 0; i < dd->ipath_pioavregs; i++) { + u64 val, oldval; + /* deal with 6110 chip bug on high register #s */ + im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ? + i ^ 1 : i; + val = le64_to_cpu(dd->ipath_pioavailregs_dma[im]); + /* + * busy out the buffers not in the kernel avail list, + * without changing the generation bits. + */ + oldval = dd->ipath_pioavailshadow[i]; + dd->ipath_pioavailshadow[i] = val | + ((~dd->ipath_pioavailkernel[i] << + INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT) & + 0xaaaaaaaaaaaaaaaaULL); /* All BUSY bits in qword */ + if (oldval != dd->ipath_pioavailshadow[i]) + ipath_dbg("shadow[%d] was %Lx, now %lx\n", + i, oldval, dd->ipath_pioavailshadow[i]); + } + spin_unlock_irqrestore(&ipath_pioavail_lock, flags); +} + /** * ipath_setrcvhdrsize - set the receive header size * @dd: the infinipath device @@ -1482,9 +1516,12 @@ static noinline void no_pio_bufs(struct ipath_devdata *dd) */ ipath_stats.sps_nopiobufs++; if (!(++dd->ipath_consec_nopiobuf % 100000)) { - ipath_dbg("%u pio sends with no bufavail; dmacopy: " - "%llx %llx %llx %llx; shadow: %lx %lx %lx %lx\n", + ipath_force_pio_avail_update(dd); /* at start */ + ipath_dbg("%u tries no piobufavail ts%lx; dmacopy: " + "%llx %llx %llx %llx\n" + "ipath shadow: %lx %lx %lx %lx\n", dd->ipath_consec_nopiobuf, + (unsigned long)get_cycles(), (unsigned long long) le64_to_cpu(dma[0]), (unsigned long long) le64_to_cpu(dma[1]), (unsigned long long) le64_to_cpu(dma[2]), @@ -1496,14 +1533,17 @@ static noinline void no_pio_bufs(struct ipath_devdata *dd) */ if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) > (sizeof(shadow[0]) * 4 * 4)) - ipath_dbg("2nd group: dmacopy: %llx %llx " - "%llx %llx; shadow: %lx %lx %lx %lx\n", + ipath_dbg("2nd group: dmacopy: " + "%llx %llx %llx %llx\n" + "ipath shadow: %lx %lx %lx %lx\n", (unsigned long long)le64_to_cpu(dma[4]), (unsigned long long)le64_to_cpu(dma[5]), (unsigned long long)le64_to_cpu(dma[6]), (unsigned long long)le64_to_cpu(dma[7]), - shadow[4], shadow[5], shadow[6], - shadow[7]); + shadow[4], shadow[5], shadow[6], shadow[7]); + + /* at end, so update likely happened */ + ipath_reset_availshadow(dd); } } @@ -1652,19 +1692,46 @@ void ipath_chg_pioavailkernel(struct ipath_devdata *dd, unsigned start, unsigned len, int avail) { unsigned long flags; - unsigned end; + unsigned end, cnt = 0, next; /* There are two bits per send buffer (busy and generation) */ start *= 2; - len *= 2; - end = start + len; + end = start + len * 2; - /* Set or clear the generation bits. */ spin_lock_irqsave(&ipath_pioavail_lock, flags); + /* Set or clear the busy bit in the shadow. */ while (start < end) { if (avail) { - __clear_bit(start + INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT, - dd->ipath_pioavailshadow); + unsigned long dma; + int i, im; + /* + * the BUSY bit will never be set, because we disarm + * the user buffers before we hand them back to the + * kernel. We do have to make sure the generation + * bit is set correctly in shadow, since it could + * have changed many times while allocated to user. + * We can't use the bitmap functions on the full + * dma array because it is always little-endian, so + * we have to flip to host-order first. + * BITS_PER_LONG is slightly wrong, since it's + * always 64 bits per register in chip... + * We only work on 64 bit kernels, so that's OK. + */ + /* deal with 6110 chip bug on high register #s */ + i = start / BITS_PER_LONG; + im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ? + i ^ 1 : i; + __clear_bit(INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT + + start, dd->ipath_pioavailshadow); + dma = (unsigned long) le64_to_cpu( + dd->ipath_pioavailregs_dma[im]); + if (test_bit((INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT + + start) % BITS_PER_LONG, &dma)) + __set_bit(INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT + + start, dd->ipath_pioavailshadow); + else + __clear_bit(INFINIPATH_SENDPIOAVAIL_CHECK_SHIFT + + start, dd->ipath_pioavailshadow); __set_bit(start, dd->ipath_pioavailkernel); } else { __set_bit(start + INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT, @@ -1673,7 +1740,44 @@ void ipath_chg_pioavailkernel(struct ipath_devdata *dd, unsigned start, } start += 2; } + + if (dd->ipath_pioupd_thresh) { + end = 2 * (dd->ipath_piobcnt2k + dd->ipath_piobcnt4k); + next = find_first_bit(dd->ipath_pioavailkernel, end); + while (next < end) { + cnt++; + next = find_next_bit(dd->ipath_pioavailkernel, end, + next + 1); + } + } spin_unlock_irqrestore(&ipath_pioavail_lock, flags); + + /* + * When moving buffers from kernel to user, if number assigned to + * the user is less than the pio update threshold, and threshold + * is supported (cnt was computed > 0), drop the update threshold + * so we update at least once per allocated number of buffers. + * In any case, if the kernel buffers are less than the threshold, + * drop the threshold. We don't bother increasing it, having once + * decreased it, since it would typically just cycle back and forth. + * If we don't decrease below buffers in use, we can wait a long + * time for an update, until some other context uses PIO buffers. + */ + if (!avail && len < cnt) + cnt = len; + if (cnt < dd->ipath_pioupd_thresh) { + dd->ipath_pioupd_thresh = cnt; + ipath_dbg("Decreased pio update threshold to %u\n", + dd->ipath_pioupd_thresh); + spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); + dd->ipath_sendctrl &= ~(INFINIPATH_S_UPDTHRESH_MASK + << INFINIPATH_S_UPDTHRESH_SHIFT); + dd->ipath_sendctrl |= dd->ipath_pioupd_thresh + << INFINIPATH_S_UPDTHRESH_SHIFT; + ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, + dd->ipath_sendctrl); + spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags); + } } /** diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 8b17522..3295177 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -173,47 +173,25 @@ static int ipath_get_base_info(struct file *fp, (void *) dd->ipath_statusp - (void *) dd->ipath_pioavailregs_dma; if (!shared) { - kinfo->spi_piocnt = dd->ipath_pbufsport; + kinfo->spi_piocnt = pd->port_piocnt; kinfo->spi_piobufbase = (u64) pd->port_piobufs; kinfo->__spi_uregbase = (u64) dd->ipath_uregbase + dd->ipath_ureg_align * pd->port_port; } else if (master) { - kinfo->spi_piocnt = (dd->ipath_pbufsport / subport_cnt) + - (dd->ipath_pbufsport % subport_cnt); + kinfo->spi_piocnt = (pd->port_piocnt / subport_cnt) + + (pd->port_piocnt % subport_cnt); /* Master's PIO buffers are after all the slave's */ kinfo->spi_piobufbase = (u64) pd->port_piobufs + dd->ipath_palign * - (dd->ipath_pbufsport - kinfo->spi_piocnt); + (pd->port_piocnt - kinfo->spi_piocnt); } else { unsigned slave = subport_fp(fp) - 1; - kinfo->spi_piocnt = dd->ipath_pbufsport / subport_cnt; + kinfo->spi_piocnt = pd->port_piocnt / subport_cnt; kinfo->spi_piobufbase = (u64) pd->port_piobufs + dd->ipath_palign * kinfo->spi_piocnt * slave; } - /* - * Set the PIO avail update threshold to no larger - * than the number of buffers per process. Note that - * we decrease it here, but won't ever increase it. - */ - if (dd->ipath_pioupd_thresh && - kinfo->spi_piocnt < dd->ipath_pioupd_thresh) { - unsigned long flags; - - dd->ipath_pioupd_thresh = kinfo->spi_piocnt; - ipath_dbg("Decreased pio update threshold to %u\n", - dd->ipath_pioupd_thresh); - spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); - dd->ipath_sendctrl &= ~(INFINIPATH_S_UPDTHRESH_MASK - << INFINIPATH_S_UPDTHRESH_SHIFT); - dd->ipath_sendctrl |= dd->ipath_pioupd_thresh - << INFINIPATH_S_UPDTHRESH_SHIFT; - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, - dd->ipath_sendctrl); - spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags); - } - if (shared) { kinfo->spi_port_uregbase = (u64) dd->ipath_uregbase + dd->ipath_ureg_align * pd->port_port; @@ -1309,19 +1287,19 @@ static int ipath_mmap(struct file *fp, struct vm_area_struct *vma) ureg = dd->ipath_uregbase + dd->ipath_ureg_align * pd->port_port; if (!pd->port_subport_cnt) { /* port is not shared */ - piocnt = dd->ipath_pbufsport; + piocnt = pd->port_piocnt; piobufs = pd->port_piobufs; } else if (!subport_fp(fp)) { /* caller is the master */ - piocnt = (dd->ipath_pbufsport / pd->port_subport_cnt) + - (dd->ipath_pbufsport % pd->port_subport_cnt); + piocnt = (pd->port_piocnt / pd->port_subport_cnt) + + (pd->port_piocnt % pd->port_subport_cnt); piobufs = pd->port_piobufs + - dd->ipath_palign * (dd->ipath_pbufsport - piocnt); + dd->ipath_palign * (pd->port_piocnt - piocnt); } else { unsigned slave = subport_fp(fp) - 1; /* caller is a slave */ - piocnt = dd->ipath_pbufsport / pd->port_subport_cnt; + piocnt = pd->port_piocnt / pd->port_subport_cnt; piobufs = pd->port_piobufs + dd->ipath_palign * piocnt * slave; } @@ -1633,9 +1611,6 @@ static int try_alloc_port(struct ipath_devdata *dd, int port, port_fp(fp) = pd; pd->port_pid = current->pid; strncpy(pd->port_comm, current->comm, sizeof(pd->port_comm)); - ipath_chg_pioavailkernel(dd, - dd->ipath_pbufsport * (pd->port_port - 1), - dd->ipath_pbufsport, 0); ipath_stats.sps_ports++; ret = 0; } else @@ -1938,11 +1913,25 @@ static int ipath_do_user_init(struct file *fp, /* for now we do nothing with rcvhdrcnt: uinfo->spu_rcvhdrcnt */ + /* some ports may get extra buffers, calculate that here */ + if (pd->port_port <= dd->ipath_ports_extrabuf) + pd->port_piocnt = dd->ipath_pbufsport + 1; + else + pd->port_piocnt = dd->ipath_pbufsport; + /* for right now, kernel piobufs are at end, so port 1 is at 0 */ + if (pd->port_port <= dd->ipath_ports_extrabuf) + pd->port_pio_base = (dd->ipath_pbufsport + 1) + * (pd->port_port - 1); + else + pd->port_pio_base = dd->ipath_ports_extrabuf + + dd->ipath_pbufsport * (pd->port_port - 1); pd->port_piobufs = dd->ipath_piobufbase + - dd->ipath_pbufsport * (pd->port_port - 1) * dd->ipath_palign; - ipath_cdbg(VERBOSE, "Set base of piobufs for port %u to 0x%x\n", - pd->port_port, pd->port_piobufs); + pd->port_pio_base * dd->ipath_palign; + ipath_cdbg(VERBOSE, "piobuf base for port %u is 0x%x, piocnt %u," + " first pio %u\n", pd->port_port, pd->port_piobufs, + pd->port_piocnt, pd->port_pio_base); + ipath_chg_pioavailkernel(dd, pd->port_pio_base, pd->port_piocnt, 0); /* * Now allocate the rcvhdr Q and eager TIDs; skip the TID @@ -2107,7 +2096,6 @@ static int ipath_close(struct inode *in, struct file *fp) } if (dd->ipath_kregbase) { - int i; /* atomically clear receive enable port and intr avail. */ clear_bit(dd->ipath_r_portenable_shift + port, &dd->ipath_rcvctrl); @@ -2136,9 +2124,9 @@ static int ipath_close(struct inode *in, struct file *fp) ipath_write_kreg_port(dd, dd->ipath_kregs->kr_rcvhdraddr, pd->port_port, dd->ipath_dummy_hdrq_phys); - i = dd->ipath_pbufsport * (port - 1); - ipath_disarm_piobufs(dd, i, dd->ipath_pbufsport); - ipath_chg_pioavailkernel(dd, i, dd->ipath_pbufsport, 1); + ipath_disarm_piobufs(dd, pd->port_pio_base, pd->port_piocnt); + ipath_chg_pioavailkernel(dd, pd->port_pio_base, + pd->port_piocnt, 1); dd->ipath_f_clear_tids(dd, pd->port_port); diff --git a/drivers/infiniband/hw/ipath/ipath_iba7220.c b/drivers/infiniband/hw/ipath/ipath_iba7220.c index 5f693de..8eee783 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba7220.c +++ b/drivers/infiniband/hw/ipath/ipath_iba7220.c @@ -595,7 +595,7 @@ static void ipath_7220_txe_recover(struct ipath_devdata *dd) dev_info(&dd->pcidev->dev, "Recovering from TXE PIO parity error\n"); - ipath_disarm_senderrbufs(dd, 1); + ipath_disarm_senderrbufs(dd); } @@ -675,10 +675,8 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg, ctrl = ipath_read_kreg32(dd, dd->ipath_kregs->kr_control); if ((ctrl & INFINIPATH_C_FREEZEMODE) && !ipath_diag_inuse) { /* - * Parity errors in send memory are recoverable, - * just cancel the send (if indicated in * sendbuffererror), - * count the occurrence, unfreeze (if no other handled - * hardware error bits are set), and continue. + * Parity errors in send memory are recoverable by h/w + * just do housekeeping, exit freeze mode and continue. */ if (hwerrs & ((INFINIPATH_HWE_TXEMEMPARITYERR_PIOBUF | INFINIPATH_HWE_TXEMEMPARITYERR_PIOPBC) @@ -687,13 +685,6 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg, hwerrs &= ~((INFINIPATH_HWE_TXEMEMPARITYERR_PIOBUF | INFINIPATH_HWE_TXEMEMPARITYERR_PIOPBC) << INFINIPATH_HWE_TXEMEMPARITYERR_SHIFT); - if (!hwerrs) { - /* else leave in freeze mode */ - ipath_write_kreg(dd, - dd->ipath_kregs->kr_control, - dd->ipath_control); - goto bail; - } } if (hwerrs) { /* @@ -723,8 +714,8 @@ static void ipath_7220_handle_hwerrors(struct ipath_devdata *dd, char *msg, *dd->ipath_statusp |= IPATH_STATUS_HWERROR; dd->ipath_flags &= ~IPATH_INITTED; } else { - ipath_dbg("Clearing freezemode on ignored hardware " - "error\n"); + ipath_dbg("Clearing freezemode on ignored or " + "recovered hardware error\n"); ipath_clear_freeze(dd); } } @@ -1967,7 +1958,7 @@ static void ipath_7220_config_ports(struct ipath_devdata *dd, ushort cfgports) dd->ipath_rcvctrl); dd->ipath_p0_rcvegrcnt = 2048; /* always */ if (dd->ipath_flags & IPATH_HAS_SEND_DMA) - dd->ipath_pioreserved = 1; /* reserve a buffer */ + dd->ipath_pioreserved = 3; /* kpiobufs used for PIO */ } diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index 27dd894..3e5baa4 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -41,7 +41,7 @@ /* * min buffers we want to have per port, after driver */ -#define IPATH_MIN_USER_PORT_BUFCNT 8 +#define IPATH_MIN_USER_PORT_BUFCNT 7 /* * Number of ports we are configured to use (to allow for more pio @@ -54,13 +54,9 @@ MODULE_PARM_DESC(cfgports, "Set max number of ports to use"); /* * Number of buffers reserved for driver (verbs and layered drivers.) - * Reserved at end of buffer list. Initialized based on - * number of PIO buffers if not set via module interface. + * Initialized based on number of PIO buffers if not set via module interface. * The problem with this is that it's global, but we'll use different - * numbers for different chip types. So the default value is not - * very useful. I've redefined it for the 1.3 release so that it's - * zero unless set by the user to something else, in which case we - * try to respect it. + * numbers for different chip types. */ static ushort ipath_kpiobufs; @@ -546,9 +542,12 @@ static void enable_chip(struct ipath_devdata *dd, int reinit) pioavail = dd->ipath_pioavailregs_dma[i ^ 1]; else pioavail = dd->ipath_pioavailregs_dma[i]; - dd->ipath_pioavailshadow[i] = le64_to_cpu(pioavail) | - (~dd->ipath_pioavailkernel[i] << - INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT); + /* + * don't need to worry about ipath_pioavailkernel here + * because we will call ipath_chg_pioavailkernel() later + * in initialization, to busy out buffers as needed + */ + dd->ipath_pioavailshadow[i] = le64_to_cpu(pioavail); } /* can get counters, stats, etc. */ dd->ipath_flags |= IPATH_PRESENT; @@ -708,12 +707,11 @@ static void verify_interrupt(unsigned long opaque) int ipath_init_chip(struct ipath_devdata *dd, int reinit) { int ret = 0; - u32 val32, kpiobufs; + u32 kpiobufs, defkbufs; u32 piobufs, uports; u64 val; struct ipath_portdata *pd; gfp_t gfp_flags = GFP_USER | __GFP_COMP; - unsigned long flags; ret = init_housekeeping(dd, reinit); if (ret) @@ -753,56 +751,46 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) dd->ipath_pioavregs = ALIGN(piobufs, sizeof(u64) * BITS_PER_BYTE / 2) / (sizeof(u64) * BITS_PER_BYTE / 2); uports = dd->ipath_cfgports ? dd->ipath_cfgports - 1 : 0; - if (ipath_kpiobufs == 0) { - /* not set by user (this is default) */ - if (piobufs > 144) - kpiobufs = 32; - else - kpiobufs = 16; - } + if (piobufs > 144) + defkbufs = 32 + dd->ipath_pioreserved; else - kpiobufs = ipath_kpiobufs; + defkbufs = 16 + dd->ipath_pioreserved; - if (kpiobufs + (uports * IPATH_MIN_USER_PORT_BUFCNT) > piobufs) { + if (ipath_kpiobufs && (ipath_kpiobufs + + (uports * IPATH_MIN_USER_PORT_BUFCNT)) > piobufs) { int i = (int) piobufs - (int) (uports * IPATH_MIN_USER_PORT_BUFCNT); if (i < 1) i = 1; dev_info(&dd->pcidev->dev, "Allocating %d PIO bufs of " "%d for kernel leaves too few for %d user ports " - "(%d each); using %u\n", kpiobufs, + "(%d each); using %u\n", ipath_kpiobufs, piobufs, uports, IPATH_MIN_USER_PORT_BUFCNT, i); /* * shouldn't change ipath_kpiobufs, because could be * different for different devices... */ kpiobufs = i; - } + } else if (ipath_kpiobufs) + kpiobufs = ipath_kpiobufs; + else + kpiobufs = defkbufs; dd->ipath_lastport_piobuf = piobufs - kpiobufs; dd->ipath_pbufsport = uports ? dd->ipath_lastport_piobuf / uports : 0; - val32 = dd->ipath_lastport_piobuf - (dd->ipath_pbufsport * uports); - if (val32 > 0) { - ipath_dbg("allocating %u pbufs/port leaves %u unused, " - "add to kernel\n", dd->ipath_pbufsport, val32); - dd->ipath_lastport_piobuf -= val32; - kpiobufs += val32; - ipath_dbg("%u pbufs/port leaves %u unused, add to kernel\n", - dd->ipath_pbufsport, val32); - } + /* if not an even divisor, some user ports get extra buffers */ + dd->ipath_ports_extrabuf = dd->ipath_lastport_piobuf - + (dd->ipath_pbufsport * uports); + if (dd->ipath_ports_extrabuf) + ipath_dbg("%u pbufs/port leaves some unused, add 1 buffer to " + "ports <= %u\n", dd->ipath_pbufsport, + dd->ipath_ports_extrabuf); dd->ipath_lastpioindex = 0; dd->ipath_lastpioindexl = dd->ipath_piobcnt2k; - ipath_chg_pioavailkernel(dd, 0, piobufs, 1); + /* ipath_pioavailshadow initialized earlier */ ipath_cdbg(VERBOSE, "%d PIO bufs for kernel out of %d total %u " "each for %u user ports\n", kpiobufs, piobufs, dd->ipath_pbufsport, uports); - if (dd->ipath_pioupd_thresh) { - if (dd->ipath_pbufsport < dd->ipath_pioupd_thresh) - dd->ipath_pioupd_thresh = dd->ipath_pbufsport; - if (kpiobufs < dd->ipath_pioupd_thresh) - dd->ipath_pioupd_thresh = kpiobufs; - } - ret = dd->ipath_f_early_init(dd); if (ret) { ipath_dev_err(dd, "Early initialization failure\n"); @@ -810,13 +798,6 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) } /* - * Cancel any possible active sends from early driver load. - * Follows early_init because some chips have to initialize - * PIO buffers in early_init to avoid false parity errors. - */ - ipath_cancel_sends(dd, 0); - - /* * Early_init sets rcvhdrentsize and rcvhdrsize, so this must be * done after early_init. */ @@ -836,6 +817,7 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) ipath_write_kreg(dd, dd->ipath_kregs->kr_sendpioavailaddr, dd->ipath_pioavailregs_phys); + /* * this is to detect s/w errors, which the h/w works around by * ignoring the low 6 bits of address, if it wasn't aligned. @@ -862,12 +844,6 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) ~0ULL&~INFINIPATH_HWE_MEMBISTFAILED); ipath_write_kreg(dd, dd->ipath_kregs->kr_control, 0ULL); - spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); - dd->ipath_sendctrl = INFINIPATH_S_PIOENABLE; - ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, dd->ipath_sendctrl); - ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); - spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags); - /* * before error clears, since we expect serdes pll errors during * this, the first time after reset @@ -940,6 +916,19 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit) else enable_chip(dd, reinit); + /* after enable_chip, so pioavailshadow setup */ + ipath_chg_pioavailkernel(dd, 0, piobufs, 1); + + /* + * Cancel any possible active sends from early driver load. + * Follows early_init because some chips have to initialize + * PIO buffers in early_init to avoid false parity errors. + * After enable and ipath_chg_pioavailkernel so we can safely + * enable pioavail updates and PIOENABLE; packets are now + * ready to go out. + */ + ipath_cancel_sends(dd, 1); + if (!reinit) { /* * Used when we close a port, for DMA already in flight diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 45c4c06..26900b3 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -38,42 +38,12 @@ #include "ipath_verbs.h" #include "ipath_common.h" -/* - * clear (write) a pio buffer, to clear a parity error. This routine - * should only be called when in freeze mode, and the buffer should be - * canceled afterwards. - */ -static void ipath_clrpiobuf(struct ipath_devdata *dd, u32 pnum) -{ - u32 __iomem *pbuf; - u32 dwcnt; /* dword count to write */ - if (pnum < dd->ipath_piobcnt2k) { - pbuf = (u32 __iomem *) (dd->ipath_pio2kbase + pnum * - dd->ipath_palign); - dwcnt = dd->ipath_piosize2k >> 2; - } - else { - pbuf = (u32 __iomem *) (dd->ipath_pio4kbase + - (pnum - dd->ipath_piobcnt2k) * dd->ipath_4kalign); - dwcnt = dd->ipath_piosize4k >> 2; - } - dev_info(&dd->pcidev->dev, - "Rewrite PIO buffer %u, to recover from parity error\n", - pnum); - - /* no flush required, since already in freeze */ - writel(dwcnt + 1, pbuf); - while (--dwcnt) - writel(0, pbuf++); -} /* * Called when we might have an error that is specific to a particular * PIO buffer, and may need to cancel that buffer, so it can be re-used. - * If rewrite is true, and bits are set in the sendbufferror registers, - * we'll write to the buffer, for error recovery on parity errors. */ -void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) +void ipath_disarm_senderrbufs(struct ipath_devdata *dd) { u32 piobcnt; unsigned long sbuf[4]; @@ -109,11 +79,8 @@ void ipath_disarm_senderrbufs(struct ipath_devdata *dd, int rewrite) } for (i = 0; i < piobcnt; i++) - if (test_bit(i, sbuf)) { - if (rewrite) - ipath_clrpiobuf(dd, i); + if (test_bit(i, sbuf)) ipath_disarm_piobufs(dd, i, 1); - } /* ignore armlaunch errs for a bit */ dd->ipath_lastcancel = jiffies+3; } @@ -164,7 +131,7 @@ static u64 handle_e_sum_errs(struct ipath_devdata *dd, ipath_err_t errs) { u64 ignore_this_time = 0; - ipath_disarm_senderrbufs(dd, 0); + ipath_disarm_senderrbufs(dd); if ((errs & E_SUM_LINK_PKTERRS) && !(dd->ipath_flags & IPATH_LINKACTIVE)) { /* @@ -909,8 +876,8 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) * processes (causing armlaunch), send errors due to going into freeze mode, * etc., and try to avoid causing extra interrupts while doing so. * Forcibly update the in-memory pioavail register copies after cleanup - * because the chip won't do it for anything changing while in freeze mode - * (we don't want to wait for the next pio buffer state change). + * because the chip won't do it while in freeze mode (the register values + * themselves are kept correct). * Make sure that we don't lose any important interrupts by using the chip * feature that says that writing 0 to a bit in *clear that is set in * *status will cause an interrupt to be generated again (if allowed by @@ -918,48 +885,23 @@ static int handle_errors(struct ipath_devdata *dd, ipath_err_t errs) */ void ipath_clear_freeze(struct ipath_devdata *dd) { - int i, im; - u64 val; - /* disable error interrupts, to avoid confusion */ ipath_write_kreg(dd, dd->ipath_kregs->kr_errormask, 0ULL); /* also disable interrupts; errormask is sometimes overwriten */ ipath_write_kreg(dd, dd->ipath_kregs->kr_intmask, 0ULL); - /* - * clear all sends, because they have may been - * completed by usercode while in freeze mode, and - * therefore would not be sent, and eventually - * might cause the process to run out of bufs - */ ipath_cancel_sends(dd, 1); + + /* clear the freeze, and be sure chip saw it */ ipath_write_kreg(dd, dd->ipath_kregs->kr_control, dd->ipath_control); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); - /* - * ensure pio avail updates continue (because the update - * won't have happened from cancel_sends because we were - * still in freeze - */ + /* force in-memory update now we are out of freeze */ ipath_force_pio_avail_update(dd); /* - * We just enabled pioavailupdate, so dma copy is almost certainly - * not yet right, so read the registers directly. Similar to init - */ - for (i = 0; i < dd->ipath_pioavregs; i++) { - /* deal with 6110 chip bug */ - im = (i > 3 && (dd->ipath_flags & IPATH_SWAP_PIOBUFS)) ? - i ^ 1 : i; - val = ipath_read_kreg64(dd, (0x1000 / sizeof(u64)) + im); - dd->ipath_pioavailregs_dma[i] = cpu_to_le64(val); - dd->ipath_pioavailshadow[i] = val | - (~dd->ipath_pioavailkernel[i] << - INFINIPATH_SENDPIOAVAIL_BUSY_SHIFT); - } - - /* * force new interrupt if any hwerr, error or interrupt bits are * still set, and clear "safe" send packet errors related to freeze * and cancelling sends. Re-enable error interrupts before possible @@ -1316,10 +1258,8 @@ irqreturn_t ipath_intr(int irq, void *data) ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags); - if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA)) - handle_layer_pioavail(dd); - else - ipath_dbg("unexpected BUFAVAIL intr\n"); + /* always process; sdma verbs uses PIO for acks and VL15 */ + handle_layer_pioavail(dd); } ret = IRQ_HANDLED; diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 202337a..02b24a3 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -117,6 +117,10 @@ struct ipath_portdata { u16 port_subport_cnt; /* non-zero if port is being shared. */ u16 port_subport_id; + /* number of pio bufs for this port (all procs, if shared) */ + u32 port_piocnt; + /* first pio buffer for this port */ + u32 port_pio_base; /* chip offset of PIO buffers for this port */ u32 port_piobufs; /* how many alloc_pages() chunks in port_rcvegrbuf_pages */ @@ -384,6 +388,8 @@ struct ipath_devdata { u32 ipath_lastrpkts; /* pio bufs allocated per port */ u32 ipath_pbufsport; + /* if remainder on bufs/port, ports < extrabuf get 1 extra */ + u32 ipath_ports_extrabuf; u32 ipath_pioupd_thresh; /* update threshold, some chips */ /* * number of ports configured as max; zero is set to number chip @@ -1011,7 +1017,7 @@ void ipath_get_eeprom_info(struct ipath_devdata *); int ipath_update_eeprom_log(struct ipath_devdata *dd); void ipath_inc_eeprom_err(struct ipath_devdata *dd, u32 eidx, u32 incr); u64 ipath_snap_cntr(struct ipath_devdata *, ipath_creg); -void ipath_disarm_senderrbufs(struct ipath_devdata *, int); +void ipath_disarm_senderrbufs(struct ipath_devdata *); void ipath_force_pio_avail_update(struct ipath_devdata *); void signal_ib_event(struct ipath_devdata *dd, enum ib_event_type ev); diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 8ac5c1d..9e3fe61 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -481,9 +481,10 @@ done: wake_up(&qp->wait); } -static void want_buffer(struct ipath_devdata *dd) +static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp) { - if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA)) { + if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA) || + qp->ibqp.qp_type == IB_QPT_SMI) { unsigned long flags; spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); @@ -519,7 +520,7 @@ static void ipath_no_bufs_available(struct ipath_qp *qp, spin_lock_irqsave(&dev->pending_lock, flags); list_add_tail(&qp->piowait, &dev->piowait); spin_unlock_irqrestore(&dev->pending_lock, flags); - want_buffer(dev->dd); + want_buffer(dev->dd, qp); dev->n_piowait++; } diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 1974df7..0d07682 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -449,16 +449,19 @@ int setup_sdma(struct ipath_devdata *dd) ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmaheadaddr, dd->ipath_sdma_head_phys); - /* Reserve all the former "kernel" piobufs */ - n = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k - dd->ipath_pioreserved; - for (i = dd->ipath_lastport_piobuf; i < n; ++i) { + /* + * Reserve all the former "kernel" piobufs, using high number range + * so we get as many 4K buffers as possible + */ + n = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k; + i = dd->ipath_lastport_piobuf + dd->ipath_pioreserved; + ipath_chg_pioavailkernel(dd, i, n - i , 0); + for (; i < n; ++i) { unsigned word = i / 64; unsigned bit = i & 63; BUG_ON(word >= 3); senddmabufmask[word] |= 1ULL << bit; } - ipath_chg_pioavailkernel(dd, dd->ipath_lastport_piobuf, - n - dd->ipath_lastport_piobuf, 0); ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmabufmask0, senddmabufmask[0]); ipath_write_kreg(dd, dd->ipath_kregs->kr_senddmabufmask1, From ralph.campbell at qlogic.com Tue May 6 11:36:52 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 06 May 2008 11:36:52 -0700 Subject: [ofa-general] [PATCH 7/7] IB/ipath - fix SDMA error recovery in absence of link status change In-Reply-To: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> Message-ID: <20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com> From: John Gregor What's fixed: in ipath_cancel_sends() We need to unconditionally set ABORTING. So, swapped the tests so the set_bit() isn't shadowed by the &&. If we've disarmed the piobufs, then we need to unconditionally set DISARMED. So, moved it out from the overly protective if at the bottom. in sdma_abort_task() Abort_task was written knowing that the SDMA engine would always be reset (and restarted) on error. A recent change broke that fundamental assumption by taking the restart portion and making it conditional on a link status change. But, SDMA can go boom without a link status change in some conditions. Signed-off-by: John Gregor --- drivers/infiniband/hw/ipath/ipath_driver.c | 8 +++++-- drivers/infiniband/hw/ipath/ipath_sdma.c | 31 ++++++++++++++++++++++------ 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 2036d38..ce7b7c3 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1898,8 +1898,8 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) spin_lock_irqsave(&dd->ipath_sdma_lock, flags); skip_cancel = - !test_bit(IPATH_SDMA_DISABLED, statp) && - test_and_set_bit(IPATH_SDMA_ABORTING, statp); + test_and_set_bit(IPATH_SDMA_ABORTING, statp) + && !test_bit(IPATH_SDMA_DISABLED, statp); spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); if (skip_cancel) goto bail; @@ -1930,6 +1930,9 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) ipath_disarm_piobufs(dd, 0, dd->ipath_piobcnt2k + dd->ipath_piobcnt4k); + if (dd->ipath_flags & IPATH_HAS_SEND_DMA) + set_bit(IPATH_SDMA_DISARMED, &dd->ipath_sdma_status); + if (restore_sendctrl) { /* else done by caller later if needed */ spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); @@ -1949,7 +1952,6 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) /* only wait so long for intr */ dd->ipath_sdma_abort_intr_timeout = jiffies + HZ; dd->ipath_sdma_reset_wait = 200; - __set_bit(IPATH_SDMA_DISARMED, &dd->ipath_sdma_status); if (!test_bit(IPATH_SDMA_SHUTDOWN, &dd->ipath_sdma_status)) tasklet_hi_schedule(&dd->ipath_sdma_abort_task); spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 0d07682..3697449 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -308,13 +308,15 @@ static void sdma_abort_task(unsigned long opaque) spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); /* - * Don't restart sdma here. Wait until link is up to ACTIVE. - * VL15 MADs used to bring the link up use PIO, and multiple - * link transitions otherwise cause the sdma engine to be + * Don't restart sdma here (with the exception + * below). Wait until link is up to ACTIVE. VL15 MADs + * used to bring the link up use PIO, and multiple link + * transitions otherwise cause the sdma engine to be * stopped and started multiple times. - * The disable is done here, including the shadow, so the - * state is kept consistent. - * See ipath_restart_sdma() for the actual starting of sdma. + * The disable is done here, including the shadow, + * so the state is kept consistent. + * See ipath_restart_sdma() for the actual starting + * of sdma. */ spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); dd->ipath_sendctrl &= ~INFINIPATH_S_SDMAENABLE; @@ -326,6 +328,13 @@ static void sdma_abort_task(unsigned long opaque) /* make sure I see next message */ dd->ipath_sdma_abort_jiffies = 0; + /* + * Not everything that takes SDMA offline is a link + * status change. If the link was up, restart SDMA. + */ + if (dd->ipath_flags & IPATH_LINKACTIVE) + ipath_restart_sdma(dd); + goto done; } @@ -427,7 +436,12 @@ int setup_sdma(struct ipath_devdata *dd) goto done; } - dd->ipath_sdma_status = 0; + /* + * Set initial status as if we had been up, then gone down. + * This lets initial start on transition to ACTIVE be the + * same as restart after link flap. + */ + dd->ipath_sdma_status = IPATH_SDMA_ABORT_ABORTED; dd->ipath_sdma_abort_jiffies = 0; dd->ipath_sdma_generation = 0; dd->ipath_sdma_descq_tail = 0; @@ -618,6 +632,9 @@ void ipath_restart_sdma(struct ipath_devdata *dd) ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); spin_unlock_irqrestore(&dd->ipath_sendctrl_lock, flags); + /* notify upper layers */ + ipath_ib_piobufavail(dd->verbs_dev); + bail: return; } From caitlin.bestler at neterion.com Tue May 6 12:28:28 2008 From: caitlin.bestler at neterion.com (Caitlin Bestler) Date: Tue, 6 May 2008 12:28:28 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <4820A427.1070405@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> Message-ID: <469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com> On Tue, May 6, 2008 at 11:32 AM, Steve Wise wrote: > Roland Dreier wrote: > > > > - always do peer2peer and don't let the app choose. This forces > > > the overhead of p2p mode on all apps, but preserves the API. > > > > How bad is the overhead? > > > > - R. > > > > > The client side must send a "Ready To Receive" message. This will be > negotiated via the MPA exchange and the resulting RTR message may be a 0B > read + read response, 0B write, or a 0B send. For chelsio, the 0B write > couldn't be used, and the 0B read was the least impact on the driver code, > so we used that. For nes, they currently use a 0B write. > > Also, there are some "caveats" if you turn this on: > > 1) private data is used to negotiate the type of RTR message and if its > needed. This is more of a global module option I think, since it will > break interoperability with iwarp. Prolly will bump the MPA version number > if this option is on too. > > 2) if the RTR message fails, it can generate a CQE that is unexpected. > > 3) if using SEND, then a recv completion is always generated. > > Steve. > > > Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct packet that needs TCP handling, will occupy a buffer in various switch queues, etc. So while it can be about as innocuous as any TCP segment can be, it is still an excess packet if it did not need to be sent. The overwhelming majority of applications use a client/server model rather than peer2peer. For them this is an excess wire packet, so I think that would make it excessive overhead. Secondly, the applications that need this feature will generally know that they need it. Developers of MPI and other peer-2-peer applications tend to know advanced networking a bit more than typical app developers. So keeping the default to match the client/server model makes sense. From swise at opengridcomputing.com Tue May 6 13:37:58 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 06 May 2008 15:37:58 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <469958e00805061228s6a825066je6648d5e27db42a9@mail.gmail.com> Message-ID: <4820C1A6.8020104@opengridcomputing.com> Caitlin Bestler wrote: > On Tue, May 6, 2008 at 11:32 AM, Steve Wise wrote: > >> Roland Dreier wrote: >> >> >>> > - always do peer2peer and don't let the app choose. This forces >>> > the overhead of p2p mode on all apps, but preserves the API. >>> >>> How bad is the overhead? >>> >>> - R. >>> >>> >>> >> The client side must send a "Ready To Receive" message. This will be >> negotiated via the MPA exchange and the resulting RTR message may be a 0B >> read + read response, 0B write, or a 0B send. For chelsio, the 0B write >> couldn't be used, and the 0B read was the least impact on the driver code, >> so we used that. For nes, they currently use a 0B write. >> >> Also, there are some "caveats" if you turn this on: >> >> 1) private data is used to negotiate the type of RTR message and if its >> needed. This is more of a global module option I think, since it will >> break interoperability with iwarp. Prolly will bump the MPA version number >> if this option is on too. >> >> 2) if the RTR message fails, it can generate a CQE that is unexpected. >> >> 3) if using SEND, then a recv completion is always generated. >> >> Steve. >> >> >> >> > > Keep in mind that even if it is a zero byte RDMA Write, it is still a distinct > packet that needs TCP handling, will occupy a buffer in various switch > queues, etc. > > So while it can be about as innocuous as any TCP segment can be, it > is still an excess packet if it did not need to be sent. The overwhelming > majority of applications use a client/server model rather than peer2peer. > For them this is an excess wire packet, so I think that would make it > excessive overhead. > > Secondly, the applications that need this feature will generally know > that they need it. Developers of MPI and other peer-2-peer applications > tend to know advanced networking a bit more than typical app developers. > So keeping the default to match the client/server model makes sense. > What are the overwhelming majority of user mode rdma applications that don't assume a peer2peer model? Steve. From birkett at elementis.com Tue May 6 19:38:51 2008 From: birkett at elementis.com (churchill chin-w) Date: Wed, 07 May 2008 02:38:51 +0000 Subject: [ofa-general] Surprise for general Message-ID: <000701c8affa$04dae05c$737686a9@txgyn> Download and Watch :) LgZLsJMXbSD -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Wed May 7 00:34:58 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 07 May 2008 10:34:58 +0300 Subject: [ofa-general] getting network statistics In-Reply-To: <20080506141039.GJ6586@cefeid.wcss.wroc.pl> References: <1203424196.16145.1.camel@mtls03> <20080506141039.GJ6586@cefeid.wcss.wroc.pl> Message-ID: <1210145698.15669.78.camel@mtls03> These files are on a virtual file system and their size does not change. You need to read them, e.g. using cat, in order to get the statistics data. For example, "cat port_rcv_data" will give you a measure of how many bytes of data were received by the port. On Tue, 2008-05-06 at 16:10 +0200, Pawel Dziekonski wrote: > you mean port_rcv_data and port_xmit_data ? > > if so, then I have 2 jobs that are definitelly using IB network, but > those files almost do not change. :o > > OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp > > root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al > total 0 > drwxr-xr-x 2 root root 0 May 6 15:45 ./ > drwxr-xr-x 5 root root 0 May 6 15:45 ../ > -r--r--r-- 1 root root 4096 May 6 15:45 VL15_dropped > -r--r--r-- 1 root root 4096 May 6 15:45 excessive_buffer_overrun_errors > -r--r--r-- 1 root root 4096 May 6 15:45 link_downed > -r--r--r-- 1 root root 4096 May 6 15:45 link_error_recovery > -r--r--r-- 1 root root 4096 May 6 15:45 local_link_integrity_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_constraint_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_data > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_packets > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_remote_physical_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_switch_relay_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_constraint_errors > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_data > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_discards > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_packets > -r--r--r-- 1 root root 4096 May 6 15:45 symbol_error > > > On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote: > > cat /sys/class/infiniband/mlx4_0/ports/1/counters/* > > > > mlx4_* can be mthca* > > > > On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote: > > > Under Linux with Mellanox ofed, how can I get real-time network > > > statistics. e.g. how many bytes are being sent and received over each > > > port at any given time? > From eli at dev.mellanox.co.il Wed May 7 01:14:24 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 07 May 2008 11:14:24 +0300 Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <48206690.3090604@Voltaire.COM> References: <48206690.3090604@Voltaire.COM> Message-ID: <1210148064.15669.84.camel@mtls03> On Tue, 2008-05-06 at 17:09 +0300, Moni Shoua wrote: > The purpose of this patch is to make the events that are related to SM change > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > When SM related events are handled, it is not necessary to flush unicast > info from device but only multicast info. This patch divides the events that are > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1 > does more than 0). > The main change is in __ipoib_ib_dev_flush(). Instead of flagging to the function > about pkey_events we now use leveling. An event that requires "harder" flushing > calls this function with higher number for level. Besides the concept, > the actual change is that SM related events are not flushing unicast info and > not bringing the device down but only refresh the multicast info in the background. > As far as I know, when an SM change event occurs, it could mean the SM changed and the new one "decided" to reprogram all the LIDs for example. In that case you will issue only level 0 and the all your neighbours can become invalid. From hfwnhfb at bourdarias.com Wed May 7 03:48:42 2008 From: hfwnhfb at bourdarias.com (Lenny Rubin) Date: Wed, 7 May 2008 18:48:42 +0800 Subject: [ofa-general] Hi sure cure Message-ID: <060301331.40877404264200@bourdarias.com> ED affects over 30 million men and their partners in the U.S. So if you’re a man who has ED, or if you think you might be, don’t worry. You’re not alone. More than 50% of men. between 40 and 70 can experience ED to some degree. The fact is guys at any age can experience it. http://www.zavxx.cn/v/ From olaf.kirch at oracle.com Wed May 7 03:50:22 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 7 May 2008 12:50:22 +0200 Subject: [ofa-general] New rds patches for ofed 1.3.1 Message-ID: <200805071250.22719.olaf.kirch@oracle.com> Hi, I have two more RDS kernel patches for OFED 1.3.1, and one additional rds-tools patch. They're available from my git trees at on branch code-drop-20080507 If you have any feedback, please let me know. At this point, I'm not going to submit the dma_sync patches yet. I think they need more testing, and I'd rather postpone them to OFED 1.3.2. I'll also post these patches in a follow-up email to this message. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From olaf.kirch at oracle.com Wed May 7 03:51:52 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 7 May 2008 12:51:52 +0200 Subject: [ofa-general] [PATCH 1/3] Change RDMA completion notifications In-Reply-To: <200805071250.22719.olaf.kirch@oracle.com> References: <200805071250.22719.olaf.kirch@oracle.com> Message-ID: <200805071251.53377.olaf.kirch@oracle.com> commit 9194a75cf945beee95f8fb8ab08015d05aa797d4 Author: Olaf Kirch Date: Wed May 7 10:40:13 2008 +0200 Change RDMA completion notifications If the user asked for a completion notification on RDMA ops, we can implement three different semantics: 1. Notify when we received the ACK on the RDS message that was queued with the RDMA. This provides reliable notification of RDMA status at the expense of a one-way packet delay. 2. Notify when the IB stack gives us the completion event for the RDMA operation. 3. Notify when the IB stack gives us the completion event for the accompanying RDS messages. In OFED 1.3, RDS implemented approach #1. This turns out to be too slow for some purposes, so I'm switching to approach #3 with this patch. I'm leaving the old code in place however, so that we can support different modes later if we want. Signed-off-by: Olaf Kirch diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 4bbab10..724167c 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -53,6 +53,23 @@ void rds_ib_send_unmap_rm(struct rds_ib_connection *ic, /* raise rdma completion hwm */ if (rm->m_rdma_op && success) { + /* If the user asked for a completion notification on this + * message, we can implement three different semantics: + * 1. Notify when we received the ACK on the RDS message + * that was queued with the RDMA. This provides reliable + * notification of RDMA status at the expense of a one-way + * packet delay. + * 2. Notify when the IB stack gives us the completion event for + * the RDMA operation. + * 3. Notify when the IB stack gives us the completion event for + * the accompanying RDS messages. + * Here, we implement approach #3. To implement approach #2, + * call rds_rdma_send_complete from the cq_handler. To implement #1, + * don't call rds_rdma_send_complete at all, and fall back to the notify + * handling in the ACK processing code. + */ + rds_rdma_send_complete(rm); + if (rm->m_rdma_op->r_write) rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); else diff --git a/net/rds/rdma.h b/net/rds/rdma.h index 2ff0cea..289f962 100644 --- a/net/rds/rdma.h +++ b/net/rds/rdma.h @@ -71,5 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, struct cmsghdr *cmsg); void rds_rdma_free_op(struct rds_rdma_op *ro); +void rds_rdma_send_complete(struct rds_message *rm); #endif diff --git a/net/rds/send.c b/net/rds/send.c index 26e1e3e..2b7661d 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -356,6 +356,42 @@ int rds_send_acked_before(struct rds_connection *conn, u64 seq) } /* + * This is pretty similar to what happens below in the ACK + * handling code - except that we call here as soon as we get + * the IB send completion on the RDMA op and the accompanying + * message. + */ +void rds_rdma_send_complete(struct rds_message *rm) +{ + struct rds_sock *rs = NULL; + struct rds_rdma_op *ro; + struct rds_notifier *notifier; + + spin_lock(&rm->m_rs_lock); + + ro = rm->m_rdma_op; + if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags) + && ro && ro->r_notify + && (notifier = ro->r_notifier) != NULL) { + rs = rm->m_rs; + sock_hold(rds_rs_to_sk(rs)); + + spin_lock(&rs->rs_lock); + list_add_tail(¬ifier->n_list, &rs->rs_notify_queue); + spin_unlock(&rs->rs_lock); + + ro->r_notifier = NULL; + } + + spin_unlock(&rm->m_rs_lock); + + if (rs) { + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } +} + +/* * This removes messages from the socket's list if they're on it. The list * argument must be private to the caller, we must be able to modify it * without locks. The messages must have a reference held for their -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From olaf.kirch at oracle.com Wed May 7 03:53:56 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 7 May 2008 12:53:56 +0200 Subject: [ofa-general] [PATCH 3/3] rds-stress: fix RDS congestion monitoring In-Reply-To: <200805071252.46590.olaf.kirch@oracle.com> References: <200805071250.22719.olaf.kirch@oracle.com> <200805071251.53377.olaf.kirch@oracle.com> <200805071252.46590.olaf.kirch@oracle.com> Message-ID: <200805071253.57429.olaf.kirch@oracle.com> commit 4dacd1a8270aa226bfff157af4519fa33c820253 Author: Olaf Kirch Date: Wed May 7 10:42:55 2008 +0200 Fix RDS congestion monitoring The RDS congestion monitoring code tries to help applications deal with remote congestion more efficiently. If enabled, an application that tried to send to a congested port will receive a notification as soon as the port becomes uncongested again. For efficiency reasons, the application isn't given a complete 8K congestion bitmap, but a 64bit mask that represents the ports having changed, with port N being represented by (1 << (port % 64)) The macro used to translate port numbers to the mask bit shifted integer 1, not 1ULL, resulting in undefined behavior when (port % 64) >= 32 Signed-off-by: Olaf Kirch diff --git a/net/ib_rds.h b/net/ib_rds.h index cea73fc..e098036 100644 --- a/net/ib_rds.h +++ b/net/ib_rds.h @@ -176,7 +176,7 @@ struct rds_info_tcp_socket { */ #define RDS_CONG_MONITOR_SIZE 64 #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) -#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port)) +#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port)) /* * RDMA related types -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From olaf.kirch at oracle.com Wed May 7 03:54:13 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 7 May 2008 12:54:13 +0200 Subject: [ofa-general] [PATCH 2/3] RDS: Fix RDS congestion monitoring In-Reply-To: <200805071251.53377.olaf.kirch@oracle.com> References: <200805071250.22719.olaf.kirch@oracle.com> <200805071251.53377.olaf.kirch@oracle.com> Message-ID: <200805071254.14211.olaf.kirch@oracle.com> commit 01b72f27721c2dccf42dd5eea2e5d5e573cd8585 Author: Olaf Kirch Date: Wed May 7 10:40:36 2008 +0200 Fix RDS congestion monitoring The RDS congestion monitoring code tries to help applications deal with remote congestion more efficiently. If enabled, an application that tried to send to a congested port will receive a notification as soon as the port becomes uncongested again. For efficiency reasons, the application isn't given a complete 8K congestion bitmap, but a 64bit mask that represents the ports having changed, with port N being represented by (1 << (port % 64)) This code had several bugs in it: - the macro used to translate port numbers to the mask bit shifted integer 1, not 1ULL, resulting in undefined behavior when (port % 64) >= 32 - rds_ib_cong_recv computes the 64bit mask of all ports that changed from congested to uncongested. It got the bit arithmetics wrong. Also, it used be64_to_cpu to convert the mask, which is wrong, as the congestion map is little endian - in the IB send completion handler, we need to check whether there is a pending congestion map update we need to send. - in rds_poll, we should grab the rs_lock spinlock when testing whether rs_cong_mask is non-zero Signed-off-by: Olaf Kirch diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index 7c20dd4..2fa2d0e 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -165,8 +165,10 @@ static unsigned int rds_poll(struct file *file, struct socket *sock, if (rds_cong_updated_since(&rs->rs_cong_track)) mask |= (POLLIN | POLLRDNORM | POLLWRBAND); } else { + spin_lock(&rs->rs_lock); if (rs->rs_cong_notify) mask |= (POLLIN | POLLRDNORM); + spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || !list_empty(&rs->rs_notify_queue)) diff --git a/net/rds/cong.c b/net/rds/cong.c index 4ec85ce..beeb539 100644 --- a/net/rds/cong.c +++ b/net/rds/cong.c @@ -238,7 +238,7 @@ void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask) if (waitqueue_active(&rds_poll_waitq)) wake_up_all(&rds_poll_waitq); - if (!list_empty(&rds_cong_monitor)) { + if (portmask && !list_empty(&rds_cong_monitor)) { unsigned long flags; struct rds_sock *rs; diff --git a/net/rds/ib_rds.h b/net/rds/ib_rds.h index cea73fc..e098036 100644 --- a/net/rds/ib_rds.h +++ b/net/rds/ib_rds.h @@ -176,7 +176,7 @@ struct rds_info_tcp_socket { */ #define RDS_CONG_MONITOR_SIZE 64 #define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) -#define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port)) +#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port)) /* * RDMA related types diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 6c9cc9e..8ffbb0c 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -639,8 +639,9 @@ static void rds_ib_cong_recv(struct rds_connection *conn, src = addr + frag_off; dst = (void *)map->m_page_addrs[map_page] + map_off; for (k = 0; k < to_copy; k += 8) { - /* Record ports that became uncongested */ - uncongested |= *src & (*src ^ *dst); + /* Record ports that became uncongested, ie + * bits that changed from 0 to 1. */ + uncongested |= ~(*src) & *dst; *dst++ = *src++; } kunmap_atomic(addr, KM_SOFTIRQ0); @@ -662,7 +663,7 @@ static void rds_ib_cong_recv(struct rds_connection *conn, } /* the congestion map is in little endian order */ - uncongested = be64_to_cpu(uncongested); + uncongested = le64_to_cpu(uncongested); rds_cong_map_updated(map, uncongested); } diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 724167c..567f62f 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -218,7 +218,8 @@ void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context) rds_ib_ring_free(&ic->i_send_ring, completed); - if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) + || test_bit(0, &conn->c_map_queued)) queue_delayed_work(rds_wq, &conn->c_send_w, 0); /* We expect errors as the qp is drained during shutdown */ -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From xlwcpahp at boardmansystems.com Wed May 7 05:38:20 2008 From: xlwcpahp at boardmansystems.com (Duane Valentin) Date: Wed, 7 May 2008 14:38:20 +0200 Subject: [ofa-general] Hi large drive Message-ID: <01c8b04f$fe92d600$408fb94e@xlwcpahp> "In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size." We offer solution! Gain incredible girth and mind-blowing length in just a few weeks time! http://www.gksoo.cn/a/ From andrea at qumranet.com Wed May 7 07:35:51 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:51 +0200 Subject: [ofa-general] [PATCH 01 of 11] mmu-notifier-core In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1210096013 -7200 # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_lock() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken in virtual address order. The order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running concurrently to trigger lock inversion deadlocks. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_lock may not allocate the required vmalloc space. See the comment on top of mm_lock() implementation for the worst case memory requirements. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. Signed-off-by: Andrea Arcangeli Signed-off-by: Nick Piggin Signed-off-by: Christoph Lameter diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct if (!hlist_unhashed(n)) { __hlist_del(n); INIT_HLIST_NODE(n); + } +} + +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; } } diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -19,6 +20,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -235,6 +237,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,282 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include +#include +#include +#include + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_lock() protected critical section + * and it's released only when mm_count reaches zero in mmdrop(). + */ +struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ + struct hlist_head list; + /* srcu structure for this mm */ + struct srcu_struct srcu; + /* to serialize the list modifications and hlist_unhashed */ + spinlock_t lock; +}; + +struct mmu_notifier_ops { + /* + * Called either by mmu_notifier_unregister or when the mm is + * being destroyed by exit_mmap, always before all pages are + * freed. This can run concurrently with other mmu notifier + * methods (the ones invoked outside the mm context) and it + * should tear down all secondary mmu mappings and freeze the + * secondary mmu. If this method isn't implemented you've to + * be sure that nothing could possibly write to the pages + * through the secondary mmu by the time the last thread with + * tsk->mm == mm exits. + * + * As side note: the pages freed after ->release returns could + * be immediately reallocated by the gart at an alias physical + * address with a different cache model, so if ->release isn't + * implemented because all _software_ driven memory accesses + * through the secondary mmu are terminated by the time the + * last thread of this mm quits, you've also to be sure that + * speculative _hardware_ operations can't allocate dirty + * cachelines in the cpu that could not be snooped and made + * coherent with the other read and write operations happening + * through the gart alias address, so leading to memory + * corruption. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. Both functions + * may sleep. The subsystem must guarantee that no additional + * references are taken to the pages in the range established + * between the call to invalidate_range_start() and the + * matching call to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern int __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/include/linux/srcu.h b/include/linux/srcu.h --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include + struct srcu_struct_array { int c[2]; }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include #include #include +#include #include #include @@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -205,3 +205,6 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + bool diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1753,7 +1776,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include #include #include +#include +#include +#include #include #include @@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2255,3 +2259,194 @@ int install_special_mapping(struct mm_st return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + unsigned long _a = (unsigned long)*(spinlock_t **)a; + unsigned long _b = (unsigned long)*(spinlock_t **)b; + + cond_resched(); + if (_a < _b) + return -1; + if (_a > _b) + return 1; + return 0; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. + * + * The caller must take the mmap_sem in read or write mode before + * calling mm_lock(). The caller isn't allowed to release the mmap_sem + * until mm_unlock() returns. + * + * While mm_lock() itself won't strictly require the mmap_sem in write + * mode to be safe, in order to block all operations that could modify + * pagetables and free pages without need of altering the vma layout + * (for example populate_range() with nonlinear vmas) the mmap_sem + * must be taken in write mode by the caller. + * + * A single task can't take more than one mm_lock in a row or it would + * deadlock. + * + * The sorting is needed to avoid lock inversion deadlocks if two + * tasks run mm_lock at the same time on different mm that happen to + * share some anon_vmas/inodes but mapped in different order. + * + * mm_lock and mm_unlock are expensive operations that may have to + * take thousand of locks. Thanks to sort() the complexity is + * O(N*log(N)) where N is the number of VMAs in the mm. The max number + * of vmas is defined in /proc/sys/vm/max_map_count. + * + * mm_lock() can fail if memory allocation fails. The worst case + * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *), + * so around 1Mbyte, but in practice it'll be much less because + * normally there won't be max_map_count vmas allocated in the task + * that runs mm_lock(). + * + * The vmalloc memory allocated by mm_lock is stored in the + * mm_lock_data structure that must be allocated by the caller and it + * must be later passed to mm_unlock that will free it after using it. + * Allocating the mm_lock_data structure on the stack is fine because + * it's only a couple of bytes in size. + * + * If mm_lock() returns -ENOMEM no memory has been allocated and the + * mm_lock_data structure can be freed immediately, and mm_unlock must + * not be called. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) + return -ENOMEM; + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + vfree(anon_vma_locks); + return -ENOMEM; + } + + /* + * When mm_lock_sort_anon_vma/i_mmap returns zero it + * means there's no lock to take and so we can free + * the array here without waiting mm_unlock. mm_unlock + * will do nothing if nr_i_mmap/anon_vma_locks is + * zero. + */ + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* + * mm_unlock doesn't require any memory allocation and it won't fail. + * + * The mmap_sem cannot be released until mm_unlock returns. + * + * All memory has been previously allocated by mm_lock and it'll be + * all freed before returning. Only after mm_unlock returns, the + * caller is allowed to free and forget the mm_lock_data structure. + * + * mm_unlock runs in O(N) where N is the max number of VMAs in the + * mm. The max number of vmas is defined in + * /proc/sys/vm/max_map_count. + */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->nr_i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,292 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * This function can't run concurrently against mmu_notifier_register + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds an mm_count pin + * itself. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + int srcu; + + spin_lock(&mm->mmu_notifier_mm->lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing other than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init_rcu(&mn->hlist); + /* + * SRCU here will block mmu_notifier_unregister until + * ->release returns. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes and stop the driver + * from establishing any more sptes before all the + * pages in the mm are freed. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + spin_lock(&mm->mmu_notifier_mm->lock); + } + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * synchronize_srcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * + * The mmu_notifier_mm can't go away from under us because one + * mm_count is hold by exit_mmap. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0, srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +static int do_mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm, + int take_mmap_sem) +{ + struct mm_lock_data data; + struct mmu_notifier_mm * mmu_notifier_mm; + int ret; + + BUG_ON(atomic_read(&mm->mm_users) <= 0); + + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + ret = init_srcu_struct(&mmu_notifier_mm->srcu); + if (unlikely(ret)) + goto out_kfree; + + if (take_mmap_sem) + down_write(&mm->mmap_sem); + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out_cleanup; + + if (!mm_has_notifiers(mm)) { + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; + } + atomic_inc(&mm->mm_count); + + /* + * Serialize the update against mmu_notifier_unregister. A + * side note: mmu_notifier_release can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + * We can't race against any other mmu notifiers either thanks + * to mm_lock(). + */ + spin_lock(&mm->mmu_notifier_mm->lock); + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); + spin_unlock(&mm->mmu_notifier_mm->lock); + + mm_unlock(mm, &data); +out_cleanup: + if (take_mmap_sem) + up_write(&mm->mmap_sem); + if (mmu_notifier_mm) + cleanup_srcu_struct(&mmu_notifier_mm->srcu); +out_kfree: + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); +out: + BUG_ON(atomic_read(&mm->mm_users) <= 0); + return ret; +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 1); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * Same as mmu_notifier_register but here the caller must hold the + * mmap_sem in write mode. + */ +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 0); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); + +/* this is called after the last mmu_notifier_unregister() returned */ +void __mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); + cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu); + kfree(mm->mmu_notifier_mm); + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ +} + +/* + * This releases the mm_count pin automatically and frees the mm + * structure if it was the last user of it. It serializes against + * running mmu notifiers with SRCU and against mmu_notifier_unregister + * with the unregister lock + SRCU. All sptes must be dropped before + * calling mmu_notifier_unregister. ->release or any other notifier + * method may be invoked concurrently with mmu_notifier_unregister, + * and only after mmu_notifier_unregister returned we're guaranteed + * that ->release or any other method can't run anymore. + */ +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + spin_lock(&mm->mmu_notifier_mm->lock); + if (!hlist_unhashed(&mn->hlist)) { + int srcu; + + hlist_del_rcu(&mn->hlist); + + /* + * SRCU here will force exit_mmap to wait ->release to finish + * before freeing the pages. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee ->release is called before freeing the + * pages. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + } else + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * Wait any running method to finish, of course including + * ->release if it was run by mmu_notifier_relase instead of us. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + mmdrop(mm); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) From andrea at qumranet.com Wed May 7 07:35:52 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:52 +0200 Subject: [ofa-general] [PATCH 02 of 11] get_task_mm In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1210115127 -7200 # Node ID c5badbefeee07518d9d1acca13e94c981420317c # Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c get_task_mm get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -465,7 +465,8 @@ struct mm_struct *get_task_mm(struct tas if (task->flags & PF_BORROWED_MM) mm = NULL; else - atomic_inc(&mm->mm_users); + if (!atomic_inc_not_zero(&mm->mm_users)) + mm = NULL; } task_unlock(task); return mm; From andrea at qumranet.com Wed May 7 07:35:54 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:54 +0200 Subject: [ofa-general] [PATCH 04 of 11] free-pgtables In-Reply-To: Message-ID: <34f6a4bf67ce66714ba2.1210170954@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115130 -7200 # Node ID 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09 # Parent d60d200565abde6a8ed45271e53cde9c5c75b426 free-pgtables Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables() and we cannot sleep while gathering pages for a tlb flush. Move the tlb_gather/tlb_finish call to free_pgtables() to be done for each vma. This may add a number of tlb flushes depending on the number of vmas that cannot be coalesced into one. The first pointer argument to free_pgtables() can then be dropped. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -772,8 +772,8 @@ int walk_page_range(const struct mm_stru void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -272,9 +272,11 @@ void free_pgd_range(struct mmu_gather ** } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; @@ -286,7 +288,8 @@ void free_pgtables(struct mmu_gather **t unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + hugetlb_free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { /* @@ -299,9 +302,11 @@ void free_pgtables(struct mmu_gather **t anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma->vm_end, + tlb = tlb_gather_mmu(vma->vm_mm, 0); + free_pgd_range(&tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } + tlb_finish_mmu(tlb, addr, vma->vm_end); vma = next; } } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1759,9 +1759,9 @@ static void unmap_region(struct mm_struc update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, + tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); - tlb_finish_mmu(tlb, start, end); } /* @@ -2060,8 +2060,8 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, From andrea at qumranet.com Wed May 7 07:35:53 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:53 +0200 Subject: [ofa-general] [PATCH 03 of 11] invalidate_page outside PT lock In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1210115129 -7200 # Node ID d60d200565abde6a8ed45271e53cde9c5c75b426 # Parent c5badbefeee07518d9d1acca13e94c981420317c invalidate_page outside PT lock Moves all mmu notifier methods outside the PT lock (first and not last step to make them sleep capable). Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -210,35 +210,6 @@ static inline void mmu_notifier_mm_destr __mmu_notifier_mm_destroy(mm); } -/* - * These two macros will sometime replace ptep_clear_flush. - * ptep_clear_flush is impleemnted as macro itself, so this also is - * implemented as a macro until ptep_clear_flush will converted to an - * inline function, to diminish the risk of compilation failure. The - * invalidate_page method over time can be moved outside the PT lock - * and these two macros can be later removed. - */ -#define ptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - pte_t __pte; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ - __pte; \ -}) - -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ -({ \ - int __young; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ - __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ - ___address); \ - __young; \ -}) - #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -274,9 +245,6 @@ static inline void mmu_notifier_mm_destr { } -#define ptep_clear_flush_young_notify ptep_clear_flush_young -#define ptep_clear_flush_notify ptep_clear_flush - #endif /* CONFIG_MMU_NOTIFIER */ #endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,11 +188,13 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); + /* must invalidate_page _before_ freeing the page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(page); } } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1714,9 +1714,10 @@ static int do_wp_page(struct mm_struct * */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - page_cache_release(old_page); + new_page = NULL; if (!pte_same(*page_table, orig_pte)) goto unlock; + page_cache_release(old_page); page_mkwrite = 1; } @@ -1732,6 +1733,7 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = new_page = NULL; goto unlock; } @@ -1776,7 +1778,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1788,12 +1790,18 @@ gotten: } else mem_cgroup_uncharge_page(new_page); - if (new_page) +unlock: + pte_unmap_unlock(page_table, ptl); + + if (new_page) { + if (new_page == old_page) + /* cow happened, notify before releasing old_page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(new_page); + } if (old_page) page_cache_release(old_page); -unlock: - pte_unmap_unlock(page_table, ptl); + if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -275,7 +275,7 @@ static int page_referenced_one(struct pa unsigned long address; pte_t *pte; spinlock_t *ptl; - int referenced = 0; + int referenced = 0, clear_flush_young = 0; address = vma_address(page, vma); if (address == -EFAULT) @@ -288,8 +288,11 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young_notify(vma, address, pte)) - referenced++; + } else { + clear_flush_young = 1; + if (ptep_clear_flush_young(vma, address, pte)) + referenced++; + } /* Pretend the page is referenced if the task has the swap token and is in the middle of a page fault. */ @@ -299,6 +302,10 @@ static int page_referenced_one(struct pa (*mapcount)--; pte_unmap_unlock(pte, ptl); + + if (clear_flush_young) + referenced += mmu_notifier_clear_flush_young(mm, address); + out: return referenced; } @@ -458,7 +465,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush_notify(vma, address, pte); + entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -466,6 +473,10 @@ static int page_mkclean_one(struct page } pte_unmap_unlock(pte, ptl); + + if (ret) + mmu_notifier_invalidate_page(mm, address); + out: return ret; } @@ -717,15 +728,14 @@ static int try_to_unmap_one(struct page * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. */ - if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young_notify(vma, address, pte)))) { + if (!migration && (vma->vm_flags & VM_LOCKED)) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -780,6 +790,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (ret != SWAP_FAIL) + mmu_notifier_invalidate_page(mm, address); out: return ret; } @@ -818,7 +830,7 @@ static void try_to_unmap_cluster(unsigne spinlock_t *ptl; struct page *page; unsigned long address; - unsigned long end; + unsigned long start, end; address = (vma->vm_start + cursor) & CLUSTER_MASK; end = address + CLUSTER_SIZE; @@ -839,6 +851,8 @@ static void try_to_unmap_cluster(unsigne if (!pmd_present(*pmd)) return; + start = address; + mmu_notifier_invalidate_range_start(mm, start, end); pte = pte_offset_map_lock(mm, pmd, address, &ptl); /* Update high watermark before we lower rss */ @@ -850,12 +864,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young_notify(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -871,6 +885,7 @@ static void try_to_unmap_cluster(unsigne (*mapcount)--; } pte_unmap_unlock(pte - 1, ptl); + mmu_notifier_invalidate_range_end(mm, start, end); } static int try_to_unmap_anon(struct page *page, int migration) From andrea at qumranet.com Wed May 7 07:35:56 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:56 +0200 Subject: [ofa-general] [PATCH 06 of 11] rwsem contended In-Reply-To: Message-ID: <0621238970155f8ff2d6.1210170956@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115132 -7200 # Node ID 0621238970155f8ff2d60ca4996dcdd470f9c6ce # Parent 20bc6a66a86ef6bd60919cc77ff51d4af741b057 rwsem contended Add a function to rw_semaphores to check if there are any processes waiting for the semaphore. Add rwsem_needbreak to sched.h that works in the same way as spinlock_needbreak(). Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -57,6 +57,8 @@ extern void up_write(struct rw_semaphore */ extern void downgrade_write(struct rw_semaphore *sem); +extern int rwsem_is_contended(struct rw_semaphore *sem); + #ifdef CONFIG_DEBUG_LOCK_ALLOC /* * nested locking. NOTE: rwsems are not allowed to recurse diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2030,6 +2030,15 @@ static inline int spin_needbreak(spinloc #endif } +static inline int rwsem_needbreak(struct rw_semaphore *sem) +{ +#ifdef CONFIG_PREEMPT + return rwsem_is_contended(sem); +#else + return 0; +#endif +} + /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c --- a/lib/rwsem-spinlock.c +++ b/lib/rwsem-spinlock.c @@ -305,6 +305,18 @@ void __downgrade_write(struct rw_semapho spin_unlock_irqrestore(&sem->wait_lock, flags); } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* + * Racy check for an empty list. False positives or negatives + * would be okay. False positive may cause a useless dropping of + * locks. False negatives may cause locks to be held a bit + * longer until the next check. + */ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(__init_rwsem); EXPORT_SYMBOL(__down_read); EXPORT_SYMBOL(__down_read_trylock); diff --git a/lib/rwsem.c b/lib/rwsem.c --- a/lib/rwsem.c +++ b/lib/rwsem.c @@ -251,6 +251,18 @@ asmregparm struct rw_semaphore *rwsem_do return sem; } +int rwsem_is_contended(struct rw_semaphore *sem) +{ + /* + * Racy check for an empty list. False positives or negatives + * would be okay. False positive may cause a useless dropping of + * locks. False negatives may cause locks to be held a bit + * longer until the next check. + */ + return !list_empty(&sem->wait_list); +} + +EXPORT_SYMBOL(rwsem_is_contended); EXPORT_SYMBOL(rwsem_down_read_failed); EXPORT_SYMBOL(rwsem_down_write_failed); EXPORT_SYMBOL(rwsem_wake); From andrea at qumranet.com Wed May 7 07:35:55 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:55 +0200 Subject: [ofa-general] [PATCH 05 of 11] unmap vmas tlb flushing In-Reply-To: Message-ID: <20bc6a66a86ef6bd6091.1210170955@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115131 -7200 # Node ID 20bc6a66a86ef6bd60919cc77ff51d4af741b057 # Parent 34f6a4bf67ce66714ba2d5c13a5fed241d34fb09 unmap vmas tlb flushing Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -744,8 +744,7 @@ struct page *vm_normal_page(struct vm_ar unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -849,7 +849,6 @@ static unsigned long unmap_page_range(st /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -861,20 +860,13 @@ static unsigned long unmap_page_range(st * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -883,9 +875,14 @@ unsigned long unmap_vmas(struct mmu_gath int tlb_start_valid = 0; unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; - int fullmm = (*tlbp)->fullmm; + int fullmm; + struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb->fullmm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,7 +909,7 @@ unsigned long unmap_vmas(struct mmu_gath (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, &zap_work, details); if (zap_work > 0) { @@ -920,22 +917,23 @@ unsigned long unmap_vmas(struct mmu_gath break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { if (i_mmap_lock) { - *tlbp = NULL; + tlb = NULL; goto out; } cond_resched(); } - *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm); + tlb = tlb_gather_mmu(vma->vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); out: mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ @@ -951,18 +949,10 @@ unsigned long zap_page_range(struct vm_a unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma->vm_mm; - struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); - update_hiwater_rss(mm); - end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) - tlb_finish_mmu(tlb, address, end); - return end; + return unmap_vmas(vma, address, end, &nr_accounted, details); } /* diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1751,15 +1751,10 @@ static void unmap_region(struct mm_struc unsigned long start, unsigned long end) { struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; - struct mmu_gather *tlb; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); - update_hiwater_rss(mm); - unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL); + unmap_vmas(vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - tlb_finish_mmu(tlb, start, end); free_pgtables(vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); } @@ -2044,7 +2039,6 @@ EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ void exit_mmap(struct mm_struct *mm) { - struct mmu_gather *tlb; struct vm_area_struct *vma = mm->mmap; unsigned long nr_accounted = 0; unsigned long end; @@ -2055,12 +2049,11 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); - tlb = tlb_gather_mmu(mm, 1); + /* Don't update_hiwater_rss(mm) here, do_exit already did */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ - end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); + end = unmap_vmas(vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - tlb_finish_mmu(tlb, 0, end); free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* From andrea at qumranet.com Wed May 7 07:35:58 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:58 +0200 Subject: [ofa-general] [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: Message-ID: <6b384bb988786aa78ef0.1210170958@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115136 -7200 # Node ID 6b384bb988786aa78ef07440180e4b2948c4c6a2 # Parent 58f716ad4d067afb6bdd1b5f7042e19d854aae0d anon-vma-rwsem Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter Signed-off-by: Andrea Arcangeli diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -25,7 +25,8 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock; /* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; @@ -43,18 +44,31 @@ static inline void anon_vma_free(struct kmem_cache_free(anon_vma_cachep, anon_vma); } +struct anon_vma *grab_anon_vma(struct page *page); + +static inline void get_anon_vma(struct anon_vma *anon_vma) +{ + atomic_inc(&anon_vma->refcount); +} + +static inline void put_anon_vma(struct anon_vma *anon_vma) +{ + if (atomic_dec_and_test(&anon_vma->refcount)) + anon_vma_free(anon_vma); +} + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); } /* diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -235,15 +235,16 @@ static void remove_anon_migration_ptes(s return; /* - * We hold the mmap_sem lock. So no need to call page_lock_anon_vma. + * We hold either the mmap_sem lock or a reference on the + * anon_vma. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); + down_read(&anon_vma->sem); list_for_each_entry(vma, &anon_vma->head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(&anon_vma->lock); + up_read(&anon_vma->sem); } /* @@ -630,7 +631,7 @@ static int unmap_and_move(new_page_t get int rc = 0; int *result = NULL; struct page *newpage = get_new_page(page, private, &result); - int rcu_locked = 0; + struct anon_vma *anon_vma = NULL; int charge = 0; if (!newpage) @@ -654,16 +655,14 @@ static int unmap_and_move(new_page_t get } /* * By try_to_unmap(), page->mapcount goes down to 0 here. In this case, - * we cannot notice that anon_vma is freed while we migrates a page. + * we cannot notice that anon_vma is freed while we migrate a page. * This rcu_read_lock() delays freeing anon_vma pointer until the end * of migration. File cache pages are no problem because of page_lock() * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { - rcu_read_lock(); - rcu_locked = 1; - } + if (PageAnon(page)) + anon_vma = grab_anon_vma(page); /* * Corner case handling: @@ -681,10 +680,7 @@ static int unmap_and_move(new_page_t get if (!PageAnon(page) && PagePrivate(page)) { /* * Go direct to try_to_free_buffers() here because - * a) that's what try_to_release_page() would do anyway - * b) we may be under rcu_read_lock() here, so we can't - * use GFP_KERNEL which is what try_to_release_page() - * needs to be effective. + * that's what try_to_release_page() would do anyway */ try_to_free_buffers(page); } @@ -705,8 +701,8 @@ static int unmap_and_move(new_page_t get } else if (charge) mem_cgroup_end_migration(newpage); rcu_unlock: - if (rcu_locked) - rcu_read_unlock(); + if (anon_vma) + put_anon_vma(anon_vma); unlock: diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -570,7 +570,7 @@ again: remove_next = 1 + (end > next-> if (vma->anon_vma) anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); /* * Easily overlooked: when mprotect shifts the boundary, * make sure the expanding vma has anon_vma set if the @@ -624,7 +624,7 @@ again: remove_next = 1 + (end > next-> } if (anon_vma) - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); if (mapping) up_write(&mapping->i_mmap_sem); diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -69,7 +69,7 @@ int anon_vma_prepare(struct vm_area_stru if (anon_vma) { allocated = NULL; locked = anon_vma; - spin_lock(&locked->lock); + down_write(&locked->sem); } else { anon_vma = anon_vma_alloc(); if (unlikely(!anon_vma)) @@ -81,6 +81,7 @@ int anon_vma_prepare(struct vm_area_stru /* page_table_lock to protect against threads */ spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { + get_anon_vma(anon_vma); vma->anon_vma = anon_vma; list_add_tail(&vma->anon_vma_node, &anon_vma->head); allocated = NULL; @@ -88,7 +89,7 @@ int anon_vma_prepare(struct vm_area_stru spin_unlock(&mm->page_table_lock); if (locked) - spin_unlock(&locked->lock); + up_write(&locked->sem); if (unlikely(allocated)) anon_vma_free(allocated); } @@ -99,14 +100,17 @@ void __anon_vma_merge(struct vm_area_str { BUG_ON(vma->anon_vma != next->anon_vma); list_del(&next->anon_vma_node); + put_anon_vma(vma->anon_vma); } void __anon_vma_link(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; - if (anon_vma) + if (anon_vma) { + get_anon_vma(anon_vma); list_add_tail(&vma->anon_vma_node, &anon_vma->head); + } } void anon_vma_link(struct vm_area_struct *vma) @@ -114,36 +118,32 @@ void anon_vma_link(struct vm_area_struct struct anon_vma *anon_vma = vma->anon_vma; if (anon_vma) { - spin_lock(&anon_vma->lock); + get_anon_vma(anon_vma); + down_write(&anon_vma->sem); list_add_tail(&vma->anon_vma_node, &anon_vma->head); - spin_unlock(&anon_vma->lock); + up_write(&anon_vma->sem); } } void anon_vma_unlink(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; - int empty; if (!anon_vma) return; - spin_lock(&anon_vma->lock); + down_write(&anon_vma->sem); list_del(&vma->anon_vma_node); - - /* We must garbage collect the anon_vma if it's empty */ - empty = list_empty(&anon_vma->head); - spin_unlock(&anon_vma->lock); - - if (empty) - anon_vma_free(anon_vma); + up_write(&anon_vma->sem); + put_anon_vma(anon_vma); } static void anon_vma_ctor(struct kmem_cache *cachep, void *data) { struct anon_vma *anon_vma = data; - spin_lock_init(&anon_vma->lock); + init_rwsem(&anon_vma->sem); + atomic_set(&anon_vma->refcount, 0); INIT_LIST_HEAD(&anon_vma->head); } @@ -157,9 +157,9 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -static struct anon_vma *page_lock_anon_vma(struct page *page) +struct anon_vma *grab_anon_vma(struct page *page) { - struct anon_vma *anon_vma; + struct anon_vma *anon_vma = NULL; unsigned long anon_mapping; rcu_read_lock(); @@ -170,17 +170,26 @@ static struct anon_vma *page_lock_anon_v goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); - spin_lock(&anon_vma->lock); - return anon_vma; + if (!atomic_inc_not_zero(&anon_vma->refcount)) + anon_vma = NULL; out: rcu_read_unlock(); - return NULL; + return anon_vma; +} + +static struct anon_vma *page_lock_anon_vma(struct page *page) +{ + struct anon_vma *anon_vma = grab_anon_vma(page); + + if (anon_vma) + down_read(&anon_vma->sem); + return anon_vma; } static void page_unlock_anon_vma(struct anon_vma *anon_vma) { - spin_unlock(&anon_vma->lock); - rcu_read_unlock(); + up_read(&anon_vma->sem); + put_anon_vma(anon_vma); } /* From andrea at qumranet.com Wed May 7 07:36:00 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:36:00 +0200 Subject: [ofa-general] [PATCH 10 of 11] export zap_page_range for XPMEM In-Reply-To: Message-ID: <5b2eb7d28a4517daf91b.1210170960@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115797 -7200 # Node ID 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b # Parent 94eaa1515369e8ef183e2457f6f25a7f36473d70 export zap_page_range for XPMEM XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson Signed-off-by: Andrea Arcangeli diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -954,6 +954,7 @@ unsigned long zap_page_range(struct vm_a return unmap_vmas(vma, address, end, &nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. From andrea at qumranet.com Wed May 7 07:35:57 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:57 +0200 Subject: [ofa-general] [PATCH 07 of 11] i_mmap_rwsem In-Reply-To: Message-ID: <58f716ad4d067afb6bdd.1210170957@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115135 -7200 # Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d # Parent 0621238970155f8ff2d60ca4996dcdd470f9c6ce i_mmap_rwsem The conversion to a rwsem allows notifier callbacks during rmap traversal for files. A rw style lock also allows concurrent walking of the reverse map so that multiple processors can expire pages in the same memory area of the same process. So it increases the potential concurrency. Signed-off-by: Andrea Arcangeli Signed-off-by: Christoph Lameter diff --git a/Documentation/vm/locking b/Documentation/vm/locking --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_sem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str if (!vma_shareable(vma, addr)) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); if (!prio_tree_empty(&mapping->i_mmap)) hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); rwlock_init(&inode->i_data.tree_lock); - spin_lock_init(&inode->i_data.i_mmap_lock); + init_rwsem(&inode->i_data.i_mmap_sem); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -502,7 +502,7 @@ struct address_space { unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock; /* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned int truncate_count; /* Cover race condition with truncate */ unsigned long nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -735,7 +735,7 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm atomic_dec(&inode->i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(&file->f_mapping->i_mmap_lock); + down_write(&file->f_mapping->i_mmap_sem); tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(file->f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file->f_mapping); - spin_unlock(&file->f_mapping->i_mmap_lock); + up_write(&file->f_mapping->i_mmap_sem); } /* diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki /* * Lock ordering: * - * ->i_mmap_lock (vmtruncate) + * ->i_mmap_sem (vmtruncate) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * * ->i_mutex - * ->i_mmap_lock (truncate->unmap_mapping_range) + * ->i_mmap_sem (truncate->unmap_mapping_range) * * ->mmap_sem - * ->i_mmap_lock + * ->i_mmap_sem * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * @@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki * ->sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * - * ->i_mmap_lock + * ->i_mmap_sem * ->anon_vma.lock (vma_adjust) * * ->anon_vma.lock diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp if (!page) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { mm = vma->vm_mm; address = vma->vm_start + @@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp page_cache_release(page); } } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns } goto out; } - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); flush_dcache_mmap_lock(mapping); vma->vm_flags |= VM_NONLINEAR; vma_prio_tree_remove(vma, &mapping->i_mmap); vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); flush_dcache_mmap_unlock(mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } mmu_notifier_invalidate_range_start(mm, start, start + size); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar struct page *page; struct page *tmp; /* - * A page gathering list, protected by per file i_mmap_lock. The + * A page gathering list, protected by per file i_mmap_sem. The * lock is used to avoid list corruption from multiple unmapping * of the same page since we are using page->lru. */ @@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area * do nothing in this case. */ if (vma->vm_file) { - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); __unmap_hugepage_range(vma, start, end); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); } } @@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm BUG_ON(address >= end); flush_cache_range(vma, address, end); - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); spin_lock(&mm->page_table_lock); for (; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm } } spin_unlock(&mm->page_table_lock); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); flush_tlb_range(vma, start, end); } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_ unsigned long tlb_start = 0; /* For tlb_finish_mmu */ int tlb_start_valid = 0; unsigned long start = start_addr; - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; + struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL; int fullmm; struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; @@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_ tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || - (i_mmap_lock && spin_needbreak(i_mmap_lock))) { - if (i_mmap_lock) { + (i_mmap_sem && rwsem_needbreak(i_mmap_sem))) { + if (i_mmap_sem) { tlb = NULL; goto out; } @@ -1829,7 +1829,7 @@ unwritable_page: /* * Helper functions for unmap_mapping_range(). * - * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __ + * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __ * * We have to restart searching the prio_tree whenever we drop the lock, * since the iterator is only valid while the lock is held, and anyway @@ -1848,7 +1848,7 @@ unwritable_page: * can't efficiently keep all vmas in step with mapping->truncate_count: * so instead reset them all whenever it wraps back to 0 (then go to 1). * mapping->truncate_count and vma->vm_truncate_count are protected by - * i_mmap_lock. + * i_mmap_sem. * * In order to make forward progress despite repeatedly restarting some * large vma, note the restart_addr from unmap_vmas when it breaks out: @@ -1898,7 +1898,7 @@ again: restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr, details); - need_break = need_resched() || spin_needbreak(details->i_mmap_lock); + need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem); if (restart_addr >= end_addr) { /* We have now completed this vma: mark it so */ @@ -1912,9 +1912,9 @@ again: goto again; } - spin_unlock(details->i_mmap_lock); + up_write(details->i_mmap_sem); cond_resched(); - spin_lock(details->i_mmap_lock); + down_write(details->i_mmap_sem); return -EINTR; } @@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_ details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - details.i_mmap_lock = &mapping->i_mmap_lock; + details.i_mmap_sem = &mapping->i_mmap_sem; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); /* Protect against endless unmapping loops */ mapping->truncate_count++; @@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_ unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } EXPORT_SYMBOL(unmap_mapping_range); diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s if (!mapping) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) remove_migration_pte(vma, old, new); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -189,7 +189,7 @@ error: } /* - * Requires inode->i_mapping->i_mmap_lock + * Requires inode->i_mapping->i_mmap_sem */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) @@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru if (file) { struct address_space *mapping = file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); __remove_shared_vm_struct(vma, file, mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } } @@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m mapping = vma->vm_file->f_mapping; if (mapping) { - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); vma->vm_truncate_count = mapping->truncate_count; } anon_vma_lock(vma); @@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m anon_vma_unlock(vma); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mm->map_count++; validate_mm(mm); @@ -542,7 +542,7 @@ again: remove_next = 1 + (end > next-> mapping = file->f_mapping; if (!(vma->vm_flags & VM_NONLINEAR)) root = &mapping->i_mmap; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (importer && vma->vm_truncate_count != next->vm_truncate_count) { /* @@ -626,7 +626,7 @@ again: remove_next = 1 + (end > next-> if (anon_vma) spin_unlock(&anon_vma->lock); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); if (remove_next) { if (file) { @@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm) /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_lock is taken here. + * then i_mmap_sem is taken here. */ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) { diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str * and we propagate stale pages into the dst afterward. */ mapping = vma->vm_file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (new_vma->vm_truncate_count && new_vma->vm_truncate_count != vma->vm_truncate_count) new_vma->vm_truncate_count = 0; @@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str pte_unmap_nested(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -24,7 +24,7 @@ * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) - * mapping->i_mmap_lock + * mapping->i_mmap_sem * anon_vma->lock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) @@ -373,14 +373,14 @@ static int page_referenced_file(struct p * The page lock not only makes sure that page->mapping cannot * suddenly be NULLified by truncation, it makes sure that the * structure at mapping cannot be freed and reused yet, - * so we can safely take mapping->i_mmap_lock. + * so we can safely take mapping->i_mmap_sem. */ BUG_ON(!PageLocked(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); /* - * i_mmap_lock does not stabilize mapcount at all, but mapcount + * i_mmap_sem does not stabilize mapcount at all, but mapcount * is more likely to be accurate if we note it after spinning. */ mapcount = page_mapcount(page); @@ -403,7 +403,7 @@ static int page_referenced_file(struct p break; } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return referenced; } @@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr BUG_ON(PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { if (vma->vm_flags & VM_SHARED) ret += page_mkclean_one(page, vma); } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } @@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page unsigned long max_nl_size = 0; unsigned int mapcount; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) @@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page mapcount = page_mapcount(page); if (!mapcount) goto out; - cond_resched_lock(&mapping->i_mmap_lock); max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; if (max_nl_cursor == 0) @@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page } vma->vm_private_data = (void *) max_nl_cursor; } - cond_resched_lock(&mapping->i_mmap_lock); max_nl_cursor += CLUSTER_SIZE; } while (max_nl_cursor <= max_nl_size); @@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } From andrea at qumranet.com Wed May 7 07:36:01 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:36:01 +0200 Subject: [ofa-general] [PATCH 11 of 11] mmap sems In-Reply-To: Message-ID: # HG changeset patch # User Andrea Arcangeli # Date 1210115798 -7200 # Node ID eb924315351f6b056428e35c983ad28040420fea # Parent 5b2eb7d28a4517daf91b08b4dcfbb58fd2b42d0b mmap sems This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson Signed-off-by: Andrea Arcangeli diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ generic_file_direct_IO(int rw, struct ki * * ->i_mutex (generic_file_buffered_write) * ->mmap_sem (fault_in_pages_readable->do_page_fault) + * + * When taking multiple mmap_sems, one should lock the lowest-addressed + * one first proceeding on up to the highest-addressed one. * * ->i_mutex * ->i_alloc_sem (various) From andrea at qumranet.com Wed May 7 07:35:59 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:59 +0200 Subject: [ofa-general] [PATCH 09 of 11] mm_lock-rwsem In-Reply-To: Message-ID: <94eaa1515369e8ef183e.1210170959@duo.random> # HG changeset patch # User Andrea Arcangeli # Date 1210115508 -7200 # Node ID 94eaa1515369e8ef183e2457f6f25a7f36473d70 # Parent 6b384bb988786aa78ef07440180e4b2948c4c6a2 mm_lock-rwsem Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock conversion. Signed-off-by: Andrea Arcangeli diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,10 +1084,10 @@ extern int install_special_mapping(struc unsigned long flags, struct page **pages); struct mm_lock_data { - spinlock_t **i_mmap_locks; - spinlock_t **anon_vma_locks; - size_t nr_i_mmap_locks; - size_t nr_anon_vma_locks; + struct rw_semaphore **i_mmap_sems; + struct rw_semaphore **anon_vma_sems; + size_t nr_i_mmap_sems; + size_t nr_anon_vma_sems; }; extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2255,8 +2255,8 @@ int install_special_mapping(struct mm_st static int mm_lock_cmp(const void *a, const void *b) { - unsigned long _a = (unsigned long)*(spinlock_t **)a; - unsigned long _b = (unsigned long)*(spinlock_t **)b; + unsigned long _a = (unsigned long)*(struct rw_semaphore **)a; + unsigned long _b = (unsigned long)*(struct rw_semaphore **)b; cond_resched(); if (_a < _b) @@ -2266,7 +2266,7 @@ static int mm_lock_cmp(const void *a, co return 0; } -static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, +static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems, int anon) { struct vm_area_struct *vma; @@ -2275,59 +2275,59 @@ static unsigned long mm_lock_sort(struct for (vma = mm->mmap; vma; vma = vma->vm_next) { if (anon) { if (vma->anon_vma) - locks[i++] = &vma->anon_vma->lock; + sems[i++] = &vma->anon_vma->sem; } else { if (vma->vm_file && vma->vm_file->f_mapping) - locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + sems[i++] = &vma->vm_file->f_mapping->i_mmap_sem; } } if (!i) goto out; - sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL); out: return i; } static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 1); + return mm_lock_sort(mm, sems, 1); } static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 0); + return mm_lock_sort(mm, sems, 0); } -static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock) { - spinlock_t *last = NULL; + struct rw_semaphore *last = NULL; size_t i; for (i = 0; i < nr; i++) /* Multiple vmas may use the same lock. */ - if (locks[i] != last) { - BUG_ON((unsigned long) last > (unsigned long) locks[i]); - last = locks[i]; + if (sems[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) sems[i]); + last = sems[i]; if (lock) - spin_lock(last); + down_write(last); else - spin_unlock(last); + up_write(last); } } -static inline void __mm_lock(spinlock_t **locks, size_t nr) +static inline void __mm_lock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 1); + mm_lock_unlock(sems, nr, 1); } -static inline void __mm_unlock(spinlock_t **locks, size_t nr) +static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 0); + mm_lock_unlock(sems, nr, 0); } /* @@ -2358,10 +2358,10 @@ static inline void __mm_unlock(spinlock_ * of vmas is defined in /proc/sys/vm/max_map_count. * * mm_lock() can fail if memory allocation fails. The worst case - * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *), - * so around 1Mbyte, but in practice it'll be much less because - * normally there won't be max_map_count vmas allocated in the task - * that runs mm_lock(). + * vmalloc allocation required is 2*max_map_count*sizeof(struct + * rw_semaphore *), so around 1Mbyte, but in practice it'll be much + * less because normally there won't be max_map_count vmas allocated + * in the task that runs mm_lock(). * * The vmalloc memory allocated by mm_lock is stored in the * mm_lock_data structure that must be allocated by the caller and it @@ -2375,16 +2375,16 @@ static inline void __mm_unlock(spinlock_ */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { - spinlock_t **anon_vma_locks, **i_mmap_locks; + struct rw_semaphore **anon_vma_sems, **i_mmap_sems; if (mm->map_count) { - anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!anon_vma_locks)) + anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count); + if (unlikely(!anon_vma_sems)) return -ENOMEM; - i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); - if (unlikely(!i_mmap_locks)) { - vfree(anon_vma_locks); + i_mmap_sems = vmalloc(sizeof(struct rw_semaphore *) * mm->map_count); + if (unlikely(!i_mmap_sems)) { + vfree(anon_vma_sems); return -ENOMEM; } @@ -2392,31 +2392,31 @@ int mm_lock(struct mm_struct *mm, struct * When mm_lock_sort_anon_vma/i_mmap returns zero it * means there's no lock to take and so we can free * the array here without waiting mm_unlock. mm_unlock - * will do nothing if nr_i_mmap/anon_vma_locks is + * will do nothing if nr_i_mmap/anon_vma_sems is * zero. */ - data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); - data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + data->nr_anon_vma_sems = mm_lock_sort_anon_vma(mm, anon_vma_sems); + data->nr_i_mmap_sems = mm_lock_sort_i_mmap(mm, i_mmap_sems); - if (data->nr_anon_vma_locks) { - __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); - data->anon_vma_locks = anon_vma_locks; + if (data->nr_anon_vma_sems) { + __mm_lock(anon_vma_sems, data->nr_anon_vma_sems); + data->anon_vma_sems = anon_vma_sems; } else - vfree(anon_vma_locks); + vfree(anon_vma_sems); - if (data->nr_i_mmap_locks) { - __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); - data->i_mmap_locks = i_mmap_locks; + if (data->nr_i_mmap_sems) { + __mm_lock(i_mmap_sems, data->nr_i_mmap_sems); + data->i_mmap_sems = i_mmap_sems; } else - vfree(i_mmap_locks); + vfree(i_mmap_sems); } return 0; } -static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +static void mm_unlock_vfree(struct rw_semaphore **sems, size_t nr) { - __mm_unlock(locks, nr); - vfree(locks); + __mm_unlock(sems, nr); + vfree(sems); } /* @@ -2435,11 +2435,11 @@ void mm_unlock(struct mm_struct *mm, str void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) { if (mm->map_count) { - if (data->nr_anon_vma_locks) - mm_unlock_vfree(data->anon_vma_locks, - data->nr_anon_vma_locks); - if (data->nr_i_mmap_locks) - mm_unlock_vfree(data->i_mmap_locks, - data->nr_i_mmap_locks); + if (data->nr_anon_vma_sems) + mm_unlock_vfree(data->anon_vma_sems, + data->nr_anon_vma_sems); + if (data->nr_i_mmap_sems) + mm_unlock_vfree(data->i_mmap_sems, + data->nr_i_mmap_sems); } } From andrea at qumranet.com Wed May 7 07:35:50 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 07 May 2008 16:35:50 +0200 Subject: [ofa-general] [PATCH 00 of 11] mmu notifier #v16 Message-ID: Hello, this is the last update of the mmu notifier patch. Jack asked a __mmu_notifier_register to call under mmap_sem in write mode. Here an update with that change plus allowing ->release not to be implemented (two liner change to mmu_notifier.c). The entire diff between v15 and v16 mmu-notifier-core was posted in separate email. From andrea at qumranet.com Wed May 7 08:00:15 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 7 May 2008 17:00:15 +0200 Subject: [ofa-general] Re: [PATCH 01 of 12] Core of mmu notifiers In-Reply-To: <20080429160340.GG8315@duo.random> References: <20080424153943.GJ24536@duo.random> <20080424174145.GM24536@duo.random> <20080426131734.GB19717@sgi.com> <20080427122727.GO9514@duo.random> <20080429001052.GA8315@duo.random> <20080429153052.GE8315@duo.random> <20080429155030.GB28944@sgi.com> <20080429160340.GG8315@duo.random> Message-ID: <20080507150014.GI8362@duo.random> On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote: > Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem > yourself in this direction, you're very welcome to go ahead while I In case you didn't notice this already, for a further explanation of why semaphores runs slower for small critical sections and why the conversion from spinlock to rwsem should happen under a config option, see the "AIM7 40% regression with 2.6.26-rc1" thread. From rdreier at cisco.com Wed May 7 08:29:27 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 08:29:27 -0700 Subject: [ofa-general] [RFC/PATCH 1/2] RDMA/cxgb3: Don't add PBL memory to gen_pool in chunks Message-ID: Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB chunks. This limits the largest single allocation that can be done to the same size, which means that with 4 KB pages, each of which takes 8 bytes of PBL memory, the largest memory region that can be allocated is 1 GB (256K PBL entries * 4 KB/entry). Remove this limit by adding all the PBL memory in a single gen_pool chunk, if possible. Add code that falls back to smaller chunks if gen_pool_add() fails, which can happen if there is not sufficient contiguous lowmem for the internal gen_pool bitmap. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/cxgb3/cxio_resource.c | 36 +++++++++++++++++++++------ 1 files changed, 28 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_resource.c b/drivers/infiniband/hw/cxgb3/cxio_resource.c index 45ed4f2..bd233c0 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_resource.c +++ b/drivers/infiniband/hw/cxgb3/cxio_resource.c @@ -250,7 +250,6 @@ void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) */ #define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ -#define PBL_CHUNK 2*1024*1024 u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) { @@ -267,14 +266,35 @@ void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) { - unsigned long i; + unsigned pbl_start, pbl_chunk; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); - if (rdev_p->pbl_pool) - for (i = rdev_p->rnic_info.pbl_base; - i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; - i += PBL_CHUNK) - gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); - return rdev_p->pbl_pool ? 0 : -ENOMEM; + if (!rdev_p->pbl_pool) + return -ENOMEM; + + pbl_start = rdev_p->rnic_info.pbl_base; + pbl_chunk = rdev_p->rnic_info.pbl_top - pbl_start + 1; + + while (pbl_start < rdev_p->rnic_info.pbl_top) { + pbl_chunk = min(rdev_p->rnic_info.pbl_top - pbl_start + 1, + pbl_chunk); + if (gen_pool_add(rdev_p->pbl_pool, pbl_start, pbl_chunk, -1)) { + PDBG("%s failed to add PBL chunk (%x/%x)\n", + __func__, pbl_start, pbl_chunk); + if (pbl_chunk <= 1024 << MIN_PBL_SHIFT) { + printk(KERN_WARNING MOD "%s: Failed to add all PBL chunks (%x/%x)\n", + __func__, pbl_start, rdev_p->rnic_info.pbl_top - pbl_start); + return 0; + } + pbl_chunk >>= 1; + } else { + PDBG("%s added PBL chunk (%x/%x)\n", + __func__, pbl_start, pbl_chunk); + pbl_start += pbl_chunk; + } + } + + return 0; } void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) -- 1.5.5.1 From rdreier at cisco.com Wed May 7 08:29:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 08:29:59 -0700 Subject: [ofa-general] [RFC/PATCH 2/2] RDMA/cxgb3: Fix severe limit on userspace memory registration size Message-ID: Currently, iw_cxgb3 is severely limited on the amount of userspace memory that can be registered in in a single memory region, which causes big problems for applications that expect to be able to register 100s of MB. The problem is that the driver uses a single kmalloc()ed buffer to hold the physical buffer list (PBL) for the entire memory region during registration, which means that 8 bytes of contiguous memory are required for each page of memory being registered. For example, a 64 MB registration will require 128 KB of contiguous memory with 4 KB pages, and it unlikely that such an allocation will succeed on a busy system. This is purely a driver problem: the temporary page list buffer is not needed by the hardware, so we can fix this by writing the PBL to the hardware in page-sized chunks rather than all at once. We do this by splitting the memory registration operation up into several steps: - Allocate PBL space in adapter memory for the full registration - Copy PBL to adapter memory in chunks - Allocate STag and enable memory region This also allows several other cleanups to the __cxio_tpt_op() interface and related parts of the driver. This change leaves the reregister memory region and memory window operations broken, but they already didn't work due to other longstanding bugs, so fixing them will be left to a later patch. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 90 +++++++++++++------------- drivers/infiniband/hw/cxgb3/cxio_hal.h | 8 +- drivers/infiniband/hw/cxgb3/iwch_mem.c | 75 +++++++++++++++-------- drivers/infiniband/hw/cxgb3/iwch_provider.c | 68 ++++++++++++++++----- drivers/infiniband/hw/cxgb3/iwch_provider.h | 8 +- 5 files changed, 155 insertions(+), 94 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 5fd8506..ebf9d30 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -588,7 +588,7 @@ static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) * caller aquires the ctrl_qp lock before the call */ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, - u32 len, void *data, int completion) + u32 len, void *data) { u32 i, nr_wqe, copy_len; u8 *copy_data; @@ -624,7 +624,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, flag = 0; if (i == (nr_wqe - 1)) { /* last WQE */ - flag = completion ? T3_COMPLETION_FLAG : 0; + flag = T3_COMPLETION_FLAG; if (len % 32) utx_len = len / 32 + 1; else @@ -683,21 +683,20 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, return 0; } -/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size - * OUT: stag index, actual pbl_size, pbl_addr allocated. +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl_size and pbl_addr + * OUT: stag index * TBD: shared memory region support */ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, u32 *stag, u8 stag_state, u32 pdid, enum tpt_mem_type type, enum tpt_mem_perm perm, - u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl, - u32 *pbl_size, u32 *pbl_addr) + u32 zbva, u64 to, u32 len, u8 page_size, + u32 pbl_size, u32 pbl_addr) { int err; struct tpt_entry tpt; u32 stag_idx; u32 wptr; - int rereg = (*stag != T3_STAG_UNSET); stag_state = stag_state > 0; stag_idx = (*stag) >> 8; @@ -711,30 +710,8 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", __func__, stag_state, type, pdid, stag_idx); - if (reset_tpt_entry) - cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3); - else if (!rereg) { - *pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3); - if (!*pbl_addr) { - return -ENOMEM; - } - } - mutex_lock(&rdev_p->ctrl_qp.lock); - /* write PBL first if any - update pbl only if pbl list exist */ - if (pbl) { - - PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", - __func__, *pbl_addr, rdev_p->rnic_info.pbl_base, - *pbl_size); - err = cxio_hal_ctrl_qp_write_mem(rdev_p, - (*pbl_addr >> 5), - (*pbl_size << 3), pbl, 0); - if (err) - goto ret; - } - /* write TPT entry */ if (reset_tpt_entry) memset(&tpt, 0, sizeof(tpt)); @@ -749,23 +726,23 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | V_TPT_PAGE_SIZE(page_size)); tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : - cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3)); + cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, pbl_addr)>>3)); tpt.len = cpu_to_be32(len); tpt.va_hi = cpu_to_be32((u32) (to >> 32)); tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); tpt.rsvd_bind_cnt_or_pstag = 0; tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : - cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)); + cpu_to_be32(V_TPT_PBL_SIZE(pbl_size >> 2)); } err = cxio_hal_ctrl_qp_write_mem(rdev_p, stag_idx + (rdev_p->rnic_info.tpt_base >> 5), - sizeof(tpt), &tpt, 1); + sizeof(tpt), &tpt); /* release the stag index to free pool */ if (reset_tpt_entry) cxio_hal_put_stag(rdev_p->rscp, stag_idx); -ret: + wptr = rdev_p->ctrl_qp.wptr; mutex_unlock(&rdev_p->ctrl_qp.lock); if (!err) @@ -776,44 +753,67 @@ ret: return err; } +int cxio_write_pbl(struct cxio_rdev *rdev_p, __be64 *pbl, + u32 pbl_addr, u32 pbl_size) +{ + u32 wptr; + int err; + + PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", + __func__, pbl_addr, rdev_p->rnic_info.pbl_base, + pbl_size); + + mutex_lock(&rdev_p->ctrl_qp.lock); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, pbl_addr >> 5, pbl_size << 3, + pbl); + wptr = rdev_p->ctrl_qp.wptr; + mutex_unlock(&rdev_p->ctrl_qp.lock); + if (err) + return err; + + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + + return 0; +} + int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, - u8 page_size, __be64 *pbl, u32 *pbl_size, - u32 *pbl_addr) + u8 page_size, u32 pbl_size, u32 pbl_addr) { *stag = T3_STAG_UNSET; return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, - zbva, to, len, page_size, pbl, pbl_size, pbl_addr); + zbva, to, len, page_size, pbl_size, pbl_addr); } int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, - u8 page_size, __be64 *pbl, u32 *pbl_size, - u32 *pbl_addr) + u8 page_size, u32 pbl_size, u32 pbl_addr) { return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, - zbva, to, len, page_size, pbl, pbl_size, pbl_addr); + zbva, to, len, page_size, pbl_size, pbl_addr); } int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, u32 pbl_addr) { - return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, - &pbl_size, &pbl_addr); + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, + pbl_size, pbl_addr); } int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) { - u32 pbl_size = 0; *stag = T3_STAG_UNSET; return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, - NULL, &pbl_size, NULL); + 0, 0); } int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) { - return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, - NULL, NULL); + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, + 0, 0); } int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 69ab08e..6e128f6 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -154,14 +154,14 @@ int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq, int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, struct cxio_ucontext *uctx); int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_write_pbl(struct cxio_rdev *rdev_p, __be64 *pbl, + u32 pbl_addr, u32 pbl_size); int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, - u8 page_size, __be64 *pbl, u32 *pbl_size, - u32 *pbl_addr); + u8 page_size, u32 pbl_size, u32 pbl_addr); int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, - u8 page_size, __be64 *pbl, u32 *pbl_size, - u32 *pbl_addr); + u8 page_size, u32 pbl_size, u32 pbl_addr); int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, u32 pbl_addr); int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c index 58c3d61..ec49a5c 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_mem.c +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -35,17 +35,26 @@ #include #include "cxio_hal.h" +#include "cxio_resource.h" #include "iwch.h" #include "iwch_provider.h" -int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, - struct iwch_mr *mhp, - int shift, - __be64 *page_list) +static void iwch_finish_mem_reg(struct iwch_mr *mhp, u32 stag) { - u32 stag; u32 mmid; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(mhp->rhp, &mhp->rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp); +} + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, int shift) +{ + u32 stag; if (cxio_register_phys_mem(&rhp->rdev, &stag, mhp->attr.pdid, @@ -53,28 +62,21 @@ int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, mhp->attr.zbva, mhp->attr.va_fbo, mhp->attr.len, - shift-12, - page_list, - &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + shift - 12, + mhp->attr.pbl_size, mhp->attr.pbl_addr)) return -ENOMEM; - mhp->attr.state = 1; - mhp->attr.stag = stag; - mmid = stag >> 8; - mhp->ibmr.rkey = mhp->ibmr.lkey = stag; - insert_handle(rhp, &rhp->mmidr, mhp, mmid); - PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp); + + iwch_finish_mem_reg(mhp, stag); + return 0; } int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, struct iwch_mr *mhp, int shift, - __be64 *page_list, int npages) { u32 stag; - u32 mmid; - /* We could support this... */ if (npages > mhp->attr.pbl_size) @@ -87,19 +89,40 @@ int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, mhp->attr.zbva, mhp->attr.va_fbo, mhp->attr.len, - shift-12, - page_list, - &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + shift - 12, + mhp->attr.pbl_size, mhp->attr.pbl_addr)) return -ENOMEM; - mhp->attr.state = 1; - mhp->attr.stag = stag; - mmid = stag >> 8; - mhp->ibmr.rkey = mhp->ibmr.lkey = stag; - insert_handle(rhp, &rhp->mmidr, mhp, mmid); - PDBG("%s mmid 0x%x mhp %p\n", __func__, mmid, mhp); + + iwch_finish_mem_reg(mhp, stag); + + return 0; +} + +int iwch_alloc_pbl(struct iwch_mr *mhp, int npages) +{ + mhp->attr.pbl_addr = cxio_hal_pblpool_alloc(&mhp->rhp->rdev, + npages << 3); + + if (!mhp->attr.pbl_addr) + return -ENOMEM; + + mhp->attr.pbl_size = npages; + return 0; } +void iwch_free_pbl(struct iwch_mr *mhp) +{ + cxio_hal_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr, + mhp->attr.pbl_size << 3); +} + +int iwch_write_pbl(struct iwch_mr *mhp, __be64 *pages, int npages, int offset) +{ + return cxio_write_pbl(&mhp->rhp->rdev, pages, + mhp->attr.pbl_addr + (offset << 3), npages); +} + int build_phys_page_list(struct ib_phys_buf *buffer_list, int num_phys_buf, u64 *iova_start, diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index d07d3a3..8934178 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -442,6 +442,7 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr) mmid = mhp->attr.stag >> 8; cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, mhp->attr.pbl_addr); + iwch_free_pbl(mhp); remove_handle(rhp, &rhp->mmidr, mmid); if (mhp->kva) kfree((void *) (unsigned long) mhp->kva); @@ -475,6 +476,8 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, if (!mhp) return ERR_PTR(-ENOMEM); + mhp->rhp = rhp; + /* First check that we have enough alignment */ if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { ret = -EINVAL; @@ -492,7 +495,17 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, if (ret) goto err; - mhp->rhp = rhp; + ret = iwch_alloc_pbl(mhp, npages); + if (ret) { + kfree(page_list); + goto err_pbl; + } + + ret = iwch_write_pbl(mhp, page_list, npages, 0); + kfree(page_list); + if (ret) + goto err_pbl; + mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; @@ -502,12 +515,15 @@ static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, mhp->attr.len = (u32) total_size; mhp->attr.pbl_size = npages; - ret = iwch_register_mem(rhp, php, mhp, shift, page_list); - kfree(page_list); - if (ret) { - goto err; - } + ret = iwch_register_mem(rhp, php, mhp, shift); + if (ret) + goto err_pbl; + return &mhp->ibmr; + +err_pbl: + iwch_free_pbl(mhp); + err: kfree(mhp); return ERR_PTR(ret); @@ -560,7 +576,7 @@ static int iwch_reregister_phys_mem(struct ib_mr *mr, return ret; } - ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + ret = iwch_reregister_mem(rhp, php, &mh, shift, npages); kfree(page_list); if (ret) { return ret; @@ -602,6 +618,8 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, if (!mhp) return ERR_PTR(-ENOMEM); + mhp->rhp = rhp; + mhp->umem = ib_umem_get(pd->uobject->context, start, length, acc, 0); if (IS_ERR(mhp->umem)) { err = PTR_ERR(mhp->umem); @@ -615,10 +633,14 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, list_for_each_entry(chunk, &mhp->umem->chunk_list, list) n += chunk->nents; - pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + err = iwch_alloc_pbl(mhp, n); + if (err) + goto err; + + pages = (__be64 *) __get_free_page(GFP_KERNEL); if (!pages) { err = -ENOMEM; - goto err; + goto err_pbl; } i = n = 0; @@ -630,25 +652,38 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, pages[i++] = cpu_to_be64(sg_dma_address( &chunk->page_list[j]) + mhp->umem->page_size * k); + if (i == PAGE_SIZE / sizeof *pages) { + err = iwch_write_pbl(mhp, pages, i, n); + if (err) + goto pbl_done; + n += i; + i = 0; + } } } - mhp->rhp = rhp; + if (i) + err = iwch_write_pbl(mhp, pages, i, n); + +pbl_done: + free_page((unsigned long) pages); + if (err) + goto err_pbl; + mhp->attr.pdid = php->pdid; mhp->attr.zbva = 0; mhp->attr.perms = iwch_ib_to_tpt_access(acc); mhp->attr.va_fbo = virt; mhp->attr.page_size = shift - 12; mhp->attr.len = (u32) length; - mhp->attr.pbl_size = i; - err = iwch_register_mem(rhp, php, mhp, shift, pages); - kfree(pages); + + err = iwch_register_mem(rhp, php, mhp, shift); if (err) - goto err; + goto err_pbl; if (udata && !t3a_device(rhp)) { uresp.pbl_addr = (mhp->attr.pbl_addr - - rhp->rdev.rnic_info.pbl_base) >> 3; + rhp->rdev.rnic_info.pbl_base) >> 3; PDBG("%s user resp pbl_addr 0x%x\n", __func__, uresp.pbl_addr); @@ -661,6 +696,9 @@ static struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, return &mhp->ibmr; +err_pbl: + iwch_free_pbl(mhp); + err: ib_umem_release(mhp->umem); kfree(mhp); diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h index db5100d..836163f 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.h +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -340,14 +340,14 @@ int iwch_quiesce_qps(struct iwch_cq *chp); int iwch_resume_qps(struct iwch_cq *chp); void stop_read_rep_timer(struct iwch_qp *qhp); int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, - struct iwch_mr *mhp, - int shift, - __be64 *page_list); + struct iwch_mr *mhp, int shift); int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, struct iwch_mr *mhp, int shift, - __be64 *page_list, int npages); +int iwch_alloc_pbl(struct iwch_mr *mhp, int npages); +void iwch_free_pbl(struct iwch_mr *mhp); +int iwch_write_pbl(struct iwch_mr *mhp, __be64 *pages, int npages, int offset); int build_phys_page_list(struct ib_phys_buf *buffer_list, int num_phys_buf, u64 *iova_start, -- 1.5.5.1 From holt at sgi.com Wed May 7 08:59:48 2008 From: holt at sgi.com (Robin Holt) Date: Wed, 7 May 2008 10:59:48 -0500 Subject: [ofa-general] Re: [PATCH 02 of 11] get_task_mm In-Reply-To: References: Message-ID: <20080507155948.GO18857@sgi.com> You can drop this patch. This turned out to be a race in xpmem. It "appeared" as if it were a race in get_task_mm, but it really is not. The current->mm field is cleared under the task_lock and the task_lock is grabbed by get_task_mm. I have been testing you v15 version without this patch and not encountere the problem again (now that I fixed my xpmem race). Thanks, Robin On Wed, May 07, 2008 at 04:35:52PM +0200, Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli > # Date 1210115127 -7200 > # Node ID c5badbefeee07518d9d1acca13e94c981420317c > # Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c > get_task_mm From andrea at qumranet.com Wed May 7 09:20:07 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 7 May 2008 18:20:07 +0200 Subject: [ofa-general] Re: [PATCH 02 of 11] get_task_mm In-Reply-To: <20080507155948.GO18857@sgi.com> References: <20080507155948.GO18857@sgi.com> Message-ID: <20080507162006.GB18260@duo.random> On Wed, May 07, 2008 at 10:59:48AM -0500, Robin Holt wrote: > You can drop this patch. > > This turned out to be a race in xpmem. It "appeared" as if it were a > race in get_task_mm, but it really is not. The current->mm field is > cleared under the task_lock and the task_lock is grabbed by get_task_mm. 100% agreed, I'll nuke it as it seems really a noop. > I have been testing you v15 version without this patch and not > encountere the problem again (now that I fixed my xpmem race). Great. About your other deadlock I'm curious if my deadlock fix for the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the swap + a swaphog allocating and freeing another 2G of swap in a loop. I couldn't reproduce any other problem with KVM since I fixed that bit regardless if I apply only mmu-notifier-core (2.6.26 version) or the full patchset (post 2.6.26). From swise at opengridcomputing.com Wed May 7 09:29:49 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 May 2008 11:29:49 -0500 Subject: [ofa-general] [RFC/PATCH 1/2] RDMA/cxgb3: Don't add PBL memory to gen_pool in chunks In-Reply-To: References: Message-ID: <4821D8FD.9030705@opengridcomputing.com> Roland Dreier wrote: > Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB > chunks. This limits the largest single allocation that can be done to > the same size, which means that with 4 KB pages, each of which takes 8 > bytes of PBL memory, the largest memory region that can be allocated > is 1 GB (256K PBL entries * 4 KB/entry). > > Remove this limit by adding all the PBL memory in a single gen_pool > chunk, if possible. Add code that falls back to smaller chunks if > gen_pool_add() fails, which can happen if there is not sufficient > contiguous lowmem for the internal gen_pool bitmap. > > Signed-off-by: Roland Dreier > Acked-by: Steve Wise From swise at opengridcomputing.com Wed May 7 09:30:12 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 May 2008 11:30:12 -0500 Subject: [ofa-general] [RFC/PATCH 2/2] RDMA/cxgb3: Fix severe limit on userspace memory registration size In-Reply-To: References: Message-ID: <4821D914.7030403@opengridcomputing.com> Roland Dreier wrote: > Currently, iw_cxgb3 is severely limited on the amount of userspace > memory that can be registered in in a single memory region, which > causes big problems for applications that expect to be able to > register 100s of MB. > > The problem is that the driver uses a single kmalloc()ed buffer to > hold the physical buffer list (PBL) for the entire memory region > during registration, which means that 8 bytes of contiguous memory are > required for each page of memory being registered. For example, a 64 > MB registration will require 128 KB of contiguous memory with 4 KB > pages, and it unlikely that such an allocation will succeed on a busy > system. > > This is purely a driver problem: the temporary page list buffer is not > needed by the hardware, so we can fix this by writing the PBL to the > hardware in page-sized chunks rather than all at once. We do this by > splitting the memory registration operation up into several steps: > > - Allocate PBL space in adapter memory for the full registration > - Copy PBL to adapter memory in chunks > - Allocate STag and enable memory region > > This also allows several other cleanups to the __cxio_tpt_op() > interface and related parts of the driver. > > This change leaves the reregister memory region and memory window > operations broken, but they already didn't work due to other > longstanding bugs, so fixing them will be left to a later patch. > > Signed-off-by: Roland Dreier > Acked-by: Steve Wise From riel at redhat.com Wed May 7 10:35:32 2008 From: riel at redhat.com (Rik van Riel) Date: Wed, 7 May 2008 13:35:32 -0400 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: Message-ID: <20080507133532.4a4df89d@bree.surriel.com> On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli wrote: > Signed-off-by: Andrea Arcangeli > Signed-off-by: Nick Piggin > Signed-off-by: Christoph Lameter Acked-by: Rik van Riel -- All rights reversed. From riel at redhat.com Wed May 7 10:39:43 2008 From: riel at redhat.com (Rik van Riel) Date: Wed, 7 May 2008 13:39:43 -0400 Subject: [ofa-general] Re: [PATCH 03 of 11] invalidate_page outside PT lock In-Reply-To: References: Message-ID: <20080507133943.3e76c899@bree.surriel.com> On Wed, 07 May 2008 16:35:53 +0200 Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli > # Date 1210115129 -7200 > # Node ID d60d200565abde6a8ed45271e53cde9c5c75b426 > # Parent c5badbefeee07518d9d1acca13e94c981420317c > invalidate_page outside PT lock > > Moves all mmu notifier methods outside the PT lock (first and not last > step to make them sleep capable). This patch appears to undo some of the changes made by patch 01/11. Would it be an idea to merge them into one, so the first patch introduces the right conventions directly? -- All rights reversed. From riel at redhat.com Wed May 7 10:41:33 2008 From: riel at redhat.com (Rik van Riel) Date: Wed, 7 May 2008 13:41:33 -0400 Subject: [ofa-general] Re: [PATCH 04 of 11] free-pgtables In-Reply-To: <34f6a4bf67ce66714ba2.1210170954@duo.random> References: <34f6a4bf67ce66714ba2.1210170954@duo.random> Message-ID: <20080507134133.6c3f7d99@bree.surriel.com> On Wed, 07 May 2008 16:35:54 +0200 Andrea Arcangeli wrote: > Signed-off-by: Christoph Lameter > Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel -- All rights reversed. From riel at redhat.com Wed May 7 10:46:29 2008 From: riel at redhat.com (Rik van Riel) Date: Wed, 7 May 2008 13:46:29 -0400 Subject: [ofa-general] Re: [PATCH 05 of 11] unmap vmas tlb flushing In-Reply-To: <20bc6a66a86ef6bd6091.1210170955@duo.random> References: <20bc6a66a86ef6bd6091.1210170955@duo.random> Message-ID: <20080507134629.0dcfd4a1@bree.surriel.com> On Wed, 07 May 2008 16:35:55 +0200 Andrea Arcangeli wrote: > Signed-off-by: Christoph Lameter > Signed-off-by: Andrea Arcangeli Acked-by: Rik van Riel -- All rights reversed. From andrea at qumranet.com Wed May 7 10:57:05 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 7 May 2008 19:57:05 +0200 Subject: [ofa-general] Re: [PATCH 03 of 11] invalidate_page outside PT lock In-Reply-To: <20080507133943.3e76c899@bree.surriel.com> References: <20080507133943.3e76c899@bree.surriel.com> Message-ID: <20080507175705.GE18260@duo.random> On Wed, May 07, 2008 at 01:39:43PM -0400, Rik van Riel wrote: > Would it be an idea to merge them into one, so the first patch > introduces the right conventions directly? The only reason this isn't merged into one, is that this requires non obvious (not difficult though) to the core VM code. I wanted to keep an obviously safe approach for 2.6.26. The other conventions are only needed by XPMEM and XPMEM can't work without all other patches anyway. From rdreier at cisco.com Wed May 7 11:02:46 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 11:02:46 -0700 Subject: [ofa-general] Re: [PATCH 7/7] IB/ipath - fix SDMA error recovery in absence of link status change In-Reply-To: <20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Tue, 06 May 2008 11:36:52 -0700") References: <20080506183615.6521.98230.stgit@eng-46.mv.qlogic.com> <20080506183652.6521.21456.stgit@eng-46.mv.qlogic.com> Message-ID: thanks, applied all 7 From rdreier at cisco.com Wed May 7 12:17:26 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 12:17:26 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a fixes for various low-level HW driver issues: - cxgb3 severe limits on memory registration size - ehca QP async event race - ipath miscellaneous issues Dave Olson (2): IB/ipath: Fix bug that can leave sends disabled after freeze recovery IB/ipath: Need to always request and handle PIO avail interrupts John Gregor (1): IB/ipath: Fix SDMA error recovery in absence of link status change Michael Albaugh (2): IB/ipath: Only warn about prototype chip during init IB/ipath: Fix count of packets received by kernel Ralph Campbell (2): IB/ipath: Only increment SSN if WQE is put on send queue IB/ipath: Return the correct opcode for RDMA WRITE with immediate Roland Dreier (2): RDMA/cxgb3: Don't add PBL memory to gen_pool in chunks RDMA/cxgb3: Fix severe limit on userspace memory registration size Stefan Roscher (1): IB/ehca: Wait for async events to finish before destroying QP drivers/infiniband/hw/cxgb3/cxio_hal.c | 90 ++++++++-------- drivers/infiniband/hw/cxgb3/cxio_hal.h | 8 +- drivers/infiniband/hw/cxgb3/cxio_resource.c | 36 +++++-- drivers/infiniband/hw/cxgb3/iwch_mem.c | 75 +++++++++----- drivers/infiniband/hw/cxgb3/iwch_provider.c | 68 ++++++++++--- drivers/infiniband/hw/cxgb3/iwch_provider.h | 8 +- drivers/infiniband/hw/ehca/ehca_classes.h | 2 + drivers/infiniband/hw/ehca/ehca_irq.c | 4 + drivers/infiniband/hw/ehca/ehca_qp.c | 5 + drivers/infiniband/hw/ipath/ipath_driver.c | 138 ++++++++++++++++++++++--- drivers/infiniband/hw/ipath/ipath_file_ops.c | 72 ++++++-------- drivers/infiniband/hw/ipath/ipath_iba7220.c | 26 ++--- drivers/infiniband/hw/ipath/ipath_init_chip.c | 95 ++++++++---------- drivers/infiniband/hw/ipath/ipath_intr.c | 80 ++------------ drivers/infiniband/hw/ipath/ipath_kernel.h | 8 ++- drivers/infiniband/hw/ipath/ipath_rc.c | 6 +- drivers/infiniband/hw/ipath/ipath_ruc.c | 7 +- drivers/infiniband/hw/ipath/ipath_sdma.c | 44 ++++++-- drivers/infiniband/hw/ipath/ipath_verbs.c | 2 +- 19 files changed, 458 insertions(+), 316 deletions(-) From akpm at linux-foundation.org Wed May 7 13:02:14 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 7 May 2008 13:02:14 -0700 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: Message-ID: <20080507130214.5884d94a.akpm@linux-foundation.org> On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core > > ... > > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis > * or hlist_del_rcu(), running on this same list. > * However, it is perfectly legal to run concurrently with > * the _rcu list-traversal primitives, such as > - * hlist_for_each_entry(). > + * hlist_for_each_entry_rcu(). > */ > static inline void hlist_del_rcu(struct hlist_node *n) > { > @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct > if (!hlist_unhashed(n)) { > __hlist_del(n); > INIT_HLIST_NODE(n); > + } > +} > + > +/** > + * hlist_del_init_rcu - deletes entry from hash list with re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on entry does return true after this. It is Should that be "does" or "does not". "does", I suppose. It should refer to hlist_unhashed() The term "on entry" is a bit ambiguous - we normally use that as shorthand to mean "on entry to the function". So I'll change this to > + * Note: hlist_unhashed() on the node returns true after this. It is OK? > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include OK, unrelated bugfix ;) > --- a/include/linux/srcu.h > +++ b/include/linux/srcu.h > @@ -27,6 +27,8 @@ > #ifndef _LINUX_SRCU_H > #define _LINUX_SRCU_H > > +#include And another. Fair enough. From akpm at linux-foundation.org Wed May 7 13:05:28 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 7 May 2008 13:05:28 -0700 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: Message-ID: <20080507130528.adfd154c.akpm@linux-foundation.org> On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core The patch looks OK to me. The proposal is that we sneak this into 2.6.26. Are there any sufficiently-serious objections to this? The patch will be a no-op for 2.6.26. This is all rather unusual. For the record, could we please review the reasons for wanting to do this? Thanks. From torvalds at linux-foundation.org Wed May 7 13:30:39 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 13:30:39 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507130528.adfd154c.akpm@linux-foundation.org> References: <20080507130528.adfd154c.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Andrew Morton wrote: > > The patch looks OK to me. As far as I can tell, authorship has been destroyed by at least two of the patches (ie Christoph seems to be the author, but Andrea seems to have dropped that fact). > The proposal is that we sneak this into 2.6.26. Are there any > sufficiently-serious objections to this? Yeah, too late and no upside. That "locking" code is also too ugly to live, at least without some serious arguments for why it has to be done that way. Sorting the locks? In a vmalloc'ed area? And calling this something innocuous like "mm_lock()"? Hell no. That code needs some serious re-thinking. Linus From torvalds at linux-foundation.org Wed May 7 13:56:23 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 13:56:23 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <6b384bb988786aa78ef0.1210170958@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> Message-ID: On Wed, 7 May 2008, Andrea Arcangeli wrote: > > Convert the anon_vma spinlock to a rw semaphore. This allows concurrent > traversal of reverse maps for try_to_unmap() and page_mkclean(). It also > allows the calling of sleeping functions from reverse map traversal as > needed for the notifier callbacks. It includes possible concurrency. This also looks very debatable indeed. The only performance numbers quoted are: > This results in f.e. the Aim9 brk performance test to got down by 10-15%. which just seems like a total disaster. The whole series looks bad, in fact. Lack of authorship, bad single-line description, and the code itself sucks so badly that it's not even funny. NAK NAK NAK. All of it. It stinks. Linus From andrea at qumranet.com Wed May 7 14:26:50 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 7 May 2008 23:26:50 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> Message-ID: <20080507212650.GA8276@duo.random> On Wed, May 07, 2008 at 01:56:23PM -0700, Linus Torvalds wrote: > This also looks very debatable indeed. The only performance numbers quoted > are: > > > This results in f.e. the Aim9 brk performance test to got down by 10-15%. > > which just seems like a total disaster. > > The whole series looks bad, in fact. Lack of authorship, bad single-line Glad you agree. Note that the fact the whole series looks bad, is _exactly_ why I couldn't let Christoph keep going with mmu-notifier-core at the very end of his patchset. I had to move it at the top to have a chance to get the KVM and GRU requirements merged in 2.6.26. I think the spinlock->rwsem conversion is ok under config option, as you can see I complained myself to various of those patches and I'll take care they're in a mergeable state the moment I submit them. What XPMEM requires are different semantics for the methods, and we never had to do any blocking I/O during vmtruncate before, now we have to. And I don't see a problem in making the conversion from spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on anything but ia64. Please ignore all patches but mmu-notifier-core. I regularly forward _only_ mmu-notifier-core to Andrew, that's the only one that is in merge-ready status, everything else is just so XPMEM can test and we can keep discussing it to bring it in a mergeable state like mmu-notifier-core already is. From torvalds at linux-foundation.org Wed May 7 14:36:57 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 14:36:57 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507212650.GA8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> Message-ID: On Wed, 7 May 2008, Andrea Arcangeli wrote: > > I think the spinlock->rwsem conversion is ok under config option, as > you can see I complained myself to various of those patches and I'll > take care they're in a mergeable state the moment I submit them. What > XPMEM requires are different semantics for the methods, and we never > had to do any blocking I/O during vmtruncate before, now we have to. I really suspect we don't really have to, and that it would be better to just fix the code that does that. > Please ignore all patches but mmu-notifier-core. I regularly forward > _only_ mmu-notifier-core to Andrew, that's the only one that is in > merge-ready status, everything else is just so XPMEM can test and we > can keep discussing it to bring it in a mergeable state like > mmu-notifier-core already is. The thing is, I didn't like that one *either*. I thought it was the biggest turd in the series (and by "biggest", I literally mean "most lines of turd-ness" rather than necessarily "ugliest per se"). I literally think that mm_lock() is an unbelievable piece of utter and horrible CRAP. There's simply no excuse for code like that. If you want to avoid the deadlock from taking multiple locks in order, but there is really just a single operation that needs it, there's a really really simple solution. And that solution is *not* to sort the whole damn f*cking list in a vmalloc'ed data structure prior to locking! Damn. No, the simple solution is to just make up a whole new upper-level lock, and get that lock *first*. You can then take all the multiple locks at a lower level in any order you damn well please. And yes, it's one more lock, and yes, it serializes stuff, but: - that code had better not be critical anyway, because if it was, then the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ - parallelism is overrated: it doesn't matter one effing _whit_ if something is a hundred times more parallel, if it's also a hundred times *SLOWER*. So dang it, flush the whole damn series down the toilet and either forget the thing entirely, or re-do it sanely. And here's an admission that I lied: it wasn't *all* clearly crap. I did like one part, namely list_del_init_rcu(), but that one should have been in a separate patch. I'll happily apply that one. Linus From andrea at qumranet.com Wed May 7 14:58:40 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Wed, 7 May 2008 23:58:40 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: <20080507130528.adfd154c.akpm@linux-foundation.org> Message-ID: <20080507215840.GB8276@duo.random> On Wed, May 07, 2008 at 01:30:39PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrew Morton wrote: > > > > The patch looks OK to me. > > As far as I can tell, authorship has been destroyed by at least two of the > patches (ie Christoph seems to be the author, but Andrea seems to have > dropped that fact). I can't follow this, please be more specific. About the patches I merged from Christoph, I didn't touch them at all (except for fixing a kernel crashing bug in them plus some reject fix). Initially I didn't even add a signed-off-by: andrea, and I only had the signed-off-by: christoph. But then he said I had to add my signed-off-by too, while I thought at most an acked-by was required. So if I got any attribution on Christoph work it's only because he explicitly requested it as it was passing through my maintenance line. In any case, all patches except mmu-notifier-core are irrelevant in this context and I'm entirely fine to give Christoph the whole attribution of the whole patchset including the whole mmu-notifier-core where most of the code is mine. We had many discussions with Christoph, Robin and Jack, but I can assure you nobody had a single problem with regard to attribution. About all patches except mmu-notifier-core: Christoph, Robin and everyone (especially myself) agrees those patches can't yet be merged in 2.6.26. With regard to the post-2.6.26 material, I think adding a config option to make the change at compile time, is ok. And there's no other way to deal with it in a clean way, as vmtrunate has to teardown pagetables, and if the i_mmap_lock is a spinlock there's no way to notify secondary mmus about it, if the ->invalidate_range_start method has to allocate an skb, send it through the network and wait for I/O completion with schedule(). > Yeah, too late and no upside. No upside to all people setting CONFIG_KVM=n true, but no downside for them either, that's the important fact! And for all the people setting CONFIG_KVM!=n, I should provide some background here. KVM MM development is halted without this, that includes: paging, ballooning, tlb flushing at large, pci-passthrough removing page pin as a whole, etc... Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle things without mmu notifiers (mlock whatever). Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has gone as far as writing his own ultrasimple mmu notifier implementation, unfortunately too simple as invalidate_range_start was missing and we can't remove the page pinning and avoid doing spte=invalid;tlbflush;unpin for every group of sptes released without it. And without mm_lock invalidate_range_start can't be implemented in a generic way (to work for GRU/XPMEM too). > That "locking" code is also too ugly to live, at least without some > serious arguments for why it has to be done that way. Sorting the locks? > In a vmalloc'ed area? And calling this something innocuous like > "mm_lock()"? Hell no. That's only invoked in mmu_notifier_register, mm_lock is explicitly documented as heavyweight function. In the KVM case it's only called when a VM is created, that's irrelevant cpu cost compared to the time it takes to the OS to boot in the VM... (especially without real mode emulation with direct NPT-like secondary-mmu paging). mm_lock solved the fundamental race in the range_start/end invalidation model (that will allow GRU to do a single tlb flush for the whole range that is going to be freed by zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his EMM versions of mmu notifiers, moments after I released it, I think he wouldn't have done it if there was a better way. > That code needs some serious re-thinking. Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu notifiers, my original mmu notifiers, Christoph's first version of my mmu notifiers, with my new mmu notifiers, with christoph EMM version of my new mmu notifiers, with my latest mmu notifiers, and all people making suggestions and testing the code and needing the code badly, and further patches waiting inclusion during 2.6.27 in this area, it must be obvious for everyone, that there's zero chance this code won't evolve over time to perfection, but we can't wait it to be perfect before start using it or we're screwed. Even if it's entirely broken this will allow kvm development to continue and then we'll fix it (but don't worry it works great at runtime and there are no race conditions, Jack and Robin are also using it with zero problems with GRU and XPMEM just in case the KVM testing going great isn't enough). Furthermore the API is freezed for almost months, everyone agrees with all fundamental blocks in mmu-notifier-core patch (to be complete Christoph would like to replace invalidate_page with an invalidate_range_start/end but that's a minor detail). And most important we need something in now, regardless of which API. We can handle a change of API totally fine later. mm_lock() is not even part of the mmu notifier API, it's just an internal implementation detail, so whatever problem it has, or whatever better name we can find, isn't an high priority right now. If you suggest a better name now I'll fix it up immediately. I hope the mm_lock name and whatever signed-off-by error in patches after mmu-notifier-core won't be really why this doesn't go in. Thanks a lot for your time to review even if it wasn't as positive as I hoped, Andrea From torvalds at linux-foundation.org Wed May 7 15:11:10 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 15:11:10 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507215840.GB8276@duo.random> References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> Message-ID: On Wed, 7 May 2008, Andrea Arcangeli wrote: > > As far as I can tell, authorship has been destroyed by at least two of the > > patches (ie Christoph seems to be the author, but Andrea seems to have > > dropped that fact). > > I can't follow this, please be more specific. The patches were sent to lkml without *any* indication that you weren't actually the author. So if Andrew had merged them, they would have been merged as yours. > > That "locking" code is also too ugly to live, at least without some > > serious arguments for why it has to be done that way. Sorting the locks? > > In a vmalloc'ed area? And calling this something innocuous like > > "mm_lock()"? Hell no. > > That's only invoked in mmu_notifier_register, mm_lock is explicitly > documented as heavyweight function. Is that an excuse for UTTER AND TOTAL CRAP? Linus From rdreier at cisco.com Wed May 7 15:14:42 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 15:14:42 -0700 Subject: [ofa-general] [PATCH/RFC] RDMA/nes: Fix up nes_lro_max_aggr module parameter Message-ID: Fix some bugs with the max_aggr module parameter added with LRO support: - The module parameter value ignored and not actually used to set lro_mgr.max_aggr. - MODULE_PARM_DESC had a typo "_mro_" instead of "_lro_" so it didn't end up describing the actual module parameter. - The nes_lro_max_aggr variable was declared as unsigned, but the module_param line said "int" instead of "uint" for the type. - The default value for the parameter was stuck in the permissions field of module_param, which led to nonsensical permissions for the file under /sys/module/iw_nes/param. - The parameter was used in only one file but defined in another, which led to the variable being global for no good reason. Move everything related to the parameter to the file nes_hw.c where it is actually used. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/nes/nes.c | 4 ---- drivers/infiniband/hw/nes/nes.h | 1 - drivers/infiniband/hw/nes/nes_hw.c | 6 +++++- 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c index 9f7364a..a4e9269 100644 --- a/drivers/infiniband/hw/nes/nes.c +++ b/drivers/infiniband/hw/nes/nes.c @@ -91,10 +91,6 @@ unsigned int nes_debug_level = 0; module_param_named(debug_level, nes_debug_level, uint, 0644); MODULE_PARM_DESC(debug_level, "Enable debug output level"); -unsigned int nes_lro_max_aggr = NES_LRO_MAX_AGGR; -module_param(nes_lro_max_aggr, int, NES_LRO_MAX_AGGR); -MODULE_PARM_DESC(nes_mro_max_aggr, " nic LRO MAX packet aggregation"); - LIST_HEAD(nes_adapter_list); static LIST_HEAD(nes_dev_list); diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h index 1f9f7bf..61b46e9 100644 --- a/drivers/infiniband/hw/nes/nes.h +++ b/drivers/infiniband/hw/nes/nes.h @@ -173,7 +173,6 @@ extern int disable_mpa_crc; extern unsigned int send_first; extern unsigned int nes_drv_opt; extern unsigned int nes_debug_level; -extern unsigned int nes_lro_max_aggr; extern struct list_head nes_adapter_list; diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 8dc70f9..d3278f1 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -42,6 +42,10 @@ #include "nes.h" +static unsigned int nes_lro_max_aggr = NES_LRO_MAX_AGGR; +module_param(nes_lro_max_aggr, uint, 0444); +MODULE_PARM_DESC(nes_lro_max_aggr, "NIC LRO max packet aggregation"); + static u32 crit_err_count; u32 int_mod_timer_init; u32 int_mod_cq_depth_256; @@ -1738,7 +1742,7 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct net_device *netdev) jumbomode = 1; nes_nic_init_timer_defaults(nesdev, jumbomode); } - nesvnic->lro_mgr.max_aggr = NES_LRO_MAX_AGGR; + nesvnic->lro_mgr.max_aggr = nes_lro_max_aggr; nesvnic->lro_mgr.max_desc = NES_MAX_LRO_DESCRIPTORS; nesvnic->lro_mgr.lro_arr = nesvnic->lro_desc; nesvnic->lro_mgr.get_skb_header = nes_lro_get_skb_hdr; -- 1.5.5.1 From andrea at qumranet.com Wed May 7 15:22:05 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:22:05 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> Message-ID: <20080507222205.GC8276@duo.random> On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. I'll let you discuss with Christoph and Robin about it. The moment I heard the schedule inside ->invalidate_page() requirement I reacted the same way you did. But I don't see any other real solution for XPMEM other than spin-looping for ages halting the scheduler for ages, while the ack is received from the network device. But mm_lock is required even without XPMEM. And srcu is also required without XPMEM to allow ->release to schedule (however downgrading srcu to rcu will result in a very small patch, srcu and rcu are about the same with a kernel supporting preempt=y like 2.6.26). > I literally think that mm_lock() is an unbelievable piece of utter and > horrible CRAP. > > There's simply no excuse for code like that. I think it's a great smp scalability optimization over the global lock you're proposing below. > No, the simple solution is to just make up a whole new upper-level lock, > and get that lock *first*. You can then take all the multiple locks at a > lower level in any order you damn well please. Unfortunately the lock you're talking about would be: static spinlock_t global_lock = ... There's no way to make it more granular. So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock we'd need to take that extremely wide spinlock first (and even worse, later it would become a rwsem when XPMEM is selected making the VM even slower than it already becomes when XPMEM support is selected at compile time). > And yes, it's one more lock, and yes, it serializes stuff, but: > > - that code had better not be critical anyway, because if it was, then > the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ mmu_notifier_register can take ages. No problem. > - parallelism is overrated: it doesn't matter one effing _whit_ if > something is a hundred times more parallel, if it's also a hundred > times *SLOWER*. mmu_notifier_register is fine to be hundred times slower (preempt-rt will turn all locks in spinlocks so no problem). > And here's an admission that I lied: it wasn't *all* clearly crap. I did > like one part, namely list_del_init_rcu(), but that one should have been > in a separate patch. I'll happily apply that one. Sure, I'll split it from the rest if the mmu-notifier-core isn't merged. My objective has been: 1) add zero overhead to the VM before anybody starts a VM with kvm and still zero overhead for all other tasks except the task where the VM runs. The only exception is the unlikely(!mm->mmu_notifier_mm) check that is optimized away too when CONFIG_KVM=n. And even for that check my invalidate_page reduces the number of branches to the absolute minimum possible. 2) avoid any new cacheline collision in the fast paths to allow numa systems not to nearly-crash (mm->mmu_notifier_mm will be shared and never written, except during the first mmu_notifier_register) 3) avoid any risk to introduce regressions in 2.6.26 (the patch must be obviously safe). Even if mm_lock would be a bad idea like you say, it's order of magnitude safer even if entirely broken then messing with the VM core locking in 2.6.26. mm_lock (or whatever name you like to give it, I admit mm_lock may not be worrysome enough for people to have an idea to call it in a fast path) is going to be the real deal for the long term to allow mmu_notifier_register to serialize against invalidate_page_start/end. If I fail in 2.6.26 I'll offer maintainership to Christoph as promised, and you'll find him pushing for mm_lock to be merged (as XPMEM/GRU aren't technologies running on cellphones where your global wide spinlocks is optimized away at compile time, and he also has to deal with XPMEM where such a spinlock would need to become a rwsem as the anon_vma->sem has to be taken after it), but let's assume you're right entirely right here that mm_lock is going to be dropped and there's a better way: it's still a fine solution for 2.6.26. And if you prefer I can move the whole mm_lock() from mmap.c/mm.h to mmu_notifier.[ch] so you don't get any pollution in the core VM, and mm_lock will be invisible to everything but anybody calling mmu_notifier_register() then and it will be trivial to remove later if you really want to add a global spinlock as there's no way to be more granular than a _global_ numa-wide spinlock taken before any i_mmap_lock/anon_vma->lock, without my mm_lock. From rdreier at cisco.com Wed May 7 15:22:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 15:22:30 -0700 Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached P_Key/GID queries Message-ID: The SRP initiator is currently using ib_find_cached_pkey() and ib_get_cached_gid() in situations where the uncached ib_find_pkey() and ib_query_gid() functions serve just as well: sleeping is allowed and performance is not an issue. Since we want to eliminate the cached operations in the long term, convert SRP to use the uncached variants. Signed-off-by: Roland Dreier --- Anyone have concerns about queueing this for the next merge window? drivers/infiniband/ulp/srp/ib_srp.c | 13 +++++-------- 1 files changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4351457..81cc59c 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -49,8 +49,6 @@ #include #include -#include - #include "ib_srp.h" #define DRV_NAME "ib_srp" @@ -183,10 +181,10 @@ static int srp_init_qp(struct srp_target_port *target, if (!attr) return -ENOMEM; - ret = ib_find_cached_pkey(target->srp_host->srp_dev->dev, - target->srp_host->port, - be16_to_cpu(target->path.pkey), - &attr->pkey_index); + ret = ib_find_pkey(target->srp_host->srp_dev->dev, + target->srp_host->port, + be16_to_cpu(target->path.pkey), + &attr->pkey_index); if (ret) goto out; @@ -1883,8 +1881,7 @@ static ssize_t srp_create_target(struct device *dev, if (ret) goto err; - ib_get_cached_gid(host->srp_dev->dev, host->port, 0, - &target->path.sgid); + ib_query_gid(host->srp_dev->dev, host->port, 0, &target->path.sgid); shost_printk(KERN_DEBUG, target->scsi_host, PFX "new target: id_ext %016llx ioc_guid %016llx pkey %04x " -- 1.5.5.1 From rdreier at cisco.com Wed May 7 15:26:46 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 15:26:46 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <4820A427.1070405@opengridcomputing.com> (Steve Wise's message of "Tue, 06 May 2008 13:32:07 -0500") References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> Message-ID: > 3) if using SEND, then a recv completion is always generated. I'm just trying to define the scope of the issue here... so is there any conceivable real-life situation where neither a 0B read nor a 0B write would work, and the connection setup will have to use a 0B send? - R. From andrea at qumranet.com Wed May 7 15:27:58 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:27:58 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> Message-ID: <20080507222758.GD8276@duo.random> On Wed, May 07, 2008 at 03:11:10PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > As far as I can tell, authorship has been destroyed by at least two of the > > > patches (ie Christoph seems to be the author, but Andrea seems to have > > > dropped that fact). > > > > I can't follow this, please be more specific. > > The patches were sent to lkml without *any* indication that you weren't > actually the author. > > So if Andrew had merged them, they would have been merged as yours. I rechecked and I guarantee that the patches where Christoph isn't listed are developed by myself and he didn't write a single line on them. In any case I expect Christoph to review (he's CCed) and to point me to any attribution error. The only mistake I did once in that area was to give too _few_ attribution to myself and he asked me to add myself in the signed-off so I added myself by Christoph own request, but be sure I didn't remove him! From swise at opengridcomputing.com Wed May 7 15:29:44 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 May 2008 17:29:44 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> Message-ID: <48222D58.9020205@opengridcomputing.com> Roland Dreier wrote: > > 3) if using SEND, then a recv completion is always generated. > > I'm just trying to define the scope of the issue here... so is there any > conceivable real-life situation where neither a 0B read nor a 0B write > would work, and the connection setup will have to use a 0B send? > i'm not sure what you mean by "real-life". For the rnics we have: nes - requires 0b write cxgb3 - requires 0b read amso1100 - won't work in p2p mode So there are none that I know of that require a send for this. From rdreier at cisco.com Wed May 7 15:31:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 15:31:08 -0700 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507222758.GD8276@duo.random> (Andrea Arcangeli's message of "Thu, 8 May 2008 00:27:58 +0200") References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> Message-ID: > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! I think the point you're missing is that any patches written by Christoph need a line like From: Christoph Lameter at the top of the body so that Christoph becomes the author when it is committed into git. The Signed-off-by: line needs to be preserved too of course, but it is not sufficient by itself. - R. From akpm at linux-foundation.org Wed May 7 15:31:03 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 7 May 2008 15:31:03 -0700 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507222205.GC8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> Message-ID: <20080507153103.237ea5b6.akpm@linux-foundation.org> On Thu, 8 May 2008 00:22:05 +0200 Andrea Arcangeli wrote: > > No, the simple solution is to just make up a whole new upper-level lock, > > and get that lock *first*. You can then take all the multiple locks at a > > lower level in any order you damn well please. > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. > > So every time before taking any ->i_mmap_lock _and_ any anon_vma->lock > we'd need to take that extremely wide spinlock first (and even worse, > later it would become a rwsem when XPMEM is selected making the VM > even slower than it already becomes when XPMEM support is selected at > compile time). Nope. We only need to take the global lock before taking *two or more* of the per-vma locks. I really wish I'd thought of that. From rdreier at cisco.com Wed May 7 15:33:53 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 May 2008 15:33:53 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <48222D58.9020205@opengridcomputing.com> (Steve Wise's message of "Wed, 07 May 2008 17:29:44 -0500") References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> Message-ID: > > I'm just trying to define the scope of the issue here... so is there any > > conceivable real-life situation where neither a 0B read nor a 0B write > > would work, and the connection setup will have to use a 0B send? > i'm not sure what you mean by "real-life". For the rnics we have: > > nes - requires 0b write > cxgb3 - requires 0b read > amso1100 - won't work in p2p mode > > So there are none that I know of that require a send for this. I guess my question was whether we expect to ever need to worry about the 0B send case, or whether it's just theoretical. If no current NICs have a problem with read or write, and future NICs will be built to a future MPA spec, then it seems we don't have to worry about what happens if a 0B send is done as part of connection setup. The spurious CQE on connection failure and the private data breakage are serious obviously. The interoperability issues of this stuff seem pretty painful to me. From andrea at qumranet.com Wed May 7 15:37:38 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:37:38 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507222758.GD8276@duo.random> References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> Message-ID: <20080507223738.GF8276@duo.random> On Thu, May 08, 2008 at 12:27:58AM +0200, Andrea Arcangeli wrote: > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! By PM (guess he's scared to post to this thread ;) Chris is telling me, what you mean perhaps is I should add a From: Christoph in the body of the email if the first signed-off-by is from Christoph, to indicate the first signoff was by him and the patch in turn was started by him. I thought the order of the signoffs was enough, but if that From was mandatory and missing, if there's any error it obviously wasn't intentional especially given I only left a signed-off-by: christoph on his patches until he asked me to add my signoff too. Correcting it is trivial given I carefully ordered the signoff so that the author is at the top of the signoff list. At least for mmu-notifier-core given I obviously am the original author of that code, I hope the From: of the email was enough even if an additional From: andrea was missing in the body. Also you can be sure that Christoph and especially Robin (XPMEM) will be more than happy if all patches with Christoph at the top of the signed-off-by will be merged in 2.6.26 despite there wasn't From: christoph at the top of the body ;). So I don't see a big deal here... From andrea at qumranet.com Wed May 7 15:39:14 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:39:14 +0200 Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> Message-ID: <20080507223914.GG8276@duo.random> On Wed, May 07, 2008 at 03:31:08PM -0700, Roland Dreier wrote: > I think the point you're missing is that any patches written by > Christoph need a line like > > From: Christoph Lameter > > at the top of the body so that Christoph becomes the author when it is > committed into git. The Signed-off-by: line needs to be preserved too > of course, but it is not sufficient by itself. Ok so I see the problem Linus is referring to now (I received the hint by PM too), I thought the order of the signed-off-by was relevant, it clearly isn't or we're wasting space ;) From swise at opengridcomputing.com Wed May 7 15:41:01 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 May 2008 17:41:01 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> Message-ID: <48222FFD.40302@opengridcomputing.com> Roland Dreier wrote: > > > I'm just trying to define the scope of the issue here... so is there any > > > conceivable real-life situation where neither a 0B read nor a 0B write > > > would work, and the connection setup will have to use a 0B send? > > > i'm not sure what you mean by "real-life". For the rnics we have: > > > > nes - requires 0b write > > cxgb3 - requires 0b read > > amso1100 - won't work in p2p mode > > > > So there are none that I know of that require a send for this. > > I guess my question was whether we expect to ever need to worry about > the 0B send case, or whether it's just theoretical. If no current NICs > have a problem with read or write, and future NICs will be built to a > future MPA spec, then it seems we don't have to worry about what happens > if a 0B send is done as part of connection setup. > > I agree. We can dump the 0B send stuff. > The spurious CQE on connection failure and the private data breakage are > serious obviously. The interoperability issues of this stuff seem > pretty painful to me. Its is painful. But without anything, you cannot run OMPI, IMPI or HPMPI on a iwarp cluster with mixed vendor rnics... Steve. From steiner at sgi.com Wed May 7 15:42:33 2008 From: steiner at sgi.com (Jack Steiner) Date: Wed, 7 May 2008 17:42:33 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507212650.GA8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> Message-ID: <20080507224232.GA24600@sgi.com> > And I don't see a problem in making the conversion from > spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on > anything but ia64. That is currently true but we are also working on XPMEM for x86_64. The new XPMEM code should be posted within a few weeks. --- jack From andrea at qumranet.com Wed May 7 15:44:06 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:44:06 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507153103.237ea5b6.akpm@linux-foundation.org> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> Message-ID: <20080507224406.GI8276@duo.random> On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > Nope. We only need to take the global lock before taking *two or more* of > the per-vma locks. > > I really wish I'd thought of that. I don't see how you can avoid taking the system-wide-global lock before every single anon_vma->lock/i_mmap_lock out there without mm_lock. Please note, we can't allow a thread to be in the middle of zap_page_range while mmu_notifier_register runs. vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more than one lock and we've to still take the global-system-wide lock _before_ this single i_mmap_lock and no other lock at all. Please elaborate, thanks! From torvalds at linux-foundation.org Wed May 7 15:44:24 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 15:44:24 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507222205.GC8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Unfortunately the lock you're talking about would be: > > static spinlock_t global_lock = ... > > There's no way to make it more granular. Right. So what? It's still about a million times faster than what the code does now. You comment about "great smp scalability optimization" just shows that you're a moron. It is no such thing. The fact is, it's a horrible pessimization, since even SMP will be *SLOWER*. It will just be "less slower" when you have a million CPU's and they all try to do this at the same time (which probably never ever happens). In other words, "scalability" is totally meaningless. The only thing that matters is *performance*. If the "scalable" version performs WORSE, then it is simply worse. Not better. End of story. > mmu_notifier_register can take ages. No problem. So what you're saying is that performance doesn't matter? So why do you do the ugly crazy hundred-line implementation, when a simple two-liner would do equally well? Your arguments are crap. Anyway, discussion over. This code doesn't get merged. It doesn't get merged before 2.6.26, and it doesn't get merged _after_ either. Rewrite the code, or not. I don't care. I'll very happily not merge crap for the rest of my life. Linus From andrea at qumranet.com Wed May 7 15:58:01 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 00:58:01 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> Message-ID: <20080507225801.GK8276@duo.random> On Wed, May 07, 2008 at 03:44:24PM -0700, Linus Torvalds wrote: > > > On Thu, 8 May 2008, Andrea Arcangeli wrote: > > > > Unfortunately the lock you're talking about would be: > > > > static spinlock_t global_lock = ... > > > > There's no way to make it more granular. > > Right. So what? > > It's still about a million times faster than what the code does now. mmu_notifier_register only runs when windows or linux or macosx boots. Who could ever care of the msec spent in mm_lock compared to the time it takes to linux to boot? What you're proposing is to slowdown AIM and certain benchmarks 20% or more for all users, just so you save at most 1msec to start a VM. > Rewrite the code, or not. I don't care. I'll very happily not merge crap > for the rest of my life. If you want the global lock I'll do it no problem, I just think it's obviously inferior solution for 99% of users out there (including kvm users that will also have to take that lock while kvm userland runs). In my view the most we should do in this area is to reduce further the max number of locks to take if max_map_count already isn't enough. From akpm at linux-foundation.org Wed May 7 15:59:14 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 7 May 2008 15:59:14 -0700 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507224406.GI8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> Message-ID: <20080507155914.d7790069.akpm@linux-foundation.org> On Thu, 8 May 2008 00:44:06 +0200 Andrea Arcangeli wrote: > On Wed, May 07, 2008 at 03:31:03PM -0700, Andrew Morton wrote: > > Nope. We only need to take the global lock before taking *two or more* of > > the per-vma locks. > > > > I really wish I'd thought of that. > > I don't see how you can avoid taking the system-wide-global lock > before every single anon_vma->lock/i_mmap_lock out there without > mm_lock. > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. > > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. > > Please elaborate, thanks! umm... CPU0: CPU1: spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); bad. CPU0: CPU1: spin_lock(global_lock) spin_lock(global_lock); spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_lock(a->lock); Is OK. CPU0: CPU1: spin_lock(global_lock) spin_lock(a->lock); spin_lock(b->lock); spin_lock(b->lock); spin_unlock(b->lock); spin_lock(a->lock); spin_unlock(a->lock); also OK. As long as all code paths which can take two-or-more locks are all covered by the global lock there is no deadlock scenario. If a thread takes just a single instance of one of these locks without taking the global_lock then there is also no deadlock. Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly added mm_lock() thing and we also take both those locks at the same time in regular code, we're probably screwed. From torvalds at linux-foundation.org Wed May 7 16:00:13 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 16:00:13 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507222758.GD8276@duo.random> References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. How long have you been doing kernel development? How about you read SubmittingPatches a few times before you show just how clueless you are? Hint: look for the string that says "From:". Also look at the section that talks about "summary phrase". You got it all wrong, and you don't even seem to realize that you got it wrong, even when I told you. Linus From andrea at qumranet.com Wed May 7 16:02:42 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 01:02:42 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507225801.GK8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507225801.GK8276@duo.random> Message-ID: <20080507230242.GL8276@duo.random> To remove mm_lock without adding an horrible system-wide lock before every i_mmap_lock etc.. we've to remove invalidate_range_begin/end. Then we can return to an older approach of doing only invalidate_page and serializing it with the PT lock against get_user_pages. That works fine for KVM but GRU will have to flush the tlb once every time we drop the PT lock, that means once per each 512 ptes on x86-64 etc... instead of a single time for the whole range regardless how large the range is. From torvalds at linux-foundation.org Wed May 7 16:03:00 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 16:03:00 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507223914.GG8276@duo.random> References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> <20080507223914.GG8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Ok so I see the problem Linus is referring to now (I received the hint > by PM too), I thought the order of the signed-off-by was relevant, it > clearly isn't or we're wasting space ;) The order of the signed-offs are somewhat relevant, but no, sign-offs don't mean authorship. See the rules for sign-off: you can sign off on another persons patches, even if they didn't sign off on them themselves. That's clause (b) in particular. So yes, quite often you'd _expect_ the first sign-off to match the author, but that's a correlation, not a causal relationship. Linus From torvalds at linux-foundation.org Wed May 7 16:09:48 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 16:09:48 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507225801.GK8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507225801.GK8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > mmu_notifier_register only runs when windows or linux or macosx > boots. Who could ever care of the msec spent in mm_lock compared to > the time it takes to linux to boot? Andrea, you're *this* close to going to my list of people who it is not worth reading email from, and where it's better for everybody involved if I just teach my spam-filter about it. That code was CRAP. That code was crap whether it's used once, or whether it's used a million times. Stop making excuses for it just because it's not performance- critical. So give it up already. I told you what the non-crap solution was. It's simpler, faster, and is about two lines of code compared to the crappy version (which was what - 200 lines of crap with a big comment on top of it just to explain the idiocy?). So until you can understand the better solution, don't even bother emailing me, ok? Because the next email I get from you that shows the intelligence level of a gnat, I'll just give up and put you in a spam-filter. Because my IQ goes down just from reading your mails. I can't afford to continue. Linus From torvalds at linux-foundation.org Wed May 7 16:19:05 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 16:19:05 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507155914.d7790069.akpm@linux-foundation.org> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Andrew Morton wrote: > > Now, if we need to take both anon_vma->lock AND i_mmap_lock in the newly > added mm_lock() thing and we also take both those locks at the same time in > regular code, we're probably screwed. No, just use the normal static ordering for that case: one type of lock goes before the other kind. If those locks nest in regular code, you have to do that *anyway*. The code that can take many locks, will have to get the global lock *and* order the types, but that's still trivial. It's something like spin_lock(&global_lock); for (vma = mm->mmap; vma; vma = vma->vm_next) { if (vma->anon_vma) spin_lock(&vma->anon_vma->lock); } for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); } spin_unlock(&global_lock); and now everybody follows the rule that "anon_vma->lock" precedes "i_mmap_lock". So there can be no ABBA deadlock between the normal users and the many-locks version, and there can be no ABBA deadlock between many-locks-takers because they use the global_lock to serialize. This really isn't rocket science, guys. (I really hope and believe that they don't nest anyway, and that you can just use a single for-loop for the many-lock case) Linus From sean.hefty at intel.com Wed May 7 16:25:06 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 7 May 2008 16:25:06 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <48222FFD.40302@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> Message-ID: <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> >> > nes - requires 0b write >> > cxgb3 - requires 0b read >> > amso1100 - won't work in p2p mode I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b reads, or cxgb3 0b writes. >Its is painful. But without anything, you cannot run OMPI, IMPI or >HPMPI on a iwarp cluster with mixed vendor rnics... Is there any requirement at the receiving side, versus the initiating side? That is, just because nes issues a 0b write, does the receiving HW care if a read or write shows up? Or is this restriction on both sides? - Sean From benh at kernel.crashing.org Wed May 7 16:28:38 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 08 May 2008 09:28:38 +1000 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507224406.GI8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> Message-ID: <1210202918.1421.20.camel@pasglop> On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > Please note, we can't allow a thread to be in the middle of > zap_page_range while mmu_notifier_register runs. You said yourself that mmu_notifier_register can be as slow as you want ... what about you use stop_machine for it ? I'm not even joking here :-) > vmtruncate takes 1 single lock, the i_mmap_lock of the inode. Not more > than one lock and we've to still take the global-system-wide lock > _before_ this single i_mmap_lock and no other lock at all. Ben. From clameter at sgi.com Wed May 7 16:39:39 2008 From: clameter at sgi.com (Christoph Lameter) Date: Wed, 7 May 2008 16:39:39 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Linus Torvalds wrote: > The code that can take many locks, will have to get the global lock *and* > order the types, but that's still trivial. It's something like > > spin_lock(&global_lock); > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (vma->anon_vma) > spin_lock(&vma->anon_vma->lock); > } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > if (!vma->anon_vma && vma->vm_file && vma->vm_file->f_mapping) > spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); > } > spin_unlock(&global_lock); Multiple vmas may share the same mapping or refer to the same anonymous vma. The above code will deadlock since we may take some locks multiple times. From bpkg at blickpunkt-studios.com Wed May 7 16:39:54 2008 From: bpkg at blickpunkt-studios.com (Tameka Raymond) Date: Wed, 7 May 2008 20:39:54 -0300 Subject: [ofa-general] Hi large drive Message-ID: <154218039.49768283245069@blickpunkt-studios.com> "In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size." We offer solution! Gain incredible girth and mind-blowing length in just a few weeks time! http://www.aucri.net/a/ From andrea at qumranet.com Wed May 7 16:39:53 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 01:39:53 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507155914.d7790069.akpm@linux-foundation.org> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: <20080507233953.GM8276@duo.random> Hi Andrew, On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote: > CPU0: CPU1: > > spin_lock(global_lock) > spin_lock(a->lock); spin_lock(b->lock); ================== mmu_notifier_register() > spin_lock(b->lock); spin_unlock(b->lock); > spin_lock(a->lock); > spin_unlock(a->lock); > > also OK. But the problem is that we've to stop the critical section in the place I marked with "========" while mmu_notifier_register runs. Otherwise the driver calling mmu_notifier_register won't know if it's safe to start establishing secondary sptes/tlbs. If the driver will establish sptes/tlbs with get_user_pages/follow_page the page could be freed immediately later when zap_page_range starts. So if CPU1 doesn't take the global_lock before proceeding in zap_page_range (inside vmtruncate i_mmap_lock that is represented as b->lock above) we're in trouble. What we can do is to replace the mm_lock with a spin_lock(&global_lock) only if all places that takes i_mmap_lock takes the global lock first and that hurts scalability of the fast paths that are performance critical like vmtruncate and anon_vma->lock. Perhaps they're not so performance critical, but surely much more performant critical than mmu_notifier_register ;). The idea of polluting various scalable paths like truncate() syscall in the VM with a global spinlock frightens me, I'd rather return to invalidate_page() inside the PT lock removing both invalidate_range_start/end. Then all serialization against the mmu notifiers will be provided by the PT lock that the secondary mmu page fault also has to take in get_user_pages (or follow_page). In any case that is a better solution that won't slowdown the VM when MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance is about the same with or without invalidate_range_start/end. I didn't think anybody could care about how long mmu_notifier_register takes until it returns compared to all heavyweight operations that happens to start a VM (not only in the kernel but in the guest too). Infact if it's security that we worry about here, can put a cap of _time_ that mmu_notifier_register can take before it fails, and we fail to start a VM if it takes more than 5sec, that's still fine as the failure could happen for other reasons too like vmalloc shortage and we already handle it just fine. This 5sec delay can't possibly happen in practice anyway in the only interesting scenario, just like the vmalloc shortage. This is obviously a superior solution than polluting the VM with an useless global spinlock that will destroy truncate/AIM on numa. Anyway Christoph, I uploaded my last version here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16 (applies and runs fine on 26-rc1) You're more than welcome to takeover from it, I kind of feel my time now may be better spent to emulate the mmu-notifier-core with kprobes. From torvalds at linux-foundation.org Wed May 7 16:38:51 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 16:38:51 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 01 of 11] mmu-notifier-core In-Reply-To: <20080507223738.GF8276@duo.random> References: <20080507130528.adfd154c.akpm@linux-foundation.org> <20080507215840.GB8276@duo.random> <20080507222758.GD8276@duo.random> <20080507223738.GF8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > At least for mmu-notifier-core given I obviously am the original > author of that code, I hope the From: of the email was enough even if > an additional From: andrea was missing in the body. Ok, this whole series of patches have just been such a disaster that I'm (a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that I'm still looking at it at all, but I am. And quite frankly, the more I look, and the more answers from you I get, the less I like it. And I didn't like it that much to start with, as you may have noticed. You say that "At least for mmu-notifier-core given I obviously am the original author of that code", but that is not at all obvious either. One of the reasons I stated that authorship seems to have been thrown away is very much exactly in that first mmu-notifier-core patch: + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter so I would very strongly dispute that it's "obvious" that you are the original author of the code there. So there was a reason why I said that I thought authorship had been lost somewhere along the way. Linus From andrea at qumranet.com Wed May 7 16:45:21 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 01:45:21 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <1210202918.1421.20.camel@pasglop> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <1210202918.1421.20.camel@pasglop> Message-ID: <20080507234521.GN8276@duo.random> On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > > > Please note, we can't allow a thread to be in the middle of > > zap_page_range while mmu_notifier_register runs. > > You said yourself that mmu_notifier_register can be as slow as you > want ... what about you use stop_machine for it ? I'm not even joking > here :-) We can put a cap of time + a cap of vmas. It's not important if it fails, the only useful case we know it, and it won't be slow at all. The failure can happen because the cap of time or the cap of vmas or the cap vmas triggers or there's a vmalloc shortage. We handle the failure in userland of course. There are zillon of allocations needed anyway, any one of them can fail, so this isn't a new fail path, is the same fail path that always existed before mmu_notifiers existed. I can't possibly see how adding a new global wide lock that forces all truncate to be serialized against each other, practically eliminating the need of the i_mmap_lock, could be superior to an approach that doesn't cause the overhead to the VM at all, and only require kvm to pay for an additional cost when it startup. Furthermore the only reason I had to implement mm_lock was to fix the invalidate_range_start/end model, if we go with only invalidate_page and invalidate_pages called inside the PT lock and we use the PT lock to serialize, we don't need a mm_lock anymore and no new lock from the VM either. I tried to push for that, but everyone else wanted invalidate_range_start/end. I only did the only possible thing to do: to make invalidate_range_start safe to make everyone happy without slowing down the VM. From torvalds at linux-foundation.org Wed May 7 17:03:30 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 17:03:30 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Christoph Lameter wrote: > > Multiple vmas may share the same mapping or refer to the same anonymous > vma. The above code will deadlock since we may take some locks multiple > times. Ok, so that actually _is_ a problem. It would be easy enough to also add just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's suggestion of just using stop-machine is actually the right one just because it's _so_ simple. (That said, we're not running out of vm flags yet, and if we were, we could just add another word. We're already wasting that space right now on 64-bit by calling it "unsigned long"). Linus From swise at opengridcomputing.com Wed May 7 17:16:34 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 07 May 2008 19:16:34 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> Message-ID: <48224662.60401@opengridcomputing.com> Sean Hefty wrote: >>> > nes - requires 0b write >>> > cxgb3 - requires 0b read >>> > amso1100 - won't work in p2p mode > > I'm assuming by requires that you, uhm, mean requires, and nes couldn't do 0b > reads, or cxgb3 0b writes. > Well, I'm not sure about nes. But cxgb3 cannot deal with receiving a 0B write for the RTR because the FW doesn't see incoming writes, nor does the driver. nes may be able to request a 0b read, but I what I meant was they currently use a 0B write and not a read. So its possible to reduce the complexity if we just mandate 0B read for RTR. But it makes sense in my mind to allow the other message types... >> Its is painful. But without anything, you cannot run OMPI, IMPI or >> HPMPI on a iwarp cluster with mixed vendor rnics... > > Is there any requirement at the receiving side, versus the initiating side? > That is, just because nes issues a 0b write, does the receiving HW care if a > read or write shows up? Or is this restriction on both sides? > The requirement is mostly driven from the receiving side. For cxgb3 it is anyway... The receiving side, ie the side that issues the rdma_accept will tell the sending side what RTR message to send, if any. So the MPA exchange will look like this: client sends MPA Start request with private data saying "i can send an RTR if you want it". server moves connection into RDMA mode server sends MPA Start response with "lets do RTR and send me X" where X could be 0B write, 0B read request or 0B send. client moves connection into RDMA mode client sends X and then enables SQ processing (or indicate ESTABLISHED) Once server gets X it can enable SQ processing (or indicate ESTABLISHED) If X was a 0B read request, server sends 0B read response. Steve From holt at sgi.com Wed May 7 17:38:38 2008 From: holt at sgi.com (Robin Holt) Date: Wed, 7 May 2008 19:38:38 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> Message-ID: <20080508003838.GA9878@sgi.com> On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > I think the spinlock->rwsem conversion is ok under config option, as > > you can see I complained myself to various of those patches and I'll > > take care they're in a mergeable state the moment I submit them. What > > XPMEM requires are different semantics for the methods, and we never > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. That fix is going to be fairly difficult. I will argue impossible. First, a little background. SGI allows one large numa-link connected machine to be broken into seperate single-system images which we call partitions. XPMEM allows, at its most extreme, one process on one partition to grant access to a portion of its virtual address range to processes on another partition. Those processes can then fault pages and directly share the memory. In order to invalidate the remote page table entries, we need to message (uses XPC) to the remote side. The remote side needs to acquire the importing process's mmap_sem and call zap_page_range(). Between the messaging and the acquiring a sleeping lock, I would argue this will require sleeping locks in the path prior to the mmu_notifier invalidate_* callouts(). On a side note, we currently have XPMEM working on x86_64 SSI, and ia64 cross-partition. We are in the process of getting XPMEM working on x86_64 cross-partition in support of UV. Thanks, Robin Holt From holt at sgi.com Wed May 7 17:52:56 2008 From: holt at sgi.com (Robin Holt) Date: Wed, 7 May 2008 19:52:56 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: <20080508005256.GB9878@sgi.com> On Wed, May 07, 2008 at 05:03:30PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Christoph Lameter wrote: > > > > Multiple vmas may share the same mapping or refer to the same anonymous > > vma. The above code will deadlock since we may take some locks multiple > > times. > > Ok, so that actually _is_ a problem. It would be easy enough to also add > just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing > a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's > suggestion of just using stop-machine is actually the right one just > because it's _so_ simple. Also, stop-machine will not work if we come to the conclusion that i_mmap_lock and anon_vma->lock need to be sleepable locks. Thanks, Robin Holt From clameter at sgi.com Wed May 7 17:56:17 2008 From: clameter at sgi.com (Christoph Lameter) Date: Wed, 7 May 2008 17:56:17 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Linus Torvalds wrote: > On Wed, 7 May 2008, Christoph Lameter wrote: > > > > Multiple vmas may share the same mapping or refer to the same anonymous > > vma. The above code will deadlock since we may take some locks multiple > > times. > > Ok, so that actually _is_ a problem. It would be easy enough to also add > just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing > a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's > suggestion of just using stop-machine is actually the right one just > because it's _so_ simple. Set the vma flag when we locked it and then skip when we find it locked right? This would be in addition to the global lock? stop-machine would work for KVM since its a once in a Guest OS time of thing. But GRU, KVM and eventually Infiniband need the ability to attach in a reasonable timeframe without causing major hiccups for other processes. > (That said, we're not running out of vm flags yet, and if we were, we > could just add another word. We're already wasting that space right now on > 64-bit by calling it "unsigned long"). We sure have enough flags. From torvalds at linux-foundation.org Wed May 7 17:55:33 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 17:55:33 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508003838.GA9878@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080508003838.GA9878@sgi.com> Message-ID: On Wed, 7 May 2008, Robin Holt wrote: > > In order to invalidate the remote page table entries, we need to message > (uses XPC) to the remote side. The remote side needs to acquire the > importing process's mmap_sem and call zap_page_range(). Between the > messaging and the acquiring a sleeping lock, I would argue this will > require sleeping locks in the path prior to the mmu_notifier invalidate_* > callouts(). You simply will *have* to do it without locally holding all the MM spinlocks. Because quite frankly, slowing down all the normal VM stuff for some really esoteric hardware simply isn't acceptable. We just don't do it. So what is it that actually triggers one of these events? The most obvious solution is to just queue the affected pages while holding the spinlocks (perhaps locking them locally), and then handling all the stuff that can block after releasing things. That's how we normally do these things, and it works beautifully, without making everything slower. Sometimes we go to extremes, and actually break the locks are restart (ugh), and it gets ugly, but even that tends to be preferable to using the wrong locking. The thing is, spinlocks really kick ass. Yes, they restrict what you can do within them, but if 99.99% of all work is non-blocking, then the really odd rare blocking case is the one that needs to accomodate, not the rest. Linus From torvalds at linux-foundation.org Wed May 7 18:02:49 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 18:02:49 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507233953.GM8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > Hi Andrew, > > On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote: > > CPU0: CPU1: > > > > spin_lock(global_lock) > > spin_lock(a->lock); spin_lock(b->lock); > ================== mmu_notifier_register() If mmy_notifier_register() takes the global lock, it cannot happen here. It will be blocked (by CPU0), so there's no way it can then cause an ABBA deadlock. It will be released when CPU0 has taken *all* the locks it needed to take. > What we can do is to replace the mm_lock with a > spin_lock(&global_lock) only if all places that takes i_mmap_lock NO! You replace mm_lock() with the sequence that Andrew gave you (and I described): spin_lock(&global_lock) .. get all locks UNORDERED .. spin_unlock(&global_lock) and you're now done. You have your "mm_lock()" (which still needs to be renamed - it should be a "mmu_notifier_lock()" or something like that), but you don't need the insane sorting. At most you apparently need a way to recognize duplicates (so that you don't deadlock on yourself), which looks like a simple bit-per-vma. The global lock doesn't protect any data structures itself - it just protects two of these mm_lock() functions from ABBA'ing on each other! Linus From torvalds at linux-foundation.org Wed May 7 18:07:27 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 18:07:27 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Christoph Lameter wrote: > > Set the vma flag when we locked it and then skip when we find it locked > right? This would be in addition to the global lock? Yes. And clear it before unlocking (and again, testing if it's already clear - you mustn't unlock twice, so you must only unlock when the bit was set). You also (obviously) need to have somethign that guarantees that the lists themselves are stable over the whole sequence, but I assume you already have mmap_sem for reading (since you'd need it anyway just to follow the list). And if you have it for writing, it can obviously *act* as the global lock, since it would already guarantee mutual exclusion on that mm->mmap list. Linus From clameter at sgi.com Wed May 7 18:12:32 2008 From: clameter at sgi.com (Christoph Lameter) Date: Wed, 7 May 2008 18:12:32 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> Message-ID: On Wed, 7 May 2008, Linus Torvalds wrote: > and you're now done. You have your "mm_lock()" (which still needs to be > renamed - it should be a "mmu_notifier_lock()" or something like that), > but you don't need the insane sorting. At most you apparently need a way > to recognize duplicates (so that you don't deadlock on yourself), which > looks like a simple bit-per-vma. Andrea's mm_lock could have wider impact. It is the first effective way that I have seen of temporarily holding off reclaim from an address space. It sure is a brute force approach. From andrea at qumranet.com Wed May 7 18:26:56 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 03:26:56 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> Message-ID: <20080508012656.GQ8276@duo.random> On Wed, May 07, 2008 at 06:02:49PM -0700, Linus Torvalds wrote: > You replace mm_lock() with the sequence that Andrew gave you (and I > described): > > spin_lock(&global_lock) > .. get all locks UNORDERED .. > spin_unlock(&global_lock) > > and you're now done. You have your "mm_lock()" (which still needs to be > renamed - it should be a "mmu_notifier_lock()" or something like that), > but you don't need the insane sorting. At most you apparently need a way > to recognize duplicates (so that you don't deadlock on yourself), which > looks like a simple bit-per-vma. > > The global lock doesn't protect any data structures itself - it just > protects two of these mm_lock() functions from ABBA'ing on each other! I thought the thing to remove was the "get all locks". I didn't realize the major problem was only the sorting of the array. I'll add the global lock, it's worth it as it drops the worst case number of steps by log(65536) times. Furthermore surely two concurrent mm_notifier_lock will run faster as it'll decrease the cacheline collisions. Since you ask to call it mmu_notifier_lock I'll also move it to mmu_notifier.[ch] as consequence. From torvalds at linux-foundation.org Wed May 7 18:32:11 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 18:32:11 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> Message-ID: On Wed, 7 May 2008, Christoph Lameter wrote: > On Wed, 7 May 2008, Linus Torvalds wrote: > > and you're now done. You have your "mm_lock()" (which still needs to be > > renamed - it should be a "mmu_notifier_lock()" or something like that), > > but you don't need the insane sorting. At most you apparently need a way > > to recognize duplicates (so that you don't deadlock on yourself), which > > looks like a simple bit-per-vma. > > Andrea's mm_lock could have wider impact. It is the first effective > way that I have seen of temporarily holding off reclaim from an address > space. It sure is a brute force approach. Well, I don't think the naming necessarily has to be about notifiers, but it should be at least a *bit* more scary than "mm_lock()", to make it clear that it's pretty dang expensive. Even without the vmalloc and sorting, if it would be used by "normal" things it would still be very expensive for some cases - running thngs like ElectricFence, for example, will easily generate thousands and thousands of vma's in a process. Linus From andrea at qumranet.com Wed May 7 18:34:59 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 03:34:59 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080507234521.GN8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <1210202918.1421.20.camel@pasglop> <20080507234521.GN8276@duo.random> Message-ID: <20080508013459.GS8276@duo.random> Sorry for not having completely answered to this. I initially thought stop_machine could work when you mentioned it, but I don't think it can even removing xpmem block-inside-mmu-notifier-method requirements. For stop_machine to solve this (besides being slower and potentially not more safe as running stop_machine in a loop isn't nice), we'd need to prevent preemption in between invalidate_range_start/end. I think there are two ways: 1) add global lock around mm_lock to remove the sorting 2) remove invalidate_range_start/end, nuke mm_lock as consequence of it, and replace all three with invalidate_pages issued inside the PT lock, one invalidation for each 512 pte_t modified, so serialization against get_user_pages becomes trivial but this will be not ok at all for SGI as it increases a lot their invalidation frequency For KVM both ways are almost the same. I'll implement 1 now then we'll see... From torvalds at linux-foundation.org Wed May 7 18:39:48 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 18:39:48 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: On Wed, 7 May 2008, Christoph Lameter wrote: > > > (That said, we're not running out of vm flags yet, and if we were, we > > could just add another word. We're already wasting that space right now on > > 64-bit by calling it "unsigned long"). > > We sure have enough flags. Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are unique), we need to mark the address spaces/anonvma's. So the flag would need to be in the "struct anon_vma" (and struct address_space), not in the vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and would require adding a new field to "struct anon_vma()" And related to that brain-fart of mine, that obviously also means that yes, the locking has to be stronger than "mm->mmap_sem" held for writing, so yeah, it would have be a separate global spinlock (or perhaps a blocking lock if you have some reason to protect anything else with this too). Linus From andrea at qumranet.com Wed May 7 18:52:49 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 03:52:49 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> Message-ID: <20080508015249.GT8276@duo.random> On Wed, May 07, 2008 at 06:39:48PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Christoph Lameter wrote: > > > > > (That said, we're not running out of vm flags yet, and if we were, we > > > could just add another word. We're already wasting that space right now on > > > 64-bit by calling it "unsigned long"). > > > > We sure have enough flags. > > Oh, btw, I was wrong - we wouldn't want to mark the vma's (they are > unique), we need to mark the address spaces/anonvma's. So the flag would > need to be in the "struct anon_vma" (and struct address_space), not in the > vma itself. My bad. So the flag wouldn't be one of the VM_xyzzy flags, and > would require adding a new field to "struct anon_vma()" > > And related to that brain-fart of mine, that obviously also means that > yes, the locking has to be stronger than "mm->mmap_sem" held for writing, > so yeah, it would have be a separate global spinlock (or perhaps a > blocking lock if you have some reason to protect anything else with this So because the bitflag can't prevent taking the same lock twice on two different vmas in the same mm, we still can't remove the sorting, and the global lock won't buy much other than reducing the collisions. I can add that though. I think it's more interesting to put a cap on the number of vmas to min(1024,max_map_count). The sort time on an 8k array runs in constant time. kvm runs with 127 vmas allocated... From torvalds at linux-foundation.org Wed May 7 18:57:05 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 18:57:05 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508015249.GT8276@duo.random> References: <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080508015249.GT8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > So because the bitflag can't prevent taking the same lock twice on two > different vmas in the same mm, we still can't remove the sorting Andrea. Take five minutes. Take a deep breadth. And *think* about actually reading what I wrote. The bitflag *can* prevent taking the same lock twice. It just needs to be in the right place. Let me quote it for you: > So the flag wouldn't be one of the VM_xyzzy flags, and would require > adding a new field to "struct anon_vma()" IOW, just make it be in that anon_vma (and the address_space). No sorting required. > I think it's more interesting to put a cap on the number of vmas to > min(1024,max_map_count). The sort time on an 8k array runs in constant > time. Shut up already. It's not constant time just because you can cap the overhead. We're not in a university, and we care about performance, not your made-up big-O notation. Linus From andrea at qumranet.com Wed May 7 19:24:24 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 04:24:24 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080508015249.GT8276@duo.random> Message-ID: <20080508022424.GU8276@duo.random> On Wed, May 07, 2008 at 06:57:05PM -0700, Linus Torvalds wrote: > Take five minutes. Take a deep breadth. And *think* about actually reading > what I wrote. > > The bitflag *can* prevent taking the same lock twice. It just needs to be > in the right place. It's not that I didn't read it, but to do it I've to grow every anon_vma by 8 bytes. I thought it was implicit that the conclusion of your email is that it couldn't possibly make sense to grow the size of each anon_vma by 33%, when nobody loaded the kvm or gru or xpmem kernel modules. It surely isn't my preferred solution, while capping the number of vmas to 1024 means sort() will make around 10240 steps, Matt call tell the exact number. The big cost shouldn't be in sort. Even 512 vmas will be more than enough for us infact. Note that I've a cond_resched in the sort compare function and I can re-add the signal_pending check. I had the signal_pending check in the original version that didn't use sort() but was doing an inner loop, I thought signal_pending wasn't necessary after speeding it up with sort(). But I can add it again, so then we'll only fail to abort inside sort() and we'll be able to break the loop while taking all the spinlocks, but with such as small array that can't be an issue and the result will surely run faster than stop_machine with zero ram and cpu overhead for the VM (besides stop_machine can't work or we'd need to disable preemption between invalidate_range_start/end, even removing the xpmem schedule-inside-mmu-notifier requirement). From torvalds at linux-foundation.org Wed May 7 19:32:05 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 19:32:05 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508022424.GU8276@duo.random> References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080508015249.GT8276@duo.random> <20080508022424.GU8276@duo.random> Message-ID: Andrea, I'm not interested. I've stated my standpoint: the code being discussed is crap. We're not doing that. Not in the core VM. I gave solutions that I think aren't crap, but I already also stated that I have no problems not merging it _ever_ if no solution can be found. The whole issue simply isn't even worth the pain, imnsho. Linus From andrea at qumranet.com Wed May 7 19:56:52 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 04:56:52 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> Message-ID: <20080508025652.GW8276@duo.random> On Wed, May 07, 2008 at 06:12:32PM -0700, Christoph Lameter wrote: > Andrea's mm_lock could have wider impact. It is the first effective > way that I have seen of temporarily holding off reclaim from an address > space. It sure is a brute force approach. The only improvement I can imagine on mm_lock, is after changing the name to global_mm_lock() to reestablish the signal_pending check in the loop that takes the spinlock and to backoff and put the cap to 512 vmas so the ram wasted on anon-vmas wouldn't save more than 10-100usec at most (plus the vfree that may be a bigger cost but we're ok to pay it and it surely isn't security related). Then on the long term we need to talk to Matt on returning a parameter to the sort function to break the loop. After that we remove the 512 vma cap and mm_lock is free to run as long as it wants like /dev/urandom, nobody can care less how long it will run before returning as long as it reacts to signals. This is the right way if we want to support XPMEM/GRU efficiently and without introducing unnecessary regressions in the VM fastpaths and VM footprint. From clameter at sgi.com Wed May 7 20:10:33 2008 From: clameter at sgi.com (Christoph Lameter) Date: Wed, 7 May 2008 20:10:33 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508025652.GW8276@duo.random> References: <20080507212650.GA8276@duo.random> <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > to the sort function to break the loop. After that we remove the 512 > vma cap and mm_lock is free to run as long as it wants like > /dev/urandom, nobody can care less how long it will run before > returning as long as it reacts to signals. Look Linus has told you what to do. Why not simply do it? From andrea at qumranet.com Wed May 7 20:41:33 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 05:41:33 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> Message-ID: <20080508034133.GY8276@duo.random> On Wed, May 07, 2008 at 08:10:33PM -0700, Christoph Lameter wrote: > On Thu, 8 May 2008, Andrea Arcangeli wrote: > > > to the sort function to break the loop. After that we remove the 512 > > vma cap and mm_lock is free to run as long as it wants like > > /dev/urandom, nobody can care less how long it will run before > > returning as long as it reacts to signals. > > Look Linus has told you what to do. Why not simply do it? When it looked like we could use vm_flags to remove sort, that looked an ok optimization, no problem with optimizations, I'm all for optimizations if they cost nothing to the VM fast paths and VM footprint. But removing sort isn't worth it if it takes away ram from the VM even when global_mm_lock will never be called. sort is like /dev/urandom so after sort is fixed to handle signals (and I expect Matt will help with this) we'll remove the 512 vmas cap. In the meantime we can live with the 512 vmas cap that guarantees sort won't take more than a few dozen usec. Removing sort() is the only thing that the anon vma bitflag can achieve and it's clearly not worth it and it would go in the wrong direction (fixing sort to handle signals is clearly the right direction, if sort is a concern at all). Adding the global lock around global_mm_lock to avoid one global_mm_lock to collide against another global_mm_lock is sure ok with me, if that's still wanted now that it's clear removing sort isn't worth it, I'm neutral on this. Christoph please go ahead and add the bitflag to anon-vma yourself if you want. If something isn't technically right I don't do it no matter who asks it. From torvalds at linux-foundation.org Wed May 7 21:14:45 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 7 May 2008 21:14:45 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508034133.GY8276@duo.random> References: <20080507222205.GC8276@duo.random> <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > But removing sort isn't worth it if it takes away ram from the VM even > when global_mm_lock will never be called. Andrea, you really are a piece of work. Your arguments have been bogus crap that didn't even understand what was going on from the beginning, and now you continue to do that. What exactly "takes away ram" from the VM? The notion of adding a flag to "struct anon_vma"? The one that already has a 4 byte padding thing on x86-64 just after the spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two bytes of padding if we didn't just make the spinlock type unconditionally 32 bits rather than the 16 bits we actually _use_? IOW, you didn't even look at it, did you? But whatever. I clearly don't want a patch from you anyway, so .. Linus From andrea at qumranet.com Wed May 7 22:20:19 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 07:20:19 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> Message-ID: <20080508052019.GA8276@duo.random> On Wed, May 07, 2008 at 09:14:45PM -0700, Linus Torvalds wrote: > IOW, you didn't even look at it, did you? Actually I looked both at the struct and at the slab alignment just in case it was changed recently. Now after reading your mail I also compiled it just in case. 2.6.26-rc1 # name : tunables : slabdata anon_vma 260 576 24 144 1 : tunables 120 60 8 : slabdata 4 4 0 ^^ ^^^ 2.6.26-rc1 + below patch diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -27,6 +27,7 @@ struct anon_vma { struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ + int flag:1; }; #ifdef CONFIG_MMU # name : tunables : slabdata anon_vma 250 560 32 112 1 : tunables 120 60 8 : slabdata 5 5 0 ^^ ^^^ Not a big deal sure to grow it 33%, it's so small anyway, but I don't see the point in growing it. sort() can be interrupted by signals, and until it can we can cap it to 512 vmas making the worst case taking dozen usecs, I fail to see what you have against sort(). Again: if a vma bitflag + global lock could have avoided sort and run O(N) instead of current O(N*log(N)) I would have done that immediately, infact I was in the process of doing it when you posted the followup. Nothing personal here, just staying technical. Hope you too. From penberg at cs.helsinki.fi Wed May 7 22:27:47 2008 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 8 May 2008 08:27:47 +0300 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508052019.GA8276@duo.random> References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> Message-ID: <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com> On Thu, May 8, 2008 at 8:20 AM, Andrea Arcangeli wrote: > Actually I looked both at the struct and at the slab alignment just in > case it was changed recently. Now after reading your mail I also > compiled it just in case. > > @@ -27,6 +27,7 @@ struct anon_vma { > struct anon_vma { > > spinlock_t lock; /* Serialize access to vma list */ > > struct list_head head; /* List of private "related" vmas */ > + int flag:1; > }; You might want to read carefully what Linus wrote: > The one that already has a 4 byte padding thing on x86-64 just after the > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > bytes of padding if we didn't just make the spinlock type unconditionally > 32 bits rather than the 16 bits we actually _use_? So you need to add the flag _after_ ->lock and _before_ ->head.... Pekka From penberg at cs.helsinki.fi Wed May 7 22:30:20 2008 From: penberg at cs.helsinki.fi (Pekka Enberg) Date: Thu, 8 May 2008 08:30:20 +0300 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com> References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com> Message-ID: <84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com> On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg wrote: > You might want to read carefully what Linus wrote: > > > The one that already has a 4 byte padding thing on x86-64 just after the > > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > > bytes of padding if we didn't just make the spinlock type unconditionally > > 32 bits rather than the 16 bits we actually _use_? > > So you need to add the flag _after_ ->lock and _before_ ->head.... Oh should have taken my morning coffee first, before ->lock works obviously as well. From andrea at qumranet.com Wed May 7 22:49:31 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Thu, 8 May 2008 07:49:31 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com> References: <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> <84144f020805072227i3382465eleccded79d9fcf532@mail.gmail.com> <84144f020805072230g2a619d65x8e3bb1fbf9d130d8@mail.gmail.com> Message-ID: <20080508054931.GB8276@duo.random> On Thu, May 08, 2008 at 08:30:20AM +0300, Pekka Enberg wrote: > On Thu, May 8, 2008 at 8:27 AM, Pekka Enberg wrote: > > You might want to read carefully what Linus wrote: > > > > > The one that already has a 4 byte padding thing on x86-64 just after the > > > spinlock? And that on 32-bit x86 (with less than 256 CPU's) would have two > > > bytes of padding if we didn't just make the spinlock type unconditionally > > > 32 bits rather than the 16 bits we actually _use_? > > > > So you need to add the flag _after_ ->lock and _before_ ->head.... > > Oh should have taken my morning coffee first, before ->lock works > obviously as well. Sorry, Linus's right: I didn't realize the "after the spinlock" was literally after the spinlock, I didn't see the 4 byte padding when I read the code and put the flag:1 in. If put between ->lock and ->head it doesn't take more memory on x86-64 as described literlly. So the next would be to find another place like that in the address space. Perhaps after the private_lock using the same trick or perhaps the slab alignment won't actually alter the number of slabs per page regardless. I leave that to Christoph, he's surely better than me at doing this, I give it up entirely and I consider my attempt to merge a total failure and I strongly regret it. On a side note the anon_vma will change to this when XPMEM support is compiled in: struct anon_vma { - spinlock_t lock; /* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private "related" vmas */ }; not sure if it'll grow in size or not after that but let's say it's not a big deal. From gip at bonniemichaels.com Thu May 8 00:46:28 2008 From: gip at bonniemichaels.com (Sidney Watson) Date: Thu, 8 May 2008 14:46:28 +0700 Subject: [ofa-general] Re: Hi sure cure Message-ID: <01c8b11a$4c5ba680$b3920a3a@gip> ED affects over 30 million men and their partners in the U.S. So if you’re a man who has ED, or if you think you might be, don’t worry. You’re not alone. More than 50% of men. between 40 and 70 can experience ED to some degree. The fact is guys at any age can experience it. http://www.divekto.com/v/ From PHF at zurich.ibm.com Thu May 8 02:38:19 2008 From: PHF at zurich.ibm.com (Philip Frey1) Date: Thu, 8 May 2008 11:38:19 +0200 Subject: [ofa-general] cxgb3 user limitations Message-ID: Hi, I have a Chelsio T3. Whenever I do RDMA as normal user, I am severely limited in terms of memory. Why is that? Is there a way of using the RNIC with the same privileges as the root but without actually being the root? Could you give me some insight in what the limits of the Chelsio RNIC are? (Max MRs, QPs, PDs etc) Many thanks and kind regards, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.heinz at qlogic.com Thu May 8 06:54:24 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 8 May 2008 08:54:24 -0500 Subject: [ofa-general] Need help diagnosing a problem.... Message-ID: I was smoke testing a small cluster when one of the nodes posted this: May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: Internal error detected: May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[00]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[01]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[02]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[03]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[04]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[05]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[06]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[07]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[08]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[09]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0a]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0b]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0c]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0d]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0e]: ffffffff May 7 16:47:00 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0f]: ffffffff At this point, all further IB traffic on that node failed, and it silently hung during shut down. Any suggestions as to what I should look at? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu May 8 07:16:54 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 08 May 2008 09:16:54 -0500 Subject: [ofa-general] Re: cxgb3 user limitations In-Reply-To: References: Message-ID: <48230B56.4060101@opengridcomputing.com> Philip Frey1 wrote: > > Hi, > > I have a Chelsio T3. Whenever I do RDMA as normal user, I am severely > limited in terms of memory. > Why is that? Is there a way of using the RNIC with the same privileges > as the root but without actually > being the root? > > Could you give me some insight in what the limits of the Chelsio RNIC > are? (Max MRs, QPs, PDs etc) > > Many thanks and kind regards, > Philip Try running ibv_devinfo -v to see driver/hw limits. However, how are you limited? Are you getting failures registering memory? Did you try setting your ulimit -l to unlimited or at least as large as the memory region you want to register? Steve. From pawel.dziekonski at wcss.pl Thu May 8 07:57:10 2008 From: pawel.dziekonski at wcss.pl (Pawel Dziekonski) Date: Thu, 8 May 2008 16:57:10 +0200 Subject: [ofa-general] getting network statistics In-Reply-To: <1210145698.15669.78.camel@mtls03> References: <1203424196.16145.1.camel@mtls03> <20080506141039.GJ6586@cefeid.wcss.wroc.pl> <1210145698.15669.78.camel@mtls03> Message-ID: <20080508145710.GA13329@cefeid.wcss.wroc.pl> yes, I meant CONTENT of those files. while true; do cat port_rcv_data; sleep 1; done 223212307 223212307 223212307 223212342 223227022 223227022 223227022 223227022 223227022 223227057 223227265 On Wed, 07 May 2008 at 10:34:58AM +0300, Eli Cohen wrote: > These files are on a virtual file system and their size does not change. > You need to read them, e.g. using cat, in order to get the statistics > data. For example, "cat port_rcv_data" will give you a measure of how > many bytes of data were received by the port. > > On Tue, 2008-05-06 at 16:10 +0200, Pawel Dziekonski wrote: > > you mean port_rcv_data and port_xmit_data ? > > > > if so, then I have 2 jobs that are definitelly using IB network, but > > those files almost do not change. :o > > > > OFED 1.2.5.5 and kernel 2.6.9-55.0.12.ELsmp > > > > root at wn111:/sys/class/infiniband/mthca0/ports/1/counters # ls -al > > total 0 > > drwxr-xr-x 2 root root 0 May 6 15:45 ./ > > drwxr-xr-x 5 root root 0 May 6 15:45 ../ > > -r--r--r-- 1 root root 4096 May 6 15:45 VL15_dropped > > -r--r--r-- 1 root root 4096 May 6 15:45 excessive_buffer_overrun_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 link_downed > > -r--r--r-- 1 root root 4096 May 6 15:45 link_error_recovery > > -r--r--r-- 1 root root 4096 May 6 15:45 local_link_integrity_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_constraint_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_data > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_packets > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_remote_physical_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_rcv_switch_relay_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_constraint_errors > > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_data > > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_discards > > -r--r--r-- 1 root root 4096 May 6 15:45 port_xmit_packets > > -r--r--r-- 1 root root 4096 May 6 15:45 symbol_error > > > > > > On Tue, 19 Feb 2008 at 02:29:56PM +0200, Eli Cohen wrote: > > > cat /sys/class/infiniband/mlx4_0/ports/1/counters/* > > > > > > mlx4_* can be mthca* > > > > > > On Tue, 2008-02-19 at 11:03 +0200, David Minor wrote: > > > > Under Linux with Mellanox ofed, how can I get real-time network > > > > statistics. e.g. how many bytes are being sent and received over each > > > > port at any given time? > > > -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From torvalds at linux-foundation.org Thu May 8 08:03:19 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 8 May 2008 08:03:19 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508052019.GA8276@duo.random> References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> Message-ID: On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Actually I looked both at the struct and at the slab alignment just in > case it was changed recently. Now after reading your mail I also > compiled it just in case. Put the flag after the spinlock, not after the "list_head". Also, we'd need to make it unsigned short flag:1; _and_ change spinlock_types.h to make the spinlock size actually match the required size (right now we make it an "unsigned int slock" even when we actually only use 16 bits). See the #if (NR_CPUS < 256) code in . Linus From edmumpvosie at bosbyshell.com Thu May 8 08:45:00 2008 From: edmumpvosie at bosbyshell.com (Abby Ashley) Date: Thu, 8 May 2008 16:45:00 +0100 Subject: [ofa-general] Hi large drive Message-ID: <01c8b12a$daf08e00$b764f259@edmumpvosie> "In a poll conducted by Durax™, 67% of women said they were unhappy with their lover's pip size." We offer solution! Gain incredible girth and mind-blowing length in just a few weeks time! http://www.alleeg.net/a/ From torvalds at linux-foundation.org Thu May 8 09:11:33 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 8 May 2008 09:11:33 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> Message-ID: On Thu, 8 May 2008, Linus Torvalds wrote: > > Also, we'd need to make it > > unsigned short flag:1; > > _and_ change spinlock_types.h to make the spinlock size actually match the > required size (right now we make it an "unsigned int slock" even when we > actually only use 16 bits). Btw, this is an issue only on 32-bit x86, because on 64-bit one we already have the padding due to the alignment of the 64-bit pointers in the list_head (so there's already empty space there). On 32-bit, the alignment of list-head is obviously just 32 bits, so right now the structure is "perfectly packed" and doesn't have any empty space. But that's just because the spinlock is unnecessarily big. (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the structure really will grow. That's a very odd configuration, though, and not one I feel we really need to care about). Linus From akepner at sgi.com Thu May 8 10:09:48 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 8 May 2008 10:09:48 -0700 Subject: [ofa-general] IPoIB-UD TX timeouts (OFED 1.2) In-Reply-To: <4e6a6b3c0804301300q57b4b562r854e337ff8706222@mail.gmail.com> References: <20080430192354.GG26724@sgi.com> <4e6a6b3c0804301300q57b4b562r854e337ff8706222@mail.gmail.com> Message-ID: <20080508170948.GT24293@sgi.com> On Wed, Apr 30, 2008 at 11:00:55PM +0300, Eli Cohen wrote: > .... > when it happens please: > 1. Check the link error counters. Unfortunately there appear to be things running that periodically reset the counters (to avoid hitting the 32 bit limit), so the port counters usually come back as all 0. > 2. Disconnect and reconnect the cable and see if it recovers. None of the systems where this has been seen are physically accessible to me (and even if they were, finding the right cable to pull might be tricky :-) We have some new information, which I'll post now. -- Arthur From akepner at sgi.com Thu May 8 10:19:36 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 8 May 2008 10:19:36 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) Message-ID: <20080508171936.GU24293@sgi.com> In an earlier email I mentioned that, with certain workloads, we are seeing an endless loop of timeouts on the IPoIB-UD send queue. Messages like "NETDEV WATCHDOG: ib0: transmit timed out" appear once a second until the driver is unloaded. That was with OFED 1.2. Using OFED 1.3, we see what I believe is the same problem, but it looks a little different. We don't get "NETDEV WATCHDOG", but we get an endless string of "post_send failed". (I suspect, but haven't verified, that the difference is due to the sharing of ipoib_dev_priv's tx_outstanding member between the UD and CM IPoIB QPs, the value of tx_outstanding is used to determine when to call netif_stop_queue().) The h/w is MT25204, with f/w version 1.2.0, on an x86_64. I instrumented the mthca driver to maintain a cicular buffer of the state of the IPoIB-UD send queue on each call to the "post_send" (mthca_arbel_post_send) and "poll_cq" (mthca_poll_one) routines, and also to dump the QP and CQ context when the full queue is detected. At some point, we just stop getting completions on the send queue. Here are the last entries from the "poll_cq" log: # jiffies qpn last head tail # comp ..... 0x100032cdc 0x404 0x49 0x24b 0x24a 0x100032cdc 0x404 0x4a 0x24b 0x24b 0x100033eed 0x404 0x4c 0x24e 0x24d 0x100033eed 0x404 0x4d 0x24e 0x24e 0x10003b594 0x404 0x4f 0x251 0x250 0x10003b594 0x404 0x50 0x251 0x251 0x10003c999 0x404 0x52 0x254 0x253 0x10003ca16 0x404 0x53 0x255 0x254 0x10003ca93 0x404 0x54 0x256 0x255 0x10003ca93 0x404 0x55 0x256 0x256 We keep calling the send routine (apparently via the periodic ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - the send queue length is 128. Here are some entries after the queue has filled (they keep going "forever"): # jiffies qpn last head tail # comp ..... 0x1000760dd 0x404 0x55 0x2d6 0x256 0x1000761c6 0x404 0x55 0x2d6 0x256 0x1000761d7 0x404 0x55 0x2d6 0x256 0x1000762c0 0x404 0x55 0x2d6 0x256 0x1000762d1 0x404 0x55 0x2d6 0x256 0x1000763ba 0x404 0x55 0x2d6 0x256 And here's the QP and CQ context immediately after the first post_send failure: QP context (including the 2-32 bit "opt_param_mask" and reserved fields at the beginning): [00] 0x00000000 0x00000000 0x30031900 0xef3e3f16 [10] 0x8b423b00 0x00000002 0x00000404 0x00000000 [20] 0x00000000 0x00000000 0x01000000 0x60000000 [30] 0x00000000 0x00000000 0x00000000 0x00000000 [40] 0x00000000 0x00000000 0x00000000 0x00000000 [50] 0x00000000 0x00000000 0x00000000 0x00000000 [60] 0x00000000 0x00000000 0x00000000 0x00000006 [70] 0x00000000 0x00002600 0xaf004000 0x00800088 [80] 0x00000256 0x00000082 0x00004000 0x00000005 [90] 0x00ffffff 0x00000257 0x00000008 0x003a277f [a0] 0x25020200 0x00000081 0x00000000 0x00007ff9 [b0] 0x00000b1b 0x00000000 0x000003f8 0x03f80256 [c0] 0x00000000 0x00000000 0x00000000 0x00000000 [d0] 0x00000000 0x00000000 0x00000000 0x00000000 [e0] 0x00000000 0x00000000 0x00000000 0x00000000 [f0] 0x00000000 0x00000000 0x00000000 0x00000000 CQ context: [00] 0x00000a00 0x00000000 0x00000000 0x08000002 [10] 0x00000000 0x00000001 0x00000004 0x00002500 [20] 0x000001fd 0x000001fd 0x00000000 0x00000238 [30] 0x00000082 0x00007ffa 0x00000004 0x00000000 I don't see anything obviously wrong here - anyone at Mellanox? Any idea why the card would stop generating TX completions? -- Arthur From rdreier at cisco.com Thu May 8 10:42:37 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 May 2008 10:42:37 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080508171936.GU24293@sgi.com> (akepner@sgi.com's message of "Thu, 8 May 2008 10:19:36 -0700") References: <20080508171936.GU24293@sgi.com> Message-ID: > Using OFED 1.3, we see what I believe is the same problem, but it > looks a little different. We don't get "NETDEV WATCHDOG", but > we get an endless string of "post_send failed". That's bad. Did you check if the send is failing due to overrunning the send queue? > (I suspect, but haven't verified, that the difference is due to > the sharing of ipoib_dev_priv's tx_outstanding member between > the UD and CM IPoIB QPs, the value of tx_outstanding is used > to determine when to call netif_stop_queue().) A while ago, I was worried about the handling of tx_outstanding and how the driver makes sure that it doesn't post too many sends, but I managed to convince myself that the code was OK. Guess we should check it over one more time. - R. From rdreier at cisco.com Thu May 8 10:45:28 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 May 2008 10:45:28 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080508171936.GU24293@sgi.com> (akepner@sgi.com's message of "Thu, 8 May 2008 10:19:36 -0700") References: <20080508171936.GU24293@sgi.com> Message-ID: > We keep calling the send routine (apparently via the periodic > ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - > the send queue length is 128. Here are some entries after the queue > has filled (they keep going "forever"): I don't see how ipoib_ib_tx_timer_func() could call post_send since all it does is poll the CQ and handle completions. Also ipoib_ib_tx_timer_func() was added post-OFED 1.3 (it is only in 2.6.26-rc1 AFAIK), so what kernel are you using? - R. From akepner at sgi.com Thu May 8 10:43:58 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 8 May 2008 10:43:58 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: References: <20080508171936.GU24293@sgi.com> Message-ID: <20080508174358.GW24293@sgi.com> On Thu, May 08, 2008 at 10:42:37AM -0700, Roland Dreier wrote: > .. > That's bad. Did you check if the send is failing due to overrunning the > send queue? Yes, mthca_wq_overflow() is detecting a full queue. (Is that what you mean?) -- Arthur From rdreier at cisco.com Thu May 8 10:50:11 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 May 2008 10:50:11 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080508174358.GW24293@sgi.com> (akepner@sgi.com's message of "Thu, 8 May 2008 10:43:58 -0700") References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> Message-ID: > Yes, mthca_wq_overflow() is detecting a full queue. > > (Is that what you mean?) Yep, that's what I was wondering. It might be useful to track the value of tx_outstanding... from a quick look at the code I can't see how the transmit queue could be awake when the UD send queue is full. Are you using connected mode when you reproduce this, or does it happen with datagram mode? - R. From akepner at sgi.com Thu May 8 10:52:26 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 8 May 2008 10:52:26 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: References: <20080508171936.GU24293@sgi.com> Message-ID: <20080508175226.GX24293@sgi.com> On Thu, May 08, 2008 at 10:45:28AM -0700, Roland Dreier wrote: > > I don't see how ipoib_ib_tx_timer_func() could call post_send since all > it does is poll the CQ and handle completions. I'll check for sure about what's calling post_send here. > > Also ipoib_ib_tx_timer_func() was added post-OFED 1.3 (it is only in > 2.6.26-rc1 AFAIK), so what kernel are you using? > The kernel is SLES10 SP1 (2.6.16.46-0.12-smp), and OFED 1.3-ga is installed on that. -- Arthur From akepner at sgi.com Thu May 8 10:55:47 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 8 May 2008 10:55:47 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> Message-ID: <20080508175547.GY24293@sgi.com> On Thu, May 08, 2008 at 10:50:11AM -0700, Roland Dreier wrote: > ... > It might be useful to track the value of tx_outstanding... from a quick > look at the code I can't see how the transmit queue could be awake when > the UD send queue is full. > OK, I'll check that. > Are you using connected mode when you reproduce this, or does it happen > with datagram mode? > We're using connected mode. (I think we've have had some similar problems when using datagram mode, but I don't have details.) -- Arthur From ralph.campbell at qlogic.com Thu May 8 11:55:12 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 08 May 2008 11:55:12 -0700 Subject: [ofa-general] [PATCH 0/3] IB/ipath -- fixes for 2.6.26 Message-ID: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> The following patches fix a number of bugs for the QLogic DDR HCA. IB/ipath - fix RC and UC error handling IB/ipath - fix many locking issues when switching to error state IB/ipath - fix RDMA read response sequence checking These can also be pulled into Roland's infiniband.git for-2.6.26 repo using: git pull git://git.qlogic.com/ipath-linux-2.6 for-roland Just FYI, these changes bring 2.6.26 into sync with what I submitted for OFED-1.3.1. I also don't expect further changes for a while. From ralph.campbell at qlogic.com Thu May 8 11:55:17 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 08 May 2008 11:55:17 -0700 Subject: [ofa-general] [PATCH 1/3] IB/ipath - fix RC and UC error handling In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> Message-ID: <20080508185517.8547.42852.stgit@eng-46.mv.qlogic.com> When errors are detected in RC, the QP should transition to the IB_QPS_ERR state, not the IB_QPS_SQE state. Also, when the error is on the responder side, the recv work completion error was incorrect (rem vs. local). Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_qp.c | 54 +-------- drivers/infiniband/hw/ipath/ipath_rc.c | 127 +++++++--------------- drivers/infiniband/hw/ipath/ipath_ruc.c | 165 ++++++++++++++--------------- drivers/infiniband/hw/ipath/ipath_verbs.c | 4 - drivers/infiniband/hw/ipath/ipath_verbs.h | 6 + 5 files changed, 132 insertions(+), 224 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index dd5b6e9..6f98632 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -374,13 +374,14 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type) } /** - * ipath_error_qp - put a QP into an error state - * @qp: the QP to put into an error state + * ipath_error_qp - put a QP into the error state + * @qp: the QP to put into the error state * @err: the receive completion error to signal if a RWQE is active * * Flushes both send and receive work queues. * Returns true if last WQE event should be generated. * The QP s_lock should be held and interrupts disabled. + * If we are already in error state, just return. */ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) @@ -389,8 +390,10 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) struct ib_wc wc; int ret = 0; - ipath_dbg("QP%d/%d in error state (%d)\n", - qp->ibqp.qp_num, qp->remote_qpn, err); + if (qp->state == IB_QPS_ERR) + goto bail; + + qp->state = IB_QPS_ERR; spin_lock(&dev->pending_lock); if (!list_empty(&qp->timerwait)) @@ -460,6 +463,7 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) } else if (qp->ibqp.event_handler) ret = 1; +bail: return ret; } @@ -1026,48 +1030,6 @@ bail: } /** - * ipath_sqerror_qp - put a QP's send queue into an error state - * @qp: QP who's send queue will be put into an error state - * @wc: the WC responsible for putting the QP in this state - * - * Flushes the send work queue. - * The QP s_lock should be held and interrupts disabled. - */ - -void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc) -{ - struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); - - ipath_dbg("Send queue error on QP%d/%d: err: %d\n", - qp->ibqp.qp_num, qp->remote_qpn, wc->status); - - spin_lock(&dev->pending_lock); - if (!list_empty(&qp->timerwait)) - list_del_init(&qp->timerwait); - if (!list_empty(&qp->piowait)) - list_del_init(&qp->piowait); - spin_unlock(&dev->pending_lock); - - ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); - if (++qp->s_last >= qp->s_size) - qp->s_last = 0; - - wc->status = IB_WC_WR_FLUSH_ERR; - - while (qp->s_last != qp->s_head) { - wqe = get_swqe_ptr(qp, qp->s_last); - wc->wr_id = wqe->wr.wr_id; - wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), wc, 1); - if (++qp->s_last >= qp->s_size) - qp->s_last = 0; - } - qp->s_cur = qp->s_tail = qp->s_head; - qp->state = IB_QPS_SQE; -} - -/** * ipath_get_credit - flush the send work queue of a QP * @qp: the qp who's send work queue to flush * @aeth: the Acknowledge Extended Transport Header diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 08b11b5..b4b26c3 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -771,27 +771,14 @@ done: * * The QP s_lock should be held and interrupts disabled. */ -void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) +void ipath_restart_rc(struct ipath_qp *qp, u32 psn) { struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); struct ipath_ibdev *dev; if (qp->s_retry == 0) { - wc->wr_id = wqe->wr.wr_id; - wc->status = IB_WC_RETRY_EXC_ERR; - wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc->vendor_err = 0; - wc->byte_len = 0; - wc->qp = &qp->ibqp; - wc->imm_data = 0; - wc->src_qp = qp->remote_qpn; - wc->wc_flags = 0; - wc->pkey_index = 0; - wc->slid = qp->remote_ah_attr.dlid; - wc->sl = qp->remote_ah_attr.sl; - wc->dlid_path_bits = 0; - wc->port_num = 0; - ipath_sqerror_qp(qp, wc); + ipath_send_complete(qp, wqe, IB_WC_RETRY_EXC_ERR); + ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); goto bail; } qp->s_retry--; @@ -804,6 +791,8 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) spin_lock(&dev->pending_lock); if (!list_empty(&qp->timerwait)) list_del_init(&qp->timerwait); + if (!list_empty(&qp->piowait)) + list_del_init(&qp->piowait); spin_unlock(&dev->pending_lock); if (wqe->wr.opcode == IB_WR_RDMA_READ) @@ -845,6 +834,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); struct ib_wc wc; + enum ib_wc_status status; struct ipath_swqe *wqe; int ret = 0; u32 ack_psn; @@ -909,7 +899,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, */ update_last_psn(qp, wqe->psn - 1); /* Retry this request. */ - ipath_restart_rc(qp, wqe->psn, &wc); + ipath_restart_rc(qp, wqe->psn); /* * No need to process the ACK/NAK since we are * restarting an earlier request. @@ -937,20 +927,15 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, /* Post a send completion queue entry if requested. */ if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || (wqe->wr.send_flags & IB_SEND_SIGNALED)) { + memset(&wc, 0, sizeof wc); wc.wr_id = wqe->wr.wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; wc.byte_len = wqe->length; - wc.imm_data = 0; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; - wc.wc_flags = 0; - wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); } qp->s_retry = qp->s_retry_cnt; @@ -1012,7 +997,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, if (qp->s_last == qp->s_tail) goto bail; if (qp->s_rnr_retry == 0) { - wc.status = IB_WC_RNR_RETRY_EXC_ERR; + status = IB_WC_RNR_RETRY_EXC_ERR; goto class_b; } if (qp->s_rnr_retry_cnt < 7) @@ -1050,37 +1035,25 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, * RDMA READ response which terminates the RDMA * READ. */ - ipath_restart_rc(qp, psn, &wc); + ipath_restart_rc(qp, psn); break; case 1: /* Invalid Request */ - wc.status = IB_WC_REM_INV_REQ_ERR; + status = IB_WC_REM_INV_REQ_ERR; dev->n_other_naks++; goto class_b; case 2: /* Remote Access Error */ - wc.status = IB_WC_REM_ACCESS_ERR; + status = IB_WC_REM_ACCESS_ERR; dev->n_other_naks++; goto class_b; case 3: /* Remote Operation Error */ - wc.status = IB_WC_REM_OP_ERR; + status = IB_WC_REM_OP_ERR; dev->n_other_naks++; class_b: - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.qp = &qp->ibqp; - wc.imm_data = 0; - wc.src_qp = qp->remote_qpn; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = qp->remote_ah_attr.dlid; - wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; - ipath_sqerror_qp(qp, &wc); + ipath_send_complete(qp, wqe, status); + ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); break; default: @@ -1126,8 +1099,8 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, int header_in_data) { struct ipath_swqe *wqe; + enum ib_wc_status status; unsigned long flags; - struct ib_wc wc; int diff; u32 pad; u32 aeth; @@ -1159,6 +1132,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, if (unlikely(qp->s_last == qp->s_tail)) goto ack_done; wqe = get_swqe_ptr(qp, qp->s_last); + status = IB_WC_SUCCESS; switch (opcode) { case OP(ACKNOWLEDGE): @@ -1200,7 +1174,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, /* no AETH, no ACK */ if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) { dev->n_rdma_seq++; - ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + ipath_restart_rc(qp, qp->s_last_psn + 1); goto ack_done; } if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ)) @@ -1261,7 +1235,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, /* ACKs READ req. */ if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) { dev->n_rdma_seq++; - ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + ipath_restart_rc(qp, qp->s_last_psn + 1); goto ack_done; } if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ)) @@ -1291,31 +1265,16 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, goto ack_done; } -ack_done: - spin_unlock_irqrestore(&qp->s_lock, flags); - goto bail; - ack_op_err: - wc.status = IB_WC_LOC_QP_OP_ERR; + status = IB_WC_LOC_QP_OP_ERR; goto ack_err; ack_len_err: - wc.status = IB_WC_LOC_LEN_ERR; + status = IB_WC_LOC_LEN_ERR; ack_err: - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; - wc.qp = &qp->ibqp; - wc.src_qp = qp->remote_qpn; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = qp->remote_ah_attr.dlid; - wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; - ipath_sqerror_qp(qp, &wc); + ipath_send_complete(qp, wqe, status); + ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); +ack_done: spin_unlock_irqrestore(&qp->s_lock, flags); bail: return; @@ -1523,13 +1482,12 @@ send_ack: return 0; } -static void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) +void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) { unsigned long flags; int lastwqe; spin_lock_irqsave(&qp->s_lock, flags); - qp->state = IB_QPS_ERR; lastwqe = ipath_error_qp(qp, err); spin_unlock_irqrestore(&qp->s_lock, flags); @@ -1643,11 +1601,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, opcode == OP(SEND_LAST) || opcode == OP(SEND_LAST_WITH_IMMEDIATE)) break; - nack_inv: - ipath_rc_error(qp, IB_WC_REM_INV_REQ_ERR); - qp->r_nak_state = IB_NAK_INVALID_REQUEST; - qp->r_ack_psn = qp->r_psn; - goto send_ack; + goto nack_inv; case OP(RDMA_WRITE_FIRST): case OP(RDMA_WRITE_MIDDLE): @@ -1673,18 +1627,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, break; } - wc.imm_data = 0; - wc.wc_flags = 0; + memset(&wc, 0, sizeof wc); /* OK, process the packet. */ switch (opcode) { case OP(SEND_FIRST): - if (!ipath_get_rwqe(qp, 0)) { - rnr_nak: - qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer; - qp->r_ack_psn = qp->r_psn; - goto send_ack; - } + if (!ipath_get_rwqe(qp, 0)) + goto rnr_nak; qp->r_rcv_len = 0; /* FALLTHROUGH */ case OP(SEND_MIDDLE): @@ -1751,14 +1700,10 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, wc.opcode = IB_WC_RECV_RDMA_WITH_IMM; else wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; - wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, (ohdr->bth[0] & @@ -1951,11 +1896,21 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, goto send_ack; goto done; +rnr_nak: + qp->r_nak_state = IB_RNR_NAK | qp->r_min_rnr_timer; + qp->r_ack_psn = qp->r_psn; + goto send_ack; + +nack_inv: + ipath_rc_error(qp, IB_WC_LOC_QP_OP_ERR); + qp->r_nak_state = IB_NAK_INVALID_REQUEST; + qp->r_ack_psn = qp->r_psn; + goto send_ack; + nack_acc: - ipath_rc_error(qp, IB_WC_REM_ACCESS_ERR); + ipath_rc_error(qp, IB_WC_LOC_PROT_ERR); qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR; qp->r_ack_psn = qp->r_psn; - send_ack: send_rc_ack(qp); diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index 9e3fe61..c716a03 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. + * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -140,20 +140,11 @@ int ipath_init_sge(struct ipath_qp *qp, struct ipath_rwqe *wqe, goto bail; bad_lkey: + memset(&wc, 0, sizeof(wc)); wc.wr_id = wqe->wr_id; wc.status = IB_WC_LOC_PROT_ERR; wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; wc.qp = &qp->ibqp; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; /* Signal solicited completion event. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); ret = 0; @@ -270,6 +261,7 @@ static void ipath_ruc_loopback(struct ipath_qp *sqp) struct ib_wc wc; u64 sdata; atomic64_t *maddr; + enum ib_wc_status send_status; qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn); if (!qp) { @@ -300,8 +292,8 @@ again: wqe = get_swqe_ptr(sqp, sqp->s_last); spin_unlock_irqrestore(&sqp->s_lock, flags); - wc.wc_flags = 0; - wc.imm_data = 0; + memset(&wc, 0, sizeof wc); + send_status = IB_WC_SUCCESS; sqp->s_sge.sge = wqe->sg_list[0]; sqp->s_sge.sg_list = wqe->sg_list + 1; @@ -313,75 +305,33 @@ again: wc.imm_data = wqe->wr.ex.imm_data; /* FALLTHROUGH */ case IB_WR_SEND: - if (!ipath_get_rwqe(qp, 0)) { - rnr_nak: - /* Handle RNR NAK */ - if (qp->ibqp.qp_type == IB_QPT_UC) - goto send_comp; - if (sqp->s_rnr_retry == 0) { - wc.status = IB_WC_RNR_RETRY_EXC_ERR; - goto err; - } - if (sqp->s_rnr_retry_cnt < 7) - sqp->s_rnr_retry--; - dev->n_rnr_naks++; - sqp->s_rnr_timeout = - ib_ipath_rnr_table[qp->r_min_rnr_timer]; - ipath_insert_rnr_queue(sqp); - goto done; - } + if (!ipath_get_rwqe(qp, 0)) + goto rnr_nak; break; case IB_WR_RDMA_WRITE_WITH_IMM: - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_WRITE))) { - wc.status = IB_WC_REM_INV_REQ_ERR; - goto err; - } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) + goto inv_err; wc.wc_flags = IB_WC_WITH_IMM; wc.imm_data = wqe->wr.ex.imm_data; if (!ipath_get_rwqe(qp, 1)) goto rnr_nak; /* FALLTHROUGH */ case IB_WR_RDMA_WRITE: - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_WRITE))) { - wc.status = IB_WC_REM_INV_REQ_ERR; - goto err; - } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_WRITE))) + goto inv_err; if (wqe->length == 0) break; if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge, wqe->length, wqe->wr.wr.rdma.remote_addr, wqe->wr.wr.rdma.rkey, - IB_ACCESS_REMOTE_WRITE))) { - acc_err: - wc.status = IB_WC_REM_ACCESS_ERR; - err: - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; - wc.byte_len = 0; - wc.qp = &sqp->ibqp; - wc.src_qp = sqp->remote_qpn; - wc.pkey_index = 0; - wc.slid = sqp->remote_ah_attr.dlid; - wc.sl = sqp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; - spin_lock_irqsave(&sqp->s_lock, flags); - ipath_sqerror_qp(sqp, &wc); - spin_unlock_irqrestore(&sqp->s_lock, flags); - goto done; - } + IB_ACCESS_REMOTE_WRITE))) + goto acc_err; break; case IB_WR_RDMA_READ: - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_READ))) { - wc.status = IB_WC_REM_INV_REQ_ERR; - goto err; - } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_READ))) + goto inv_err; if (unlikely(!ipath_rkey_ok(qp, &sqp->s_sge, wqe->length, wqe->wr.wr.rdma.remote_addr, wqe->wr.wr.rdma.rkey, @@ -394,11 +344,8 @@ again: case IB_WR_ATOMIC_CMP_AND_SWP: case IB_WR_ATOMIC_FETCH_AND_ADD: - if (unlikely(!(qp->qp_access_flags & - IB_ACCESS_REMOTE_ATOMIC))) { - wc.status = IB_WC_REM_INV_REQ_ERR; - goto err; - } + if (unlikely(!(qp->qp_access_flags & IB_ACCESS_REMOTE_ATOMIC))) + goto inv_err; if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge, sizeof(u64), wqe->wr.wr.atomic.remote_addr, wqe->wr.wr.atomic.rkey, @@ -415,7 +362,8 @@ again: goto send_comp; default: - goto done; + send_status = IB_WC_LOC_QP_OP_ERR; + goto serr; } sge = &sqp->s_sge.sge; @@ -458,14 +406,11 @@ again: wc.opcode = IB_WC_RECV; wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; - wc.vendor_err = 0; wc.byte_len = wqe->length; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; - wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; wc.port_num = 1; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, @@ -473,9 +418,63 @@ again: send_comp: sqp->s_rnr_retry = sqp->s_rnr_retry_cnt; - ipath_send_complete(sqp, wqe, IB_WC_SUCCESS); + ipath_send_complete(sqp, wqe, send_status); goto again; +rnr_nak: + /* Handle RNR NAK */ + if (qp->ibqp.qp_type == IB_QPT_UC) + goto send_comp; + /* + * Note: we don't need the s_lock held since the BUSY flag + * makes this single threaded. + */ + if (sqp->s_rnr_retry == 0) { + send_status = IB_WC_RNR_RETRY_EXC_ERR; + goto serr; + } + if (sqp->s_rnr_retry_cnt < 7) + sqp->s_rnr_retry--; + spin_lock_irqsave(&sqp->s_lock, flags); + if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_RECV_OK)) + goto unlock; + dev->n_rnr_naks++; + sqp->s_rnr_timeout = ib_ipath_rnr_table[qp->r_min_rnr_timer]; + ipath_insert_rnr_queue(sqp); + goto unlock; + +inv_err: + send_status = IB_WC_REM_INV_REQ_ERR; + wc.status = IB_WC_LOC_QP_OP_ERR; + goto err; + +acc_err: + send_status = IB_WC_REM_ACCESS_ERR; + wc.status = IB_WC_LOC_PROT_ERR; +err: + /* responder goes to error state */ + ipath_rc_error(qp, wc.status); + +serr: + spin_lock_irqsave(&sqp->s_lock, flags); + ipath_send_complete(sqp, wqe, send_status); + if (sqp->ibqp.qp_type == IB_QPT_RC) { + int lastwqe = ipath_error_qp(sqp, IB_WC_WR_FLUSH_ERR); + + sqp->s_flags &= ~IPATH_S_BUSY; + spin_unlock_irqrestore(&sqp->s_lock, flags); + if (lastwqe) { + struct ib_event ev; + + ev.device = sqp->ibqp.device; + ev.element.qp = &sqp->ibqp; + ev.event = IB_EVENT_QP_LAST_WQE_REACHED; + sqp->ibqp.event_handler(&ev, sqp->ibqp.qp_context); + } + goto done; + } +unlock: + spin_unlock_irqrestore(&sqp->s_lock, flags); done: if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); @@ -651,21 +650,15 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, status != IB_WC_SUCCESS) { struct ib_wc wc; + memset(&wc, 0, sizeof wc); wc.wr_id = wqe->wr.wr_id; wc.status = status; wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - wc.vendor_err = 0; - wc.byte_len = wqe->length; - wc.imm_data = 0; wc.qp = &qp->ibqp; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 0); + if (status == IB_WC_SUCCESS) + wc.byte_len = wqe->length; + ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, + status != IB_WC_SUCCESS); } spin_lock_irqsave(&qp->s_lock, flags); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 5015cd2..22bb42d 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -744,12 +744,10 @@ static void ipath_ib_timer(struct ipath_ibdev *dev) /* XXX What if timer fires again while this is running? */ for (qp = resend; qp != NULL; qp = qp->timer_next) { - struct ib_wc wc; - spin_lock_irqsave(&qp->s_lock, flags); if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) { dev->n_timeouts++; - ipath_restart_rc(qp, qp->s_last_psn + 1, &wc); + ipath_restart_rc(qp, qp->s_last_psn + 1); } spin_unlock_irqrestore(&qp->s_lock, flags); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 6514aa8..4c7c2aa 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -710,8 +710,6 @@ void ipath_free_all_qps(struct ipath_qp_table *qpt); int ipath_init_qp_table(struct ipath_ibdev *idev, int size); -void ipath_sqerror_qp(struct ipath_qp *qp, struct ib_wc *wc); - void ipath_get_credit(struct ipath_qp *qp, u32 aeth); unsigned ipath_ib_rate_to_mult(enum ib_rate rate); @@ -729,7 +727,9 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, int has_grh, void *data, u32 tlen, struct ipath_qp *qp); -void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc); +void ipath_restart_rc(struct ipath_qp *qp, u32 psn); + +void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err); int ipath_post_ud_send(struct ipath_qp *qp, struct ib_send_wr *wr); From ralph.campbell at qlogic.com Thu May 8 11:55:23 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 08 May 2008 11:55:23 -0700 Subject: [ofa-general] [PATCH 2/3] IB/ipath - fix many locking issues when switching to error state In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> Message-ID: <20080508185522.8547.46713.stgit@eng-46.mv.qlogic.com> The send DMA hardware queue voided a number of prior assumptions about when a send is complete which led to completions being generated out of order. There were also a number of locking issues when switching the QP to the error or reset states, and implements the IB_QPS_SQD state. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_qp.c | 183 ++++++++++++------------- drivers/infiniband/hw/ipath/ipath_rc.c | 151 +++++++++++++-------- drivers/infiniband/hw/ipath/ipath_ruc.c | 168 ++++++++++++++++------- drivers/infiniband/hw/ipath/ipath_uc.c | 57 +++++--- drivers/infiniband/hw/ipath/ipath_ud.c | 66 +++++++-- drivers/infiniband/hw/ipath/ipath_user_sdma.h | 2 drivers/infiniband/hw/ipath/ipath_verbs.c | 174 ++++++++++++++++-------- drivers/infiniband/hw/ipath/ipath_verbs.h | 57 +++++++- 8 files changed, 554 insertions(+), 304 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 6f98632..4715911 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -242,7 +242,6 @@ static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) { struct ipath_qp *q, **qpp; unsigned long flags; - int fnd = 0; spin_lock_irqsave(&qpt->lock, flags); @@ -253,51 +252,40 @@ static void ipath_free_qp(struct ipath_qp_table *qpt, struct ipath_qp *qp) *qpp = qp->next; qp->next = NULL; atomic_dec(&qp->refcount); - fnd = 1; break; } } spin_unlock_irqrestore(&qpt->lock, flags); - - if (!fnd) - return; - - free_qpn(qpt, qp->ibqp.qp_num); - - wait_event(qp->wait, !atomic_read(&qp->refcount)); } /** - * ipath_free_all_qps - remove all QPs from the table + * ipath_free_all_qps - check for QPs still in use * @qpt: the QP table to empty + * + * There should not be any QPs still in use. + * Free memory for table. */ -void ipath_free_all_qps(struct ipath_qp_table *qpt) +unsigned ipath_free_all_qps(struct ipath_qp_table *qpt) { unsigned long flags; - struct ipath_qp *qp, *nqp; - u32 n; + struct ipath_qp *qp; + u32 n, qp_inuse = 0; + spin_lock_irqsave(&qpt->lock, flags); for (n = 0; n < qpt->max; n++) { - spin_lock_irqsave(&qpt->lock, flags); qp = qpt->table[n]; qpt->table[n] = NULL; - spin_unlock_irqrestore(&qpt->lock, flags); - - while (qp) { - nqp = qp->next; - free_qpn(qpt, qp->ibqp.qp_num); - if (!atomic_dec_and_test(&qp->refcount) || - !ipath_destroy_qp(&qp->ibqp)) - ipath_dbg("QP memory leak!\n"); - qp = nqp; - } + + for (; qp; qp = qp->next) + qp_inuse++; } + spin_unlock_irqrestore(&qpt->lock, flags); - for (n = 0; n < ARRAY_SIZE(qpt->map); n++) { + for (n = 0; n < ARRAY_SIZE(qpt->map); n++) if (qpt->map[n].page) - free_page((unsigned long)qpt->map[n].page); - } + free_page((unsigned long) qpt->map[n].page); + return qp_inuse; } /** @@ -336,11 +324,12 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type) qp->remote_qpn = 0; qp->qkey = 0; qp->qp_access_flags = 0; - qp->s_busy = 0; + atomic_set(&qp->s_dma_busy, 0); qp->s_flags &= IPATH_S_SIGNAL_REQ_WR; qp->s_hdrwords = 0; qp->s_wqe = NULL; qp->s_pkt_delay = 0; + qp->s_draining = 0; qp->s_psn = 0; qp->r_psn = 0; qp->r_msn = 0; @@ -353,7 +342,8 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type) } qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; qp->r_nak_state = 0; - qp->r_wrid_valid = 0; + qp->r_aflags = 0; + qp->r_flags = 0; qp->s_rnr_timeout = 0; qp->s_head = 0; qp->s_tail = 0; @@ -361,7 +351,6 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type) qp->s_last = 0; qp->s_ssn = 1; qp->s_lsn = 0; - qp->s_wait_credit = 0; memset(qp->s_ack_queue, 0, sizeof(qp->s_ack_queue)); qp->r_head_ack_queue = 0; qp->s_tail_ack_queue = 0; @@ -370,7 +359,6 @@ static void ipath_reset_qp(struct ipath_qp *qp, enum ib_qp_type type) qp->r_rq.wq->head = 0; qp->r_rq.wq->tail = 0; } - qp->r_reuse_sge = 0; } /** @@ -402,39 +390,21 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) list_del_init(&qp->piowait); spin_unlock(&dev->pending_lock); - wc.vendor_err = 0; - wc.byte_len = 0; - wc.imm_data = 0; + /* Schedule the sending tasklet to drain the send work queue. */ + if (qp->s_last != qp->s_head) + ipath_schedule_send(qp); + + memset(&wc, 0, sizeof(wc)); wc.qp = &qp->ibqp; - wc.src_qp = 0; - wc.wc_flags = 0; - wc.pkey_index = 0; - wc.slid = 0; - wc.sl = 0; - wc.dlid_path_bits = 0; - wc.port_num = 0; - if (qp->r_wrid_valid) { - qp->r_wrid_valid = 0; + wc.opcode = IB_WC_RECV; + + if (test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags)) { wc.wr_id = qp->r_wr_id; - wc.opcode = IB_WC_RECV; wc.status = err; ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, 1); } wc.status = IB_WC_WR_FLUSH_ERR; - while (qp->s_last != qp->s_head) { - struct ipath_swqe *wqe = get_swqe_ptr(qp, qp->s_last); - - wc.wr_id = wqe->wr.wr_id; - wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; - if (++qp->s_last >= qp->s_size) - qp->s_last = 0; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); - } - qp->s_cur = qp->s_tail = qp->s_head; - qp->s_hdrwords = 0; - qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE; - if (qp->r_rq.wq) { struct ipath_rwq *wq; u32 head; @@ -450,7 +420,6 @@ int ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) tail = wq->tail; if (tail >= qp->r_rq.size) tail = 0; - wc.opcode = IB_WC_RECV; while (tail != head) { wc.wr_id = get_rwqe_ptr(&qp->r_rq, tail)->wr_id; if (++tail >= qp->r_rq.size) @@ -482,11 +451,10 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, struct ipath_ibdev *dev = to_idev(ibqp->device); struct ipath_qp *qp = to_iqp(ibqp); enum ib_qp_state cur_state, new_state; - unsigned long flags; int lastwqe = 0; int ret; - spin_lock_irqsave(&qp->s_lock, flags); + spin_lock_irq(&qp->s_lock); cur_state = attr_mask & IB_QP_CUR_STATE ? attr->cur_qp_state : qp->state; @@ -539,16 +507,42 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, switch (new_state) { case IB_QPS_RESET: + if (qp->state != IB_QPS_RESET) { + qp->state = IB_QPS_RESET; + spin_lock(&dev->pending_lock); + if (!list_empty(&qp->timerwait)) + list_del_init(&qp->timerwait); + if (!list_empty(&qp->piowait)) + list_del_init(&qp->piowait); + spin_unlock(&dev->pending_lock); + qp->s_flags &= ~IPATH_S_ANY_WAIT; + spin_unlock_irq(&qp->s_lock); + /* Stop the sending tasklet */ + tasklet_kill(&qp->s_task); + wait_event(qp->wait_dma, !atomic_read(&qp->s_dma_busy)); + spin_lock_irq(&qp->s_lock); + } ipath_reset_qp(qp, ibqp->qp_type); break; + case IB_QPS_SQD: + qp->s_draining = qp->s_last != qp->s_cur; + qp->state = new_state; + break; + + case IB_QPS_SQE: + if (qp->ibqp.qp_type == IB_QPT_RC) + goto inval; + qp->state = new_state; + break; + case IB_QPS_ERR: lastwqe = ipath_error_qp(qp, IB_WC_WR_FLUSH_ERR); break; default: + qp->state = new_state; break; - } if (attr_mask & IB_QP_PKEY_INDEX) @@ -601,8 +595,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC) qp->s_max_rd_atomic = attr->max_rd_atomic; - qp->state = new_state; - spin_unlock_irqrestore(&qp->s_lock, flags); + spin_unlock_irq(&qp->s_lock); if (lastwqe) { struct ib_event ev; @@ -616,7 +609,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, goto bail; inval: - spin_unlock_irqrestore(&qp->s_lock, flags); + spin_unlock_irq(&qp->s_lock); ret = -EINVAL; bail: @@ -647,7 +640,7 @@ int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, attr->pkey_index = qp->s_pkey_index; attr->alt_pkey_index = 0; attr->en_sqd_async_notify = 0; - attr->sq_draining = 0; + attr->sq_draining = qp->s_draining; attr->max_rd_atomic = qp->s_max_rd_atomic; attr->max_dest_rd_atomic = qp->r_max_rd_atomic; attr->min_rnr_timer = qp->r_min_rnr_timer; @@ -837,6 +830,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, spin_lock_init(&qp->r_rq.lock); atomic_set(&qp->refcount, 0); init_waitqueue_head(&qp->wait); + init_waitqueue_head(&qp->wait_dma); tasklet_init(&qp->s_task, ipath_do_send, (unsigned long)qp); INIT_LIST_HEAD(&qp->piowait); INIT_LIST_HEAD(&qp->timerwait); @@ -930,6 +924,7 @@ bail_ip: else vfree(qp->r_rq.wq); ipath_free_qp(&dev->qp_table, qp); + free_qpn(&dev->qp_table, qp->ibqp.qp_num); bail_qp: kfree(qp); bail_swq: @@ -951,41 +946,44 @@ int ipath_destroy_qp(struct ib_qp *ibqp) { struct ipath_qp *qp = to_iqp(ibqp); struct ipath_ibdev *dev = to_idev(ibqp->device); - unsigned long flags; - spin_lock_irqsave(&qp->s_lock, flags); - qp->state = IB_QPS_ERR; - spin_unlock_irqrestore(&qp->s_lock, flags); - spin_lock(&dev->n_qps_lock); - dev->n_qps_allocated--; - spin_unlock(&dev->n_qps_lock); + /* Make sure HW and driver activity is stopped. */ + spin_lock_irq(&qp->s_lock); + if (qp->state != IB_QPS_RESET) { + qp->state = IB_QPS_RESET; + spin_lock(&dev->pending_lock); + if (!list_empty(&qp->timerwait)) + list_del_init(&qp->timerwait); + if (!list_empty(&qp->piowait)) + list_del_init(&qp->piowait); + spin_unlock(&dev->pending_lock); + qp->s_flags &= ~IPATH_S_ANY_WAIT; + spin_unlock_irq(&qp->s_lock); + /* Stop the sending tasklet */ + tasklet_kill(&qp->s_task); + wait_event(qp->wait_dma, !atomic_read(&qp->s_dma_busy)); + } else + spin_unlock_irq(&qp->s_lock); - /* Stop the sending tasklet. */ - tasklet_kill(&qp->s_task); + ipath_free_qp(&dev->qp_table, qp); if (qp->s_tx) { atomic_dec(&qp->refcount); if (qp->s_tx->txreq.flags & IPATH_SDMA_TXREQ_F_FREEBUF) kfree(qp->s_tx->txreq.map_addr); + spin_lock_irq(&dev->pending_lock); + list_add(&qp->s_tx->txreq.list, &dev->txreq_free); + spin_unlock_irq(&dev->pending_lock); + qp->s_tx = NULL; } - /* Make sure the QP isn't on the timeout list. */ - spin_lock_irqsave(&dev->pending_lock, flags); - if (!list_empty(&qp->timerwait)) - list_del_init(&qp->timerwait); - if (!list_empty(&qp->piowait)) - list_del_init(&qp->piowait); - if (qp->s_tx) - list_add(&qp->s_tx->txreq.list, &dev->txreq_free); - spin_unlock_irqrestore(&dev->pending_lock, flags); + wait_event(qp->wait, !atomic_read(&qp->refcount)); - /* - * Make sure that the QP is not in the QPN table so receive - * interrupts will discard packets for this QP. XXX Also remove QP - * from multicast table. - */ - if (atomic_read(&qp->refcount) != 0) - ipath_free_qp(&dev->qp_table, qp); + /* all user's cleaned up, mark it available */ + free_qpn(&dev->qp_table, qp->ibqp.qp_num); + spin_lock(&dev->n_qps_lock); + dev->n_qps_allocated--; + spin_unlock(&dev->n_qps_lock); if (qp->ip) kref_put(&qp->ip->ref, ipath_release_mmap_info); @@ -1055,9 +1053,10 @@ void ipath_get_credit(struct ipath_qp *qp, u32 aeth) } /* Restart sending if it was blocked due to lack of credits. */ - if (qp->s_cur != qp->s_head && + if ((qp->s_flags & IPATH_S_WAIT_SSN_CREDIT) && + qp->s_cur != qp->s_head && (qp->s_lsn == (u32) -1 || ipath_cmp24(get_swqe_ptr(qp, qp->s_cur)->ssn, qp->s_lsn + 1) <= 0)) - tasklet_hi_schedule(&qp->s_task); + ipath_schedule_send(qp); } diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index b4b26c3..5b5276a 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -92,6 +92,10 @@ static int ipath_make_rc_ack(struct ipath_ibdev *dev, struct ipath_qp *qp, u32 bth0; u32 bth2; + /* Don't send an ACK if we aren't supposed to. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) + goto bail; + /* header size in 32-bit words LRH+BTH = (8+12)/4. */ hwords = 5; @@ -238,14 +242,25 @@ int ipath_make_rc_req(struct ipath_qp *qp) ipath_make_rc_ack(dev, qp, ohdr, pmtu)) goto done; - if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) || - qp->s_rnr_timeout || qp->s_wait_credit) - goto bail; + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND)) + goto bail; + /* We are in the error state, flush the work request. */ + if (qp->s_last == qp->s_head) + goto bail; + /* If DMAs are in progress, we can't flush immediately. */ + if (atomic_read(&qp->s_dma_busy)) { + qp->s_flags |= IPATH_S_WAIT_DMA; + goto bail; + } + wqe = get_swqe_ptr(qp, qp->s_last); + ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR); + goto done; + } - /* Limit the number of packets sent without an ACK. */ - if (ipath_cmp24(qp->s_psn, qp->s_last_psn + IPATH_PSN_CREDIT) > 0) { - qp->s_wait_credit = 1; - dev->n_rc_stalls++; + /* Leave BUSY set until RNR timeout. */ + if (qp->s_rnr_timeout) { + qp->s_flags |= IPATH_S_WAITING; goto bail; } @@ -257,6 +272,9 @@ int ipath_make_rc_req(struct ipath_qp *qp) wqe = get_swqe_ptr(qp, qp->s_cur); switch (qp->s_state) { default: + if (!(ib_ipath_state_ops[qp->state] & + IPATH_PROCESS_NEXT_SEND_OK)) + goto bail; /* * Resend an old request or start a new one. * @@ -294,8 +312,10 @@ int ipath_make_rc_req(struct ipath_qp *qp) case IB_WR_SEND_WITH_IMM: /* If no credit, return. */ if (qp->s_lsn != (u32) -1 && - ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) + ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + qp->s_flags |= IPATH_S_WAIT_SSN_CREDIT; goto bail; + } wqe->lpsn = wqe->psn; if (len > pmtu) { wqe->lpsn += (len - 1) / pmtu; @@ -325,8 +345,10 @@ int ipath_make_rc_req(struct ipath_qp *qp) case IB_WR_RDMA_WRITE_WITH_IMM: /* If no credit, return. */ if (qp->s_lsn != (u32) -1 && - ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) + ipath_cmp24(wqe->ssn, qp->s_lsn + 1) > 0) { + qp->s_flags |= IPATH_S_WAIT_SSN_CREDIT; goto bail; + } ohdr->u.rc.reth.vaddr = cpu_to_be64(wqe->wr.wr.rdma.remote_addr); ohdr->u.rc.reth.rkey = @@ -570,7 +592,11 @@ int ipath_make_rc_req(struct ipath_qp *qp) ipath_make_ruc_header(dev, qp, ohdr, bth0 | (qp->s_state << 24), bth2); done: ret = 1; + goto unlock; + bail: + qp->s_flags &= ~IPATH_S_BUSY; +unlock: spin_unlock_irqrestore(&qp->s_lock, flags); return ret; } @@ -606,7 +632,11 @@ static void send_rc_ack(struct ipath_qp *qp) spin_unlock_irqrestore(&qp->s_lock, flags); + /* Don't try to send ACKs if the link isn't ACTIVE */ dd = dev->dd; + if (!(dd->ipath_flags & IPATH_LINKACTIVE)) + goto done; + piobuf = ipath_getpiobuf(dd, 0, NULL); if (!piobuf) { /* @@ -668,15 +698,16 @@ static void send_rc_ack(struct ipath_qp *qp) goto done; queue_ack: - dev->n_rc_qacks++; - qp->s_flags |= IPATH_S_ACK_PENDING; - qp->s_nak_state = qp->r_nak_state; - qp->s_ack_psn = qp->r_ack_psn; + if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK) { + dev->n_rc_qacks++; + qp->s_flags |= IPATH_S_ACK_PENDING; + qp->s_nak_state = qp->r_nak_state; + qp->s_ack_psn = qp->r_ack_psn; + + /* Schedule the send tasklet. */ + ipath_schedule_send(qp); + } spin_unlock_irqrestore(&qp->s_lock, flags); - - /* Call ipath_do_rc_send() in another thread. */ - tasklet_hi_schedule(&qp->s_task); - done: return; } @@ -735,7 +766,7 @@ static void reset_psn(struct ipath_qp *qp, u32 psn) /* * Set the state to restart in the middle of a request. * Don't change the s_sge, s_cur_sge, or s_cur_size. - * See ipath_do_rc_send(). + * See ipath_make_rc_req(). */ switch (opcode) { case IB_WR_SEND: @@ -801,7 +832,7 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn) dev->n_rc_resends += (qp->s_psn - psn) & IPATH_PSN_MASK; reset_psn(qp, psn); - tasklet_hi_schedule(&qp->s_task); + ipath_schedule_send(qp); bail: return; @@ -809,13 +840,7 @@ bail: static inline void update_last_psn(struct ipath_qp *qp, u32 psn) { - if (qp->s_last_psn != psn) { - qp->s_last_psn = psn; - if (qp->s_wait_credit) { - qp->s_wait_credit = 0; - tasklet_hi_schedule(&qp->s_task); - } - } + qp->s_last_psn = psn; } /** @@ -915,14 +940,10 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, wqe->wr.opcode == IB_WR_ATOMIC_FETCH_AND_ADD)) { qp->s_num_rd_atomic--; /* Restart sending task if fence is complete */ - if ((qp->s_flags & IPATH_S_FENCE_PENDING) && - !qp->s_num_rd_atomic) { - qp->s_flags &= ~IPATH_S_FENCE_PENDING; - tasklet_hi_schedule(&qp->s_task); - } else if (qp->s_flags & IPATH_S_RDMAR_PENDING) { - qp->s_flags &= ~IPATH_S_RDMAR_PENDING; - tasklet_hi_schedule(&qp->s_task); - } + if (((qp->s_flags & IPATH_S_FENCE_PENDING) && + !qp->s_num_rd_atomic) || + qp->s_flags & IPATH_S_RDMAR_PENDING) + ipath_schedule_send(qp); } /* Post a send completion queue entry if requested. */ if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || @@ -956,6 +977,8 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, } else { if (++qp->s_last >= qp->s_size) qp->s_last = 0; + if (qp->state == IB_QPS_SQD && qp->s_last == qp->s_cur) + qp->s_draining = 0; if (qp->s_last == qp->s_tail) break; wqe = get_swqe_ptr(qp, qp->s_last); @@ -979,7 +1002,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, */ if (ipath_cmp24(qp->s_psn, psn) <= 0) { reset_psn(qp, psn + 1); - tasklet_hi_schedule(&qp->s_task); + ipath_schedule_send(qp); } } else if (ipath_cmp24(qp->s_psn, psn) <= 0) { qp->s_state = OP(SEND_LAST); @@ -1018,6 +1041,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode, ib_ipath_rnr_table[(aeth >> IPATH_AETH_CREDIT_SHIFT) & IPATH_AETH_CREDIT_MASK]; ipath_insert_rnr_queue(qp); + ipath_schedule_send(qp); goto bail; case 3: /* NAK */ @@ -1108,6 +1132,10 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, spin_lock_irqsave(&qp->s_lock, flags); + /* Double check we can process this now that we hold the s_lock. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) + goto ack_done; + /* Ignore invalid responses. */ if (ipath_cmp24(psn, qp->s_next_psn) >= 0) goto ack_done; @@ -1343,7 +1371,12 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev, psn &= IPATH_PSN_MASK; e = NULL; old_req = 1; + spin_lock_irqsave(&qp->s_lock, flags); + /* Double check we can process this now that we hold the s_lock. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) + goto unlock_done; + for (i = qp->r_head_ack_queue; ; i = prev) { if (i == qp->s_tail_ack_queue) old_req = 0; @@ -1471,7 +1504,7 @@ static inline int ipath_rc_rcv_error(struct ipath_ibdev *dev, break; } qp->r_nak_state = 0; - tasklet_hi_schedule(&qp->s_task); + ipath_schedule_send(qp); unlock_done: spin_unlock_irqrestore(&qp->s_lock, flags); @@ -1503,18 +1536,15 @@ void ipath_rc_error(struct ipath_qp *qp, enum ib_wc_status err) static inline void ipath_update_ack_queue(struct ipath_qp *qp, unsigned n) { - unsigned long flags; unsigned next; next = n + 1; if (next > IPATH_MAX_RDMA_ATOMIC) next = 0; - spin_lock_irqsave(&qp->s_lock, flags); if (n == qp->s_tail_ack_queue) { qp->s_tail_ack_queue = next; qp->s_ack_state = OP(ACKNOWLEDGE); } - spin_unlock_irqrestore(&qp->s_lock, flags); } /** @@ -1543,6 +1573,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, int diff; struct ib_reth *reth; int header_in_data; + unsigned long flags; /* Validate the SLID. See Ch. 9.6.1.5 */ if (unlikely(be16_to_cpu(hdr->lrh[3]) != qp->remote_ah_attr.dlid)) @@ -1690,9 +1721,8 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, goto nack_inv; ipath_copy_sge(&qp->r_sge, data, tlen); qp->r_msn++; - if (!qp->r_wrid_valid) + if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags)) break; - qp->r_wrid_valid = 0; wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; if (opcode == OP(RDMA_WRITE_LAST_WITH_IMMEDIATE) || @@ -1764,9 +1794,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, next = qp->r_head_ack_queue + 1; if (next > IPATH_MAX_RDMA_ATOMIC) next = 0; + spin_lock_irqsave(&qp->s_lock, flags); + /* Double check we can process this while holding the s_lock. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) + goto unlock; if (unlikely(next == qp->s_tail_ack_queue)) { if (!qp->s_ack_queue[next].sent) - goto nack_inv; + goto nack_inv_unlck; ipath_update_ack_queue(qp, next); } e = &qp->s_ack_queue[qp->r_head_ack_queue]; @@ -1787,7 +1821,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, ok = ipath_rkey_ok(qp, &e->rdma_sge, len, vaddr, rkey, IB_ACCESS_REMOTE_READ); if (unlikely(!ok)) - goto nack_acc; + goto nack_acc_unlck; /* * Update the next expected PSN. We add 1 later * below, so only add the remainder here. @@ -1814,13 +1848,12 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, qp->r_psn++; qp->r_state = opcode; qp->r_nak_state = 0; - barrier(); qp->r_head_ack_queue = next; - /* Call ipath_do_rc_send() in another thread. */ - tasklet_hi_schedule(&qp->s_task); + /* Schedule the send tasklet. */ + ipath_schedule_send(qp); - goto done; + goto unlock; } case OP(COMPARE_SWAP): @@ -1839,9 +1872,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, next = qp->r_head_ack_queue + 1; if (next > IPATH_MAX_RDMA_ATOMIC) next = 0; + spin_lock_irqsave(&qp->s_lock, flags); + /* Double check we can process this while holding the s_lock. */ + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) + goto unlock; if (unlikely(next == qp->s_tail_ack_queue)) { if (!qp->s_ack_queue[next].sent) - goto nack_inv; + goto nack_inv_unlck; ipath_update_ack_queue(qp, next); } if (!header_in_data) @@ -1851,13 +1888,13 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, vaddr = ((u64) be32_to_cpu(ateth->vaddr[0]) << 32) | be32_to_cpu(ateth->vaddr[1]); if (unlikely(vaddr & (sizeof(u64) - 1))) - goto nack_inv; + goto nack_inv_unlck; rkey = be32_to_cpu(ateth->rkey); /* Check rkey & NAK */ if (unlikely(!ipath_rkey_ok(qp, &qp->r_sge, sizeof(u64), vaddr, rkey, IB_ACCESS_REMOTE_ATOMIC))) - goto nack_acc; + goto nack_acc_unlck; /* Perform atomic OP and save result. */ maddr = (atomic64_t *) qp->r_sge.sge.vaddr; sdata = be64_to_cpu(ateth->swap_data); @@ -1874,13 +1911,12 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, qp->r_psn++; qp->r_state = opcode; qp->r_nak_state = 0; - barrier(); qp->r_head_ack_queue = next; - /* Call ipath_do_rc_send() in another thread. */ - tasklet_hi_schedule(&qp->s_task); + /* Schedule the send tasklet. */ + ipath_schedule_send(qp); - goto done; + goto unlock; } default: @@ -1901,19 +1937,26 @@ rnr_nak: qp->r_ack_psn = qp->r_psn; goto send_ack; +nack_inv_unlck: + spin_unlock_irqrestore(&qp->s_lock, flags); nack_inv: ipath_rc_error(qp, IB_WC_LOC_QP_OP_ERR); qp->r_nak_state = IB_NAK_INVALID_REQUEST; qp->r_ack_psn = qp->r_psn; goto send_ack; +nack_acc_unlck: + spin_unlock_irqrestore(&qp->s_lock, flags); nack_acc: ipath_rc_error(qp, IB_WC_LOC_PROT_ERR); qp->r_nak_state = IB_NAK_REMOTE_ACCESS_ERROR; qp->r_ack_psn = qp->r_psn; send_ack: send_rc_ack(qp); + goto done; +unlock: + spin_unlock_irqrestore(&qp->s_lock, flags); done: return; } diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index c716a03..a4b5521 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -78,6 +78,7 @@ const u32 ib_ipath_rnr_table[32] = { * ipath_insert_rnr_queue - put QP on the RNR timeout list for the device * @qp: the QP * + * Called with the QP s_lock held and interrupts disabled. * XXX Use a simple list for now. We might need a priority * queue if we have lots of QPs waiting for RNR timeouts * but that should be rare. @@ -85,9 +86,9 @@ const u32 ib_ipath_rnr_table[32] = { void ipath_insert_rnr_queue(struct ipath_qp *qp) { struct ipath_ibdev *dev = to_idev(qp->ibqp.device); - unsigned long flags; - spin_lock_irqsave(&dev->pending_lock, flags); + /* We already did a spin_lock_irqsave(), so just use spin_lock */ + spin_lock(&dev->pending_lock); if (list_empty(&dev->rnrwait)) list_add(&qp->timerwait, &dev->rnrwait); else { @@ -109,7 +110,7 @@ void ipath_insert_rnr_queue(struct ipath_qp *qp) nqp->s_rnr_timeout -= qp->s_rnr_timeout; list_add(&qp->timerwait, l); } - spin_unlock_irqrestore(&dev->pending_lock, flags); + spin_unlock(&dev->pending_lock); } /** @@ -185,6 +186,11 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) } spin_lock_irqsave(&rq->lock, flags); + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) { + ret = 0; + goto unlock; + } + wq = rq->wq; tail = wq->tail; /* Validate tail before using it since it is user writable. */ @@ -192,9 +198,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) tail = 0; do { if (unlikely(tail == wq->head)) { - spin_unlock_irqrestore(&rq->lock, flags); ret = 0; - goto bail; + goto unlock; } /* Make sure entry is read after head index is read. */ smp_rmb(); @@ -207,7 +212,7 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) wq->tail = tail; ret = 1; - qp->r_wrid_valid = 1; + set_bit(IPATH_R_WRID_VALID, &qp->r_aflags); if (handler) { u32 n; @@ -234,8 +239,8 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) goto bail; } } +unlock: spin_unlock_irqrestore(&rq->lock, flags); - bail: return ret; } @@ -263,35 +268,59 @@ static void ipath_ruc_loopback(struct ipath_qp *sqp) atomic64_t *maddr; enum ib_wc_status send_status; + /* + * Note that we check the responder QP state after + * checking the requester's state. + */ qp = ipath_lookup_qpn(&dev->qp_table, sqp->remote_qpn); - if (!qp) { - dev->n_pkt_drops++; - return; - } -again: spin_lock_irqsave(&sqp->s_lock, flags); - if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_SEND_OK) || - sqp->s_rnr_timeout) { - spin_unlock_irqrestore(&sqp->s_lock, flags); - goto done; - } + /* Return if we are already busy processing a work request. */ + if ((sqp->s_flags & (IPATH_S_BUSY | IPATH_S_ANY_WAIT)) || + !(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_OR_FLUSH_SEND)) + goto unlock; - /* Get the next send request. */ - if (sqp->s_last == sqp->s_head) { - /* Send work queue is empty. */ - spin_unlock_irqrestore(&sqp->s_lock, flags); - goto done; + sqp->s_flags |= IPATH_S_BUSY; + +again: + if (sqp->s_last == sqp->s_head) + goto clr_busy; + wqe = get_swqe_ptr(sqp, sqp->s_last); + + /* Return if it is not OK to start a new work reqeust. */ + if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_NEXT_SEND_OK)) { + if (!(ib_ipath_state_ops[sqp->state] & IPATH_FLUSH_SEND)) + goto clr_busy; + /* We are in the error state, flush the work request. */ + send_status = IB_WC_WR_FLUSH_ERR; + goto flush_send; } /* * We can rely on the entry not changing without the s_lock * being held until we update s_last. + * We increment s_cur to indicate s_last is in progress. */ - wqe = get_swqe_ptr(sqp, sqp->s_last); + if (sqp->s_last == sqp->s_cur) { + if (++sqp->s_cur >= sqp->s_size) + sqp->s_cur = 0; + } spin_unlock_irqrestore(&sqp->s_lock, flags); + if (!qp || !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) { + dev->n_pkt_drops++; + /* + * For RC, the requester would timeout and retry so + * shortcut the timeouts and just signal too many retries. + */ + if (sqp->ibqp.qp_type == IB_QPT_RC) + send_status = IB_WC_RETRY_EXC_ERR; + else + send_status = IB_WC_SUCCESS; + goto serr; + } + memset(&wc, 0, sizeof wc); send_status = IB_WC_SUCCESS; @@ -396,8 +425,7 @@ again: sqp->s_len -= len; } - if (wqe->wr.opcode == IB_WR_RDMA_WRITE || - wqe->wr.opcode == IB_WR_RDMA_READ) + if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags)) goto send_comp; if (wqe->wr.opcode == IB_WR_RDMA_WRITE_WITH_IMM) @@ -417,6 +445,8 @@ again: wqe->wr.send_flags & IB_SEND_SOLICITED); send_comp: + spin_lock_irqsave(&sqp->s_lock, flags); +flush_send: sqp->s_rnr_retry = sqp->s_rnr_retry_cnt; ipath_send_complete(sqp, wqe, send_status); goto again; @@ -437,11 +467,12 @@ rnr_nak: sqp->s_rnr_retry--; spin_lock_irqsave(&sqp->s_lock, flags); if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_RECV_OK)) - goto unlock; + goto clr_busy; + sqp->s_flags |= IPATH_S_WAITING; dev->n_rnr_naks++; sqp->s_rnr_timeout = ib_ipath_rnr_table[qp->r_min_rnr_timer]; ipath_insert_rnr_queue(sqp); - goto unlock; + goto clr_busy; inv_err: send_status = IB_WC_REM_INV_REQ_ERR; @@ -473,17 +504,19 @@ serr: } goto done; } +clr_busy: + sqp->s_flags &= ~IPATH_S_BUSY; unlock: spin_unlock_irqrestore(&sqp->s_lock, flags); done: - if (atomic_dec_and_test(&qp->refcount)) + if (qp && atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); } static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp) { if (!(dd->ipath_flags & IPATH_HAS_SEND_DMA) || - qp->ibqp.qp_type == IB_QPT_SMI) { + qp->ibqp.qp_type == IB_QPT_SMI) { unsigned long flags; spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); @@ -501,26 +534,36 @@ static void want_buffer(struct ipath_devdata *dd, struct ipath_qp *qp) * @dev: the device we ran out of buffers on * * Called when we run out of PIO buffers. + * If we are now in the error state, return zero to flush the + * send work request. */ -static void ipath_no_bufs_available(struct ipath_qp *qp, +static int ipath_no_bufs_available(struct ipath_qp *qp, struct ipath_ibdev *dev) { unsigned long flags; + int ret = 1; /* * Note that as soon as want_buffer() is called and * possibly before it returns, ipath_ib_piobufavail() - * could be called. If we are still in the tasklet function, - * tasklet_hi_schedule() will not call us until the next time - * tasklet_hi_schedule() is called. - * We leave the busy flag set so that another post send doesn't - * try to put the same QP on the piowait list again. + * could be called. Therefore, put QP on the piowait list before + * enabling the PIO avail interrupt. */ - spin_lock_irqsave(&dev->pending_lock, flags); - list_add_tail(&qp->piowait, &dev->piowait); - spin_unlock_irqrestore(&dev->pending_lock, flags); - want_buffer(dev->dd, qp); - dev->n_piowait++; + spin_lock_irqsave(&qp->s_lock, flags); + if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) { + dev->n_piowait++; + qp->s_flags |= IPATH_S_WAITING; + qp->s_flags &= ~IPATH_S_BUSY; + spin_lock(&dev->pending_lock); + if (list_empty(&qp->piowait)) + list_add_tail(&qp->piowait, &dev->piowait); + spin_unlock(&dev->pending_lock); + } else + ret = 0; + spin_unlock_irqrestore(&qp->s_lock, flags); + if (ret) + want_buffer(dev->dd, qp); + return ret; } /** @@ -596,15 +639,13 @@ void ipath_do_send(unsigned long data) struct ipath_qp *qp = (struct ipath_qp *)data; struct ipath_ibdev *dev = to_idev(qp->ibqp.device); int (*make_req)(struct ipath_qp *qp); - - if (test_and_set_bit(IPATH_S_BUSY, &qp->s_busy)) - goto bail; + unsigned long flags; if ((qp->ibqp.qp_type == IB_QPT_RC || qp->ibqp.qp_type == IB_QPT_UC) && qp->remote_ah_attr.dlid == dev->dd->ipath_lid) { ipath_ruc_loopback(qp); - goto clear; + goto bail; } if (qp->ibqp.qp_type == IB_QPT_RC) @@ -614,6 +655,19 @@ void ipath_do_send(unsigned long data) else make_req = ipath_make_ud_req; + spin_lock_irqsave(&qp->s_lock, flags); + + /* Return if we are already busy processing a work request. */ + if ((qp->s_flags & (IPATH_S_BUSY | IPATH_S_ANY_WAIT)) || + !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_OR_FLUSH_SEND)) { + spin_unlock_irqrestore(&qp->s_lock, flags); + goto bail; + } + + qp->s_flags |= IPATH_S_BUSY; + + spin_unlock_irqrestore(&qp->s_lock, flags); + again: /* Check for a constructed packet to be sent. */ if (qp->s_hdrwords != 0) { @@ -623,8 +677,8 @@ again: */ if (ipath_verbs_send(qp, &qp->s_hdr, qp->s_hdrwords, qp->s_cur_sge, qp->s_cur_size)) { - ipath_no_bufs_available(qp, dev); - goto bail; + if (ipath_no_bufs_available(qp, dev)) + goto bail; } dev->n_unicast_xmit++; /* Record that we sent the packet and s_hdr is empty. */ @@ -633,16 +687,20 @@ again: if (make_req(qp)) goto again; -clear: - clear_bit(IPATH_S_BUSY, &qp->s_busy); + bail:; } +/* + * This should be called with s_lock held. + */ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, enum ib_wc_status status) { - unsigned long flags; - u32 last; + u32 old_last, last; + + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_OR_FLUSH_SEND)) + return; /* See ch. 11.2.4.1 and 10.7.3.1 */ if (!(qp->s_flags & IPATH_S_SIGNAL_REQ_WR) || @@ -661,10 +719,14 @@ void ipath_send_complete(struct ipath_qp *qp, struct ipath_swqe *wqe, status != IB_WC_SUCCESS); } - spin_lock_irqsave(&qp->s_lock, flags); - last = qp->s_last; + old_last = last = qp->s_last; if (++last >= qp->s_size) last = 0; qp->s_last = last; - spin_unlock_irqrestore(&qp->s_lock, flags); + if (qp->s_cur == old_last) + qp->s_cur = last; + if (qp->s_tail == old_last) + qp->s_tail = last; + if (qp->state == IB_QPS_SQD && last == qp->s_cur) + qp->s_draining = 0; } diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index bfe8926..7fd18e8 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved. + * Copyright (c) 2006, 2007, 2008 QLogic Corporation. All rights reserved. * Copyright (c) 2005, 2006 PathScale, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -47,14 +47,30 @@ int ipath_make_uc_req(struct ipath_qp *qp) { struct ipath_other_headers *ohdr; struct ipath_swqe *wqe; + unsigned long flags; u32 hwords; u32 bth0; u32 len; u32 pmtu = ib_mtu_enum_to_int(qp->path_mtu); int ret = 0; - if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) + spin_lock_irqsave(&qp->s_lock, flags); + + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND)) + goto bail; + /* We are in the error state, flush the work request. */ + if (qp->s_last == qp->s_head) + goto bail; + /* If DMAs are in progress, we can't flush immediately. */ + if (atomic_read(&qp->s_dma_busy)) { + qp->s_flags |= IPATH_S_WAIT_DMA; + goto bail; + } + wqe = get_swqe_ptr(qp, qp->s_last); + ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR); goto done; + } ohdr = &qp->s_hdr.u.oth; if (qp->remote_ah_attr.ah_flags & IB_AH_GRH) @@ -69,9 +85,12 @@ int ipath_make_uc_req(struct ipath_qp *qp) qp->s_wqe = NULL; switch (qp->s_state) { default: + if (!(ib_ipath_state_ops[qp->state] & + IPATH_PROCESS_NEXT_SEND_OK)) + goto bail; /* Check if send work queue is empty. */ if (qp->s_cur == qp->s_head) - goto done; + goto bail; /* * Start a new request. */ @@ -134,7 +153,7 @@ int ipath_make_uc_req(struct ipath_qp *qp) break; default: - goto done; + goto bail; } break; @@ -194,9 +213,14 @@ int ipath_make_uc_req(struct ipath_qp *qp) ipath_make_ruc_header(to_idev(qp->ibqp.device), qp, ohdr, bth0 | (qp->s_state << 24), qp->s_next_psn++ & IPATH_PSN_MASK); +done: ret = 1; + goto unlock; -done: +bail: + qp->s_flags &= ~IPATH_S_BUSY; +unlock: + spin_unlock_irqrestore(&qp->s_lock, flags); return ret; } @@ -258,8 +282,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, */ opcode = be32_to_cpu(ohdr->bth[0]) >> 24; - wc.imm_data = 0; - wc.wc_flags = 0; + memset(&wc, 0, sizeof wc); /* Compare the PSN verses the expected PSN. */ if (unlikely(ipath_cmp24(psn, qp->r_psn) != 0)) { @@ -322,8 +345,8 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, case OP(SEND_ONLY): case OP(SEND_ONLY_WITH_IMMEDIATE): send_first: - if (qp->r_reuse_sge) { - qp->r_reuse_sge = 0; + if (qp->r_flags & IPATH_R_REUSE_SGE) { + qp->r_flags &= ~IPATH_R_REUSE_SGE; qp->r_sge = qp->s_rdma_read_sge; } else if (!ipath_get_rwqe(qp, 0)) { dev->n_pkt_drops++; @@ -340,13 +363,13 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, case OP(SEND_MIDDLE): /* Check for invalid length PMTU or posted rwqe len. */ if (unlikely(tlen != (hdrsize + pmtu + 4))) { - qp->r_reuse_sge = 1; + qp->r_flags |= IPATH_R_REUSE_SGE; dev->n_pkt_drops++; goto done; } qp->r_rcv_len += pmtu; if (unlikely(qp->r_rcv_len > qp->r_len)) { - qp->r_reuse_sge = 1; + qp->r_flags |= IPATH_R_REUSE_SGE; dev->n_pkt_drops++; goto done; } @@ -372,7 +395,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, /* Check for invalid length. */ /* XXX LAST len should be >= 1 */ if (unlikely(tlen < (hdrsize + pad + 4))) { - qp->r_reuse_sge = 1; + qp->r_flags |= IPATH_R_REUSE_SGE; dev->n_pkt_drops++; goto done; } @@ -380,7 +403,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, tlen -= (hdrsize + pad + 4); wc.byte_len = tlen + qp->r_rcv_len; if (unlikely(wc.byte_len > qp->r_len)) { - qp->r_reuse_sge = 1; + qp->r_flags |= IPATH_R_REUSE_SGE; dev->n_pkt_drops++; goto done; } @@ -390,14 +413,10 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; - wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; wc.sl = qp->remote_ah_attr.sl; - wc.dlid_path_bits = 0; - wc.port_num = 0; /* Signal completion event if the solicited bit is set. */ ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, (ohdr->bth[0] & @@ -488,8 +507,8 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, dev->n_pkt_drops++; goto done; } - if (qp->r_reuse_sge) - qp->r_reuse_sge = 0; + if (qp->r_flags & IPATH_R_REUSE_SGE) + qp->r_flags &= ~IPATH_R_REUSE_SGE; else if (!ipath_get_rwqe(qp, 1)) { dev->n_pkt_drops++; goto done; diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 8b6a261..77ca8ca 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -65,9 +65,9 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) u32 length; qp = ipath_lookup_qpn(&dev->qp_table, swqe->wr.wr.ud.remote_qpn); - if (!qp) { + if (!qp || !(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_RECV_OK)) { dev->n_pkt_drops++; - goto send_comp; + goto done; } rsge.sg_list = NULL; @@ -91,14 +91,12 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) * present on the wire. */ length = swqe->length; + memset(&wc, 0, sizeof wc); wc.byte_len = length + sizeof(struct ib_grh); if (swqe->wr.opcode == IB_WR_SEND_WITH_IMM) { wc.wc_flags = IB_WC_WITH_IMM; wc.imm_data = swqe->wr.ex.imm_data; - } else { - wc.wc_flags = 0; - wc.imm_data = 0; } /* @@ -229,7 +227,6 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) } wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; - wc.vendor_err = 0; wc.qp = &qp->ibqp; wc.src_qp = sqp->ibqp.qp_num; /* XXX do we know which pkey matched? Only needed for GSI. */ @@ -248,8 +245,7 @@ drop: kfree(rsge.sg_list); if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); -send_comp: - ipath_send_complete(sqp, swqe, IB_WC_SUCCESS); +done:; } /** @@ -264,6 +260,7 @@ int ipath_make_ud_req(struct ipath_qp *qp) struct ipath_other_headers *ohdr; struct ib_ah_attr *ah_attr; struct ipath_swqe *wqe; + unsigned long flags; u32 nwords; u32 extra_bytes; u32 bth0; @@ -271,13 +268,30 @@ int ipath_make_ud_req(struct ipath_qp *qp) u16 lid; int ret = 0; - if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK))) - goto bail; + spin_lock_irqsave(&qp->s_lock, flags); + + if (!(ib_ipath_state_ops[qp->state] & IPATH_PROCESS_NEXT_SEND_OK)) { + if (!(ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND)) + goto bail; + /* We are in the error state, flush the work request. */ + if (qp->s_last == qp->s_head) + goto bail; + /* If DMAs are in progress, we can't flush immediately. */ + if (atomic_read(&qp->s_dma_busy)) { + qp->s_flags |= IPATH_S_WAIT_DMA; + goto bail; + } + wqe = get_swqe_ptr(qp, qp->s_last); + ipath_send_complete(qp, wqe, IB_WC_WR_FLUSH_ERR); + goto done; + } if (qp->s_cur == qp->s_head) goto bail; wqe = get_swqe_ptr(qp, qp->s_cur); + if (++qp->s_cur >= qp->s_size) + qp->s_cur = 0; /* Construct the header. */ ah_attr = &to_iah(wqe->wr.wr.ud.ah)->attr; @@ -288,10 +302,23 @@ int ipath_make_ud_req(struct ipath_qp *qp) dev->n_unicast_xmit++; } else { dev->n_unicast_xmit++; - lid = ah_attr->dlid & - ~((1 << dev->dd->ipath_lmc) - 1); + lid = ah_attr->dlid & ~((1 << dev->dd->ipath_lmc) - 1); if (unlikely(lid == dev->dd->ipath_lid)) { + /* + * If DMAs are in progress, we can't generate + * a completion for the loopback packet since + * it would be out of order. + * XXX Instead of waiting, we could queue a + * zero length descriptor so we get a callback. + */ + if (atomic_read(&qp->s_dma_busy)) { + qp->s_flags |= IPATH_S_WAIT_DMA; + goto bail; + } + spin_unlock_irqrestore(&qp->s_lock, flags); ipath_ud_loopback(qp, wqe); + spin_lock_irqsave(&qp->s_lock, flags); + ipath_send_complete(qp, wqe, IB_WC_SUCCESS); goto done; } } @@ -368,11 +395,13 @@ int ipath_make_ud_req(struct ipath_qp *qp) ohdr->u.ud.deth[1] = cpu_to_be32(qp->ibqp.qp_num); done: - if (++qp->s_cur >= qp->s_size) - qp->s_cur = 0; ret = 1; + goto unlock; bail: + qp->s_flags &= ~IPATH_S_BUSY; +unlock: + spin_unlock_irqrestore(&qp->s_lock, flags); return ret; } @@ -506,8 +535,8 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, /* * Get the next work request entry to find where to put the data. */ - if (qp->r_reuse_sge) - qp->r_reuse_sge = 0; + if (qp->r_flags & IPATH_R_REUSE_SGE) + qp->r_flags &= ~IPATH_R_REUSE_SGE; else if (!ipath_get_rwqe(qp, 0)) { /* * Count VL15 packets dropped due to no receive buffer. @@ -523,7 +552,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, } /* Silently drop packets which are too big. */ if (wc.byte_len > qp->r_len) { - qp->r_reuse_sge = 1; + qp->r_flags |= IPATH_R_REUSE_SGE; dev->n_pkt_drops++; goto bail; } @@ -535,7 +564,8 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, ipath_skip_sge(&qp->r_sge, sizeof(struct ib_grh)); ipath_copy_sge(&qp->r_sge, data, wc.byte_len - sizeof(struct ib_grh)); - qp->r_wrid_valid = 0; + if (!test_and_clear_bit(IPATH_R_WRID_VALID, &qp->r_aflags)) + goto bail; wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; diff --git a/drivers/infiniband/hw/ipath/ipath_user_sdma.h b/drivers/infiniband/hw/ipath/ipath_user_sdma.h index e70946c..fc76316 100644 --- a/drivers/infiniband/hw/ipath/ipath_user_sdma.h +++ b/drivers/infiniband/hw/ipath/ipath_user_sdma.h @@ -45,8 +45,6 @@ int ipath_user_sdma_writev(struct ipath_devdata *dd, int ipath_user_sdma_make_progress(struct ipath_devdata *dd, struct ipath_user_sdma_queue *pq); -int ipath_user_sdma_pkt_sent(const struct ipath_user_sdma_queue *pq, - u32 counter); void ipath_user_sdma_queue_drain(struct ipath_devdata *dd, struct ipath_user_sdma_queue *pq); diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index 22bb42d..e0ec540 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -111,16 +111,24 @@ static unsigned int ib_ipath_disable_sma; module_param_named(disable_sma, ib_ipath_disable_sma, uint, S_IWUSR | S_IRUGO); MODULE_PARM_DESC(disable_sma, "Disable the SMA"); +/* + * Note that it is OK to post send work requests in the SQE and ERR + * states; ipath_do_send() will process them and generate error + * completions as per IB 1.2 C10-96. + */ const int ib_ipath_state_ops[IB_QPS_ERR + 1] = { [IB_QPS_RESET] = 0, [IB_QPS_INIT] = IPATH_POST_RECV_OK, [IB_QPS_RTR] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, [IB_QPS_RTS] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | - IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK, + IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK | + IPATH_PROCESS_NEXT_SEND_OK, [IB_QPS_SQD] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | - IPATH_POST_SEND_OK, - [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK, - [IB_QPS_ERR] = 0, + IPATH_POST_SEND_OK | IPATH_PROCESS_SEND_OK, + [IB_QPS_SQE] = IPATH_POST_RECV_OK | IPATH_PROCESS_RECV_OK | + IPATH_POST_SEND_OK | IPATH_FLUSH_SEND, + [IB_QPS_ERR] = IPATH_POST_RECV_OK | IPATH_FLUSH_RECV | + IPATH_POST_SEND_OK | IPATH_FLUSH_SEND, }; struct ipath_ucontext { @@ -230,18 +238,6 @@ void ipath_skip_sge(struct ipath_sge_state *ss, u32 length) } } -static void ipath_flush_wqe(struct ipath_qp *qp, struct ib_send_wr *wr) -{ - struct ib_wc wc; - - memset(&wc, 0, sizeof(wc)); - wc.wr_id = wr->wr_id; - wc.status = IB_WC_WR_FLUSH_ERR; - wc.opcode = ib_ipath_wc_opcode[wr->opcode]; - wc.qp = &qp->ibqp; - ipath_cq_enter(to_icq(qp->ibqp.send_cq), &wc, 1); -} - /* * Count the number of DMA descriptors needed to send length bytes of data. * Don't modify the ipath_sge_state to get the count. @@ -347,14 +343,8 @@ static int ipath_post_one_send(struct ipath_qp *qp, struct ib_send_wr *wr) spin_lock_irqsave(&qp->s_lock, flags); /* Check that state is OK to post send. */ - if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK))) { - if (qp->state != IB_QPS_SQE && qp->state != IB_QPS_ERR) - goto bail_inval; - /* C10-96 says generate a flushed completion entry. */ - ipath_flush_wqe(qp, wr); - ret = 0; - goto bail; - } + if (unlikely(!(ib_ipath_state_ops[qp->state] & IPATH_POST_SEND_OK))) + goto bail_inval; /* IB spec says that num_sge == 0 is OK. */ if (wr->num_sge > qp->s_max_sge) @@ -677,6 +667,7 @@ bail:; static void ipath_ib_timer(struct ipath_ibdev *dev) { struct ipath_qp *resend = NULL; + struct ipath_qp *rnr = NULL; struct list_head *last; struct ipath_qp *qp; unsigned long flags; @@ -703,7 +694,9 @@ static void ipath_ib_timer(struct ipath_ibdev *dev) if (--qp->s_rnr_timeout == 0) { do { list_del_init(&qp->timerwait); - tasklet_hi_schedule(&qp->s_task); + qp->timer_next = rnr; + rnr = qp; + atomic_inc(&qp->refcount); if (list_empty(last)) break; qp = list_entry(last->next, struct ipath_qp, @@ -743,9 +736,13 @@ static void ipath_ib_timer(struct ipath_ibdev *dev) spin_unlock_irqrestore(&dev->pending_lock, flags); /* XXX What if timer fires again while this is running? */ - for (qp = resend; qp != NULL; qp = qp->timer_next) { + while (resend != NULL) { + qp = resend; + resend = qp->timer_next; + spin_lock_irqsave(&qp->s_lock, flags); - if (qp->s_last != qp->s_tail && qp->state == IB_QPS_RTS) { + if (qp->s_last != qp->s_tail && + ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) { dev->n_timeouts++; ipath_restart_rc(qp, qp->s_last_psn + 1); } @@ -755,6 +752,19 @@ static void ipath_ib_timer(struct ipath_ibdev *dev) if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); } + while (rnr != NULL) { + qp = rnr; + rnr = qp->timer_next; + + spin_lock_irqsave(&qp->s_lock, flags); + if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) + ipath_schedule_send(qp); + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } } static void update_sge(struct ipath_sge_state *ss, u32 length) @@ -1010,13 +1020,24 @@ static void sdma_complete(void *cookie, int status) struct ipath_verbs_txreq *tx = cookie; struct ipath_qp *qp = tx->qp; struct ipath_ibdev *dev = to_idev(qp->ibqp.device); + unsigned int flags; + enum ib_wc_status ibs = status == IPATH_SDMA_TXREQ_S_OK ? + IB_WC_SUCCESS : IB_WC_WR_FLUSH_ERR; - /* Generate a completion queue entry if needed */ - if (qp->ibqp.qp_type != IB_QPT_RC && tx->wqe) { - enum ib_wc_status ibs = status == IPATH_SDMA_TXREQ_S_OK ? - IB_WC_SUCCESS : IB_WC_WR_FLUSH_ERR; - + if (atomic_dec_and_test(&qp->s_dma_busy)) { + spin_lock_irqsave(&qp->s_lock, flags); + if (tx->wqe) + ipath_send_complete(qp, tx->wqe, ibs); + if ((ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND && + qp->s_last != qp->s_head) || + (qp->s_flags & IPATH_S_WAIT_DMA)) + ipath_schedule_send(qp); + spin_unlock_irqrestore(&qp->s_lock, flags); + wake_up(&qp->wait_dma); + } else if (tx->wqe) { + spin_lock_irqsave(&qp->s_lock, flags); ipath_send_complete(qp, tx->wqe, ibs); + spin_unlock_irqrestore(&qp->s_lock, flags); } if (tx->txreq.flags & IPATH_SDMA_TXREQ_F_FREEBUF) @@ -1027,6 +1048,21 @@ static void sdma_complete(void *cookie, int status) wake_up(&qp->wait); } +static void decrement_dma_busy(struct ipath_qp *qp) +{ + unsigned int flags; + + if (atomic_dec_and_test(&qp->s_dma_busy)) { + spin_lock_irqsave(&qp->s_lock, flags); + if ((ib_ipath_state_ops[qp->state] & IPATH_FLUSH_SEND && + qp->s_last != qp->s_head) || + (qp->s_flags & IPATH_S_WAIT_DMA)) + ipath_schedule_send(qp); + spin_unlock_irqrestore(&qp->s_lock, flags); + wake_up(&qp->wait_dma); + } +} + /* * Compute the number of clock cycles of delay before sending the next packet. * The multipliers reflect the number of clocks for the fastest rate so @@ -1065,9 +1101,12 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp, if (tx) { qp->s_tx = NULL; /* resend previously constructed packet */ + atomic_inc(&qp->s_dma_busy); ret = ipath_sdma_verbs_send(dd, tx->ss, tx->len, tx); - if (ret) + if (ret) { qp->s_tx = tx; + decrement_dma_busy(qp); + } goto bail; } @@ -1118,12 +1157,14 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp, tx->txreq.sg_count = ndesc; tx->map_len = (hdrwords + 2) << 2; tx->txreq.map_addr = &tx->hdr; + atomic_inc(&qp->s_dma_busy); ret = ipath_sdma_verbs_send(dd, ss, dwords, tx); if (ret) { /* save ss and length in dwords */ tx->ss = ss; tx->len = dwords; qp->s_tx = tx; + decrement_dma_busy(qp); } goto bail; } @@ -1144,6 +1185,7 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp, memcpy(piobuf, hdr, hdrwords << 2); ipath_copy_from_sge(piobuf + hdrwords, ss, len); + atomic_inc(&qp->s_dma_busy); ret = ipath_sdma_verbs_send(dd, NULL, 0, tx); /* * If we couldn't queue the DMA request, save the info @@ -1154,6 +1196,7 @@ static int ipath_verbs_send_dma(struct ipath_qp *qp, tx->ss = NULL; tx->len = 0; qp->s_tx = tx; + decrement_dma_busy(qp); } dev->n_unaligned++; goto bail; @@ -1177,6 +1220,7 @@ static int ipath_verbs_send_pio(struct ipath_qp *qp, unsigned flush_wc; u32 control; int ret; + unsigned int flags; piobuf = ipath_getpiobuf(dd, plen, NULL); if (unlikely(piobuf == NULL)) { @@ -1247,8 +1291,11 @@ static int ipath_verbs_send_pio(struct ipath_qp *qp, } copy_io(piobuf, ss, len, flush_wc); done: - if (qp->s_wqe) + if (qp->s_wqe) { + spin_lock_irqsave(&qp->s_lock, flags); ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS); + spin_unlock_irqrestore(&qp->s_lock, flags); + } ret = 0; bail: return ret; @@ -1281,19 +1328,12 @@ int ipath_verbs_send(struct ipath_qp *qp, struct ipath_ib_header *hdr, * can defer SDMA restart until link goes ACTIVE without * worrying about just how we got there. */ - if (qp->ibqp.qp_type == IB_QPT_SMI) + if (qp->ibqp.qp_type == IB_QPT_SMI || + !(dd->ipath_flags & IPATH_HAS_SEND_DMA)) ret = ipath_verbs_send_pio(qp, hdr, hdrwords, ss, len, plen, dwords); - /* All non-VL15 packets are dropped if link is not ACTIVE */ - else if (!(dd->ipath_flags & IPATH_LINKACTIVE)) { - if (qp->s_wqe) - ipath_send_complete(qp, qp->s_wqe, IB_WC_SUCCESS); - ret = 0; - } else if (dd->ipath_flags & IPATH_HAS_SEND_DMA) - ret = ipath_verbs_send_dma(qp, hdr, hdrwords, ss, len, - plen, dwords); else - ret = ipath_verbs_send_pio(qp, hdr, hdrwords, ss, len, + ret = ipath_verbs_send_dma(qp, hdr, hdrwords, ss, len, plen, dwords); return ret; @@ -1401,27 +1441,46 @@ bail: * This is called from ipath_intr() at interrupt level when a PIO buffer is * available after ipath_verbs_send() returned an error that no buffers were * available. Return 1 if we consumed all the PIO buffers and we still have - * QPs waiting for buffers (for now, just do a tasklet_hi_schedule and + * QPs waiting for buffers (for now, just restart the send tasklet and * return zero). */ int ipath_ib_piobufavail(struct ipath_ibdev *dev) { + struct list_head *list; + struct ipath_qp *qplist; struct ipath_qp *qp; unsigned long flags; if (dev == NULL) goto bail; + list = &dev->piowait; + qplist = NULL; + spin_lock_irqsave(&dev->pending_lock, flags); - while (!list_empty(&dev->piowait)) { - qp = list_entry(dev->piowait.next, struct ipath_qp, - piowait); + while (!list_empty(list)) { + qp = list_entry(list->next, struct ipath_qp, piowait); list_del_init(&qp->piowait); - clear_bit(IPATH_S_BUSY, &qp->s_busy); - tasklet_hi_schedule(&qp->s_task); + qp->pio_next = qplist; + qplist = qp; + atomic_inc(&qp->refcount); } spin_unlock_irqrestore(&dev->pending_lock, flags); + while (qplist != NULL) { + qp = qplist; + qplist = qp->pio_next; + + spin_lock_irqsave(&qp->s_lock, flags); + if (ib_ipath_state_ops[qp->state] & IPATH_PROCESS_SEND_OK) + ipath_schedule_send(qp); + spin_unlock_irqrestore(&qp->s_lock, flags); + + /* Notify ipath_destroy_qp() if it is waiting. */ + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + } + bail: return 0; } @@ -2143,11 +2202,12 @@ bail: void ipath_unregister_ib_device(struct ipath_ibdev *dev) { struct ib_device *ibdev = &dev->ibdev; - - disable_timer(dev->dd); + u32 qps_inuse; ib_unregister_device(ibdev); + disable_timer(dev->dd); + if (!list_empty(&dev->pending[0]) || !list_empty(&dev->pending[1]) || !list_empty(&dev->pending[2])) @@ -2162,7 +2222,10 @@ void ipath_unregister_ib_device(struct ipath_ibdev *dev) * Note that ipath_unregister_ib_device() can be called before all * the QPs are destroyed! */ - ipath_free_all_qps(&dev->qp_table); + qps_inuse = ipath_free_all_qps(&dev->qp_table); + if (qps_inuse) + ipath_dev_err(dev->dd, "QP memory leak! %u still in use\n", + qps_inuse); kfree(dev->qp_table.table); kfree(dev->lk_table.table); kfree(dev->txreq_bufs); @@ -2213,17 +2276,14 @@ static ssize_t show_stats(struct device *device, struct device_attribute *attr, "RC OTH NAKs %d\n" "RC timeouts %d\n" "RC RDMA dup %d\n" - "RC stalls %d\n" "piobuf wait %d\n" - "no piobuf %d\n" "unaligned %d\n" "PKT drops %d\n" "WQE errs %d\n", dev->n_rc_resends, dev->n_rc_qacks, dev->n_rc_acks, dev->n_seq_naks, dev->n_rdma_seq, dev->n_rnr_naks, dev->n_other_naks, dev->n_timeouts, - dev->n_rdma_dup_busy, dev->n_rc_stalls, dev->n_piowait, - dev->n_no_piobuf, dev->n_unaligned, + dev->n_rdma_dup_busy, dev->n_piowait, dev->n_unaligned, dev->n_pkt_drops, dev->n_wqe_errs); for (i = 0; i < ARRAY_SIZE(dev->opstats); i++) { const struct ipath_opcode_stats *si = &dev->opstats[i]; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 4c7c2aa..eed1fdc 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -74,6 +74,11 @@ #define IPATH_POST_RECV_OK 0x02 #define IPATH_PROCESS_RECV_OK 0x04 #define IPATH_PROCESS_SEND_OK 0x08 +#define IPATH_PROCESS_NEXT_SEND_OK 0x10 +#define IPATH_FLUSH_SEND 0x20 +#define IPATH_FLUSH_RECV 0x40 +#define IPATH_PROCESS_OR_FLUSH_SEND \ + (IPATH_PROCESS_SEND_OK | IPATH_FLUSH_SEND) /* IB Performance Manager status values */ #define IB_PMA_SAMPLE_STATUS_DONE 0x00 @@ -353,12 +358,14 @@ struct ipath_qp { struct ib_qp ibqp; struct ipath_qp *next; /* link list for QPN hash table */ struct ipath_qp *timer_next; /* link list for ipath_ib_timer() */ + struct ipath_qp *pio_next; /* link for ipath_ib_piobufavail() */ struct list_head piowait; /* link for wait PIO buf */ struct list_head timerwait; /* link for waiting for timeouts */ struct ib_ah_attr remote_ah_attr; struct ipath_ib_header s_hdr; /* next packet header to send */ atomic_t refcount; wait_queue_head_t wait; + wait_queue_head_t wait_dma; struct tasklet_struct s_task; struct ipath_mmap_info *ip; struct ipath_sge_state *s_cur_sge; @@ -369,7 +376,7 @@ struct ipath_qp { struct ipath_sge_state s_rdma_read_sge; struct ipath_sge_state r_sge; /* current receive data */ spinlock_t s_lock; - unsigned long s_busy; + atomic_t s_dma_busy; u16 s_pkt_delay; u16 s_hdrwords; /* size of s_hdr in 32 bit words */ u32 s_cur_size; /* size of send packet in bytes */ @@ -383,6 +390,7 @@ struct ipath_qp { u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */ u32 r_ack_psn; /* PSN for next ACK or atomic ACK */ u64 r_wr_id; /* ID for current receive WQE */ + unsigned long r_aflags; u32 r_len; /* total length of r_sge */ u32 r_rcv_len; /* receive data len processed */ u32 r_psn; /* expected rcv packet sequence number */ @@ -394,8 +402,7 @@ struct ipath_qp { u8 r_state; /* opcode of last packet received */ u8 r_nak_state; /* non-zero if NAK is pending */ u8 r_min_rnr_timer; /* retry timeout value for RNR NAKs */ - u8 r_reuse_sge; /* for UC receive errors */ - u8 r_wrid_valid; /* r_wrid set but CQ entry not yet made */ + u8 r_flags; u8 r_max_rd_atomic; /* max number of RDMA read/atomic to receive */ u8 r_head_ack_queue; /* index into s_ack_queue[] */ u8 qp_access_flags; @@ -404,13 +411,13 @@ struct ipath_qp { u8 s_rnr_retry_cnt; u8 s_retry; /* requester retry counter */ u8 s_rnr_retry; /* requester RNR retry counter */ - u8 s_wait_credit; /* limit number of unacked packets sent */ u8 s_pkey_index; /* PKEY index to use */ u8 s_max_rd_atomic; /* max number of RDMA read/atomic to send */ u8 s_num_rd_atomic; /* number of RDMA read/atomic pending */ u8 s_tail_ack_queue; /* index into s_ack_queue[] */ u8 s_flags; u8 s_dmult; + u8 s_draining; u8 timeout; /* Timeout for this QP */ enum ib_mtu path_mtu; u32 remote_qpn; @@ -428,16 +435,39 @@ struct ipath_qp { struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; -/* Bit definition for s_busy. */ -#define IPATH_S_BUSY 0 +/* + * Atomic bit definitions for r_aflags. + */ +#define IPATH_R_WRID_VALID 0 + +/* + * Bit definitions for r_flags. + */ +#define IPATH_R_REUSE_SGE 0x01 /* * Bit definitions for s_flags. + * + * IPATH_S_FENCE_PENDING - waiting for all prior RDMA read or atomic SWQEs + * before processing the next SWQE + * IPATH_S_RDMAR_PENDING - waiting for any RDMA read or atomic SWQEs + * before processing the next SWQE + * IPATH_S_WAITING - waiting for RNR timeout or send buffer available. + * IPATH_S_WAIT_SSN_CREDIT - waiting for RC credits to process next SWQE + * IPATH_S_WAIT_DMA - waiting for send DMA queue to drain before generating + next send completion entry not via send DMA. */ #define IPATH_S_SIGNAL_REQ_WR 0x01 #define IPATH_S_FENCE_PENDING 0x02 #define IPATH_S_RDMAR_PENDING 0x04 #define IPATH_S_ACK_PENDING 0x08 +#define IPATH_S_BUSY 0x10 +#define IPATH_S_WAITING 0x20 +#define IPATH_S_WAIT_SSN_CREDIT 0x40 +#define IPATH_S_WAIT_DMA 0x80 + +#define IPATH_S_ANY_WAIT (IPATH_S_FENCE_PENDING | IPATH_S_RDMAR_PENDING | \ + IPATH_S_WAITING | IPATH_S_WAIT_SSN_CREDIT | IPATH_S_WAIT_DMA) #define IPATH_PSN_CREDIT 512 @@ -573,13 +603,11 @@ struct ipath_ibdev { u32 n_rnr_naks; u32 n_other_naks; u32 n_timeouts; - u32 n_rc_stalls; u32 n_pkt_drops; u32 n_vl15_dropped; u32 n_wqe_errs; u32 n_rdma_dup_busy; u32 n_piowait; - u32 n_no_piobuf; u32 n_unaligned; u32 port_cap_flags; u32 pma_sample_start; @@ -657,6 +685,17 @@ static inline struct ipath_ibdev *to_idev(struct ib_device *ibdev) return container_of(ibdev, struct ipath_ibdev, ibdev); } +/* + * This must be called with s_lock held. + */ +static inline void ipath_schedule_send(struct ipath_qp *qp) +{ + if (qp->s_flags & IPATH_S_ANY_WAIT) + qp->s_flags &= ~IPATH_S_ANY_WAIT; + if (!(qp->s_flags & IPATH_S_BUSY)) + tasklet_hi_schedule(&qp->s_task); +} + int ipath_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, @@ -706,7 +745,7 @@ int ipath_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int ipath_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_qp_init_attr *init_attr); -void ipath_free_all_qps(struct ipath_qp_table *qpt); +unsigned ipath_free_all_qps(struct ipath_qp_table *qpt); int ipath_init_qp_table(struct ipath_ibdev *idev, int size); From ralph.campbell at qlogic.com Thu May 8 11:55:28 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 08 May 2008 11:55:28 -0700 Subject: [ofa-general] [PATCH 3/3] IB/ipath - fix RDMA read response sequence checking In-Reply-To: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> Message-ID: <20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com> If an out of sequence RDMA read response middle or last packet is received, we should only resend the RDMA read request on the first out of sequence packet and drop subsequent out of sequence packets otherwise, we get "too many retries". Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_rc.c | 7 +++++++ drivers/infiniband/hw/ipath/ipath_verbs.h | 1 + 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index 5b5276a..108df66 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -1189,6 +1189,7 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, wqe = get_swqe_ptr(qp, qp->s_last); if (unlikely(wqe->wr.opcode != IB_WR_RDMA_READ)) goto ack_op_err; + qp->r_flags &= ~IPATH_R_RDMAR_SEQ; /* * If this is a response to a resent RDMA read, we * have to be careful to copy the data to the right @@ -1202,6 +1203,9 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, /* no AETH, no ACK */ if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) { dev->n_rdma_seq++; + if (qp->r_flags & IPATH_R_RDMAR_SEQ) + goto ack_done; + qp->r_flags |= IPATH_R_RDMAR_SEQ; ipath_restart_rc(qp, qp->s_last_psn + 1); goto ack_done; } @@ -1263,6 +1267,9 @@ static inline void ipath_rc_rcv_resp(struct ipath_ibdev *dev, /* ACKs READ req. */ if (unlikely(ipath_cmp24(psn, qp->s_last_psn + 1))) { dev->n_rdma_seq++; + if (qp->r_flags & IPATH_R_RDMAR_SEQ) + goto ack_done; + qp->r_flags |= IPATH_R_RDMAR_SEQ; ipath_restart_rc(qp, qp->s_last_psn + 1); goto ack_done; } diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index eed1fdc..d64ca0f 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -444,6 +444,7 @@ struct ipath_qp { * Bit definitions for r_flags. */ #define IPATH_R_REUSE_SGE 0x01 +#define IPATH_R_RDMAR_SEQ 0x02 /* * Bit definitions for s_flags. From sean.hefty at intel.com Thu May 8 11:58:03 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 May 2008 11:58:03 -0700 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <48224662.60401@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> <48224662.60401@opengridcomputing.com> Message-ID: <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com> >The requirement is mostly driven from the receiving side. For cxgb3 it >is anyway... Maybe you can help me understand the spec here. If we ignore this feature for a minute, then the side that calls rdma_connect() must instead issue the first 'send' request to the server. Can the first 'send' be a 0B rdma write or read? Why wouldn't the target of that request not have to transition to connected? Is the issue that there's no way for the receiving FW/driver to know that this has occurred so that it can signal that the connection has been established? I.e. a client that does this must signal the server that things are ready through some out of band means. >server sends MPA Start response with "lets do RTR and send me X" where >X could be 0B write, 0B read request or 0B send. Are there any restrictions where a client may not be able to issue what the server requests? E.g. the hardware doesn't issue 0B writes. - Sean From swise at opengridcomputing.com Thu May 8 12:17:02 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 08 May 2008 14:17:02 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> <48224662.60401@opengridcomputing.com> <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com> Message-ID: <482351AE.2050800@opengridcomputing.com> An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu May 8 12:25:13 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 08 May 2008 14:25:13 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <482351AE.2050800@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> <48224662.60401@opengridcomputing.com> <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com> <482351AE.2050800@opengridcomputing.com> Message-ID: <48235399.4030609@opengridcomputing.com> From RFC 5044, section 7.1.2 "Connection Startup Rules", Page 29: 4. MPA Responder mode implementations MUST receive and validate at least one FPDU before sending any FPDUs or Markers. Note: This requirement is present to allow the Initiator time to get its receiver into Full Operation before an FPDU arrives, avoiding potential race conditions at the Initiator. This was also subject to some debate in the work group before rough consensus was reached. Eliminating this requirement would allow faster startup in some types of applications. However, that would also make certain implementations (particularly "dual stack") much harder. Steve Wise wrote: > Sean Hefty wrote: >>> The requirement is mostly driven from the receiving side. For cxgb3 it >>> is anyway... >>> >> >> Maybe you can help me understand the spec here. If we ignore this feature for a >> minute, then the side that calls rdma_connect() must instead issue the first >> 'send' request to the server. Can the first 'send' be a 0B rdma write or read? >> > According to the MPI IETF RFC, the initiator must send the first > FPDU. That could be anything. The spec leaves it up to the ULP. > >> Why wouldn't the target of that request not have to transition to connected? >> >> > I don't understand this question? What does 'transition to connected' > mean? > > The requirement is that the responder (the side that issues the > rdma_accept in rdma-cma terms) _cannot_ send an FPDU until it first > receives one from the initiator. How that is enforces is an > implementation detail. The responder driver could hold off on the > ESTABLISHED event until it receives the first FPDU. Or it could stall > SQ processing until the first FPDU is received yet still indicate that > the connection is ESTABLISHED. > >> Is the issue that there's no way for the receiving FW/driver to know that this >> has occurred so that it can signal that the connection has been established? >> I.e. a client that does this must signal the server that things are ready >> through some out of band means. >> >> > I don't understand what you're getting at exactly. > > The issue is that the server doesn't know when the client receives the > MPA Start Response and has successfully transitioned the connection > into RDMA mode. IF the server sends an FPDU immediately following the > MPA Start Response (which is in streaming mode), then its possible for > that first FPDU to get passed up to the driver/ULP as streaming mode > data. Which breaks everything. Soooo, the spec says the server > cannot send an FPDU until it first receives one and thus _knows_ the > client is in RDMA mode (by virtue of the fact that the client sent and > FPDU). > > >>> server sends MPA Start response with "lets do RTR and send me X" where >>> X could be 0B write, 0B read request or 0B send. >>> >> >> Are there any restrictions where a client may not be able to issue what the >> server requests? E.g. the hardware doesn't issue 0B writes. >> >> > > Well I guess there could be. The concensus within the iWARP vendors > at Reno was that 0B read would ok. During the previous discussion on > this list shortly after Reno, issues where raised that we should allow > other types. > > We could make the MPA start request have more info than "I can do > RTR". It could have "Here are the RTR msgs I can send". Does that > help? > > > > Steve. > > From swise at opengridcomputing.com Thu May 8 12:42:55 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 08 May 2008 14:42:55 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <48235399.4030609@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <4820A427.1070405@opengridcomputing.com> <48222D58.9020205@opengridcomputing.com> <48222FFD.40302@opengridcomputing.com> <001001c8b099$960a9560$5fd4180a@amr.corp.intel.com> <48224662.60401@opengridcomputing.com> <000a01c8b13d$716d2980$465a180a@amr.corp.intel.com> <482351AE.2050800@opengridcomputing.com> <48235399.4030609@opengridcomputing.com> Message-ID: <482357BF.8050702@opengridcomputing.com> Here is the thread where we discussed how to implement peer-to-peer for iWARP in Nov/2007: http://lists.openfabrics.org/pipermail/general/2007-November/043252.html Steve Wise wrote: > > > From RFC 5044, section 7.1.2 "Connection Startup Rules", Page 29: > > 4. MPA Responder mode implementations MUST receive and validate at > least one FPDU before sending any FPDUs or Markers. > > Note: This requirement is present to allow the Initiator time to > get its receiver into Full Operation before an FPDU arrives, > avoiding potential race conditions at the Initiator. This > was also subject to some debate in the work group before > rough consensus was reached. Eliminating this requirement > would allow faster startup in some types of applications. > However, that would also make certain implementations > (particularly "dual stack") much harder. > From jsquyres at cisco.com Thu May 8 14:28:46 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 8 May 2008 17:28:46 -0400 Subject: [ofa-general] Verbs: IB vs. iWARP Message-ID: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Over the past 24 hours, we assembled a list of differences between IB and iWARP usage of verbs. I got a few comments on the text we assembled, and figured it was time to turn this text over to OpenFabrics to make it fully correct/complete/whatever, and then publish it however you see fit. I hope this starter text is helpful to you; enjoy. ----- * struct ib_device.transport_type will be IBV_TRANSPORT_IWARP for iWARP devices and IBV_TRANSPORT_IB for IB devices. * ibv_query_gid(): * When invoked on an IB HCA, will return the IB subnet prefix in subnet_prefix and GUID of the port in the interface_id. * When invoked on an iWARP NIC, will return the NIC's MAC address in subnet_prefix and 0 in the interface_id. * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can be made using the IB CM, RDMA CM, or some other (assumedly out-of-band) mechanism. * When making QPs, some versions of iWARP drivers require the initiator of the connection to send the first message (having the non- initiator send the first message will terminate the connection). Newer versions of iWARP firmware/drivers hide this functionality down in the driver, so the ULP doesn't have to ensure that the initiator sends the first message. * When terminating connections via the RDMA CM (via the rdma_disconnect() call or by simply destroying the QP without disconnecting first), iWARP transports will automatically create a CQE for any pending send or receive WRs with the status set to IBV_WC_WR_FLUSH_ERR. Note that IB HCAs do the same thing, but the iWARP RDMA CM disconnection progresses independently of the ULP, meaning that when one side issues the disconnect, the other side will automatically be disconnected (even if the ULP doesn't realize it). IB HCAs may not process the disconnect until later (via RDMA CM or otherwise), perhaps not until the ULP realizes that the disconnect has occurred. In short: device-independent verbs-based applications need to be able to handle FLUSH WRs during disconnection and not treat them as an error. * LIDs are always 0 in iWARP. * LMC is always 0 for iWARP. * Memory regions used to receive RDMA read responses must have "remote write" permission (since in the iWARP protocol, RDMA read responses are basically the same as incoming RDMA write requests). * Atomics and immediate data are not available in iWARP. * The sink scatter-gather list for an RDMA read can only have one element for iWARP (which is reported accurately in struct ibv_device.max_sge). * Send completions provide a slightly different guarantee: * iWARP: indicates that the resources in the corresponding WR can be reused; it does ''not'' indicate that the data is in the peer's memory, or even that they have been transmitted yet. * IB: indicates that the data has been transmitted and has arrived at the remote HCA (but is not necessarily in the remote target buffer yet) * All currently-available RNICs (May 2008) do not support RNR retry. Specifically: current RNICs will terminate a QP connection if a SEND arrives with no corresponding pre-posted receive. -- Jeff Squyres Cisco Systems From andrea at qumranet.com Thu May 8 15:01:06 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 9 May 2008 00:01:06 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> Message-ID: <20080508220106.GF2964@duo.random> On Thu, May 08, 2008 at 09:11:33AM -0700, Linus Torvalds wrote: > Btw, this is an issue only on 32-bit x86, because on 64-bit one we already > have the padding due to the alignment of the 64-bit pointers in the > list_head (so there's already empty space there). > > On 32-bit, the alignment of list-head is obviously just 32 bits, so right > now the structure is "perfectly packed" and doesn't have any empty space. > But that's just because the spinlock is unnecessarily big. > > (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the > structure really will grow. That's a very odd configuration, though, and > not one I feel we really need to care about). I see two ways to implement it: 1) use #ifdef and make it zero overhead for 64bit only without playing any non obvious trick. struct anon_vma { spinlock_t lock; #ifdef CONFIG_MMU_NOTIFIER int global_mm_lock:1; #endif struct address_space { spinlock_t private_lock; #ifdef CONFIG_MMU_NOTIFIER int global_mm_lock:1; #endif 2) add a: #define AS_GLOBAL_MM_LOCK (__GFP_BITS_SHIFT + 2) /* global_mm_locked */ and use address_space->flags with bitops And as Andrew pointed me out by PM, for the anon_vma we can use the LSB of the list.next/prev because the list can't be browsed when the lock is taken, so taking the lock and then setting the bit and clearing the bit before unlocking is safe. The LSB will always read 0 even if it's under list_add modification when the global spinlock isn't taken. And after taking the anon_vma lock we can switch it the LSB from 0 to 1 without races and the 1 will be protected by the global spinlock. The above solution is zero cost for 32bit too, so I prefer it. So I now agree with you this is a great idea on how to remove sort() and vmalloc and especially vfree without increasing the VM footprint. I'll send an update with this for review very shortly and I hope this goes in so KVM will be able to swap and do many other things very well starting in 2.6.26. Thanks a lot, Andrea From changquing.tang at hp.com Thu May 8 15:02:13 2008 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu, 8 May 2008 22:02:13 +0000 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: Great, Thanks, Jeff. --CQ > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > Jeff Squyres > Sent: Thursday, May 08, 2008 4:29 PM > To: OpenFabrics General > Subject: [ofa-general] Verbs: IB vs. iWARP > > Over the past 24 hours, we assembled a list of differences > between IB and iWARP usage of verbs. I got a few comments on > the text we assembled, and figured it was time to turn this > text over to OpenFabrics to make it fully > correct/complete/whatever, and then publish it however you see fit. > > I hope this starter text is helpful to you; enjoy. > > ----- > * struct ib_device.transport_type will be > IBV_TRANSPORT_IWARP for iWARP devices and IBV_TRANSPORT_IB > for IB devices. > > * ibv_query_gid(): > * When invoked on an IB HCA, will return the IB subnet > prefix in subnet_prefix and GUID of the port in the interface_id. > * When invoked on an iWARP NIC, will return the NIC's MAC > address in subnet_prefix and 0 in the interface_id. > > * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can > be made using the IB CM, RDMA CM, or some other (assumedly > out-of-band) mechanism. > > * When making QPs, some versions of iWARP drivers require > the initiator of the connection to send the first message > (having the non- initiator send the first message will > terminate the connection). > Newer versions of iWARP firmware/drivers hide this > functionality down in the driver, so the ULP doesn't have to > ensure that the initiator sends the first message. > > * When terminating connections via the RDMA CM (via the > rdma_disconnect() call or by simply destroying the QP without > disconnecting first), iWARP transports will automatically > create a CQE for any pending send or receive WRs with the > status set to IBV_WC_WR_FLUSH_ERR. Note that IB HCAs do the > same thing, but the iWARP RDMA CM disconnection progresses > independently of the ULP, meaning that when one side issues > the disconnect, the other side will automatically be > disconnected (even if the ULP doesn't realize it). > IB HCAs may not process the disconnect until later (via RDMA > CM or otherwise), perhaps not until the ULP realizes that the > disconnect has occurred. In short: device-independent > verbs-based applications need to be able to handle FLUSH WRs > during disconnection and not treat them as an error. > > * LIDs are always 0 in iWARP. > > * LMC is always 0 for iWARP. > > * Memory regions used to receive RDMA read responses must > have "remote write" permission (since in the iWARP protocol, > RDMA read responses are basically the same as incoming RDMA > write requests). > > * Atomics and immediate data are not available in iWARP. > > * The sink scatter-gather list for an RDMA read can only > have one element for iWARP (which is reported accurately in > struct ibv_device.max_sge). > > * Send completions provide a slightly different guarantee: > * iWARP: indicates that the resources in the > corresponding WR can be reused; it does ''not'' indicate that > the data is in the peer's memory, or even that they have been > transmitted yet. > * IB: indicates that the data has been transmitted and > has arrived at the remote HCA (but is not necessarily in the > remote target buffer > yet) > > * All currently-available RNICs (May 2008) do not support > RNR retry. Specifically: current RNICs will terminate a QP > connection if a SEND arrives with no corresponding pre-posted receive. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From Arkady.Kanevsky at netapp.com Thu May 8 15:14:54 2008 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 8 May 2008 18:14:54 -0400 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: There are also some difference in memory registration, for example FMR. peer-to-peer iWARP CM support has been submitted by Steve Wise. We will test its interop in Sept assuming that it will be in OFED version which will be used for OFA interop. The changes are not just in the FW and driver but also in iWARP CM. Also one can call iWARP CM directly bypassing RDMA CM. But there is no reason for it. All iWARP apps hade been developed after RDMA CM was in place so there was no reason to go under the cover. Cheers, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Jeff Squyres [mailto:jsquyres at cisco.com] > Sent: Thursday, May 08, 2008 5:29 PM > To: OpenFabrics General > Subject: [ofa-general] Verbs: IB vs. iWARP > > Over the past 24 hours, we assembled a list of differences > between IB and iWARP usage of verbs. I got a few comments on > the text we assembled, and figured it was time to turn this > text over to OpenFabrics to make it fully > correct/complete/whatever, and then publish it however you see fit. > > I hope this starter text is helpful to you; enjoy. > > ----- > * struct ib_device.transport_type will be > IBV_TRANSPORT_IWARP for iWARP devices and IBV_TRANSPORT_IB > for IB devices. > > * ibv_query_gid(): > * When invoked on an IB HCA, will return the IB subnet > prefix in subnet_prefix and GUID of the port in the interface_id. > * When invoked on an iWARP NIC, will return the NIC's MAC > address in subnet_prefix and 0 in the interface_id. > > * iWARP QPs ''must'' be made with the RDMA CM; IB QPs can > be made using the IB CM, RDMA CM, or some other (assumedly > out-of-band) mechanism. > > * When making QPs, some versions of iWARP drivers require > the initiator of the connection to send the first message > (having the non- > initiator send the first message will terminate the connection). > Newer versions of iWARP firmware/drivers hide this > functionality down in the driver, so the ULP doesn't have to > ensure that the initiator sends the first message. > > * When terminating connections via the RDMA CM (via the > rdma_disconnect() call or by simply destroying the QP without > disconnecting first), iWARP transports will automatically > create a CQE for any pending send or receive WRs with the > status set to IBV_WC_WR_FLUSH_ERR. Note that IB HCAs do the > same thing, but the iWARP RDMA CM disconnection progresses > independently of the ULP, meaning that when one side issues > the disconnect, the other side will > automatically be disconnected (even if the ULP doesn't realize it). > IB HCAs may not process the disconnect until later (via RDMA > CM or otherwise), perhaps not until the ULP realizes that the > disconnect has occurred. In short: device-independent > verbs-based applications need to be able to handle FLUSH WRs > during disconnection and not treat them as an error. > > * LIDs are always 0 in iWARP. > > * LMC is always 0 for iWARP. > > * Memory regions used to receive RDMA read responses must > have "remote write" permission (since in the iWARP protocol, > RDMA read responses are basically the same as incoming RDMA > write requests). > > * Atomics and immediate data are not available in iWARP. > > * The sink scatter-gather list for an RDMA read can only > have one element for iWARP (which is reported accurately in > struct ibv_device.max_sge). > > * Send completions provide a slightly different guarantee: > * iWARP: indicates that the resources in the > corresponding WR can be reused; it does ''not'' indicate that > the data is in the peer's memory, or even that they have been > transmitted yet. > * IB: indicates that the data has been transmitted and > has arrived at the remote HCA (but is not necessarily in the > remote target buffer > yet) > > * All currently-available RNICs (May 2008) do not support > RNR retry. Specifically: current RNICs will terminate a QP > connection if a SEND arrives with no corresponding pre-posted receive. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Thu May 8 15:16:12 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 8 May 2008 15:16:12 -0700 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: <000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com> It'd be great to find a place for this on the wiki, so it's easier to find in the future. - Sean From rdreier at cisco.com Thu May 8 15:22:47 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 May 2008 15:22:47 -0700 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: (Arkady Kanevsky's message of "Thu, 8 May 2008 18:14:54 -0400") References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: > There are also some difference in memory registration, for example FMR. What are the differences? I don't know of any significant ones (given the IB verbs extensions). - R. From Arkady.Kanevsky at netapp.com Thu May 8 15:31:02 2008 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Thu, 8 May 2008 18:31:02 -0400 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: I had not check for a while but my recollection is the the fmr implementation is vendor specific... Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, May 08, 2008 6:23 PM > To: Kanevsky, Arkady > Cc: OpenFabrics General > Subject: Re: [ofa-general] Verbs: IB vs. iWARP > > > There are also some difference in memory registration, for > example FMR. > > What are the differences? I don't know of any significant > ones (given the IB verbs extensions). > > - R. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From akpm at linux-foundation.org Thu May 8 22:29:16 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Thu, 8 May 2008 22:29:16 -0700 Subject: [ofa-general] bitops take an unsigned long * Message-ID: <20080508222916.277649ca.akpm@linux-foundation.org> Most architectures could (and should) take an unsigned long * arg for their bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband is being a problem. It would be nice to get it fixed up, please. drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs': drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends': drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors': drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr': drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send': drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task': drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma': drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma': drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send': drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type From radix at stamm-guisan.ch Thu May 8 22:52:57 2008 From: radix at stamm-guisan.ch (franky keh) Date: Fri, 09 May 2008 05:52:57 +0000 Subject: [ofa-general] I busted you general Message-ID: <000501c8b1a7$053c382a$0500d8a6@jymrus> Watch :) EitpJNFbyD -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert at saq.co.uk Fri May 9 01:26:47 2008 From: Robert at saq.co.uk (Robert Dunkley) Date: Fri, 9 May 2008 09:26:47 +0100 Subject: [ofa-general] Some general Infinband questions Message-ID: First of all please excuse me if this is not the right place to ask (If anyone knows a good place to start for information on Infiniband clustering then please let me know). I'm considering an Infiniband Xen Virtual Server cluster under Centos 5.1. I already have a pair of under utilized NAS servers running Windows Server 2003. I've played with MYSQL NDB clustering and have a reasonable amount of Linux and Windows experience. Would using the Windows Servers as storage and the Centos based servers as cluster nodes even be possible with the current state of software? Any general advice or tips? The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire. GU32 3DZ SEmtec Limited trading as SAQ is Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. DSL : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.heinz at qlogic.com Fri May 9 06:17:45 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Fri, 9 May 2008 08:17:45 -0500 Subject: [ofa-general] Still looking for help debugging a problem. Message-ID: May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: Internal error detected: May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[00]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[01]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[02]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[03]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[04]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[05]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[06]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[07]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[08]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[09]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0a]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0b]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0c]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0d]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0e]: ffffffff May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: buf[0f]: ffffffff The HCA in question is a Connect-X and the problem only seems to happen with this node. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: From forum.san at gmail.com Fri May 9 08:47:06 2008 From: forum.san at gmail.com (Sangamesh B) Date: Fri, 9 May 2008 21:17:06 +0530 Subject: [ofa-general] RPM build errors:user vlad does not exist - using root Message-ID: Hi all, I've worked with MPICH2. But I am a beginner to Infiniband and OFED stuff. The installation of OFED-1.3.rc1 package on a cluster with CentOS 5 gave following error: .... ....... gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.addr.o.d -nostdinc -iwithprefix include -D__KERNEL__ -include include/linux/autoconf.h -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/linux/autoconf.h -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/kernel_addons/backport/2.6.9_U3/include/ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/debug -I/usr/local/include/scst -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/net/cxgb3 -Iinclude -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Os -fomit-frame-pointer -g -Wdeclaration-after-statement -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -DMODULE -DKBUILD_BASENAME=addr -DKBUILD_MODNAME=ib_addr -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:31:25: linux/mutex.h: No such file or directory In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:32: include/linux/inetdevice.h:50: error: field `mr_gq_timer' has incomplete type include/linux/inetdevice.h:51: error: field `mr_ifc_timer' has incomplete type include/linux/inetdevice.h:95: error: `IFNAMSIZ' undeclared here (not in a function) include/linux/inetdevice.h: In function `in_dev_get': include/linux/inetdevice.h:146: error: dereferencing pointer to incomplete type include/linux/inetdevice.h: In function `__in_dev_get': include/linux/inetdevice.h:156: error: dereferencing pointer to incomplete type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:38:26: net/netevent.h: No such file or directory In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_addr.h:37, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:39: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h:51:31: linux/scatterlist.h: No such file or directory In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_addr.h:37, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:39: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/rdma/ib_verbs.h:1090: error: field `xrcd_table_mutex' has incomplete type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60: warning: type defaults to `int' in declaration of `DEFINE_MUTEX' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60: warning: parameter names (without types) in function declaration /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62: warning: type defaults to `int' in declaration of `DECLARE_DELAYED_WORK' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62: warning: parameter names (without types) in function declaration /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `set_timeout': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127: error: `work' undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127: error: (Each undeclared identifier is reported only once /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:127: error: for each function it appears in.) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `queue_req': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:140: warning: implicit declaration of function `mutex_lock' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:140: error: `lock' undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:150: warning: implicit declaration of function `mutex_unlock' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `process_req': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:225: error: `lock' undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `rdma_addr_cancel': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:341: error: `lock' undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `netevent_callback': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:358: error: `NETEVENT_NEIGH_UPDATE' undeclared (first use in this function) /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `addr_init': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:378: warning: implicit declaration of function `register_netevent_notifier' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function `addr_cleanup': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:384: warning: implicit declaration of function `unregister_netevent_notifier' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: At top level: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:218: warning: 'process_req' defined but not used /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:60: warning: 'DEFINE_MUTEX' declared `static' but never defined /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:62: warning: 'DECLARE_DELAYED_WORK' declared `static' but never defined make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core] Error 2 make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.9-34.0.2.EL-smp-x86_64' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.66489 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.66489 (%build) Why these errors? As a beginner, I want to know some ponts: In docs/OFED_Installation_Guide.txt guide, it is given that 1. OS Distribution Required Packages --------------- ---------------------------------- General: o Common to all gcc, glib, glib-devel, glibc, glibc-devel, glibc-devel-32bit (to build 32-bit libraries on x86_86 and ppc64), zlib-devel, automake, autoconf, libtool. o RedHat, Fedora kernel-devel, rpm-build And: 2. Specific Component Requirements: o Mvapich a Fortran Compiler (such as gcc-g77) o Mvapich2 libstdc++-devel, sysfsutils (SuSE), libsysfs-devel (RedHat5.0, Fedora C6) Since OS is CentOS, I tried to install libstdc++-devel CentOS rpms. But failed because of glibc dependencies. Are both of these very much required? In mvapich2. three makefiles are given: make.mvapich2.ofad, make.mvapich2.vapi and make.mvapich2.udapl. Are all these support this OFED? Or only make.mvapich2.ofad supports OFED? If yes, then what should be used for the other two? Thanks in advance for Howto:Infiniband + OFED + mvapich2 concepts. -Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri May 9 09:20:13 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 May 2008 09:20:13 -0700 Subject: [ofa-general] Still looking for help debugging a problem. In-Reply-To: (Mike Heinz's message of "Fri, 9 May 2008 08:17:45 -0500") References: Message-ID: > May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: > Internal error detected: > May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: > buf[00]: ffffffff > The HCA in question is a Connect-X and the problem only seems to happen > with this node. Sounds like a hardware problem. Try reseating everything etc. From swise at opengridcomputing.com Fri May 9 09:26:00 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 09 May 2008 11:26:00 -0500 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> Message-ID: <48247B18.8000100@opengridcomputing.com> The ib_fmr stuff is pretty HW-specific yes? Kanevsky, Arkady wrote: > I had not check for a while but my recollection is the the fmr > implementation is > vendor specific... > > Arkady Kanevsky email: arkady at netapp.com > Network Appliance Inc. phone: 781-768-5395 > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > Waltham, MA 02451 central phone: 781-768-5300 > > > >> -----Original Message----- >> From: Roland Dreier [mailto:rdreier at cisco.com] >> Sent: Thursday, May 08, 2008 6:23 PM >> To: Kanevsky, Arkady >> Cc: OpenFabrics General >> Subject: Re: [ofa-general] Verbs: IB vs. iWARP >> >> > There are also some difference in memory registration, for >> example FMR. >> >> What are the differences? I don't know of any significant >> ones (given the IB verbs extensions). >> >> - R. >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Fri May 9 09:34:24 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 May 2008 09:34:24 -0700 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <48247B18.8000100@opengridcomputing.com> (Steve Wise's message of "Fri, 09 May 2008 11:26:00 -0500") References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <48247B18.8000100@opengridcomputing.com> Message-ID: > The ib_fmr stuff is pretty HW-specific yes? But not really an IB vs. iWARP thing... it's just some wacky extension that was created a long time ago, which an iWARP RNIC could implement just as easily as an IB HCA. - R. From rdreier at cisco.com Fri May 9 09:36:35 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 May 2008 09:36:35 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: <20080508222916.277649ca.akpm@linux-foundation.org> (Andrew Morton's message of "Thu, 8 May 2008 22:29:16 -0700") References: <20080508222916.277649ca.akpm@linux-foundation.org> Message-ID: > Most architectures could (and should) take an unsigned long * arg for their > bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband > is being a problem. Is your fix available somewhere? Would like to check any patches I make. > It would be nice to get it fixed up, please. Will take a look. A few non-ipath warnings in the spew: > drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send': > drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function which gcc version is giving this? > drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used ...this got lost in the noise. - R. From swise at opengridcomputing.com Fri May 9 09:37:33 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 09 May 2008 11:37:33 -0500 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com> References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <000c01c8b159$1fb6f9b0$465a180a@amr.corp.intel.com> Message-ID: <48247DCD.9010205@opengridcomputing.com> Sean Hefty wrote: > It'd be great to find a place for this on the wiki, so it's easier to find in > the future. > > https://wiki.openfabrics.org/tiki-index.php?page=Verbs%3A+Infiniband+vs+iWARP From Arkady.Kanevsky at netapp.com Fri May 9 09:54:34 2008 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 9 May 2008 12:54:34 -0400 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <48247B18.8000100@opengridcomputing.com> References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <48247B18.8000100@opengridcomputing.com> Message-ID: The one in the core should not be. The one in a vendor driver invoked by core is vendor and HW specific. Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Friday, May 09, 2008 12:26 PM > To: Kanevsky, Arkady > Cc: Roland Dreier; OpenFabrics General > Subject: Re: [ofa-general] Verbs: IB vs. iWARP > > The ib_fmr stuff is pretty HW-specific yes? > > > Kanevsky, Arkady wrote: > > I had not check for a while but my recollection is the the fmr > > implementation is vendor specific... > > > > Arkady Kanevsky email: arkady at netapp.com > > Network Appliance Inc. phone: 781-768-5395 > > 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 > > Waltham, MA 02451 central phone: 781-768-5300 > > > > > > > >> -----Original Message----- > >> From: Roland Dreier [mailto:rdreier at cisco.com] > >> Sent: Thursday, May 08, 2008 6:23 PM > >> To: Kanevsky, Arkady > >> Cc: OpenFabrics General > >> Subject: Re: [ofa-general] Verbs: IB vs. iWARP > >> > >> > There are also some difference in memory registration, > for example > >> FMR. > >> > >> What are the differences? I don't know of any significant ones > >> (given the IB verbs extensions). > >> > >> - R. > >> > >> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > >> > >> > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Fri May 9 10:48:08 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 09 May 2008 12:48:08 -0500 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <48247B18.8000100@opengridcomputing.com> Message-ID: <48248E58.8060301@opengridcomputing.com> One more item added to the wiki page: * iWARP RNICs support a special privileged lkey == 0 which can be used in local SGLs when the address is a bus/dma address. From michael.heinz at qlogic.com Fri May 9 10:58:03 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Fri, 9 May 2008 12:58:03 -0500 Subject: [ofa-general] Still looking for help debugging a problem. In-Reply-To: References: Message-ID: Thanks, Roland. We're trying switching HCAs now. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Friday, May 09, 2008 12:20 PM To: Mike Heinz Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Still looking for help debugging a problem. > May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: > Internal error detected: > May 9 08:38:54 compute-0-4.local kernel: mlx4_core 0000:02:00.0: > buf[00]: ffffffff > The HCA in question is a Connect-X and the problem only seems to happen > with this node. Sounds like a hardware problem. Try reseating everything etc. From a.p.zijlstra at chello.nl Fri May 9 11:37:29 2008 From: a.p.zijlstra at chello.nl (Peter Zijlstra) Date: Fri, 09 May 2008 20:37:29 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <20080507153103.237ea5b6.akpm@linux-foundation.org> <20080507224406.GI8276@duo.random> <20080507155914.d7790069.akpm@linux-foundation.org> <20080507233953.GM8276@duo.random> <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> Message-ID: <1210358249.13978.275.camel@twins> On Thu, 2008-05-08 at 09:11 -0700, Linus Torvalds wrote: > > On Thu, 8 May 2008, Linus Torvalds wrote: > > > > Also, we'd need to make it > > > > unsigned short flag:1; > > > > _and_ change spinlock_types.h to make the spinlock size actually match the > > required size (right now we make it an "unsigned int slock" even when we > > actually only use 16 bits). > > Btw, this is an issue only on 32-bit x86, because on 64-bit one we already > have the padding due to the alignment of the 64-bit pointers in the > list_head (so there's already empty space there). > > On 32-bit, the alignment of list-head is obviously just 32 bits, so right > now the structure is "perfectly packed" and doesn't have any empty space. > But that's just because the spinlock is unnecessarily big. > > (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the > structure really will grow. That's a very odd configuration, though, and > not one I feel we really need to care about). Another possibility, would something like this work? /* * null out the begin function, no new begin calls can be made */ rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); /* * lock/unlock all rmap locks in any order - this ensures that any * pending start() will have its end() function called. */ mm_barrier(mm); /* * now that no new start() call can be made and all start()/end() pairs * are complete we can remove the notifier. */ mmu_notifier_remove(mm, my_notifier); This requires a mmu_notifier instance per attached mm and that __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain the function. But I think its enough to ensure that: for each start an end will be called It can however happen that end is called without start - but we could handle that I think. From andrea at qumranet.com Fri May 9 11:55:53 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 9 May 2008 20:55:53 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <1210358249.13978.275.camel@twins> References: <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> <1210358249.13978.275.camel@twins> Message-ID: <20080509185553.GF7710@duo.random> On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote: > Another possibility, would something like this work? > > > /* > * null out the begin function, no new begin calls can be made > */ > rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); > > /* > * lock/unlock all rmap locks in any order - this ensures that any > * pending start() will have its end() function called. > */ > mm_barrier(mm); > > /* > * now that no new start() call can be made and all start()/end() pairs > * are complete we can remove the notifier. > */ > mmu_notifier_remove(mm, my_notifier); > > > This requires a mmu_notifier instance per attached mm and that > __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain > the function. > > But I think its enough to ensure that: > > for each start an end will be called We don't need that, it's perfectly ok if start is called but end is not, it's ok to unregister in the middle as I guarantee ->release is called before mmu_notifier_unregister returns (if ->release is needed at all, not the case for KVM/GRU). Unregister is already solved with srcu/rcu without any additional complication as we don't need the guarantee that for each start an end will be called. > It can however happen that end is called without start - but we could > handle that I think. The only reason mm_lock() was introduced is to solve "register", to guarantee that for each end there was a start. We can't handle end called without start in the driver. The reason the driver must be prevented to register in the middle of start/end, if that if it ever happens the driver has no way to know it must stop the secondary mmu page faults to call get_user_pages and instantiate sptes/secondarytlbs on pages that will be freed as soon as zap_page_range starts. From a.p.zijlstra at chello.nl Fri May 9 12:04:47 2008 From: a.p.zijlstra at chello.nl (Peter Zijlstra) Date: Fri, 09 May 2008 21:04:47 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080509185553.GF7710@duo.random> References: <20080508025652.GW8276@duo.random> <20080508034133.GY8276@duo.random> <20080508052019.GA8276@duo.random> <1210358249.13978.275.camel@twins> <20080509185553.GF7710@duo.random> Message-ID: <1210359887.6524.0.camel@lappy.programming.kicks-ass.net> On Fri, 2008-05-09 at 20:55 +0200, Andrea Arcangeli wrote: > On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote: > > Another possibility, would something like this work? > > > > > > /* > > * null out the begin function, no new begin calls can be made > > */ > > rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); > > > > /* > > * lock/unlock all rmap locks in any order - this ensures that any > > * pending start() will have its end() function called. > > */ > > mm_barrier(mm); > > > > /* > > * now that no new start() call can be made and all start()/end() pairs > > * are complete we can remove the notifier. > > */ > > mmu_notifier_remove(mm, my_notifier); > > > > > > This requires a mmu_notifier instance per attached mm and that > > __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain > > the function. > > > > But I think its enough to ensure that: > > > > for each start an end will be called > > We don't need that, it's perfectly ok if start is called but end is > not, it's ok to unregister in the middle as I guarantee ->release is > called before mmu_notifier_unregister returns (if ->release is needed > at all, not the case for KVM/GRU). > > Unregister is already solved with srcu/rcu without any additional > complication as we don't need the guarantee that for each start an end > will be called. > > > It can however happen that end is called without start - but we could > > handle that I think. > > The only reason mm_lock() was introduced is to solve "register", to > guarantee that for each end there was a start. We can't handle end > called without start in the driver. > > The reason the driver must be prevented to register in the middle of > start/end, if that if it ever happens the driver has no way to know it > must stop the secondary mmu page faults to call get_user_pages and > instantiate sptes/secondarytlbs on pages that will be freed as soon as > zap_page_range starts. Right - then I got it backwards. Never mind me then.. From akpm at linux-foundation.org Fri May 9 12:05:29 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Fri, 9 May 2008 12:05:29 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: References: <20080508222916.277649ca.akpm@linux-foundation.org> Message-ID: <20080509120529.a9e616e3.akpm@linux-foundation.org> On Fri, 09 May 2008 09:36:35 -0700 Roland Dreier wrote: > > Most architectures could (and should) take an unsigned long * arg for their > > bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband > > is being a problem. > > Is your fix available somewhere? Would like to check any patches I make. It needs some preparatory patches, otherwise you'll be looking through thousands of warnings. At http://userweb.kernel.org/~akpm/mmotm/broken-out/ we have arch-x86-mm-patc-use-boot_cpu_has.patch x86-setup_force_cpu_cap-dont-do-clear_bitnon-unsigned-long.patch lguest-use-cpu-capability-accessors.patch x86-set_restore_sigmask-avoid-bitop-on-a-u32.patch and then the conversion patch: x86-bitops-take-an-unsigned-long.patch > > It would be nice to get it fixed up, please. > > Will take a look. Thanks. > A few non-ipath warnings in the spew: > > > drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send': > > drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function That's a falsie: gcc assumes that foo(&var) doesn't write to `var' :( > which gcc version is giving this? 4.0.2 I think. > > drivers/char/epca.c:2542: warning: 'epca_setup' defined but not used > > ...this got lost in the noise. > Interesting, thanks. I'll bug Alan about that. From andrea at qumranet.com Fri May 9 12:32:30 2008 From: andrea at qumranet.com (Andrea Arcangeli) Date: Fri, 9 May 2008 21:32:30 +0200 Subject: [ofa-general] [PATCH 001/001] mmu-notifier-core v17 Message-ID: <20080509193230.GH7710@duo.random> From: Andrea Arcangeli With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. Signed-off-by: Andrea Arcangeli Signed-off-by: Nick Piggin Signed-off-by: Christoph Lameter --- Full patchset is here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc1/mmu-notifier-v17 Thanks! diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct if (!hlist_unhashed(n)) { __hlist_del(n); INIT_HLIST_NODE(n); + } +} + +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on the node return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; } } diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1067,6 +1067,9 @@ extern struct vm_area_struct *copy_vma(s unsigned long addr, unsigned long len, pgoff_t pgoff); extern void exit_mmap(struct mm_struct *); +extern int mm_take_all_locks(struct mm_struct *mm); +extern void mm_drop_all_locks(struct mm_struct *mm); + #ifdef CONFIG_PROC_FS /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */ extern void added_exe_file_vma(struct mm_struct *mm); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -19,6 +20,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -235,6 +237,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,279 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include +#include +#include + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected + * critical section and it's released only when mm_count reaches zero + * in mmdrop(). + */ +struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ + struct hlist_head list; + /* to serialize the list modifications and hlist_unhashed */ + spinlock_t lock; +}; + +struct mmu_notifier_ops { + /* + * Called either by mmu_notifier_unregister or when the mm is + * being destroyed by exit_mmap, always before all pages are + * freed. This can run concurrently with other mmu notifier + * methods (the ones invoked outside the mm context) and it + * should tear down all secondary mmu mappings and freeze the + * secondary mmu. If this method isn't implemented you've to + * be sure that nothing could possibly write to the pages + * through the secondary mmu by the time the last thread with + * tsk->mm == mm exits. + * + * As side note: the pages freed after ->release returns could + * be immediately reallocated by the gart at an alias physical + * address with a different cache model, so if ->release isn't + * implemented because all _software_ driven memory accesses + * through the secondary mmu are terminated by the time the + * last thread of this mm quits, you've also to be sure that + * speculative _hardware_ operations can't allocate dirty + * cachelines in the cpu that could not be snooped and made + * coherent with the other read and write operations happening + * through the gart alias address, so leading to memory + * corruption. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. The subsystem + * must guarantee that no additional references are taken to + * the pages in the range established between the call to + * invalidate_range_start() and the matching call to + * invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern int __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -19,6 +19,7 @@ */ #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */ #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */ +#define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */ static inline void mapping_set_error(struct address_space *mapping, int error) { diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -26,6 +26,14 @@ */ struct anon_vma { spinlock_t lock; /* Serialize access to vma list */ + /* + * NOTE: the LSB of the head.next is set by + * mm_take_all_locks() _after_ taking the above lock. So the + * head must only be read/written after taking the above lock + * to be sure to see a valid next pointer. The LSB bit itself + * is serialized by a system wide lock only visible to + * mm_take_all_locks() (mm_all_locks_mutex). + */ struct list_head head; /* List of private "related" vmas */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #include @@ -386,6 +387,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -418,6 +420,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -205,3 +205,6 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + bool diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1544,10 +1565,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1555,6 +1577,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1756,7 +1779,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -2048,6 +2049,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2255,3 +2257,152 @@ int install_special_mapping(struct mm_st return 0; } + +static DEFINE_MUTEX(mm_all_locks_mutex); + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. + * + * The caller must take the mmap_sem in write mode before calling + * mm_take_all_locks(). The caller isn't allowed to release the + * mmap_sem until mm_drop_all_locks() returns. + * + * mmap_sem in write mode is required in order to block all operations + * that could modify pagetables and free pages without need of + * altering the vma layout (for example populate_range() with + * nonlinear vmas). It's also needed in write mode to avoid new + * anon_vmas to be associated with existing vmas. + * + * A single task can't take more than one mm_take_all_locks() in a row + * or it would deadlock. + * + * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in + * mapping->flags avoid to take the same lock twice, if more than one + * vma in this mm is backed by the same anon_vma or address_space. + * + * We can take all the locks in random order because the VM code + * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never + * takes more than one of them in a row. Secondly we're protected + * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. + * + * mm_take_all_locks() and mm_drop_all_locks are expensive operations + * that may have to take thousand of locks. + * + * mm_take_all_locks() can fail if it's interrupted by signals. + */ +int mm_take_all_locks(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + int ret = -EINTR; + + BUG_ON(down_read_trylock(&mm->mmap_sem)); + + mutex_lock(&mm_all_locks_mutex); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + struct file *filp; + if (signal_pending(current)) + goto out_unlock; + if (vma->anon_vma && !test_bit(0, (unsigned long *) + &vma->anon_vma->head.next)) { + /* + * The LSB of head.next can't change from + * under us because we hold the + * global_mm_spinlock. + */ + spin_lock(&vma->anon_vma->lock); + /* + * We can safely modify head.next after taking + * the anon_vma->lock. If some other vma in + * this mm shares the same anon_vma we won't + * take it again. + * + * No need of atomic instructions here, + * head.next can't change from under us thanks + * to the anon_vma->lock. + */ + if (__test_and_set_bit(0, (unsigned long *) + &vma->anon_vma->head.next)) + BUG(); + } + + filp = vma->vm_file; + if (filp && filp->f_mapping && + !test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) { + /* + * AS_MM_ALL_LOCKS can't change from under us + * because we hold the global_mm_spinlock. + * + * Operations on ->flags have to be atomic + * because even if AS_MM_ALL_LOCKS is stable + * thanks to the global_mm_spinlock, there may + * be other cpus changing other bitflags in + * parallel to us. + */ + if (test_and_set_bit(AS_MM_ALL_LOCKS, + &filp->f_mapping->flags)) + BUG(); + spin_lock(&filp->f_mapping->i_mmap_lock); + } + } + ret = 0; + +out_unlock: + if (ret) + mm_drop_all_locks(mm); + + return ret; +} + +/* + * The mmap_sem cannot be released by the caller until + * mm_drop_all_locks() returns. + */ +void mm_drop_all_locks(struct mm_struct *mm) +{ + struct vm_area_struct *vma; + + BUG_ON(down_read_trylock(&mm->mmap_sem)); + BUG_ON(!mutex_is_locked(&mm_all_locks_mutex)); + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + struct file *filp; + if (vma->anon_vma && + test_bit(0, (unsigned long *) + &vma->anon_vma->head.next)) { + /* + * The LSB of head.next can't change to 0 from + * under us because we hold the + * global_mm_spinlock. + * + * We must however clear the bitflag before + * unlocking the vma so the users using the + * anon_vma->head will never see our bitflag. + * + * No need of atomic instructions here, + * head.next can't change from under us until + * we release the anon_vma->lock. + */ + if (!__test_and_clear_bit(0, (unsigned long *) + &vma->anon_vma->head.next)) + BUG(); + spin_unlock(&vma->anon_vma->lock); + } + filp = vma->vm_file; + if (filp && filp->f_mapping && + test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) { + /* + * AS_MM_ALL_LOCKS can't change to 0 from under us + * because we hold the global_mm_spinlock. + */ + spin_unlock(&filp->f_mapping->i_mmap_lock); + if (!test_and_clear_bit(AS_MM_ALL_LOCKS, + &filp->f_mapping->flags)) + BUG(); + } + } + + mutex_unlock(&mm_all_locks_mutex); +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,276 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include +#include +#include +#include +#include +#include + +/* + * This function can't run concurrently against mmu_notifier_register + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to RCU and it serializes + * against the other mmu notifiers with RCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds an mm_count pin + * itself. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + + spin_lock(&mm->mmu_notifier_mm->lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing other than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init_rcu(&mn->hlist); + /* + * RCU here will block mmu_notifier_unregister until + * ->release returns. + */ + rcu_read_lock(); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes and stop the driver + * from establishing any more sptes before all the + * pages in the mm are freed. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + rcu_read_unlock(); + spin_lock(&mm->mmu_notifier_mm->lock); + } + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * synchronize_rcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * + * The mmu_notifier_mm can't go away from under us because one + * mm_count is hold by exit_mmap. + */ + synchronize_rcu(); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + rcu_read_unlock(); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + rcu_read_unlock(); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + rcu_read_unlock(); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + rcu_read_lock(); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + rcu_read_unlock(); +} + +static int do_mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm, + int take_mmap_sem) +{ + struct mmu_notifier_mm * mmu_notifier_mm; + int ret; + + BUG_ON(atomic_read(&mm->mm_users) <= 0); + + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + if (take_mmap_sem) + down_write(&mm->mmap_sem); + ret = mm_take_all_locks(mm); + if (unlikely(ret)) + goto out_cleanup; + + if (!mm_has_notifiers(mm)) { + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; + } + atomic_inc(&mm->mm_count); + + /* + * Serialize the update against mmu_notifier_unregister. A + * side note: mmu_notifier_release can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + * We can't race against any other mmu notifier method either + * thanks to mm_take_all_locks(). + */ + spin_lock(&mm->mmu_notifier_mm->lock); + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); + spin_unlock(&mm->mmu_notifier_mm->lock); + + mm_drop_all_locks(mm); +out_cleanup: + if (take_mmap_sem) + up_write(&mm->mmap_sem); + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); +out: + BUG_ON(atomic_read(&mm->mm_users) <= 0); + return ret; +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 1); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * Same as mmu_notifier_register but here the caller must hold the + * mmap_sem in write mode. + */ +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 0); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); + +/* this is called after the last mmu_notifier_unregister() returned */ +void __mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); + kfree(mm->mmu_notifier_mm); + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ +} + +/* + * This releases the mm_count pin automatically and frees the mm + * structure if it was the last user of it. It serializes against + * running mmu notifiers with RCU and against mmu_notifier_unregister + * with the unregister lock + RCU. All sptes must be dropped before + * calling mmu_notifier_unregister. ->release or any other notifier + * method may be invoked concurrently with mmu_notifier_unregister, + * and only after mmu_notifier_unregister returned we're guaranteed + * that ->release or any other method can't run anymore. + */ +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + spin_lock(&mm->mmu_notifier_mm->lock); + if (!hlist_unhashed(&mn->hlist)) { + hlist_del_rcu(&mn->hlist); + + /* + * RCU here will force exit_mmap to wait ->release to finish + * before freeing the pages. + */ + rcu_read_lock(); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee ->release is called before freeing the + * pages. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + rcu_read_unlock(); + } else + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * Wait any running method to finish, of course including + * ->release if it was run by mmu_notifier_relase instead of us. + */ + synchronize_rcu(); + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + mmdrop(mm); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) From swise at opengridcomputing.com Fri May 9 13:19:02 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 09 May 2008 15:19:02 -0500 Subject: [ofa-general] [PATCH 2.6.26] RDMA/cxgb3: Wrap the software sq ptr as needed on flush. Message-ID: <20080509201902.13077.53047.stgit@dell3.ogc.int> cxio_flush_sq() was failing to wrap around the sw-sq causing garbage completion entries on a flush operation. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 5fd8506..20a6326 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -405,11 +405,11 @@ int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); ptr = wq->sq_rptr + count; - sqp += count; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); while (ptr != wq->sq_wptr) { insert_sq_cqe(wq, cq, sqp); - sqp++; ptr++; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); flushed++; } return flushed; From rdreier at cisco.com Fri May 9 22:37:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 May 2008 22:37:36 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: <20080508222916.277649ca.akpm@linux-foundation.org> (Andrew Morton's message of "Thu, 8 May 2008 22:29:16 -0700") References: <20080508222916.277649ca.akpm@linux-foundation.org> Message-ID: > Most architectures could (and should) take an unsigned long * arg for their > bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband > is being a problem. > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs': > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends': > drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors': > drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr': > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task': > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma': > drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma': > drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send': > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type So all of these are ipath warnings, seemingly all because ipath_devdata.ipath_sdma_status is a u64. The stupid fix is to change this declaration to unsigned long as below, but this sets a trap if the driver is ever fixed so that it doesn't depend on 64BIT, because of /* bit positions for sdma_status */ #define IPATH_SDMA_ABORTING 0 #define IPATH_SDMA_DISARMED 1 #define IPATH_SDMA_DISABLED 2 #define IPATH_SDMA_LAYERBUF 3 #define IPATH_SDMA_RUNNING 62 #define IPATH_SDMA_SHUTDOWN 63 I don't see that this status is shared with hardware, and I don't see why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to unsigned long and moving those to bits 4 and 5 seems like it might be a clean fix. The other option is to convert to a bitmap and using the bitmap operations, which ends up being a bigger patch. But since I don't really understand this part of the driver, some guidance would be helpful... - R. diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index ce7b7c3..7635ace 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) */ if (dd->ipath_flags & IPATH_HAS_SEND_DMA) { int skip_cancel; - u64 *statp = &dd->ipath_sdma_status; + unsigned long *statp = &dd->ipath_sdma_status; spin_lock_irqsave(&dd->ipath_sdma_lock, flags); skip_cancel = diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 02b24a3..a46f8ad 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -483,7 +483,7 @@ struct ipath_devdata { /* SendDMA related entries */ spinlock_t ipath_sdma_lock; - u64 ipath_sdma_status; + unsigned long ipath_sdma_status; unsigned long ipath_sdma_abort_jiffies; unsigned long ipath_sdma_abort_intr_timeout; unsigned long ipath_sdma_buf_jiffies; From akpm at linux-foundation.org Sat May 10 00:08:38 2008 From: akpm at linux-foundation.org (Andrew Morton) Date: Sat, 10 May 2008 00:08:38 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: References: <20080508222916.277649ca.akpm@linux-foundation.org> Message-ID: <20080510000838.e85f5d89.akpm@linux-foundation.org> On Fri, 09 May 2008 22:37:36 -0700 Roland Dreier wrote: > > Most architectures could (and should) take an unsigned long * arg for their > > bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband > > is being a problem. > > ... > > > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > So all of these are ipath warnings, seemingly all because > ipath_devdata.ipath_sdma_status is a u64. The stupid fix is to change > this declaration to unsigned long as below, but this sets a trap if the > driver is ever fixed so that it doesn't depend on 64BIT, because of > > /* bit positions for sdma_status */ > #define IPATH_SDMA_ABORTING 0 > #define IPATH_SDMA_DISARMED 1 > #define IPATH_SDMA_DISABLED 2 > #define IPATH_SDMA_LAYERBUF 3 > #define IPATH_SDMA_RUNNING 62 > #define IPATH_SDMA_SHUTDOWN 63 > > I don't see that this status is shared with hardware, and I don't see > why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to > unsigned long and moving those to bits 4 and 5 seems like it might be a > clean fix. > > The other option is to convert to a bitmap and using the bitmap > operations, which ends up being a bigger patch. > > But since I don't really understand this part of the driver, some > guidance would be helpful... > Another option might be - u64 ipath_sdma_status; + unsigned long ipath_sdma_status[64/BITS_PER_LONG]; Because the bitops are OK for use against an _array_ of unsigned longs, not just a single unsigned long. Or, if you want to preserve that u64: union { u64 ipath_sdma_status; unsigned long ipath_sdma_status_bits[64/BITS_PER_LONG]; }; and do the bitops on ipath_sdma_status_bits. Or just remove all the set_bit/clear_bit/etc and use plain old |, &, etc. It all needs a bit of thought if you're supporting big-endian machines, however. From hrosenstock at xsigo.com Sat May 10 05:29:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 10 May 2008 05:29:13 -0700 Subject: [ofa-general] OpenSM and fat tree Message-ID: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> Hi Yevgeny, Is it possible that OpenSM's fat tree routing somehow depends on the LIDs previously assigned ? It seems that for a legitimate fat tree topology, the topology sometimes won't come up as a fat tree if reassigning LIDs (-r) is not used. In addition to -r making fat tree work, certain routing algorithms seem to also clear this out (without using -r). For example, if lash were run and then ftree, it seems to work without doing the -r. (Haven't yet tried updn). Any ideas on this ? Should a bug be filed on this ? Thanks. -- Hal From rdreier at cisco.com Sat May 10 09:02:09 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 10 May 2008 09:02:09 -0700 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: <48248E58.8060301@opengridcomputing.com> (Steve Wise's message of "Fri, 09 May 2008 12:48:08 -0500") References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <48247B18.8000100@opengridcomputing.com> <48248E58.8060301@opengridcomputing.com> Message-ID: > One more item added to the wiki page: > > > * iWARP RNICs support a special privileged lkey == 0 which can be > used in local SGLs when the address is a bus/dma address. I think under Linux this is pretty much irrelevant, given that we have ib_get_dma_mr(). And the IB BMME define the same concept anyway. - R. From 2magasin at rems.fr Sat May 10 07:38:46 2008 From: 2magasin at rems.fr (ernesto jikun) Date: Sat, 10 May 2008 14:38:46 +0000 Subject: [ofa-general] branded time Message-ID: <000701c8b2ba$01bdda8d$ecb1a7b1@tjkhjri> " My order arrived yesterday via registered mail in good order THE WATCH IS BEAUTIFUL AND EVEN BETTER THAN I EXPECTED." Try it for yourself - u will be amazed!! - The worlds largest online retailer of luxury products, including: Rolex Sports Models Rolex Datejusts Breitling Cartier Porsche Design Dolce & Gabbana Dior Gucci Hermes Watches Patek Philippe Visit - www.zimpleq.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Sat May 10 11:51:53 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 10 May 2008 21:51:53 +0300 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> Message-ID: <4825EEC9.4070208@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > Is it possible that OpenSM's fat tree routing somehow depends on the > LIDs previously assigned ? It depends only on the existence of the LIDs. > It seems that for a legitimate fat tree topology, the topology sometimes > won't come up as a fat tree if reassigning LIDs (-r) is not used. That's odd... > In addition to -r making fat tree work, certain routing algorithms seem > to also clear this out (without using -r). For example, if lash were run > and then ftree, it seems to work without doing the -r. (Haven't yet > tried updn). > > Any ideas on this ? Should a bug be filed on this ? Thanks. No ideas whatsoever. Please file a bug on this. It would be nice if I could reproduce it in simulation. -- Yevgeny > -- Hal > > > From akepner at sgi.com Sat May 10 12:07:21 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Sat, 10 May 2008 12:07:21 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> Message-ID: <20080510190721.GI5298@sgi.com> On Thu, May 08, 2008 at 10:50:11AM -0700, Roland Dreier wrote: > ... > It might be useful to track the value of tx_outstanding... from a quick > look at the code I can't see how the transmit queue could be awake when > the UD send queue is full. > I haven't been able to get any new debug data (the only way we know to reproduce this one is to use a pretty large system - a scarce resource), but it does look like there's a hole here, since ipoib_cm.c:ipoib_cm_send() and ipoib_ib.c:ipoib_send() check on different conditions (off by one) to detect a full queue. ipoib_cm.c:ipoib_cm_send() does: if (++priv->tx_outstanding == ipoib_sendq_size) netif_stop_queue(dev); but ipoib_ib.c:ipoib_send() does: if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) { netif_stop_queue(dev); So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), followed by a call to ipoib_send() would get to a situation where the queue was full, but not stopped. I'm not saying this is what's happening for us (just dunno yet) but it looks possible. -- Arthur From swise at opengridcomputing.com Sat May 10 16:18:45 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 10 May 2008 18:18:45 -0500 Subject: [ofa-general] Verbs: IB vs. iWARP In-Reply-To: References: <681B7596-9218-4016-863E-7DDBC085AD7E@cisco.com> <48247B18.8000100@opengridcomputing.com> <48248E58.8060301@opengridcomputing.com> Message-ID: <48262D55.80806@opengridcomputing.com> Roland Dreier wrote: > > One more item added to the wiki page: > > > > > > * iWARP RNICs support a special privileged lkey == 0 which can be > > used in local SGLs when the address is a bus/dma address. > > I think under Linux this is pretty much irrelevant, given that we have > ib_get_dma_mr(). And the IB BMME define the same concept anyway. > > - R. Its not irrelevant if someone tries to port an iwarp app to IB and said iwarp app uses lkey 0 everywhere... From jackm at dev.mellanox.co.il Sat May 10 22:49:24 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 11 May 2008 08:49:24 +0300 Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached P_Key/GID queries In-Reply-To: References: Message-ID: <200805110849.25034.jackm@dev.mellanox.co.il> On Thursday 08 May 2008 01:22, Roland Dreier wrote: >  Since we want to eliminate the > cached operations in the long term, convert SRP to use the uncached > variants. Eliminating the caches will pose a performance problem when sending raw packets. The ib_post_send API provides the pkey_index -- and this needs to be translated to the actual p_key when building the Base Transport Header. - Jack From eli at dev.mellanox.co.il Sun May 11 01:18:19 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 11 May 2008 11:18:19 +0300 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080510190721.GI5298@sgi.com> References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> Message-ID: <1210493899.15669.116.camel@mtls03> On Sat, 2008-05-10 at 12:07 -0700, akepner at sgi.com wrote: > I haven't been able to get any new debug data (the only way we > know to reproduce this one is to use a pretty large system - a > scarce resource), but it does look like there's a hole here, > since ipoib_cm.c:ipoib_cm_send() and ipoib_ib.c:ipoib_send() > check on different conditions (off by one) to detect a full > queue. > > ipoib_cm.c:ipoib_cm_send() does: > if (++priv->tx_outstanding == ipoib_sendq_size) > netif_stop_queue(dev); > > but ipoib_ib.c:ipoib_send() does: > if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) { > netif_stop_queue(dev); > > So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), > followed by a call to ipoib_send() would get to a situation where > the queue was full, but not stopped. The reason why the queue is stopped when there is one entry still left is to allow ipoib_ib_tx_timer_func() to post a special send request that will ensure a completion is reported for this operation thus freeing entries at the tx ring. I don't think the scenario you describe here can lead to a deadlock since if that happens, it will be released because of either one of the following two reasons: 1. If the tx queue contains not yet polled, more than one completion of send WRs posted by ipoib_cm_send(), they will soon be polled since they are posted to a signaled QP and sooner or later will generate completions and interrupts. In this case, subsequent postings to ipoib_send() will work as expected. 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it means that there are 126 outstanding ipoib_send() requests at the tx queue and this means that a few of them are signaled and are expected to be completed soon. If you just want to make sure there is no bug in my theory you can just use this patch: Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-05-07 12:30:10.000000000 +0300 +++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-05-11 09:59:42.000000000 +0300 @@ -535,7 +535,9 @@ static inline int post_send(struct ipoib } else priv->tx_wr.opcode = IB_WR_SEND; - if (unlikely((priv->tx_head & (MAX_SEND_CQE - 1)) == MAX_SEND_CQE - 1)) + /* start forcing signaled if we get near queue full */ + if (unlikely((priv->tx_head & (MAX_SEND_CQE - 1)) == MAX_SEND_CQE - 1) || + priv->tx_outstanding > (ipoib_sendq_size - 5)) priv->tx_wr.send_flags |= IB_SEND_SIGNALED; else priv->tx_wr.send_flags &= ~IB_SEND_SIGNALED; And last, could you arrange a remote access to a machine in this condition so we could check the state of the device/FW? From olga.shern at gmail.com Sun May 11 02:38:50 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Sun, 11 May 2008 12:38:50 +0300 Subject: [ofa-general] Re: [ewg] OFED May 5 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> Message-ID: On 5/6/08, Tziporet Koren wrote: > > > May 5 OFED meeting summary: > =========================== > > 1. OFED 1.3.1: > 1.1 Status of changes: > IB-bonding - on work > SRP failover - done (need more testing) > SDP crashes - on work (not clear if we will have > something on time) > RDS fixes for RDMA API - done > librdmacm 1.0.7 - done > uDAPL updates - done > Open MPI 1.2.6 - done > MVAPICH 1.0.1 - done > MVAPICH2 1.0.3 - done > IPoIB - 2 bugs fixed. There are still two issue that > should be resolved. > Low level drivers: Changes that already committed: > nes > mlx4 > cxgb3 > ehca > > 1.2 Schedule: > rc1 - was released today > rc2 - May 20 > GA - May 29 > > 1.3 Discussion: > - ipath driver is going to be updated > - There is an issue of bonding and Ethernet drivers on RHEL4 - > under debug > - We wish to add support for SLES10 SP2. Already got an approval > from Novell > Any volunteer to provide the new backport patches? Tziporet, we will do it. Already started with it, seems like everything is compiled, need only backport bonding Olga 2. OFED 1.4: > Updated that the new tree will be ready next week - based on > 2.6.26-rc > > 3. Update on OpenSuSE build system - Yiftah updated on the work that is > done and problems: > - The system requires clean RPMs only (no use of install script) - > they work to resolve > - We target this system toward releases (and not to replace the daily > build system). > - we may try now with OFED 1.3.1 > > > Tziporet > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From moshek at voltaire.com Sun May 11 03:03:17 2008 From: moshek at voltaire.com (Moshe Kazir) Date: Sun, 11 May 2008 13:03:17 +0300 Subject: [ofa-general] OFED May 5 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> Message-ID: <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> > - We wish to add support for SLES10 SP2. Already got an approval from Novell > Any volunteer to provide the new backport patches? I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3. ib-bonding compile failed. Everything else is compiled o.k. Attached : ib-bonding error log. I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed, I'll get Moni's help). Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Tuesday, May 06, 2008 6:45 PM To: Tziporet Koren; ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ofa-general] OFED May 5 meeting summary May 5 OFED meeting summary: =========================== 1. OFED 1.3.1: 1.1 Status of changes: IB-bonding - on work SRP failover - done (need more testing) SDP crashes - on work (not clear if we will have something on time) RDS fixes for RDMA API - done librdmacm 1.0.7 - done uDAPL updates - done Open MPI 1.2.6 - done MVAPICH 1.0.1 - done MVAPICH2 1.0.3 - done IPoIB - 2 bugs fixed. There are still two issue that should be resolved. Low level drivers: Changes that already committed: nes mlx4 cxgb3 ehca 1.2 Schedule: rc1 - was released today rc2 - May 20 GA - May 29 1.3 Discussion: - ipath driver is going to be updated - There is an issue of bonding and Ethernet drivers on RHEL4 - under debug - We wish to add support for SLES10 SP2. Already got an approval from Novell Any volunteer to provide the new backport patches? 2. OFED 1.4: Updated that the new tree will be ready next week - based on 2.6.26-rc 3. Update on OpenSuSE build system - Yiftah updated on the work that is done and problems: - The system requires clean RPMs only (no use of install script) - they work to resolve - We target this system toward releases (and not to replace the daily build system). - we may try now with OFED 1.3.1 Tziporet _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- A non-text attachment was scrubbed... Name: ib-bonding.rpmbuild.log Type: application/octet-stream Size: 31538 bytes Desc: ib-bonding.rpmbuild.log URL: From akepner at sgi.com Sun May 11 03:23:45 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Sun, 11 May 2008 03:23:45 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <1210493899.15669.116.camel@mtls03> References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> <1210493899.15669.116.camel@mtls03> Message-ID: <20080511102345.GJ5298@sgi.com> On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote: > .... > The reason why the queue is stopped when there is one entry still left > is to allow ipoib_ib_tx_timer_func() to post a special send request that > will ensure a completion is reported for this operation thus freeing > entries at the tx ring. I don't think the scenario you describe here can > lead to a deadlock since if that happens, it will be released because of > either one of the following two reasons: > 1. If the tx queue contains not yet polled, more than one completion of > send WRs posted by ipoib_cm_send(), they will soon be polled since they > are posted to a signaled QP and sooner or later will generate > completions and interrupts. In this case, subsequent postings to > ipoib_send() will work as expected. > > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it > means that there are 126 outstanding ipoib_send() requests at the tx > queue and this means that a few of them are signaled and are expected to > be completed soon. Thanks for the explanation. The main problem that we're seeing is that we just stop getting completions for the send queue. (And we see this with OFED-1.2 and 1.3, which makes me think that it's unlikely to be due to the IPoIB driver since that's changed so much.) > ..... > And last, could you arrange a remote access to a machine in this > condition so we could check the state of the device/FW? > Yes, I think so. Let me see if I can arrange that. -- Arthur From erezz at voltaire.com Sun May 11 04:00:16 2008 From: erezz at voltaire.com (Erez Zilber) Date: Sun, 11 May 2008 14:00:16 +0300 Subject: [ofa-general] Moving responsibility for iSER & iSCSI related issues Message-ID: <4826D1C0.4040301@voltaire.com> Hi, After ~4 years of working on iSER & iSCSI, I'm moving on and will be involved from a different perspective. Therefore, I will be unable to continue my current maintainership responsibility for iSER related issues. I want to thank everyone for the great work that I had the chance to be part of. Eli Dorfman (elid at voltaire.com) will be taking over my maintainership of iSER code for kernel.org. Eli has already started doing that work. Doron Shoham (dorons at voltaire.com) will be responsible for iSER and iSCSI related issues in OFED (i.e. open-iscsi, iSER & stgt). All relevant git trees will move from my trees to his. These changes will be effective as of 19/5/08. After that, if you need anything, I will be available on erezzi.list at gmail.com Erez From vlad at dev.mellanox.co.il Sun May 11 04:46:00 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 11 May 2008 14:46:00 +0300 Subject: [ofa-general] Re: [rds-devel] New rds patches for ofed 1.3.1 In-Reply-To: <200805071250.22719.olaf.kirch@oracle.com> References: <200805071250.22719.olaf.kirch@oracle.com> Message-ID: <4826DC78.50700@dev.mellanox.co.il> Olaf Kirch wrote: > Hi, > > I have two more RDS kernel patches for OFED 1.3.1, and one additional > rds-tools patch. They're available from my git trees at > on branch code-drop-20080507 > > If you have any feedback, please let me know. > > At this point, I'm not going to submit the dma_sync patches yet. > I think they need more testing, and I'd rather postpone them to > OFED 1.3.2. > > I'll also post these patches in a follow-up email to this message. > > Olaf Pulled into OFED-1.3.1. Regards, Vladimir From olga.shern at gmail.com Sun May 11 04:49:42 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Sun, 11 May 2008 14:49:42 +0300 Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <1210148064.15669.84.camel@mtls03> References: <48206690.3090604@Voltaire.COM> <1210148064.15669.84.camel@mtls03> Message-ID: On 5/7/08, Eli Cohen wrote: > > > On Tue, 2008-05-06 at 17:09 +0300, Moni Shoua wrote: > > The purpose of this patch is to make the events that are related to SM > change > > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > > When SM related events are handled, it is not necessary to flush unicast > > info from device but only multicast info. This patch divides the events > that are > > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 > and 1 > > does more than 0). > > The main change is in __ipoib_ib_dev_flush(). Instead of flagging to the > function > > about pkey_events we now use leveling. An event that requires "harder" > flushing > > calls this function with higher number for level. Besides the concept, > > the actual change is that SM related events are not flushing unicast > info and > > not bringing the device down but only refresh the multicast info in the > background. > > > As far as I know, when an SM change event occurs, it could mean the SM > changed and the new one "decided" to reprogram all the LIDs for example. > In that case you will issue only level 0 and the all your neighbours can > become invalid. > > _ When SM change event occurs it mean that there was SM fail over, OpenSM and also vendor's SM in 99% of the cases will keep the LIDs (LIDs persistency). If there will be LID change then there will be LID change event and it is level 1 and not level 0 event. ______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Sun May 11 05:17:53 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 11 May 2008 15:17:53 +0300 Subject: [ofa-general] [PATCH] IB/core: Add comletion flag for send with invalidate Message-ID: <1210508273.15669.131.camel@mtls03> >From da2391afba573aa5cbfd488e2c2498e3586ae1b9 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Sun, 11 May 2008 14:59:08 +0300 Subject: [PATCH] IB/core: Add comletion flag for send with invalidate Add IB_WC_WITH_INVALIDATE to enum ib_wc_flags to mark completions of "send with invalidate" operations. Signed-off-by: Eli Cohen --- include/rdma/ib_verbs.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..57a11f8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -424,7 +424,8 @@ enum ib_wc_opcode { enum ib_wc_flags { IB_WC_GRH = 1, - IB_WC_WITH_IMM = (1<<1) + IB_WC_WITH_IMM = (1<<1), + IB_WC_WITH_INVALIDATE = (1<<2), }; struct ib_wc { -- 1.5.5.1 From eli at dev.mellanox.co.il Sun May 11 05:18:41 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 11 May 2008 15:18:41 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Add send with invalidate support Message-ID: <1210508321.15669.133.camel@mtls03> >From 1c9492f357efa456074ab7e4552e8d8eccfe3cfe Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Sun, 11 May 2008 15:02:04 +0300 Subject: [PATCH] IB/mlx4: Add send with invalidate support Add send with invalidate support to mlx4. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/cq.c | 8 ++++++++ drivers/infiniband/hw/mlx4/qp.c | 22 +++++++++++++++++----- drivers/net/mlx4/mr.c | 6 ++++-- 3 files changed, 29 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 4521319..291e856 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -637,6 +637,7 @@ repoll: case MLX4_OPCODE_SEND_IMM: wc->wc_flags |= IB_WC_WITH_IMM; case MLX4_OPCODE_SEND: + case MLX4_OPCODE_SEND_INVAL: wc->opcode = IB_WC_SEND; break; case MLX4_OPCODE_RDMA_READ: @@ -676,6 +677,13 @@ repoll: wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + case MLX4_RECV_OPCODE_SEND_INVAL: + wc->opcode = IB_WC_RECV; + wc->wc_flags = IB_WC_WITH_INVALIDATE; + /* + * TBD: maybe we should just call this ieth_val + */ + wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid); } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 8e02ecf..d0d5f77 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = { [IB_WR_RDMA_READ] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ), [IB_WR_ATOMIC_CMP_AND_SWP] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS), [IB_WR_ATOMIC_FETCH_AND_ADD] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA), + [IB_WR_SEND_WITH_INV] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL), }; static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp) @@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr, return 0; } +static __be32 get_ieth(struct ib_send_wr *wr) +{ + switch (wr->opcode) { + case IB_WR_SEND_WITH_IMM: + case IB_WR_RDMA_WRITE_WITH_IMM: + return wr->ex.imm_data; + + case IB_WR_SEND_WITH_INV: + return cpu_to_be32(wr->ex.invalidate_rkey); + + default: + return 0; + } +} + int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp->sq_signal_bits; - if (wr->opcode == IB_WR_SEND_WITH_IMM || - wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) - ctrl->imm = wr->ex.imm_data; - else - ctrl->imm = 0; + ctrl->imm = get_ieth(wr); wqe += sizeof *ctrl; size = sizeof *ctrl / 16; diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 03a9abc..e78f53d 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -47,7 +47,7 @@ struct mlx4_mpt_entry { __be32 flags; __be32 qpn; __be32 key; - __be32 pd; + __be32 pd_flags; __be64 start; __be64 length; __be32 lkey; @@ -71,6 +71,8 @@ struct mlx4_mpt_entry { #define MLX4_MPT_STATUS_SW 0xF0 #define MLX4_MPT_STATUS_HW 0x00 +#define MLX4_MPT_FLAG_EN_INV 0x3000000 + static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order) { int o; @@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) mr->access); mpt_entry->key = cpu_to_be32(key_to_hw_index(mr->key)); - mpt_entry->pd = cpu_to_be32(mr->pd); + mpt_entry->pd_flags = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV); mpt_entry->start = cpu_to_be64(mr->iova); mpt_entry->length = cpu_to_be64(mr->size); mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift); -- 1.5.5.1 From kliteyn at dev.mellanox.co.il Sun May 11 05:37:11 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 11 May 2008 15:37:11 +0300 Subject: [ofa-general] [PATCH] infiniband-diags/Makefile.am: fix location of ibdiag_version.h Message-ID: <4826E877.7090706@dev.mellanox.co.il> Hi Sasha, When compiling infiniband-diags not from the source code location, compilation fails to find the ibdiag_version.h file - fixing it. Signed-off-by: Yevgeny Kliteynik --- infiniband-diags/Makefile.am | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index e502a06..b6228b5 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,5 +1,5 @@ -INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ @@ -103,7 +103,7 @@ man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ BUILT_SOURCES = ibdiag_version ibdiag_version: if [ -x $(top_srcdir)/../gen_ver.sh ] ; then \ - ver_file=$(srcdir)/include/ibdiag_version.h ; \ + ver_file=$(top_builddir)/include/ibdiag_version.h ; \ ibdiag_ver=`cat $$ver_file | sed -ne '/#define IBDIAG_VERSION /s/^.*\"\(.*\)\"$$/\1/p'` ; \ ver=`$(top_srcdir)/../gen_ver.sh $(PACKAGE)` ; \ if [ $$ver != $$ibdiag_ver ] ; then \ -- 1.5.1.4 From eli at dev.mellanox.co.il Sun May 11 07:21:11 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 11 May 2008 17:21:11 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support Message-ID: <1210515671.15669.138.camel@mtls03> >From 0fdabd83e54369b51ac41003f7fe282604b63ad5 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Sun, 11 May 2008 15:02:04 +0300 Subject: Add send with invalidate support to mlx4. Signed-off-by: Eli Cohen --- changes since last commit: set cap flag IB_DEVICE_SEND_W_INV drivers/infiniband/hw/mlx4/cq.c | 8 ++++++++ drivers/infiniband/hw/mlx4/main.c | 3 ++- drivers/infiniband/hw/mlx4/qp.c | 22 +++++++++++++++++----- drivers/net/mlx4/mr.c | 6 ++++-- 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 4521319..291e856 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -637,6 +637,7 @@ repoll: case MLX4_OPCODE_SEND_IMM: wc->wc_flags |= IB_WC_WITH_IMM; case MLX4_OPCODE_SEND: + case MLX4_OPCODE_SEND_INVAL: wc->opcode = IB_WC_SEND; break; case MLX4_OPCODE_RDMA_READ: @@ -676,6 +677,13 @@ repoll: wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + case MLX4_RECV_OPCODE_SEND_INVAL: + wc->opcode = IB_WC_RECV; + wc->wc_flags = IB_WC_WITH_INVALIDATE; + /* + * TBD: maybe we should just call this ieth_val + */ + wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid); } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 4d61e32..a88fa15 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -90,7 +90,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->device_cap_flags = IB_DEVICE_CHANGE_PHY_PORT | IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SYS_IMAGE_GUID | - IB_DEVICE_RC_RNR_NAK_GEN; + IB_DEVICE_RC_RNR_NAK_GEN | + IB_DEVICE_SEND_W_INV; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_PKEY_CNTR) props->device_cap_flags |= IB_DEVICE_BAD_PKEY_CNTR; if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_BAD_QKEY_CNTR) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 8e02ecf..d0d5f77 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = { [IB_WR_RDMA_READ] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ), [IB_WR_ATOMIC_CMP_AND_SWP] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS), [IB_WR_ATOMIC_FETCH_AND_ADD] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA), + [IB_WR_SEND_WITH_INV] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL), }; static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp) @@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr, return 0; } +static __be32 get_ieth(struct ib_send_wr *wr) +{ + switch (wr->opcode) { + case IB_WR_SEND_WITH_IMM: + case IB_WR_RDMA_WRITE_WITH_IMM: + return wr->ex.imm_data; + + case IB_WR_SEND_WITH_INV: + return cpu_to_be32(wr->ex.invalidate_rkey); + + default: + return 0; + } +} + int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp->sq_signal_bits; - if (wr->opcode == IB_WR_SEND_WITH_IMM || - wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) - ctrl->imm = wr->ex.imm_data; - else - ctrl->imm = 0; + ctrl->imm = get_ieth(wr); wqe += sizeof *ctrl; size = sizeof *ctrl / 16; diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 03a9abc..e78f53d 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -47,7 +47,7 @@ struct mlx4_mpt_entry { __be32 flags; __be32 qpn; __be32 key; - __be32 pd; + __be32 pd_flags; __be64 start; __be64 length; __be32 lkey; @@ -71,6 +71,8 @@ struct mlx4_mpt_entry { #define MLX4_MPT_STATUS_SW 0xF0 #define MLX4_MPT_STATUS_HW 0x00 +#define MLX4_MPT_FLAG_EN_INV 0x3000000 + static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order) { int o; @@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) mr->access); mpt_entry->key = cpu_to_be32(key_to_hw_index(mr->key)); - mpt_entry->pd = cpu_to_be32(mr->pd); + mpt_entry->pd_flags = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV); mpt_entry->start = cpu_to_be64(mr->iova); mpt_entry->length = cpu_to_be64(mr->size); mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift); -- 1.5.5.1 From rdreier at cisco.com Sun May 11 08:34:12 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 11 May 2008 08:34:12 -0700 Subject: [ofa-general] [2.6.27 PATCH/RFC] IB/srp: Remove use of cached P_Key/GID queries In-Reply-To: <200805110849.25034.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Sun, 11 May 2008 08:49:24 +0300") References: <200805110849.25034.jackm@dev.mellanox.co.il> Message-ID: > >  Since we want to eliminate the > > cached operations in the long term, convert SRP to use the uncached > > variants. > Eliminating the caches will pose a performance problem when sending > raw packets. The ib_post_send API provides the pkey_index -- and this > needs to be translated to the actual p_key when building the Base > Transport Header. Not sure what you mean about a performance problem... if you're talking about the mthca/mlx4 internals, then we just need a private P_Key/GID cache inside the driver. This can be updated synchronously when processing MADs, rather than having all the complicated logic in the current caching module. But I don't think sending MADs is really a fast path anyway. - R. From rdreier at cisco.com Sun May 11 08:37:57 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 11 May 2008 08:37:57 -0700 Subject: [ofa-general] Re: [PATCH] IB/core: Add comletion flag for send with invalidate In-Reply-To: <1210508273.15669.131.camel@mtls03> (Eli Cohen's message of "Sun, 11 May 2008 15:17:53 +0300") References: <1210508273.15669.131.camel@mtls03> Message-ID: > Add IB_WC_WITH_INVALIDATE to enum ib_wc_flags to mark completions > of "send with invalidate" operations. Actually, given that we took the approach of adding a new send opcode for send w/invalidate, I think it probably makes more sense to add a new IB_WC_ opcode for such completions. This doesn't match the immediate data handling, but on the other hand, if we just add a wc flag for invalidate, what opcode do we report for local invalidate work requests? - R. From rdreier at cisco.com Sun May 11 08:44:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 11 May 2008 08:44:18 -0700 Subject: [ofa-general] Re: [PATCH] IB/core: Add comletion flag for send with invalidate In-Reply-To: <1210508273.15669.131.camel@mtls03> (Eli Cohen's message of "Sun, 11 May 2008 15:17:53 +0300") References: <1210508273.15669.131.camel@mtls03> Message-ID: Never mind... I see that this is for the corresponding receive on the other side. But: > + /* > + * TBD: maybe we should just call this ieth_val > + */ > + wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid); I think we should make the wc->imm_data handling a union like with did for the work request structure. If we call it ieth_val I think it's too hard for anyone to understand, and it is IB-specific (iWARP has no IETH of course). - R. From ogerlitz at voltaire.com Sun May 11 22:13:50 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 12 May 2008 08:13:50 +0300 Subject: [ofa-general] Re: [2.6.27 PATCH/RFC] IB/srp: Remove use of cached P_Key/GID queries In-Reply-To: References: <200805110849.25034.jackm@dev.mellanox.co.il> Message-ID: <4827D20E.4060805@voltaire.com> Roland Dreier wrote: > But I don't think sending MADs is really a fast path anyway. It is fast path to some extent when this node runs the SM Or. From ogerlitz at voltaire.com Sun May 11 22:44:31 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 12 May 2008 08:44:31 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <1210515671.15669.138.camel@mtls03> References: <1210515671.15669.138.camel@mtls03> Message-ID: <4827D93F.1020208@voltaire.com> Eli Cohen wrote: > Add send with invalidate support to mlx4. > Signed-off-by: Eli Cohen Hi Eli, Thinking about it a little, I don't see how this send-with-invalidate support is applicable for ULPs such as SRP, iSER and RDS who use the Mellanox FMRs. This is because to invalidate an rkey they call fmr_unmap and what the mlx4 (similarly for mthca) driver does at mlx4_ib_unmap_fmr() is call mlx4_fmr_unmap() for each fmr and then issue SYNC_TPT command. Even if doing send-with-inv would save the ULP the indirect call to mlx4_fmr_unmap() (which does almost nothing by itself), if it doesn't cause the HW/FW to issue SYNC_TPT, it can not replace the call to ib_unmap_fmr in the side that generated this rkey. And if it does cause SYNC_TPT, the effect of amortizing the cost of this heavy command through un-mapping on many fmrs at once is lost, correct? Or > mlx4_ib_unmap_fmr calls mlx4_fmr_unmap for each fmr and then issues SYNC_TPT command > > void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr, > u32 *lkey, u32 *rkey) > { > if (!fmr->maps) > return; > > fmr->maps = 0; > > *(u8 *) fmr->mpt = MLX4_MPT_STATUS_SW; > } > From keshetti85-student at yahoo.co.in Sun May 11 23:30:14 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Mon, 12 May 2008 12:00:14 +0530 Subject: [ofa-general] OpenSM SA dump ? Message-ID: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com> When I ran opensm with '-D 0x43' option, it generated opensm-sa.dump file in /var/log directory. But to my surprise 'opensm-sa.dump' file is very small in size and it contained information of MC groups only. Do I need to give more options to get detailed information of SA dump ? Also, is there any way to dump the local SA cache to a file in the OFED-1.3 implementation ? -Mahesh From monis at Voltaire.COM Mon May 12 01:08:25 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 12 May 2008 11:08:25 +0300 Subject: [ofa-general] [PATCH] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <48206690.3090604@Voltaire.COM> References: <48206690.3090604@Voltaire.COM> Message-ID: <4827FAF9.2030506@Voltaire.COM> Hi Roland Do you have comments for this patch? We'd like to have it in please. thanks MoniS From monis at Voltaire.COM Mon May 12 01:12:15 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 12 May 2008 11:12:15 +0300 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: <4820638E.4030901@Voltaire.COM> References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> <4820638E.4030901@Voltaire.COM> Message-ID: <4827FBDF.9040308@Voltaire.COM> Moni Shoua wrote: >> I guess I can believe things don't get worse but I still don't know how >> this makes things better. With the current code the request is lost >> because it goes to the wrong SM; with the new code the request is failed >> by the SA layer. So in both cases the consumer just has to try again. >> >> So is there some practical benefit we see by adding this code? >> >> - R. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > In general I see the benefit in faster detection of wrong SM ah. Before the patch consumers > need to wait for a timeout before the detection and after the patch it happens immediately > on return from the function. This improves the performance of an SM failover scenario. > > Some applications may get the benefit above only they handle new return code (EAGAIN) specifically > but this patch opens the door for such improvement. > > thanks > > MoniS > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Hi Roland, Can we please go on with this patch? We would like to see it in the next kernel. thanks MoniS From eli at dev.mellanox.co.il Mon May 12 01:32:42 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 12 May 2008 11:32:42 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <4827D93F.1020208@voltaire.com> References: <1210515671.15669.138.camel@mtls03> <4827D93F.1020208@voltaire.com> Message-ID: <1210581162.15669.158.camel@mtls03> On Mon, 2008-05-12 at 08:44 +0300, Or Gerlitz wrote: > Thinking about it a little, I don't see how this send-with-invalidate > support is applicable for ULPs such as SRP, iSER and RDS who use the > Mellanox FMRs. > > This is because to invalidate an rkey they call fmr_unmap and what the > mlx4 (similarly for mthca) driver does at mlx4_ib_unmap_fmr() is call > mlx4_fmr_unmap() for each fmr and then issue SYNC_TPT command. > > Even if doing send-with-inv would save the ULP the indirect call to > mlx4_fmr_unmap() (which does almost nothing by itself), if it doesn't > cause the HW/FW to issue SYNC_TPT, it can not replace the call to > ib_unmap_fmr in the side that generated this rkey. And if it does cause > SYNC_TPT, the effect of amortizing the cost of this heavy command > through un-mapping on many fmrs at once is lost, correct? The outcome of send with invalidate involves an implicit "sync_tpt" like operation although it syncs the caches to invalidate only the specific memory key (as opposed to sync tpt command which has a more global nature). But I think that the idea is not to save the overhead of sync tpt commands but to provide security. Perhaps someone from RDS can add more on that. From pawel.dziekonski at wcss.pl Mon May 12 01:54:15 2008 From: pawel.dziekonski at wcss.pl (Pawel Dziekonski) Date: Mon, 12 May 2008 10:54:15 +0200 Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand interface In-Reply-To: <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com> References: <1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com> <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512085415.GF5226@cefeid.wcss.wroc.pl> On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote: > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote: > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky wrote: > > > Check : > > > > > > /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/* > > > > Thanks, these are interesting counters. Unfortunately these counters > > are 32-bit counters and already overflowed during the test I ran (less > > than one day of SRP communication): > > > > $ uname -r > > 2.6.24 > > $ uname -m > > x86_64 > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets} > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <== > > 4294967295 > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <== > > 4294967295 > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <== > > 4294967295 > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <== > > 4294967295 > > Depending on which fabric management (usually bundled with the SM) is > being used, you may be able to obtain this information via the > Performance Manager (and not be limited to 32 bit counters). Hi, I have exactly the same problem on OFED 1.2.5.5, redhat kernel 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain 4294967295 value, or barely change. :( How can I get Performance Manager running and printing some reasonable numbers? regards, P -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From ogerlitz at voltaire.com Mon May 12 03:42:31 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 12 May 2008 13:42:31 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <1210581162.15669.158.camel@mtls03> References: <1210515671.15669.138.camel@mtls03> <4827D93F.1020208@voltaire.com> <1210581162.15669.158.camel@mtls03> Message-ID: <48281F17.400@voltaire.com> Eli Cohen wrote: > The outcome of send with invalidate involves an implicit "sync_tpt" like > operation although it syncs the caches to invalidate only the specific > memory key (as opposed to sync tpt command which has a more global > nature). > But I think that the idea is not to save the overhead of sync tpt > commands but to provide security. Yes, if send-with-invalidate causes a sync-tpt which applies only to the specific rkey (I assume its documented in the PRM) this can be used to make the mellanox fmrs --much-- more secure. Are you thinking on enhancement to support that for consumers that use FMRs through the pool at the core? Or. From tziporet at dev.mellanox.co.il Mon May 12 04:09:52 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 12 May 2008 14:09:52 +0300 Subject: [ewg] RE: [ofa-general] OFED May 5 meeting summary In-Reply-To: <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> Message-ID: <48282580.8040208@mellanox.co.il> Moshe Kazir wrote: > > I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3. > > ib-bonding compile failed. Everything else is compiled o.k. > > Attached : ib-bonding error log. > > > I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed, > I'll get Moni's help). > > Thanks Please update when done. Any need for a change in the install script? Tziporet From hrosenstock at xsigo.com Mon May 12 04:19:05 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 04:19:05 -0700 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <4825EEC9.4070208@dev.mellanox.co.il> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> Message-ID: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> On Sat, 2008-05-10 at 21:51 +0300, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > Is it possible that OpenSM's fat tree routing somehow depends on the > > LIDs previously assigned ? > > It depends only on the existence of the LIDs. > > > It seems that for a legitimate fat tree topology, the topology sometimes > > won't come up as a fat tree if reassigning LIDs (-r) is not used. > > That's odd... > > > In addition to -r making fat tree work, certain routing algorithms seem > > to also clear this out (without using -r). For example, if lash were run > > and then ftree, it seems to work without doing the -r. (Haven't yet > > tried updn). > > > > Any ideas on this ? Should a bug be filed on this ? Thanks. > > No ideas whatsoever. Please file a bug on this. I filed this as bug 1031: https://bugs.openfabrics.org/show_bug.cgi?id=1031 > It would be nice if I could reproduce it in simulation. Yes, that would be nice; but I don't have a sim case. -- Hal > -- Yevgeny > > > -- Hal > > > > > > > From kliteyn at dev.mellanox.co.il Mon May 12 05:30:18 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 12 May 2008 15:30:18 +0300 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> Message-ID: <4828385A.6080804@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > On Sat, 2008-05-10 at 21:51 +0300, Yevgeny Kliteynik wrote: >> Hal Rosenstock wrote: >>> Hi Yevgeny, >>> >>> Is it possible that OpenSM's fat tree routing somehow depends on the >>> LIDs previously assigned ? >> It depends only on the existence of the LIDs. >> >>> It seems that for a legitimate fat tree topology, the topology sometimes >>> won't come up as a fat tree if reassigning LIDs (-r) is not used. >> That's odd... >> >>> In addition to -r making fat tree work, certain routing algorithms seem >>> to also clear this out (without using -r). For example, if lash were run >>> and then ftree, it seems to work without doing the -r. (Haven't yet >>> tried updn). >>> >>> Any ideas on this ? Should a bug be filed on this ? Thanks. >> No ideas whatsoever. Please file a bug on this. > > I filed this as bug 1031: > https://bugs.openfabrics.org/show_bug.cgi?id=1031 Thanks >> It would be nice if I could reproduce it in simulation. > > Yes, that would be nice; but I don't have a sim case. The problem is, I don't even know where to start. I tested it in simulations on different topologies, and it is used on real cluster(s) too. I need more details, and some hint on how to reproduce it. Can you describe the setup you used when you saw this problem? -- Yevgeny > -- Hal > >> -- Yevgeny >> >>> -- Hal >>> >>> >>> > > From olga.shern at gmail.com Mon May 12 06:49:34 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Mon, 12 May 2008 16:49:34 +0300 Subject: [ewg] RE: [ofa-general] OFED May 5 meeting summary In-Reply-To: <48282580.8040208@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> <48282580.8040208@mellanox.co.il> Message-ID: On 5/12/08, Tziporet Koren wrote: > > Moshe Kazir wrote: > > > > > I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3. > > > > ib-bonding compile failed. Everything else is compiled o.k. > > Attached : ib-bonding error log. > > > > > > I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed, > > I'll get Moni's help). > > > > > > > Thanks > Please update when done. > Any need for a change in the install script? It seems that there is no need for changes in the install script, I will update you Tziporet _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olgas at voltaire.com Mon May 12 07:18:33 2008 From: olgas at voltaire.com (Olga Shern) Date: Mon, 12 May 2008 17:18:33 +0300 Subject: [ofa-general] Compiling OFED 1.3 on Gentoo Message-ID: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com> Hi, We are trying to compile OFED 1.3 on Gentoo and see the following error, Build falls on libibcommon library with the error bellow. Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'dist ' --target i386 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' /tmp/OFED-1.3/SRPMS/libibcommon-1.0.8-1.ofed1.3.src.rpm error: Macro %dist has empty body error: Macro %dist has empty body sh: line 0: fg: no job control error: Failed build dependencies: is needed by libibcommon-1.0.8-1.ofed1.3.src Installing /tmp/OFED-1.3/SRPMS/libibcommon-1.0.8-1.ofed1.3.src.rpm Building target platforms: i386 Building for target i386 There is a strange space under 'error:' line, before 'is needed by libibcommon-1.0.8-1.ofed1.3.src' But if I install source RPM file and then running 'rpmbuild -ba libibcommon.spec' then I can build RPM, so only rpmbuild --rebuild command causing to problems. Have someone seen this error before? Have someone succeeded to build OFED 1.3 on Gentoo? Thanks Olga -------------- next part -------------- An HTML attachment was scrubbed... URL: From xemul at openvz.org Mon May 12 07:43:58 2008 From: xemul at openvz.org (Pavel Emelyanov) Date: Mon, 12 May 2008 18:43:58 +0400 Subject: [ofa-general] [PATCH][INFINIBAND]: Make ipath_portdata work with struct pid * not pid_t. Message-ID: <482857AE.2030904@openvz.org> The official reason is "with the presence of pid namespaces in the kernel using pid_t-s inside one is no longer safe". But the reason I fix exactly the infiniband right now is the following. About a month ago (when the 2.6.25 was not yet released) there still was a one last caller of a to-be-deprecated-soon function find_pid() - the kill_proc() function, which in turn was only used by nfs callback code. During the last merge window, this last caller was finally eliminated by some NFS patch(es) and I was about to finally kill this kill_proc() and find_pid(), but found, that I was late and the kill_proc is now called from the infiniband driver (commit 58411d1c). So here's the patch, that turns this code to use struct pid * and (!) the kill_pid routine. If it is possible to have this one in 2.6.26, I would appreciate this A LOT and be able to close one more hole in pid namespaces. Signed-off-by: Pavel Emelyanov --- diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index ce7b7c3..258e66c 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2616,7 +2616,7 @@ int ipath_reset_device(int unit) ipath_dbg("unit %u port %d is in use " "(PID %u cmd %s), can't reset\n", unit, i, - dd->ipath_pd[i]->port_pid, + pid_nr(dd->ipath_pd[i]->port_pid), dd->ipath_pd[i]->port_comm); ret = -EBUSY; goto bail; @@ -2654,19 +2654,21 @@ bail: static int ipath_signal_procs(struct ipath_devdata *dd, int sig) { int i, sub, any = 0; - pid_t pid; + struct pid *pid; if (!dd->ipath_pd) return 0; for (i = 1; i < dd->ipath_cfgports; i++) { - if (!dd->ipath_pd[i] || !dd->ipath_pd[i]->port_cnt || - !dd->ipath_pd[i]->port_pid) + if (!dd->ipath_pd[i] || !dd->ipath_pd[i]->port_cnt) continue; pid = dd->ipath_pd[i]->port_pid; + if (!pid) + continue; + dev_info(&dd->pcidev->dev, "context %d in use " "(PID %u), sending signal %d\n", - i, pid, sig); - kill_proc(pid, sig, 1); + i, pid_nr(pid), sig); + kill_pid(pid, sig, 1); any++; for (sub = 0; sub < INFINIPATH_MAX_SUBPORT; sub++) { pid = dd->ipath_pd[i]->port_subpid[sub]; @@ -2674,8 +2676,8 @@ static int ipath_signal_procs(struct ipath_devdata *dd, int sig) continue; dev_info(&dd->pcidev->dev, "sub-context " "%d:%d in use (PID %u), sending " - "signal %d\n", i, sub, pid, sig); - kill_proc(pid, sig, 1); + "signal %d\n", i, sub, pid_nr(pid), sig); + kill_pid(pid, sig, 1); any++; } } diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index 3295177..b472b15 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -555,7 +555,7 @@ static int ipath_tid_free(struct ipath_portdata *pd, unsigned subport, p = dd->ipath_pageshadow[porttid + tid]; dd->ipath_pageshadow[porttid + tid] = NULL; ipath_cdbg(VERBOSE, "PID %u freeing TID %u\n", - pd->port_pid, tid); + pid_nr(pd->port_pid), tid); dd->ipath_f_put_tid(dd, &tidbase[tid], RCVHQ_RCV_TYPE_EXPECTED, dd->ipath_tidinvalid); @@ -1609,7 +1609,7 @@ static int try_alloc_port(struct ipath_devdata *dd, int port, port); pd->port_cnt = 1; port_fp(fp) = pd; - pd->port_pid = current->pid; + pd->port_pid = get_pid(task_pid(current)); strncpy(pd->port_comm, current->comm, sizeof(pd->port_comm)); ipath_stats.sps_ports++; ret = 0; @@ -1793,14 +1793,15 @@ static int find_shared_port(struct file *fp, } port_fp(fp) = pd; subport_fp(fp) = pd->port_cnt++; - pd->port_subpid[subport_fp(fp)] = current->pid; + pd->port_subpid[subport_fp(fp)] = + get_pid(task_pid(current)); tidcursor_fp(fp) = 0; pd->active_slaves |= 1 << subport_fp(fp); ipath_cdbg(PROC, "%s[%u] %u sharing %s[%u] unit:port %u:%u\n", current->comm, current->pid, subport_fp(fp), - pd->port_comm, pd->port_pid, + pd->port_comm, pid_nr(pd->port_pid), dd->ipath_unit, pd->port_port); ret = 1; goto done; @@ -2066,7 +2067,8 @@ static int ipath_close(struct inode *in, struct file *fp) * the slave(s) don't wait for receive data forever. */ pd->active_slaves &= ~(1 << fd->subport); - pd->port_subpid[fd->subport] = 0; + put_pid(pd->port_subpid[fd->subport]); + pd->port_subpid[fd->subport] = NULL; mutex_unlock(&ipath_mutex); goto bail; } @@ -2074,7 +2076,7 @@ static int ipath_close(struct inode *in, struct file *fp) if (pd->port_hdrqfull) { ipath_cdbg(PROC, "%s[%u] had %u rcvhdrqfull errors " - "during run\n", pd->port_comm, pd->port_pid, + "during run\n", pd->port_comm, pid_nr(pd->port_pid), pd->port_hdrqfull); pd->port_hdrqfull = 0; } @@ -2134,11 +2136,12 @@ static int ipath_close(struct inode *in, struct file *fp) unlock_expected_tids(pd); ipath_stats.sps_ports--; ipath_cdbg(PROC, "%s[%u] closed port %u:%u\n", - pd->port_comm, pd->port_pid, + pd->port_comm, pid_nr(pd->port_pid), dd->ipath_unit, port); } - pd->port_pid = 0; + put_pid(pd->port_pid); + pd->port_pid = NULL; dd->ipath_pd[pd->port_port] = NULL; /* before releasing mutex */ mutex_unlock(&ipath_mutex); ipath_free_pddata(dd, pd); /* after releasing the mutex */ diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 02b24a3..2097587 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -159,8 +159,8 @@ struct ipath_portdata { /* saved total number of polled urgent packets for poll edge trigger */ u32 port_urgent_poll; /* pid of process using this port */ - pid_t port_pid; - pid_t port_subpid[INFINIPATH_MAX_SUBPORT]; + struct pid *port_pid; + struct pid *port_subpid[INFINIPATH_MAX_SUBPORT]; /* same size as task_struct .comm[] */ char port_comm[16]; /* pkeys set by this use of this port */ From sashak at voltaire.com Mon May 12 11:37:36 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 18:37:36 +0000 Subject: [ofa-general] Re: [PATCH][TRIVIAL] ibsim/sim_net.c: Fix some typos In-Reply-To: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com> References: <1209934852.20493.182.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512183736.GB17046@sashak.voltaire.com> On 14:00 Sun 04 May , Hal Rosenstock wrote: > ibsim/sim.net.c: Fix some typos > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Mon May 12 11:38:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 18:38:07 +0000 Subject: [ofa-general] Re: [PATCH][TRIVIAL] ibsim/ibsim.c: Fix usage display In-Reply-To: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com> References: <1209934895.20493.184.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512183807.GC17046@sashak.voltaire.com> On 14:01 Sun 04 May , Hal Rosenstock wrote: > ibsim/ibsim.c: Fix usage display > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From rdreier at cisco.com Mon May 12 08:41:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 May 2008 08:41:18 -0700 Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with struct pid * not pid_t. In-Reply-To: <482857AE.2030904@openvz.org> (Pavel Emelyanov's message of "Mon, 12 May 2008 18:43:58 +0400") References: <482857AE.2030904@openvz.org> Message-ID: Seems fine to me... ipath guys, any comment? I think it would be reasonale to include this with the other ipath fixes when I ask Linus to pull in a day or two. From sashak at voltaire.com Mon May 12 11:41:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 18:41:38 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512184138.GD17046@sashak.voltaire.com> On 14:01 Sun 04 May , Hal Rosenstock wrote: > ibsim/sim.h: Fix NodeDescription size so can have maximum size > NodeDescription per IBA spec rather than truncating them > > Signed-off-by: Hal Rosenstock > > diff --git a/ibsim/sim.h b/ibsim/sim.h > index bea136a..dbf1220 100644 > --- a/ibsim/sim.h > +++ b/ibsim/sim.h > @@ -67,7 +67,7 @@ > > #define NODEIDBASE 20 > #define NODEPREFIX 20 > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > +#define NODEIDLEN 65 > #define ALIASLEN 40 nodeid filed in struct Node still have length 64, so it looks that using NODEIDLEN value larger than this introduces overflow. I think bigger change is needed there. Sasha From sashak at voltaire.com Mon May 12 11:49:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 18:49:03 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of attachment/SIM_HOST use In-Reply-To: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com> References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512184903.GE17046@sashak.voltaire.com> On 14:01 Sun 04 May , Hal Rosenstock wrote: > ibsim/README: Clarify point of attachment/SIM_HOST use > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hrosenstock at xsigo.com Mon May 12 08:59:17 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 08:59:17 -0700 Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of attachment/SIM_HOST use In-Reply-To: <20080512184903.GE17046@sashak.voltaire.com> References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com> <20080512184903.GE17046@sashak.voltaire.com> Message-ID: <1210607957.2026.501.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 18:49 +0000, Sasha Khapyorsky wrote: > On 14:01 Sun 04 May , Hal Rosenstock wrote: > > ibsim/README: Clarify point of attachment/SIM_HOST use > > > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. There was a v2 of this patch with a minor change for some omitted words. -- Hal > Sasha From dave.olson at qlogic.com Mon May 12 08:48:44 2008 From: dave.olson at qlogic.com (Dave Olson) Date: Mon, 12 May 2008 08:48:44 -0700 (PDT) Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with struct pid * not pid_t. In-Reply-To: References: <482857AE.2030904@openvz.org> Message-ID: On Mon, 12 May 2008, Roland Dreier wrote: | Seems fine to me... ipath guys, any comment? I think it would be | reasonale to include this with the other ipath fixes when I ask Linus to | pull in a day or two. I looked at the original patch, and it looks fine to me. Should be fairly easy to cover in ofed 1.4 backport patches for the older kernels. Dave Olson dave.olson at qlogic.com From sashak at voltaire.com Mon May 12 12:10:32 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 19:10:32 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/README: Clarify point of attachment/SIM_HOST use In-Reply-To: <1210607957.2026.501.camel@hrosenstock-ws.xsigo.com> References: <1209934908.20493.186.camel@hrosenstock-ws.xsigo.com> <20080512184903.GE17046@sashak.voltaire.com> <1210607957.2026.501.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512191032.GG17046@sashak.voltaire.com> On 08:59 Mon 12 May , Hal Rosenstock wrote: > On Mon, 2008-05-12 at 18:49 +0000, Sasha Khapyorsky wrote: > > On 14:01 Sun 04 May , Hal Rosenstock wrote: > > > ibsim/README: Clarify point of attachment/SIM_HOST use > > > > > > Signed-off-by: Hal Rosenstock > > > > Applied. Thanks. > > There was a v2 of this patch with a minor change for some omitted words. Applied this too. Thanks. Sasha From hrosenstock at xsigo.com Mon May 12 09:28:16 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 09:28:16 -0700 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <20080512184138.GD17046@sashak.voltaire.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> <20080512184138.GD17046@sashak.voltaire.com> Message-ID: <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 18:41 +0000, Sasha Khapyorsky wrote: > On 14:01 Sun 04 May , Hal Rosenstock wrote: > > ibsim/sim.h: Fix NodeDescription size so can have maximum size > > NodeDescription per IBA spec rather than truncating them > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/ibsim/sim.h b/ibsim/sim.h > > index bea136a..dbf1220 100644 > > --- a/ibsim/sim.h > > +++ b/ibsim/sim.h > > @@ -67,7 +67,7 @@ > > > > #define NODEIDBASE 20 > > #define NODEPREFIX 20 > > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > > +#define NODEIDLEN 65 > > #define ALIASLEN 40 > > nodeid filed in struct Node still have length 64, so it looks that using > NODEIDLEN value larger than this introduces overflow. I think bigger > change is needed there. I made NODEIDLEN 65 rather than 64 due to the +1 in the original define. How about defining it as 64 as in the below ? Does that get around the overflow issue ? -- Hal ibsim/sim.h: Fix NodeDescription size so can have maximum size NodeDescription per IBA spec rather than truncating them Signed-off-by: Hal Rosenstock diff --git a/ibsim/sim.h b/ibsim/sim.h index bea136a..0bf14fd 100644 --- a/ibsim/sim.h +++ b/ibsim/sim.h @@ -65,9 +65,8 @@ #define DEFAULT_LINKWIDTH LINKWIDTH_4x #define DEFAULT_LINKSPEED LINKSPEED_SDR -#define NODEIDBASE 20 #define NODEPREFIX 20 -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) +#define NODEIDLEN 64 #define ALIASLEN 40 #define MAXHOPS 16 > Sasha From jon at opengridcomputing.com Mon May 12 09:57:38 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Mon, 12 May 2008 11:57:38 -0500 Subject: [ofa-general] RDS flow control Message-ID: <200805121157.38135.jon@opengridcomputing.com> As part of my effort to get RDS working for iWARP, I will be working on the RDS flow control. Flow control is needed for iWARP due to the fact that iWARP connections terminate if there is no posted recv for an incoming packet. IB connections do not have this limitation if setup in a certain way. In its current implementation, RDS sets the connection attribute rnr_retry to 7. This causes IB to retransmit until there is a posted recv buffer. Using a credit based flow control mechanism, we can ensure there will be a posted recv for every incoming packet (thus laying part of the foundation of allowing iWARP to work). Also, it will reduce unnecessary IB transport traffic (at the expense of maintaining the credit schema). I am still in the very early stages of implementing this. So any pointers to RDS documentation (or a RDS git tree) would be very helpful. I have a small IB setup to test this on, so anyone willing to test it when I am done would be helpful as well. Thanks, Jon From richard.frank at oracle.com Mon May 12 10:08:06 2008 From: richard.frank at oracle.com (Richard Frank) Date: Mon, 12 May 2008 13:08:06 -0400 Subject: [ofa-general] RDS flow control In-Reply-To: <200805121157.38135.jon@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> Message-ID: <48287976.10403@oracle.com> We should define a set of performance criteria / tests to ensure we do not impact our current performance with IB... An alternative would be to push this into an IWARP specific module....and if works well there - we might then want to move it to generic RDS layer ? As an example, the TCP transport for RDS - handles flow control internally.. Jon Mason wrote: > As part of my effort to get RDS working for iWARP, I will be working on the > RDS flow control. Flow control is needed for iWARP due to the fact that > iWARP connections terminate if there is no posted recv for an incoming > packet. IB connections do not have this limitation if setup in a certain > way. In its current implementation, RDS sets the connection attribute > rnr_retry to 7. This causes IB to retransmit until there is a posted recv > buffer. > > Using a credit based flow control mechanism, we can ensure there will be a > posted recv for every incoming packet (thus laying part of the foundation of > allowing iWARP to work). Also, it will reduce unnecessary IB transport > traffic (at the expense of maintaining the credit schema). > > I am still in the very early stages of implementing this. So any pointers to > RDS documentation (or a RDS git tree) would be very helpful. I have a small > IB setup to test this on, so anyone willing to test it when I am done would > be helpful as well. > > Thanks, > Jon > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon May 12 14:25:36 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 21:25:36 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> <20080512184138.GD17046@sashak.voltaire.com> <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512212536.GJ17046@sashak.voltaire.com> On 09:28 Mon 12 May , Hal Rosenstock wrote: > > I made NODEIDLEN 65 rather than 64 due to the +1 in the original define. > How about defining it as 64 as in the below ? Does that get around the > overflow issue ? > > -- Hal > > ibsim/sim.h: Fix NodeDescription size so can have maximum size > NodeDescription per IBA spec rather than truncating them > > Signed-off-by: Hal Rosenstock > > diff --git a/ibsim/sim.h b/ibsim/sim.h > index bea136a..0bf14fd 100644 > --- a/ibsim/sim.h > +++ b/ibsim/sim.h > @@ -65,9 +65,8 @@ > #define DEFAULT_LINKWIDTH LINKWIDTH_4x > #define DEFAULT_LINKSPEED LINKSPEED_SDR > > -#define NODEIDBASE 20 > #define NODEPREFIX 20 > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > +#define NODEIDLEN 64 > #define ALIASLEN 40 It is likely will prevent overflow, but will potentially truncate last NodeDesc character due to string NULL terminator. What about something like below? Sasha diff --git a/ibsim/sim.h b/ibsim/sim.h index 81bb47c..d3294f4 100644 --- a/ibsim/sim.h +++ b/ibsim/sim.h @@ -67,7 +67,7 @@ #define NODEIDBASE 20 #define NODEPREFIX 20 -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) +#define NODEIDLEN 65 #define ALIASLEN 40 #define MAXHOPS 16 @@ -237,7 +237,7 @@ struct Node { uint64_t sysguid; uint64_t nodeguid; // also portguid int portsbase; // in port table - char nodeid[64]; // contain nodeid[NODEIDLEN] + char nodeid[NODEIDLEN]; // contain nodeid[NODEIDLEN] uint8_t nodeinfo[64]; char nodedesc[64]; Switch *sw; diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c index 5f64229..fe3e9be 100644 --- a/ibsim/sim_cmd.c +++ b/ibsim/sim_cmd.c @@ -56,7 +56,7 @@ extern Port *ports; extern Port **lids; extern int netnodes, netports, netswitches; -#define NAMELEN 64 +#define NAMELEN NODEIDLEN char *portstates[] = { "-", "Down", "Init", "Armed", "Active", diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index bf7a06a..2a9c19b 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -267,17 +267,16 @@ static int new_hca(Node * nd) return 0; } -static int build_nodeid(char *nodeid, char *base) +static int build_nodeid(char *nodeid, size_t len, char *base) { if (strchr(base, '#') || strchr(base, '@')) { IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved", base); return -1; } - if (netprefix[0] == 0) - strncpy(nodeid, base, NODEIDLEN); - else - snprintf(nodeid, NODEIDLEN, "%s#%s", netprefix, base); + + snprintf(nodeid, len, "%s%s%s", netprefix, *netprefix ? "#" : "", base); + return 0; } @@ -287,7 +286,7 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) char nodeid[NODEIDLEN]; Node *nd; - if (build_nodeid(nodeid, nodename) < 0) + if (build_nodeid(nodeid, sizeof(nodeid), nodename) < 0) return 0; if (find_node(nodeid)) { @@ -310,11 +309,9 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) nd->type = type; nd->numports = nodeports; - strncpy(nd->nodeid, nodeid, NODEIDLEN - 1); - if (nodedesc && nodedesc[0]) - strncpy(nd->nodedesc, nodedesc, NODEIDLEN - 1); - else - strncpy(nd->nodedesc, nodeid, NODEIDLEN - 1); + strncpy(nd->nodeid, nodeid, sizeof(nd->nodeid) - 1); + strncpy(nd->nodedesc, nodedesc && *nodedesc ? nodedesc : nodeid, + sizeof(nd->nodedesc) - 1); nd->sysguid = nd->nodeguid = guids[type]; if (type == SWITCH_NODE) { nodeports++; // port 0 is SMA @@ -551,22 +548,20 @@ char *expand_name(char *base, char *name, char **portstr) if (netprefix[0] != 0 && !strchr(base, '#')) snprintf(name, NODEIDLEN, "%s#%s", netprefix, base); else - strcpy(name, base); + strncpy(name, base, NODEIDLEN - 1); if (portstr) - *portstr = 0; + *portstr = NULL; PDEBUG("name %s port %s", name, portstr ? *portstr : 0); return name; } - if (base[0] == '@') - snprintf(name, ALIASLEN, "%s%s", netprefix, base); - else - strcpy(name, base); + + snprintf(name, NODEIDLEN, "%s%s", base[0] == '@' ? netprefix : "", base); PDEBUG("alias %s", name); if (!(s = map_alias(name))) return 0; - strcpy(name, s); + strncpy(name, s, NODEIDLEN - 1); if (portstr) { *portstr = name; @@ -1075,12 +1070,12 @@ int link_ports(Port * lport, Port * rport) lport->remotenode = rnode; lport->remoteport = rport->portnum; set_portinfo(lport, lnode->type == SWITCH_NODE ? swport : hcaport); - memcpy(lport->remotenodeid, rnode->nodeid, NODEIDLEN); + memcpy(lport->remotenodeid, rnode->nodeid, sizeof(lport->remotenodeid)); rport->remotenode = lnode; rport->remoteport = lport->portnum; set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport); - memcpy(rport->remotenodeid, lnode->nodeid, NODEIDLEN); + memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid)); lport->state = rport->state = 2; // Initialilze lport->physstate = rport->physstate = 5; // LinkUP if (lnode->sw) @@ -1166,7 +1161,7 @@ int connect_ports(void) } } else if (remoteport->remoteport != port->portnum || strncmp(remoteport->remotenodeid, port->node->nodeid, - NODEIDLEN)) { + sizeof(remoteport->remotenodeid))) { IBWARN ("remote port %d in node \"%s\" is not connected to " "node \"%s\" port %d (\"%s\" %d)", From arlin.r.davis at intel.com Mon May 12 11:29:35 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 12 May 2008 11:29:35 -0700 Subject: [ofa-general] [PATCH 1/2][dat1.2] dapl: change cma provider to use max_rdma_read_in, out from ep_attr instead of HCA max values when connecting. Message-ID: <000001c8b45e$217943e0$9f97070a@amr.corp.intel.com> Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/openib_cma/dapl_ib_cm.c | 9 ++++----- 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c index f08ee4b..f2eb8cb 100755 --- a/dapl/openib_cma/dapl_ib_cm.c +++ b/dapl/openib_cma/dapl_ib_cm.c @@ -404,9 +404,6 @@ static void dapli_cm_passive_cb(struct dapl_cm_id *conn, struct rdma_cm_event *event) { struct dapl_cm_id *new_conn; -#ifdef DAPL_DBG - struct rdma_addr *ipaddr = &conn->cm_id->route.addr; -#endif dapl_dbg_log(DAPL_DBG_TYPE_CM, " passive_cb: conn %p id %d event %d\n", @@ -539,8 +536,10 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HANDLE ep_handle, /* Setup QP/CM parameters and private data in cm_id */ (void)dapl_os_memzero(&conn->params, sizeof(conn->params)); - conn->params.responder_resources = conn->hca->ib_trans.max_rdma_rd_in; - conn->params.initiator_depth = conn->hca->ib_trans.max_rdma_rd_out; + conn->params.responder_resources = + ep_ptr->param.ep_attr.max_rdma_read_in; + conn->params.initiator_depth = + ep_ptr->param.ep_attr.max_rdma_read_out; conn->params.flow_control = 1; conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT; conn->params.retry_count = IB_RC_RETRY_COUNT; -- 1.5.2.5 From arlin.r.davis at intel.com Mon May 12 11:29:40 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 12 May 2008 11:29:40 -0700 Subject: [ofa-general] [PATCH 2/2][dat1.2] dapl: Fix long delays with the cma provider open call when DNS is not configure on server. Message-ID: Open call should default to netdev names when resolving local IP address for cma binding to match dat.conf settings. The open code attempts to resolve with IP or Hostname first and if there is no DNS services setup the failover to netdev name resolution is delayed for as much as 20 seconds. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/openib_cma/dapl_ib_util.c | 36 ++++++++++++++++-------------------- 1 files changed, 16 insertions(+), 20 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 4de5a2c..e76e319 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -117,28 +117,24 @@ bail: static int getipaddr(char *name, char *addr, int len) { struct addrinfo *res; - int ret; - /* Assume network name and address type for first attempt */ - if (getaddrinfo(name, NULL, NULL, &res)) { - /* retry using network device name */ - ret = getipaddr_netdev(name,addr,len); - if (ret) { + /* assume netdev for first attempt, then network and address type */ + if (getipaddr_netdev(name,addr,len)) { + if (getaddrinfo(name, NULL, NULL, &res)) { dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: getaddr_netdev ERROR:" - " %s. Is %s configured?\n", - strerror(errno), name); - return ret; - } - } else { - if (len >= res->ai_addrlen) - memcpy(addr, res->ai_addr, res->ai_addrlen); - else { + " open_hca: getaddr_netdev ERROR:" + " %s. Is %s configured?\n", + strerror(errno), name); + return 1; + } else { + if (len >= res->ai_addrlen) + memcpy(addr, res->ai_addr, res->ai_addrlen); + else { + freeaddrinfo(res); + return 1; + } freeaddrinfo(res); - return EINVAL; } - - freeaddrinfo(res); } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, @@ -642,7 +638,7 @@ DAT_RETURN dapli_ib_thread_init(void) while (g_ib_thread_state != IB_THREAD_RUN) { struct timespec sleep, remain; sleep.tv_sec = 0; - sleep.tv_nsec = 20000000; /* 20 ms */ + sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_init: waiting for ib_thread\n"); dapl_os_unlock(&g_hca_lock); @@ -679,7 +675,7 @@ void dapli_ib_thread_destroy(void) while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { struct timespec sleep, remain; sleep.tv_sec = 0; - sleep.tv_nsec = 20000000; /* 20 ms */ + sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: waiting for ib_thread\n"); write(g_ib_pipe[1], "w", sizeof "w"); -- 1.5.2.5 From chu11 at llnl.gov Mon May 12 11:33:45 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 12 May 2008 11:33:45 -0700 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <1207703425-19039-1-git-send-email-sashak@voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> Message-ID: <1210617225.11133.461.camel@cardanus.llnl.gov> Hey Sasha, Ira and I were chatting. A few other comments: 1) Many configuration values are not output by default in opensm right now, mainly b/c it behaves like a cache rather than an configuration file. i.e. if (p_opts->connect_roots) fprintf(opts_file, "# Connect roots (use FALSE if unsure)\n" "connect_roots %s\n\n", p_opts->connect_roots ? "TRUE" : "FALSE"); Going forward w/ a config file, I think these should be output by default all the time so users know they exist. 2) Will there be an option to specify an alternate configuration file, i.e. not /etc/opensm/opensm.conf? Al On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote: > Hi, > > This is attempt to make some order with OpenSM configuration. Now it > will use conventional (similar to another programs which may have > configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead > of option cache file. Config file for some startup scripts should go > away. Option '-c' is preserved - it can be useful for config file > template generation, but OpenSM will not try to read option cache file. > > This is RFC yet. In addition to this we will need to update scripts and > man pages. > > Any feedback? Thoughts? > > Sasha -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Mon May 12 11:33:45 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 12 May 2008 11:33:45 -0700 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <1207703425-19039-1-git-send-email-sashak@voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> Message-ID: <1210617225.11133.461.camel@cardanus.llnl.gov> Hey Sasha, Ira and I were chatting. A few other comments: 1) Many configuration values are not output by default in opensm right now, mainly b/c it behaves like a cache rather than an configuration file. i.e. if (p_opts->connect_roots) fprintf(opts_file, "# Connect roots (use FALSE if unsure)\n" "connect_roots %s\n\n", p_opts->connect_roots ? "TRUE" : "FALSE"); Going forward w/ a config file, I think these should be output by default all the time so users know they exist. 2) Will there be an option to specify an alternate configuration file, i.e. not /etc/opensm/opensm.conf? Al On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote: > Hi, > > This is attempt to make some order with OpenSM configuration. Now it > will use conventional (similar to another programs which may have > configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead > of option cache file. Config file for some startup scripts should go > away. Option '-c' is preserved - it can be useful for config file > template generation, but OpenSM will not try to read option cache file. > > This is RFC yet. In addition to this we will need to update scripts and > man pages. > > Any feedback? Thoughts? > > Sasha -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From hrosenstock at xsigo.com Mon May 12 12:00:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 12:00:13 -0700 Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand interface In-Reply-To: <20080512085415.GF5226@cefeid.wcss.wroc.pl> References: <1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com> <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com> <20080512085415.GF5226@cefeid.wcss.wroc.pl> Message-ID: <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote: > On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote: > > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote: > > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky wrote: > > > > Check : > > > > > > > > /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/* > > > > > > Thanks, these are interesting counters. Unfortunately these counters > > > are 32-bit counters and already overflowed during the test I ran (less > > > than one day of SRP communication): > > > > > > $ uname -r > > > 2.6.24 > > > $ uname -m > > > x86_64 > > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets} > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <== > > > 4294967295 > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <== > > > 4294967295 > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <== > > > 4294967295 > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <== > > > 4294967295 > > > > Depending on which fabric management (usually bundled with the SM) is > > being used, you may be able to obtain this information via the > > Performance Manager (and not be limited to 32 bit counters). > > Hi, > > I have exactly the same problem on OFED 1.2.5.5, redhat kernel > 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain > 4294967295 value, or barely change. :( > > How can I get Performance Manager running and printing some reasonable > numbers? Are you using OpenSM ? If so, which version ? -- Hal > regards, P From hrosenstock at xsigo.com Mon May 12 12:05:17 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 12:05:17 -0700 Subject: [ofa-general] OpenSM SA dump ? In-Reply-To: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com> References: <829ded920805112330w2378e795qe5342b50b1c4aff0@mail.gmail.com> Message-ID: <1210619117.2026.554.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 12:00 +0530, Keshetti Mahesh wrote: > When I ran opensm with '-D 0x43' option, it generated opensm-sa.dump > file in /var/log directory. But to my surprise 'opensm-sa.dump' file is very > small in size and it contained information of MC groups only. Only multicast, services, and informs are dumped. These are the so called client registrations. > Do I need > to give more options to get detailed information of SA dump ? What SA information are you looking for ? -- Hal > Also, is there any way to dump the local SA cache to a file in the > OFED-1.3 implementation ? > > -Mahesh > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From pawel.dziekonski at wcss.pl Mon May 12 12:05:35 2008 From: pawel.dziekonski at wcss.pl (Pawel Dziekonski) Date: Mon, 12 May 2008 21:05:35 +0200 Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand interface In-Reply-To: <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com> References: <1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com> <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com> <20080512085415.GF5226@cefeid.wcss.wroc.pl> <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512190535.GA24024@cefeid.wcss.wroc.pl> On Mon, 12 May 2008 at 12:00:13PM -0700, Hal Rosenstock wrote: > On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote: > > On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote: > > > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote: > > > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky wrote: > > > > > Check : > > > > > > > > > > /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/* > > > > > > > > Thanks, these are interesting counters. Unfortunately these counters > > > > are 32-bit counters and already overflowed during the test I ran (less > > > > than one day of SRP communication): > > > > > > > > $ uname -r > > > > 2.6.24 > > > > $ uname -m > > > > x86_64 > > > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets} > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <== > > > > 4294967295 > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <== > > > > 4294967295 > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <== > > > > 4294967295 > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <== > > > > 4294967295 > > > > > > Depending on which fabric management (usually bundled with the SM) is > > > being used, you may be able to obtain this information via the > > > Performance Manager (and not be limited to 32 bit counters). > > > > Hi, > > > > I have exactly the same problem on OFED 1.2.5.5, redhat kernel > > 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain > > 4294967295 value, or barely change. :( > > > > How can I get Performance Manager running and printing some reasonable > > numbers? > > Are you using OpenSM ? If so, which version ? yes, opensm-3.0.3. P -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From hrosenstock at xsigo.com Mon May 12 12:13:14 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 12:13:14 -0700 Subject: [ofa-general] Obtaining RDMA statistics for an InfiniBand interface In-Reply-To: <20080512190535.GA24024@cefeid.wcss.wroc.pl> References: <1E3DCD1C63492545881FACB6063A57C102221AD0@mtiexch01.mti.com> <1205844786.11393.121.camel@hrosenstock-ws.xsigo.com> <20080512085415.GF5226@cefeid.wcss.wroc.pl> <1210618813.2026.551.camel@hrosenstock-ws.xsigo.com> <20080512190535.GA24024@cefeid.wcss.wroc.pl> Message-ID: <1210619594.2026.563.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 21:05 +0200, Pawel Dziekonski wrote: > On Mon, 12 May 2008 at 12:00:13PM -0700, Hal Rosenstock wrote: > > On Mon, 2008-05-12 at 10:54 +0200, Pawel Dziekonski wrote: > > > On Tue, 18 Mar 2008 at 05:53:06AM -0700, Hal Rosenstock wrote: > > > > On Tue, 2008-03-18 at 09:17 +0100, Bart Van Assche wrote: > > > > > On Mon, Mar 17, 2008 at 6:07 PM, Boris Shpolyansky wrote: > > > > > > Check : > > > > > > > > > > > > /sys/class/infiniband/[mlx4_*|mthca*]/ports/*/counters/* > > > > > > > > > > Thanks, these are interesting counters. Unfortunately these counters > > > > > are 32-bit counters and already overflowed during the test I ran (less > > > > > than one day of SRP communication): > > > > > > > > > > $ uname -r > > > > > 2.6.24 > > > > > $ uname -m > > > > > x86_64 > > > > > $ head /sys/class/infiniband/mthca0/ports/1/counters/*_{data,packets} > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_data <== > > > > > 4294967295 > > > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_data <== > > > > > 4294967295 > > > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_rcv_packets <== > > > > > 4294967295 > > > > > > > > > > ==> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_packets <== > > > > > 4294967295 > > > > > > > > Depending on which fabric management (usually bundled with the SM) is > > > > being used, you may be able to obtain this information via the > > > > Performance Manager (and not be limited to 32 bit counters). > > > > > > Hi, > > > > > > I have exactly the same problem on OFED 1.2.5.5, redhat kernel > > > 2.6.9-55.0.12.ELsmp on X86_64 machines. {data,packets} files contain > > > 4294967295 value, or barely change. :( > > > > > > How can I get Performance Manager running and printing some reasonable > > > numbers? > > > > Are you using OpenSM ? If so, which version ? > > yes, opensm-3.0.3. 3.0.3 or 3.0.13 ? Anyway, PerfMgr was not part of those 3.0.x versions. I think it was added in the 3.1 series (and is available in OFED 1.3) or 3.2 series (trunk). Is upgrading a possibility ? If so, you might want to check out Ira's response on how to do this (and also what the output looks like to make sure it can meet your needs). -- Hal > P From hrosenstock at xsigo.com Mon May 12 12:20:25 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 12:20:25 -0700 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <20080512212536.GJ17046@sashak.voltaire.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> <20080512184138.GD17046@sashak.voltaire.com> <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> <20080512212536.GJ17046@sashak.voltaire.com> Message-ID: <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 21:25 +0000, Sasha Khapyorsky wrote: > On 09:28 Mon 12 May , Hal Rosenstock wrote: > > > > I made NODEIDLEN 65 rather than 64 due to the +1 in the original define. > > How about defining it as 64 as in the below ? Does that get around the > > overflow issue ? > > > > -- Hal > > > > ibsim/sim.h: Fix NodeDescription size so can have maximum size > > NodeDescription per IBA spec rather than truncating them > > > > Signed-off-by: Hal Rosenstock > > > > diff --git a/ibsim/sim.h b/ibsim/sim.h > > index bea136a..0bf14fd 100644 > > --- a/ibsim/sim.h > > +++ b/ibsim/sim.h > > @@ -65,9 +65,8 @@ > > #define DEFAULT_LINKWIDTH LINKWIDTH_4x > > #define DEFAULT_LINKSPEED LINKSPEED_SDR > > > > -#define NODEIDBASE 20 > > #define NODEPREFIX 20 I think this can now be eliminated as the only use was in NODEIDLEN. > > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > > +#define NODEIDLEN 64 > > #define ALIASLEN 40 > > It is likely will prevent overflow, but will potentially truncate last > NodeDesc character due to string NULL terminator. What about something > like below? Thanks. This seems to work for my usage but not sure about some of the other concatenated names and whether it accomodates those other usages which I don't fully understand. -- Hal > Sasha > > > diff --git a/ibsim/sim.h b/ibsim/sim.h > index 81bb47c..d3294f4 100644 > --- a/ibsim/sim.h > +++ b/ibsim/sim.h > @@ -67,7 +67,7 @@ > > #define NODEIDBASE 20 > #define NODEPREFIX 20 > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > +#define NODEIDLEN 65 > #define ALIASLEN 40 > > #define MAXHOPS 16 > @@ -237,7 +237,7 @@ struct Node { > uint64_t sysguid; > uint64_t nodeguid; // also portguid > int portsbase; // in port table > - char nodeid[64]; // contain nodeid[NODEIDLEN] > + char nodeid[NODEIDLEN]; // contain nodeid[NODEIDLEN] > uint8_t nodeinfo[64]; > char nodedesc[64]; > Switch *sw; > diff --git a/ibsim/sim_cmd.c b/ibsim/sim_cmd.c > index 5f64229..fe3e9be 100644 > --- a/ibsim/sim_cmd.c > +++ b/ibsim/sim_cmd.c > @@ -56,7 +56,7 @@ extern Port *ports; > extern Port **lids; > extern int netnodes, netports, netswitches; > > -#define NAMELEN 64 > +#define NAMELEN NODEIDLEN > > char *portstates[] = { > "-", "Down", "Init", "Armed", "Active", > diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c > index bf7a06a..2a9c19b 100644 > --- a/ibsim/sim_net.c > +++ b/ibsim/sim_net.c > @@ -267,17 +267,16 @@ static int new_hca(Node * nd) > return 0; > } > > -static int build_nodeid(char *nodeid, char *base) > +static int build_nodeid(char *nodeid, size_t len, char *base) > { > if (strchr(base, '#') || strchr(base, '@')) { > IBWARN("bad nodeid \"%s\": '#' & '@' characters are reserved", > base); > return -1; > } > - if (netprefix[0] == 0) > - strncpy(nodeid, base, NODEIDLEN); > - else > - snprintf(nodeid, NODEIDLEN, "%s#%s", netprefix, base); > + > + snprintf(nodeid, len, "%s%s%s", netprefix, *netprefix ? "#" : "", base); > + > return 0; > } > > @@ -287,7 +286,7 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) > char nodeid[NODEIDLEN]; > Node *nd; > > - if (build_nodeid(nodeid, nodename) < 0) > + if (build_nodeid(nodeid, sizeof(nodeid), nodename) < 0) > return 0; > > if (find_node(nodeid)) { > @@ -310,11 +309,9 @@ static Node *new_node(int type, char *nodename, char *nodedesc, int nodeports) > > nd->type = type; > nd->numports = nodeports; > - strncpy(nd->nodeid, nodeid, NODEIDLEN - 1); > - if (nodedesc && nodedesc[0]) > - strncpy(nd->nodedesc, nodedesc, NODEIDLEN - 1); > - else > - strncpy(nd->nodedesc, nodeid, NODEIDLEN - 1); > + strncpy(nd->nodeid, nodeid, sizeof(nd->nodeid) - 1); > + strncpy(nd->nodedesc, nodedesc && *nodedesc ? nodedesc : nodeid, > + sizeof(nd->nodedesc) - 1); > nd->sysguid = nd->nodeguid = guids[type]; > if (type == SWITCH_NODE) { > nodeports++; // port 0 is SMA > @@ -551,22 +548,20 @@ char *expand_name(char *base, char *name, char **portstr) > if (netprefix[0] != 0 && !strchr(base, '#')) > snprintf(name, NODEIDLEN, "%s#%s", netprefix, base); > else > - strcpy(name, base); > + strncpy(name, base, NODEIDLEN - 1); > if (portstr) > - *portstr = 0; > + *portstr = NULL; > PDEBUG("name %s port %s", name, portstr ? *portstr : 0); > return name; > } > - if (base[0] == '@') > - snprintf(name, ALIASLEN, "%s%s", netprefix, base); > - else > - strcpy(name, base); > + > + snprintf(name, NODEIDLEN, "%s%s", base[0] == '@' ? netprefix : "", base); > PDEBUG("alias %s", name); > > if (!(s = map_alias(name))) > return 0; > > - strcpy(name, s); > + strncpy(name, s, NODEIDLEN - 1); > > if (portstr) { > *portstr = name; > @@ -1075,12 +1070,12 @@ int link_ports(Port * lport, Port * rport) > lport->remotenode = rnode; > lport->remoteport = rport->portnum; > set_portinfo(lport, lnode->type == SWITCH_NODE ? swport : hcaport); > - memcpy(lport->remotenodeid, rnode->nodeid, NODEIDLEN); > + memcpy(lport->remotenodeid, rnode->nodeid, sizeof(lport->remotenodeid)); > > rport->remotenode = lnode; > rport->remoteport = lport->portnum; > set_portinfo(rport, rnode->type == SWITCH_NODE ? swport : hcaport); > - memcpy(rport->remotenodeid, lnode->nodeid, NODEIDLEN); > + memcpy(rport->remotenodeid, lnode->nodeid, sizeof(rport->remotenodeid)); > lport->state = rport->state = 2; // Initialilze > lport->physstate = rport->physstate = 5; // LinkUP > if (lnode->sw) > @@ -1166,7 +1161,7 @@ int connect_ports(void) > } > } else if (remoteport->remoteport != port->portnum || > strncmp(remoteport->remotenodeid, port->node->nodeid, > - NODEIDLEN)) { > + sizeof(remoteport->remotenodeid))) { > IBWARN > ("remote port %d in node \"%s\" is not connected to " > "node \"%s\" port %d (\"%s\" %d)", > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon May 12 15:23:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 22:23:16 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512222316.GK17046@sashak.voltaire.com> Hi Hal, On 14:02 Sun 04 May , Hal Rosenstock wrote: > > I have a question on ibsim parsing: > > In sim_net.c:parse_port, there is the following code: > parse_opt: > line = s; > while (s && (s = strchr(s + 1, '='))) { > char *opt = s; > while (opt && !isalpha(*opt)) > opt--; > if (!opt || parse_port_opt(port, opt, s + 1) < 0) { > IBWARN("bad port option"); > return -1; > } > line = s + 1; > } > > port options appear include w for link width and s for link speed. > > An issue is that this parsing starts inside the NodeDescription. = is a > valid character there and causes an invalid port option. I can see the issue, but not sure I know best solution yet (never used 's' and 'w' options and didn't see topology files examples where it was used). > There seem to > me to be two choices here: > 1. Either ignore unknown options in parse_port_option and the rule > becomes w= and s= are invalid in the NodeDescription (which is > artificial and not really per the spec). > or Not sure that "per the spec" restriction is applicable here, it is only about simulator topology file format and this file is editable (node description value is ignored by ibsim parser in those lines anyway). > 2. Find some way to start this port option parsing past the end of the > NodeDescription. As I'm not sure about all the formats supported, I > don't know how to determine a "solid" way to get past the end of the > NodeDescription in the topology format. Do you ? Do you have examples of 'w' or 's' usage? If it is something like: [1] "S-0008f104003f15e4"[19][ext 1] w=4 s=2 # lid 460 lmc 1 "ISR9288/ISR9096 Voltaire sLB-24D" , then it should be easy separable by '#' character. Sasha From hrosenstock at xsigo.com Mon May 12 12:35:27 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 12:35:27 -0700 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080512222316.GK17046@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> Message-ID: <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Mon, 2008-05-12 at 22:23 +0000, Sasha Khapyorsky wrote: > Hi Hal, > > On 14:02 Sun 04 May , Hal Rosenstock wrote: > > > > I have a question on ibsim parsing: > > > > In sim_net.c:parse_port, there is the following code: > > parse_opt: > > line = s; > > while (s && (s = strchr(s + 1, '='))) { > > char *opt = s; > > while (opt && !isalpha(*opt)) > > opt--; > > if (!opt || parse_port_opt(port, opt, s + 1) < 0) { > > IBWARN("bad port option"); > > return -1; > > } > > line = s + 1; > > } > > > > port options appear include w for link width and s for link speed. > > > > An issue is that this parsing starts inside the NodeDescription. = is a > > valid character there and causes an invalid port option. > > I can see the issue, but not sure I know best solution yet (never used > 's' and 'w' options and didn't see topology files examples where it was > used). > > > There seem to > > me to be two choices here: > > 1. Either ignore unknown options in parse_port_option and the rule > > becomes w= and s= are invalid in the NodeDescription (which is > > artificial and not really per the spec). > > or > > Not sure that "per the spec" restriction is applicable here, it is > only about simulator topology file format and this file is editable > (node description value is ignored by ibsim parser in those lines > anyway). > > > 2. Find some way to start this port option parsing past the end of the > > NodeDescription. As I'm not sure about all the formats supported, I > > don't know how to determine a "solid" way to get past the end of the > > NodeDescription in the topology format. Do you ? > > Do you have examples of 'w' or 's' usage? No. > If it is something like: > > [1] "S-0008f104003f15e4"[19][ext 1] w=4 s=2 # lid 460 lmc 1 "ISR9288/ISR9096 Voltaire sLB-24D" > > , then it should be easy separable by '#' character. The = character is part of the NodeDescription and doesn't get skipped even though it should. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon May 12 15:57:25 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 22:57:25 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512225725.GN17046@sashak.voltaire.com> On 12:35 Mon 12 May , Hal Rosenstock wrote: > > The = character is part of the NodeDescription and doesn't get skipped > even though it should. I'm not following. If '=' character is used in NodeDescription why it should be skipped? Sasha From steiner at sgi.com Mon May 12 13:01:13 2008 From: steiner at sgi.com (Jack Steiner) Date: Mon, 12 May 2008 15:01:13 -0500 Subject: [ofa-general] Re: [PATCH 001/001] mmu-notifier-core v17 In-Reply-To: <20080509193230.GH7710@duo.random> References: <20080509193230.GH7710@duo.random> Message-ID: <20080512200113.GA31862@sgi.com> On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote: > From: Andrea Arcangeli > > With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to > pages. There are secondary MMUs (with secondary sptes and secondary > tlbs) too. sptes in the kvm case are shadow pagetables, but when I say > spte in mmu-notifier context, I mean "secondary pte". In GRU case > there's no actual secondary pte and there's only a secondary tlb > because the GRU secondary MMU has no knowledge about sptes and every > secondary tlb miss event in the MMU always generates a page fault that > has to be resolved by the CPU (this is not the case of KVM where the a > secondary tlb miss will walk sptes in hardware and it will refill the >... FYI, I applied to patch to a tree that has the GRU driver. All regression tests passed. --- jack From sashak at voltaire.com Mon May 12 16:00:09 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 23:00:09 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> <20080512184138.GD17046@sashak.voltaire.com> <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> <20080512212536.GJ17046@sashak.voltaire.com> <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512230009.GO17046@sashak.voltaire.com> On 12:20 Mon 12 May , Hal Rosenstock wrote: > > > diff --git a/ibsim/sim.h b/ibsim/sim.h > > > index bea136a..0bf14fd 100644 > > > --- a/ibsim/sim.h > > > +++ b/ibsim/sim.h > > > @@ -65,9 +65,8 @@ > > > #define DEFAULT_LINKWIDTH LINKWIDTH_4x > > > #define DEFAULT_LINKSPEED LINKSPEED_SDR > > > > > > -#define NODEIDBASE 20 > > > #define NODEPREFIX 20 > > I think this can now be eliminated as the only use was in NODEIDLEN. Agree (missed this). > > > -#define NODEIDLEN (NODEIDBASE+NODEPREFIX+1) > > > +#define NODEIDLEN 64 > > > #define ALIASLEN 40 > > > > It is likely will prevent overflow, but will potentially truncate last > > NodeDesc character due to string NULL terminator. What about something > > like below? > > Thanks. This seems to work for my usage but not sure about some of the > other concatenated names and whether it accomodates those other usages > which I don't fully understand. There are no changes in the parser, just filed sizes handling. Sasha From hrosenstock at xsigo.com Mon May 12 13:02:10 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 13:02:10 -0700 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080512225725.GN17046@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> Message-ID: <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 22:57 +0000, Sasha Khapyorsky wrote: > On 12:35 Mon 12 May , Hal Rosenstock wrote: > > > > The = character is part of the NodeDescription and doesn't get skipped > > even though it should. > > I'm not following. If '=' character is used in NodeDescription why it > should be skipped? In your previous post, you wrote: "node description value is ignored by ibsim parser in those lines anyway" so that seems like it should be "skipped" rather than treating it like some keyword preceeds it. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Mon May 12 16:10:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 23:10:03 +0000 Subject: [ofa-general] Re: [PATCH] ibsim/sim.h: Fix NodeDescription size In-Reply-To: <20080512230009.GO17046@sashak.voltaire.com> References: <1209934867.20493.183.camel@hrosenstock-ws.xsigo.com> <20080512184138.GD17046@sashak.voltaire.com> <1210609696.2026.508.camel@hrosenstock-ws.xsigo.com> <20080512212536.GJ17046@sashak.voltaire.com> <1210620025.2026.567.camel@hrosenstock-ws.xsigo.com> <20080512230009.GO17046@sashak.voltaire.com> Message-ID: <20080512231003.GP17046@sashak.voltaire.com> On 23:00 Mon 12 May , Sasha Khapyorsky wrote: > > > > I think this can now be eliminated as the only use was in NODEIDLEN. > > Agree (missed this). I applied both patches. Sasha From sashak at voltaire.com Mon May 12 16:12:48 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 23:12:48 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080512231248.GQ17046@sashak.voltaire.com> On 13:02 Mon 12 May , Hal Rosenstock wrote: > > In your previous post, you wrote: > "node description value is ignored by ibsim parser in those lines > anyway" > so that seems like it should be "skipped" rather than treating it like > some keyword preceeds it. Yes, that is correct. Sasha From sashak at voltaire.com Mon May 12 16:18:31 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 12 May 2008 23:18:31 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080512231248.GQ17046@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> Message-ID: <20080512231831.GR17046@sashak.voltaire.com> On 23:12 Mon 12 May , Sasha Khapyorsky wrote: > On 13:02 Mon 12 May , Hal Rosenstock wrote: > > > > In your previous post, you wrote: > > "node description value is ignored by ibsim parser in those lines > > anyway" > > so that seems like it should be "skipped" rather than treating it like > > some keyword preceeds it. > > Yes, that is correct. Something like this should help (eg ignore "unknown" options). Sasha diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c index 2a9c19b..6e3c0e9 100644 --- a/ibsim/sim_net.c +++ b/ibsim/sim_net.c @@ -432,32 +432,31 @@ static int parse_port_lid_and_lmc(Port * port, char *line) static int parse_port_opt(Port * port, char *opt, char *val) { - int width; - int speed; + int v; - if (*opt == 'w') { - width = strtoul(val, 0, 0); - if (!is_linkwidth_valid(width)) + switch (*opt) { + case 'w': + v = strtoul(val, 0, 0); + if (!is_linkwidth_valid(v)) return -1; - port->linkwidthena = width; + port->linkwidthena = v; DEBUG("port %p linkwidth enabled set to %d", port, port->linkwidthena); - return 0; - } else if (*opt == 's') { - speed = strtoul(val, 0, 0); - - if (!is_linkspeed_valid(speed)) + break; + case 's': + v = strtoul(val, 0, 0); + if (!is_linkspeed_valid(v)) return -1; - port->linkspeedena = speed; + port->linkspeedena = v; DEBUG("port %p linkspeed enabled set to %d", port, port->linkspeedena); - return 0; - } else { - IBWARN("unknown opt %c", *opt); - return -1; + break; + default: + break; } + return 0; } static void init_ports(Node * node, int type, int maxports) From hrosenstock at xsigo.com Mon May 12 13:40:04 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 12 May 2008 13:40:04 -0700 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080512231831.GR17046@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> <20080512231831.GR17046@sashak.voltaire.com> Message-ID: <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-12 at 23:18 +0000, Sasha Khapyorsky wrote: > On 23:12 Mon 12 May , Sasha Khapyorsky wrote: > > On 13:02 Mon 12 May , Hal Rosenstock wrote: > > > > > > In your previous post, you wrote: > > > "node description value is ignored by ibsim parser in those lines > > > anyway" > > > so that seems like it should be "skipped" rather than treating it like > > > some keyword preceeds it. > > > > Yes, that is correct. > > Something like this should help (eg ignore "unknown" options). Right; that's what I meant by option 1. Also, it's not really unknown options since it's part of the NodeDescription. This works as long as the "known" options (currently s= and w=) are not part of NodeDescription. It works for the real life use case that started this. -- Hal > Sasha > > > diff --git a/ibsim/sim_net.c b/ibsim/sim_net.c > index 2a9c19b..6e3c0e9 100644 > --- a/ibsim/sim_net.c > +++ b/ibsim/sim_net.c > @@ -432,32 +432,31 @@ static int parse_port_lid_and_lmc(Port * port, char *line) > > static int parse_port_opt(Port * port, char *opt, char *val) > { > - int width; > - int speed; > + int v; > > - if (*opt == 'w') { > - width = strtoul(val, 0, 0); > - if (!is_linkwidth_valid(width)) > + switch (*opt) { > + case 'w': > + v = strtoul(val, 0, 0); > + if (!is_linkwidth_valid(v)) > return -1; > > - port->linkwidthena = width; > + port->linkwidthena = v; > DEBUG("port %p linkwidth enabled set to %d", port, > port->linkwidthena); > - return 0; > - } else if (*opt == 's') { > - speed = strtoul(val, 0, 0); > - > - if (!is_linkspeed_valid(speed)) > + break; > + case 's': > + v = strtoul(val, 0, 0); > + if (!is_linkspeed_valid(v)) > return -1; > > - port->linkspeedena = speed; > + port->linkspeedena = v; > DEBUG("port %p linkspeed enabled set to %d", port, > port->linkspeedena); > - return 0; > - } else { > - IBWARN("unknown opt %c", *opt); > - return -1; > + break; > + default: > + break; > } > + return 0; > } > > static void init_ports(Node * node, int type, int maxports) > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From christophe.jaillet at wanadoo.fr Mon May 12 14:35:59 2008 From: christophe.jaillet at wanadoo.fr (Christophe Jaillet) Date: Mon, 12 May 2008 23:35:59 +0200 Subject: [ofa-general] ***SPAM*** [PATCH 1/1] infiniband/hw/nes/: avoid unnecessary memset Message-ID: <20080512213601.626C91C0008F@mwinf2103.orange.fr> From: Christophe Jaillet Hi, here is a patch against linux/drivers/infiniband/hw/nes/nes_cm.c which : 1) Remove an explicit memset(.., 0, ...) to a variable allocated with kzalloc (i.e. 'listener'). Note: this patch is based on 'linux-2.6.25.tar.bz2' Signed-off-by: Christophe Jaillet --- --- linux/drivers/infiniband/hw/nes/nes_cm.c 2008-04-17 04:49:44.000000000 +0200 +++ linux/drivers/infiniband/hw/nes/nes_cm.c.cj 2008-05-12 23:31:24.000000000 +0200 @@ -1587,7 +1587,6 @@ static struct nes_cm_listener *mini_cm_l return NULL; } - memset(listener, 0, sizeof(struct nes_cm_listener)); listener->loc_addr = htonl(cm_info->loc_addr); listener->loc_port = htons(cm_info->loc_port); listener->reused_node = 0; From weiny2 at llnl.gov Mon May 12 14:45:41 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 12 May 2008 14:45:41 -0700 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <1210617225.11133.461.camel@cardanus.llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> Message-ID: <20080512144541.3879de40.weiny2@llnl.gov> Sasha, Also, I wonder if anyone would object to you applying your patches to the tree as is and we work out the details from there? I don't see anything wrong with your patches except that more work will be needed, as you said, in the man pages and scripts. After you apply your patches I think we can start in changing the man pages and scripts. Al, Tim, and I started talking about this after I ran into problems with the current config files trying to write a PerfMgr HOWTO today. :-( Ira On Mon, 12 May 2008 11:33:45 -0700 Al Chu wrote: > Hey Sasha, > > Ira and I were chatting. A few other comments: > > 1) Many configuration values are not output by default in opensm right > now, mainly b/c it behaves like a cache rather than an configuration > file. i.e. > > if (p_opts->connect_roots) > fprintf(opts_file, > "# Connect roots (use FALSE if unsure)\n" > "connect_roots %s\n\n", > p_opts->connect_roots ? "TRUE" : "FALSE"); > > Going forward w/ a config file, I think these should be output by > default all the time so users know they exist. > > 2) Will there be an option to specify an alternate configuration file, > i.e. not /etc/opensm/opensm.conf? > > Al > > On Wed, 2008-04-09 at 01:10 +0000, Sasha Khapyorsky wrote: > > Hi, > > > > This is attempt to make some order with OpenSM configuration. Now it > > will use conventional (similar to another programs which may have > > configuration) config ($sysconfig/etc/opensm/opensm.conf) file instead > > of option cache file. Config file for some startup scripts should go > > away. Option '-c' is preserved - it can be useful for config file > > template generation, but OpenSM will not try to read option cache file. > > > > This is RFC yet. In addition to this we will need to update scripts and > > man pages. > > > > Any feedback? Thoughts? > > > > Sasha > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From ralph.campbell at qlogic.com Mon May 12 16:13:25 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 12 May 2008 16:13:25 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: References: <20080508222916.277649ca.akpm@linux-foundation.org> Message-ID: <1210634005.3949.26.camel@brick.pathscale.com> This change looks fine to me. ipath_sdma_status doesn't depend on hardware so changing #define IPATH_SDMA_RUNNING 62 #define IPATH_SDMA_SHUTDOWN 63 to different values is fine. Roland, do you want me to send a patch for this? On Fri, 2008-05-09 at 22:37 -0700, Roland Dreier wrote: > > Most architectures could (and should) take an unsigned long * arg for their > > bitops. x86 doesn't do this and it needs fixing. I fixed it. Infiniband > > is being a problem. > > > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'decode_sdma_errs': > > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:926: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c: In function 'ipath_cancel_sends': > > drivers/infiniband/hw/ipath/ipath_driver.c:1901: warning: passing argument 2 of 'test_and_set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1902: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1934: warning: passing argument 2 of 'set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1949: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1950: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_driver.c:1955: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_errors': > > drivers/infiniband/hw/ipath/ipath_intr.c:553: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:554: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c: In function 'handle_sdma_intr': > > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:566: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:570: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:575: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_intr.c:579: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_notify_task': > > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:200: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': > > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:236: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:253: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:354: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'setup_sdma': > > drivers/infiniband/hw/ipath/ipath_sdma.c:504: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'teardown_sdma': > > drivers/infiniband/hw/ipath/ipath_sdma.c:521: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:522: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:523: warning: passing argument 2 of '__set_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': > > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:608: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:609: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:612: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:613: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:614: warning: passing argument 2 of '__clear_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_sdma_verbs_send': > > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > drivers/infiniband/hw/ipath/ipath_sdma.c:691: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type > > So all of these are ipath warnings, seemingly all because > ipath_devdata.ipath_sdma_status is a u64. The stupid fix is to change > this declaration to unsigned long as below, but this sets a trap if the > driver is ever fixed so that it doesn't depend on 64BIT, because of > > /* bit positions for sdma_status */ > #define IPATH_SDMA_ABORTING 0 > #define IPATH_SDMA_DISARMED 1 > #define IPATH_SDMA_DISABLED 2 > #define IPATH_SDMA_LAYERBUF 3 > #define IPATH_SDMA_RUNNING 62 > #define IPATH_SDMA_SHUTDOWN 63 > > I don't see that this status is shared with hardware, and I don't see > why the RUNNING and SHUTDOWN bits need to be 62 and 63... converting to > unsigned long and moving those to bits 4 and 5 seems like it might be a > clean fix. > > The other option is to convert to a bitmap and using the bitmap > operations, which ends up being a bigger patch. > > But since I don't really understand this part of the driver, some > guidance would be helpful... > > - R. > > > diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c > index ce7b7c3..7635ace 100644 > --- a/drivers/infiniband/hw/ipath/ipath_driver.c > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c > @@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) > */ > if (dd->ipath_flags & IPATH_HAS_SEND_DMA) { > int skip_cancel; > - u64 *statp = &dd->ipath_sdma_status; > + unsigned long *statp = &dd->ipath_sdma_status; > > spin_lock_irqsave(&dd->ipath_sdma_lock, flags); > skip_cancel = > diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h > index 02b24a3..a46f8ad 100644 > --- a/drivers/infiniband/hw/ipath/ipath_kernel.h > +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h > @@ -483,7 +483,7 @@ struct ipath_devdata { > > /* SendDMA related entries */ > spinlock_t ipath_sdma_lock; > - u64 ipath_sdma_status; > + unsigned long ipath_sdma_status; > unsigned long ipath_sdma_abort_jiffies; > unsigned long ipath_sdma_abort_intr_timeout; > unsigned long ipath_sdma_buf_jiffies; > > From gsadasiv7 at gmail.com Mon May 12 16:32:26 2008 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Mon, 12 May 2008 16:32:26 -0700 Subject: [ofa-general] ibv_get_cq_event blocking forever after successful ibv_post_send... In-Reply-To: References: <20070525212214.20500.qmail@station183.com> Message-ID: <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> Hi, Was there any resolution to this issue? I am seeing the exact behavior where no event is generated after doing a send. There were a few successful sends that got completion events. But it just stops without any error indication. I am pasting the part of the code which does this operation: create_qp () { struct ibv_qp_init_attr init_attr; init_attr.cap.max_send_wr = 20; init_attr.cap.max_recv_wr = 20; init_attr.cap.max_recv_sge = 1; init_attr.cap.max_send_sge = 1; init_attr.qp_type = IBV_QPT_RC; init_attr.send_cq = send_cq; init_attr.recv_cq = recv_cq; init_attr.sq_sig_all = 0; qp = ibv_create_qp(pd, &init_attr); if (!qp) { return 1; } attr.qp_state = IBV_QPS_INIT; attr.pkey_index = 0; attr.port_num = src_port; attr.qp_access_flags = 0; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS)) { return 1; } attr.qp_state = IBV_QPS_RTR; attr.path_mtu = IBV_MTU_2048; attr.rq_psn = 1; attr.dest_qp_num = dst_qp_num; attr.max_dest_rd_atomic = 1; attr.ah_attr.dlid = dst_lid; attr.ah_attr.sl = serv_level; attr.ah_attr.port_num = src_port; attr.min_rnr_timer = 12; attr.ah_attr.is_global = 0; attr.ah_attr.src_path_bits = 0; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE| IBV_QP_PATH_MTU| IBV_QP_RQ_PSN| IBV_QP_DEST_QPN| IBV_QP_MAX_DEST_RD_ATOMIC| IBV_QP_AV| IBV_QP_MIN_RNR_TIMER)) { return 1; } attr.qp_state = IBV_QPS_RTS; attr.timeout = 10; attr.retry_cnt = 7; attr.rnr_retry = 7; attr.sq_psn = 1; attr.max_rd_atomic = 1; if (ibv_modify_qp(qp, &attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { return 1; } } send_data(char *buf, int datasz, void *arg) { int ret; /* * Save the WR-id so that we can compare against this * once tx is done. */ sq_wr_id[tail] = global_cnt++; send_sgl[tail].addr = (u64) (unsigned long) buf; send_sgl[tail].length = datasz; send_sgl[tail].lkey = send_mr->lkey; sq_wr[tail].opcode = IBV_WR_SEND; sq_wr[tail].send_flags = IBV_SEND_SIGNALED; sq_wr[tail].sg_list = &send_sgl[tail]; sq_wr[tail].num_sge = 1; send_data[tail] = (u64)buf; send_arg[tail] = arg; ret = ibv_post_send(qp, &sq_wr[tail], &bad_wr); if (tail == 19) { //max_send_wr -1 tail = 0; } else { tail += 1; } return ret; } recv_thread (void *arg) { struct ibv_cq *ev_cq; void *ev_ctx; int ret; ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx); if (ret) { return 1; } ibv_ack_cq_events(ev_cq, 1); ret = ibv_req_notify_cq(ev_cq, 0); if (ret) { return 1; } while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) { switch (wc.opcode) { case IBV_WC_SEND: { if (wc.status == IBV_WC_SUCCESS) { if (sq_wr_id[head] != wc.wr_id) { datasz = 0; return 1; } } else { retuen 1; } buf = (char *)send_data[head]; arg = (u64)send_arg[head]; sq_wr_id[head] = 0; if (head == 19) {//max_send_wr -1 head = 0; } else { head += 1; } break; } } } Thanks Ganesh On Mon, May 28, 2007 at 9:28 PM, Roland Dreier wrote: > > Any ideas on why the ibv_get_cq_event() would never see an event > > after a "successful" send requesting a completion event? > > It's either a bug in your code or a bug in the stack below your code. > The best way to debug this would be for you to post your actual code > (in a form that someone else can run), so that we can either point out > what's wrong with your code, or have a test case for the real bug. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon May 12 16:44:34 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 May 2008 16:44:34 -0700 Subject: [ofa-general] ibv_get_cq_event blocking forever after successfulibv_post_send... In-Reply-To: <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> References: <20070525212214.20500.qmail@station183.com> <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> Message-ID: <000301c8b48a$2200e670$465a180a@amr.corp.intel.com> > attr.rnr_retry = 7; Can you drop this to 6 and see if the behavior changes? > recv_thread (void *arg) > { >    struct ibv_cq        *ev_cq; >    void                 *ev_ctx; >    int                   ret; > > >    ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx); >    if (ret) { >        return 1; >    } > >    ibv_ack_cq_events(ev_cq, 1); > >    ret = ibv_req_notify_cq(ev_cq, 0); >    if (ret) { >        return 1; >    } > >    while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) { >        switch (wc.opcode) { >            case IBV_WC_SEND: { >                if (wc.status == IBV_WC_SUCCESS) { >                    if (sq_wr_id[head] != wc.wr_id) { >                        datasz = 0; >                        return 1; >                    } >                } else { >                    retuen 1; ^^^^^^ Are you sure this is the code that's running? >                } >                buf = (char *)send_data[head]; >                arg = (u64)send_arg[head]; >                sq_wr_id[head] = 0; >                if (head == 19) {//max_send_wr -1 >                    head = 0; >                } else { >                    head += 1; >                } >                break; >         } >    } > > } Where do you re-post receives? From gsadasiv7 at gmail.com Mon May 12 17:03:00 2008 From: gsadasiv7 at gmail.com (Ganesh Sadasivan) Date: Mon, 12 May 2008 17:03:00 -0700 Subject: [ofa-general] ibv_get_cq_event blocking forever after successfulibv_post_send... In-Reply-To: <000301c8b48a$2200e670$465a180a@amr.corp.intel.com> References: <20070525212214.20500.qmail@station183.com> <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> <000301c8b48a$2200e670$465a180a@amr.corp.intel.com> Message-ID: <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com> On Mon, May 12, 2008 at 4:44 PM, Sean Hefty wrote: > > attr.rnr_retry = 7; > > Can you drop this to 6 and see if the behavior changes? That does not change the behavior. > > > recv_thread (void *arg) > > { > > struct ibv_cq *ev_cq; > > void *ev_ctx; > > int ret; > > > > > > ret = ibv_get_cq_event(comp_channel, &ev_cq, &ev_ctx); > > if (ret) { > > return 1; > > } > > > > ibv_ack_cq_events(ev_cq, 1); > > > > ret = ibv_req_notify_cq(ev_cq, 0); > > if (ret) { > > return 1; > > } > > > > while ((rv = ibv_poll_cq(cq, 1, &wc)) == 1) { > > switch (wc.opcode) { > > case IBV_WC_SEND: { > > if (wc.status == IBV_WC_SUCCESS) { > > if (sq_wr_id[head] != wc.wr_id) { > > datasz = 0; > > return 1; > > } > > } else { > > retuen 1; > > ^^^^^^ > Are you sure this is the code that's running? This is a cut-paste error. I just extracted the relevant code from the actual piece. > > > } > > buf = (char *)send_data[head]; > > arg = (u64)send_arg[head]; > > sq_wr_id[head] = 0; > > if (head == 19) {//max_send_wr -1 > > head = 0; > > } else { > > head += 1; > > } > > break; > > } > > } > > > > } > > Where do you re-post receives? I just pasted the send part of the code. Should I send the receive code too? Thanks Ganesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon May 12 18:03:35 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 May 2008 18:03:35 -0700 Subject: [ofa-general] ibv_get_cq_event blocking forever after successfulibv_post_send... In-Reply-To: <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com> (Ganesh Sadasivan's message of "Mon, 12 May 2008 17:03:00 -0700") References: <20070525212214.20500.qmail@station183.com> <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> <000301c8b48a$2200e670$465a180a@amr.corp.intel.com> <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com> Message-ID: > This is a cut-paste error. I just extracted the relevant code from the > actual piece. Unless you send an actual test app that someone could really compile and run, it's very hard to help debug it. Basically your only chance is if you have a really obvious bug that someone could see by reading your code. From kkfiesta at businessweb.com.hk Mon May 12 19:00:19 2008 From: kkfiesta at businessweb.com.hk (Emil Beatty) Date: Mon, 12 May 2008 22:00:19 -0400 Subject: [ofa-general] You don`t forget it Message-ID: <001501c8b47b$91473840$00f0fe3c@pc6> Improve xxx experience and gain in size with these efficient medicine. No woman thinks his man's babymaker is hard enough. It's here... "They must be in pain. It's so painful to know that your child has been so cruel, so wicked. So I say to them my prayers are with you. On Sunday, family and friends observed a two-minute silence outside Our Lady of Lourdes church, where Jimmy had been an altar boy. He said: "The youth, intent on violence, offered him (Jimmy) outside for a fight. She said: "We just don't know really whether we're coming or going. He said: "I wanted to help young people, as a young person myself, I wanted to help give them a voice. -------------- next part -------------- An HTML attachment was scrubbed... URL: From keshetti85-student at yahoo.co.in Mon May 12 21:33:53 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Tue, 13 May 2008 10:03:53 +0530 Subject: [ofa-general] OpenSM SA dump ? In-Reply-To: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com> References: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com> Message-ID: <829ded920805122133j76f483et8280197f216721c6@mail.gmail.com> Thanks Hal for the reply. > Only multicast, services, and informs are dumped. These are the so > called client registrations. > > What SA information are you looking for ? I expected 'opensm-sa.dump' file to contain all the configured paths between the hosts. Is there any way to dump the local SA cache to a file with current OFED ? -Mahesh From okir at lst.de Mon May 12 23:08:09 2008 From: okir at lst.de (Olaf Kirch) Date: Tue, 13 May 2008 08:08:09 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <200805121157.38135.jon@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> Message-ID: <200805130808.10510.okir@lst.de> On Monday 12 May 2008 18:57:38 Jon Mason wrote: > As part of my effort to get RDS working for iWARP, I will be working on the > RDS flow control. Flow control is needed for iWARP due to the fact that > iWARP connections terminate if there is no posted recv for an incoming > packet. IB connections do not have this limitation if setup in a certain > way. In its current implementation, RDS sets the connection attribute > rnr_retry to 7. This causes IB to retransmit until there is a posted recv > buffer. I think for the initial implementation, it is fine for iWARP to just fail the connect when that happens, and re-establish the connection. If you use reasonable defaults for the send and recv queues, receiver overruns should be relatively rare. Once everything else works, let's revisit the flow control part. > I am still in the very early stages of implementing this. So any pointers to > RDS documentation (or a RDS git tree) would be very helpful. I have a small > IB setup to test this on, so anyone willing to test it when I am done would > be helpful as well. The main RDS repo is the OFED tree. If you want to integrate with my work tree, let me know and I'll feed your patches into my tree at http://www.openfabrics.org/git/?p=~okir/ofed_1_3/linux-2.6.git Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From kliteyn at dev.mellanox.co.il Tue May 13 04:15:22 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 13 May 2008 14:15:22 +0300 Subject: [ofa-general] [PATCH] opensm/osm_state_mgr.c: fix segmentation fault Message-ID: <4829784A.6030708@dev.mellanox.co.il> Hi Sasha, Fixing trivial segmentation fault in state manager. Please apply to ofed_1_3 branch and to master. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_state_mgr.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 4b7235f..6f06a8d 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -736,9 +736,8 @@ static boolean_t __osm_state_mgr_is_sm_port_down(IN osm_state_mgr_t * if (!p_port) { osm_log(p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_is_sm_port_down: ERR 3309: " - "SM port with GUID:%016" PRIx64 " (%s) is unknown\n", - cl_ntoh64(port_guid), - p_port->p_node ? p_port->p_node->print_desc : "UNKNOWN"); + "SM port with GUID:%016" PRIx64 " is unknown\n", + cl_ntoh64(port_guid)); state = IB_LINK_DOWN; CL_PLOCK_RELEASE(p_mgr->p_lock); goto Exit; -- 1.5.1.4 From moshek at voltaire.com Tue May 13 04:53:08 2008 From: moshek at voltaire.com (Moshe Kazir) Date: Tue, 13 May 2008 14:53:08 +0300 Subject: [ewg] RE: [ofa-general] OFED May 5 meeting summary In-Reply-To: <48282580.8040208@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> <48282580.8040208@mellanox.co.il> Message-ID: <39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com> Backport of ib-bonding to sles10 sp2 Beta3 is finished. The diff file was delivered to Moni Shoua . BUT !!! When I tried OFED-1.3.1 on sles10 sp2 rc3 I found that the kernel is changed and we have backport issues with ofa_kernel compilation..... I continue digging .... How many rc's are planed for sles 10 sp 2 ? Do we want to backport every RC ? or we want to wait till the last RC before GA ? Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Monday, May 12, 2008 2:10 PM To: Moshe Kazir Cc: ewg at lists.openfabrics.org; Moni Shoua; Olga Shern; general at lists.openfabrics.org Subject: Re: [ewg] RE: [ofa-general] OFED May 5 meeting summary Moshe Kazir wrote: > > I have checked OFED-1.3.1-rc1 on SLES10 SP 2 Beta3. > > ib-bonding compile failed. Everything else is compiled o.k. > > Attached : ib-bonding error log. > > > I'll take the backport of ib-bonding to sles10 sp 2 on me (if needed, > I'll get Moni's help). > > Thanks Please update when done. Any need for a change in the install script? Tziporet From tziporet at dev.mellanox.co.il Tue May 13 05:06:38 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 13 May 2008 15:06:38 +0300 Subject: [ewg] RE: [ofa-general] OFED May 5 meeting summary In-Reply-To: <39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C90282E694@mtlexch01.mtl.com> <39C75744D164D948A170E9792AF8E7CAC5AF56@exil.voltaire.com> <48282580.8040208@mellanox.co.il> <39C75744D164D948A170E9792AF8E7CAC5AF5F@exil.voltaire.com> Message-ID: <4829844E.3090307@mellanox.co.il> Moshe Kazir wrote: > Backport of ib-bonding to sles10 sp2 Beta3 is finished. > The diff file was delivered to Moni Shoua . > > BUT !!! > > When I tried OFED-1.3.1 on sles10 sp2 rc3 I found that the kernel is > changed and we have > backport issues with ofa_kernel compilation..... > > I continue digging .... > > How many rc's are planed for sles 10 sp 2 ? > > Do we want to backport every RC ? > or we want to wait till the last RC before GA ? > > I think we should add support for the latest SLES10 SP2 only. Meanwhile we can add backport patches for the latest available RC and replace them when a new RC is out Tziporet From nickpiggin at yahoo.com.au Tue May 13 05:06:44 2008 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Tue, 13 May 2008 22:06:44 +1000 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508003838.GA9878@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> Message-ID: <200805132206.47655.nickpiggin@yahoo.com.au> On Thursday 08 May 2008 10:38, Robin Holt wrote: > On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > I think the spinlock->rwsem conversion is ok under config option, as > > > you can see I complained myself to various of those patches and I'll > > > take care they're in a mergeable state the moment I submit them. What > > > XPMEM requires are different semantics for the methods, and we never > > > had to do any blocking I/O during vmtruncate before, now we have to. > > > > I really suspect we don't really have to, and that it would be better to > > just fix the code that does that. > > That fix is going to be fairly difficult. I will argue impossible. > > First, a little background. SGI allows one large numa-link connected > machine to be broken into seperate single-system images which we call > partitions. > > XPMEM allows, at its most extreme, one process on one partition to > grant access to a portion of its virtual address range to processes on > another partition. Those processes can then fault pages and directly > share the memory. > > In order to invalidate the remote page table entries, we need to message > (uses XPC) to the remote side. The remote side needs to acquire the > importing process's mmap_sem and call zap_page_range(). Between the > messaging and the acquiring a sleeping lock, I would argue this will > require sleeping locks in the path prior to the mmu_notifier invalidate_* > callouts(). Why do you need to take mmap_sem in order to shoot down pagetables of the process? It would be nice if this can just be done without sleeping. From nickpiggin at yahoo.com.au Tue May 13 05:14:24 2008 From: nickpiggin at yahoo.com.au (Nick Piggin) Date: Tue, 13 May 2008 22:14:24 +1000 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080508013459.GS8276@duo.random> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507234521.GN8276@duo.random> <20080508013459.GS8276@duo.random> Message-ID: <200805132214.27510.nickpiggin@yahoo.com.au> On Thursday 08 May 2008 11:34, Andrea Arcangeli wrote: > Sorry for not having completely answered to this. I initially thought > stop_machine could work when you mentioned it, but I don't think it > can even removing xpmem block-inside-mmu-notifier-method requirements. > > For stop_machine to solve this (besides being slower and potentially > not more safe as running stop_machine in a loop isn't nice), we'd need > to prevent preemption in between invalidate_range_start/end. > > I think there are two ways: > > 1) add global lock around mm_lock to remove the sorting > > 2) remove invalidate_range_start/end, nuke mm_lock as consequence of > it, and replace all three with invalidate_pages issued inside the > PT lock, one invalidation for each 512 pte_t modified, so > serialization against get_user_pages becomes trivial but this will > be not ok at all for SGI as it increases a lot their invalidation > frequency This is what I suggested to begin with before this crazy locking was developed to handle these corner cases... because I wanted the locking to match with the tried and tested Linux core mm/ locking rather than introducing this new idea. I don't see why you're bending over so far backwards to accommodate this GRU thing that we don't even have numbers for and could actually potentially be batched up in other ways (eg. using mmu_gather or mmu_gather-like idea). The bare essential, matches-with-Linux-mm mmu notifiers that I first saw of yours was pretty elegant and nice. The idea that "only one solution must go in and handle everything perfectly" is stupid because it is quite obvious that the sleeping invalidate idea is just an order of magnitude or two more complex than the simple atomic invalidates needed by you. We should and could easily have had that code upstream long ago :( I'm not saying we ignore the sleeping or batching cases, but we should introduce the ideas slowly and carefully and assess the pros and cons of each step along the way. > > For KVM both ways are almost the same. > > I'll implement 1 now then we'll see... > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo at kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email at kvack.org From ogerlitz at voltaire.com Tue May 13 07:11:15 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 13 May 2008 17:11:15 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 1/4] net/bonding: announce fail-over for the active-backup mode Message-ID: Enhance bonding to announce fail-over for the active-backup mode through the netdev events notifier chain mechanism. Such an event can be of use for the RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER) always use the same links as the IP stack does. Signed-off-by: Or Gerlitz I am sending the patch along with the series before its review in netdev since its needed later in patch #4 and I see some issues while testing it with 2.6.26-rc2 which I am pretty sure not related directly to my work, so I'd like to do some more testing before handing it to the bonding maintainer. Index: linux-2.6.26-rc2/drivers/net/bonding/bond_main.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c 2008-05-13 10:02:22.000000000 +0300 +++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c 2008-05-13 16:34:28.000000000 +0300 @@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon bond->send_grat_arp = 1; } else bond_send_gratuitous_arp(bond); + netdev_bonding_change(bond->dev); } } Index: linux-2.6.26-rc2/include/linux/notifier.h =================================================================== --- linux-2.6.26-rc2.orig/include/linux/notifier.h 2008-05-13 10:02:30.000000000 +0300 +++ linux-2.6.26-rc2/include/linux/notifier.h 2008-05-13 11:50:44.000000000 +0300 @@ -197,6 +197,7 @@ static inline int notifier_to_errno(int #define NETDEV_GOING_DOWN 0x0009 #define NETDEV_CHANGENAME 0x000A #define NETDEV_FEAT_CHANGE 0x000B +#define NETDEV_BONDING_FAILOVER 0x000C #define SYS_DOWN 0x0001 /* Notify of system down */ #define SYS_RESTART SYS_DOWN Index: linux-2.6.26-rc2/include/linux/netdevice.h =================================================================== --- linux-2.6.26-rc2.orig/include/linux/netdevice.h 2008-05-13 10:02:30.000000000 +0300 +++ linux-2.6.26-rc2/include/linux/netdevice.h 2008-05-13 11:50:20.000000000 +0300 @@ -1459,6 +1459,7 @@ extern void __dev_addr_unsync(struct de extern void dev_set_promiscuity(struct net_device *dev, int inc); extern void dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); +extern void netdev_bonding_change(struct net_device *dev); extern void netdev_features_change(struct net_device *dev); /* Load a device via the kmod */ extern void dev_load(struct net *net, const char *name); Index: linux-2.6.26-rc2/net/core/dev.c =================================================================== --- linux-2.6.26-rc2.orig/net/core/dev.c 2008-05-13 10:02:31.000000000 +0300 +++ linux-2.6.26-rc2/net/core/dev.c 2008-05-13 11:50:49.000000000 +0300 @@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi } } +void netdev_bonding_change(struct net_device *dev) +{ + call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev); +} +EXPORT_SYMBOL(netdev_bonding_change); + /** * dev_load - load a network module * @net: the applicable net namespace From ogerlitz at voltaire.com Tue May 13 07:12:16 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 13 May 2008 17:12:16 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 2/4] rdma/addr: keep the name of the netdevice in struct rdma_dev_addr In-Reply-To: References: Message-ID: Keep also the local (src) device name in struct rdma_dev_addr. Under bonding HA scheme this can be used by the rdma-cm to align RDMA sessions to use the same links as the IP stack does after bonding fail-over happened. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/addr.c 2008-05-13 16:31:07.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/addr.c 2008-05-13 16:45:01.000000000 +0300 @@ -100,6 +100,7 @@ int rdma_copy_addr(struct rdma_dev_addr memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); if (dst_dev_addr) memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->src_netdev_name, dev->name, IFNAMSIZ); return 0; } EXPORT_SYMBOL(rdma_copy_addr); Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-13 16:31:07.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-13 16:45:01.000000000 +0300 @@ -998,6 +998,7 @@ static struct rdma_id_private *cma_new_c union cma_ip_addr *src, *dst; __be16 port; u8 ip_ver; + int ret; if (cma_get_net_info(ib_event->private_data, listen_id->ps, &ip_ver, &port, &src, &dst)) @@ -1022,10 +1023,11 @@ static struct rdma_id_private *cma_new_c if (rt->num_paths == 2) rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; - ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); - ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; + ret = rdma_translate_ip(&id->route.addr.src_addr, + &id->route.addr.dev_addr); + if (ret) + goto destroy_id; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; Index: linux-2.6.26-rc2/include/rdma/ib_addr.h =================================================================== --- linux-2.6.26-rc2.orig/include/rdma/ib_addr.h 2008-05-13 16:31:07.000000000 +0300 +++ linux-2.6.26-rc2/include/rdma/ib_addr.h 2008-05-13 16:45:01.000000000 +0300 @@ -57,6 +57,7 @@ struct rdma_dev_addr { unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; enum rdma_node_type dev_type; + char src_netdev_name[IFNAMSIZ]; }; /** From ogerlitz at voltaire.com Tue May 13 07:13:14 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 13 May 2008 17:13:14 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode attribute to IDs In-Reply-To: References: Message-ID: RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer of the rdma-cm wants that RDMA sessions would always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding did fail-over but the IB link used by an already existing session is operating fine. For now this mode is supported only for the connected services of the rdma-cm. More ha modes can be added in the future. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c =================================================================== Index: linux-2.6.26-rc2/include/rdma/rdma_cm.h =================================================================== --- linux-2.6.26-rc2.orig/include/rdma/rdma_cm.h 2008-04-17 05:49:44.000000000 +0300 +++ linux-2.6.26-rc2/include/rdma/rdma_cm.h 2008-05-13 13:52:53.000000000 +0300 @@ -328,4 +328,10 @@ void rdma_leave_multicast(struct rdma_cm */ void rdma_set_service_type(struct rdma_cm_id *id, int tos); +enum rdma_ha_mode { + RDMA_ALIGN_WITH_NETDEVICE = 1 +}; + +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode); + #endif /* RDMA_CM_H */ Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-13 11:57:02.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-13 14:57:12.000000000 +0300 @@ -143,6 +143,7 @@ struct rdma_id_private { u32 qp_num; u8 srq; u8 tos; + enum rdma_ha_mode ha_mode; }; struct cma_multicast { @@ -1523,6 +1524,19 @@ void rdma_set_service_type(struct rdma_c } EXPORT_SYMBOL(rdma_set_service_type); +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode) +{ + struct rdma_id_private *id_priv; + + if ((mode == RDMA_ALIGN_WITH_NETDEVICE) && cma_is_ud_ps(id->ps)) + return -ENOTSUPP; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->ha_mode = mode; + return 0; +} +EXPORT_SYMBOL(rdma_set_high_availability_mode); + static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { From ogerlitz at voltaire.com Tue May 13 07:13:58 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 13 May 2008 17:13:58 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer of the rdma-cm wants that RDMA sessions would always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding did fail-over but the IB link used by an already existing session is operating fine. Use netevent notification for sensing that a change has happened in the IP stack, then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" in that respect with the IP stack, and disconnect it, in case this is what the user asked to when setting an ha mode for the ID. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-13 16:57:47.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-13 16:58:55.000000000 +0300 @@ -144,6 +144,7 @@ struct rdma_id_private { u8 srq; u8 tos; enum rdma_ha_mode ha_mode; + struct work_struct ha_work; }; struct cma_multicast { @@ -268,6 +269,14 @@ static inline int cma_is_ud_ps(enum rdma return (ps == RDMA_PS_UDP || ps == RDMA_PS_IPOIB); } +static void cma_ha_work_handler(struct work_struct *work) +{ + struct rdma_id_private *id_priv; + + id_priv = container_of(work, struct rdma_id_private, ha_work); + rdma_disconnect(&id_priv->id); +} + static void cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { @@ -401,7 +410,8 @@ struct rdma_cm_id *rdma_create_id(rdma_c INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); - + INIT_WORK(&id_priv->ha_work, cma_ha_work_handler); + return &id_priv->id; } EXPORT_SYMBOL(rdma_create_id); @@ -2743,6 +2753,38 @@ void rdma_leave_multicast(struct rdma_cm } EXPORT_SYMBOL(rdma_leave_multicast); +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct net_device *ndev = (struct net_device *)ctx; + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + struct rdma_dev_addr *dev_addr; + + if (dev_net(ndev) != &init_net) + return NOTIFY_DONE; + + if (event != NETDEV_BONDING_FAILOVER) + return NOTIFY_DONE; + + if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) + return NOTIFY_DONE; + + list_for_each_entry(cma_dev, &dev_list, list) + list_for_each_entry(id_priv, &cma_dev->id_list, list) { + dev_addr = &id_priv->id.route.addr.dev_addr; + if (!memcmp(dev_addr->src_netdev_name, ndev->name, IFNAMSIZ) && + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) + if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE) + schedule_work(&id_priv->ha_work); + } + return NOTIFY_DONE; +} + +static struct notifier_block cma_nb = { + .notifier_call = cma_netdev_callback +}; + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2847,6 +2889,7 @@ static int cma_init(void) ib_sa_register_client(&sa_client); rdma_addr_register_client(&addr_client); + register_netdevice_notifier(&cma_nb); ret = ib_register_client(&cma_client); if (ret) @@ -2854,6 +2897,7 @@ static int cma_init(void) return 0; err: + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); @@ -2863,6 +2907,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); From rdreier at cisco.com Tue May 13 07:27:46 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 07:27:46 -0700 Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode attribute to IDs In-Reply-To: (Or Gerlitz's message of "Tue, 13 May 2008 17:13:14 +0300 (IDT)") References: Message-ID: > +enum rdma_ha_mode { > + RDMA_ALIGN_WITH_NETDEVICE = 1 > +}; > +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode) this seems like overengineering to me... given there are no other modes, you are adding an elaborate NOP. (Nothing looks at ha_mode) Do you have plans for other modes? > u8 srq; > u8 tos; > + enum rdma_ha_mode ha_mode; Side note -- you're wasting two bytes here because of alignment. From rdreier at cisco.com Tue May 13 07:32:10 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 07:32:10 -0700 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: (Or Gerlitz's message of "Tue, 13 May 2008 17:13:58 +0300 (IDT)") References: Message-ID: > Use netevent notification for sensing that a change has happened in the IP stack, > then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" > in that respect with the IP stack, and disconnect it, in case this is what the > user asked to when setting an ha mode for the ID. this seems like a strange "HA" feature -- to disconnect connections that otherwise would continue operating. What is the use case/use scenario? > + list_for_each_entry(cma_dev, &dev_list, list) > + list_for_each_entry(id_priv, &cma_dev->id_list, list) { > + dev_addr = &id_priv->id.route.addr.dev_addr; > + if (!memcmp(dev_addr->src_netdev_name, ndev->name, IFNAMSIZ) && > + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) > + if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE) > + schedule_work(&id_priv->ha_work); > + } This looks horribly racy/incorrect against RDMA device removal, CMA ID destruction and netdev renaming. - R. From weiny2 at llnl.gov Tue May 13 08:05:36 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 13 May 2008 08:05:36 -0700 Subject: [ofa-general] OpenSM SA dump ? In-Reply-To: <829ded920805122133j76f483et8280197f216721c6@mail.gmail.com> References: <829ded920805122128q7fd0956fkcf2ca86635b4673c@mail.gmail.com> <829ded920805122133j76f483et8280197f216721c6@mail.gmail.com> Message-ID: <20080513080536.079b56a3.weiny2@llnl.gov> On Tue, 13 May 2008 10:03:53 +0530 "Keshetti Mahesh" wrote: > Thanks Hal for the reply. > > > Only multicast, services, and informs are dumped. These are the so > > called client registrations. > > > > What SA information are you looking for ? > > I expected 'opensm-sa.dump' file to contain all the configured paths > between the hosts. saquery can provide the PathRecords anytime. Ira > > Is there any way to dump the local SA cache to a file with current > OFED ? > > -Mahesh > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From holt at sgi.com Tue May 13 08:32:38 2008 From: holt at sgi.com (Robin Holt) Date: Tue, 13 May 2008 10:32:38 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <200805132206.47655.nickpiggin@yahoo.com.au> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> Message-ID: <20080513153238.GL19717@sgi.com> On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > In order to invalidate the remote page table entries, we need to message > > (uses XPC) to the remote side. The remote side needs to acquire the > > importing process's mmap_sem and call zap_page_range(). Between the > > messaging and the acquiring a sleeping lock, I would argue this will > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > callouts(). > > Why do you need to take mmap_sem in order to shoot down pagetables of > the process? It would be nice if this can just be done without > sleeping. We are trying to shoot down page tables of a different process running on a different instance of Linux running on Numa-link connected portions of the same machine. The messaging is clearly going to require sleeping. Are you suggesting we need to rework XPC communications to not require sleeping? I think that is going to be impossible since the transfer engine requires a sleeping context. Additionally, the call to zap_page_range expects to have the mmap_sem held. I suppose we could use something other than zap_page_range and atomically clear the process page tables. Doing that will not alleviate the need to sleep for the messaging to the other partitions. Thanks, Robin From rdreier at cisco.com Tue May 13 09:17:45 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 09:17:45 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: <1210634005.3949.26.camel@brick.pathscale.com> (Ralph Campbell's message of "Mon, 12 May 2008 16:13:25 -0700") References: <20080508222916.277649ca.akpm@linux-foundation.org> <1210634005.3949.26.camel@brick.pathscale.com> Message-ID: > This change looks fine to me. > > ipath_sdma_status doesn't depend on hardware so changing > #define IPATH_SDMA_RUNNING 62 > #define IPATH_SDMA_SHUTDOWN 63 > to different values is fine. Great, I guess I will change them to 30 and 31 so the values always work even if unsigned long is 32 bits. Out of curiousity, was there any reason for choosing 0, 1, 2, 3 and then skipping to 62? > Roland, do you want me to send a patch for this? I can handle it I think... I'll merge a patch that changes the two declarations to unsigned long (as I sent out before) and also changes RUNNING and SHUTDOWN to 30 and 31. - R. From swise at opengridcomputing.com Tue May 13 09:47:05 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 May 2008 11:47:05 -0500 Subject: [ofa-general] RDS flow control In-Reply-To: <200805130808.10510.okir@lst.de> References: <200805121157.38135.jon@opengridcomputing.com> <200805130808.10510.okir@lst.de> Message-ID: <4829C609.5000205@opengridcomputing.com> Olaf Kirch wrote: > On Monday 12 May 2008 18:57:38 Jon Mason wrote: > >> As part of my effort to get RDS working for iWARP, I will be working on the >> RDS flow control. Flow control is needed for iWARP due to the fact that >> iWARP connections terminate if there is no posted recv for an incoming >> packet. IB connections do not have this limitation if setup in a certain >> way. In its current implementation, RDS sets the connection attribute >> rnr_retry to 7. This causes IB to retransmit until there is a posted recv >> buffer. >> > > I think for the initial implementation, it is fine for iWARP to just > fail the connect when that happens, and re-establish the connection. > > If you use reasonable defaults for the send and recv queues, receiver > overruns should be relatively rare. > > Once everything else works, let's revisit the flow control part. > > I _think_ you'll hit this quickly with one-way flows. Send completions for iWARP only mean the user's buffer can be reused. Not that its placed at the remote peer or in the remote user's buffer. But perhaps I'm wrong. Jon, maybe you should try to hit this with IB and rnr_retry == 0 using the rds perf tools? Also "the everything else" part depends on remove fmr usage. I'm working on the new RDMA memory verbs allowing fast registration of physical memory via a send WR. To support iWARP we need to remove the fmr usage from RDS. The idea was to replace fmrs with the new fastreg verbs. Thoughts? Stay tuned for the new verbs API RFC... Steve. From richard.frank at oracle.com Tue May 13 10:03:21 2008 From: richard.frank at oracle.com (Richard Frank) Date: Tue, 13 May 2008 13:03:21 -0400 Subject: [ofa-general] RDS flow control In-Reply-To: <4829C609.5000205@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805130808.10510.okir@lst.de> <4829C609.5000205@opengridcomputing.com> Message-ID: <4829C9D9.6050409@oracle.com> Steve Wise wrote: > Olaf Kirch wrote: >> On Monday 12 May 2008 18:57:38 Jon Mason wrote: >> >>> As part of my effort to get RDS working for iWARP, I will be working >>> on the RDS flow control. Flow control is needed for iWARP due to >>> the fact that iWARP connections terminate if there is no posted recv >>> for an incoming packet. IB connections do not have this limitation >>> if setup in a certain way. In its current implementation, RDS sets >>> the connection attribute rnr_retry to 7. This causes IB to >>> retransmit until there is a posted recv buffer. >> >> I think for the initial implementation, it is fine for iWARP to just >> fail the connect when that happens, and re-establish the connection. >> >> If you use reasonable defaults for the send and recv queues, receiver >> overruns should be relatively rare. >> >> Once everything else works, let's revisit the flow control part. >> >> > I _think_ you'll hit this quickly with one-way flows. Send > completions for iWARP only mean the user's buffer can be reused. Not > that its placed at the remote peer or in the remote user's buffer. > Let's see what happens - anyway - this could be solved in an IWARP extension to RDS - right ? > But perhaps I'm wrong. Jon, maybe you should try to hit this with IB > and rnr_retry == 0 using the rds perf tools? > Also "the everything else" part depends on remove fmr usage. I'm > working on the new RDMA memory verbs allowing fast registration of > physical memory via a send WR. To support iWARP we need to remove the > fmr usage from RDS. The idea was to replace fmrs with the new > fastreg verbs. Thoughts? > What does "fast" imply here - how does this compare to the performance of FMRs ? Why would not push memory window creation into the RDS transport specific implementations ? Changing the API may be OK - if we retain the performance we have with IB. > Stay tuned for the new verbs API RFC... > > Steve. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Tue May 13 10:05:12 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 13 May 2008 10:05:12 -0700 Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com> >+static void cma_ha_work_handler(struct work_struct *work) >+{ >+ struct rdma_id_private *id_priv; >+ >+ id_priv = container_of(work, struct rdma_id_private, ha_work); >+ rdma_disconnect(&id_priv->id); >+} This will race with other user calls. I've found it fairly difficult for the rdma_cm to call back into its own API and avoid racing with the user trying to destroy the cm_id. None of the APIs are coded to allow calling them simultaneously with destroy. A better solution for this may be for the rdma_cm to simply notify the user that the IP mapping for their RDMA device has changed. The user can then disconnect, with the appropriate synchronization, if they want their RDMA connection to follow the IP address. (If I understood correctly, the reason for this is to allow failing back to a repaired port.) >+static int cma_netdev_callback(struct notifier_block *self, unsigned long >event, >+ void *ctx) >+{ >+ struct net_device *ndev = (struct net_device *)ctx; >+ struct cma_device *cma_dev; >+ struct rdma_id_private *id_priv; >+ struct rdma_dev_addr *dev_addr; >+ >+ if (dev_net(ndev) != &init_net) >+ return NOTIFY_DONE; >+ >+ if (event != NETDEV_BONDING_FAILOVER) >+ return NOTIFY_DONE; >+ >+ if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) >+ return NOTIFY_DONE; >+ >+ list_for_each_entry(cma_dev, &dev_list, list) >+ list_for_each_entry(id_priv, &cma_dev->id_list, list) { >+ dev_addr = &id_priv->id.route.addr.dev_addr; >+ if (!memcmp(dev_addr->src_netdev_name, ndev->name, IFNAMSIZ) >&& >+ memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev- >>addr_len)) >+ if (id_priv->ha_mode == >RDMA_ALIGN_WITH_NETDEVICE) >+ schedule_work(&id_priv->ha_work); >+ } >+ return NOTIFY_DONE; >+} As Roland mentioned, this is racy in the areas he pointed out. This will take some thought to handle correctly. - Sean From rdreier at cisco.com Tue May 13 10:41:39 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 10:41:39 -0700 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: <4827FBDF.9040308@Voltaire.COM> (Moni Shoua's message of "Mon, 12 May 2008 11:12:15 +0300") References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> <4820638E.4030901@Voltaire.COM> <4827FBDF.9040308@Voltaire.COM> Message-ID: > Can we please go on with this patch? We would like to see it in the next kernel. I still don't get why this is important to you. Is there a concrete example of a situation where this actually makes a measurable difference? We need some justification for adding this locking complexity beyond "it doesn't hurt." (And also of course we need it fixed so there aren't races) - R. From swise at opengridcomputing.com Tue May 13 10:58:11 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 May 2008 12:58:11 -0500 Subject: [ofa-general] RDS flow control In-Reply-To: <4829C9D9.6050409@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805130808.10510.okir@lst.de> <4829C609.5000205@opengridcomputing.com> <4829C9D9.6050409@oracle.com> Message-ID: <4829D6B3.5080900@opengridcomputing.com> Richard Frank wrote: > Steve Wise wrote: >> Olaf Kirch wrote: >>> On Monday 12 May 2008 18:57:38 Jon Mason wrote: >>> >>>> As part of my effort to get RDS working for iWARP, I will be >>>> working on the RDS flow control. Flow control is needed for iWARP >>>> due to the fact that iWARP connections terminate if there is no >>>> posted recv for an incoming packet. IB connections do not have >>>> this limitation if setup in a certain way. In its current >>>> implementation, RDS sets the connection attribute rnr_retry to 7. >>>> This causes IB to retransmit until there is a posted recv buffer. >>> >>> I think for the initial implementation, it is fine for iWARP to just >>> fail the connect when that happens, and re-establish the connection. >>> >>> If you use reasonable defaults for the send and recv queues, receiver >>> overruns should be relatively rare. >>> >>> Once everything else works, let's revisit the flow control part. >>> >>> >> I _think_ you'll hit this quickly with one-way flows. Send >> completions for iWARP only mean the user's buffer can be reused. Not >> that its placed at the remote peer or in the remote user's buffer. >> > Let's see what happens - anyway - this could be solved in an IWARP > extension to RDS - right ? Yes, by adding flow control. And it could be iwarp-specific if you want. I would not suggest relying on connection termination and re-establishment as the way to handle this :). >> But perhaps I'm wrong. Jon, maybe you should try to hit this with IB >> and rnr_retry == 0 using the rds perf tools? >> Also "the everything else" part depends on remove fmr usage. I'm >> working on the new RDMA memory verbs allowing fast registration of >> physical memory via a send WR. To support iWARP we need to remove >> the fmr usage from RDS. The idea was to replace fmrs with the new >> fastreg verbs. Thoughts? >> > What does "fast" imply here - how does this compare to the performance > of FMRs ? Don't know yet, but probably as fast. > > Why would not push memory window creation into the RDS transport > specific implementations ? Isn't it already transport-specific? IE you don't need FMRs for TCP. (I'm ignorant on the specifics of the implementation at this point, so please excuse any dumb statements :) > > Changing the API may be OK - if we retain the performance we have with > IB. I assume nothing would fly that regresses IB performance. Worst case, you have an iwarp-specific RDS transport like you do for TCP, I guess. Hopefully though, IB + iWARP will be a common transport. > >> Stay tuned for the new verbs API RFC... >> >> Steve. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general From okir at lst.de Tue May 13 11:04:00 2008 From: okir at lst.de (Olaf Kirch) Date: Tue, 13 May 2008 20:04:00 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <4829D6B3.5080900@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> <4829C9D9.6050409@oracle.com> <4829D6B3.5080900@opengridcomputing.com> Message-ID: <200805132004.01371.okir@lst.de> On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > Yes, by adding flow control. And it could be iwarp-specific if you > want. I would not suggest relying on connection termination and > re-establishment as the way to handle this :). No, not in the long term. But let's hold off on the flow control stuff for a little - I would first like to finish my patch set and hand it out for you folks to bang on it, rather than the other way round. Okay with you guys? > I assume nothing would fly that regresses IB performance. Worst case, > you have an iwarp-specific RDS transport like you do for TCP, I guess. > Hopefully though, IB + iWARP will be a common transport. If it turns out that way, fine. If iWARP ands up sharing 80% of the code with IB except the RDMA specific functions, I think that's very much acceptable, too. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From swise at opengridcomputing.com Tue May 13 11:08:46 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 May 2008 13:08:46 -0500 Subject: [ofa-general] RDS flow control In-Reply-To: <200805132004.01371.okir@lst.de> References: <200805121157.38135.jon@opengridcomputing.com> <4829C9D9.6050409@oracle.com> <4829D6B3.5080900@opengridcomputing.com> <200805132004.01371.okir@lst.de> Message-ID: <4829D92E.2070504@opengridcomputing.com> Olaf Kirch wrote: > On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > >> Yes, by adding flow control. And it could be iwarp-specific if you >> want. I would not suggest relying on connection termination and >> re-establishment as the way to handle this :). >> > > No, not in the long term. But let's hold off on the flow control stuff > for a little - I would first like to finish my patch set and hand it > out for you folks to bang on it, rather than the other way round. > Okay with you guys? > What patch set? We can't run on chelsio's rnic with fmrs... > >> I assume nothing would fly that regresses IB performance. Worst case, >> you have an iwarp-specific RDS transport like you do for TCP, I guess. >> Hopefully though, IB + iWARP will be a common transport. >> > > If it turns out that way, fine. If iWARP ands up sharing 80% of the > code with IB except the RDMA specific functions, I think that's > very much acceptable, too. > > Olaf > From okir at lst.de Tue May 13 11:24:11 2008 From: okir at lst.de (Olaf Kirch) Date: Tue, 13 May 2008 20:24:11 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <4829D92E.2070504@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805132004.01371.okir@lst.de> <4829D92E.2070504@opengridcomputing.com> Message-ID: <200805132024.12741.okir@lst.de> On Tuesday 13 May 2008 20:08:46 Steve Wise wrote: > > No, not in the long term. But let's hold off on the flow control stuff > > for a little - I would first like to finish my patch set and hand it > > out for you folks to bang on it, rather than the other way round. > > Okay with you guys? > > > > What patch set? I mentioned in a previous mail to Jon that I have some partial patches that implement flow control. I want to get that code out to you ASAP; I think that's easier than having two different approaches that need to be reconciled afterwards. > We can't run on chelsio's rnic with fmrs... Yes, that is understood. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From rdreier at cisco.com Tue May 13 11:46:06 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 11:46:06 -0700 Subject: [ofa-general] [PATCH 3/3] IB/ipath - fix RDMA read response sequence checking In-Reply-To: <20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 08 May 2008 11:55:28 -0700") References: <20080508185512.8547.29637.stgit@eng-46.mv.qlogic.com> <20080508185528.8547.31626.stgit@eng-46.mv.qlogic.com> Message-ID: OK, applied all 3. From rdreier at cisco.com Tue May 13 11:46:13 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 11:46:13 -0700 Subject: [ofa-general] Re: [PATCH][INFINIBAND]: Make ipath_portdata work with struct pid * not pid_t. In-Reply-To: <482857AE.2030904@openvz.org> (Pavel Emelyanov's message of "Mon, 12 May 2008 18:43:58 +0400") References: <482857AE.2030904@openvz.org> Message-ID: thanks, applied From rdreier at cisco.com Tue May 13 11:51:58 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 11:51:58 -0700 Subject: [ofa-general] bitops take an unsigned long * In-Reply-To: <1210634005.3949.26.camel@brick.pathscale.com> (Ralph Campbell's message of "Mon, 12 May 2008 16:13:25 -0700") References: <20080508222916.277649ca.akpm@linux-foundation.org> <1210634005.3949.26.camel@brick.pathscale.com> Message-ID: OK, I added the below to my tree for my next pull request: commit f018c7e177a50390f6fcb137f1a28a6027d8ba50 Author: Roland Dreier Date: Tue May 13 11:51:23 2008 -0700 IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long Andrew Morton pointed out that bitops should take an unsigned long * arg. However, the ipath driver was doing bitops on struct ipath_devdata.ipath_sdma_status, which is u64. Change this member to unsigned long to avoid tons of warnings when x86 fixes the bitops to take unsigned long * instead of void *. Also, change the IPATH_SDMA_RUNNING and IPATH_SDMA_SHUTDOWN bit numbers to 30 and 31 (instead of 62 and 63) so that we're not setting another booby trap for someone who tries to make ipath work on a 32-bit architecture. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 258e66c..daad09a 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1894,7 +1894,7 @@ void ipath_cancel_sends(struct ipath_devdata *dd, int restore_sendctrl) */ if (dd->ipath_flags & IPATH_HAS_SEND_DMA) { int skip_cancel; - u64 *statp = &dd->ipath_sdma_status; + unsigned long *statp = &dd->ipath_sdma_status; spin_lock_irqsave(&dd->ipath_sdma_lock, flags); skip_cancel = diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 2097587..59a8b25 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -483,7 +483,7 @@ struct ipath_devdata { /* SendDMA related entries */ spinlock_t ipath_sdma_lock; - u64 ipath_sdma_status; + unsigned long ipath_sdma_status; unsigned long ipath_sdma_abort_jiffies; unsigned long ipath_sdma_abort_intr_timeout; unsigned long ipath_sdma_buf_jiffies; @@ -822,8 +822,8 @@ struct ipath_devdata { #define IPATH_SDMA_DISARMED 1 #define IPATH_SDMA_DISABLED 2 #define IPATH_SDMA_LAYERBUF 3 -#define IPATH_SDMA_RUNNING 62 -#define IPATH_SDMA_SHUTDOWN 63 +#define IPATH_SDMA_RUNNING 30 +#define IPATH_SDMA_SHUTDOWN 31 /* bit combinations that correspond to abort states */ #define IPATH_SDMA_ABORT_NONE 0 From rdreier at cisco.com Tue May 13 11:53:20 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 11:53:20 -0700 Subject: [ofa-general] [PATCH 2.6.26] RDMA/cxgb3: Wrap the software sq ptr as needed on flush. In-Reply-To: <20080509201902.13077.53047.stgit@dell3.ogc.int> (Steve Wise's message of "Fri, 09 May 2008 15:19:02 -0500") References: <20080509201902.13077.53047.stgit@dell3.ogc.int> Message-ID: thanks, applied From rdreier at cisco.com Tue May 13 11:56:21 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 11:56:21 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080510190721.GI5298@sgi.com> (akepner@sgi.com's message of "Sat, 10 May 2008 12:07:21 -0700") References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> Message-ID: > ipoib_cm.c:ipoib_cm_send() does: > if (++priv->tx_outstanding == ipoib_sendq_size) > netif_stop_queue(dev); > > but ipoib_ib.c:ipoib_send() does: > if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) { > netif_stop_queue(dev); So this is not in the upstream kernel... I wonder if this is a bug introduced in an OFED 1.3 patch? From sashak at voltaire.com Tue May 13 15:18:34 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 May 2008 22:18:34 +0000 Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: fix segmentation fault In-Reply-To: <4829784A.6030708@dev.mellanox.co.il> References: <4829784A.6030708@dev.mellanox.co.il> Message-ID: <20080513221834.GE21414@sashak.voltaire.com> Hi Yevgeny, On 14:15 Tue 13 May , Yevgeny Kliteynik wrote: > Hi Sasha, > > Fixing trivial segmentation fault in state manager. > > Please apply to ofed_1_3 branch and to master. > > -- Yevgeny > > Signed-off-by: Yevgeny Kliteynik This patch is not against master branch. I applied this by hand editing. Thanks for the fix. But please rebase your working branch! Sasha From or.gerlitz at gmail.com Tue May 13 12:48:27 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 13 May 2008 22:48:27 +0300 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: <15ddcffd0805131248l7da1274fy8f467ae3a98e176e@mail.gmail.com> On 5/13/08, Roland Dreier wrote: > > > Use netevent notification for sensing that a change has happened in the > IP stack, > > then scan the rdma-cm IDs list to see if there is an ID that is > "misaligned" > > in that respect with the IP stack, and disconnect it, in case this is > what the > > user asked to when setting an ha mode for the ID. > > this seems like a strange "HA" feature -- to disconnect connections that > otherwise would continue operating. What is the use case/use scenario? OK, I might went too fast here. The idea is to align the RDMA traffic with the links used by the IP stack. In the case where the app takes advantage of bonding ipoib devices to achieve HA AND it want this alignment, when bonding does fail-over from --any-- reason (eg the problem is fixed and the primary option is used) a "fail-back" of the connection is needed. (*) HW error --> RC connection break && bonding failover (change of active slave device, send gratuitous ARP), then this app reconnects and its back in the business. > + list_for_each_entry(cma_dev, &dev_list, list) > > + list_for_each_entry(id_priv, &cma_dev->id_list, list) { > > + dev_addr = &id_priv->id.route.addr.dev_addr; > > + if (!memcmp(dev_addr->src_netdev_name, > ndev->name, IFNAMSIZ) && > > + memcmp(dev_addr->src_dev_addr, > ndev->dev_addr, ndev->addr_len)) > > + if (id_priv->ha_mode == > RDMA_ALIGN_WITH_NETDEVICE) > > > + schedule_work(&id_priv->ha_work); > > + } > > > This looks horribly racy/incorrect against RDMA device removal, CMA ID > destruction and netdev renaming. mmm, bad. I see your point re the first two, that is some locking is needed to protect against device removal, the ID should be referenced, etc. As for the netdev renaming, I don't see how making a decsion based on memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) is racy?! even for the crazy case the ndev->name gets changed in the middle of this memcmp, the only issue would be some confusion made by the code, no damage. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue May 13 15:59:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 May 2008 22:59:14 +0000 Subject: [ofa-general] Re: [PATCH] infiniband-diags/Makefile.am: fix location of ibdiag_version.h In-Reply-To: <4826E877.7090706@dev.mellanox.co.il> References: <4826E877.7090706@dev.mellanox.co.il> Message-ID: <20080513225914.GF21414@sashak.voltaire.com> On 15:37 Sun 11 May , Yevgeny Kliteynik wrote: > Hi Sasha, > > When compiling infiniband-diags not from the source code location, > compilation fails to find the ibdiag_version.h file - fixing it. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Tue May 13 13:01:53 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 13 May 2008 23:01:53 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_state_mgr.c: fix segmentation fault In-Reply-To: <20080513221834.GE21414@sashak.voltaire.com> References: <4829784A.6030708@dev.mellanox.co.il> <20080513221834.GE21414@sashak.voltaire.com> Message-ID: <4829F3B1.3040009@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 14:15 Tue 13 May , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Fixing trivial segmentation fault in state manager. >> >> Please apply to ofed_1_3 branch and to master. >> >> -- Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik > > This patch is not against master branch. I applied this by hand editing. > Thanks for the fix. But please rebase your working branch! My bad, this patch was against ofed_1_3 only, and I forgot to send a separate patch for master after seeing that... Sorry -- Yevgeny > Sasha > From or.gerlitz at gmail.com Tue May 13 13:10:30 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 13 May 2008 23:10:30 +0300 Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode attribute to IDs In-Reply-To: References: Message-ID: <15ddcffd0805131310x6710fbb6v890d71297f5588ed@mail.gmail.com> On 5/13/08, Roland Dreier wrote: > > > +enum rdma_ha_mode { > > + RDMA_ALIGN_WITH_NETDEVICE = 1 > > +}; > > > +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum > rdma_ha_mode mode) > > this seems like overengineering to me... given there are no other modes, > you are adding an elaborate NOP. (Nothing looks at ha_mode) First, this patch would be later extended for the rdma_ucm part (exposing the ha_mode to user space). Second, indeed nothing looks on ha_mode in this patch, but the next one (4/4) uses it. I was thinking its better to decompose the changes this way such that patches are not too small and not too big both in size and the change they carry in their content. Do you have plans for other modes? down the road someone might want to add APM support for the rdma-cm, or more modes that I can't think of now. > u8 srq; > > u8 tos; > > + enum rdma_ha_mode ha_mode; > > Side note -- you're wasting two bytes here because of alignment. What would be the easy way to avoid it? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Tue May 13 13:26:06 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 13 May 2008 23:26:06 +0300 Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com> References: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com> Message-ID: <15ddcffd0805131326x7873df30ua015a1719c90fb89@mail.gmail.com> On 5/13/08, Sean Hefty wrote: This will race with other user calls. I've found it fairly difficult for > the > rdma_cm to call back into its own API and avoid racing with the user > trying to > destroy the cm_id. None of the APIs are coded to allow calling them > simultaneously with destroy. I see. > A better solution for this may be for the rdma_cm to simply notify the > user that > the IP mapping for their RDMA device has changed. The user can then > disconnect, > with the appropriate synchronization, if they want their RDMA connection > to > follow the IP address. Yes, this is possible, I have tried to implement it at the rdma-cm to avoid having each ULP do it at their code, if you think its practially impossible for the rdma-cm to call its own API, I will can change this into delivering disconnected event. (If I understood correctly, the reason for this is to allow failing back to > a repaired port.) indeed, this is a possible use case. >+ list_for_each_entry(cma_dev, &dev_list, list) > >+ list_for_each_entry(id_priv, &cma_dev->id_list, list) { > >+ dev_addr = &id_priv->id.route.addr.dev_addr; > >+ if (!memcmp(dev_addr->src_netdev_name, ndev->name, > IFNAMSIZ) && > >+ memcmp(dev_addr->src_dev_addr, > ndev->dev_addr,ndev->addr_len) > >+ if (id_priv->ha_mode == > RDMA_ALIGN_WITH_NETDEVICE) > >+ > schedule_work(&id_priv->ha_work); > > As Roland mentioned, this is racy in the areas he pointed out. This will > take > some thought to handle correctly. > OK, I will try to improve things here, any hints/directions would be very much appreciated... Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 13 13:40:52 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 13:40:52 -0700 Subject: [ofa-general] Re: [PATCH 06/13] QLogic VNIC: IB core stack interaction In-Reply-To: <20080430171855.31725.89658.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:48:55 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171855.31725.89658.stgit@localhost.localdomain> Message-ID: > +#include > + ret = ib_find_cached_pkey(viport_config->ibdev, > + viport_config->port, > + be16_to_cpu(viport_config->path_info.path. > + pkey), > + &attr->pkey_index); I think this can just be replaced with ib_find_pkey()... there is a call to kmalloc(... GFP_KERNEL) just a couple of lines about, so you are in a context where sleeping is allowed. As I said before we want to get rid of the caching infrastructure so please don't add new users. From rdreier at cisco.com Tue May 13 13:41:37 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 May 2008 13:41:37 -0700 Subject: [ofa-general] Re: [PATCH 07/13] QLogic VNIC: Handling configurable parameters of the driver In-Reply-To: <20080430171925.31725.22023.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:49:25 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430171925.31725.22023.stgit@localhost.localdomain> Message-ID: > + ib_get_cached_gid(config->ibdev, config->port, 0, > + &config->path_info.path.sgid); Again, looks like a sleepable context so please use ib_query_gid() instead. From sashak at voltaire.com Tue May 13 16:43:45 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 May 2008 23:43:45 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> <20080512231831.GR17046@sashak.voltaire.com> <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080513234345.GI21414@sashak.voltaire.com> On 13:40 Mon 12 May , Hal Rosenstock wrote: > > Right; that's what I meant by option 1. True. > Also, it's not really unknown options since it's part of the > NodeDescription. > > This works as long as the "known" options (currently s= and w=) are not > part of NodeDescription. It works for the real life use case that > started this. That is correct, but we know that ibsim parser doesn't parse NodeDescription in those (port related) lines, so in such "worst" cases when 's=' and/or 'w=' strings are used in NodeDescription this could be just filtered out from ibnetdiscovery file. Sasha From sashak at voltaire.com Tue May 13 16:51:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 13 May 2008 23:51:00 +0000 Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1 is down. In-Reply-To: <20080501155045.4aa3ef2c.weiny2@llnl.gov> References: <20080501155045.4aa3ef2c.weiny2@llnl.gov> Message-ID: <20080513235100.GK21414@sashak.voltaire.com> On 15:50 Thu 01 May , Ira Weiny wrote: > I found a bug in the printing of the names of switches on iblinkinfo.pl. The > name of the switch was being pulled from the first ports "link" structure. The > problem is, if the first port is down there was no structure available. This > gets the switch name from the first link structure available and prints the > name correctly. > > Ira > > From 9b69c0ff4c7785be78157ab78e4a4892d64e2fb2 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 1 May 2008 15:46:25 -0700 > Subject: [PATCH] infiniband-diags/scripts/iblinkinfo.pl: fix printing of switch name when port 1 > is down. > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From hrosenstock at xsigo.com Tue May 13 13:56:58 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 13 May 2008 13:56:58 -0700 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080513234345.GI21414@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> <20080512231831.GR17046@sashak.voltaire.com> <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> <20080513234345.GI21414@sashak.voltaire.com> Message-ID: <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-13 at 23:43 +0000, Sasha Khapyorsky wrote: > On 13:40 Mon 12 May , Hal Rosenstock wrote: > > > > Right; that's what I meant by option 1. > > True. > > > Also, it's not really unknown options since it's part of the > > NodeDescription. > > > > This works as long as the "known" options (currently s= and w=) are not > > part of NodeDescription. It works for the real life use case that > > started this. > > That is correct, but we know that ibsim parser doesn't parse > NodeDescription in those (port related) lines, so in such "worst" cases > when 's=' and/or 'w=' strings are used in NodeDescription this could be > just filtered out from ibnetdiscovery file. That's why I termed this approach a workaround and it does limit the NodeDescription in ways not limited by the IBA spec. Is this worth mentioning in the README or some other doc for ibsim ? -- Hal > Sasha From sashak at voltaire.com Tue May 13 17:02:47 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 May 2008 00:02:47 +0000 Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) In-Reply-To: <20080424181657.28d58a29.weiny2@llnl.gov> References: <20080423133816.6c1b6315.weiny2@llnl.gov> <48109087.6030606@voltaire.com> <20080424143125.2aad1db8.weiny2@llnl.gov> <15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com> <20080424181657.28d58a29.weiny2@llnl.gov> Message-ID: <20080514000247.GL21414@sashak.voltaire.com> On 18:16 Thu 24 Apr , Ira Weiny wrote: > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 24 Apr 2008 18:05:01 -0700 > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit > > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Tue May 13 17:06:17 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 May 2008 00:06:17 +0000 Subject: [ofa-general] Nodes dropping out of IPoIB mcast group In-Reply-To: <4816C6F6.6000602@voltaire.com> References: <20080423133816.6c1b6315.weiny2@llnl.gov> <48109087.6030606@voltaire.com> <20080424143125.2aad1db8.weiny2@llnl.gov> <15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com> <20080424181657.28d58a29.weiny2@llnl.gov> <48143DBA.3080701@voltaire.com> <20080428091923.0abf9fb5.weiny2@llnl.gov> <4816C6F6.6000602@voltaire.com> Message-ID: <20080514000617.GM21414@sashak.voltaire.com> On 09:57 Tue 29 Apr , Or Gerlitz wrote: > > And when openSM does the heavy sweep, what nodes would have their client > rereg bit set, only the ones beyond the recovered link? Yes. > also will openSM > cycle the logical link state of those nodes (which is active!) through > armed-active again or the only SET would be for the rereg bit? No, ideally (unless other PortInfo fields were changed) only client rereg bit SET will be issued. Sasha. From caitlin.bestler at neterion.com Tue May 13 14:15:04 2008 From: caitlin.bestler at neterion.com (Caitlin Bestler) Date: Tue, 13 May 2008 14:15:04 -0700 Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode attribute to IDs In-Reply-To: References: Message-ID: <469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com> On Tue, May 13, 2008 at 7:13 AM, Or Gerlitz wrote: > RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer > of the rdma-cm wants that RDMA sessions would always use the same links (eg ) > as the IP stack does. In the current code, this does not happen when bonding did > fail-over but the IB link used by an already existing session is operating fine. > For now this mode is supported only for the connected services of the rdma-cm. > I'm not sure I've even seen an "RDMA Session". There are lots of RDMA *connections*, and there are RDMA applications that have an application-layer session that use several RDMA connections. But I'm fairly certain that there is no such thing as an "RDMA Session". Which raises some serious doubts about an automatic connection tear down based upon decisions at the RDMA layer. This will also create problems with iWARP/IB compatability. The iWARP standards (IETF and RDMAC) both solve the problem of RDMA endpoint / IP Address affinity by simply mandating it. While no real solution is given in the standards, it has generally been interpreted to mean: - You cannot create an RDMA connection on a device (or assign an existing TCP connection to an RDMA endpoint) if the device is not a valid route given the source/destination IP Addresses. - You can determine the set of possible RDMA devices by first consulting the local routing tables using the desired source and destination IP addresses. - If an RDMA device is no longer a valid route for a connection then the underlying TCP connection will fail (and it would be real nice if this happened promptly if the reason if a network reconfiguration rather than just waiting for things to fail). An important corner case here is that there may not be a need to migrate an existing RDMA Connection to a new device just because the *preferred* route has changed. The non-preferred route may still be fully operable and it may be preferable to continue using it for *this* connection given the cost of tear down and start up. Keep in mind that if the old route does not work then it will fail fairly quickly. If doing it quickly is important then the device should have mechanisms to ensure that it does not keep stale ARP or Neighbor Discovery lingering around. If the ARP/ND information is erased the connection will be torn down very quickly (destination unreachable). Now for both IB and iWARP there is a substantial possibility that a connection can be migrated to a different port within the same or co-operating devices. In that case the High Availability is achieved without the application having to be involved at all. If the connection is going to have to re-established on a *different* device there is a substantial risk that this will involve re-registering memory, re-connecting, and re-advertising buffers. I don't see how you can wisely decide that the benefits of a preferred route outweigh these costs on an application-independent basis. What if the application was nearly done with the connection? Or knew that it would be ending a current burst of activity in a few seconds and could pay for the connection shift-back then? And if the application is going to make the decision, then can't it just subscribe to the local routing tables on its own without any help from OFA? Even if it is response to a failure on the old connection, any application that has a "session'" concept will have procedures for re-establishing the session on a new connection. Where is the need for a one-size-fits-none standardized solution? From sashak at voltaire.com Tue May 13 17:15:45 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 May 2008 00:15:45 +0000 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> <20080512231831.GR17046@sashak.voltaire.com> <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> <20080513234345.GI21414@sashak.voltaire.com> <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080514001545.GO21414@sashak.voltaire.com> On 13:56 Tue 13 May , Hal Rosenstock wrote: > > > > That is correct, but we know that ibsim parser doesn't parse > > NodeDescription in those (port related) lines, so in such "worst" cases > > when 's=' and/or 'w=' strings are used in NodeDescription this could be > > just filtered out from ibnetdiscovery file. > > That's why I termed this approach a workaround and it does limit the > NodeDescription in ways not limited by the IBA spec. No, it does not limit NodeDescription at all - it is *only* file format limitation (remove NodeDescription from port related lines in the file and we are done). > Is this worth mentioning in the README or some other doc for ibsim ? Looks like overkill for me. Sasha From flatif at NetEffect.com Tue May 13 14:46:47 2008 From: flatif at NetEffect.com (Faisal Latif) Date: Tue, 13 May 2008 16:46:47 -0500 Subject: [ofa-general] RE: [PATCH 1/1] infiniband/hw/nes/: avoid unnecessary memset In-Reply-To: <20080512213601.626C91C0008F@mwinf2103.orange.fr> References: <20080512213601.626C91C0008F@mwinf2103.orange.fr> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2> Acked-by Faisal Latif Thanks Faisal > > From: Christophe Jaillet > > Hi, here is a patch against linux/drivers/infiniband/hw/nes/nes_cm.c > which : > > 1) Remove an explicit memset(.., 0, ...) to a variable allocated with > kzalloc (i.e. 'listener'). > > > Note: this patch is based on 'linux-2.6.25.tar.bz2' > > Signed-off-by: Christophe Jaillet > > --- > > --- linux/drivers/infiniband/hw/nes/nes_cm.c 2008-04-17 > 04:49:44.000000000 +0200 > +++ linux/drivers/infiniband/hw/nes/nes_cm.c.cj 2008-05-12 > 23:31:24.000000000 +0200 > @@ -1587,7 +1587,6 @@ static struct nes_cm_listener *mini_cm_l > return NULL; > } > > - memset(listener, 0, sizeof(struct nes_cm_listener)); > listener->loc_addr = htonl(cm_info->loc_addr); > listener->loc_port = htons(cm_info->loc_port); > listener->reused_node = 0; > From swise at opengridcomputing.com Tue May 13 14:59:14 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 May 2008 16:59:14 -0500 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: <482A0F32.2010001@opengridcomputing.com> Roland Dreier wrote: > > Use netevent notification for sensing that a change has happened in the IP stack, > > then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" > > in that respect with the IP stack, and disconnect it, in case this is what the > > user asked to when setting an ha mode for the ID. > > this seems like a strange "HA" feature -- to disconnect connections that > otherwise would continue operating. What is the use case/use scenario? > > Maybe this should really be implemented in the ULP that wants this behavior. IE the ULP could register for routing/neighbour changes and tear down connections and re-established them on the correct device. STeve. From swise at opengridcomputing.com Tue May 13 15:00:18 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 13 May 2008 17:00:18 -0500 Subject: [ofa-general] RE: [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com> References: <000001c8b51b$81cd77d0$865b180a@amr.corp.intel.com> Message-ID: <482A0F72.3090206@opengridcomputing.com> Sean Hefty wrote: >> +static void cma_ha_work_handler(struct work_struct *work) >> +{ >> + struct rdma_id_private *id_priv; >> + >> + id_priv = container_of(work, struct rdma_id_private, ha_work); >> + rdma_disconnect(&id_priv->id); >> +} >> > > This will race with other user calls. I've found it fairly difficult for the > rdma_cm to call back into its own API and avoid racing with the user trying to > destroy the cm_id. None of the APIs are coded to allow calling them > simultaneously with destroy. > > A better solution for this may be for the rdma_cm to simply notify the user that > the IP mapping for their RDMA device has changed. The user can then disconnect, > with the appropriate synchronization, if they want their RDMA connection to > follow the IP address. (If I understood correctly, the reason for this is to > allow failing back to a repaired port.) > > Yes. Move this logic to the ULP, not in the rdma-cm... From bryan.d.green at nasa.gov Tue May 13 15:16:28 2008 From: bryan.d.green at nasa.gov (Bryan Green) Date: Tue, 13 May 2008 15:16:28 -0700 Subject: [ofa-general] libibvpp - A libibverbs C++ wrapper library. Message-ID: <20080513221628.3BC9E20415F@ece06.nas.nasa.gov> I'd like to make an announcement about a recently released library. I've released a C++ wrapper library for libibverbs, called libibvpp. It is currently released under the NOSA (NASA Open Source Agreement) license. For more information, please see the README (link below). I hope this library is of interest to the C++ programmers out there in the OpenFabrics community. I'm also curious if there would be any interest in hosting this project on the openfabrics website. Here is the library's current home: http://opensource.arc.nasa.gov/project/libivpp/ README: http://opensource.arc.nasa.gov/software/24/notes/ Download: http://opensource.arc.nasa.gov/static/downloads/libibvpp-0.1.tar.gz Thanks, -Bryan --------------------------------------- Bryan Green Visualization Group NASA Advanced Supercomputing Division NASA Ames Research Center email: bryan.d.green at nasa.gov --------------------------------------- From richard.frank at oracle.com Tue May 13 15:36:44 2008 From: richard.frank at oracle.com (Richard Frank) Date: Tue, 13 May 2008 18:36:44 -0400 Subject: [ofa-general] RDS flow control In-Reply-To: <200805132024.12741.okir@lst.de> References: <200805121157.38135.jon@opengridcomputing.com> <200805132004.01371.okir@lst.de> <4829D92E.2070504@opengridcomputing.com> <200805132024.12741.okir@lst.de> Message-ID: <482A17FC.7070804@oracle.com> Olaf, if / when you have this running for IB - let me know - I think we can give to some folks at Oracle who will be able to tell us if there is any performance regression using TPCH.. especially if we have it in the next week or so.. as I think we have a config to test with. Olaf Kirch wrote: > On Tuesday 13 May 2008 20:08:46 Steve Wise wrote: > >>> No, not in the long term. But let's hold off on the flow control stuff >>> for a little - I would first like to finish my patch set and hand it >>> out for you folks to bang on it, rather than the other way round. >>> Okay with you guys? >>> >>> >> What patch set? >> > > I mentioned in a previous mail to Jon that I have some partial patches > that implement flow control. I want to get that code out to you ASAP; > I think that's easier than having two different approaches that need > to be reconciled afterwards. > > >> We can't run on chelsio's rnic with fmrs... >> > > Yes, that is understood. > > Olaf > From john.gregor at qlogic.com Tue May 13 16:00:49 2008 From: john.gregor at qlogic.com (John Gregor) Date: Tue, 13 May 2008 16:00:49 -0700 (PDT) Subject: [ofa-general] bitops take an unsigned long * Message-ID: <20080513230049.C9DC121A047C@diamond.mv.qlogic.com> From: Roland Dreier > Out of curiousity, was there any reason for choosing 0, 1, 2, 3 and > then skipping to 62? Not really. Just that RUNNING and SHUTDOWN are conceptually different from ABORTING, DISARMED, and DISABLED and so it seemed to make sense at the time to cluster the bits at opposite ends of the qword. It made the printk() output easier to scan quickly during debugging. -John Gregor From MAILER-DAEMON at mail.sy-h.com Tue May 13 16:00:26 2008 From: MAILER-DAEMON at mail.sy-h.com (Mail Delivery Subsystem) Date: Wed, 14 May 2008 07:00:26 +0800 Subject: [ofa-general] Returned mail: see transcript for details Message-ID: <200805132300.m4DN0Qjm007166@mail.sy-h.com> The original message was received at Wed, 14 May 2008 07:00:24 +0800 from host-195.117.116.18.smj.net.pl [195.117.116.18] (may be forged) ----- The following addresses had permanent fatal errors ----- (reason: User unknown) ----- Transcript of session follows ----- procmail: Unknown user "lla_chang" 550 5.1.1 ... User unknown -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 893 bytes Desc: not available URL: From consequences25 at rigodelatorre.com Tue May 13 16:27:58 2008 From: consequences25 at rigodelatorre.com (Brain Shafer) Date: Tue, 13 May 2008 20:27:58 -0300 Subject: [ofa-general] buy now Viagra 100mg x 30 pills US $ 3.33 Per Pill Message-ID: <01c8b537$d4ea1b00$aff745bd@consequences25> US $ 69.95 100mg x 10 pills price http://pdqlothes.com From worleys at gmail.com Tue May 13 17:47:15 2008 From: worleys at gmail.com (Chris Worley) Date: Tue, 13 May 2008 18:47:15 -0600 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: References: Message-ID: In two 1.3 builds I get different SET_IPOIB_CM settings in /etc/infiniband/openib.conf. A generic build sets it to "yes". A kitchen-sink build doesn't set it. Is there a reason (as I need it to be enabled on a system that needs the kitchen-sink build)? Thanks, Chris From akepner at sgi.com Tue May 13 18:21:46 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 13 May 2008 18:21:46 -0700 Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack Message-ID: <20080514012146.GG29302@sgi.com> We're getting panics like this one on big clusters: skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at net/core/skbuff.c:94 invalid opcode: 0000 [1] SMP last sysfs file: /class/infiniband/mlx4_0/node_type CPU 0 Modules linked in: worm sg sd_mod crc32c libcrc32c rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_umad iw_cxgb3 cxgb3 firmware_class mlx4_ibib_mthca iscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib ib_cm ib_sa ib_mad ib_core ipv6 loop numatools xpmem shpchp pci_hotplug i2c_i801 i2c_core mlx4_core libata scsi_mod nfs lockd nfs_acl af_packet sunrpc e1000 Pid: 0, comm: swapper Tainted: G U 2.6.16.46-0.12-smp #1 RIP: 0010:[] {skb_over_panic+77} RSP: 0018:ffffffff80417e28 EFLAGS: 00010292 RAX: 0000000000000098 RBX: ffff81041b4bee08 RCX: 0000000000000292 RDX: ffffffff80347868 RSI: 0000000000000292 RDI: ffffffff80347860 RBP: ffff8103725817c0 R08: ffffffff80347868 R09: ffff81041d94e3c0 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81041b4be500 R13: 0000000000000060 R14: 0000000000000900 R15: ffffc20000078908 FS: 0000000000000000(0000) GS:ffffffff803be000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b44089dc000 CR3: 000000041f35d000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff803d8000, task ffffffff80341340) Stack: ffff810372b0f0bc ffff810372b0f080 ffff81041b4be000 ffff81041b4be500 0000000000000060 ffffffff8821f336 ffffffff80417ec8 ffff81041b4be000 0000000417227014 0000000000000292 Call Trace: {:ib_ipoib:ipoib_ib_handle_rx_wc+909} {:ib_ipoib:ipoib_poll+159} {net_rx_action+165} {__do_softirq+85} {call_softirq+30} {do_softirq+44} {do_IRQ+64} {mwait_idle+0} {ret_from_intr+0} {mwait_idle+0} {mwait_idle+54} {cpu_idle+151} {start_kernel+601} {_sinittext+650} Started looking into what might cause this and I found that IPoIB always does something like this: int ipoib_poll(struct net_device *dev, int *budget) { struct ipoib_dev_priv *priv = netdev_priv(dev); .... ib_poll_cq(priv->rcq, t, priv->ibwc); for (i = 0; i < n; i++) { struct ib_wc *wc = priv->ibwc + i; .... ipoib_ib_handle_rx_wc(dev, wc); What happens if we call ib_poll_cq() then, before processing the rx completions in ipoib_ib_handle_rx_wc(), ipoib_poll() gets called again (on a different CPU)? That could corrupt the priv->ibwc array, and lead to a panic like above. How about keeping the array of struct ib_wc on the stack? This has been tested only on a small system, not yet on one large enough to verify that it prevents the panic. But this "obviously" needs to be fixed, no? Signed-off-by: Arthur Kepner --- ipoib.h | 3 --- ipoib_ib.c | 31 +++++++++++++++++-------------- 2 files changed, 17 insertions(+), 17 deletions(-) diff -rup a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2008-05-12 16:39:22.024109931 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2008-05-13 16:21:52.433988977 -0700 @@ -326,7 +326,6 @@ struct ipoib_cm_dev_priv { struct sk_buff_head skb_queue; struct list_head start_list; struct list_head reap_list; - struct ib_wc ibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; int nonsrq_conn_qp; @@ -406,8 +405,6 @@ struct ipoib_dev_priv { struct ib_send_wr tx_wr; unsigned tx_outstanding; - struct ib_wc ibwc[IPOIB_NUM_WC]; - struct ib_wc send_wc[MAX_SEND_CQE]; unsigned int tx_poll; struct list_head dead_ahs; diff -rup a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-05-12 16:39:22.020109690 -0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-05-13 17:19:28.809819954 -0700 @@ -366,12 +366,13 @@ static void ipoib_ib_handle_tx_wc(struct void poll_tx(struct ipoib_dev_priv *priv) { + struct ib_wc send_wc[MAX_SEND_CQE]; int n, i; while (1) { - n = ib_poll_cq(priv->scq, MAX_SEND_CQE, priv->send_wc); + n = ib_poll_cq(priv->scq, MAX_SEND_CQE, send_wc); for (i = 0; i < n; ++i) - ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i); + ipoib_ib_handle_tx_wc(priv->dev, send_wc + i); if (n < MAX_SEND_CQE) break; @@ -380,6 +381,7 @@ void poll_tx(struct ipoib_dev_priv *priv int ipoib_poll(struct net_device *dev, int *budget) { + struct ib_wc ibwc[IPOIB_NUM_WC]; struct ipoib_dev_priv *priv = netdev_priv(dev); int max = min(*budget, dev->quota); int done; @@ -393,10 +395,10 @@ poll_more: while (max) { t = min(IPOIB_NUM_WC, max); - n = ib_poll_cq(priv->rcq, t, priv->ibwc); + n = ib_poll_cq(priv->rcq, t, ibwc); for (i = 0; i < n; i++) { - struct ib_wc *wc = priv->ibwc + i; + struct ib_wc *wc = ibwc + i; if (wc->wr_id & IPOIB_OP_RECV) { ++done; @@ -783,29 +785,30 @@ static int recvs_pending(struct net_devi void ipoib_drain_cq(struct net_device *dev) { + struct ib_wc ibwc[IPOIB_NUM_WC]; struct ipoib_dev_priv *priv = netdev_priv(dev); int i, n; do { - n = ib_poll_cq(priv->rcq, IPOIB_NUM_WC, priv->ibwc); + n = ib_poll_cq(priv->rcq, IPOIB_NUM_WC, ibwc); for (i = 0; i < n; ++i) { /* * Convert any successful completions to flush * errors to avoid passing packets up the * stack after bringing the device down. */ - if (priv->ibwc[i].status == IB_WC_SUCCESS) - priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR; + if (ibwc[i].status == IB_WC_SUCCESS) + ibwc[i].status = IB_WC_WR_FLUSH_ERR; - if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { - if (priv->ibwc[i].wr_id & IPOIB_OP_CM) - ipoib_cm_handle_rx_wc(dev, priv->ibwc + i); + if (ibwc[i].wr_id & IPOIB_OP_RECV) { + if (ibwc[i].wr_id & IPOIB_OP_CM) + ipoib_cm_handle_rx_wc(dev, ibwc + i); else - ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); + ipoib_ib_handle_rx_wc(dev, ibwc + i); } else { - if (priv->ibwc[i].wr_id & IPOIB_OP_CM) - ipoib_cm_handle_tx_wc(dev, priv->ibwc + i); + if (ibwc[i].wr_id & IPOIB_OP_CM) + ipoib_cm_handle_tx_wc(dev, ibwc + i); else - ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); + ipoib_ib_handle_tx_wc(dev, ibwc + i); } } } while (n == IPOIB_NUM_WC); From postmaster at iol.it Tue May 13 18:45:26 2008 From: postmaster at iol.it (Mail Delivery Service) Date: Wed, 14 May 2008 03:45:26 +0200 Subject: [ofa-general] Delivery Status Notification Message-ID: <47F2341F0A61E194@smtp-in4.libero.it> - These recipients of your message have been processed by the mail server: richzucc at libero.it; Failed; 5.2.2 (mailbox full) Remote MTA ims73c.libero.it: SMTP diagnostic: 552 RCPT TO: Mailbox disk quota exceeded -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/rfc822-headers Size: 963 bytes Desc: not available URL: From npiggin at suse.de Tue May 13 21:11:22 2008 From: npiggin at suse.de (Nick Piggin) Date: Wed, 14 May 2008 06:11:22 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080513153238.GL19717@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> Message-ID: <20080514041122.GE24516@wotan.suse.de> On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote: > On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > > In order to invalidate the remote page table entries, we need to message > > > (uses XPC) to the remote side. The remote side needs to acquire the > > > importing process's mmap_sem and call zap_page_range(). Between the > > > messaging and the acquiring a sleeping lock, I would argue this will > > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > > callouts(). > > > > Why do you need to take mmap_sem in order to shoot down pagetables of > > the process? It would be nice if this can just be done without > > sleeping. > > We are trying to shoot down page tables of a different process running > on a different instance of Linux running on Numa-link connected portions > of the same machine. Right. You can zap page tables without sleeping, if you're careful. I don't know that we quite do that for anonymous pages at the moment, but it should be possible with a bit of thought, I believe. > The messaging is clearly going to require sleeping. Are you suggesting > we need to rework XPC communications to not require sleeping? I think > that is going to be impossible since the transfer engine requires a > sleeping context. I guess that you have found a way to perform TLB flushing within coherent domains over the numalink interconnect without sleeping. I'm sure it would be possible to send similar messages between non coherent domains. So yes, I'd much rather rework such highly specialized system to fit in closer with Linux than rework Linux to fit with these machines (and apparently slow everyone else down). > Additionally, the call to zap_page_range expects to have the mmap_sem > held. I suppose we could use something other than zap_page_range and > atomically clear the process page tables. zap_page_range does not expect to have mmap_sem held. I think for anon pages it is always called with mmap_sem, however try_to_unmap_anon is not (although it expects page lock to be held, I think we should be able to avoid that). > Doing that will not alleviate > the need to sleep for the messaging to the other partitions. No, but I'd venture to guess that is not impossible to implement even on your current hardware (maybe a firmware update is needed)? From benh at kernel.crashing.org Tue May 13 22:43:59 2008 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 13 May 2008 22:43:59 -0700 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <200805132214.27510.nickpiggin@yahoo.com.au> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507234521.GN8276@duo.random> <20080508013459.GS8276@duo.random> <200805132214.27510.nickpiggin@yahoo.com.au> Message-ID: <1210743839.8297.55.camel@pasglop> On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > ea. > > I don't see why you're bending over so far backwards to accommodate > this GRU thing that we don't even have numbers for and could actually > potentially be batched up in other ways (eg. using mmu_gather or > mmu_gather-like idea). I agree, we're better off generalizing the mmu_gather batching instead... I had some never-finished patches to use the mmu_gather for pretty much everything except single page faults, tho various subtle differences between archs and lack of time caused me to let them take the dust and not finish them... I can try to dig some of that out when I'm back from my current travel, though it's probably worth re-doing from scratch now. Ben. From npiggin at suse.de Tue May 13 23:06:11 2008 From: npiggin at suse.de (Nick Piggin) Date: Wed, 14 May 2008 08:06:11 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <1210743839.8297.55.camel@pasglop> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507234521.GN8276@duo.random> <20080508013459.GS8276@duo.random> <200805132214.27510.nickpiggin@yahoo.com.au> <1210743839.8297.55.camel@pasglop> Message-ID: <20080514060610.GB30448@wotan.suse.de> On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote: > > On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > > ea. > > > > I don't see why you're bending over so far backwards to accommodate > > this GRU thing that we don't even have numbers for and could actually > > potentially be batched up in other ways (eg. using mmu_gather or > > mmu_gather-like idea). > > I agree, we're better off generalizing the mmu_gather batching > instead... Well, the first thing would be just to get rid of the whole start/end idea, which completely departs from the standard Linux system of clearing ptes, then flushing TLBs, then freeing memory. The onus would then be on GRU to come up with some numbers to justify batching, and a patch which works nicely with the rest of the Linux mm. And yes, mmu-gather is *the* obvious first choice of places to look if one wanted batching hooks. > I had some never-finished patches to use the mmu_gather for pretty much > everything except single page faults, tho various subtle differences > between archs and lack of time caused me to let them take the dust and > not finish them... > > I can try to dig some of that out when I'm back from my current travel, > though it's probably worth re-doing from scratch now. I always liked the idea as you know. But I don't think that should be mixed in with the first iteration of the mmu notifiers patch anyway. GRU actually can work without batching, but there is simply some (unquantified to me) penalty for not batching it. I think it is far better to first put in a clean and simple and working functionality first. The idea that we have to unload some monster be-all-and-end-all solution onto mainline in a single go seems counter productive to me. From ogerlitz at voltaire.com Tue May 13 23:23:48 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 May 2008 09:23:48 +0300 Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by OFED 1.3 In-Reply-To: References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> Message-ID: <482A8574.8070201@voltaire.com> Roland Dreier wrote: > > ipoib_cm.c:ipoib_cm_send() does: > > if (++priv->tx_outstanding == ipoib_sendq_size) > > netif_stop_queue(dev); > > > > but ipoib_ib.c:ipoib_send() does: > > if (++priv->tx_outstanding == (ipoib_sendq_size - 1)) { > > netif_stop_queue(dev); > > So this is not in the upstream kernel... I wonder if this is a bug > introduced in an OFED 1.3 patch? Over the last period we had so much debugging done to non reviewed ipoib patches which were merged to ofed 1.3 bypassing any sane procedure. This includes people sending to Roland bugs reports on code he does not see in his tree, and people reporting on bugs introduced by code pushed to ofed after rc3! It seems like we chose a very un efficient way to work : first, merge code, second, test and see it crashing, third, ask for the maintainer to review, get him to fix it, forth, push it to the kernel. ofed 1.3 is out there merged into commercial "enterprise" distros, ipoib is the first thing people test, so these people would get all these crashes. Maybe its about time for the Linux IB maintainers to get a little angry?! Or. From eli at dev.mellanox.co.il Wed May 14 00:25:48 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 14 May 2008 10:25:48 +0300 Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack In-Reply-To: <20080514012146.GG29302@sgi.com> References: <20080514012146.GG29302@sgi.com> Message-ID: <1210749948.15669.268.camel@mtls03> On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote: > We're getting panics like this one on big clusters: > > skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0 RX SKBs are large enough to contain 100 bytes... this looks like corruption. Can you give more information on OS, kernel version, OFED version. > > Started looking into what might cause this and I found that IPoIB > always does something like this: > > int ipoib_poll(struct net_device *dev, int *budget) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > .... > ib_poll_cq(priv->rcq, t, priv->ibwc); > > for (i = 0; i < n; i++) { > struct ib_wc *wc = priv->ibwc + i; > .... > ipoib_ib_handle_rx_wc(dev, wc); > > > What happens if we call ib_poll_cq() then, before processing the > rx completions in ipoib_ib_handle_rx_wc(), ipoib_poll() gets called > again (on a different CPU)? That could corrupt the priv->ibwc array, > and lead to a panic like above. >From NAPI_HOWTO.txt, although the file has been removed but I think the statement is still valid: -Guarantee: Only one CPU at any time can call dev->poll(); this is because only one CPU can pick the initial interrupt and hence the initial netif_rx_schedule(dev); > > How about keeping the array of struct ib_wc on the stack? The stack is limited for kernel code and putting this on the stack is limiting. I think this could hurt performance too due to more cache misses. From monis at Voltaire.COM Wed May 14 00:41:19 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 14 May 2008 10:41:19 +0300 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> <4820638E.4030901@Voltaire.COM> <4827FBDF.9040308@Voltaire.COM> Message-ID: <482A979F.6040305@Voltaire.COM> Roland Dreier wrote: > > Can we please go on with this patch? We would like to see it in the next kernel. > > I still don't get why this is important to you. Is there a concrete > example of a situation where this actually makes a measurable difference? > > We need some justification for adding this locking complexity beyond "it > doesn't hurt." (And also of course we need it fixed so there aren't races) > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Hi, OK. Here is an example that was viewed in our tests. One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server). SM takeover event takes place during traffic and as a result multicast info is flushed and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience is a very big chance) that the request to rejoin will be to the old SM and only after a retry join completes successfully. This takes too long and the patch solves it. I hope that this is convincing enough for you because for us it is important that a recovery from a failure will be as quick as possible. thanks MoniS From vlad at dev.mellanox.co.il Wed May 14 03:55:12 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 14 May 2008 13:55:12 +0300 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: References: Message-ID: <482AC510.3090602@dev.mellanox.co.il> Chris Worley wrote: > In two 1.3 builds I get different SET_IPOIB_CM settings in > /etc/infiniband/openib.conf. > > A generic build sets it to "yes". A kitchen-sink build doesn't set it. > > Is there a reason (as I need it to be enabled on a system that needs > the kitchen-sink build)? > > Thanks, > > Chris > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > The default mode for IPoIB CM in OFED-1.3 (/etc/infiniband/openib.conf) is: SET_IPOIB_CM=yes It was different (SET_IPOIB_CM=no) in OFED-1.2 between Thu Mar 29 16:57:22 2007 and Wed Apr 4 10:41:09 2007 (before OFED-1.2-rc1). Can you point me the OFED-1.3 build where SET_IPOIB_CM is set to "no"? Regards, Vladimir From hrosenstock at xsigo.com Wed May 14 04:24:04 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 14 May 2008 04:24:04 -0700 Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) In-Reply-To: <20080514000247.GL21414@sashak.voltaire.com> References: <20080423133816.6c1b6315.weiny2@llnl.gov> <48109087.6030606@voltaire.com> <20080424143125.2aad1db8.weiny2@llnl.gov> <15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com> <20080424181657.28d58a29.weiny2@llnl.gov> <20080514000247.GL21414@sashak.voltaire.com> Message-ID: <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com> On Wed, 2008-05-14 at 00:02 +0000, Sasha Khapyorsky wrote: > On 18:16 Thu 24 Apr , Ira Weiny wrote: > > > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001 > > From: Ira K. Weiny > > Date: Thu, 24 Apr 2008 18:05:01 -0700 > > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit > > > > > > Signed-off-by: Ira K. Weiny > > Applied. Thanks. Would this change also be applied to ofed_1_3 branch ? -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From holt at sgi.com Wed May 14 04:26:25 2008 From: holt at sgi.com (Robin Holt) Date: Wed, 14 May 2008 06:26:25 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080514041122.GE24516@wotan.suse.de> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> Message-ID: <20080514112625.GY9878@sgi.com> On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote: > On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote: > > On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > > > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > > > In order to invalidate the remote page table entries, we need to message > > > > (uses XPC) to the remote side. The remote side needs to acquire the > > > > importing process's mmap_sem and call zap_page_range(). Between the > > > > messaging and the acquiring a sleeping lock, I would argue this will > > > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > > > callouts(). > > > > > > Why do you need to take mmap_sem in order to shoot down pagetables of > > > the process? It would be nice if this can just be done without > > > sleeping. > > > > We are trying to shoot down page tables of a different process running > > on a different instance of Linux running on Numa-link connected portions > > of the same machine. > > Right. You can zap page tables without sleeping, if you're careful. I > don't know that we quite do that for anonymous pages at the moment, but it > should be possible with a bit of thought, I believe. > > > > The messaging is clearly going to require sleeping. Are you suggesting > > we need to rework XPC communications to not require sleeping? I think > > that is going to be impossible since the transfer engine requires a > > sleeping context. > > I guess that you have found a way to perform TLB flushing within coherent > domains over the numalink interconnect without sleeping. I'm sure it would > be possible to send similar messages between non coherent domains. I assume by coherent domains, your are actually talking about system images. Our memory coherence domain on the 3700 family is 512 processors on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes. We extend a "Read-Exclusive" mode beyond the coherence domain so any processor is able to read any cacheline on the system. We also provide uncached access for certain types of memory beyond the coherence domain. For the other partitions, the exporting partition does not know what virtual address the imported pages are mapped. The pages are frequently mapped in a different order by the MPI library to help with MPI collective operations. For the exporting side to do those TLB flushes, we would need to replicate all that importing information back to the exporting side. Additionally, the hardware that does the TLB flushing is protected by a spinlock on each system image. We would need to change that simple spinlock into a type of hardware lock that would work (on 3700) outside the processors coherence domain. The only way to do that is to use uncached addresses with our Atomic Memory Operations which do the cmpxchg at the memory controller. The uncached accesses are an order of magnitude or more slower. > So yes, I'd much rather rework such highly specialized system to fit in > closer with Linux than rework Linux to fit with these machines (and > apparently slow everyone else down). But it isn't that we are having a problem adapting to just the hardware. One of the limiting factors is Linux on the other partition. > > Additionally, the call to zap_page_range expects to have the mmap_sem > > held. I suppose we could use something other than zap_page_range and > > atomically clear the process page tables. > > zap_page_range does not expect to have mmap_sem held. I think for anon > pages it is always called with mmap_sem, however try_to_unmap_anon is > not (although it expects page lock to be held, I think we should be able > to avoid that). zap_page_range calls unmap_vmas which walks to vma->next. Are you saying that can be walked without grabbing the mmap_sem at least readably? I feel my understanding of list management and locking completely shifting. > > Doing that will not alleviate > > the need to sleep for the messaging to the other partitions. > > No, but I'd venture to guess that is not impossible to implement even > on your current hardware (maybe a firmware update is needed)? Are you suggesting the sending side would not need to sleep or the receiving side? Assuming you meant the sender, it spins waiting for the remote side to acknowledge the invalidate request? We place the data into a previously agreed upon buffer and send an interrupt. At this point, we would need to start spinning and waiting for completion. Let's assume we never run out of buffer space. The receiving side receives an interrupt. The interrupt currently wakes an XPC thread to do the work of transfering and delivering the message to XPMEM. The transfer of the data which XPC does uses the BTE engine which takes up to 28 seconds to timeout (hardware timeout before raising and error) and the BTE code automatically does a retry for certain types of failure. We currently need to grab semaphores which _MAY_ be able to be reworked into other types of locks. Thanks, Robin From hrosenstock at xsigo.com Wed May 14 04:29:39 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 14 May 2008 04:29:39 -0700 Subject: [ofa-general] Re: ibsim parsing question In-Reply-To: <20080514001545.GO21414@sashak.voltaire.com> References: <1209934944.20493.188.camel@hrosenstock-ws.xsigo.com> <20080512222316.GK17046@sashak.voltaire.com> <1210620927.2026.575.camel@hrosenstock-ws.xsigo.com> <20080512225725.GN17046@sashak.voltaire.com> <1210622530.2026.580.camel@hrosenstock-ws.xsigo.com> <20080512231248.GQ17046@sashak.voltaire.com> <20080512231831.GR17046@sashak.voltaire.com> <1210624804.2026.586.camel@hrosenstock-ws.xsigo.com> <20080513234345.GI21414@sashak.voltaire.com> <1210712218.2026.719.camel@hrosenstock-ws.xsigo.com> <20080514001545.GO21414@sashak.voltaire.com> Message-ID: <1210764579.2026.735.camel@hrosenstock-ws.xsigo.com> On Wed, 2008-05-14 at 00:15 +0000, Sasha Khapyorsky wrote: > On 13:56 Tue 13 May , Hal Rosenstock wrote: > > > > > > That is correct, but we know that ibsim parser doesn't parse > > > NodeDescription in those (port related) lines, so in such "worst" cases > > > when 's=' and/or 'w=' strings are used in NodeDescription this could be > > > just filtered out from ibnetdiscovery file. > > > > That's why I termed this approach a workaround and it does limit the > > NodeDescription in ways not limited by the IBA spec. > > No, it does not limit NodeDescription at all - it is *only* file format > limitation (remove NodeDescription from port related lines in the file > and we are done). That's my point; NodeDescription is part of the ibnd file format and hence this limits what can be there so in that sense this limits it's contents and removing such occurrences from the file is a workaround IMO. While a distinction can be made between the actual NodeDescription and what is in the file, they're supposed to be one and the same and not require some extra post processing. > > Is this worth mentioning in the README or some other doc for ibsim ? > > Looks like overkill for me. OK. -- Hal > Sasha From aiyu1016 at 126.com Wed May 14 04:37:44 2008 From: aiyu1016 at 126.com (=?GBK?B?MjAwOMnPuqO5+rzKyta7+rL60rXVuV/AqdW5?=) Date: Wed, 14 May 2008 19:37:44 +0800 (CST) Subject: [ofa-general] =?gbk?b?Rnc6MjAwOLXazuW97NbQufqjqMnPuqOjqbn6vMo=?= =?gbk?b?yta7+rL60rXVucDAu+H039HQzNa74S0tLc28?= Message-ID: <8557900.1029561210765064406.JavaMail.coremail@bj126app43.126.com> ---------- 转发邮件信息 ---------- 发件人:""2008上海国际手机产业展_扩展" " 发送日期:2008-05-12 13:22:21 收件人:aiyu1016 at 126.com 主题: 2008第五届中国(上海)国际手机产业展览会暨研讨会---图 2008第五届中国(上海)国际手机产业展览会暨研讨会 欢迎参展/参观! 2008第五届中国(上海)国际手机产业展览会暨研讨会 时间:2008.7.9—11 地点:上海光大会展中心 展出面积:18000平米 【参展请联系】 联系人:胡钟辉(先生) 013686820858 ?? 电? 话:021-64751029(直线)? 64363606-1029(总机)?? ???传? 真:021-64755056 【前言】 “上海国际手机产业展览会暨研讨会”,从2004年开始,一年一届固定在上海举办。近年来,已发展为中国有名的手机产业展会,拥有大量固定的参展商和观众,是每年手机上下游企业云集的贸易盛会,是手机行业人士济济一堂的最佳场所。 “2008第五届中国(上海)国际手机产业展览会暨研讨会”将于2008年7月9—11日在上海光大会展中心举行,展出面积将达18000平米,标准展位900个,届时,一个规模宏大,知名品牌云集的行业盛会将呈现在您的面前。 【07年精彩回顾】 上届展会于2007年7月11日—13日在上海光大会展中心圆满落下帷幕,总展出面积从2006年的6000平方米增加到2007年的12000平方米;参展商数量由2006年的210家跃升到2007年的486家,创下了历史最好记录;观众数量从2006年的28638人增加到2007年的44862人。展览会的高成长性再次印证了其在中国手机产业展会中的领先地位。 组委会展期展后对展商信息的调查表明:82%的参展商对本届展览会的展出效果表示满意, 80%的参展商有浓厚的兴趣表示再次参加下届展览会。70%的参展商认为同比其它展会本届展会有着更大的优势。 【组织机构】 批准单位:上海市人民政府对外经济贸易委员会 主办单位:上海市通信制造业行业协会 中国电子学会 上海国际广告展览有限公司 承办单位:上海扩展展览服务有限公司 上海国际广告展览有限公司 展会指定网站:中国手机信息网(http://www.mobileexpo.cn) 特邀媒体推广:中通网(http://www.ci800.com/) 网络支持:中国移动通信配件网( http://www.mob-acc.com/) 中国手机研发网(http://www.1mp.cc/) 中华液晶网(http://cn.fpdisplay.com/) 多彩数码商情(http: //www.mobilerefer.com) 中国信息产业网(http://www.cnii.com.cn/) 通信世界网(http://www.cww.net.cn/) 3G中国资讯网( http://www.t800.com.cn/) 国际电池网( http://www.ib160.com/) IT交易网(http://www.itb2b.com.cn/) 中国移动通信产业网(http://www.chinamobile.gov.cn/) 中国手机配套采购网(http://mobileaccessory.chinacomponents.com.cn) 媒体支持:《环球通讯》《手机圈》《手机资讯》《手机&配件》 《数码网络商情》《通讯专刊》《通信世界报》《新电子》 《今日电子》《环球光电与显示》《电子元器件应用》等。 【日程安排】 报到布展:2008年7月7日—7月8日 开 幕 式:2008年7月9日(9:30—10:00) 展 览:2008年7月9日—7月11日 撤 展:2008年7月11日下午 【展会优势】 (1)地理优势:上海是中国最大的经济中心城市,也是国际著名的港口城市,在中国的经济发展中具有极其重要的地位。上海地处长江三角洲前沿,北界长江,东濒东海,南临杭州湾,西接江苏、浙江两省。地处南北海岸线中心,长江由此入海,交通便利,腹地宽阔,地理位置优越,是一个良好的江海港口 (2)规模优势:展览总规模达到18000平方米,标准展台超过900个,作为专业盛会,本届展览规模优势十分明显。 (3)专业优势:本次展会将成为手机行业最主要的展览,与同类展览相比,主题突出,集中展示以“手机”为主线的产业科技成果、先进的技术及产品,专业优势非常明显。成为业内机构进行针对性的业务推介和形象宣传的首选渠道。 (4)经验优势:展会承办机构上海扩展展览服务有限公司,每年在国内外举办10多个专业国际展会,拥有庞大行业数据库,能确保展会拥有高质量的观众。与中国20多个各级政府部门、30余家专业协会学会、150多家专业媒体以及国际会展机构保持密切联系。 【参展范围】 (一) 移动终端:手机品牌厂商、手机设计公司、手机ODM、OEM、EMS厂商、手机代工厂商、小灵通、PDA、数码相机、掌上电脑、手机电视等移动终端生产厂家; (二) 相关配件及产品: 1.半导体器件及IC电路: 芯片、芯片设计、图象传感器、导电矽胶、调解器、半导体器件、集成电路电讯和数据通讯网络应用技术等; -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2008第五届中国(上海)国际手机产业展览会暨研讨会正.doc Type: application/msword Size: 68096 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2008手机展位图.jpg Type: image/pjpeg Size: 1202722 bytes Desc: not available URL: From ogerlitz at voltaire.com Wed May 14 05:33:47 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 May 2008 15:33:47 +0300 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482A0F32.2010001@opengridcomputing.com> References: <482A0F32.2010001@opengridcomputing.com> Message-ID: <482ADC2B.5080008@voltaire.com> Steve Wise wrote: > Maybe this should really be implemented in the ULP that wants this > behavior. IE the ULP could register for routing/neighbour changes and > tear down connections and re-established them on the correct device. > Hi Steve, First, registration for neighbour changes can't serve for the purpose of aligning RDMA traffic with the IP stack, from bunch of reasons among them are: - for IB, no neighbour is created at the passive side of the unicast session - for unicast sessions, address resolution involves ARP but the neighbour may be deleted by the kernel since the rdma traffic does not go through the stack - for multicast sessions, no neighbour is created during address resolution Second, the rdma-cm does well in saving the ULP from interacting with the network stack, that is the ULP is not aware to the routing lookup / neigbour / net device used for address resolution. In that spirit I prefer to add the registration for net events at the low level (rdma-cm). Third, thanks for bringing the point of route changes :) Or. From ogerlitz at voltaire.com Wed May 14 05:44:01 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 May 2008 15:44:01 +0300 Subject: [ofa-general] [RFC PATCH 3/4] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com> References: <469958e00805131415l39c54201v4b5f39ed81fbf9cf@mail.gmail.com> Message-ID: <482ADE91.1050704@voltaire.com> Caitlin Bestler wrote: > I'm not sure I've even seen an "RDMA Session". OK, there was some misunderstanding here: by "session" I refer to both connected and unconnected (unicast & multicast) services of the rdma-cm. I wanted to emphasize that similarly to the network stack where bonding works for TCP, UDP unicast and UDP multicast, etc traffic, I want this "rdma/ip traffic alignment" feature to work not only for RC connections. > And if the application is going to make the decision, then > can't it just subscribe to the local routing tables on its > own without any help from OFA? It can, but I don't want it to. Please see my other response to Steve on this thread from today. Or. From ogerlitz at voltaire.com Wed May 14 05:52:11 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 May 2008 15:52:11 +0300 Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels older than 2.6.21 In-Reply-To: <1210765031.15669.285.camel@mtls03> References: <1210765031.15669.285.camel@mtls03> Message-ID: <482AE07B.8040501@voltaire.com> Eli Cohen wrote: > IB/ipoib: Fix neigh destructor oops > > For kernels 2.6.20 and older, it may happen that the pointer to > ipoib_neigh_cleanup() is called after IPoIB has been unloades, > causing a kernel oops. This problem has been fixed for 2.6.21 with > the following commit: ecbb416939da77c0d107409976499724baddce7b Hi Eli, Before looking into the solution, I'd like to slow down a little and understand the problem (how can ipoib_neigh_cleanup() called after IPoIB has been unloaded) and why does the commit below solve it, from its change-log I don't see any reference to this problem: > commit ecbb416939da77c0d107409976499724baddce7b > Author: Alexey Kuznetsov > Date: Sat Mar 24 12:52:16 2007 -0700 > > [NET]: Fix neighbour destructor handling. > > ->neigh_destructor() is killed (not used), replaced with > ->neigh_cleanup(), which is called when neighbor entry goes to dead > state. At this point everything is still valid: neigh->dev, > neigh->parms etc. > > The device should guarantee that dead neighbor entries (neigh->dead != > 0) do not get private part initialized, otherwise nobody will cleanup > it. > > I think this is enough for ipoib which is the only user of this thing. > Initialization private part of neighbor entries happens in ipib > start_xmit routine, which is not reached when device is down. But it > would be better to add explicit test for neigh->dead in any case. > > Signed-off-by: David S. Miller > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 0741c6d..f2a40ae 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct net_device *dev) > queue_work(ipoib_workqueue, &priv->restart_task); > } > > -static void ipoib_neigh_destructor(struct neighbour *n) > +static void ipoib_neigh_cleanup(struct neighbour *n) > { > struct ipoib_neigh *neigh; > struct ipoib_dev_priv *priv = netdev_priv(n->dev); > @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struct neighbour *n) > struct ipoib_ah *ah = NULL; > > ipoib_dbg(priv, > - "neigh_destructor for %06x " IPOIB_GID_FMT "\n", > + "neigh_cleanup for %06x " IPOIB_GID_FMT "\n", > IPOIB_QPN(n->ha), > IPOIB_GID_RAW_ARG(n->ha + 4)); > > @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) > > static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) > { > - parms->neigh_destructor = ipoib_neigh_destructor; > + parms->neigh_cleanup = ipoib_neigh_cleanup; > > return 0; > } > diff --git a/include/net/neighbour.h b/include/net/neighbour.h > index 3725b93..ad7fe11 100644 > --- a/include/net/neighbour.h > +++ b/include/net/neighbour.h > @@ -36,7 +36,7 @@ struct neigh_parms > struct net_device *dev; > struct neigh_parms *next; > int (*neigh_setup)(struct neighbour *); > - void (*neigh_destructor)(struct neighbour *); > + void (*neigh_cleanup)(struct neighbour *); > struct neigh_table *tbl; > > void *sysctl_table; > diff --git a/net/atm/clip.c b/net/atm/clip.c > index ebb5d0c..8c38258 100644 > --- a/net/atm/clip.c > +++ b/net/atm/clip.c > @@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc, struct sk_buff *skb) > spin_unlock_irqrestore(&PRIV(dev)->xoff_lock, flags); > } > > -static void clip_neigh_destroy(struct neighbour *neigh) > -{ > - DPRINTK("clip_neigh_destroy (neigh %p)\n", neigh); > - if (NEIGH2ENTRY(neigh)->vccs) > - printk(KERN_CRIT "clip_neigh_destroy: vccs != NULL !!!\n"); > - NEIGH2ENTRY(neigh)->vccs = (void *) NEIGHBOR_DEAD; > -} > - > static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb) > { > DPRINTK("clip_neigh_solicit (neigh %p, skb %p)\n", neigh, skb); > @@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = { > /* parameters are copied from ARP ... */ > .parms = { > .tbl = &clip_tbl, > - .neigh_destructor = clip_neigh_destroy, > .base_reachable_time = 30 * HZ, > .retrans_time = 1 * HZ, > .gc_staletime = 60 * HZ, > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 3183142..cfc6001 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_table *tbl) > n->dead = 1; > shrunk = 1; > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > continue; > } > @@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) > NEIGH_PRINTK2("neigh %p is stray.\n", n); > } > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > } > } > @@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *neigh) > kfree(hh); > } > > - if (neigh->parms->neigh_destructor) > - (neigh->parms->neigh_destructor)(neigh); > - > skb_queue_purge(&neigh->arp_queue); > > dev_put(neigh->dev); > @@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigned long arg) > *np = n->next; > n->dead = 1; > write_unlock(&n->lock); > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > continue; > } > @@ -2088,8 +2091,11 @@ void __neigh_for_each_release(struct neigh_table *tbl, > } else > np = &n->next; > write_unlock(&n->lock); > - if (release) > + if (release) { > + if (n->parms->neigh_cleanup) > + n->parms->neigh_cleanup(n); > neigh_release(n); > + } > } > } > } > From steiner at sgi.com Wed May 14 06:15:32 2008 From: steiner at sgi.com (Jack Steiner) Date: Wed, 14 May 2008 08:15:32 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <1210743839.8297.55.camel@pasglop> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080507234521.GN8276@duo.random> <20080508013459.GS8276@duo.random> <200805132214.27510.nickpiggin@yahoo.com.au> <1210743839.8297.55.camel@pasglop> Message-ID: <20080514131531.GA10393@sgi.com> On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > > ea. > > > > I don't see why you're bending over so far backwards to accommodate > > this GRU thing that we don't even have numbers for and could actually > > potentially be batched up in other ways (eg. using mmu_gather or > > mmu_gather-like idea). > > I agree, we're better off generalizing the mmu_gather batching > instead... Unfortunately, we are at least several months away from being able to provide numbers to justify batching - assuming it is really needed. We need large systems running real user workloads. I wish we had that available right now, but we don't. It also depends on what you mean by "no batching". If you mean that the notifier gets called for each pte that is removed from the page table, then the overhead is clearly very high for some operations. Consider the unmap of a very large object. A TLB flush per page will be too costly. However, something based on the mmu_gather seems like it should provide exactly what is needed to do efficient flushing of the TLB. The GRU does not require that it be called in a sleepable context. As long as the notifier callout provides the mmu_gather and vaddr range being flushed, the GRU can do the efficiently do the rest. > > I had some never-finished patches to use the mmu_gather for pretty much > everything except single page faults, tho various subtle differences > between archs and lack of time caused me to let them take the dust and > not finish them... > > I can try to dig some of that out when I'm back from my current travel, > though it's probably worth re-doing from scratch now. > > Ben. > -- jack From okir at lst.de Wed May 14 06:16:00 2008 From: okir at lst.de (Olaf Kirch) Date: Wed, 14 May 2008 15:16:00 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <482A17FC.7070804@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805132024.12741.okir@lst.de> <482A17FC.7070804@oracle.com> Message-ID: <200805141516.01908.okir@lst.de> On Wednesday 14 May 2008 00:36:44 Richard Frank wrote: > Olaf, if / when you have this running for IB - let me know - I think we > can give to some folks at Oracle who will be able to tell us if there is > any performance regression using TPCH.. especially if we have it in the > next week or so.. as I think we have a config to test with. I'll give it a try. It's not complete, and still somewhat brittle. As the code was never designed with transport level flow control in mind, there's a few things I need to move around that tend to sparkle and emit bit smoke when you touch them the wrong way :) I'll let you know as soon as I have something for you to test. Olaf > > Olaf Kirch wrote: > > On Tuesday 13 May 2008 20:08:46 Steve Wise wrote: > > > >>> No, not in the long term. But let's hold off on the flow control stuff > >>> for a little - I would first like to finish my patch set and hand it > >>> out for you folks to bang on it, rather than the other way round. > >>> Okay with you guys? > >>> > >>> > >> What patch set? > >> > > > > I mentioned in a previous mail to Jon that I have some partial patches > > that implement flow control. I want to get that code out to you ASAP; > > I think that's easier than having two different approaches that need > > to be reconciled afterwards. > > > > > >> We can't run on chelsio's rnic with fmrs... > >> > > > > Yes, that is understood. > > > > Olaf > > > -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From eli at dev.mellanox.co.il Wed May 14 06:38:10 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 14 May 2008 16:38:10 +0300 Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels older than 2.6.21 In-Reply-To: <482AE07B.8040501@voltaire.com> References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com> Message-ID: <1210772290.20499.5.camel@mtls03> On Wed, 2008-05-14 at 15:52 +0300, Or Gerlitz wrote: > Eli Cohen wrote: > > IB/ipoib: Fix neigh destructor oops > > > > For kernels 2.6.20 and older, it may happen that the pointer to > > ipoib_neigh_cleanup() is called after IPoIB has been unloades, > > causing a kernel oops. This problem has been fixed for 2.6.21 with > > the following commit: ecbb416939da77c0d107409976499724baddce7b > Hi Eli, > > Before looking into the solution, I'd like to slow down a little and > understand the problem (how can ipoib_neigh_cleanup() called after IPoIB > has been unloaded) and why does the commit below solve it, from its > change-log I don't see any reference to this problem: > > > commit ecbb416939da77c0d107409976499724baddce7b > > Author: Alexey Kuznetsov > > Date: Sat Mar 24 12:52:16 2007 -0700 > > I add to the thread the author of the commit. I don't know this code good enough to give an explanation but for kernels following this commit I don't get these failures. Perhaps someone else can comment on this. From ogerlitz at voltaire.com Wed May 14 06:40:12 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 14 May 2008 16:40:12 +0300 Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels older than 2.6.21 In-Reply-To: <1210772290.20499.5.camel@mtls03> References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com> <1210772290.20499.5.camel@mtls03> Message-ID: <482AEBBC.10803@voltaire.com> Eli Cohen wrote: > I add to the thread the author of the commit. I don't know this code > good enough to give an explanation but for kernels following this commit > I don't get these failures. Perhaps someone else can comment on this. > Even before understanding what Alexy's patch is doing, can you explain why the ipoib neighbour destructor callback is called after the ipoib module has been unloaded? is it b/c the stack did this call on a neighbour created by another device such as loopack etc? Or. From eli at dev.mellanox.co.il Wed May 14 06:52:10 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 14 May 2008 16:52:10 +0300 Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels older than 2.6.21 In-Reply-To: <482AEBBC.10803@voltaire.com> References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com> <1210772290.20499.5.camel@mtls03> <482AEBBC.10803@voltaire.com> Message-ID: <1210773130.20499.11.camel@mtls03> On Wed, 2008-05-14 at 16:40 +0300, Or Gerlitz wrote: > Eli Cohen wrote: > > I add to the thread the author of the commit. I don't know this code > > good enough to give an explanation but for kernels following this commit > > I don't get these failures. Perhaps someone else can comment on this. > > > Even before understanding what Alexy's patch is doing, can you explain > why the ipoib neighbour destructor callback is called after the ipoib > module has been unloaded? is it b/c the stack did this call on a > neighbour created by another device such as loopack etc? > That could be one reason for this. And it could be that the kernel does not guarantee that all the neighbours destructors of the interface get called before the interface is stopped (and the module is unloaded). From 997.palmerlucius at jbstrans.com Wed May 14 06:56:44 2008 From: 997.palmerlucius at jbstrans.com (Esteban Lunsford) Date: Wed, 14 May 2008 21:56:44 +0800 Subject: [ofa-general] Don't miss to see my pic Message-ID: <01c8b60d$65def600$c9be0eda@997.palmerlucius> Hello! I am bored tonight. I am nice girl that would like to chat with you. Email me at Astrid at emaidaking.cn only, because I am using my friend's email to write this. Mind me sending some of my pictures to you? -------------- next part -------------- A non-text attachment was scrubbed... Name: me Type: image/gif Size: 46393 bytes Desc: not available URL: From akepner at sgi.com Wed May 14 07:05:46 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 14 May 2008 07:05:46 -0700 Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack In-Reply-To: <1210749948.15669.268.camel@mtls03> References: <20080514012146.GG29302@sgi.com> <1210749948.15669.268.camel@mtls03> Message-ID: <20080514140546.GK29302@sgi.com> On Wed, May 14, 2008 at 10:25:48AM +0300, Eli Cohen wrote: > On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote: > > We're getting panics like this one on big clusters: > > > > skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0 > > RX SKBs are large enough to contain 100 bytes... this looks like > corruption. Exactly. > Can you give more information on OS, kernel version, OFED > version. SUSE Linux Enterprise Server 10 SP1 (x86_64) - Kernel 2.6.16.46-0.12-smp OFED 1.3 GA > ..... > >From NAPI_HOWTO.txt, although the file has been removed but I think the > statement is still valid: > > -Guarantee: Only one CPU at any time can call dev->poll(); this is > because only one CPU can pick the initial interrupt and hence the > initial netif_rx_schedule(dev); > Yes, you're correct. I missed the use of the __LINK_STATE_RX_SCHED bit in __netif_rx_schedule_prep()/netif_rx_complete() that serializes this. (Roland also pointed this out to me.) -- Arthur From eli at dev.mellanox.co.il Wed May 14 07:22:46 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 14 May 2008 17:22:46 +0300 Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack In-Reply-To: <20080514140546.GK29302@sgi.com> References: <20080514012146.GG29302@sgi.com> <1210749948.15669.268.camel@mtls03> <20080514140546.GK29302@sgi.com> Message-ID: <1210774966.23636.3.camel@mtls03> On Wed, 2008-05-14 at 07:05 -0700, akepner at sgi.com wrote: > On Wed, May 14, 2008 at 10:25:48AM +0300, Eli Cohen wrote: > > > On Tue, 2008-05-13 at 18:21 -0700, akepner at sgi.com wrote: > > > We're getting panics like this one on big clusters: > > > > > > skb_over_panic: text:ffffffff8821f32e len:160 put:100 head:ffff810372b0f000 data:ffff810372b0f01c tail:ffff810372b0f0bc end:ffff810372b0f080 dev:ib0 > > > > RX SKBs are large enough to contain 100 bytes... this looks like > > corruption. > > Exactly. One thing that can help discover memory corruptions and other bug is to use a debug kernel. Is it possible that you will configure a few nodes for debug kernel? From akepner at sgi.com Wed May 14 07:23:43 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 14 May 2008 07:23:43 -0700 Subject: [ofa-general] [PATCH] IPoIB: keep ib_wc[] on stack In-Reply-To: <1210774966.23636.3.camel@mtls03> References: <20080514012146.GG29302@sgi.com> <1210749948.15669.268.camel@mtls03> <20080514140546.GK29302@sgi.com> <1210774966.23636.3.camel@mtls03> Message-ID: <20080514142343.GM29302@sgi.com> On Wed, May 14, 2008 at 05:22:46PM +0300, Eli Cohen wrote: > .... > One thing that can help discover memory corruptions and other bug is to > use a debug kernel. Is it possible that you will configure a few nodes > for debug kernel? > Yes, we can certainly do that. It may take some time, because this bug (like lots of others) is only seen on very large systems, and scheduling test/debug time on these systems isn't easy. -- Arthur From weiny2 at llnl.gov Wed May 14 07:59:47 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 14 May 2008 07:59:47 -0700 Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) In-Reply-To: <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com> References: <20080423133816.6c1b6315.weiny2@llnl.gov> <48109087.6030606@voltaire.com> <20080424143125.2aad1db8.weiny2@llnl.gov> <15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com> <20080424181657.28d58a29.weiny2@llnl.gov> <20080514000247.GL21414@sashak.voltaire.com> <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080514075947.1e0c3b53.weiny2@llnl.gov> Yea I guess it should be. Ira On Wed, 14 May 2008 04:24:04 -0700 Hal Rosenstock wrote: > On Wed, 2008-05-14 at 00:02 +0000, Sasha Khapyorsky wrote: > > On 18:16 Thu 24 Apr , Ira Weiny wrote: > > > > > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001 > > > From: Ira K. Weiny > > > Date: Thu, 24 Apr 2008 18:05:01 -0700 > > > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit > > > > > > > > > Signed-off-by: Ira K. Weiny > > > > Applied. Thanks. > > Would this change also be applied to ofed_1_3 branch ? > > -- Hal > > > Sasha > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From torvalds at linux-foundation.org Wed May 14 08:18:21 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 14 May 2008 08:18:21 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080514112625.GY9878@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: On Wed, 14 May 2008, Robin Holt wrote: > > Are you suggesting the sending side would not need to sleep or the > receiving side? One thing to realize is that most of the time (read: pretty much *always*) when we have the problem of wanting to sleep inside a spinlock, the solution is actually to just move the sleeping to outside the lock, and then have something else that serializes things. That way, the core code (protected by the spinlock, and in all the hot paths) doesn't sleep, but the special case code (that wants to sleep) can have some other model of serialization that allows sleeping, and that includes as a small part the spinlocked region. I do not know how XPMEM actually works, or how you use it, but it seriously sounds like that is how things *should* work. And yes, that probably means that the mmu-notifiers as they are now are simply not workable: they'd need to be moved up so that they are inside the mmap semaphore but not the spinlocks. Can it be done? I don't know. But I do know that I'm unlikely to accept a noticeable slowdown in some very core code for a case that affects about 0.00001% of the population. In other words, I think you *have* to do it. Linus From kuznet at ms2.inr.ac.ru Wed May 14 08:30:00 2008 From: kuznet at ms2.inr.ac.ru (Alexey Kuznetsov) Date: Wed, 14 May 2008 19:30:00 +0400 Subject: [ofa-general] Re: IB/ipoib: Fix neigh destructor oops for kernels older than 2.6.21 In-Reply-To: <482AEBBC.10803@voltaire.com> References: <1210765031.15669.285.camel@mtls03> <482AE07B.8040501@voltaire.com> <1210772290.20499.5.camel@mtls03> <482AEBBC.10803@voltaire.com> Message-ID: <20080514153000.GA23220@ms2.inr.ac.ru> Hello! On Wed, May 14, 2008 at 04:40:12PM +0300, Or Gerlitz wrote: > Eli Cohen wrote: > >I add to the thread the author of the commit. I don't know this code > >good enough to give an explanation but for kernels following this commit > >I don't get these failures. Perhaps someone else can comment on this. > > > Even before understanding what Alexy's patch is doing, can you explain > why the ipoib neighbour destructor callback is called after the ipoib > module has been unloaded? is it b/c the stack did this call on a > neighbour created by another device such as loopack etc? Look at thread "Subject: dst_ifdown breaks infiniband?" in netdev or lkml. Shortly, the problem is the following: * To unload a netdevice we must release all the references. * Particularly, we force release of the neighbour references, redirecting stale neighbour entries to loopback device * However, neighbour destructor still points to ipoib device and it is called after the device is unloaded (since we dropped reference, unload is possible) The observation was that destructor is the only harmful thing and that actually it is not used by anyone but ipoib. The patch get rids of destructor and introduces cleanup, which is made once for each neighbour entry before invalidation (particularly, right before the device is unregistered) and it is supposed to move neighbour entry to the state, where we do not need any calls to the device code. Alexey From sashak at voltaire.com Wed May 14 08:47:46 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 May 2008 18:47:46 +0300 Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.) In-Reply-To: <20080514075947.1e0c3b53.weiny2@llnl.gov> References: <20080423133816.6c1b6315.weiny2@llnl.gov> <48109087.6030606@voltaire.com> <20080424143125.2aad1db8.weiny2@llnl.gov> <15ddcffd0804241523p19559580vc3a1293c1fe097b1@mail.gmail.com> <20080424181657.28d58a29.weiny2@llnl.gov> <20080514000247.GL21414@sashak.voltaire.com> <1210764244.2026.728.camel@hrosenstock-ws.xsigo.com> <20080514075947.1e0c3b53.weiny2@llnl.gov> Message-ID: <20080514154746.GF4616@sashak.voltaire.com> On 07:59 Wed 14 May , Ira Weiny wrote: > Yea I guess it should be. Ok. I applied this to 1.3 branch too. Sasha From holt at sgi.com Wed May 14 09:22:24 2008 From: holt at sgi.com (Robin Holt) Date: Wed, 14 May 2008 11:22:24 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: <20080514162223.GZ9878@sgi.com> On Wed, May 14, 2008 at 08:18:21AM -0700, Linus Torvalds wrote: > > > On Wed, 14 May 2008, Robin Holt wrote: > > > > Are you suggesting the sending side would not need to sleep or the > > receiving side? > > One thing to realize is that most of the time (read: pretty much *always*) > when we have the problem of wanting to sleep inside a spinlock, the > solution is actually to just move the sleeping to outside the lock, and > then have something else that serializes things. > > That way, the core code (protected by the spinlock, and in all the hot > paths) doesn't sleep, but the special case code (that wants to sleep) can > have some other model of serialization that allows sleeping, and that > includes as a small part the spinlocked region. > > I do not know how XPMEM actually works, or how you use it, but it > seriously sounds like that is how things *should* work. And yes, that > probably means that the mmu-notifiers as they are now are simply not > workable: they'd need to be moved up so that they are inside the mmap > semaphore but not the spinlocks. We are in the process of attempting this now. Unfortunately for SGI, Christoph is on vacation right now so we have been trying to work it internally. We are looking through two possible methods, one we add a callout to the tlb flush paths for both the mmu_gather and flush_tlb_page locations. The other we place a specific callout seperate from the gather callouts in the paths we are concerned with. We will look at both more carefully before posting. In either implementation, not all call paths would require the stall to ensure data integrity. Would it be acceptable to always put a sleepable stall in even if the code path did not require the pages be unwritable prior to continuing? If we did that, I would be freed from having a pool of invalidate threads ready for XPMEM to use for that work. Maybe there is a better way, but the sleeping requirement we would have on the threads make most options seem unworkable. Thanks, Robin From caitlin.bestler at neterion.com Wed May 14 09:24:16 2008 From: caitlin.bestler at neterion.com (Caitlin Bestler) Date: Wed, 14 May 2008 09:24:16 -0700 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482ADC2B.5080008@voltaire.com> References: <482A0F32.2010001@opengridcomputing.com> <482ADC2B.5080008@voltaire.com> Message-ID: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> On Wed, May 14, 2008 at 5:33 AM, Or Gerlitz wrote: > Steve Wise wrote: >> >> Maybe this should really be implemented in the ULP that wants this >> behavior. IE the ULP could register for routing/neighbour changes and tear >> down connections and re-established them on the correct device. >> > Hi Steve, > > First, registration for neighbour changes can't serve for the purpose of > aligning RDMA traffic with the IP stack, from bunch of reasons among them > are: > > - for IB, no neighbour is created at the passive side of the unicast session > > - for unicast sessions, address resolution involves ARP but the neighbour > may be deleted by the kernel since the rdma traffic does not go through the > stack > > - for multicast sessions, no neighbour is created during address resolution > > Second, the rdma-cm does well in saving the ULP from interacting with the > network stack, that is the ULP is not aware to the routing lookup / neigbour > / net device used for address resolution. In that spirit I prefer to add the > registration for net events at the low level (rdma-cm). > > Third, thanks for bringing the point of route changes :) > > Or. > > Perhaps one of the most fundamental differences for RDMA services versus the traditional socket interface is that RDMA services need to be bound to a specific device. When establishing a connection (or flow) the application needs to select which device to use. Traditional socket applications do not need to do this, but rdma-cm seems to be an acceptable solution. The trickier problem is the one you raise on migrating a connection or flow when IP routing is reconfigured. To a classic socket application, each IP datagram generated is sent according to the current routing tables. A connection or flow is not sticky. An RDMA connection (or IB UD flow) is sticky. The question is how sticky should it be. If it is too sticky the application may have to wait for a time-out, or be stuck using an inferior path after the primary path is restored. These are obviously undesirable. But what you have not addressed is how this compares with the cost of forcing the application session to shift connections even when the inferior path would have been acceptable. Is it not true that the lower performance of an inferior path may be preferable to the cost of tearing down and recreating a connection (and its associated protection domains and memory regions)? Because of those costs I can only see two options: 1) Merely enable the application to know when there has been a significant change in IP routing. If the current services are inadequate for this purpose then extend those rather than do an automatic connection teardown/rebuild. 2) Reduce the cost of connection teardown/rebuild by offering an option to "pre-bind" two RDMA devices so that memory registrations will be valid on both. This probably requires device level co-operation on L-Key/STag allocation, but it would be reasonable feature to consider for the High Availability market. But making automatic connection teardown a standard feature is not the best solution. From worleys at gmail.com Wed May 14 09:26:08 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 14 May 2008 10:26:08 -0600 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: <482AC510.3090602@dev.mellanox.co.il> References: <482AC510.3090602@dev.mellanox.co.il> Message-ID: On Wed, May 14, 2008 at 4:55 AM, Vladimir Sokolovsky wrote: > Chris Worley wrote: >> >> In two 1.3 builds I get different SET_IPOIB_CM settings in >> /etc/infiniband/openib.conf. >> >> A generic build sets it to "yes". A kitchen-sink build doesn't set it. >> >> Is there a reason (as I need it to be enabled on a system that needs >> the kitchen-sink build)? >> >> Thanks, >> >> Chris > > The default mode for IPoIB CM in OFED-1.3 (/etc/infiniband/openib.conf) is: > SET_IPOIB_CM=yes > > It was different (SET_IPOIB_CM=no) in OFED-1.2 between Thu Mar 29 16:57:22 > 2007 and Wed Apr 4 10:41:09 2007 (before OFED-1.2-rc1). > > Can you point me the OFED-1.3 build where SET_IPOIB_CM is set to "no"? Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5 in the "kitchen sink" build. Is there any reason to NOT use connected mode? Thanks, Chris > > Regards, > Vladimir > From rdreier at cisco.com Wed May 14 09:33:44 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 09:33:44 -0700 Subject: [ofa-general] [GIT PULL] please pull infinibang.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a fixes for various low-level HW driver issues: - nes bugs with LRO module parameter - cxgb3 bug in handling flushing on connection teardown - ipath miscellaneous issues (kind of big but look like needed fixes) Pavel Emelyanov (1): IB/ipath: Make ipath_portdata work with struct pid * not pid_t Ralph Campbell (3): IB/ipath: Fix RC and UC error handling IB/ipath: Fix many locking issues when switching to error state IB/ipath: Fix RDMA read response sequence checking Roland Dreier (2): RDMA/nes: Fix up nes_lro_max_aggr module parameter IB/ipath: Change ipath_devdata.ipath_sdma_status to be unsigned long Steve Wise (1): RDMA/cxgb3: Wrap the software send queue pointer as needed on flush drivers/infiniband/hw/cxgb3/cxio_hal.c | 4 +- drivers/infiniband/hw/ipath/ipath_driver.c | 20 +- drivers/infiniband/hw/ipath/ipath_file_ops.c | 19 +- drivers/infiniband/hw/ipath/ipath_kernel.h | 10 +- drivers/infiniband/hw/ipath/ipath_qp.c | 237 ++++++++----------- drivers/infiniband/hw/ipath/ipath_rc.c | 285 +++++++++++----------- drivers/infiniband/hw/ipath/ipath_ruc.c | 329 ++++++++++++++---------- drivers/infiniband/hw/ipath/ipath_uc.c | 57 +++-- drivers/infiniband/hw/ipath/ipath_ud.c | 66 ++++-- drivers/infiniband/hw/ipath/ipath_user_sdma.h | 2 - drivers/infiniband/hw/ipath/ipath_verbs.c | 176 +++++++++----- drivers/infiniband/hw/ipath/ipath_verbs.h | 64 ++++- drivers/infiniband/hw/nes/nes.c | 4 - drivers/infiniband/hw/nes/nes.h | 1 - drivers/infiniband/hw/nes/nes_hw.c | 6 +- 15 files changed, 725 insertions(+), 555 deletions(-) From rdreier at cisco.com Wed May 14 09:36:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 09:36:30 -0700 Subject: [ofa-general] RE: [PATCH 1/1] infiniband/hw/nes/: avoid unnecessary memset In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2> (Faisal Latif's message of "Tue, 13 May 2008 16:46:47 -0500") References: <20080512213601.626C91C0008F@mwinf2103.orange.fr> <5E701717F2B2ED4EA60F87C8AA57B7CC080BD82B@venom2> Message-ID: thanks guys, applied for 2.6.27 From sean.hefty at intel.com Wed May 14 09:40:00 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 14 May 2008 09:40:00 -0700 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> Message-ID: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> >1) Merely enable the application to know when there has been > a significant change in IP routing. If the current services are > inadequate for this purpose then extend those rather than > do an automatic connection teardown/rebuild. This is my current preferred solution. I don't have an issue with the rdma_cm issuing some sort of notification event when an IP address mapping changes. I would use an event name that indicated this, rather than 'disconnect'. If this is implemented, I'd like to minimize the overhead per rdma_cm_id required to report this event. - Sean From torvalds at linux-foundation.org Wed May 14 09:56:18 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 14 May 2008 09:56:18 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080514162223.GZ9878@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080514162223.GZ9878@sgi.com> Message-ID: On Wed, 14 May 2008, Robin Holt wrote: > > Would it be acceptable to always put a sleepable stall in even if the > code path did not require the pages be unwritable prior to continuing? > If we did that, I would be freed from having a pool of invalidate > threads ready for XPMEM to use for that work. Maybe there is a better > way, but the sleeping requirement we would have on the threads make most > options seem unworkable. I'm not understanding the question. If you can do you management outside of the spinlocks, then you can obviously do whatever you want, including sleping. It's changing the existing spinlocks to be sleepable that is not acceptable, because it's such a performance problem. Linus From ruimario at gmail.com Wed May 14 09:59:41 2008 From: ruimario at gmail.com (Rui Machado) Date: Wed, 14 May 2008 18:59:41 +0200 Subject: [ofa-general] timeout question Message-ID: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com> Hi all, when setting the timeout in a struct ibv_qp_attr, this value corresponds to the Local ACK timeout which according to the Infiniband spec will define the transport timer timeout defined by the formula: 4.096uS * 2 ^Local Ack timeout". Is this right? And is there a value for this timeout to be considered "good practice"? Also, in a client-server setup, if this timeout is set to a "big value" (like 30) when the server dies, the client will take that amount of time to realize the failure. Is this correct? Thank you for the help. Rui From clameter at sgi.com Wed May 14 10:57:08 2008 From: clameter at sgi.com (Christoph Lameter) Date: Wed, 14 May 2008 10:57:08 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: On Wed, 14 May 2008, Linus Torvalds wrote: > One thing to realize is that most of the time (read: pretty much *always*) > when we have the problem of wanting to sleep inside a spinlock, the > solution is actually to just move the sleeping to outside the lock, and > then have something else that serializes things. The problem is that the code in rmap.c try_to_umap() and friends loops over reverse maps after taking a spinlock. The mm_struct is only known after the rmap has been acccessed. This means *inside* the spinlock. That is why I tried to convert the locks to scan the revese maps to semaphores. If that is done then one can indeed do the callouts outside of atomic contexts. > Can it be done? I don't know. But I do know that I'm unlikely to accept a > noticeable slowdown in some very core code for a case that affects about > 0.00001% of the population. In other words, I think you *have* to do it. With larger number of processor semaphores make a lot of sense since the holdoff times on spinlocks will increase. If we go to sleep then the processor can do something useful instead of hogging a cacheline. A rw lock there can also increase concurrency during reclaim espcially if the anon_vma chains and the number of address spaces mapping a page is high. From torvalds at linux-foundation.org Wed May 14 11:27:14 2008 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Wed, 14 May 2008 11:27:14 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: On Wed, 14 May 2008, Christoph Lameter wrote: > > The problem is that the code in rmap.c try_to_umap() and friends loops > over reverse maps after taking a spinlock. The mm_struct is only known > after the rmap has been acccessed. This means *inside* the spinlock. So you queue them. That's what we do with things like the dirty bit. We need to hold various spinlocks to look up pages, but then we can't actually call the filesystem with the spinlock held. Converting a spinlock to a waiting lock for things like that is simply not acceptable. You have to work with the system. Yeah, there's only a single bit worth of information on whether a page is dirty or not, so "queueing" that information is trivial (it's just the return value from "page_mkclean_file()". Some things are harder than others, and I suspect you need some kind of "gather" structure to queue up all the vma's that can be affected. But it sounds like for the case of rmap, the approach of: - the page lock is the higher-level "sleeping lock" (which makes sense, since this is very close to an IO event, and that is what the page lock is generally used for) But hey, it could be anything else - maybe you have some other even bigger lock to allow you to handle lots of pages in one go. - with that lock held, you do the whole rmap dance (which requires spinlocks) and gather up the vma's and the struct mm's involved. - outside the spinlocks you then do whatever it is you need to do. This doesn't sound all that different from TLB shoot-down in SMP, and the "mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while holding the spinlocks, but if we couldn't that's how we'd still do it: it would get more involved (because we'd need to guarantee that the gather can hold *all* the pages - right now we can just flush in the middle if we need to), but it wouldn't be all that fundamentally different. And no, I really haven't even wanted to look at what XPMEM really needs to do, so maybe the above thing doesn't work for you, and you have other issues. I'm just pointing you in a general direction, not trying to say "this is exactly how to get there". Linus From swise at opengridcomputing.com Wed May 14 12:05:32 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 14:05:32 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. Message-ID: <20080514190532.28544.41595.stgit@dell3.ogc.int> The following patch proposes the API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. Please review these vs the verbs specs and see what I've missed. This patch is a request for comments and hasn't even been compiled... Steve. ----- RDMA: New Memory Extensions. Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. --- drivers/infiniband/core/verbs.c | 46 +++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 55 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 101 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..869be7d 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access) +{ + struct ib_mr *mr; + + if (!pd->device->alloc_mr) + return ERR_PTR(-ENOSYS); + + mr = pd->device->alloc_mr(pd, pbl_depth, remote_access); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len) +{ + struct ib_fast_reg_page_list *page_list + + if (!device->alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device->alloc_fast_reg_page_list(device, page_list_len); + + if (!IS_ERR(page_list)) { + page_list->device = device; + page_list->page_list_len = page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list->device->dealloc_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..d6d9514 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), IB_DEVICE_SEND_W_INV = (1<<21), + IB_DEVICE_MM_EXTENSIONS = (1<<22), }; enum ib_atomic_cap { @@ -414,6 +415,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -628,6 +631,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +682,17 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_fast_reg_page_list *page_list; + int fbo; + u32 length; + int access_flags; + struct ib_mr *mr; + } fast_reg; + struct { + struct ib_mr *mr; + } local_inv; } wr; }; @@ -1014,6 +1031,11 @@ struct ib_device { int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); + struct ib_mr * (*alloc_mr)(struct ib_pd *pd, + int pbl_depth, + int remote_access); + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); int (*rereg_phys_mr)(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -1808,6 +1830,39 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); /** + * ib_alloc_mr - Allocates memory region usable with the + * IB_WR_FAST_REG_MR send work request. + * @pd: The protection domain associated with the region. + * @pbl_depth: requested max physical buffer list size to be allocated. + * @remote_access: set to 1 if remote rdma operations are allowed. + */ +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, + int remote_access); + +struct ib_fast_reg_page_list { + struct ib_device *device; + u64 *page_list; + int page_list_len; +}; + +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used + * in a IB_WR_FAST_REG_MR work request. The resources allocated by this method + * allows for dev-specific optimization of the FAST_REG operation. + * @device - ib device pointer. + * @page_list_len - depth of the page list array to be allocated. + */ +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len); + +/** + * ib_free_fast_reg_page_list - Deallocates a previously allocated + * page list array. + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. + */ +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From dotanb at dev.mellanox.co.il Wed May 14 13:05:24 2008 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 14 May 2008 22:05:24 +0200 Subject: [ofa-general] ibv_get_cq_event blocking forever after successfulibv_post_send... In-Reply-To: References: <20070525212214.20500.qmail@station183.com> <532b813a0805121632p2ebccd11h80496f9b9c09e0c8@mail.gmail.com> <000301c8b48a$2200e670$465a180a@amr.corp.intel.com> <532b813a0805121703p76df78a8g51256d5bcdb7c330@mail.gmail.com> Message-ID: <482B4604.4010909@dev.mellanox.co.il> Roland Dreier wrote: > > This is a cut-paste error. I just extracted the relevant code from the > > actual piece. > > Unless you send an actual test app that someone could really compile and > run, it's very hard to help debug it. Basically your only chance is if > you have a really obvious bug that someone could see by reading your code. > There is a minor bug in the code: wc.opcode is valid ONLY if wc.status == IBV_WC_SUCCESS (otherwise, its value is undefined). But anyway, this should prevent you from getting the completion event. Do you call ibv_req_notify_cq BEFORE any completion is being created (for example, after the CQ was created) to request to get completion event on the first completion Dotan From swise at opengridcomputing.com Wed May 14 12:11:27 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 14:11:27 -0500 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> Message-ID: <482B395F.8020201@opengridcomputing.com> An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Wed May 14 13:11:21 2008 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 14 May 2008 22:11:21 +0200 Subject: [ofa-general] timeout question In-Reply-To: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com> References: <6978b4af0805140959h53a319f5s713e4084698fe077@mail.gmail.com> Message-ID: <482B4769.1010307@dev.mellanox.co.il> Hi. Rui Machado wrote: > Hi all, > > when setting the timeout in a struct ibv_qp_attr, this value > corresponds to the Local ACK timeout which according to the Infiniband > spec will define the transport timer timeout defined by the formula: > 4.096uS * 2 ^Local Ack timeout". Is this right? > And is there a value for this timeout to be considered "good practice"? > This value is depend on your fabric size, on the HCA you have (and some more factors).. > Also, in a client-server setup, if this timeout is set to a "big > value" (like 30) when the server dies, the client will take that > amount of time to realize the failure. Is this correct? > Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded (if there was a SR which was posted without any response from the receiver). > Thank you for the help. > Dotan From rdreier at cisco.com Wed May 14 12:31:05 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 12:31:05 -0700 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: <482A979F.6040305@Voltaire.COM> (Moni Shoua's message of "Wed, 14 May 2008 10:41:19 +0300") References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> <4820638E.4030901@Voltaire.COM> <4827FBDF.9040308@Voltaire.COM> <482A979F.6040305@Voltaire.COM> Message-ID: > OK. Here is an example that was viewed in our tests. > One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server). > SM takeover event takes place during traffic and as a result multicast info is flushed > and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience > is a very big chance) that the request to rejoin will be to the old SM and only after a retry join completes successfully. > This takes too long and the patch solves it. OK, that is fairly convincing (and it would be nice to include when sending the original patch). Please resend a version that fixes the races in the patch and we can probably add this for 2.6.27. - R. From swise at opengridcomputing.com Wed May 14 12:39:36 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 14:39:36 -0500 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482B395F.8020201@opengridcomputing.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> <482B395F.8020201@opengridcomputing.com> Message-ID: <482B3FF8.4040102@opengridcomputing.com> Steve Wise wrote: > Sean Hefty wrote: >>> 1) Merely enable the application to know when there has been >>> a significant change in IP routing. If the current services are >>> inadequate for this purpose then extend those rather than >>> do an automatic connection teardown/rebuild. >>> >> >> This is my current preferred solution. >> >> > I agree. >> I don't have an issue with the rdma_cm issuing some sort of notification event >> when an IP address mapping changes. I would use an event name that indicated >> this, rather than 'disconnect'. >> >> If this is implemented, I'd like to minimize the overhead per rdma_cm_id >> required to report this event. >> >> > > Maybe instead of making this a cm_id event, we should add a concept of > rdma async events that aren't necessarily affiliated with any > particular cm_id? IE a new channel for these types. Then you can > post it once when a route changes that affects rdma devices, for > example... > As opposed to posting the event to every cm_id affected... From dotanba at gmail.com Wed May 14 13:45:20 2008 From: dotanba at gmail.com (Dotan Barak) Date: Wed, 14 May 2008 22:45:20 +0200 Subject: [ofa-general] Moving on Message-ID: <482B4F60.5060300@gmail.com> Hi, After more than seven years of having the fun of being part of the development and productizing of the InfiniBand products and especially its SW, I have decided to move on in my professional way. I'm proud that I had a part in the OpenFabrics project and I had the opportunity to learn from the best !!! Even tough I won't be an employee of Mellanox anymore, I will still try to be evolve in the SW development in the OpenFabrics and give remarks/patches/support like I did before. I will still be available for questions/replies if needed in: dotanba at gmail.com thanks Dotan From swise at opengridcomputing.com Wed May 14 13:34:30 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 15:34:30 -0500 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int> <000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com> Message-ID: <482B4CD6.2090709@opengridcomputing.com> Sean Hefty wrote: > Thanks for looking at this. > > >> Here is the top level API change I'm proposing for enabling interoperable >> peer2peer mode for iwarp. I want to get agreement on how to expose >> this to the application before posting more of the gritty details of >> the kernel driver changes needed. The plan is to include this support >> in linux-2.6.27 + ofed-1.4. >> > > I don't have a better idea what to call this, but when I think of peer to peer, > I think of that as the connection model, not a channel usage restriction. > > I think I'll call it rtr_mode. That better describes it. The client side is sending a "ready to receive" message. And the server side holds off SQ processing until the RTR is received... From Arkady.Kanevsky at netapp.com Wed May 14 13:36:28 2008 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Wed, 14 May 2008 16:36:28 -0400 Subject: [ofa-general] Re: [PATCH] Request For Comments: In-Reply-To: <482B4CD6.2090709@opengridcomputing.com> References: <20080506170230.11409.43625.stgit@dell3.ogc.int><000b01c8af9f$0a3b79f0$bafc070a@amr.corp.intel.com> <482B4CD6.2090709@opengridcomputing.com> Message-ID: But the difference is who generates rtr message. It is not user job to deal with it. Thanks, Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Wednesday, May 14, 2008 4:35 PM > To: Sean Hefty > Cc: rdreier at cisco.com; ewg at lists.openfabrics.org; > general at lists.openfabrics.org > Subject: [ofa-general] Re: [PATCH] Request For Comments: > > Sean Hefty wrote: > > Thanks for looking at this. > > > > > >> Here is the top level API change I'm proposing for enabling > >> interoperable peer2peer mode for iwarp. I want to get > agreement on > >> how to expose this to the application before posting more of the > >> gritty details of the kernel driver changes needed. The plan is to > >> include this support in linux-2.6.27 + ofed-1.4. > >> > > > > I don't have a better idea what to call this, but when I > think of peer > > to peer, I think of that as the connection model, not a > channel usage restriction. > > > > > > I think I'll call it rtr_mode. That better describes it. > The client side is sending a "ready to receive" message. And > the server side holds off SQ processing until the RTR is received... > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Wed May 14 13:46:38 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 14 May 2008 13:46:38 -0700 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482B3FF8.4040102@opengridcomputing.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> <482B395F.8020201@opengridcomputing.com> <482B3FF8.4040102@opengridcomputing.com> Message-ID: <000001c8b603$9bb70880$8e248686@amr.corp.intel.com> >> Maybe instead of making this a cm_id event, we should add a concept of >> rdma async events that aren't necessarily affiliated with any >> particular cm_id? IE a new channel for these types. Then you can >> post it once when a route changes that affects rdma devices, for >> example... >> >As opposed to posting the event to every cm_id affected... I thought about this, and I agree that it's worth exploring. The locking to support device removal ended up being fairly complex. (I'm not sure it would have been any easier for ULPs to do this though.) The main counter I see to using a separate channel is that device removal is invoked per rdma_cm_id, so there's precedence for invoking the callback per id. My expectation is that this is a rare event. - Sean From ralph.campbell at qlogic.com Wed May 14 15:56:06 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 14 May 2008 15:56:06 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <20080514190532.28544.41595.stgit@dell3.ogc.int> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: <1210805766.3949.114.camel@brick.pathscale.com> Do we have any expected consumers for this interface? I would guess ib_srp, ib_iser as likely candidates. detailed comments inline below. On Wed, 2008-05-14 at 14:05 -0500, Steve Wise wrote: > The following patch proposes the API and core changes needed to > implement the IB BMMR and iWARP equivalient memory extensions. > > Please review these vs the verbs specs and see what I've missed. > > This patch is a request for comments and hasn't even been compiled... > > Steve. > > ----- > > RDMA: New Memory Extensions. > > Support for the IB BMME and iWARP equivalent memory extensions to > non shared memory regions. This includes: > > - allocation of an ib_mr for use in fast register work requests > > - device-specific alloc/free of physical buffer lists for use in fast > register work requests. This allows devices to allocate this memory as > needed (like via dma_alloc_coherent). > > - fast register memory region work request > > - invalidate local memory region work request > > - read with invalidate local memory region work request (iWARP only) > > > Design details: > > - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates > device support for this feature. > > - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. > > - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. > > - New API function, ib_alloc_mr() used to allocate fast_reg memory > regions. > > - New API function, ib_alloc_fast_reg_page_list to allocate > device-specific page lists. > > - New API function, ib_free_fast_reg_page_list to free said page lists. > > > Usage Model: > > - MR allocated with ib_alloc_mr() > > - Page lists allocated via ib_alloc_fast_reg_page_list(). > > - MR made VALID and bound to a specific page list via > ib_post_send(IB_WR_FAST_REG_MR) Can the same ib_alloc_fast_reg_page_list() page list be bound to more than one MR? What happens if a user tries to issue a ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR? How can the memory be read/written? If the MR allows remote operations, then RDMA writes could be used. An RDMA READ could be used. What about local access by the host CPU? > - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) > - MR deallocated with ib_dereg_mr() > > - page lists dealloced via ib_free_fast_reg_page_list(). > > Applications can allocate a fast_reg mr once, and then can repeatedly > bind the mr to different physical memory SGLs via posting work requests > to the send queue. For each outstanding mr-to-pbl binding in the SQ > pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can > be achieved while still allowing device-specific page_list processing. > --- > > drivers/infiniband/core/verbs.c | 46 +++++++++++++++++++++++++++++++++ > include/rdma/ib_verbs.h | 55 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 101 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c > index 0504208..869be7d 100644 > --- a/drivers/infiniband/core/verbs.c > +++ b/drivers/infiniband/core/verbs.c > @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) > } > EXPORT_SYMBOL(ib_dereg_mr); What does pbl_depth actually control? Is it the maximum size page list that can be used in a ib_post_send(IB_WR_FAST_REG_MR) work request? pbl_depth should be unsigned since I don't think negative values make sense. > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access) > +{ > + struct ib_mr *mr; > + > + if (!pd->device->alloc_mr) > + return ERR_PTR(-ENOSYS); > + > + mr = pd->device->alloc_mr(pd, pbl_depth, remote_access); > + > + if (!IS_ERR(mr)) { > + mr->device = pd->device; > + mr->pd = pd; > + mr->uobject = NULL; > + atomic_inc(&pd->usecnt); > + atomic_set(&mr->usecnt, 0); > + } > + > + return mr; > +} > +EXPORT_SYMBOL(ib_alloc_mr); > + > +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( > + struct ib_device *device, int page_list_len) > +{ > + struct ib_fast_reg_page_list *page_list > + > + if (!device->alloc_fast_reg_page_list) > + return ERR_PTR(-ENOSYS); > + > + page_list = device->alloc_fast_reg_page_list(device, page_list_len); > + > + if (!IS_ERR(page_list)) { > + page_list->device = device; > + page_list->page_list_len = page_list_len; > + } > + > + return page_list; > +} > +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); > + > +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) > +{ > + page_list->device->dealloc_fast_reg_page_list(page_list); > +} > +EXPORT_SYMBOL(ib_free_fast_reg_page_list); > + > /* Memory windows */ > > struct ib_mw *ib_alloc_mw(struct ib_pd *pd) > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 911a661..d6d9514 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -106,6 +106,7 @@ enum ib_device_cap_flags { > IB_DEVICE_UD_IP_CSUM = (1<<18), > IB_DEVICE_UD_TSO = (1<<19), > IB_DEVICE_SEND_W_INV = (1<<21), > + IB_DEVICE_MM_EXTENSIONS = (1<<22), > }; > > enum ib_atomic_cap { > @@ -414,6 +415,8 @@ enum ib_wc_opcode { > IB_WC_FETCH_ADD, > IB_WC_BIND_MW, > IB_WC_LSO, > + IB_WC_FAST_REG_MR, > + IB_WC_INVALIDATE_MR, > /* > * Set value of IB_WC_RECV so consumers can test if a completion is a > * receive by testing (opcode & IB_WC_RECV). > @@ -628,6 +631,9 @@ enum ib_wr_opcode { > IB_WR_ATOMIC_FETCH_AND_ADD, > IB_WR_LSO, > IB_WR_SEND_WITH_INV, > + IB_WR_FAST_REG_MR, > + IB_WR_INVALIDATE_MR, > + IB_WR_READ_WITH_INV, > }; > > enum ib_send_flags { > @@ -676,6 +682,17 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u64 iova_start; > + struct ib_fast_reg_page_list *page_list; > + int fbo; What is fbo? First byte offset? I assume fbo can't be negative so it should be "unsigned" > + u32 length; So I'm guessing the fbo and length select a subset from page_list for initializing the mr. Otherwise, the ib_fast_reg_page_list has the info. > + int access_flags; > + struct ib_mr *mr; > + } fast_reg; > + struct { > + struct ib_mr *mr; > + } local_inv; > } wr; > }; > > @@ -1014,6 +1031,11 @@ struct ib_device { > int (*query_mr)(struct ib_mr *mr, > struct ib_mr_attr *mr_attr); > int (*dereg_mr)(struct ib_mr *mr); > + struct ib_mr * (*alloc_mr)(struct ib_pd *pd, > + int pbl_depth, > + int remote_access); > + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); > + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); > int (*rereg_phys_mr)(struct ib_mr *mr, > int mr_rereg_mask, > struct ib_pd *pd, > @@ -1808,6 +1830,39 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); > int ib_dereg_mr(struct ib_mr *mr); We should define what error return values are possible and what they mean. Obviously ENOSYS is being used as the call is not supported by the device. ENOMEM is obvious. But what about EPERM, EINVAL, etc. > /** > + * ib_alloc_mr - Allocates memory region usable with the > + * IB_WR_FAST_REG_MR send work request. > + * @pd: The protection domain associated with the region. > + * @pbl_depth: requested max physical buffer list size to be allocated. > + * @remote_access: set to 1 if remote rdma operations are allowed. > + */ > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, > + int remote_access); > + > +struct ib_fast_reg_page_list { > + struct ib_device *device; > + u64 *page_list; > + int page_list_len; > +}; Is the page size always assumed to be PAGE_SIZE? What about large pages? The interface definition should say whether the page_list values are meaningful to the verbs caller. Can this list be used only for ib_post_send(IB_WR_FAST_REG_MR) or also by ib_map_phys_fmr() for example. > +/** > + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used > + * in a IB_WR_FAST_REG_MR work request. The resources allocated by this method > + * allows for dev-specific optimization of the FAST_REG operation. > + * @device - ib device pointer. > + * @page_list_len - depth of the page list array to be allocated. > + */ > +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( > + struct ib_device *device, int page_list_len); > + > +/** > + * ib_free_fast_reg_page_list - Deallocates a previously allocated > + * page list array. > + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. > + */ > +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); > + > +/** > * ib_alloc_mw - Allocates a memory window. > * @pd: The protection domain associated with the memory window. > */ > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Wed May 14 16:49:57 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 16:49:57 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <1210805766.3949.114.camel@brick.pathscale.com> (Ralph Campbell's message of "Wed, 14 May 2008 15:56:06 -0700") References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> Message-ID: > Can the same ib_alloc_fast_reg_page_list() page list be > bound to more than one MR? Yes, but as the IB spec describes, the page list belongs to the low-level driver until the fast-reg operation has completed. > What happens if a user tries to issue a > ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR? The operation completes with an error status. > How can the memory be read/written? what memory? > > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access) > What does pbl_depth actually control? pbl_depth is actual a terrible name. I would suggest calling the parameter something like max_page_list_len. I wonder if we really need the remote access flag. I know the iWARP and IB verbs both call this out, but is there really a case where specifying the exact permissions when doing the fast register is insufficient? also I wonder if it's clearer if we call this verb ib_alloc_fast_reg_mr(). > What is fbo? First byte offset? yes... too many abbreviations in this API, better to make things self-documenting at the cost of a bit more typing. > So I'm guessing the fbo and length select a subset from page_list for > initializing the mr. Otherwise, the ib_fast_reg_page_list has the > info. If you pass in one page, you might want the MR to start after the beginning of the page, and end before the end of the page. > We should define what error return values are possible > and what they mean. Obviously ENOSYS is being used as > the call is not supported by the device. ENOMEM is > obvious. But what about EPERM, EINVAL, etc. This is a big project, given we haven't done this for any other functions. > Is the page size always assumed to be PAGE_SIZE? I think we want a page_size member here for sure. > The interface definition should say whether the page_list > values are meaningful to the verbs caller. not sure what you mean... the values are initialized by the verbs consumer so they better mean something. > Can this > list be used only for ib_post_send(IB_WR_FAST_REG_MR) > or also by ib_map_phys_fmr() for example. It's just for posting sends, because it gives us a way to let low-level drivers enforce requirements they have for the page_list passed into the fast register via send queue operation-- eg it may need to be DMA-able memory (since the adapter fetches it as part of executing the WQE), there may be alignment restrictions, etc. I think we should consider the fmr interface as legacy and try to phase out using it over the long term. - R. From rdreier at cisco.com Wed May 14 16:54:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 16:54:59 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <20080514190532.28544.41595.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 14 May 2008 14:05:32 -0500") References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: A few quick comments (more later): > - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates > device support for this feature. We still have time before 2.6.26 comes out. Rather than moving IB_DEVICE_SEND_W_INV to a new bit number, I think it might be better to just remove IB_DEVICE_SEND_W_INV and make IB_DEVICE_MEMORY_EXTENSIONS (maybe "MM_EXTENSIONS" or "MEMORY_MANAGEMENT_EXTENSIONS" is better?) imply real send-with-invalidate support... so 2.6.26 won't have send-with-invalidate and 2.6.27 will have all of the IB base MM exts (and iWARP equivs) in one capability bit. Any thoughts either way? I'll post a trial balloon patch tomorrow. Second question -- IB BMME and iWARP talk about a key portion (least significant byte) of STag/L_Key/R_Key as being under consumer control. Do we want to expose that as part of this API? Basically it means we need to add a way for the consumer to pass in a new L_Key/STag as part of a lot of calls. - R. From rdreier at cisco.com Wed May 14 17:02:10 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 17:02:10 -0700 Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by OFED 1.3 In-Reply-To: <482A8574.8070201@voltaire.com> (Or Gerlitz's message of "Wed, 14 May 2008 09:23:48 +0300") References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> <482A8574.8070201@voltaire.com> Message-ID: > Maybe its about time for the Linux IB maintainers to get a little angry?! I'm not angry about it, although I have pretty much given up on trying to debug IPoIB issues seen running anything other than an upstream kernel. It seems like the OFED maintainers, the enterprise distros and their customers should be more concerned about the failure of the OFED process -- clearly producing something much buggier and less reliable than the stock kernel is not what anyone wants. - R. From charr at fusionio.com Wed May 14 17:12:20 2008 From: charr at fusionio.com (Cameron Harr) Date: Wed, 14 May 2008 18:12:20 -0600 Subject: [ofa-general] iSer and Direct IO Message-ID: <482B7FE4.9070502@fusionio.com> Hi, I've been trying to compare performances between iSer and srpt and am getting mixed results where iSer wins for IOPs and srpt wins for some streaming b/w tests. I've tested with iozone, spew and FIO, and IOP numbers are always higher on iSer. My problem though is that I'm a little suspicious of some of the iSer numbers and whether they are really using Direct IO. For example, you'll see below in some of my FIO results that I'm getting a write B/W of 799.1 MB/s at one point. That's way above what I can get natively on the device (~650 MB/s DIO) and is more along the lines of buffered IO. If the IOP numbers are also using some kind of caching, that could possibly taint them also. Does anyone know if specifying DIO will really bypass all buffers or if something is getting cached in the agents (iscsi, tgtd)? FIO --------------- iSer 1----iSer 2----SRPT 1----SRPT 2- RBW (MB/s) 565.3 836.5 622.0 581.7 Read IOPs 63488.1 68053.8 5335.6 5446.1 WBW (MB/s) 799.1 737.7 589.5 594.4 Write IOPs 79086.6 80005.7 33884.6 34058.6 Thanks much, Cameron From swise at opengridcomputing.com Wed May 14 17:56:13 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 19:56:13 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: <482B8A2D.7030505@opengridcomputing.com> Roland Dreier wrote: > A few quick comments (more later): > > > - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates > > device support for this feature. > > We still have time before 2.6.26 comes out. Rather than moving > IB_DEVICE_SEND_W_INV to a new bit number, I think it might be better to > just remove IB_DEVICE_SEND_W_INV and make IB_DEVICE_MEMORY_EXTENSIONS > (maybe "MM_EXTENSIONS" or "MEMORY_MANAGEMENT_EXTENSIONS" is better?) > imply real send-with-invalidate support... so 2.6.26 won't have > send-with-invalidate and 2.6.27 will have all of the IB base MM exts > (and iWARP equivs) in one capability bit. > > Any thoughts either way? I'll post a trial balloon patch tomorrow. > Sounds fine to me. As you've seen, I don't like to type, so MM_EXTENSIONS seems better. :) > > Second question -- IB BMME and iWARP talk about a key portion (least > significant byte) of STag/L_Key/R_Key as being under consumer control. > Do we want to expose that as part of this API? Basically it means we > need to add a way for the consumer to pass in a new L_Key/STag as part > of a lot of calls. I left it out from this first pass because we don't expose any of that in the existing RDMA API. Currently the iwarp providers make up their own keys. EG: ib_reg_phys_mr() should also allow passing in the key, at least according to iWARP verbs. But I don't really see the need... Steve. From swise at opengridcomputing.com Wed May 14 18:05:30 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 20:05:30 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> Message-ID: <482B8C5A.5020904@opengridcomputing.com> Roland Dreier wrote: > > Can the same ib_alloc_fast_reg_page_list() page list be > > bound to more than one MR? > > Yes, but as the IB spec describes, the page list belongs to the > low-level driver until the fast-reg operation has completed. > > > What happens if a user tries to issue a > > ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR? > > The operation completes with an error status. > > > How can the memory be read/written? > > what memory? > > > > +struct ib_mr *ib_alloc_mr(struct ib_pd *pd, int pbl_depth, int remote_access) > > > What does pbl_depth actually control? > > pbl_depth is actual a terrible name. I would suggest calling the > parameter something like max_page_list_len. > Terrible? :( max_page_list_len is ok. > I wonder if we really need the remote access flag. I know the iWARP and > IB verbs both call this out, but is there really a case where specifying > the exact permissions when doing the fast register is insufficient? > I agree. I don't know why they specify this. Lets remove it. > also I wonder if it's clearer if we call this verb > ib_alloc_fast_reg_mr(). Ok. > > > What is fbo? First byte offset? > > yes... too many abbreviations in this API, better to make things > self-documenting at the cost of a bit more typing. > ooh_kay :) > > So I'm guessing the fbo and length select a subset from page_list for > > initializing the mr. Otherwise, the ib_fast_reg_page_list has the > > info. > > If you pass in one page, you might want the MR to start after the > beginning of the page, and end before the end of the page. > > > We should define what error return values are possible > > and what they mean. Obviously ENOSYS is being used as > > the call is not supported by the device. ENOMEM is > > obvious. But what about EPERM, EINVAL, etc. > > This is a big project, given we haven't done this for any other functions. > > > Is the page size always assumed to be PAGE_SIZE? > > I think we want a page_size member here for sure. > So you want the page size specified in the fast_reg_page_list as opposed to when the page list is bound to the fast_reg mr (via post_send)? > > The interface definition should say whether the page_list > > values are meaningful to the verbs caller. > > not sure what you mean... the values are initialized by the verbs > consumer so they better mean something. > The idea is the (kernel) application will allocate the page_list memory vi ib_alloc_fast_reg_page_list(), then map the desired physical IO memory page-by-page, filling in the page_list with the resulting dma addresses. This page_list is then bound to a MR via the post_send(IB_WR_FAST_REG_MR). The rkey can then be advertised to peers for remote IO, or the lkey used for local IO. > > Can this > > list be used only for ib_post_send(IB_WR_FAST_REG_MR) > > or also by ib_map_phys_fmr() for example. > > It's just for posting sends, because it gives us a way to let low-level > drivers enforce requirements they have for the page_list passed into the > fast register via send queue operation-- eg it may need to be DMA-able > memory (since the adapter fetches it as part of executing the WQE), > there may be alignment restrictions, etc. > > I think we should consider the fmr interface as legacy and try to phase > out using it over the long term. Agreed. Steve. From swise at opengridcomputing.com Wed May 14 18:20:53 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 20:20:53 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <1210805766.3949.114.camel@brick.pathscale.com> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> Message-ID: <482B8FF5.80309@opengridcomputing.com> Ralph Campbell wrote: > Do we have any expected consumers for this interface? > I would guess ib_srp, ib_iser as likely candidates. > NFSRDMA RDS > detailed comments inline below. > I followed up on Roland's answers to your questions, and added a few replies inline below: >> Usage Model: >> >> - MR allocated with ib_alloc_mr() >> >> - Page lists allocated via ib_alloc_fast_reg_page_list(). >> >> - MR made VALID and bound to a specific page list via >> ib_post_send(IB_WR_FAST_REG_MR) > > Can the same ib_alloc_fast_reg_page_list() page list be > bound to more than one MR? > What happens if a user tries to issue a > ib_post_send(IB_WR_FAST_REG_MR) to a VALID MR? > > How can the memory be read/written? > If the MR allows remote operations, then RDMA writes could be > used. An RDMA READ could be used. What about local access > by the host CPU? > LOCAL_WRITE can be supplied allowing the device to do local IO. > > What does pbl_depth actually control? It allows the device to pre-allocate the page_list resources in HW. > Is it the maximum size page list that can be used in a > ib_post_send(IB_WR_FAST_REG_MR) work request? Yes. > > pbl_depth should be unsigned since I don't think negative values > make sense. > Ok. >> @@ -676,6 +682,17 @@ struct ib_send_wr { >> u16 pkey_index; /* valid for GSI only */ >> u8 port_num; /* valid for DR SMPs on switch only */ >> } ud; >> + struct { >> + u64 iova_start; >> + struct ib_fast_reg_page_list *page_list; >> + int fbo; > > What is fbo? First byte offset? > I assume fbo can't be negative so it should be "unsigned" > Ok. From rdreier at cisco.com Wed May 14 19:49:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 19:49:07 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <482B8C5A.5020904@opengridcomputing.com> (Steve Wise's message of "Wed, 14 May 2008 20:05:30 -0500") References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> Message-ID: > So you want the page size specified in the fast_reg_page_list as > opposed to when the page list is bound to the fast_reg mr (via > post_send)? It's kind of the same thing, since the fast_reg_page_list is part of the send work request... the structures you have at the moment are: > + struct { > + u64 iova_start; > + struct ib_fast_reg_page_list *page_list; > + int fbo; > + u32 length; > + int access_flags; > + struct ib_mr *mr; (side note... move this pointer up with the other pointers, so you don't end up with a hole in the structure due to alignment... or stick an int page_size in to fill the hole) > + } fast_reg; > +struct ib_fast_reg_page_list { > + struct ib_device *device; > + u64 *page_list; > + int page_list_len; > +}; is page_list_len the maximum length of the page_list, or is it filled in by the consumer? The driver could figure out the length of the page_list for any given work request by looking at the MR length and the page_size I suppose. - R. From rdreier at cisco.com Wed May 14 19:50:58 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 19:50:58 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <482B8A2D.7030505@opengridcomputing.com> (Steve Wise's message of "Wed, 14 May 2008 19:56:13 -0500") References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <482B8A2D.7030505@opengridcomputing.com> Message-ID: > > Second question -- IB BMME and iWARP talk about a key portion (least > > significant byte) of STag/L_Key/R_Key as being under consumer control. > > Do we want to expose that as part of this API? Basically it means we > > need to add a way for the consumer to pass in a new L_Key/STag as part > > of a lot of calls. > > I left it out from this first pass because we don't expose any of that > in the existing RDMA API. Currently the iwarp providers make up their > own keys. EG: ib_reg_phys_mr() should also allow passing in the key, > at least according to iWARP verbs. But I don't really see the need... Makes sense. Maybe RDS would like to control the key to avoid reuse but they've lived without it so far, and the RDS use case is kind of wacky anyway. From rdreier at cisco.com Wed May 14 19:50:05 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 19:50:05 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <482B8A2D.7030505@opengridcomputing.com> (Steve Wise's message of "Wed, 14 May 2008 19:56:13 -0500") References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <482B8A2D.7030505@opengridcomputing.com> Message-ID: > Sounds fine to me. As you've seen, I don't like to type, so > MM_EXTENSIONS seems better. :) How about we compromise on MEM_MGT_EXTENSIONS ;) MM is not 100% clear at first glance I think. - R. From rdreier at cisco.com Wed May 14 20:07:02 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 May 2008 20:07:02 -0700 Subject: [ofa-general] mthca max_sge value... ugh. Message-ID: it was recently pointed out to me that mem-free mthca devices cannot create a QP with max_send_sge and max_recv_sge set to the max_sge value returned by querying the device. The strange thing is that I thought I had tested this a while ago and it worked, but I can't see anything that would have changed things. Anyway, the patch below fixes things for me (tested on both mem-free and mem-full HCAs). But does anyone see something better to do? (short term or long term) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 9ebadd6..200cf13 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -45,6 +45,7 @@ #include "mthca_cmd.h" #include "mthca_profile.h" #include "mthca_memfree.h" +#include "mthca_wqe.h" MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); @@ -200,7 +201,18 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) mdev->limits.gid_table_len = dev_lim->max_gids; mdev->limits.pkey_table_len = dev_lim->max_pkeys; mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; - mdev->limits.max_sg = dev_lim->max_sg; + /* + * Need to allow for worst case send WQE overhead and check + * whether max_desc_sz imposes a lower limit than max_sg; UD + * send has the biggest overhead. + */ + mdev->limits.max_sg = min_t(int, dev_lim->max_sg, + (dev_lim->max_desc_sz - + sizeof (struct mthca_next_seg) - + (mthca_is_memfree(mdev) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg))) / + sizeof (struct mthca_data_seg)); mdev->limits.max_wqes = dev_lim->max_qp_sz; mdev->limits.max_qp_init_rdma = dev_lim->max_requester_per_qp; mdev->limits.reserved_qps = dev_lim->reserved_qps; From Thomas.Talpey at netapp.com Wed May 14 20:32:21 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 14 May 2008 23:32:21 -0400 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: Message-ID: At 11:07 PM 5/14/2008, Roland Dreier wrote: >it was recently pointed out to me that mem-free mthca devices cannot >create a QP with max_send_sge and max_recv_sge set to the max_sge value >returned by querying the device. We've been hit by this twice this week on two NFS/RDMA servers, so I'm glad to see this! But, for us it happens with memless ConnectX - our mthca devices are ok (but OTOH they're memfull not memfree) I'll be happy to test it with our misbehaving cards, but I can't do it until next week since they just went into a box for shipping. In the meantime, dare I ask - what's different about memfree cards that limits the sge attributes like this? And, what values result from the new code? The ConnectX ones I have report 32, and fail when trying to set that. Tom. From swise at opengridcomputing.com Wed May 14 20:40:56 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 22:40:56 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <482B8A2D.7030505@opengridcomputing.com> Message-ID: <482BB0C8.3040307@opengridcomputing.com> Roland Dreier wrote: > > Sounds fine to me. As you've seen, I don't like to type, so > > MM_EXTENSIONS seems better. :) > > How about we compromise on MEM_MGT_EXTENSIONS ;) > MM is not 100% clear at first glance I think. > > - R. works for me. From swise at opengridcomputing.com Wed May 14 20:46:57 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 May 2008 22:46:57 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> Message-ID: <482BB231.2000909@opengridcomputing.com> Roland Dreier wrote: > > So you want the page size specified in the fast_reg_page_list as > > opposed to when the page list is bound to the fast_reg mr (via > > post_send)? > > It's kind of the same thing, since the fast_reg_page_list is part of the > send work request... the structures you have at the moment are: > Yes, but I guess it makes more sense to specify it when you allocate the page_list. > > + struct { > > + u64 iova_start; > > + struct ib_fast_reg_page_list *page_list; > > + int fbo; > > + u32 length; > > + int access_flags; > > + struct ib_mr *mr; > > (side note... move this pointer up with the other pointers, so you don't > end up with a hole in the structure due to alignment... or stick an int > page_size in to fill the hole) k > > > + } fast_reg; > > > +struct ib_fast_reg_page_list { > > + struct ib_device *device; > > + u64 *page_list; > > + int page_list_len; > > +}; > > is page_list_len the maximum length of the page_list, or is it filled in > by the consumer? The driver could figure out the length of the > page_list for any given work request by looking at the MR length and the > page_size I suppose. The idea was that it was the current page_list length. But perhaps the struct needs both current and max? Or maybe the struct contains the max, and the actual length is passed in with the bind. Apps, however, might need both anyway and providing a place to keep them in this struct will help apps... Steve. From Thomas.Talpey at netapp.com Wed May 14 22:59:13 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 15 May 2008 01:59:13 -0400 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> Message-ID: At 07:49 PM 5/14/2008, Roland Dreier wrote: >also I wonder if it's clearer if we call this verb >ib_alloc_fast_reg_mr(). I have to disagree. Calling anything "fast" simply invites a "faster" thing to come along later. It's like calling something "new". I say call it what it is - a work-request-based, alloc-phys-buffer-list, bind-pages-to-list, to-be-widely-supported memory registration. Obviously, the individual verbs need to be a bit more precise. :-) Ralph - to answer your question who wants it, NFS/RDMA does, both client and server. I talked about requirements that it matches closely at Sonoma last month. But Steve - aren't these capable of protecting memory at byte granularity? The word "page" in some of the names implies otherwise. Tom. From Thomas.Talpey at netapp.com Wed May 14 23:04:52 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 15 May 2008 02:04:52 -0400 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: At 07:54 PM 5/14/2008, Roland Dreier wrote: >Second question -- IB BMME and iWARP talk about a key portion (least >significant byte) of STag/L_Key/R_Key as being under consumer control. >Do we want to expose that as part of this API? Basically it means we >need to add a way for the consumer to pass in a new L_Key/STag as part >of a lot of calls. I think the Key portion is a quite useful way for the upper layer to salt the actual R_Keys as a protection mechanism, and having it would simplify a bunch of defensive code in the NFS/RDMA client. Currently, because the keys are provider-chosen and potentially recycled, there is a latent risk. But, I only want it if ALL future providers support it in some way. If a subset does not, it's not worth coding around the differences. Tom. From sashak at voltaire.com Wed May 14 23:09:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 09:09:16 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM: Add QoS_management_in_OpenSM.txt to opensm/doc directory In-Reply-To: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com> References: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080515060916.GA24654@sashak.voltaire.com> On 07:33 Tue 06 May , Hal Rosenstock wrote: > Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From Thomas.Talpey at netapp.com Wed May 14 23:11:47 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 15 May 2008 02:11:47 -0400 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: At 02:04 AM 5/15/2008, Talpey, Thomas wrote: >At 07:54 PM 5/14/2008, Roland Dreier wrote: >>Second question -- IB BMME and iWARP talk about a key portion (least >>significant byte) of STag/L_Key/R_Key as being under consumer control. >>Do we want to expose that as part of this API? Basically it means we >>need to add a way for the consumer to pass in a new L_Key/STag as part >>of a lot of calls. > >I think the Key portion is a quite useful way for the upper layer to >salt the actual R_Keys as a protection mechanism, and having it would >simplify a bunch of defensive code in the NFS/RDMA client. Currently, >because the keys are provider-chosen and potentially recycled, there >is a latent risk. > >But, I only want it if ALL future providers support it in some way. If a >subset does not, it's not worth coding around the differences. I forgot to mention that the provider portion of the R_Key is reduced to 24 bits as a result of exposing/requiring the key. This may cause an issue at large scale, if the R_Keys have global scope. If they are limited to use on specific connections as in iWARP, then this is less of an issue. Tom. From kliteyn at dev.mellanox.co.il Thu May 15 00:09:51 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 May 2008 10:09:51 +0300 Subject: [ofa-general] [Fwd: Your message to general awaits moderator approval] Message-ID: <482BE1BF.8030505@dev.mellanox.co.il> Guys, I'm having some troubles with the mailing list filter again. Do we have some filtering changes? Any other ideas? -- Yevgeny -------- Original Message -------- Subject: Your message to general awaits moderator approval Date: Thu, 15 May 2008 00:06:04 -0700 From: general-bounces at lists.openfabrics.org To: kliteyn at dev.mellanox.co.il Your mail to 'general' with the subject ***SPAM*** [PATCH] opensm/ib_types.h: cosmetics - ?sec to usec Is being held until the list moderator can review it for approval. The reason it is being held: The message headers matched a filter rule Either the message will get posted to the list, or you will receive notification of the moderator's decision. If you would like to cancel this posting, please visit the following URL: http://lists.openfabrics.org/cgi-bin/mailman/confirm/general/b4e509be88382dc9cdc3eb30aa789ce58306a08c From eli at dev.mellanox.co.il Thu May 15 00:20:27 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 15 May 2008 10:20:27 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support Message-ID: <1210836027.18385.2.camel@mtls03> >From 2fa86ee977039784f50de982e2f6bf197f00fbeb Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Sun, 11 May 2008 15:02:04 +0300 Subject: [PATCH] IB/mlx4: Add send with invalidate support Add send with invalidate support to mlx4. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/cq.c | 8 ++++++++ drivers/infiniband/hw/mlx4/main.c | 4 ++++ drivers/infiniband/hw/mlx4/qp.c | 22 +++++++++++++++++----- drivers/net/mlx4/mr.c | 6 ++++-- 4 files changed, 33 insertions(+), 7 deletions(-) Changes since last post: set IB_DEVICE_SEND_W_INV only if FW supports it. diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 4521319..291e856 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -637,6 +637,7 @@ repoll: case MLX4_OPCODE_SEND_IMM: wc->wc_flags |= IB_WC_WITH_IMM; case MLX4_OPCODE_SEND: + case MLX4_OPCODE_SEND_INVAL: wc->opcode = IB_WC_SEND; break; case MLX4_OPCODE_RDMA_READ: @@ -676,6 +677,13 @@ repoll: wc->wc_flags = IB_WC_WITH_IMM; wc->imm_data = cqe->immed_rss_invalid; break; + case MLX4_RECV_OPCODE_SEND_INVAL: + wc->opcode = IB_WC_RECV; + wc->wc_flags = IB_WC_WITH_INVALIDATE; + /* + * TBD: maybe we should just call this ieth_val + */ + wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid); } wc->slid = be16_to_cpu(cqe->rlid); diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 4d61e32..b1e9505 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -56,6 +56,8 @@ static const char mlx4_ib_version[] = DRV_NAME ": Mellanox ConnectX InfiniBand driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +#define MLX4_FW_VER_LOCAL_SEND_INVL mlx4_fw_ver(2, 5, 0) + static void init_query_mad(struct ib_smp *mad) { mad->base_version = 1; @@ -103,6 +105,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props->device_cap_flags |= IB_DEVICE_UD_IP_CSUM; if (dev->dev->caps.max_gso_sz) props->device_cap_flags |= IB_DEVICE_UD_TSO; + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_LOCAL_SEND_INVL) + props->device_cap_flags |= IB_DEVICE_SEND_W_INV; props->vendor_id = be32_to_cpup((__be32 *) (out_mad->data + 36)) & 0xffffff; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 8e02ecf..d0d5f77 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -78,6 +78,7 @@ static const __be32 mlx4_ib_opcode[] = { [IB_WR_RDMA_READ] = __constant_cpu_to_be32(MLX4_OPCODE_RDMA_READ), [IB_WR_ATOMIC_CMP_AND_SWP] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_CS), [IB_WR_ATOMIC_FETCH_AND_ADD] = __constant_cpu_to_be32(MLX4_OPCODE_ATOMIC_FA), + [IB_WR_SEND_WITH_INV] = __constant_cpu_to_be32(MLX4_OPCODE_SEND_INVAL), }; static struct mlx4_ib_sqp *to_msqp(struct mlx4_ib_qp *mqp) @@ -1444,6 +1445,21 @@ static int build_lso_seg(struct mlx4_lso_seg *wqe, struct ib_send_wr *wr, return 0; } +static __be32 get_ieth(struct ib_send_wr *wr) +{ + switch (wr->opcode) { + case IB_WR_SEND_WITH_IMM: + case IB_WR_RDMA_WRITE_WITH_IMM: + return wr->ex.imm_data; + + case IB_WR_SEND_WITH_INV: + return cpu_to_be32(wr->ex.invalidate_rkey); + + default: + return 0; + } +} + int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { @@ -1490,11 +1506,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, MLX4_WQE_CTRL_TCP_UDP_CSUM) : 0) | qp->sq_signal_bits; - if (wr->opcode == IB_WR_SEND_WITH_IMM || - wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) - ctrl->imm = wr->ex.imm_data; - else - ctrl->imm = 0; + ctrl->imm = get_ieth(wr); wqe += sizeof *ctrl; size = sizeof *ctrl / 16; diff --git a/drivers/net/mlx4/mr.c b/drivers/net/mlx4/mr.c index 03a9abc..e78f53d 100644 --- a/drivers/net/mlx4/mr.c +++ b/drivers/net/mlx4/mr.c @@ -47,7 +47,7 @@ struct mlx4_mpt_entry { __be32 flags; __be32 qpn; __be32 key; - __be32 pd; + __be32 pd_flags; __be64 start; __be64 length; __be32 lkey; @@ -71,6 +71,8 @@ struct mlx4_mpt_entry { #define MLX4_MPT_STATUS_SW 0xF0 #define MLX4_MPT_STATUS_HW 0x00 +#define MLX4_MPT_FLAG_EN_INV 0x3000000 + static u32 mlx4_buddy_alloc(struct mlx4_buddy *buddy, int order) { int o; @@ -320,7 +322,7 @@ int mlx4_mr_enable(struct mlx4_dev *dev, struct mlx4_mr *mr) mr->access); mpt_entry->key = cpu_to_be32(key_to_hw_index(mr->key)); - mpt_entry->pd = cpu_to_be32(mr->pd); + mpt_entry->pd_flags = cpu_to_be32(mr->pd | MLX4_MPT_FLAG_EN_INV); mpt_entry->start = cpu_to_be64(mr->iova); mpt_entry->length = cpu_to_be64(mr->size); mpt_entry->entity_size = cpu_to_be32(mr->mtt.page_shift); -- 1.5.5.1 From npiggin at suse.de Thu May 15 00:57:47 2008 From: npiggin at suse.de (Nick Piggin) Date: Thu, 15 May 2008 09:57:47 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080514112625.GY9878@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: <20080515075747.GA7177@wotan.suse.de> On Wed, May 14, 2008 at 06:26:25AM -0500, Robin Holt wrote: > On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote: > > > > I guess that you have found a way to perform TLB flushing within coherent > > domains over the numalink interconnect without sleeping. I'm sure it would > > be possible to send similar messages between non coherent domains. > > I assume by coherent domains, your are actually talking about system > images. Yes > Our memory coherence domain on the 3700 family is 512 processors > on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes. > We extend a "Read-Exclusive" mode beyond the coherence domain so any > processor is able to read any cacheline on the system. We also provide > uncached access for certain types of memory beyond the coherence domain. Yes, I understand the basics. > For the other partitions, the exporting partition does not know what > virtual address the imported pages are mapped. The pages are frequently > mapped in a different order by the MPI library to help with MPI collective > operations. > > For the exporting side to do those TLB flushes, we would need to replicate > all that importing information back to the exporting side. Right. Or the exporting side could be passed tokens that it tracks itself, rather than virtual addresses. > Additionally, the hardware that does the TLB flushing is protected > by a spinlock on each system image. We would need to change that > simple spinlock into a type of hardware lock that would work (on 3700) > outside the processors coherence domain. The only way to do that is to > use uncached addresses with our Atomic Memory Operations which do the > cmpxchg at the memory controller. The uncached accesses are an order > of magnitude or more slower. I'm not sure if you're thinking about what I'm thinking of. With the scheme I'm imagining, all you will need is some way to raise an IPI-like interrupt on the target domain. The IPI target will have a driver to handle the interrupt, which will determine the mm and virtual addresses which are to be invalidated, and will then tear down those page tables and issue hardware TLB flushes within its domain. On the Linux side, I don't see why this can't be done. > > So yes, I'd much rather rework such highly specialized system to fit in > > closer with Linux than rework Linux to fit with these machines (and > > apparently slow everyone else down). > > But it isn't that we are having a problem adapting to just the hardware. > One of the limiting factors is Linux on the other partition. In what way is the Linux limiting? > > > Additionally, the call to zap_page_range expects to have the mmap_sem > > > held. I suppose we could use something other than zap_page_range and > > > atomically clear the process page tables. > > > > zap_page_range does not expect to have mmap_sem held. I think for anon > > pages it is always called with mmap_sem, however try_to_unmap_anon is > > not (although it expects page lock to be held, I think we should be able > > to avoid that). > > zap_page_range calls unmap_vmas which walks to vma->next. Are you saying > that can be walked without grabbing the mmap_sem at least readably? Oh, I get that confused because of the mixed up naming conventions there: unmap_page_range should actually be called zap_page_range. But at any rate, yes we can easily zap pagetables without holding mmap_sem. > I feel my understanding of list management and locking completely > shifting. FWIW, mmap_sem isn't held to protect vma->next there anyway, because at that point the vmas are detached from the mm's rbtree and linked list. But sure, in that particular path it is held for other reasons. > > > Doing that will not alleviate > > > the need to sleep for the messaging to the other partitions. > > > > No, but I'd venture to guess that is not impossible to implement even > > on your current hardware (maybe a firmware update is needed)? > > Are you suggesting the sending side would not need to sleep or the > receiving side? Assuming you meant the sender, it spins waiting for the > remote side to acknowledge the invalidate request? We place the data > into a previously agreed upon buffer and send an interrupt. At this > point, we would need to start spinning and waiting for completion. > Let's assume we never run out of buffer space. How would you run out of buffer space if it is synchronous? > The receiving side receives an interrupt. The interrupt currently wakes > an XPC thread to do the work of transfering and delivering the message > to XPMEM. The transfer of the data which XPC does uses the BTE engine > which takes up to 28 seconds to timeout (hardware timeout before raising > and error) and the BTE code automatically does a retry for certain > types of failure. We currently need to grab semaphores which _MAY_ > be able to be reworked into other types of locks. Sure, you obviously would need to rework your code because it's been written with the assumption that it can sleep. What is XPMEM exactly anyway? I'd assumed it is a Linux driver. From sashak at voltaire.com Thu May 15 01:21:12 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 11:21:12 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to usec In-Reply-To: <482BE026.7030104@dev.mellanox.co.il> References: <482BE026.7030104@dev.mellanox.co.il> Message-ID: <20080515082112.GC24654@sashak.voltaire.com> On 10:03 Thu 15 May , Yevgeny Kliteynik wrote: > Hi Sasha, > > Although "?sec" in the comments looks really cool and sophisticated :-), > I'd prefer to lose it and replace with a simple "usec". > > Heaving '?' in the code confuses some editors and tools, such as "kompare". > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Thu May 15 02:08:30 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 May 2008 12:08:30 +0300 Subject: [ofa-general] [PATCH] opensm/ib_types.h: fixing some wrong comments Message-ID: <482BFD8E.10101@dev.mellanox.co.il> Hi Sasha, Fixing a couple of wrong attribute descriptions in ib_types.h Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 6f3c400..e6bd9ee 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -974,7 +974,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) * IB_MAD_ATTR_PORT_SMPL_CTRL * * DESCRIPTION -* NodeDescription attribute (16.1.2) +* PortSamplesControl attribute (16.1.3) * * SOURCE */ @@ -998,7 +998,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) * IB_MAD_ATTR_PORT_SMPL_RSLT * * DESCRIPTION -* NodeInfo attribute (16.1.2) +* PortSamplesResult attribute (16.1.3) * * SOURCE */ @@ -1022,7 +1022,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) * IB_MAD_ATTR_PORT_CNTRS * * DESCRIPTION -* SwitchInfo attribute (16.1.2) +* PortCounters attribute (16.1.3) * * SOURCE */ -- 1.5.1.4 From sashak at voltaire.com Thu May 15 02:09:56 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 12:09:56 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: fixing some wrong comments In-Reply-To: <482BFD8E.10101@dev.mellanox.co.il> References: <482BFD8E.10101@dev.mellanox.co.il> Message-ID: <20080515090956.GF24654@sashak.voltaire.com> On 12:08 Thu 15 May , Yevgeny Kliteynik wrote: > Hi Sasha, > > Fixing a couple of wrong attribute descriptions in ib_types.h > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha > --- > opensm/include/iba/ib_types.h | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h > index 6f3c400..e6bd9ee 100644 > --- a/opensm/include/iba/ib_types.h > +++ b/opensm/include/iba/ib_types.h > @@ -974,7 +974,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > * IB_MAD_ATTR_PORT_SMPL_CTRL > * > * DESCRIPTION > -* NodeDescription attribute (16.1.2) > +* PortSamplesControl attribute (16.1.3) > * > * SOURCE > */ > @@ -998,7 +998,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > * IB_MAD_ATTR_PORT_SMPL_RSLT > * > * DESCRIPTION > -* NodeInfo attribute (16.1.2) > +* PortSamplesResult attribute (16.1.3) > * > * SOURCE > */ > @@ -1022,7 +1022,7 @@ static inline boolean_t OSM_API ib_class_is_rmpp(IN const uint8_t class_code) > * IB_MAD_ATTR_PORT_CNTRS > * > * DESCRIPTION > -* SwitchInfo attribute (16.1.2) > +* PortCounters attribute (16.1.3) > * > * SOURCE > */ > -- > 1.5.1.4 > From kliteyn at dev.mellanox.co.il Thu May 15 02:18:00 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 May 2008 12:18:00 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to usec In-Reply-To: <20080515082112.GC24654@sashak.voltaire.com> References: <482BE026.7030104@dev.mellanox.co.il> <20080515082112.GC24654@sashak.voltaire.com> Message-ID: <482BFFC8.1060404@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:03 Thu 15 May , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Although "?sec" in the comments looks really cool and sophisticated :-), >> I'd prefer to lose it and replace with a simple "usec". By the way, the problematic character confused Thunderburd too :) I see it was replaced by '?'. How did you apply the patch? -- Yevgeny >> Heaving '?' in the code confuses some editors and tools, such as "kompare". >> >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > > Sasha > From atheatre at bellnet.ca Thu May 15 02:22:15 2008 From: atheatre at bellnet.ca (Jocey Wall.) Date: Thu, 15 May 2008 5:22:15 -0400 Subject: [ofa-general] Ref : L/400-26932 Message-ID: <6t0pti$17dfkr@toip35-bus.srvr.bell.ca> You won £2,000,000.00 GBP.Get back to us via return email with your Name,and Address\Country and your Phone Number for more information on how you won and the delivery of your won prize to you.Email: processing.unit at btinternet.com From atheatre at bellnet.ca Thu May 15 02:24:03 2008 From: atheatre at bellnet.ca (Jocey Wall.) Date: Thu, 15 May 2008 5:24:03 -0400 Subject: [ofa-general] Ref : L/400-26932 Message-ID: <6t0pti$17dfth@toip35-bus.srvr.bell.ca> You won £2,000,000.00 GBP.Get back to us via return email with your Name,and Address\Country and your Phone Number for more information on how you won and the delivery of your won prize to you.Email: processing.unit at btinternet.com From kliteyn at dev.mellanox.co.il Thu May 15 02:25:50 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 May 2008 12:25:50 +0300 Subject: [ofa-general] [PATCH v2] opensm/osm_qos_policy.c: log matched QoS criteria Message-ID: <482C019E.40606@dev.mellanox.co.il> Hi Sasha, I think this patch was somehow lost in the pile of patches that you recently got. Anyhow, reposting it: Adding log messages for matched criteria of the QoS policy rule. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_qos_policy.c | 18 +++++++++++++++--- 1 files changed, 15 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c index 6c81872..ebe3a7f 100644 --- a/opensm/opensm/osm_qos_policy.c +++ b/opensm/opensm/osm_qos_policy.c @@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( { osm_qos_match_rule_t *p_qos_match_rule = NULL; cl_list_iterator_t list_iterator; + osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log; if (!cl_list_count(&p_qos_policy->qos_match_rules)) return NULL; + OSM_LOG_ENTER(p_log); + /* Go over all QoS match rules and find the one that matches the request */ list_iterator = cl_list_head(&p_qos_policy->qos_match_rules); @@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Source port matched.\n"); } /* If a match rule has Destination groups, PR request dest. has to be in this list */ @@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Destination port matched.\n"); } /* If a match rule has QoS classes, PR request HAS @@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "QoS Class matched.\n"); } /* If a match rule has Service IDs, PR request HAS @@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "Service ID matched.\n"); } /* If a match rule has PKeys, PR request HAS @@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( list_iterator = cl_list_next(list_iterator); continue; } - + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, + "PKey matched.\n"); } /* if we got here, then this match-rule matched this PR request */ break; } + OSM_LOG_EXIT(p_log); + if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules)) return NULL; -- 1.5.1.4 From sashak at voltaire.com Thu May 15 02:37:09 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 12:37:09 +0300 Subject: [ofa-general] Re: [ewg] Compiling OFED 1.3 on Gentoo In-Reply-To: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com> References: <39C75744D164D948A170E9792AF8E7CA012CD4B0@exil.voltaire.com> Message-ID: <20080515093709.GH24654@sashak.voltaire.com> Hi Olga, On 17:18 Mon 12 May , Olga Shern wrote: > > We are trying to compile OFED 1.3 on Gentoo and see the following error, > But if I install source RPM file and then running 'rpmbuild -ba > libibcommon.spec' then I can build RPM, so only rpmbuild --rebuild > command causing to problems. Basically Gentoo doesn't use RPM as package manager, but builds packages from sources using portage/emerge stuff. So it is nice that it builds somehow. As another workaround rpm2targz probably can be used too. Of course better would be to have native *.ebuild files. > Have someone succeeded to build OFED 1.3 on Gentoo? I'm using Gentoo as my workstation, don't do RPMs there however. Sasha From sashak at voltaire.com Thu May 15 02:40:40 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 12:40:40 +0300 Subject: [ofa-general] Re: [PATCH] opensm/ib_types.h: cosmetics - ?sec to usec In-Reply-To: <482BFFC8.1060404@dev.mellanox.co.il> References: <482BE026.7030104@dev.mellanox.co.il> <20080515082112.GC24654@sashak.voltaire.com> <482BFFC8.1060404@dev.mellanox.co.il> Message-ID: <20080515094040.GK24654@sashak.voltaire.com> On 12:18 Thu 15 May , Yevgeny Kliteynik wrote: > > By the way, the problematic character confused Thunderburd too :) > I see it was replaced by '?'. > How did you apply the patch? With 'git-am' and without any problem :) Sasha From sashak at voltaire.com Thu May 15 02:52:25 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 12:52:25 +0300 Subject: [ofa-general] Re: [PATCH v2] opensm/osm_qos_policy.c: log matched QoS criteria In-Reply-To: <482C019E.40606@dev.mellanox.co.il> References: <482C019E.40606@dev.mellanox.co.il> Message-ID: <20080515095225.GL24654@sashak.voltaire.com> Hi Yevgeny, On 12:25 Thu 15 May , Yevgeny Kliteynik wrote: > > I think this patch was somehow lost in the pile of patches > that you recently got. Anyhow, reposting it: It wasn't lost, I just didn't process it yet (and there still be more unreviewed patched on the list I need to care about). My very first thought was to not do it because such debug prints hurt performance a lot even when log level has lower value (it was measured very well during Up/Down routing optimizations), so I pend it in order to get some numbers first. Another thing I don't like is that with higher debug levels OpenSM generates ~1GB log file just during initial sweep. But this is more general concern. Sasha > Adding log messages for matched criteria of the QoS policy rule. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/opensm/osm_qos_policy.c | 18 +++++++++++++++--- > 1 files changed, 15 insertions(+), 3 deletions(-) > > diff --git a/opensm/opensm/osm_qos_policy.c b/opensm/opensm/osm_qos_policy.c > index 6c81872..ebe3a7f 100644 > --- a/opensm/opensm/osm_qos_policy.c > +++ b/opensm/opensm/osm_qos_policy.c > @@ -598,10 +598,13 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > { > osm_qos_match_rule_t *p_qos_match_rule = NULL; > cl_list_iterator_t list_iterator; > + osm_log_t * p_log = &p_qos_policy->p_subn->p_osm->log; > > if (!cl_list_count(&p_qos_policy->qos_match_rules)) > return NULL; > > + OSM_LOG_ENTER(p_log); > + > /* Go over all QoS match rules and find the one that matches the request */ > > list_iterator = cl_list_head(&p_qos_policy->qos_match_rules); > @@ -624,6 +627,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "Source port matched.\n"); > } > > /* If a match rule has Destination groups, PR request dest. has to be in this list */ > @@ -637,6 +642,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "Destination port matched.\n"); > } > > /* If a match rule has QoS classes, PR request HAS > @@ -655,7 +662,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "QoS Class matched.\n"); > } > > /* If a match rule has Service IDs, PR request HAS > @@ -675,7 +683,8 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "Service ID matched.\n"); > } > > /* If a match rule has PKeys, PR request HAS > @@ -694,13 +703,16 @@ static osm_qos_match_rule_t *__qos_policy_get_match_rule_by_params( > list_iterator = cl_list_next(list_iterator); > continue; > } > - > + OSM_LOG(&p_qos_policy->p_subn->p_osm->log, OSM_LOG_DEBUG, > + "PKey matched.\n"); > } > > /* if we got here, then this match-rule matched this PR request */ > break; > } > > + OSM_LOG_EXIT(p_log); > + > if (list_iterator == cl_list_end(&p_qos_policy->qos_match_rules)) > return NULL; > > -- > 1.5.1.4 > From holt at sgi.com Thu May 15 04:01:48 2008 From: holt at sgi.com (Robin Holt) Date: Thu, 15 May 2008 06:01:48 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080515075747.GA7177@wotan.suse.de> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> Message-ID: <20080515110147.GD10126@sgi.com> We are pursuing Linus' suggestion currently. This discussion is completely unrelated to that work. On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote: > I'm not sure if you're thinking about what I'm thinking of. With the > scheme I'm imagining, all you will need is some way to raise an IPI-like > interrupt on the target domain. The IPI target will have a driver to > handle the interrupt, which will determine the mm and virtual addresses > which are to be invalidated, and will then tear down those page tables > and issue hardware TLB flushes within its domain. On the Linux side, > I don't see why this can't be done. We would need to deposit the payload into a central location to do the invalidate, correct? That central location would either need to be indexed by physical cpuid (65536 possible currently, UV will push that up much higher) or some sort of global id which is difficult because remote partitions can reboot giving you a different view of the machine and running partitions would need to be updated. Alternatively, that central location would need to be protected by a global lock or atomic type operation, but a majority of the machine does not have coherent access to other partitions so they would need to use uncached operations. Essentially, take away from this paragraph that it is going to be really slow or really large. Then we need to deposit the information needed to do the invalidate. Lastly, we would need to interrupt. Unfortunately, here we have a thundering herd. There could be up to 16256 processors interrupting the same processor. That will be a lot of work. It will need to look up the mm (without grabbing any sleeping locks in either xpmem or the kernel) and do the tlb invalidates. Unfortunately, the sending side is not free to continue (in most cases) until it knows that the invalidate is completed. So it will need to spin waiting for a completion signal will could be as simple as an uncached word. But how will it handle the possible failure of the other partition? How will it detect that failure and recover? A timeout value could be difficult to gauge because the other side may be off doing a considerable amount of work and may just be backed up. > Sure, you obviously would need to rework your code because it's been > written with the assumption that it can sleep. It is an assumption based upon some of the kernel functions we call doing things like grabbing mutexes or rw_sems. That pushes back to us. I think the kernel's locking is perfectly reasonable. The problem we run into is we are trying to get from one context in one kernel to a different context in another and the in-between piece needs to be sleepable. > What is XPMEM exactly anyway? I'd assumed it is a Linux driver. XPMEM allows one process to make a portion of its virtual address range directly addressable by another process with the appropriate access. The other process can be on other partitions. As long as Numa-link allows access to the memory, we can make it available. Userland has an advantage in that the kernel entrance/exit code contains memory errors so we can contain hardware failures (in most cases) to only needing to terminate a user program and not lose the partition. The kernel enjoys no such fault containment so it can not safely directly reference memory. Thanks, Robin From avi at qumranet.com Thu May 15 04:12:34 2008 From: avi at qumranet.com (Avi Kivity) Date: Thu, 15 May 2008 14:12:34 +0300 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080515110147.GD10126@sgi.com> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515110147.GD10126@sgi.com> Message-ID: <482C1AA2.20307@qumranet.com> Robin Holt wrote: > Then we need to deposit the information needed to do the invalidate. > > Lastly, we would need to interrupt. Unfortunately, here we have a > thundering herd. There could be up to 16256 processors interrupting the > same processor. That will be a lot of work. It will need to look up the > mm (without grabbing any sleeping locks in either xpmem or the kernel) > and do the tlb invalidates. > > You don't need to interrupt every time. Place your data in a queue (you do support rmw operations, right?) and interrupt. Invalidates from other processors will see that the queue hasn't been processed yet and skip the interrupt. -- error compiling committee.c: too many arguments to function From sashak at voltaire.com Thu May 15 04:19:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 May 2008 14:19:14 +0300 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080515111914.GO24654@sashak.voltaire.com> Hi Hal, On 04:19 Mon 12 May , Hal Rosenstock wrote: > > I filed this as bug 1031: > https://bugs.openfabrics.org/show_bug.cgi?id=1031 > > > It would be nice if I could reproduce it in simulation. > > Yes, that would be nice; but I don't have a sim case. Do you have ibnetdiscover file for this case? If not from where report is coming? Sasha From dorfman.eli at gmail.com Thu May 15 04:23:31 2008 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Thu, 15 May 2008 14:23:31 +0300 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <482B7FE4.9070502@fusionio.com> References: <482B7FE4.9070502@fusionio.com> Message-ID: <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> On Thu, May 15, 2008 at 3:12 AM, Cameron Harr wrote: > Hi, I've been trying to compare performances between iSer and srpt and > am getting mixed results where iSer wins for IOPs and srpt wins for some > streaming b/w tests. I've tested with iozone, spew and FIO, and IOP > numbers are always higher on iSer. My problem though is that I'm a > little suspicious of some of the iSer numbers and whether they are > really using Direct IO. For example, you'll see below in some of my FIO > results that I'm getting a write B/W of 799.1 MB/s at one point. That's > way above what I can get natively on the device (~650 MB/s DIO) and is > more along the lines of buffered IO. If the IOP numbers are also using > some kind of caching, that could possibly taint them also. Does anyone > know if specifying DIO will really bypass all buffers or if something is > getting cached in the agents (iscsi, tgtd)? > > > FIO > --------------- iSer 1----iSer 2----SRPT 1----SRPT 2- > RBW (MB/s) 565.3 836.5 622.0 581.7 > Read IOPs 63488.1 68053.8 5335.6 5446.1 > WBW (MB/s) 799.1 737.7 589.5 594.4 > Write IOPs 79086.6 80005.7 33884.6 34058.6 > > > Thanks much, > Cameron > Your question should be posted on linux-scsi. See the following link that explains about DIO http://tldp.org/HOWTO/SCSI-Generic-HOWTO/dio.html Please check with sgp_dd to avoid any caching. Thanks, Eli From hrosenstock at xsigo.com Thu May 15 04:53:03 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 15 May 2008 04:53:03 -0700 Subject: [ofa-general] Re: [PATCH] OpenSM: Add QoS_management_in_OpenSM.txt to opensm/doc directory In-Reply-To: <20080515060916.GA24654@sashak.voltaire.com> References: <1210084406.2026.48.camel@hrosenstock-ws.xsigo.com> <20080515060916.GA24654@sashak.voltaire.com> Message-ID: <1210852383.2026.833.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-15 at 09:09 +0300, Sasha Khapyorsky wrote: > On 07:33 Tue 06 May , Hal Rosenstock wrote: > > Add Yevgeny's QoS_management_in_OpenSM.txt to opensm/doc directory > > > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. This should also be applied to ofed_1_3 branch IMO. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Thu May 15 00:03:02 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 May 2008 10:03:02 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/ib_types.h: cosmetics - =?iso-8859-1?q?=B5sec_to_usec?= Message-ID: <482BE026.7030104@dev.mellanox.co.il> Hi Sasha, Although "µsec" in the comments looks really cool and sophisticated :-), I'd prefer to lose it and replace with a simple "usec". Heaving 'µ' in the code confuses some editors and tools, such as "kompare". Signed-off-by: Yevgeny Kliteynik --- opensm/include/iba/ib_types.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 51695b5..6f3c400 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -3099,7 +3099,7 @@ ib_path_rec_pkt_life(IN const ib_path_rec_t * const p_rec) * [in] Pointer to the path record object. * * RETURN VALUES -* Encoded path pkt_life = 4.096 µsec * 2 ** PacketLifeTime. +* Encoded path pkt_life = 4.096 usec * 2 ** PacketLifeTime. * * NOTES * @@ -6391,7 +6391,7 @@ ib_multipath_rec_pkt_life(IN const ib_multipath_rec_t * const p_rec) * [in] Pointer to the multipath record object. * * RETURN VALUES -* Encoded multipath pkt_life = 4.096 µsec * 2 ** PacketLifeTime. +* Encoded multipath pkt_life = 4.096 usec * 2 ** PacketLifeTime. * * NOTES * -- 1.5.1.4 From swise at opengridcomputing.com Thu May 15 07:16:41 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 09:16:41 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> Message-ID: <482C45C9.2010600@opengridcomputing.com> Talpey, Thomas wrote: > At 07:49 PM 5/14/2008, Roland Dreier wrote: >> also I wonder if it's clearer if we call this verb >> ib_alloc_fast_reg_mr(). > > I have to disagree. Calling anything "fast" simply invites a "faster" > thing to come along later. It's like calling something "new". > > I say call it what it is - a work-request-based, alloc-phys-buffer-list, > bind-pages-to-list, to-be-widely-supported memory registration. > Obviously, the individual verbs need to be a bit more precise. :-) > > Ralph - to answer your question who wants it, NFS/RDMA does, both > client and server. I talked about requirements that it matches closely > at Sonoma last month. > > But Steve - aren't these capable of protecting memory at byte > granularity? The word "page" in some of the names implies otherwise. > The MR, once bound, defines the memory region at the byte granularity. The page list is just that: an array of DMA addresses to physical pages in memory. The page list + the region length + the first byte offset define the region. Steve. From ogerlitz at voltaire.com Thu May 15 07:21:25 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:21:25 +0300 (IDT) Subject: [ofa-general] [RFC v2 PATCH 0/5] rdma/cma: RDMA_ALIGN_WITH_NETDEVICE ha mode Message-ID: main changes from v1: - added new event RDMA_CM_EVENT_NETDEV_CHANGE - took the approach of notifying the user vs disconnecting the ID - this change bought us support also for the datagram (unconnected) services! I prefer to to go with affiliated event approach, from the following reasons: 1) the rdma-cm consumer ULP is not actually exposed to neighbours/routes and netdevices, i.e it knows the destination IP address and the rdma-cm does all the interaction with the network stack needed for the local (device/gid|mac/port/pkey) and remote (gid|mac) address resolutions, so in that respect this change follows this scheme. 2) its much harder for user space ULPs to get network events, but they can easily get rdma-cm events. Or From ogerlitz at voltaire.com Thu May 15 07:22:03 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:22:03 +0300 (IDT) Subject: [ofa-general] [RFC v2 PATCH 1/5] net/bonding: announce fail-over for the active-backup mode In-Reply-To: References: Message-ID: Enhance bonding to announce fail-over for the active-backup mode through the netdev events notifier chain mechanism. Such an event can be of use for the RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER) always use the same links as the IP stack does. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/net/bonding/bond_main.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c 2008-05-13 10:02:22.000000000 +0300 +++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c 2008-05-15 12:29:44.000000000 +0300 @@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon bond->send_grat_arp = 1; } else bond_send_gratuitous_arp(bond); + netdev_bonding_change(bond->dev); } } Index: linux-2.6.26-rc2/include/linux/notifier.h =================================================================== --- linux-2.6.26-rc2.orig/include/linux/notifier.h 2008-05-13 10:02:30.000000000 +0300 +++ linux-2.6.26-rc2/include/linux/notifier.h 2008-05-13 11:50:44.000000000 +0300 @@ -197,6 +197,7 @@ static inline int notifier_to_errno(int #define NETDEV_GOING_DOWN 0x0009 #define NETDEV_CHANGENAME 0x000A #define NETDEV_FEAT_CHANGE 0x000B +#define NETDEV_BONDING_FAILOVER 0x000C #define SYS_DOWN 0x0001 /* Notify of system down */ #define SYS_RESTART SYS_DOWN Index: linux-2.6.26-rc2/include/linux/netdevice.h =================================================================== --- linux-2.6.26-rc2.orig/include/linux/netdevice.h 2008-05-13 10:02:30.000000000 +0300 +++ linux-2.6.26-rc2/include/linux/netdevice.h 2008-05-13 11:50:20.000000000 +0300 @@ -1459,6 +1459,7 @@ extern void __dev_addr_unsync(struct de extern void dev_set_promiscuity(struct net_device *dev, int inc); extern void dev_set_allmulti(struct net_device *dev, int inc); extern void netdev_state_change(struct net_device *dev); +extern void netdev_bonding_change(struct net_device *dev); extern void netdev_features_change(struct net_device *dev); /* Load a device via the kmod */ extern void dev_load(struct net *net, const char *name); Index: linux-2.6.26-rc2/net/core/dev.c =================================================================== --- linux-2.6.26-rc2.orig/net/core/dev.c 2008-05-13 10:02:31.000000000 +0300 +++ linux-2.6.26-rc2/net/core/dev.c 2008-05-13 11:50:49.000000000 +0300 @@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi } } +void netdev_bonding_change(struct net_device *dev) +{ + call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev); +} +EXPORT_SYMBOL(netdev_bonding_change); + /** * dev_load - load a network module * @net: the applicable net namespace From ogerlitz at voltaire.com Thu May 15 07:22:35 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:22:35 +0300 (IDT) Subject: [ofa-general] [RFC v2 PATCH 2/5] rdma/addr: keep the name of the netdevice in struct rdma_dev_addr In-Reply-To: References: Message-ID: Keep also the local (src) device name in struct rdma_dev_addr. Under bonding HA scheme this can be used by the rdma-cm to align RDMA sessions to use the same links as the IP stack does under fail-over and route change cases. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/infiniband/core/addr.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/addr.c 2008-05-15 12:19:42.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/addr.c 2008-05-15 14:49:31.000000000 +0300 @@ -100,6 +100,7 @@ int rdma_copy_addr(struct rdma_dev_addr memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); if (dst_dev_addr) memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->src_dev_name, dev->name, IFNAMSIZ); return 0; } EXPORT_SYMBOL(rdma_copy_addr); Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-15 12:19:42.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-15 14:48:44.000000000 +0300 @@ -998,6 +998,7 @@ static struct rdma_id_private *cma_new_c union cma_ip_addr *src, *dst; __be16 port; u8 ip_ver; + int ret; if (cma_get_net_info(ib_event->private_data, listen_id->ps, &ip_ver, &port, &src, &dst)) @@ -1022,10 +1023,11 @@ static struct rdma_id_private *cma_new_c if (rt->num_paths == 2) rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; - ib_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); ib_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); - ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - rt->addr.dev_addr.dev_type = RDMA_NODE_IB_CA; + ret = rdma_translate_ip(&id->route.addr.src_addr, + &id->route.addr.dev_addr); + if (ret) + goto destroy_id; id_priv = container_of(id, struct rdma_id_private, id); id_priv->state = CMA_CONNECT; Index: linux-2.6.26-rc2/include/rdma/ib_addr.h =================================================================== --- linux-2.6.26-rc2.orig/include/rdma/ib_addr.h 2008-05-15 12:19:42.000000000 +0300 +++ linux-2.6.26-rc2/include/rdma/ib_addr.h 2008-05-15 14:49:08.000000000 +0300 @@ -57,6 +57,7 @@ struct rdma_dev_addr { unsigned char dst_dev_addr[MAX_ADDR_LEN]; unsigned char broadcast[MAX_ADDR_LEN]; enum rdma_node_type dev_type; + char src_dev_name[IFNAMSIZ]; }; /** From ogerlitz at voltaire.com Thu May 15 07:23:31 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:23:31 +0300 (IDT) Subject: [ofa-general] [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: References: Message-ID: RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer of the rdma-cm wants that RDMA sessions would always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding did fail-over but the link used by an already existing session is operating fine. Consumers seeking this ha mode would get the new RDMA_CM_EVENT_NETDEV_CHANGE event when such misalignment happens. More ha modes can be added in the future. Signed-off-by: Or Gerlitz changes from v1 - - added new event RDMA_CM_EVENT_NETDEV_CHANGE - took the approach of notifying the user vs disconnecting the ID Index: linux-2.6.26-rc2/include/rdma/rdma_cm.h =================================================================== --- linux-2.6.26-rc2.orig/include/rdma/rdma_cm.h 2008-05-15 14:48:44.000000000 +0300 +++ linux-2.6.26-rc2/include/rdma/rdma_cm.h 2008-05-15 14:49:48.000000000 +0300 @@ -53,7 +53,8 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL, RDMA_CM_EVENT_MULTICAST_JOIN, - RDMA_CM_EVENT_MULTICAST_ERROR + RDMA_CM_EVENT_MULTICAST_ERROR, + RDMA_CM_EVENT_NETDEV_CHANGE }; enum rdma_port_space { @@ -328,4 +329,10 @@ void rdma_leave_multicast(struct rdma_cm */ void rdma_set_service_type(struct rdma_cm_id *id, int tos); +enum rdma_ha_mode { + RDMA_ALIGN_WITH_NETDEVICE = 1 +}; + +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode); + #endif /* RDMA_CM_H */ Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-15 14:48:44.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-15 16:30:42.000000000 +0300 @@ -143,6 +143,7 @@ struct rdma_id_private { u32 qp_num; u8 srq; u8 tos; + enum rdma_ha_mode ha_mode; }; struct cma_multicast { @@ -1523,6 +1524,19 @@ void rdma_set_service_type(struct rdma_c } EXPORT_SYMBOL(rdma_set_service_type); +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode mode) +{ + struct rdma_id_private *id_priv; + + if (mode != RDMA_ALIGN_WITH_NETDEVICE) + return -EINVAL; + + id_priv = container_of(id, struct rdma_id_private, id); + id_priv->ha_mode = mode; + return 0; +} +EXPORT_SYMBOL(rdma_set_high_availability_mode); + static void cma_query_handler(int status, struct ib_sa_path_rec *path_rec, void *context) { From ogerlitz at voltaire.com Thu May 15 07:25:34 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:25:34 +0300 (IDT) Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer of the rdma-cm wants that RDMA sessions would always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding did fail-over but the IB link used by an already existing session is operating fine. Use netevent notification for sensing that a change has happened in the IP stack, then scan the rdma-cm IDs list to see if there is an ID that is misaligned in that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this ID, in case this is what the user asked by setting this mode for the ID. Signed-off-by: Or Gerlitz changes from v1 - - took the approach of notifying the user vs disconnecting the ID - this change bought us support also for the datagram (unconnected) services! - I used the cma_work_handler existing mechanism and decided to leave the ID state unchanged. As for the locking/protection issues, I assume the netdev notifers protect against net device removal etc while processing the event, so dev_get/put calls are not needed. Other than that there's a need to protect against (rdma) device removal and ID destruction. Spending some time on the code, I couldn't see how to do it in finer grain then the global mutex being locked/unlocked over the exectution of the dobule (dev list / id list) loops. Taking into account that this event is --rare-- and I changed the logic to first see if this ID wanted ha notification and only then do the more expensive memcmp calls, maybe this global locking is accaptable, and if not, I'd be happy to get some directions, eg if/how cma_disable_remove() and cma_enable_remove() can help for taking the lock to shorter time, etc. Index: linux-2.6.26-rc2/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/core/cma.c 2008-05-15 16:30:42.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/core/cma.c 2008-05-15 16:36:34.000000000 +0300 @@ -2743,6 +2743,64 @@ void rdma_leave_multicast(struct rdma_cm } EXPORT_SYMBOL(rdma_leave_multicast); +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv) +{ + struct rdma_dev_addr *dev_addr; + struct cma_work *work; + + dev_addr = &id_priv->id.route.addr.dev_addr; + + if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { + printk(KERN_ERR "addr change for device %s used by id %p, notifying\n", + ndev->name, &id_priv->id); + work = kzalloc(sizeof *work, GFP_KERNEL); + if (!work) + return -ENOMEM; + work->id = id_priv; + INIT_WORK(&work->work, cma_work_handler); + work->old_state = id_priv->state; + work->new_state = id_priv->state; + work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE; + atomic_inc(&id_priv->refcount); + queue_work(cma_wq, &work->work); + } +} + +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct net_device *ndev = (struct net_device *)ctx; + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + int ret = NOTIFY_DONE; + + if (dev_net(ndev) != &init_net) + return NOTIFY_DONE; + + if (event != NETDEV_BONDING_FAILOVER) + return NOTIFY_DONE; + + if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) + return NOTIFY_DONE; + + mutex_lock(&lock); + list_for_each_entry(cma_dev, &dev_list, list) + list_for_each_entry(id_priv, &cma_dev->id_list, list) { + if (id_priv->ha_mode == RDMA_ALIGN_WITH_NETDEVICE) { + ret = cma_netdev_align_id(ndev, id_priv); + if (ret) + break; + } + } + mutex_unlock(&lock); + return ret; +} + +static struct notifier_block cma_nb = { + .notifier_call = cma_netdev_callback +}; + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2847,6 +2905,7 @@ static int cma_init(void) ib_sa_register_client(&sa_client); rdma_addr_register_client(&addr_client); + register_netdevice_notifier(&cma_nb); ret = ib_register_client(&cma_client); if (ret) @@ -2854,6 +2913,7 @@ static int cma_init(void) return 0; err: + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); @@ -2863,6 +2923,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); From ogerlitz at voltaire.com Thu May 15 07:26:05 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:26:05 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 5/5] ib/iser: use the rdma-cm new RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: enhance iser to request for notification on network stack changes which makes its rdma connection unaligned with the link used by the stack for the IPs used to establish the connection. When RDMA_CM_EVENT_NETDEV_CHANGE arrives, just disconnect the connection, following that the iscsid daemon would reconnect, and the new connection would be well aligned. Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc2/drivers/infiniband/ulp/iser/iser_verbs.c =================================================================== --- linux-2.6.26-rc2.orig/drivers/infiniband/ulp/iser/iser_verbs.c 2008-05-15 15:10:21.000000000 +0300 +++ linux-2.6.26-rc2/drivers/infiniband/ulp/iser/iser_verbs.c 2008-05-15 15:31:49.000000000 +0300 @@ -476,6 +476,9 @@ static int iser_cma_handler(struct rdma_ case RDMA_CM_EVENT_DEVICE_REMOVAL: iser_disconnected_handler(cma_id); break; + case RDMA_CM_EVENT_NETDEV_CHANGE: + rdma_disconnect(cma_id); + break; default: iser_err("Unexpected RDMA CM event (%d)\n", event->event); break; @@ -534,7 +537,9 @@ int iser_connect(struct iser_conn *ib_ iser_err("rdma_create_id failed: %d\n", err); goto id_failure; } - + + rdma_set_high_availability_mode(ib_conn->cma_id, RDMA_ALIGN_WITH_NETDEVICE); + src = (struct sockaddr *)src_addr; dst = (struct sockaddr *)dst_addr; err = rdma_resolve_addr(ib_conn->cma_id, src, dst, 1000); From nico.mittenzwey at s2001.tu-chemnitz.de Thu May 15 07:31:40 2008 From: nico.mittenzwey at s2001.tu-chemnitz.de (Nico Mittenzwey) Date: Thu, 15 May 2008 16:31:40 +0200 Subject: [ofa-general] Retry count error with ipath on OFED-1.3 Message-ID: <482C494C.10204@s2001.tu-chemnitz.de> Hi, We have a problem with our QLogic InfiniPath PE-800 (rev 02), OFED 1.3 and MPI. Running simple MPI jobs like the OSU MPI bandwidth test between two nodes results in a retry count error (see end of the mail). We tried different MPI implementations like the supplied openmpi or self compiled openmpi/mvapich but always get this error. Using OFED 1.2 or the QLogic InfiniPath driver (which includes OFED 1.2) we don't get any errors. The system is a Scientific Linux SL release 5.1 with kernel 2.6.18-8.1.3.el5 (for OFED 1.2) or 2.6.18-53.1.14.el5 (OFED 1.3). There is also a Mellanox MT25204 HCA in the system which works perfectly (removing it doesn't help with the ipath problem). Since we like to stay updated we want to use OFED 1.3. Did anyone get the same error and found a solution? Thanks & regards Nico OFED 1.3 Infinipath Error: ># OSU MPI Bandwidth Test v3.1 ># Size Bandwidth (MB/s) >1 0.17 >2 0.39 >4 0.66 >8 1.80 >16 2.53 >32 5.11 >64 8.80 >128 23.09 >256 43.65 >512 84.42 >1024 151.63 >[0,1,0][btl_openib_component.c:1338:btl_openib_component_progress] from >compute-6-7 to: compute-6-8 error polling HP CQ with status RETRY >EXCEEDED ERROR status number 12 for wr_id 185705200 opcode 1 >-------------------------------------------------------------------------- >The InfiniBand retry count between two MPI processes has been >exceeded. "Retry count" is defined in the InfiniBand spec 1.2 >(section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > >This error typically means that there is something awry within the >InfiniBand fabric itself. You should note the hosts on which this >error has occurred; it has been observed that rebooting or removing a >particular host from the job can sometimes resolve this issue. > >Two MCA parameters can be used to control Open MPI's behavior with >respect to the retry count: > >* btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > >* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. >-------------------------------------------------------------------------- >mpirun noticed that job rank 1 with PID 16883 on node compute-6-8 >exited on signal 15 (Terminated). From swise at opengridcomputing.com Thu May 15 07:36:02 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 09:36:02 -0500 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: <482C4A52.4000501@opengridcomputing.com> Or Gerlitz wrote: > RDMA_ALIGN_WITH_NETDEVICE high availability (ha) mode means that the consumer > of the rdma-cm wants that RDMA sessions would always use the same links (eg ) > as the IP stack does. In the current code, this does not happen when bonding did > fail-over but the IB link used by an already existing session is operating fine. > > Use netevent notification for sensing that a change has happened in the IP stack, > then scan the rdma-cm IDs list to see if there is an ID that is misaligned > in that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this > ID, in case this is what the user asked by setting this mode for the ID. > > Signed-off-by: Or Gerlitz > At this point, I wonder if the naming should be different. IE the consumer really just wants notification of device change events. So instead of adding a new function rdma_set_high_availability_mode, you could just set an option saying WANT_NETDEV_CHANGE_EVENTS. Maybe we need to add rdma_set_option() to the kernel RDMA-CM API? IE make it more generic. Just a thought... Steve. From ogerlitz at voltaire.com Thu May 15 07:38:59 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:38:59 +0300 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000001c8b603$9bb70880$8e248686@amr.corp.intel.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> <482B395F.8020201@opengridcomputing.com> <482B3FF8.4040102@opengridcomputing.com> <000001c8b603$9bb70880$8e248686@amr.corp.intel.com> Message-ID: <482C4B03.9050507@voltaire.com> Sean Hefty wrote: > I thought about this, and I agree that it's worth exploring. The locking to > support device removal ended up being fairly complex. (I'm not sure it would > have been any easier for ULPs to do this though.) The main counter I see to > using a separate channel is that device removal is invoked per rdma_cm_id, so > there's precedence for invoking the callback per id. > > My expectation is that this is a rare event. Sean, Steve, Yes, this is rare event. I have stated at the [v2 0/5] email posting why I prefer this to be ID affiliated event, will be glad to hear your feedback on my arguments. Or. From ogerlitz at voltaire.com Thu May 15 07:41:11 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:41:11 +0300 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> Message-ID: <482C4B87.30807@voltaire.com> Sean Hefty wrote: > This is my current preferred solution. > > I don't have an issue with the rdma_cm issuing some sort of notification event > when an IP address mapping changes. I would use an event name that indicated > this, rather than 'disconnect'. > > If this is implemented, I'd like to minimize the overhead per rdma_cm_id > required to report this event. OK, Sean, I took the notification (vs disconnection) approach which seemed to be suggested by all the reviewers. As for the overhead, I tried to minimize it. Or. From swise at opengridcomputing.com Thu May 15 07:43:49 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 09:43:49 -0500 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implementRDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482C4B03.9050507@voltaire.com> References: <482A0F32.2010001@opengridcomputing.com><482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> <000801c8b5e1$27129f20$14c9180a@amr.corp.intel.com> <482B395F.8020201@opengridcomputing.com> <482B3FF8.4040102@opengridcomputing.com> <000001c8b603$9bb70880$8e248686@amr.corp.intel.com> <482C4B03.9050507@voltaire.com> Message-ID: <482C4C25.3040607@opengridcomputing.com> Or Gerlitz wrote: > Sean Hefty wrote: >> I thought about this, and I agree that it's worth exploring. The >> locking to >> support device removal ended up being fairly complex. (I'm not sure >> it would >> have been any easier for ULPs to do this though.) The main counter I >> see to >> using a separate channel is that device removal is invoked per >> rdma_cm_id, so >> there's precedence for invoking the callback per id. >> >> My expectation is that this is a rare event. > Sean, Steve, > > Yes, this is rare event. I have stated at the [v2 0/5] email posting why > I prefer this to be ID affiliated event, will be glad to hear your > feedback on my arguments. > ID affiliated event seems reasonable. Especially since its rare anyway. Steve. From ogerlitz at voltaire.com Thu May 15 07:44:41 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 15 May 2008 17:44:41 +0300 Subject: [ofa-general] [RFC PATCH 4/4] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> References: <482A0F32.2010001@opengridcomputing.com> <482ADC2B.5080008@voltaire.com> <469958e00805140924j4074a857j957297553a0801e7@mail.gmail.com> Message-ID: <482C4C59.20303@voltaire.com> Caitlin Bestler wrote: > 2) Reduce the cost of connection teardown/rebuild by offering > an option to "pre-bind" two RDMA devices so that memory > registrations will be valid on both. This probably requires > device level co-operation on L-Key/STag allocation, but > it would be reasonable feature to consider for the High > Availability market. > I am not going to explore this direction, feel free to explore it and let me know your findings. Or. From charr at fusionio.com Thu May 15 08:11:15 2008 From: charr at fusionio.com (Cameron Harr) Date: Thu, 15 May 2008 09:11:15 -0600 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> References: <482B7FE4.9070502@fusionio.com> <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> Message-ID: <482C5293.5090005@fusionio.com> An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Thu May 15 08:25:22 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 15 May 2008 11:25:22 -0400 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <482C5293.5090005@fusionio.com> References: <482B7FE4.9070502@fusionio.com> <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> <482C5293.5090005@fusionio.com> Message-ID: <482C55E2.8060905@scalableinformatics.com> Cameron Harr wrote: > ---- > [root at test05 ~]# sgp_dd dio=1 if=/dev/zero of=/dev/fioa bs=512 bpt=2048 > count=16777216 time=1 This is only 8 GB of IO. It is possible that (despite dio) you are caching. Make the IO much larger than RAM. Use a count of 128m or so. > time to transfer data was 5.556115 secs, 1546.03 MB/sec > [root at test05 ~]# sg_dd dio=1 if=/dev/zero of=/dev/fioa bs=512 bpt=2048 > count=16777216 time=1 > time to transfer data: 5.565360 secs at 1543.46 MB/sec > [root at test05 ~]# dd oflag=direct if=/dev/zero of=/dev/fioa bs=1M count=8192 > 8589934592 bytes (8.6 GB) copied, 12.7761 seconds, 672 MB/s We have found dd to be quite trustworthy with [oi]flag=direct. > ---- > Using iSer, with the small transfer chunks, sgp_dd has numbers that are in line > with what I'd expect for DIO while sg_dd doesn't: > --------- > sgp_dd: 200.64 MB/s > sg_dd: 735.42 MB/s > dd: 62.3 MB/s > -------- > But for larger transfers (with 1M block transfers), both sgp_dd and sg_dd show > well above what I think I can be getting: > ------- > sgp_dd: 882.43 > sg_dd: 819.89 > dd: 731 MB/s #Which is still high, and which makes me suspect iSer We had iSER bouncing from low 200s through 1000 MB/s during testing. Very hard to pin down good stable benchmark times. This was a few months ago. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From charr at fusionio.com Thu May 15 08:50:28 2008 From: charr at fusionio.com (Cameron Harr) Date: Thu, 15 May 2008 09:50:28 -0600 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <482C55E2.8060905@scalableinformatics.com> References: <482B7FE4.9070502@fusionio.com> <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> <482C5293.5090005@fusionio.com> <482C55E2.8060905@scalableinformatics.com> Message-ID: <482C5BC4.6090301@fusionio.com> Joe Landman wrote: > This is only 8 GB of IO. It is possible that (despite dio) you are > caching. Make the IO much larger than RAM. Use a count of 128m or so. This is going to sound dumb, but I thought I had 4 GB of RAM and thus intentionally used a file size 2x my physical RAM. As it turns out, I have 32GB of RAM on the box (4G usually shows up as 38.... and I just saw the 3). Anyway, with a 64GB file the numbers are looking more accurate (and even low): 393.3 MB/s > We have found dd to be quite trustworthy with [oi]flag=direct. I like it too. At any rate, I'm going to need to do some new testing to avoid the ram size (might just set a mem limit on the boot line). There's still a bit of a discrepancy between IOP performance with iSer and srpt. Has anyone else done comparisons with the two? I think Erez was hoping to get some numbers before too long. Cameron From worleys at gmail.com Thu May 15 08:52:52 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 15 May 2008 09:52:52 -0600 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <20080515111914.GO24654@sashak.voltaire.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> Message-ID: Is there any command line utility to tell nodes that don't see the route change to "go ask the SM again for your routes"... or "clear the route table"? On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 04:19 Mon 12 May , Hal Rosenstock wrote: >> >> I filed this as bug 1031: >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 >> >> > It would be nice if I could reproduce it in simulation. >> >> Yes, that would be nice; but I don't have a sim case. > > Do you have ibnetdiscover file for this case? If not from where report > is coming? > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From landman at scalableinformatics.com Thu May 15 08:58:58 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 15 May 2008 11:58:58 -0400 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <482C5BC4.6090301@fusionio.com> References: <482B7FE4.9070502@fusionio.com> <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> <482C5293.5090005@fusionio.com> <482C55E2.8060905@scalableinformatics.com> <482C5BC4.6090301@fusionio.com> Message-ID: <482C5DC2.7000100@scalableinformatics.com> Cameron Harr wrote: > Joe Landman wrote: >> This is only 8 GB of IO. It is possible that (despite dio) you are >> caching. Make the IO much larger than RAM. Use a count of 128m or so. > > This is going to sound dumb, but I thought I had 4 GB of RAM and thus > intentionally used a file size 2x my physical RAM. As it turns out, I > have 32GB of RAM on the box (4G usually shows up as 38.... and I just > saw the 3). Anyway, with a 64GB file the numbers are looking more > accurate (and even low): > 393.3 MB/s This is about right. We were seeing ~650MB/s iSER for a 1.3 TB file dd on our units, but it bounced all over the place in terms of rates. Very hard to pin down a single performance number. Locally the drives were >750 MB/s, so 650 isn't terrible. >> We have found dd to be quite trustworthy with [oi]flag=direct. > I like it too. At any rate, I'm going to need to do some new testing to > avoid the ram size (might just set a mem limit on the boot line). > > There's still a bit of a discrepancy between IOP performance with iSer > and srpt. Has anyone else done comparisons with the two? I think Erez > was hoping to get some numbers before too long. > Cameron I think it might be coalescing the IOPs somehow (what do your elevators look like, how deep are your queues). Each drive can do 100-300 IOPs best case. 30000 IOPs is 100-300 drives. Or caching/coalescing/elevators in action. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From hrosenstock at xsigo.com Thu May 15 09:10:27 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 15 May 2008 09:10:27 -0700 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> Message-ID: <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> Chris, On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: > Is there any command line utility to tell nodes that don't see the > route change to "go ask the SM again for your routes"... or "clear the > route table"? I'm not sure what you're asking. There is no route table at end nodes; only switch nodes and the SM maintains these. The end node only has path records which it has retrieved and perhaps cached. Path records should be refreshed when SM or local LID changes which are local events to the end node. -- Hal > On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: > > Hi Hal, > > > > On 04:19 Mon 12 May , Hal Rosenstock wrote: > >> > >> I filed this as bug 1031: > >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 > >> > >> > It would be nice if I could reproduce it in simulation. > >> > >> Yes, that would be nice; but I don't have a sim case. > > > > Do you have ibnetdiscover file for this case? If not from where report > > is coming? > > > > Sasha > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From charr at fusionio.com Thu May 15 09:11:25 2008 From: charr at fusionio.com (Cameron Harr) Date: Thu, 15 May 2008 10:11:25 -0600 Subject: [ofa-general] iSer and Direct IO In-Reply-To: <482C5DC2.7000100@scalableinformatics.com> References: <482B7FE4.9070502@fusionio.com> <694d48600805150423n1a8b0efwf6d6596f8e7891ef@mail.gmail.com> <482C5293.5090005@fusionio.com> <482C55E2.8060905@scalableinformatics.com> <482C5BC4.6090301@fusionio.com> <482C5DC2.7000100@scalableinformatics.com> Message-ID: <482C60AD.9070109@fusionio.com> Joe Landman wrote: > > I think it might be coalescing the IOPs somehow (what do your > elevators look like, how deep are your queues). Each drive can do > 100-300 IOPs best case. 30000 IOPs is 100-300 drives. Or > caching/coalescing/elevators in action. > I'm actually using a single nand-flash device (Fusion IO's ioDrive) which can do up to 100K IOPs depending on the pattern and am using the default cfq elevator with all default values (queue depth 128). Not sure if you're looking for something more than that, but it's a pretty simple setup. Cameron From worleys at gmail.com Thu May 15 09:26:37 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 15 May 2008 10:26:37 -0600 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> Message-ID: On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: > Chris, > > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: >> Is there any command line utility to tell nodes that don't see the >> route change to "go ask the SM again for your routes"... or "clear the >> route table"? > > I'm not sure what you're asking. There is no route table at end nodes; > only switch nodes and the SM maintains these. The end node only has path > records which it has retrieved and perhaps cached. Path records should > be refreshed when SM or local LID changes which are local events to the > end node. After an sm change (i.e. using the "-r" switch), nodes can't ping each other over IPoIB (other protocols also can't communicate). Restarting the OFED stack works, but modules won't unload if there was something active (i.e. Lustre), so the only recource to getting the OFED stack working again is a hard reboot. That's what I'd like to avoid if possible. Chris > > -- Hal > >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: >> > Hi Hal, >> > >> > On 04:19 Mon 12 May , Hal Rosenstock wrote: >> >> >> >> I filed this as bug 1031: >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 >> >> >> >> > It would be nice if I could reproduce it in simulation. >> >> >> >> Yes, that would be nice; but I don't have a sim case. >> > >> > Do you have ibnetdiscover file for this case? If not from where report >> > is coming? >> > >> > Sasha >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From sean.hefty at intel.com Thu May 15 09:42:20 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 May 2008 09:42:20 -0700 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: References: Message-ID: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> >+enum rdma_ha_mode { >+ RDMA_ALIGN_WITH_NETDEVICE = 1 >+}; >+ >+int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode >mode); I think we should just always report this event, and let users ignore it if they want. We don't seem to gain much by filtering the event at a lower level. - Sean From swise at opengridcomputing.com Thu May 15 09:45:24 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 11:45:24 -0500 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> Message-ID: <482C68A4.9020305@opengridcomputing.com> Sean Hefty wrote: >> +enum rdma_ha_mode { >> + RDMA_ALIGN_WITH_NETDEVICE = 1 >> +}; >> + >> +int rdma_set_high_availability_mode(struct rdma_cm_id *id, enum rdma_ha_mode >> mode); >> > > I think we should just always report this event, and let users ignore it if they > want. We don't seem to gain much by filtering the event at a lower level. > > - Sean > > Um, doesn't that then change the ABI? Some apps might hurl on a new (unexpected) event. Steve. From sean.hefty at intel.com Thu May 15 09:47:07 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 May 2008 09:47:07 -0700 Subject: [ofa-general] RE: [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: References: Message-ID: <000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com> >+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private >*id_priv) >+{ >+ struct rdma_dev_addr *dev_addr; >+ struct cma_work *work; >+ >+ dev_addr = &id_priv->id.route.addr.dev_addr; >+ >+ if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && >+ memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { >+ printk(KERN_ERR "addr change for device %s used by id %p, >notifying\n", >+ ndev->name, &id_priv->id); >+ work = kzalloc(sizeof *work, GFP_KERNEL); >+ if (!work) >+ return -ENOMEM; >+ work->id = id_priv; >+ INIT_WORK(&work->work, cma_work_handler); >+ work->old_state = id_priv->state; >+ work->new_state = id_priv->state; >+ work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE; >+ atomic_inc(&id_priv->refcount); >+ queue_work(cma_wq, &work->work); >+ } >+} My initial thought on this is to see if we can just queue a single work item that can be used to invoke the user callbacks. I'd have to see how the locking worked out though to know if that approach is 'cleaner'. Currently, the rdma_cm ensures that only a single callback to the user is invoked at a time. This is needed to support the user trying to destroy their rdma_cm_id from the callback. I didn't look to see if this still maintains that. - Sean From sean.hefty at intel.com Thu May 15 09:50:16 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 May 2008 09:50:16 -0700 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482C4A52.4000501@opengridcomputing.com> References: <482C4A52.4000501@opengridcomputing.com> Message-ID: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> >So instead of adding a new function rdma_set_high_availability_mode, you >could just set an option saying WANT_NETDEV_CHANGE_EVENTS. Maybe we >need to add rdma_set_option() to the kernel RDMA-CM API? > >IE make it more generic. I agree with this. Having a generic mechanism to report rare events would be useful. Maybe the device removal notification can be adapted for this purpose? - Sean From swise at opengridcomputing.com Thu May 15 09:55:06 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 11:55:06 -0500 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> References: <482C4A52.4000501@opengridcomputing.com> <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> Message-ID: <482C6AEA.50109@opengridcomputing.com> Sean Hefty wrote: >> So instead of adding a new function rdma_set_high_availability_mode, you >> could just set an option saying WANT_NETDEV_CHANGE_EVENTS. Maybe we >> need to add rdma_set_option() to the kernel RDMA-CM API? >> >> IE make it more generic. >> > > I agree with this. Having a generic mechanism to report rare events would be > useful. Maybe the device removal notification can be adapted for this purpose? > > - Sean > Both of these events are device related... We could have a cm_id option that can be set that sez "i want device related events"... From sean.hefty at intel.com Thu May 15 09:59:12 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 May 2008 09:59:12 -0700 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <482C6AEA.50109@opengridcomputing.com> References: <482C4A52.4000501@opengridcomputing.com> <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> <482C6AEA.50109@opengridcomputing.com> Message-ID: <000301c8b6ac$ffd1aa60$bd59180a@amr.corp.intel.com> >Both of these events are device related... We could have a cm_id option >that can be set that sez "i want device related events"... I'm not sure you want to hide device removal events, since the user must destroy their rdma_cm_id in that case. From hrosenstock at xsigo.com Thu May 15 10:12:48 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 15 May 2008 10:12:48 -0700 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> Message-ID: <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote: > On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: > > Chris, > > > > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: > >> Is there any command line utility to tell nodes that don't see the > >> route change to "go ask the SM again for your routes"... or "clear the > >> route table"? > > > > I'm not sure what you're asking. There is no route table at end nodes; > > only switch nodes and the SM maintains these. The end node only has path > > records which it has retrieved and perhaps cached. Path records should > > be refreshed when SM or local LID changes which are local events to the > > end node. > > After an sm change (i.e. using the "-r" switch), That should be a local LID change. > nodes can't ping each > other over IPoIB (other protocols also can't communicate). Sounds like ULP issue(s) in handling this. What kernel and/or OFED version are you running ? -- Hal > Restarting the OFED stack works, but modules won't unload if there was > something active (i.e. Lustre), so the only recource to getting the > OFED stack working again is a hard reboot. > > That's what I'd like to avoid if possible. > > Chris > > > > -- Hal > > > >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: > >> > Hi Hal, > >> > > >> > On 04:19 Mon 12 May , Hal Rosenstock wrote: > >> >> > >> >> I filed this as bug 1031: > >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 > >> >> > >> >> > It would be nice if I could reproduce it in simulation. > >> >> > >> >> Yes, that would be nice; but I don't have a sim case. > >> > > >> > Do you have ibnetdiscover file for this case? If not from where report > >> > is coming? > >> > > >> > Sasha > >> > _______________________________________________ > >> > general mailing list > >> > general at lists.openfabrics.org > >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From weiny2 at llnl.gov Thu May 15 10:14:18 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 15 May 2008 10:14:18 -0700 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080515101418.4ccb53f3.weiny2@llnl.gov> On Thu, 15 May 2008 10:26:37 -0600 "Chris Worley" wrote: > On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: > > Chris, > > > > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: > >> Is there any command line utility to tell nodes that don't see the > >> route change to "go ask the SM again for your routes"... or "clear the > >> route table"? > > > > I'm not sure what you're asking. There is no route table at end nodes; > > only switch nodes and the SM maintains these. The end node only has path > > records which it has retrieved and perhaps cached. Path records should > > be refreshed when SM or local LID changes which are local events to the > > end node. > > After an sm change (i.e. using the "-r" switch), nodes can't ping each > other over IPoIB (other protocols also can't communicate). Is it absolutely necessary to run with the "-r" switch? Here we have not problems letting the SM attempt to use the same LID's for nodes. Ira > > Restarting the OFED stack works, but modules won't unload if there was > something active (i.e. Lustre), so the only recource to getting the > OFED stack working again is a hard reboot. > > That's what I'd like to avoid if possible. > > Chris > > > > -- Hal > > > >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: > >> > Hi Hal, > >> > > >> > On 04:19 Mon 12 May , Hal Rosenstock wrote: > >> >> > >> >> I filed this as bug 1031: > >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 > >> >> > >> >> > It would be nice if I could reproduce it in simulation. > >> >> > >> >> Yes, that would be nice; but I don't have a sim case. > >> > > >> > Do you have ibnetdiscover file for this case? If not from where report > >> > is coming? > >> > > >> > Sasha > >> > _______________________________________________ > >> > general mailing list > >> > general at lists.openfabrics.org > >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From clameter at sgi.com Thu May 15 10:33:57 2008 From: clameter at sgi.com (Christoph Lameter) Date: Thu, 15 May 2008 10:33:57 -0700 (PDT) Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080515075747.GA7177@wotan.suse.de> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> Message-ID: On Thu, 15 May 2008, Nick Piggin wrote: > Oh, I get that confused because of the mixed up naming conventions > there: unmap_page_range should actually be called zap_page_range. But > at any rate, yes we can easily zap pagetables without holding mmap_sem. How is that synchronized with code that walks the same pagetable. These walks may not hold mmap_sem either. I would expect that one could only remove a portion of the pagetable where we have some sort of guarantee that no accesses occur. So the removal of the vma prior ensures that? From worleys at gmail.com Thu May 15 10:35:17 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 15 May 2008 11:35:17 -0600 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com> Message-ID: On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock wrote: > On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote: >> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: >> > Chris, >> > >> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: >> >> Is there any command line utility to tell nodes that don't see the >> >> route change to "go ask the SM again for your routes"... or "clear the >> >> route table"? >> > >> > I'm not sure what you're asking. There is no route table at end nodes; >> > only switch nodes and the SM maintains these. The end node only has path >> > records which it has retrieved and perhaps cached. Path records should >> > be refreshed when SM or local LID changes which are local events to the >> > end node. >> >> After an sm change (i.e. using the "-r" switch), > > That should be a local LID change. > >> nodes can't ping each >> other over IPoIB (other protocols also can't communicate). > > Sounds like ULP issue(s) in handling this. What kernel and/or OFED > version are you running ? Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel with Lustre 1.6.4.2 changes. The compute nodes are running the same kernel w/ OFED 1.2.5.5... which will be upgraded to 1.3 by the end of the day. Chris > > -- Hal > >> Restarting the OFED stack works, but modules won't unload if there was >> something active (i.e. Lustre), so the only recource to getting the >> OFED stack working again is a hard reboot. >> >> That's what I'd like to avoid if possible. >> >> Chris >> > >> > -- Hal >> > >> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: >> >> > Hi Hal, >> >> > >> >> > On 04:19 Mon 12 May , Hal Rosenstock wrote: >> >> >> >> >> >> I filed this as bug 1031: >> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 >> >> >> >> >> >> > It would be nice if I could reproduce it in simulation. >> >> >> >> >> >> Yes, that would be nice; but I don't have a sim case. >> >> > >> >> > Do you have ibnetdiscover file for this case? If not from where report >> >> > is coming? >> >> > >> >> > Sasha >> >> > _______________________________________________ >> >> > general mailing list >> >> > general at lists.openfabrics.org >> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> > >> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > >> >> _______________________________________________ >> >> general mailing list >> >> general at lists.openfabrics.org >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > >> > > > From worleys at gmail.com Thu May 15 10:37:18 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 15 May 2008 11:37:18 -0600 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <20080515101418.4ccb53f3.weiny2@llnl.gov> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <20080515101418.4ccb53f3.weiny2@llnl.gov> Message-ID: On Thu, May 15, 2008 at 11:14 AM, Ira Weiny wrote: > On Thu, 15 May 2008 10:26:37 -0600 > "Chris Worley" wrote: >> After an sm change (i.e. using the "-r" switch), nodes can't ping each >> other over IPoIB (other protocols also can't communicate). > > Is it absolutely necessary to run with the "-r" switch? Here we have not > problems letting the SM attempt to use the same LID's for nodes. yes, especially when chaging routing algorithms between the default and fat-tree. Chris From ralph.campbell at qlogic.com Thu May 15 10:40:43 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 10:40:43 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> Message-ID: <1210873243.3949.130.camel@brick.pathscale.com> On Thu, 2008-05-15 at 02:11 -0400, Talpey, Thomas wrote: > At 02:04 AM 5/15/2008, Talpey, Thomas wrote: > >At 07:54 PM 5/14/2008, Roland Dreier wrote: > >>Second question -- IB BMME and iWARP talk about a key portion (least > >>significant byte) of STag/L_Key/R_Key as being under consumer control. > >>Do we want to expose that as part of this API? Basically it means we > >>need to add a way for the consumer to pass in a new L_Key/STag as part > >>of a lot of calls. > > > >I think the Key portion is a quite useful way for the upper layer to > >salt the actual R_Keys as a protection mechanism, and having it would > >simplify a bunch of defensive code in the NFS/RDMA client. Currently, > >because the keys are provider-chosen and potentially recycled, there > >is a latent risk. > > > >But, I only want it if ALL future providers support it in some way. If a > >subset does not, it's not worth coding around the differences. > > I forgot to mention that the provider portion of the R_Key is reduced > to 24 bits as a result of exposing/requiring the key. This may cause an > issue at large scale, if the R_Keys have global scope. If they are limited > to use on specific connections as in iWARP, then this is less of an issue. > > Tom. For IB, the R_Keys are global and the spec. says that the user portion always has to be the lower 8 bits (ch 10.6.3.4) so it should be the same for all HCAs. From hrosenstock at xsigo.com Thu May 15 10:50:40 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 15 May 2008 10:50:40 -0700 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com> Message-ID: <1210873840.12616.45.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-15 at 11:35 -0600, Chris Worley wrote: > On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock wrote: > > On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote: > >> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: > >> > Chris, > >> > > >> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: > >> >> Is there any command line utility to tell nodes that don't see the > >> >> route change to "go ask the SM again for your routes"... or "clear the > >> >> route table"? > >> > > >> > I'm not sure what you're asking. There is no route table at end nodes; > >> > only switch nodes and the SM maintains these. The end node only has path > >> > records which it has retrieved and perhaps cached. Path records should > >> > be refreshed when SM or local LID changes which are local events to the > >> > end node. > >> > >> After an sm change (i.e. using the "-r" switch), > > > > That should be a local LID change. > > > >> nodes can't ping each > >> other over IPoIB (other protocols also can't communicate). > > > > Sounds like ULP issue(s) in handling this. What kernel and/or OFED > > version are you running ? > > Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel > with Lustre 1.6.4.2 changes. > > The compute nodes are running the same kernel w/ OFED 1.2.5.5... which > will be upgraded to 1.3 by the end of the day. Maybe that will be better for LID change; Let us know. -- Hal > > Chris > > > > -- Hal > > > >> Restarting the OFED stack works, but modules won't unload if there was > >> something active (i.e. Lustre), so the only recource to getting the > >> OFED stack working again is a hard reboot. > >> > >> That's what I'd like to avoid if possible. > >> > >> Chris > >> > > >> > -- Hal > >> > > >> >> On Thu, May 15, 2008 at 5:19 AM, Sasha Khapyorsky wrote: > >> >> > Hi Hal, > >> >> > > >> >> > On 04:19 Mon 12 May , Hal Rosenstock wrote: > >> >> >> > >> >> >> I filed this as bug 1031: > >> >> >> https://bugs.openfabrics.org/show_bug.cgi?id=1031 > >> >> >> > >> >> >> > It would be nice if I could reproduce it in simulation. > >> >> >> > >> >> >> Yes, that would be nice; but I don't have a sim case. > >> >> > > >> >> > Do you have ibnetdiscover file for this case? If not from where report > >> >> > is coming? > >> >> > > >> >> > Sasha > >> >> > _______________________________________________ > >> >> > general mailing list > >> >> > general at lists.openfabrics.org > >> >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > > >> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> >> > > >> >> _______________________________________________ > >> >> general mailing list > >> >> general at lists.openfabrics.org > >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > >> > > > > > From swise at opengridcomputing.com Thu May 15 11:17:34 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 13:17:34 -0500 Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions. Message-ID: <20080515181734.21020.47137.stgit@dell3.ogc.int> The following patch proposes the API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. Please review these vs the verbs specs and see what I've missed. This patch is a request for comments. Steve. Changes since Version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled ----- RDMA: New Memory Extensions. Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEMORY_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. --- drivers/infiniband/core/verbs.c | 46 +++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 55 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 101 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..0a334b4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) +{ + struct ib_mr *mr; + + if (!pd->device->alloc_fast_reg_mr) + return ERR_PTR(-ENOSYS); + + mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int max_page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + if (!device->alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device->alloc_fast_reg_page_list(device, max_page_list_len); + + if (!IS_ERR(page_list)) { + page_list->device = device; + page_list->max_page_list_len = max_page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list->device->free_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..cbef5a6 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), IB_DEVICE_SEND_W_INV = (1<<21), + IB_DEVICE_MEM_MGT_EXTENSIONS = (1<<22), }; enum ib_atomic_cap { @@ -414,6 +415,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -628,6 +631,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +682,20 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + unsigned int page_size; + unsigned int page_list_len; + unsigned int first_byte_offset; + u32 length; + int access_flags; + + } fast_reg; + struct { + struct ib_mr *mr; + } local_inv; } wr; }; @@ -1014,6 +1034,10 @@ struct ib_device { int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); + struct ib_mr * (*alloc_fast_reg_mr)(struct ib_pd *pd, + int max_page_list_len); + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); int (*rereg_phys_mr)(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -1808,6 +1832,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); /** + * ib_alloc_fast_reg_mr - Allocates memory region usable with the + * IB_WR_FAST_REG_MR send work request. + * @pd: The protection domain associated with the region. + * @max_page_list_len: requested max physical buffer list size to be allocated. + */ +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len); + +struct ib_fast_reg_page_list { + struct ib_device *device; + u64 *page_list; + unsigned int max_page_list_len; +}; + +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used + * in a IB_WR_FAST_REG_MR work request. The resources allocated by this method + * allows for dev-specific optimization of the FAST_REG operation. + * @device - ib device pointer. + * @page_list_len - depth of the page list array to be allocated. + */ +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len); + +/** + * ib_free_fast_reg_page_list - Deallocates a previously allocated + * page list array. + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. + */ +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From ralph.campbell at qlogic.com Thu May 15 11:18:59 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 11:18:59 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> Message-ID: <1210875539.3949.153.camel@brick.pathscale.com> On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote: > > So you want the page size specified in the fast_reg_page_list as > > opposed to when the page list is bound to the fast_reg mr (via > > post_send)? > > It's kind of the same thing, since the fast_reg_page_list is part of the > send work request... the structures you have at the moment are: > > > + struct { > > + u64 iova_start; > > + struct ib_fast_reg_page_list *page_list; > > + int fbo; > > + u32 length; > > + int access_flags; > > + struct ib_mr *mr; > > (side note... move this pointer up with the other pointers, so you don't > end up with a hole in the structure due to alignment... or stick an int > page_size in to fill the hole) > > > + } fast_reg; > > > +struct ib_fast_reg_page_list { > > + struct ib_device *device; > > + u64 *page_list; > > + int page_list_len; > > +}; > > is page_list_len the maximum length of the page_list, or is it filled in > by the consumer? The driver could figure out the length of the > page_list for any given work request by looking at the MR length and the > page_size I suppose. > > - R. I think Roland and Steve misunderstood what I was asking about the struct ib_fast_reg_page_list * returned from ib_alloc_fast_reg_page_list(). The question is "what can the caller do with the pointer?" Clearly, the caller can pass the pointer to ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the normal ways. Can the caller dereference the pointer and look at the values in page_list[]? Are these values understood to be a physical addresses that can be passed to phys_to_virt() for example? Are they byte addresses always aligned to a page boundary? The reason I ask is that the address used with the [LR]_Key from ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc. because the ipath driver doesn't necessarily use physical addresses for the address in the send WQEs. Normally, the address in the send WQE is a kernel virtual address so the ib_ipath driver can memcpy() the data to the chip. Lets say that ib_ipath uses vmalloc() to allocate the pages instead of dma_alloc_coherent(). As long as the ULP only uses the page_list values as an uninterpreted number that is passed back to the driver via subsequent verbs calls, it wouldn't matter to the ULP what the number represents. But if the ULP expects to be able to call some other kernel function to map or translate that value, then the ULP has to know what kind of number it represents, its size and alignment, etc. From swise at opengridcomputing.com Thu May 15 11:39:25 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 13:39:25 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <1210875539.3949.153.camel@brick.pathscale.com> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> <1210875539.3949.153.camel@brick.pathscale.com> Message-ID: <482C835D.50401@opengridcomputing.com> Ralph Campbell wrote: > On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote: > >> > So you want the page size specified in the fast_reg_page_list as >> > opposed to when the page list is bound to the fast_reg mr (via >> > post_send)? >> >> It's kind of the same thing, since the fast_reg_page_list is part of the >> send work request... the structures you have at the moment are: >> >> > + struct { >> > + u64 iova_start; >> > + struct ib_fast_reg_page_list *page_list; >> > + int fbo; >> > + u32 length; >> > + int access_flags; >> > + struct ib_mr *mr; >> >> (side note... move this pointer up with the other pointers, so you don't >> end up with a hole in the structure due to alignment... or stick an int >> page_size in to fill the hole) >> >> > + } fast_reg; >> >> > +struct ib_fast_reg_page_list { >> > + struct ib_device *device; >> > + u64 *page_list; >> > + int page_list_len; >> > +}; >> >> is page_list_len the maximum length of the page_list, or is it filled in >> by the consumer? The driver could figure out the length of the >> page_list for any given work request by looking at the MR length and the >> page_size I suppose. >> >> - R. >> > > I think Roland and Steve misunderstood what I was asking about > the struct ib_fast_reg_page_list * returned from > ib_alloc_fast_reg_page_list(). > > The question is "what can the caller do with the pointer?" > Clearly, the caller can pass the pointer to > ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the > normal ways. > > Can the caller dereference the pointer and look at the > values in page_list[]? Are these values understood to be > a physical addresses that can be passed to phys_to_virt() for example? > Are they byte addresses always aligned to a page boundary? > > The caller must _fill in_ the values in the page list. That's the whole point. IE all this func is doing is allocating the _memory_ to store the page list that the caller is building. The special function is needed because some devices might need to DMA the page list array from this memory as part of processing the FAST_REG_MR work request, and thus needs to allocate it dma coherently. The pointer returned is a kernel virtual address and can be read from/written to by the caller. > The reason I ask is that the address used with the [LR]_Key from > ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc. > because the ipath driver doesn't necessarily use physical addresses > for the address in the send WQEs. Normally, the address in the > send WQE is a kernel virtual address so the ib_ipath driver can > memcpy() the data to the chip. > > Lets say that ib_ipath uses vmalloc() to allocate the pages > instead of dma_alloc_coherent(). As long as the ULP only uses > the page_list values as an uninterpreted number that is passed > back to the driver via subsequent verbs calls, it wouldn't > matter to the ULP what the number represents. But if the ULP > expects to be able to call some other kernel function to > map or translate that value, then the ULP has to know what > kind of number it represents, its size and alignment, etc. > We're not talking about allocating the pages themselves. Here's an example (ignoring errors): page_list = ib_alloc_fast_reg_page_list(device, 1); v = get_free_page(GFP_KERNEL); page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE, DMA_TO_DEVICE|DMA_FROM_DEVICE); wr.opcode = IB_WR_FAST_REG_MR; wr.next = NULL; wr.send_flags = 0; wr.wr_id = 0xdeadbeef; wr.wr.fast_reg.mr = mr; wr.wr.fast_reg.page_list = page_list; wr.wr.fast_reg.page_size = PAGE_SIZE; wr.wr.fast_reg.page_list_len = 1; wr.wr.fast_reg.first_byte_offset = 0; wr.wr.fast_reg.iova_start = (u64)v; wr.wr.fast_reg.length = PAGE_SIZE; wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE; ib_post_send(qp, &wr, &bad_wr); From worleys at gmail.com Thu May 15 11:45:08 2008 From: worleys at gmail.com (Chris Worley) Date: Thu, 15 May 2008 12:45:08 -0600 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <1210873840.12616.45.camel@hrosenstock-ws.xsigo.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <1210871569.12616.41.camel@hrosenstock-ws.xsigo.com> <1210873840.12616.45.camel@hrosenstock-ws.xsigo.com> Message-ID: On Thu, May 15, 2008 at 11:50 AM, Hal Rosenstock wrote: > On Thu, 2008-05-15 at 11:35 -0600, Chris Worley wrote: >> On Thu, May 15, 2008 at 11:12 AM, Hal Rosenstock wrote: >> > On Thu, 2008-05-15 at 10:26 -0600, Chris Worley wrote: >> >> On Thu, May 15, 2008 at 10:10 AM, Hal Rosenstock wrote: >> >> > Chris, >> >> > >> >> > On Thu, 2008-05-15 at 09:52 -0600, Chris Worley wrote: >> >> >> Is there any command line utility to tell nodes that don't see the >> >> >> route change to "go ask the SM again for your routes"... or "clear the >> >> >> route table"? >> >> > >> >> > I'm not sure what you're asking. There is no route table at end nodes; >> >> > only switch nodes and the SM maintains these. The end node only has path >> >> > records which it has retrieved and perhaps cached. Path records should >> >> > be refreshed when SM or local LID changes which are local events to the >> >> > end node. >> >> >> >> After an sm change (i.e. using the "-r" switch), >> > >> > That should be a local LID change. >> > >> >> nodes can't ping each >> >> other over IPoIB (other protocols also can't communicate). >> > >> > Sounds like ULP issue(s) in handling this. What kernel and/or OFED >> > version are you running ? >> >> Currently, the SM is running OFED 1.3 on an RHEL4 2.6.9-67.0.4 kernel >> with Lustre 1.6.4.2 changes. >> >> The compute nodes are running the same kernel w/ OFED 1.2.5.5... which >> will be upgraded to 1.3 by the end of the day. > > Maybe that will be better for LID change; Let us know. Unfortunately, it isn't a good day to test; a critical job is running. After upgrading all but the nodes the critical job was running on, I found the opensmd hung, the rebooted nodes were not getting initialized, I had to "kill -9" it. Upon opensmd restart, I couldn't risk using the "-r" switch, but, w/o it, the fat-tree came up w/o error. Chris From ralph.campbell at qlogic.com Thu May 15 11:53:17 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 11:53:17 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <482C835D.50401@opengridcomputing.com> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> <1210875539.3949.153.camel@brick.pathscale.com> <482C835D.50401@opengridcomputing.com> Message-ID: <1210877597.3949.158.camel@brick.pathscale.com> On Thu, 2008-05-15 at 13:39 -0500, Steve Wise wrote: > Ralph Campbell wrote: > > On Wed, 2008-05-14 at 19:49 -0700, Roland Dreier wrote: > > > >> > So you want the page size specified in the fast_reg_page_list as > >> > opposed to when the page list is bound to the fast_reg mr (via > >> > post_send)? > >> > >> It's kind of the same thing, since the fast_reg_page_list is part of the > >> send work request... the structures you have at the moment are: > >> > >> > + struct { > >> > + u64 iova_start; > >> > + struct ib_fast_reg_page_list *page_list; > >> > + int fbo; > >> > + u32 length; > >> > + int access_flags; > >> > + struct ib_mr *mr; > >> > >> (side note... move this pointer up with the other pointers, so you don't > >> end up with a hole in the structure due to alignment... or stick an int > >> page_size in to fill the hole) > >> > >> > + } fast_reg; > >> > >> > +struct ib_fast_reg_page_list { > >> > + struct ib_device *device; > >> > + u64 *page_list; > >> > + int page_list_len; > >> > +}; > >> > >> is page_list_len the maximum length of the page_list, or is it filled in > >> by the consumer? The driver could figure out the length of the > >> page_list for any given work request by looking at the MR length and the > >> page_size I suppose. > >> > >> - R. > >> > > > > I think Roland and Steve misunderstood what I was asking about > > the struct ib_fast_reg_page_list * returned from > > ib_alloc_fast_reg_page_list(). > > > > The question is "what can the caller do with the pointer?" > > Clearly, the caller can pass the pointer to > > ib_post_send(IB_WR_FAST_REG_MR) and use the [LR]_Key in the > > normal ways. > > > > Can the caller dereference the pointer and look at the > > values in page_list[]? Are these values understood to be > > a physical addresses that can be passed to phys_to_virt() for example? > > Are they byte addresses always aligned to a page boundary? > > > > > > The caller must _fill in_ the values in the page list. That's the whole > point. IE all this func is doing is allocating the _memory_ to store > the page list that the caller is building. The special function is > needed because some devices might need to DMA the page list array from > this memory as part of processing the FAST_REG_MR work request, and thus > needs to allocate it dma coherently. The pointer returned is a kernel > virtual address and can be read from/written to by the caller. > > > The reason I ask is that the address used with the [LR]_Key from > > ib_get_dma_mr() has to be translated with ib_dma_map_single(), etc. > > because the ipath driver doesn't necessarily use physical addresses > > for the address in the send WQEs. Normally, the address in the > > send WQE is a kernel virtual address so the ib_ipath driver can > > memcpy() the data to the chip. > > > > > Lets say that ib_ipath uses vmalloc() to allocate the pages > > instead of dma_alloc_coherent(). As long as the ULP only uses > > the page_list values as an uninterpreted number that is passed > > back to the driver via subsequent verbs calls, it wouldn't > > matter to the ULP what the number represents. But if the ULP > > expects to be able to call some other kernel function to > > map or translate that value, then the ULP has to know what > > kind of number it represents, its size and alignment, etc. > > > > > We're not talking about allocating the pages themselves. > > Here's an example (ignoring errors): > > page_list = ib_alloc_fast_reg_page_list(device, 1); > > v = get_free_page(GFP_KERNEL); > > page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE, > > DMA_TO_DEVICE|DMA_FROM_DEVICE); > > wr.opcode = IB_WR_FAST_REG_MR; > wr.next = NULL; > wr.send_flags = 0; > wr.wr_id = 0xdeadbeef; > wr.wr.fast_reg.mr = mr; > wr.wr.fast_reg.page_list = page_list; > wr.wr.fast_reg.page_size = PAGE_SIZE; > wr.wr.fast_reg.page_list_len = 1; > wr.wr.fast_reg.first_byte_offset = 0; > wr.wr.fast_reg.iova_start = (u64)v; > wr.wr.fast_reg.length = PAGE_SIZE; > wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | > > IB_ACCESS_REMOTE_READ | > > IB_ACCESS_REMOTE_WRITE; > > ib_post_send(qp, &wr, &bad_wr); OK. Thanks for clarifying. This wasn't clear to me from the original description but I understand now. From swise at opengridcomputing.com Thu May 15 12:05:45 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 14:05:45 -0500 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <1210877597.3949.158.camel@brick.pathscale.com> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> <1210875539.3949.153.camel@brick.pathscale.com> <482C835D.50401@opengridcomputing.com> <1210877597.3949.158.camel@brick.pathscale.com> Message-ID: <482C8989.10305@opengridcomputing.com> >> We're not talking about allocating the pages themselves. >> >> Here's an example (ignoring errors): >> >> page_list = ib_alloc_fast_reg_page_list(device, 1); >> >> v = get_free_page(GFP_KERNEL); >> >> page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE, >> >> DMA_TO_DEVICE|DMA_FROM_DEVICE); >> >> wr.opcode = IB_WR_FAST_REG_MR; >> wr.next = NULL; >> wr.send_flags = 0; >> wr.wr_id = 0xdeadbeef; >> wr.wr.fast_reg.mr = mr; >> wr.wr.fast_reg.page_list = page_list; >> wr.wr.fast_reg.page_size = PAGE_SIZE; >> wr.wr.fast_reg.page_list_len = 1; >> wr.wr.fast_reg.first_byte_offset = 0; >> wr.wr.fast_reg.iova_start = (u64)v; >> wr.wr.fast_reg.length = PAGE_SIZE; >> wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | >> >> IB_ACCESS_REMOTE_READ | >> >> IB_ACCESS_REMOTE_WRITE; >> >> ib_post_send(qp, &wr, &bad_wr); >> > > OK. Thanks for clarifying. This wasn't clear to me from the > original description but I understand now. > Perhaps ib_alloc_fast_reg_page_list() isn't clear. Maybe ib_alloc_fast_reg_page_list_mem() is better? That's getting too long for my taste, but if others thing it helps... I'll change it. Steve. From ralph.campbell at qlogic.com Thu May 15 12:37:25 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 12:37:25 -0700 Subject: [ofa-general] [PATCH RFC] RDMA: New Memory Extensions. In-Reply-To: <482C8989.10305@opengridcomputing.com> References: <20080514190532.28544.41595.stgit@dell3.ogc.int> <1210805766.3949.114.camel@brick.pathscale.com> <482B8C5A.5020904@opengridcomputing.com> <1210875539.3949.153.camel@brick.pathscale.com> <482C835D.50401@opengridcomputing.com> <1210877597.3949.158.camel@brick.pathscale.com> <482C8989.10305@opengridcomputing.com> Message-ID: <1210880245.3949.180.camel@brick.pathscale.com> On Thu, 2008-05-15 at 14:05 -0500, Steve Wise wrote: > >> We're not talking about allocating the pages themselves. > >> > >> Here's an example (ignoring errors): > >> > >> page_list = ib_alloc_fast_reg_page_list(device, 1); > >> > >> v = get_free_page(GFP_KERNEL); > >> > >> page_list->page_list[0] = ib_dma_map_single(device, v, PAGE_SIZE, > >> > >> DMA_TO_DEVICE|DMA_FROM_DEVICE); > >> > >> wr.opcode = IB_WR_FAST_REG_MR; > >> wr.next = NULL; > >> wr.send_flags = 0; > >> wr.wr_id = 0xdeadbeef; > >> wr.wr.fast_reg.mr = mr; > >> wr.wr.fast_reg.page_list = page_list; > >> wr.wr.fast_reg.page_size = PAGE_SIZE; > >> wr.wr.fast_reg.page_list_len = 1; > >> wr.wr.fast_reg.first_byte_offset = 0; > >> wr.wr.fast_reg.iova_start = (u64)v; > >> wr.wr.fast_reg.length = PAGE_SIZE; > >> wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | > >> > >> IB_ACCESS_REMOTE_READ | > >> > >> IB_ACCESS_REMOTE_WRITE; > >> > >> ib_post_send(qp, &wr, &bad_wr); > >> > > > > OK. Thanks for clarifying. This wasn't clear to me from the > > original description but I understand now. > > > > Perhaps ib_alloc_fast_reg_page_list() isn't clear. Maybe > ib_alloc_fast_reg_page_list_mem() is better? That's getting too long > for my taste, but if others thing it helps... I'll change it. At a minimum, I would change the comments for the function in ib_verbs.h: +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array + * @device - ib device pointer. + * @page_list_len - size of the page list array to be allocated. + * + * This allocates and returns a struct ib_fast_reg_page_list * + * and a page_list array that is at least page_list_len in size. + * The actual size is returned in max_page_list_len. + * The caller is responsible for initializing the contents of the + * page_list array before posting a send work request with the + * IB_WC_FAST_REG_MR opcode. The page_list array entries must be + * translated using one of the ib_dma_*() functions similar to the + * addresses passed to ib_map_phys_fmr(). Once the ib_post_send() + * is issued, the struct ib_fast_reg_page_list should not be modified + * by the caller until a completion notice is returned by the device. + */ From swise at opengridcomputing.com Thu May 15 12:41:43 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 15 May 2008 14:41:43 -0500 Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions. In-Reply-To: <20080515181734.21020.47137.stgit@dell3.ogc.int> References: <20080515181734.21020.47137.stgit@dell3.ogc.int> Message-ID: <482C91F7.60708@opengridcomputing.com> BTW: I think we need a way for users to query the device to know the max page_list_length that can be handled in a FAST_REG_MR work request. In other words, a device attribute. Steve. From weiny2 at llnl.gov Thu May 15 13:27:21 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 15 May 2008 13:27:21 -0700 Subject: [ofa-general] [PATCH] OpenSM: Fix rpm build, /opensm/opensm.conf failed to install Message-ID: <20080515132721.37644ade.weiny2@llnl.gov> Sasha, I found this while trying to add the Performance Manager HOWTO to the rpm. Therefore, I think this will conflict slightly with that patch. If you like I can resubmit that patch after you apply this. Thanks, Ira >From 8453b86e94175ff3054a57c5c50e337a96d536bd Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 15 May 2008 13:13:16 -0700 Subject: [PATCH] Fix rpm build, /opensm/opensm.conf failed to install Signed-off-by: Ira K. Weiny --- opensm/configure.in | 9 +++++++-- opensm/opensm.spec.in | 8 ++++---- 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/opensm/configure.in b/opensm/configure.in index d36d7be..2ae8bd0 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -87,7 +87,7 @@ conf_dir_tmp1="`eval echo ${sysconfdir} | sed 's/^NONE/$ac_default_prefix/'`" SYS_CONFIG_DIR="`eval echo $conf_dir_tmp1`" dnl Check for a different subdir for the config files. -OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/opensm +OPENSM_CONFIG_SUB_DIR=opensm AC_MSG_CHECKING(for --with-opensm-conf-sub-dir) AC_ARG_WITH(opensm-conf-sub-dir, AC_HELP_STRING([--with-opensm-conf-sub-dir=dir], @@ -96,10 +96,15 @@ AC_ARG_WITH(opensm-conf-sub-dir, no) ;; *) - OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/$withval + OPENSM_CONFIG_SUB_DIR=$withval ;; esac ] ) +dnl this needs to be configured for rpmbuilds separate from the full path +dnl "OPENSM_CONFIG_DIR" +AC_SUBST(OPENSM_CONFIG_SUB_DIR) + +OPENSM_CONFIG_DIR=$SYS_CONFIG_DIR/$OPENSM_CONFIG_SUB_DIR AC_MSG_RESULT($OPENSM_CONFIG_DIR) AC_DEFINE_UNQUOTED(OPENSM_CONFIG_DIR, ["$OPENSM_CONFIG_DIR"], diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index feabfef..b439323 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -94,9 +94,9 @@ if [ -f /etc/redhat-release -o -s /etc/redhat-release ]; then else REDHAT="" fi -mkdir -p $etc/{init.d,logrotate.d} @OPENSM_CONFIG_DIR@ +mkdir -p $etc/{init.d,logrotate.d} $etc/@OPENSM_CONFIG_SUB_DIR@ install -m 755 scripts/${REDHAT}opensm.init $etc/init.d/opensmd -install -m 644 scripts/opensm.conf @OPENSM_CONFIG_DIR@/opensm.conf +install -m 644 scripts/opensm.conf $etc/@OPENSM_CONFIG_SUB_DIR@/opensm.conf install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh @@ -128,10 +128,10 @@ fi %doc AUTHORS COPYING README %{_sysconfdir}/init.d/opensmd %{_sbindir}/sldd.sh -%config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf +%config(noreplace) %{_sysconfdir}/@OPENSM_CONFIG_SUB_DIR@/opensm.conf %config(noreplace) %{_sysconfdir}/logrotate.d/opensm %dir /var/cache/opensm -%dir @OPENSM_CONFIG_DIR@ +%dir %{_sysconfdir}/@OPENSM_CONFIG_SUB_DIR@ %files libs %defattr(-,root,root,-) -- 1.5.1 From weiny2 at llnl.gov Thu May 15 13:27:23 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 15 May 2008 13:27:23 -0700 Subject: [ofa-general] [PATCH] OpenSM: Add a Performance Manager HOWTO to the docs and the dist Message-ID: <20080515132723.3add7c6a.weiny2@llnl.gov> There seems to be a lot of questions on the list about how to gather port counters. The Performance Manager included in OpenSM, v3.1.X (OFED 1.3) can be used to collect these counters in one place. I decided to write a little HOWTO to help people to set it up. Patch is attached, Ira >From bfc303f76a40fb5e3a9cf2c01c16c25c517c8ddd Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 15 May 2008 08:19:17 -0700 Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist Signed-off-by: Ira K. Weiny --- opensm/Makefile.am | 3 +- opensm/doc/performance-manager-HOWTO.txt | 153 ++++++++++++++++++++++++++++++ opensm/opensm.spec.in | 2 +- 3 files changed, 156 insertions(+), 2 deletions(-) create mode 100644 opensm/doc/performance-manager-HOWTO.txt diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 3811963..4c79f49 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -24,8 +24,9 @@ endif man_MANS = man/opensm.8 man/osmtest.8 various_scripts = $(wildcard scripts/*) +docs = doc/performance-manager-HOWTO.txt -EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) +EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs) dist-hook: $(EXTRA_DIST) if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \ diff --git a/opensm/doc/performance-manager-HOWTO.txt b/opensm/doc/performance-manager-HOWTO.txt new file mode 100644 index 0000000..c655f6c --- /dev/null +++ b/opensm/doc/performance-manager-HOWTO.txt @@ -0,0 +1,153 @@ +OpenSM Performance manager HOWTO +================================ + +Introduction +============ + +OpenSM now includes a performance manager which collects Port counters from +the subnet and stores them internally in OpenSM. + +Some of the features of the performance manager are: + + 1) Collect port data and error counters per v1.2 spec and store in + 64bit internal counts. + 2) Automatic reset of counters when they reach approximatly 3/4 full. + (While not guarenteeing that counts will not be missed this does + keep counts incrementing as best as possible given the current + hardware limitations.) + 3) Basic warnings in the OpenSM log on "critical" errors like symbol + errors. + 4) Automatically detects "outside" resets of counters and adjusts to + continue collecting data. + 5) Can be run in a standby SM. + +Known issues are: + + 1) Data counters will be lost on high data rate links. Sweeping the + fabric fast enough for a DDR link is not practical. + 2) Default partition support only. + + +Setup and Usage +=============== + +Using the Performance Manager consists of 3 steps: + + 1) compiling in support for the perfmgr (Optionally: the console + socket as well) + 2) enabling the perfmgr and console in opensm.opts + 3) retrieving data which has been collected. + 3a) using console to "dump data" + 3b) using a plugin module to store the data to your own + "database" + +Step 1: Compile in support for the Performance Manager +------------------------------------------------------ + +Because of the performance manager's experimental status, it is not enabled at +compile time by default. (This will hopefully soon change as more people use +it and confirm that it does not break things... ;-) The configure option is +"--enable-perf-mgr". + +At this time it is really best to enable the console socket option as well. +OpenSM can be run in an "interactive" mode. But with the console socket option +turned on one can also make a connection to a running OpenSM. The console +option is "--enable-console-socket". This option requires the use of +tcp_wrappers to ensure security. Please be aware of your configuration for +tcp_wrappers as the commands presented in the console can affect the operation +of your subnet. + +The following configure line includes turning on the performance manager as +well as the console: + + ./configure --enable-perf-mgr --enable-console-socket + + +Step 2: Enable the perfmgr and console in opensm.opts +----------------------------------------------------- + +Turning the Perfmorance Manager on is pretty easy, set the following options in +the opensm.opts config file. (Default location is +/var/cache/opensm/opensm.opts) + + # Turn it all on. + perfmgr TRUE + + # sweep time in seconds + perfmgr_sweep_time_s 180 + + # Dump file to dump the events to + event_db_dump_file /var/log/opensm_port_counters.log + +Also enable the console socket and configure the port for it to listen to if +desired. + + # console [off|local|socket] + console socket + + # Telnet port for console (default 10000) + console_port 10000 + +As noted above you also need to set up tcp_wrappers to prevent unauthorized +users from connecting to the console.[*] + + [*] As an alternate you can use the loopback mode but I noticed when + writing this (OpenSM v3.1.10; OFED 1.3) that there are some bugs in + specifying the loopback mode in the opensm.opts file. Look for this to + be fixed in newer versions. + + [**] Also you could use "local" but this is only useful if you run + OpenSM in the foreground of a terminal. As OpenSM is usually started + as a daemon I left this out as an option. + +Step 3: retrieve data which has been collected +---------------------------------------------- + +Step 3a: Using console dump function +------------------------------------ + +The console command "perfmgr dump_counters" will dump counters to the file +specified in the opensm.opts file. In the example above +"/var/log/opensm_port_counters.log" + +Example output is below: + + +"SW1 wopr ISR9024D (MLX4 FW)" 0x8f10400411f56 port 1 (Since Mon May 12 13:27:14 2008) + symbol_err_cnt : 0 + link_err_recover : 0 + link_downed : 0 + rcv_err : 0 + rcv_rem_phys_err : 0 + rcv_switch_relay_err : 2 + xmit_discards : 0 + xmit_constraint_err : 0 + rcv_constraint_err : 0 + link_integrity_err : 0 + buf_overrun_err : 0 + vl15_dropped : 0 + xmit_data : 470435 + rcv_data : 405956 + xmit_pkts : 8954 + rcv_pkts : 6900 + unicast_xmit_pkts : 0 + unicast_rcv_pkts : 0 + multicast_xmit_pkts : 0 + multicast_rcv_pkts : 0 + + + +Step 3b: Using a plugin module +------------------------------ + +If you want a more automated method of retrieving the data OpenSM provides a +plugin interface to extend OpenSM. The header file is osm_event_plugin.h. +The functions you register with this interface will be called when data is +collected. You can then use that data as appropriate. + +An example plugin can be configured at compile time using the +"--enable-default-event-plugin" option on the configure line. This plugin is +very simple. It logs "events" recieved from the performance manager to a log +file. I don't recomend using this directly but rather use it as a templat to +create your own plugin. + diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index feabfef..c36d6f2 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -125,7 +125,7 @@ fi %{_sbindir}/opensm %{_sbindir}/osmtest %{_mandir}/man8/* -%doc AUTHORS COPYING README +%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt %{_sysconfdir}/init.d/opensmd %{_sbindir}/sldd.sh %config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf -- 1.5.1 From weiny2 at llnl.gov Thu May 15 13:34:56 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 15 May 2008 13:34:56 -0700 Subject: [ofa-general] [Announce] libopensmskummeeplugin; OpenSM/PerfMgr to MySQL plugin Message-ID: <20080515133456.7b1e5c04.weiny2@llnl.gov> Announcing, libopensmskummeeplugin. https://computing.llnl.gov/linux/skummeeplugin.html This plugin takes the data from the PerfMgr and logs it to a MySQL DB. In addition it comes with scripts to set up the connection between the PerfMgr and SKUMMEE (https://sourceforge.net/projects/skummee) an open source cluster monitoring tool. Although this has been developed primarily to get data into SKUMMEE it can be used to simply store the data in a MySQL DB which can then be querried. I hope someone finds this useful, Ira Weiny From rdreier at cisco.com Thu May 15 14:50:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 14:50:38 -0700 Subject: [ofa-general] [PATCH RFC v2] RDMA: New Memory Extensions. In-Reply-To: <482C91F7.60708@opengridcomputing.com> (Steve Wise's message of "Thu, 15 May 2008 14:41:43 -0500") References: <20080515181734.21020.47137.stgit@dell3.ogc.int> <482C91F7.60708@opengridcomputing.com> Message-ID: > I think we need a way for users to query the device to know the max > page_list_length that can be handled in a FAST_REG_MR work request. > In other words, a device attribute. Yeah, stick it in there... From rdreier at cisco.com Thu May 15 15:22:12 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 15:22:12 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: (Thomas Talpey's message of "Wed, 14 May 2008 23:32:21 -0400") References: Message-ID: > We've been hit by this twice this week on two NFS/RDMA servers, so I'm > glad to see this! But, for us it happens with memless ConnectX - our mthca > devices are ok (but OTOH they're memfull not memfree) Strange... as I said before though something seems to have changed to affect this, though I have no idea what. I'm including the test program I use to check if QP creation succeeds, you can run this on any suspect systems and see what it prints. > I'll be happy to test it with our misbehaving cards, but I can't do it until > next week since they just went into a box for shipping. In the meantime, > dare I ask - what's different about memfree cards that limits the sge > attributes like this? And, what values result from the new code? The > ConnectX ones I have report 32, and fail when trying to set that. The patch doesn't change ConnectX -- creating a QP with max send/recv sge 32 works fine for me here with mlx4 from 2.6.26-rc2. For mem-free the new max_sge reported is 27 sge entries, and for memful it is 59 (and creating such QPs succeeds of course). The difference between memfree and memful that matters is just that the max_sge on memfree runs into the max WQE size, and the code didn't handle that correctly without the patch. Here's the test program to check QP creation vs reported max_sge: #include #include #include int main(int argc, char *argv) { struct ibv_device **dev_list; struct ibv_device_attr dev_attr; struct ibv_context *context; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp_init_attr qp_attr; int t; static const struct { enum ibv_qp_type type; char *name; } type_tab[] = { { IBV_QPT_RC, "RC" }, { IBV_QPT_UC, "UC" }, { IBV_QPT_UD, "UD" }, }; dev_list = ibv_get_device_list(NULL); if (!dev_list) { printf("No RDMA devices found\n"); return 1; } for (; *dev_list; ++dev_list) { printf("%s:\n", ibv_get_device_name(*dev_list)); context = ibv_open_device(*dev_list); if (!context) { printf(" ibv_open_device failed\n"); continue; } if (ibv_query_device(context, &dev_attr)) { printf(" ibv_query_device failed\n"); continue; } cq = ibv_create_cq(context, 1, NULL, NULL, 0); if (!cq) { printf(" ibv_create_cq failed\n"); continue; } pd = ibv_alloc_pd(context); if (!pd) { printf(" ibv_alloc_pd failed\n"); continue; } for (t = 0; t < sizeof type_tab / sizeof type_tab[0]; ++t) { memset(&qp_attr, 0, sizeof qp_attr); qp_attr.send_cq = cq; qp_attr.recv_cq = cq; qp_attr.cap.max_send_wr = 1; qp_attr.cap.max_recv_wr = 1; qp_attr.cap.max_send_sge = dev_attr.max_sge; qp_attr.cap.max_recv_sge = dev_attr.max_sge; qp_attr.qp_type = type_tab[t].type; printf(" %s: SGE %d ", type_tab[t].name, dev_attr.max_sge); if (ibv_create_qp(pd, &qp_attr)) printf("ok (got %d/%d)\n", qp_attr.cap.max_send_sge, qp_attr.cap.max_recv_sge); else printf("FAILED\n"); } } return 0; } From rdreier at cisco.com Thu May 15 15:31:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 15:31:55 -0700 Subject: [ofa-general] Re: [PATCH 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <20080430172156.31725.94843.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:51:56 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172156.31725.94843.stgit@localhost.localdomain> Message-ID: > +config INFINIBAND_QLGC_VNIC_DEBUG > + bool "QLogic VNIC Verbose debugging" > + depends on INFINIBAND_QLGC_VNIC > + default n > + ---help--- > + This option causes verbose debugging code to be compiled > + into the QLogic VNIC driver. The output can be turned on via the > + vnic_debug module parameter. If you have runtime control of this, I suggest making it default to on, like mthca does with: config INFINIBAND_MTHCA_DEBUG bool "Verbose debugging output" if EMBEDDED depends on INFINIBAND_MTHCA default y otherwise distros will leave the option off and it becomes a pain to debug problems because you force users to do compiles. From rdreier at cisco.com Thu May 15 15:33:56 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 15:33:56 -0700 Subject: [ofa-general] Re: [PATCH 10/13] QLogic VNIC: Driver Statistics collection In-Reply-To: <20080430172055.31725.70663.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:50:55 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172055.31725.70663.stgit@localhost.localdomain> Message-ID: > +#else /*CONFIG_INIFINIBAND_VNIC_STATS*/ > + > +static inline void vnic_connected_stats(struct vnic *vnic) > +{ > + ; > +} there are an awful lot of stubs here. Do you really expect anyone to set CONFIG_INFINIBAND_QLGC_VNIC_STATS=n? - R. From akepner at sgi.com Thu May 15 15:34:18 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 15 May 2008 15:34:18 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> Message-ID: <20080515223418.GY29302@sgi.com> Last night we were able to reproduce the bug that I reported at the beginning of this thread. In brief, the bug is that we stop getting completions on the IPoIB-UD send queue. The queue fills, and we get an endless stream of "post_send failed". This is with 2.6.16.46-0.12 (SLES 10 SP1), and OFED 1.3, running on a moderately large (512 CPU) cluster. IB HCA is MT25204, f/w 1.2.0. The workload is an MPI job of some sort, and the failure seems to happen very soon (within minutes) after the job starts. We're using CM. I've added some debug output to the driver. The debug driver prints the "tx_outstanding" value when we get post_send failures in ipoib_send(): ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80) ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80) .... (We never call netif_stop_queue().) I also instrumented calls to mthca_arbel_post_send() and mthca_poll_one() - we keep a circular buffer of the last 0x80 sends, and completions. One curious thing is that the mthca_poll_one() and mthca_arbel_post_send() routines are usually called at a very regular rate, e.g., here are the last few calls to mthca_poll_one(): # delta_t head tail # [jiffies] .... 125 0x5c 0x5b 125 0x5d 0x5c 125 0x5e 0x5d 125 0x5f 0x5e 125 0x60 0x5f 125 0x61 0x60 125 0x62 0x61 125 0x63 0x62 125 0x64 0x63 125 0x65 0x64 125 0x66 0x65 125 0x67 0x66 125 0x68 0x67 125 0x69 0x68 125 0x69 0x69 After a short time, we just stop logging any more calls to mthca_poll_one(). Then it took a few minutes to fill the queue, and start making the "post_send failed" messages. The last few succeeding calls to mthca_arbel_post_send() were: # delta_t head tail # [jiffies] .... 125 0xdc 0x69 125 0xdd 0x69 125 0xde 0x69 125 0xdf 0x69 125 0xe0 0x69 125 0xe1 0x69 125 0xe2 0x69 125 0xe3 0x69 125 0xe4 0x69 125 0xe5 0x69 125 0xe6 0x69 125 0xe7 0x69 2860 0xe8 0x69 250 0xe9 0x69 HZ = 250, so we're calling these routines twice per second. Looks like we must be doing: static void ipoib_ib_tx_timer_func(unsigned long dev_ptr) { if (post_zlen_send_wr(priv, wrid)) { ... .... poll_tx(priv); ... mod_timer(&priv->poll_timer, jiffies + HZ / 2); } Apparently when we do: static inline int mthca_poll_one( ... cqe = next_cqe_sw(cq); if (!cqe) return -EAGAIN; we find that the next CQE is owned by h/w (if we hadn't returned early, the debug code would've logged the poll.) -- Arthur From rdreier at cisco.com Thu May 15 15:38:00 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 15:38:00 -0700 Subject: [ofa-general] [PATCH 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast In-Reply-To: <20080430172025.31725.97795.stgit@localhost.localdomain> (Ramachandra K.'s message of "Wed, 30 Apr 2008 22:50:25 +0530") References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172025.31725.97795.stgit@localhost.localdomain> Message-ID: > +#define SET_MCAST_STATE_INVALID \ > +do { \ > + viport->mc_info.state = MCAST_STATE_INVALID; \ > + viport->mc_info.mc = NULL; \ > + memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid)); \ > +} while (0); Seems like this could be profitably implemented in C instead of CPP. > + spin_lock_irqsave(&viport->mc_info.lock, flags); > + viport->mc_info.state = MCAST_STATE_INVALID; > + spin_unlock_irqrestore(&viport->mc_info.lock, flags); This pattern makes me uneasy about the locking... setting the state member will already be atomic, so what do you think you're protecting against here by taking the lock? - R. From ralph.campbell at qlogic.com Thu May 15 15:48:26 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 15:48:26 -0700 Subject: [ofa-general] [PATCH 0/2] IB/ipath -- fixes for 2.6.26 Message-ID: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> The following patches fix two minor bugs for the QLogic DDR HCA. IB/ipath - fix printk compiler warning for ipath_sdma_status IB/ipath - fix UC receive completion opcode These can also be pulled into Roland's infiniband.git for-2.6.26 repo using: git pull git://git.qlogic.com/ipath-linux-2.6 for-roland From ralph.campbell at qlogic.com Thu May 15 15:48:31 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 15:48:31 -0700 Subject: [ofa-general] [PATCH 1/2] IB/ipath - fix printk compiler warning for ipath_sdma_status In-Reply-To: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> Message-ID: <20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com> This patch fixes a printk format string compiler warning to match the change of ipath_sdma_status from u64 to unsigned long. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_sdma.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 3697449..0e860fd 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -345,7 +345,7 @@ resched: * state change */ if (jiffies > dd->ipath_sdma_abort_jiffies) { - ipath_dbg("looping with status 0x%016llx\n", + ipath_dbg("looping with status 0x%016lx\n", dd->ipath_sdma_status); dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ; } @@ -615,7 +615,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd) } spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); if (!needed) { - ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n", + ipath_dbg("invalid attempt to restart SDMA, status 0x%016lx\n", dd->ipath_sdma_status); goto bail; } From ralph.campbell at qlogic.com Thu May 15 15:48:36 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 15 May 2008 15:48:36 -0700 Subject: [ofa-general] [PATCH 2/2] IB/ipath - fix UC receive completion opcode In-Reply-To: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> Message-ID: <20080515224836.23487.696.stgit@eng-46.mv.qlogic.com> When I fixed the RC receive completion opcode, I forgot to fix UC which had the same problem for RDMA write with immediate returning the wrong opcode. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_uc.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index 7fd18e8..0596ec1 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -407,12 +407,11 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, dev->n_pkt_drops++; goto done; } - /* XXX Need to free SGEs */ + wc.opcode = IB_WC_RECV; last_imm: ipath_copy_sge(&qp->r_sge, data, tlen); wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.slid = qp->remote_ah_attr.dlid; @@ -514,6 +513,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, goto done; } wc.byte_len = qp->r_len; + wc.opcode = IB_WC_RECV_RDMA_WITH_IMM; goto last_imm; case OP(RDMA_WRITE_LAST): From rdreier at cisco.com Thu May 15 16:35:39 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 16:35:39 -0700 Subject: [ofa-general] Re: [PATCH 1/2] IB/ipath - fix printk compiler warning for ipath_sdma_status In-Reply-To: <20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 15 May 2008 15:48:31 -0700") References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> <20080515224831.23487.47599.stgit@eng-46.mv.qlogic.com> Message-ID: > This patch fixes a printk format string compiler warning to match > the change of ipath_sdma_status from u64 to unsigned long. Thanks... already had fixed this locally. (Don't know how I missed the warning) From rdreier at cisco.com Thu May 15 16:36:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 May 2008 16:36:07 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/ipath - fix UC receive completion opcode In-Reply-To: <20080515224836.23487.696.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 15 May 2008 15:48:36 -0700") References: <20080515224826.23487.53076.stgit@eng-46.mv.qlogic.com> <20080515224836.23487.696.stgit@eng-46.mv.qlogic.com> Message-ID: thanks, applied. From npiggin at suse.de Thu May 15 16:52:03 2008 From: npiggin at suse.de (Nick Piggin) Date: Fri, 16 May 2008 01:52:03 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> Message-ID: <20080515235203.GB25305@wotan.suse.de> On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > On Thu, 15 May 2008, Nick Piggin wrote: > > > Oh, I get that confused because of the mixed up naming conventions > > there: unmap_page_range should actually be called zap_page_range. But > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > How is that synchronized with code that walks the same pagetable. These > walks may not hold mmap_sem either. I would expect that one could only > remove a portion of the pagetable where we have some sort of guarantee > that no accesses occur. So the removal of the vma prior ensures that? I don't really understand the question. If you remove the pte and invalidate the TLBS on the remote image's process (importing the page), then it can of course try to refault the page in because it's vma is still there. But you catch that refault in your driver , which can prevent the page from being faulted back in. From okir at lst.de Fri May 16 00:19:44 2008 From: okir at lst.de (Olaf Kirch) Date: Fri, 16 May 2008 09:19:44 +0200 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: Message-ID: <200805160919.45676.okir@lst.de> On Friday 16 May 2008 00:22:12 Roland Dreier wrote: > Strange... as I said before though something seems to have changed to > affect this, though I have no idea what. I'm including the test program > I use to check if QP creation succeeds, you can run this on any suspect > systems and see what it prints. I ran into this a few weeks back as well, when I tried to up the SG limit in RDS to 32 (on a arbel memfree card). I grepped around the code a bit, got a little confused because of all the different max_sge, max_sg and max_gs variables :-) and eventually convinced myself that the max_sge reported simply doesn't include the transport specific overhead that mthca_alloc_wqe_buf factors in. Given that you have quite different WQE overheads depending on the transport, a conservative max_sge value that works for all transports wastes one or two entries on some others. Maybe once the QP is created, it could report the actual max_sge value (which may actually be greater than the conservative, transport-independent max_sge estimate of the device). Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From keshetti85-student at yahoo.co.in Fri May 16 02:42:58 2008 From: keshetti85-student at yahoo.co.in (Keshetti Mahesh) Date: Fri, 16 May 2008 15:12:58 +0530 Subject: [ofa-general] Retry count error with ipath on OFED-1.3 Message-ID: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com> OFED 1.3 Infinipath Error: ># OSU MPI Bandwidth Test v3.1 ># Size Bandwidth (MB/s) >1 0.17 >2 0.39 >4 0.66 >8 1.80 >16 2.53 >32 5.11 >64 8.80 >128 23.09 >256 43.65 >512 84.42 >1024 151.63 >[0,1,0][btl_openib_component.c:1338:btl_openib_component_progress] from >compute-6-7 to: compute-6-8 error polling HP CQ with status RETRY >EXCEEDED ERROR status number 12 for wr_id 185705200 opcode 1 >-------------------------------------------------------------------------- >The InfiniBand retry count between two MPI processes has been >exceeded. "Retry count" is defined in the InfiniBand spec 1.2 >(section 12.7.38): > > The total number of times that the sender wishes the receiver to > retry timeout, packet sequence, etc. errors before posting a > completion error. > >This error typically means that there is something awry within the >InfiniBand fabric itself. You should note the hosts on which this >error has occurred; it has been observed that rebooting or removing a >particular host from the job can sometimes resolve this issue. > >Two MCA parameters can be used to control Open MPI's behavior with >respect to the retry count: > >* btl_openib_ib_retry_count - The number of times the sender will > attempt to retry (defaulted to 7, the maximum value). > >* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted > to 10). The actual timeout value used is calculated as: > > 4.096 microseconds * (2^btl_openib_ib_timeout) > > See the InfiniBand spec 1.2 (section 12.7.34) for more details. >-------------------------------------------------------------------------- >mpirun noticed that job rank 1 with PID 16883 on node compute-6-8 >exited on signal 15 (Terminated). Hi Nico, The above error arises due to dead lock scenario in the network. But that should not happen in your case since you are using only 2 nodes. Try increasing the IB parameters (btl_openib_ib_retry_count and btl_openib_ib_timeout) mentioned in the error. -Mahesh From holt at sgi.com Fri May 16 04:23:06 2008 From: holt at sgi.com (Robin Holt) Date: Fri, 16 May 2008 06:23:06 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080515235203.GB25305@wotan.suse.de> References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> Message-ID: <20080516112306.GA4287@sgi.com> On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > Oh, I get that confused because of the mixed up naming conventions > > > there: unmap_page_range should actually be called zap_page_range. But > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > How is that synchronized with code that walks the same pagetable. These > > walks may not hold mmap_sem either. I would expect that one could only > > remove a portion of the pagetable where we have some sort of guarantee > > that no accesses occur. So the removal of the vma prior ensures that? > > I don't really understand the question. If you remove the pte and invalidate > the TLBS on the remote image's process (importing the page), then it can > of course try to refault the page in because it's vma is still there. But > you catch that refault in your driver , which can prevent the page from > being faulted back in. I think Christoph's question has more to do with faults that are in flight. A recently requested fault could have just released the last lock that was holding up the invalidate callout. It would then begin messaging back the response PFN which could still be in flight. The invalidate callout would then fire and do the interrupt shoot-down while that response was still active (essentially beating the inflight response). The invalidate would clear up nothing and then the response would insert the PFN after it is no longer the correct PFN. Thanks, Robin From holt at sgi.com Fri May 16 04:50:05 2008 From: holt at sgi.com (Robin Holt) Date: Fri, 16 May 2008 06:50:05 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080516112306.GA4287@sgi.com> References: <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> Message-ID: <20080516115005.GC4287@sgi.com> On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote: > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > > > Oh, I get that confused because of the mixed up naming conventions > > > > there: unmap_page_range should actually be called zap_page_range. But > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > > > How is that synchronized with code that walks the same pagetable. These > > > walks may not hold mmap_sem either. I would expect that one could only > > > remove a portion of the pagetable where we have some sort of guarantee > > > that no accesses occur. So the removal of the vma prior ensures that? > > > > I don't really understand the question. If you remove the pte and invalidate > > the TLBS on the remote image's process (importing the page), then it can > > of course try to refault the page in because it's vma is still there. But > > you catch that refault in your driver , which can prevent the page from > > being faulted back in. > > I think Christoph's question has more to do with faults that are > in flight. A recently requested fault could have just released the > last lock that was holding up the invalidate callout. It would then > begin messaging back the response PFN which could still be in flight. > The invalidate callout would then fire and do the interrupt shoot-down > while that response was still active (essentially beating the inflight > response). The invalidate would clear up nothing and then the response > would insert the PFN after it is no longer the correct PFN. I just looked over XPMEM. I think we could make this work. We already have a list of active faults which is protected by a simple spinlock. I would need to nest this lock within another lock protected our PFN table (currently it is a mutex) and then the invalidate interrupt handler would need to mark the fault as invalid (which is also currently there). I think my sticking points with the interrupt method remain at fault containment and timeout. The inability of the ia64 processor to handle provide predictive failures for the read/write of memory on other partitions prevents us from being able to contain the failure. I don't think we can get the information we would need to do the invalidate without introducing fault containment issues which has been a continous area of concern for our customers. Thanks, Robin From contriver613 at g5-design.com Fri May 16 06:01:29 2008 From: contriver613 at g5-design.com (Ervin Mcintosh) Date: Fri, 16 May 2008 18:01:29 +0500 Subject: [ofa-general] $159.95 Viagra 100mg x 90 pills Message-ID: <01c8b77e$dd806280$47da975a@contriver613> 50mg x 10 pills $6.00 per pill http://carrybelieve.com From eli at dev.mellanox.co.il Fri May 16 06:06:56 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Fri, 16 May 2008 16:06:56 +0300 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080515223418.GY29302@sgi.com> References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> <20080515223418.GY29302@sgi.com> Message-ID: <1210943216.9524.15.camel@eli-laptop> On Thu, 2008-05-15 at 15:34 -0700, akepner at sgi.com wrote: > ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80) > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) > ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80) > .... This should not happen. Can you send the source files for ipoib which you're using (with the debug patches)? > > (We never call netif_stop_queue().) > You mean you don't see it get called; you did not change the code so it won't be called, correct? From olaf.kirch at oracle.com Fri May 16 07:38:16 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Fri, 16 May 2008 16:38:16 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <200805141516.01908.okir@lst.de> References: <200805121157.38135.jon@opengridcomputing.com> <482A17FC.7070804@oracle.com> <200805141516.01908.okir@lst.de> Message-ID: <200805161638.18067.olaf.kirch@oracle.com> On Wednesday 14 May 2008 15:16:00 Olaf Kirch wrote: > I'll let you know as soon as I have something for you to test. Okay, here we go. I have a whole stack of patches sitting in http://www.openfabrics.org/git/?p=~okir/ofed_1_3/linux-2.6.git on branch future-20080516 This patch stack contains everything I'm working on right now, but which isn't ready for OFED 1.3.1 yet (and I haven't started on 1.4 yet). So you get a lot more than you bargained for... but a fair bit of that is actually needed because it prepares the ground for the flow control stuff, so I didn't bother with ripping out the unneeded pieces and rediffing everything. I did some light testing with the code - it hasn't oopsed in a few hours, but I'm getting occasional errors with rds-stress right after reloading the module. They go away on the next attempt - so there's still something fishy about the code (or about rds-stress). I'm not completely happy with the performance yet. Early versions might have been more adequately dubbed "trickle control" - but I eventually managed to get something not-too-bad. However, I'm still seeing performance degradation of ~5% with some packet sizes. And that is *just* the overhead from exchanging the credit information and checking it - at some point we need to take a spinlock, and that seems to delay things just enough to make a dent in my throughput graph. In fact, I haven't yet found a test case where the sender had to slow down sending because it ran out of credits. Which confirms my suspicion that the current setup isn't so bad, at least for IB... If you're interested in the flow control code, the last commit on that branch is the one to look at - commit header appended below. I'm pretty sure that this is not exactly what iWARP needs, so please send comments/patches on how to beat it into shape for iWARP. Enjoyable weekend to everyone, Olaf commit e8a64b4f83df9df6617f75dff9e591b86174fa7c Author: Olaf Kirch Date: Fri May 16 06:16:40 2008 -0700 RDS: Implement IB flow control Here it is - flow control for RDS/IB. This patch is still very much experimental. Here's the essentials - The approach chosen here uses a credit-based flow control mechanism. Every SEND WR (including ACKs) consumes one credit, and if the sender runs out of credits, it stalls. - As new receive buffers are posted, credits are transferred to the remote node (using yet another RDS header byte for this). - Flow control is negotiated during connection setup. Initial credits are exchanged in the rds_ib_connect_private sruct - sending a value of zero (which is also the default for older protocol versions) means no flow control. - We avoid deadlock (both nodes depleting their credits, and being unable to inform the peer of newly posted buffers) by requiring that the last credit can only be used if we're posting new credits to the peer. Flow control is configurable via sysctl. It only affects newly created connections however - so your best bet is to set this right after loading the RDS module. Signed-off-by: Olaf Kirch -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From akepner at sgi.com Fri May 16 07:39:30 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 16 May 2008 07:39:30 -0700 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <1210943216.9524.15.camel@eli-laptop> References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> <20080515223418.GY29302@sgi.com> <1210943216.9524.15.camel@eli-laptop> Message-ID: <20080516143930.GE29302@sgi.com> On Fri, May 16, 2008 at 04:06:56PM +0300, Eli Cohen wrote: > On Thu, 2008-05-15 at 15:34 -0700, akepner at sgi.com wrote: > > > ib0: tx_outstanding 0x82 (ipoib_sendq_size 0x80) > > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) > > ib0: tx_outstanding 0x83 (ipoib_sendq_size 0x80) > > ib0: tx_outstanding 0x84 (ipoib_sendq_size 0x80) > > .... > > This should not happen. Can you send the source files for ipoib which > you're using (with the debug patches)? Sure. I'll send them privately, and not spam the mail list with this. But I'll restate what I said earlier in this email thread - I don't think the root cause here is IPoIB. I think IPoIB is a victim when the card stops generating completions. We've seen what looks to be the *same* bug (send queue gets forever stuffed up) on both OFED 1.2 and OFED 1.3. The drivers in these two releases (I know you're well aware) are very different. The common element is MT25204. > > > > > (We never call netif_stop_queue().) > > > You mean you don't see it get called; you did not change the code so it > won't be called, correct? I didn't change things so that netif_stop_queue() wouldn't be called. -- Arthur From emily at ontariosnearnorth.on.ca Fri May 16 06:26:00 2008 From: emily at ontariosnearnorth.on.ca (Rolex Watches) Date: Fri, 16 May 2008 13:26:00 +0000 Subject: [ofa-general] A.Lange & Sohne Watches Message-ID: <000601c8b767$02aecdee$6f433f95@wktkggs> Replica Watches - cheap and really good solution! What is a replica watch and how is it different from the real watches? A replica watch is a watch made similar to that of the real brand ones, except, at a much lower cost. A real Rolex can go up to hundreds of thousands of dollars, but you can get a replica similar to that one, for only a few hundred dollars. This allows the normal everyday person to be able to look and feel classy, without having to actually spend such ridiculous amounts of money on it. Visit our replica watches shop! -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthewtsmall at gmail.com Fri May 16 09:40:17 2008 From: matthewtsmall at gmail.com (Matthew Small) Date: Fri, 16 May 2008 12:40:17 -0400 Subject: [ofa-general] The significance of port numbers when creating QPs? Message-ID: Can anyone explain a little on the significance of choosing a port number when creating a QP. In particular, my implementation has multiple QPs associated with a single PD and the only attr.port_num I can use to initialize my queue pair seems to be 1. Can someone answer why this is and perhaps explain a general method for choosing an available port_num. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph.campbell at qlogic.com Fri May 16 09:50:15 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 16 May 2008 09:50:15 -0700 Subject: [ofa-general] The significance of port numbers when creating QPs? In-Reply-To: References: Message-ID: <1210956615.3949.199.camel@brick.pathscale.com> It depends on the hardware you have in your system. Most HCAs have one or two ports (a CX4 connector for the IB cable). The port_num is a property of the address handle (for UD QPs) or QP attributes (for UC, RC QPs) which specifies which physical IB port to use. On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote: > Can anyone explain a little on the significance of choosing a port > number when creating a QP. In particular, my implementation has > multiple QPs associated with a single PD and the only attr.port_num I > can use to initialize my queue pair seems to be 1. Can someone > answer why this is and perhaps explain a general method for choosing > an available port_num. > > -Matt > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From affiliatesvcs at gmx.com Fri May 16 09:54:24 2008 From: affiliatesvcs at gmx.com (Webroot) Date: Fri, 16 May 2008 09:54:24 -0700 Subject: [ofa-general] ***SPAM*** Information Request - Improve System Performance Message-ID: <464c3d01c8c09668a195edb40011141e@gmx.com> An HTML attachment was scrubbed... URL: From matthewtsmall at gmail.com Fri May 16 10:27:24 2008 From: matthewtsmall at gmail.com (Matthew Small) Date: Fri, 16 May 2008 13:27:24 -0400 Subject: [ofa-general] The significance of port numbers when creating QPs? In-Reply-To: <1210956615.3949.199.camel@brick.pathscale.com> References: <1210956615.3949.199.camel@brick.pathscale.com> Message-ID: So, when you are using an RC QP and attempting to write code for general hardware, how would you query the device to find which physical IB ports are available? On Fri, May 16, 2008 at 12:50 PM, Ralph Campbell wrote: > It depends on the hardware you have in your system. > Most HCAs have one or two ports (a CX4 connector > for the IB cable). The port_num is a property of > the address handle (for UD QPs) or QP attributes > (for UC, RC QPs) which specifies which physical IB > port to use. > > On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote: > > Can anyone explain a little on the significance of choosing a port > > number when creating a QP. In particular, my implementation has > > multiple QPs associated with a single PD and the only attr.port_num I > > can use to initialize my queue pair seems to be 1. Can someone > > answer why this is and perhaps explain a general method for choosing > > an available port_num. > > > > -Matt > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph.campbell at qlogic.com Fri May 16 10:45:04 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 16 May 2008 10:45:04 -0700 Subject: [ofa-general] The significance of port numbers when creating QPs? In-Reply-To: References: <1210956615.3949.199.camel@brick.pathscale.com> Message-ID: <1210959904.3949.213.camel@brick.pathscale.com> ibv_query_device() will return the number of physical ports but what you are probably asking is how to establish a connection to a particular host. That is like hostname to IP address which is accomplished via the connection manager. See the documentation for librdmacm and libibverbs. On Fri, 2008-05-16 at 13:27 -0400, Matthew Small wrote: > So, when you are using an RC QP and attempting to write code for > general hardware, how would you query the device to find which > physical IB ports are available? > > On Fri, May 16, 2008 at 12:50 PM, Ralph Campbell > wrote: > It depends on the hardware you have in your system. > Most HCAs have one or two ports (a CX4 connector > for the IB cable). The port_num is a property of > the address handle (for UD QPs) or QP attributes > (for UC, RC QPs) which specifies which physical IB > port to use. > > > On Fri, 2008-05-16 at 12:40 -0400, Matthew Small wrote: > > Can anyone explain a little on the significance of choosing > a port > > number when creating a QP. In particular, my implementation > has > > multiple QPs associated with a single PD and the only > attr.port_num I > > can use to initialize my queue pair seems to be 1. Can > someone > > answer why this is and perhaps explain a general method for > choosing > > an available port_num. > > > > -Matt > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From ruimario at gmail.com Fri May 16 11:10:56 2008 From: ruimario at gmail.com (Rui Machado) Date: Fri, 16 May 2008 20:10:56 +0200 Subject: [ofa-general] timeout question Message-ID: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> Hi, >> >> when setting the timeout in a struct ibv_qp_attr, this value >> corresponds to the Local ACK timeout which according to the Infiniband >> spec will define the transport timer timeout defined by the formula: >> 4.096uS * 2 ^Local Ack timeout". Is this right? >> And is there a value for this timeout to be considered "good practice"? >> > This value is depend on your fabric size, on the HCA you have (and some more factors).. >> Also, in a client-server setup, if this timeout is set to a "big >> value" (like 30) when the server dies, the client will take that >> amount of time to realize the failure. Is this correct? >> > Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded > (if there was a SR which was posted without any response from the receiver). > hmm..... and is there no workaround for this, for this situation? I mean, if the server dies isn't there any possibility that the sender/client realizes this. If the timeout it's too large this can be cumbersome. I tried reducing the timeout and indeed the client realizes faster when the server exits but another problem arises: Without exiting the server, on the client side I get the error (retry exceed) when polling for a recently posted send - this after some hours. Thank you for the help. Rui From rdreier at cisco.com Fri May 16 11:13:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 11:13:38 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <200805160919.45676.okir@lst.de> (Olaf Kirch's message of "Fri, 16 May 2008 09:19:44 +0200") References: <200805160919.45676.okir@lst.de> Message-ID: > Given that you have quite different WQE overheads depending on the transport, > a conservative max_sge value that works for all transports wastes one or two > entries on some others. Maybe once the QP is created, it could report > the actual max_sge value (which may actually be greater than the conservative, > transport-independent max_sge estimate of the device). Is using 27 S/G entries vs, say 29 really a big problem? The interface exists for the driver to return the actual capabilities returned, but the mthca code is a bit of a mess and I think it would require a decent amount of cleanup work to do this sanely. - R. From rdreier at cisco.com Fri May 16 11:16:32 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 11:16:32 -0700 Subject: [ofa-general] timeout question In-Reply-To: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> (Rui Machado's message of "Fri, 16 May 2008 20:10:56 +0200") References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> Message-ID: > hmm..... and is there no workaround for this, for this situation? I > mean, if the server dies isn't there any possibility that > the sender/client realizes this. If the timeout it's too large this > can be cumbersome. > > I tried reducing the timeout and indeed the client realizes faster > when the server exits but another problem arises: Without exiting the > server, > on the client side I get the error (retry exceed) when polling for a > recently posted send - this after some hours. There's a tradeoff between detecting real failures faster, and reducing false errors detected because a response came too slowly. Clearly if a response may take an amount of time 'X' to be received under normal conditions, there's no way to conclude that the remote side has failed without waiting at least 'X'. - R. From rdreier at cisco.com Fri May 16 11:21:26 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 11:21:26 -0700 Subject: [ofa-general] Re: [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <1210836027.18385.2.camel@mtls03> (Eli Cohen's message of "Thu, 15 May 2008 10:20:27 +0300") References: <1210836027.18385.2.camel@mtls03> Message-ID: > +#define MLX4_FW_VER_LOCAL_SEND_INVL mlx4_fw_ver(2, 5, 0) > + if (dev->dev->caps.fw_ver >= MLX4_FW_VER_LOCAL_SEND_INVL) > + props->device_cap_flags |= IB_DEVICE_SEND_W_INV; Are we forced to to look at the firmware version, or can we use the bmme flag that the DEV_CAP firmware command gives us? - R. From richard.frank at oracle.com Fri May 16 11:28:03 2008 From: richard.frank at oracle.com (Richard Frank) Date: Fri, 16 May 2008 14:28:03 -0400 Subject: [ofa-general] Folks is this a known problem / already fixed ? Message-ID: <482DD233.5010808@oracle.com> We see the following failure for our ConnetX HCAs.. with 1.3.1 Daily 20080512 done on vanilla OEL5U1. They are failing to initialize with the following: mlx4_core: Mellanox ConnectX core driver v1.0 (February 28, 2008) mlx4_core: Initializing 0000:05:00.0 mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting. mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting. mlx4_core: probe of 0000:05:00.0 failed with error -16 And lspci shows: 05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0) Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR] Flags: fast devsel, IRQ 169 Memory at fcc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fff000000 (64-bit, prefetchable) [disabled] [size=8M] Memory at fcbfe000 (64-bit, non-prefetchable) [disabled] [size=8K] Capabilities: [40] Power Management version 3 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256 Capabilities: [60] Express Endpoint IRQ 0 From ruimario at gmail.com Fri May 16 11:40:09 2008 From: ruimario at gmail.com (Rui Machado) Date: Fri, 16 May 2008 20:40:09 +0200 Subject: [ofa-general] timeout question In-Reply-To: References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> Message-ID: <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> 2008/5/16 Roland Dreier : > > hmm..... and is there no workaround for this, for this situation? I > > mean, if the server dies isn't there any possibility that > > the sender/client realizes this. If the timeout it's too large this > > can be cumbersome. > > > > I tried reducing the timeout and indeed the client realizes faster > > when the server exits but another problem arises: Without exiting the > > server, > > on the client side I get the error (retry exceed) when polling for a > > recently posted send - this after some hours. > > There's a tradeoff between detecting real failures faster, and reducing > false errors detected because a response came too slowly. > > Clearly if a response may take an amount of time 'X' to be received > under normal conditions, there's no way to conclude that the remote side > has failed without waiting at least 'X'. > I understand. So there's no really difference between the two situations, real server failure or just a load problem that takes more time? Something like a different error or a SIGPIPE :) ? I will describe my situation, maybe it helps (bare with me as I'm starting with Infiniband and so on) I have a client and a server.The clients posts RDMA calls one at a time (post, poll, post...). So server is just there. If I try to start something like 16 clients on 1 machine, after a few hours I will get an error on some client programs (retry excess) with a timeout of 14. If I increase the timeout for 32, I don't see that error but if I stop the server, the clients take a lot of time to acknowledge that, which is also not wanted. That's why I asked if there a 'good value'. If I have such a load between 2 nodes, I always have to risk that if the server dies the client will take much time to see it. That's not nice! Thanks for the help and quick answers, Rui From dotanba at gmail.com Fri May 16 18:54:54 2008 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 17 May 2008 03:54:54 +0200 Subject: [ofa-general] timeout question In-Reply-To: <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> Message-ID: <482E3AEE.4070603@gmail.com> Rui Machado wrote: > 2008/5/16 Roland Dreier : > >> > hmm..... and is there no workaround for this, for this situation? I >> > mean, if the server dies isn't there any possibility that >> > the sender/client realizes this. If the timeout it's too large this >> > can be cumbersome. >> > >> > I tried reducing the timeout and indeed the client realizes faster >> > when the server exits but another problem arises: Without exiting the >> > server, >> > on the client side I get the error (retry exceed) when polling for a >> > recently posted send - this after some hours. >> >> There's a tradeoff between detecting real failures faster, and reducing >> false errors detected because a response came too slowly. >> >> Clearly if a response may take an amount of time 'X' to be received >> under normal conditions, there's no way to conclude that the remote side >> has failed without waiting at least 'X'. >> >> > > I understand. So there's no really difference between the two > situations, real server failure or just a load problem that takes more > time? > From the sender QP point of view, they are the same (ack/nack wasn't send during a specific period of time) > Something like a different error or a SIGPIPE :) ? > > I will describe my situation, maybe it helps (bare with me as I'm > starting with Infiniband and so on) > I have a client and a server.The clients posts RDMA calls one at a > time (post, poll, post...). So server is just there. > If I try to start something like 16 clients on 1 machine, after a few > hours I will get an error on some client programs (retry excess) with > a timeout of 14. If I increase the timeout for 32, I don't see that > error but if I stop the server, the clients take a lot of time to > acknowledge that, which is also not wanted. > That's why I asked if there a 'good value'. If I have such a load > between 2 nodes, I always have to risk that if the server dies the > client will take much time to see it. That's not nice! > Did you try to increase the retry_count too? (and not only the timeout). By the way, Which RDMA operation do you execute READ or WRITE? > Thanks for the help and quick answers, > You are always welcome .. Dotan From ruimario at gmail.com Fri May 16 12:01:07 2008 From: ruimario at gmail.com (Rui Machado) Date: Fri, 16 May 2008 21:01:07 +0200 Subject: [ofa-general] timeout question In-Reply-To: <482E3AEE.4070603@gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> <482E3AEE.4070603@gmail.com> Message-ID: <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com> 2008/5/17 Dotan Barak : > Rui Machado wrote: >> >> 2008/5/16 Roland Dreier : >> >>> >>> > hmm..... and is there no workaround for this, for this situation? I >>> > mean, if the server dies isn't there any possibility that >>> > the sender/client realizes this. If the timeout it's too large this >>> > can be cumbersome. >>> > >>> > I tried reducing the timeout and indeed the client realizes faster >>> > when the server exits but another problem arises: Without exiting the >>> > server, >>> > on the client side I get the error (retry exceed) when polling for a >>> > recently posted send - this after some hours. >>> >>> There's a tradeoff between detecting real failures faster, and reducing >>> false errors detected because a response came too slowly. >>> >>> Clearly if a response may take an amount of time 'X' to be received >>> under normal conditions, there's no way to conclude that the remote side >>> has failed without waiting at least 'X'. >>> >>> >> >> I understand. So there's no really difference between the two >> situations, real server failure or just a load problem that takes more >> time? >> > > From the sender QP point of view, they are the same (ack/nack wasn't send > during a specific > period of time) >> >> Something like a different error or a SIGPIPE :) ? >> >> I will describe my situation, maybe it helps (bare with me as I'm >> starting with Infiniband and so on) >> I have a client and a server.The clients posts RDMA calls one at a >> time (post, poll, post...). So server is just there. >> If I try to start something like 16 clients on 1 machine, after a few >> hours I will get an error on some client programs (retry excess) with >> a timeout of 14. If I increase the timeout for 32, I don't see that >> error but if I stop the server, the clients take a lot of time to >> acknowledge that, which is also not wanted. >> That's why I asked if there a 'good value'. If I have such a load >> between 2 nodes, I always have to risk that if the server dies the >> client will take much time to see it. That's not nice! >> > > Did you try to increase the retry_count too? > (and not only the timeout). But that wouldn't change my scenario since the overall time is given by the timeout * retry count right? > By the way, Which RDMA operation do you execute READ or WRITE? >> READ. >> Thanks for the help and quick answers, >> > > You are always welcome .. Great :) Cheers, Rui From dotanba at gmail.com Fri May 16 13:04:51 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 16 May 2008 22:04:51 +0200 Subject: [ofa-general] timeout question In-Reply-To: <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> <482E3AEE.4070603@gmail.com> <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com> Message-ID: <482DE8E3.4090200@gmail.com> Rui Machado wrote: > 2008/5/17 Dotan Barak : > >> Rui Machado wrote: >> >>> 2008/5/16 Roland Dreier : >>> >>> >>>> > hmm..... and is there no workaround for this, for this situation? I >>>> > mean, if the server dies isn't there any possibility that >>>> > the sender/client realizes this. If the timeout it's too large this >>>> > can be cumbersome. >>>> > >>>> > I tried reducing the timeout and indeed the client realizes faster >>>> > when the server exits but another problem arises: Without exiting the >>>> > server, >>>> > on the client side I get the error (retry exceed) when polling for a >>>> > recently posted send - this after some hours. >>>> >>>> There's a tradeoff between detecting real failures faster, and reducing >>>> false errors detected because a response came too slowly. >>>> >>>> Clearly if a response may take an amount of time 'X' to be received >>>> under normal conditions, there's no way to conclude that the remote side >>>> has failed without waiting at least 'X'. >>>> >>>> >>>> >>> I understand. So there's no really difference between the two >>> situations, real server failure or just a load problem that takes more >>> time? >>> >>> >> From the sender QP point of view, they are the same (ack/nack wasn't send >> during a specific >> period of time) >> >>> Something like a different error or a SIGPIPE :) ? >>> >>> I will describe my situation, maybe it helps (bare with me as I'm >>> starting with Infiniband and so on) >>> I have a client and a server.The clients posts RDMA calls one at a >>> time (post, poll, post...). So server is just there. >>> If I try to start something like 16 clients on 1 machine, after a few >>> hours I will get an error on some client programs (retry excess) with >>> a timeout of 14. If I increase the timeout for 32, I don't see that >>> error but if I stop the server, the clients take a lot of time to >>> acknowledge that, which is also not wanted. >>> That's why I asked if there a 'good value'. If I have such a load >>> between 2 nodes, I always have to risk that if the server dies the >>> client will take much time to see it. That's not nice! >>> >>> >> Did you try to increase the retry_count too? >> (and not only the timeout). >> Yes. > > But that wouldn't change my scenario since the overall time is given > by the timeout * retry count right? > > >> By the way, Which RDMA operation do you execute READ or WRITE? >> > READ. > Can you replace it with a write (from the other side)? READ has "higher price" than a WRITE. Anyway, you should get the mentioned behavior anyway.. When the sender get the error, what is the status of the receiver QP? (did you try to execute ibv_query_qp and get its status?) Dotan From paulmck at linux.vnet.ibm.com Fri May 16 12:07:52 2008 From: paulmck at linux.vnet.ibm.com (Paul E. McKenney) Date: Fri, 16 May 2008 12:07:52 -0700 Subject: [ofa-general] Re: [PATCH 001/001] mmu-notifier-core v17 In-Reply-To: <20080509193230.GH7710@duo.random> References: <20080509193230.GH7710@duo.random> Message-ID: <20080516190752.GK11333@linux.vnet.ibm.com> On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote: > From: Andrea Arcangeli The hlist_del_init_rcu() primitive looks good. The rest of the RCU code looks fine assuming that "mn->ops->release()" either does call_rcu() to defer actual removal, or that the actual removal is deferred until after mmu_notifier_release() returns. Acked-by: Paul E. McKenney > With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to > pages. There are secondary MMUs (with secondary sptes and secondary > tlbs) too. sptes in the kvm case are shadow pagetables, but when I say > spte in mmu-notifier context, I mean "secondary pte". In GRU case > there's no actual secondary pte and there's only a secondary tlb > because the GRU secondary MMU has no knowledge about sptes and every > secondary tlb miss event in the MMU always generates a page fault that > has to be resolved by the CPU (this is not the case of KVM where the a > secondary tlb miss will walk sptes in hardware and it will refill the > secondary tlb transparently to software if the corresponding spte is > present). The same way zap_page_range has to invalidate the pte before > freeing the page, the spte (and secondary tlb) must also be > invalidated before any page is freed and reused. > > Currently we take a page_count pin on every page mapped by sptes, but > that means the pages can't be swapped whenever they're mapped by any > spte because they're part of the guest working set. Furthermore a spte > unmap event can immediately lead to a page to be freed when the pin is > released (so requiring the same complex and relatively slow tlb_gather > smp safe logic we have in zap_page_range and that can be avoided > completely if the spte unmap event doesn't require an unpin of the > page previously mapped in the secondary MMU). > > The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and > know when the VM is swapping or freeing or doing anything on the > primary MMU so that the secondary MMU code can drop sptes before the > pages are freed, avoiding all page pinning and allowing 100% reliable > swapping of guest physical address space. Furthermore it avoids the > code that teardown the mappings of the secondary MMU, to implement a > logic like tlb_gather in zap_page_range that would require many IPI to > flush other cpu tlbs, for each fixed number of spte unmapped. > > To make an example: if what happens on the primary MMU is a protection > downgrade (from writeable to wrprotect) the secondary MMU mappings > will be invalidated, and the next secondary-mmu-page-fault will call > get_user_pages and trigger a do_wp_page through get_user_pages if it > called get_user_pages with write=1, and it'll re-establishing an > updated spte or secondary-tlb-mapping on the copied page. Or it will > setup a readonly spte or readonly tlb mapping if it's a guest-read, if > it calls get_user_pages with write=0. This is just an example. > > This allows to map any page pointed by any pte (and in turn visible in > the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, > or an full MMU with both sptes and secondary-tlb like the > shadow-pagetable layer with kvm), or a remote DMA in software like > XPMEM (hence needing of schedule in XPMEM code to send the invalidate > to the remote node, while no need to schedule in kvm/gru as it's an > immediate event like invalidating primary-mmu pte). > > At least for KVM without this patch it's impossible to swap guests > reliably. And having this feature and removing the page pin allows > several other optimizations that simplify life considerably. > > Dependencies: > > 1) Introduces list_del_init_rcu and documents it (fixes a comment for > list_del_rcu too) > > 2) mm_take_all_locks() to register the mmu notifier when the whole VM > isn't doing anything with "mm". This allows mmu notifier users to > keep track if the VM is in the middle of the > invalidate_range_begin/end critical section with an atomic counter > incraese in range_begin and decreased in range_end. No secondary > MMU page fault is allowed to map any spte or secondary tlb > reference, while the VM is in the middle of range_begin/end as any > page returned by get_user_pages in that critical section could > later immediately be freed without any further ->invalidate_page > notification (invalidate_range_begin/end works on ranges and > ->invalidate_page isn't called immediately before freeing the > page). To stop all page freeing and pagetable overwrites the > mmap_sem must be taken in write mode and all other anon_vma/i_mmap > locks must be taken too. > > 3) It'd be a waste to add branches in the VM if nobody could possibly > run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled > if CONFIG_KVM=m/y. In the current kernel kvm won't yet take > advantage of mmu notifiers, but this already allows to compile a > KVM external module against a kernel with mmu notifiers enabled and > from the next pull from kvm.git we'll start using them. And > GRU/XPMEM will also be able to continue the development by enabling > KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code > to the mainline kernel. Then they can also enable MMU_NOTIFIERS in > the same way KVM does it (even if KVM=n). This guarantees nobody > selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. > > The mmu_notifier_register call can fail because mm_take_all_locks may > be interrupted by a signal and return -EINTR. Because > mmu_notifier_reigster is used when a driver startup, a failure can be > gracefully handled. Here an example of the change applied to kvm to > register the mmu notifiers. Usually when a driver startups other > allocations are required anyway and -ENOMEM failure paths exists > already. > > struct kvm *kvm_arch_create_vm(void) > { > struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); > + int err; > > if (!kvm) > return ERR_PTR(-ENOMEM); > > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); > > + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; > + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); > + if (err) { > + kfree(kvm); > + return ERR_PTR(err); > + } > + > return kvm; > } > > mmu_notifier_unregister returns void and it's reliable. > > Signed-off-by: Andrea Arcangeli > Signed-off-by: Nick Piggin > Signed-off-by: Christoph Lameter > --- > > Full patchset is here: > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc1/mmu-notifier-v17 > > Thanks! > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -21,6 +21,7 @@ config KVM > tristate "Kernel-based Virtual Machine (KVM) support" > depends on HAVE_KVM > select PREEMPT_NOTIFIERS > + select MMU_NOTIFIER > select ANON_INODES > ---help--- > Support hosting fully virtualized guest machines using hardware > diff --git a/include/linux/list.h b/include/linux/list.h > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis > * or hlist_del_rcu(), running on this same list. > * However, it is perfectly legal to run concurrently with > * the _rcu list-traversal primitives, such as > - * hlist_for_each_entry(). > + * hlist_for_each_entry_rcu(). > */ > static inline void hlist_del_rcu(struct hlist_node *n) > { > @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct > if (!hlist_unhashed(n)) { > __hlist_del(n); > INIT_HLIST_NODE(n); > + } > +} > + > +/** > + * hlist_del_init_rcu - deletes entry from hash list with re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on the node return true after this. It is > + * useful for RCU based read lockfree traversal if the writer side > + * must know if the list entry is still hashed or already unhashed. > + * > + * In particular, it means that we can not poison the forward pointers > + * that may still be used for walking the hash list and we can only > + * zero the pprev pointer so list_unhashed() will return true after > + * this. > + * > + * The caller must take whatever precautions are necessary (such as > + * holding appropriate locks) to avoid racing with another > + * list-mutation primitive, such as hlist_add_head_rcu() or > + * hlist_del_rcu(), running on this same list. However, it is > + * perfectly legal to run concurrently with the _rcu list-traversal > + * primitives, such as hlist_for_each_entry_rcu(). > + */ > +static inline void hlist_del_init_rcu(struct hlist_node *n) > +{ > + if (!hlist_unhashed(n)) { > + __hlist_del(n); > + n->pprev = NULL; > } > } > > diff --git a/include/linux/mm.h b/include/linux/mm.h > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1067,6 +1067,9 @@ extern struct vm_area_struct *copy_vma(s > unsigned long addr, unsigned long len, pgoff_t pgoff); > extern void exit_mmap(struct mm_struct *); > > +extern int mm_take_all_locks(struct mm_struct *mm); > +extern void mm_drop_all_locks(struct mm_struct *mm); > + > #ifdef CONFIG_PROC_FS > /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */ > extern void added_exe_file_vma(struct mm_struct *mm); > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -19,6 +20,7 @@ > #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) > > struct address_space; > +struct mmu_notifier_mm; > > #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS > typedef atomic_long_t mm_counter_t; > @@ -235,6 +237,9 @@ struct mm_struct { > struct file *exe_file; > unsigned long num_exe_file_vmas; > #endif > +#ifdef CONFIG_MMU_NOTIFIER > + struct mmu_notifier_mm *mmu_notifier_mm; > +#endif > }; > > #endif /* _LINUX_MM_TYPES_H */ > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > new file mode 100644 > --- /dev/null > +++ b/include/linux/mmu_notifier.h > @@ -0,0 +1,279 @@ > +#ifndef _LINUX_MMU_NOTIFIER_H > +#define _LINUX_MMU_NOTIFIER_H > + > +#include > +#include > +#include > + > +struct mmu_notifier; > +struct mmu_notifier_ops; > + > +#ifdef CONFIG_MMU_NOTIFIER > + > +/* > + * The mmu notifier_mm structure is allocated and installed in > + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected > + * critical section and it's released only when mm_count reaches zero > + * in mmdrop(). > + */ > +struct mmu_notifier_mm { > + /* all mmu notifiers registerd in this mm are queued in this list */ > + struct hlist_head list; > + /* to serialize the list modifications and hlist_unhashed */ > + spinlock_t lock; > +}; > + > +struct mmu_notifier_ops { > + /* > + * Called either by mmu_notifier_unregister or when the mm is > + * being destroyed by exit_mmap, always before all pages are > + * freed. This can run concurrently with other mmu notifier > + * methods (the ones invoked outside the mm context) and it > + * should tear down all secondary mmu mappings and freeze the > + * secondary mmu. If this method isn't implemented you've to > + * be sure that nothing could possibly write to the pages > + * through the secondary mmu by the time the last thread with > + * tsk->mm == mm exits. > + * > + * As side note: the pages freed after ->release returns could > + * be immediately reallocated by the gart at an alias physical > + * address with a different cache model, so if ->release isn't > + * implemented because all _software_ driven memory accesses > + * through the secondary mmu are terminated by the time the > + * last thread of this mm quits, you've also to be sure that > + * speculative _hardware_ operations can't allocate dirty > + * cachelines in the cpu that could not be snooped and made > + * coherent with the other read and write operations happening > + * through the gart alias address, so leading to memory > + * corruption. > + */ > + void (*release)(struct mmu_notifier *mn, > + struct mm_struct *mm); > + > + /* > + * clear_flush_young is called after the VM is > + * test-and-clearing the young/accessed bitflag in the > + * pte. This way the VM will provide proper aging to the > + * accesses to the page through the secondary MMUs and not > + * only to the ones through the Linux pte. > + */ > + int (*clear_flush_young)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); > + > + /* > + * Before this is invoked any secondary MMU is still ok to > + * read/write to the page previously pointed to by the Linux > + * pte because the page hasn't been freed yet and it won't be > + * freed until this returns. If required set_page_dirty has to > + * be called internally to this method. > + */ > + void (*invalidate_page)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address); > + > + /* > + * invalidate_range_start() and invalidate_range_end() must be > + * paired and are called only when the mmap_sem and/or the > + * locks protecting the reverse maps are held. The subsystem > + * must guarantee that no additional references are taken to > + * the pages in the range established between the call to > + * invalidate_range_start() and the matching call to > + * invalidate_range_end(). > + * > + * Invalidation of multiple concurrent ranges may be > + * optionally permitted by the driver. Either way the > + * establishment of sptes is forbidden in the range passed to > + * invalidate_range_begin/end for the whole duration of the > + * invalidate_range_begin/end critical section. > + * > + * invalidate_range_start() is called when all pages in the > + * range are still mapped and have at least a refcount of one. > + * > + * invalidate_range_end() is called when all pages in the > + * range have been unmapped and the pages have been freed by > + * the VM. > + * > + * The VM will remove the page table entries and potentially > + * the page between invalidate_range_start() and > + * invalidate_range_end(). If the page must not be freed > + * because of pending I/O or other circumstances then the > + * invalidate_range_start() callback (or the initial mapping > + * by the driver) must make sure that the refcount is kept > + * elevated. > + * > + * If the driver increases the refcount when the pages are > + * initially mapped into an address space then either > + * invalidate_range_start() or invalidate_range_end() may > + * decrease the refcount. If the refcount is decreased on > + * invalidate_range_start() then the VM can free pages as page > + * table entries are removed. If the refcount is only > + * droppped on invalidate_range_end() then the driver itself > + * will drop the last refcount but it must take care to flush > + * any secondary tlb before doing the final free on the > + * page. Pages will no longer be referenced by the linux > + * address space but may still be referenced by sptes until > + * the last refcount is dropped. > + */ > + void (*invalidate_range_start)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end); > + void (*invalidate_range_end)(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end); > +}; > + > +/* > + * The notifier chains are protected by mmap_sem and/or the reverse map > + * semaphores. Notifier chains are only changed when all reverse maps and > + * the mmap_sem locks are taken. > + * > + * Therefore notifier chains can only be traversed when either > + * > + * 1. mmap_sem is held. > + * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock). > + * 3. No other concurrent thread can access the list (release) > + */ > +struct mmu_notifier { > + struct hlist_node hlist; > + const struct mmu_notifier_ops *ops; > +}; > + > +static inline int mm_has_notifiers(struct mm_struct *mm) > +{ > + return unlikely(mm->mmu_notifier_mm); > +} > + > +extern int mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm); > +extern int __mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm); > +extern void mmu_notifier_unregister(struct mmu_notifier *mn, > + struct mm_struct *mm); > +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); > +extern void __mmu_notifier_release(struct mm_struct *mm); > +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, > + unsigned long address); > +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, > + unsigned long address); > +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, > + unsigned long start, unsigned long end); > +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, > + unsigned long start, unsigned long end); > + > +static inline void mmu_notifier_release(struct mm_struct *mm) > +{ > + if (mm_has_notifiers(mm)) > + __mmu_notifier_release(mm); > +} > + > +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, > + unsigned long address) > +{ > + if (mm_has_notifiers(mm)) > + return __mmu_notifier_clear_flush_young(mm, address); > + return 0; > +} > + > +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, > + unsigned long address) > +{ > + if (mm_has_notifiers(mm)) > + __mmu_notifier_invalidate_page(mm, address); > +} > + > +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + if (mm_has_notifiers(mm)) > + __mmu_notifier_invalidate_range_start(mm, start, end); > +} > + > +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + if (mm_has_notifiers(mm)) > + __mmu_notifier_invalidate_range_end(mm, start, end); > +} > + > +static inline void mmu_notifier_mm_init(struct mm_struct *mm) > +{ > + mm->mmu_notifier_mm = NULL; > +} > + > +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) > +{ > + if (mm_has_notifiers(mm)) > + __mmu_notifier_mm_destroy(mm); > +} > + > +/* > + * These two macros will sometime replace ptep_clear_flush. > + * ptep_clear_flush is impleemnted as macro itself, so this also is > + * implemented as a macro until ptep_clear_flush will converted to an > + * inline function, to diminish the risk of compilation failure. The > + * invalidate_page method over time can be moved outside the PT lock > + * and these two macros can be later removed. > + */ > +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ > +({ \ > + pte_t __pte; \ > + struct vm_area_struct *___vma = __vma; \ > + unsigned long ___address = __address; \ > + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ > + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ > + __pte; \ > +}) > + > +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ > +({ \ > + int __young; \ > + struct vm_area_struct *___vma = __vma; \ > + unsigned long ___address = __address; \ > + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ > + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ > + ___address); \ > + __young; \ > +}) > + > +#else /* CONFIG_MMU_NOTIFIER */ > + > +static inline void mmu_notifier_release(struct mm_struct *mm) > +{ > +} > + > +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, > + unsigned long address) > +{ > + return 0; > +} > + > +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, > + unsigned long address) > +{ > +} > + > +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > +} > + > +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > +} > + > +static inline void mmu_notifier_mm_init(struct mm_struct *mm) > +{ > +} > + > +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) > +{ > +} > + > +#define ptep_clear_flush_young_notify ptep_clear_flush_young > +#define ptep_clear_flush_notify ptep_clear_flush > + > +#endif /* CONFIG_MMU_NOTIFIER */ > + > +#endif /* _LINUX_MMU_NOTIFIER_H */ > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -19,6 +19,7 @@ > */ > #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */ > #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */ > +#define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */ > > static inline void mapping_set_error(struct address_space *mapping, int error) > { > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -26,6 +26,14 @@ > */ > struct anon_vma { > spinlock_t lock; /* Serialize access to vma list */ > + /* > + * NOTE: the LSB of the head.next is set by > + * mm_take_all_locks() _after_ taking the above lock. So the > + * head must only be read/written after taking the above lock > + * to be sure to see a valid next pointer. The LSB bit itself > + * is serialized by a system wide lock only visible to > + * mm_take_all_locks() (mm_all_locks_mutex). > + */ > struct list_head head; /* List of private "related" vmas */ > }; > > diff --git a/kernel/fork.c b/kernel/fork.c > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -54,6 +54,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -386,6 +387,7 @@ static struct mm_struct * mm_init(struct > > if (likely(!mm_alloc_pgd(mm))) { > mm->def_flags = 0; > + mmu_notifier_mm_init(mm); > return mm; > } > > @@ -418,6 +420,7 @@ void __mmdrop(struct mm_struct *mm) > BUG_ON(mm == &init_mm); > mm_free_pgd(mm); > destroy_context(mm); > + mmu_notifier_mm_destroy(mm); > free_mm(mm); > } > EXPORT_SYMBOL_GPL(__mmdrop); > diff --git a/mm/Kconfig b/mm/Kconfig > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -205,3 +205,6 @@ config VIRT_TO_BUS > config VIRT_TO_BUS > def_bool y > depends on !ARCH_NO_VIRT_TO_BUS > + > +config MMU_NOTIFIER > + bool > diff --git a/mm/Makefile b/mm/Makefile > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o > obj-$(CONFIG_SMP) += allocpercpu.o > obj-$(CONFIG_QUICKLIST) += quicklist.o > obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o > +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o > > diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c > --- a/mm/filemap_xip.c > +++ b/mm/filemap_xip.c > @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp > if (pte) { > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pte)); > - pteval = ptep_clear_flush(vma, address, pte); > + pteval = ptep_clear_flush_notify(vma, address, pte); > page_remove_rmap(page, vma); > dec_mm_counter(mm, file_rss); > BUG_ON(pte_dirty(pteval)); > diff --git a/mm/fremap.c b/mm/fremap.c > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns > spin_unlock(&mapping->i_mmap_lock); > } > > + mmu_notifier_invalidate_range_start(mm, start, start + size); > err = populate_range(mm, vma, start, size, pgoff); > + mmu_notifier_invalidate_range_end(mm, start, start + size); > if (!err && !(flags & MAP_NONBLOCK)) { > if (unlikely(has_write_lock)) { > downgrade_write(&mm->mmap_sem); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar > BUG_ON(start & ~HPAGE_MASK); > BUG_ON(end & ~HPAGE_MASK); > > + mmu_notifier_invalidate_range_start(mm, start, end); > spin_lock(&mm->page_table_lock); > for (address = start; address < end; address += HPAGE_SIZE) { > ptep = huge_pte_offset(mm, address); > @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar > } > spin_unlock(&mm->page_table_lock); > flush_tlb_range(vma, start, end); > + mmu_notifier_invalidate_range_end(mm, start, end); > list_for_each_entry_safe(page, tmp, &page_list, lru) { > list_del(&page->lru); > put_page(page); > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -51,6 +51,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds > unsigned long next; > unsigned long addr = vma->vm_start; > unsigned long end = vma->vm_end; > + int ret; > > /* > * Don't copy ptes where a page fault will fill them correctly. > @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds > if (is_vm_hugetlb_page(vma)) > return copy_hugetlb_page_range(dst_mm, src_mm, vma); > > + /* > + * We need to invalidate the secondary MMU mappings only when > + * there could be a permission downgrade on the ptes of the > + * parent mm. And a permission downgrade will only happen if > + * is_cow_mapping() returns true. > + */ > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier_invalidate_range_start(src_mm, addr, end); > + > + ret = 0; > dst_pgd = pgd_offset(dst_mm, addr); > src_pgd = pgd_offset(src_mm, addr); > do { > next = pgd_addr_end(addr, end); > if (pgd_none_or_clear_bad(src_pgd)) > continue; > - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, > - vma, addr, next)) > - return -ENOMEM; > + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, > + vma, addr, next))) { > + ret = -ENOMEM; > + break; > + } > } while (dst_pgd++, src_pgd++, addr = next, addr != end); > - return 0; > + > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier_invalidate_range_end(src_mm, > + vma->vm_start, end); > + return ret; > } > > static unsigned long zap_pte_range(struct mmu_gather *tlb, > @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath > unsigned long start = start_addr; > spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; > int fullmm = (*tlbp)->fullmm; > + struct mm_struct *mm = vma->vm_mm; > > + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { > unsigned long end; > > @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath > } > } > out: > + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > return start; /* which is now the end (or restart) address */ > } > > @@ -1544,10 +1565,11 @@ int apply_to_page_range(struct mm_struct > { > pgd_t *pgd; > unsigned long next; > - unsigned long end = addr + size; > + unsigned long start = addr, end = addr + size; > int err; > > BUG_ON(addr >= end); > + mmu_notifier_invalidate_range_start(mm, start, end); > pgd = pgd_offset(mm, addr); > do { > next = pgd_addr_end(addr, end); > @@ -1555,6 +1577,7 @@ int apply_to_page_range(struct mm_struct > if (err) > break; > } while (pgd++, addr = next, addr != end); > + mmu_notifier_invalidate_range_end(mm, start, end); > return err; > } > EXPORT_SYMBOL_GPL(apply_to_page_range); > @@ -1756,7 +1779,7 @@ gotten: > * seen in the presence of one thread doing SMC and another > * thread doing COW. > */ > - ptep_clear_flush(vma, address, page_table); > + ptep_clear_flush_notify(vma, address, page_table); > set_pte_at(mm, address, page_table, entry); > update_mmu_cache(vma, address, entry); > lru_cache_add_active(new_page); > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -26,6 +26,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -2048,6 +2049,7 @@ void exit_mmap(struct mm_struct *mm) > > /* mm's last user has gone, and its about to be pulled down */ > arch_exit_mmap(mm); > + mmu_notifier_release(mm); > > lru_add_drain(); > flush_cache_mm(mm); > @@ -2255,3 +2257,152 @@ int install_special_mapping(struct mm_st > > return 0; > } > + > +static DEFINE_MUTEX(mm_all_locks_mutex); > + > +/* > + * This operation locks against the VM for all pte/vma/mm related > + * operations that could ever happen on a certain mm. This includes > + * vmtruncate, try_to_unmap, and all page faults. > + * > + * The caller must take the mmap_sem in write mode before calling > + * mm_take_all_locks(). The caller isn't allowed to release the > + * mmap_sem until mm_drop_all_locks() returns. > + * > + * mmap_sem in write mode is required in order to block all operations > + * that could modify pagetables and free pages without need of > + * altering the vma layout (for example populate_range() with > + * nonlinear vmas). It's also needed in write mode to avoid new > + * anon_vmas to be associated with existing vmas. > + * > + * A single task can't take more than one mm_take_all_locks() in a row > + * or it would deadlock. > + * > + * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in > + * mapping->flags avoid to take the same lock twice, if more than one > + * vma in this mm is backed by the same anon_vma or address_space. > + * > + * We can take all the locks in random order because the VM code > + * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never > + * takes more than one of them in a row. Secondly we're protected > + * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. > + * > + * mm_take_all_locks() and mm_drop_all_locks are expensive operations > + * that may have to take thousand of locks. > + * > + * mm_take_all_locks() can fail if it's interrupted by signals. > + */ > +int mm_take_all_locks(struct mm_struct *mm) > +{ > + struct vm_area_struct *vma; > + int ret = -EINTR; > + > + BUG_ON(down_read_trylock(&mm->mmap_sem)); > + > + mutex_lock(&mm_all_locks_mutex); > + > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + struct file *filp; > + if (signal_pending(current)) > + goto out_unlock; > + if (vma->anon_vma && !test_bit(0, (unsigned long *) > + &vma->anon_vma->head.next)) { > + /* > + * The LSB of head.next can't change from > + * under us because we hold the > + * global_mm_spinlock. > + */ > + spin_lock(&vma->anon_vma->lock); > + /* > + * We can safely modify head.next after taking > + * the anon_vma->lock. If some other vma in > + * this mm shares the same anon_vma we won't > + * take it again. > + * > + * No need of atomic instructions here, > + * head.next can't change from under us thanks > + * to the anon_vma->lock. > + */ > + if (__test_and_set_bit(0, (unsigned long *) > + &vma->anon_vma->head.next)) > + BUG(); > + } > + > + filp = vma->vm_file; > + if (filp && filp->f_mapping && > + !test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) { > + /* > + * AS_MM_ALL_LOCKS can't change from under us > + * because we hold the global_mm_spinlock. > + * > + * Operations on ->flags have to be atomic > + * because even if AS_MM_ALL_LOCKS is stable > + * thanks to the global_mm_spinlock, there may > + * be other cpus changing other bitflags in > + * parallel to us. > + */ > + if (test_and_set_bit(AS_MM_ALL_LOCKS, > + &filp->f_mapping->flags)) > + BUG(); > + spin_lock(&filp->f_mapping->i_mmap_lock); > + } > + } > + ret = 0; > + > +out_unlock: > + if (ret) > + mm_drop_all_locks(mm); > + > + return ret; > +} > + > +/* > + * The mmap_sem cannot be released by the caller until > + * mm_drop_all_locks() returns. > + */ > +void mm_drop_all_locks(struct mm_struct *mm) > +{ > + struct vm_area_struct *vma; > + > + BUG_ON(down_read_trylock(&mm->mmap_sem)); > + BUG_ON(!mutex_is_locked(&mm_all_locks_mutex)); > + > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + struct file *filp; > + if (vma->anon_vma && > + test_bit(0, (unsigned long *) > + &vma->anon_vma->head.next)) { > + /* > + * The LSB of head.next can't change to 0 from > + * under us because we hold the > + * global_mm_spinlock. > + * > + * We must however clear the bitflag before > + * unlocking the vma so the users using the > + * anon_vma->head will never see our bitflag. > + * > + * No need of atomic instructions here, > + * head.next can't change from under us until > + * we release the anon_vma->lock. > + */ > + if (!__test_and_clear_bit(0, (unsigned long *) > + &vma->anon_vma->head.next)) > + BUG(); > + spin_unlock(&vma->anon_vma->lock); > + } > + filp = vma->vm_file; > + if (filp && filp->f_mapping && > + test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) { > + /* > + * AS_MM_ALL_LOCKS can't change to 0 from under us > + * because we hold the global_mm_spinlock. > + */ > + spin_unlock(&filp->f_mapping->i_mmap_lock); > + if (!test_and_clear_bit(AS_MM_ALL_LOCKS, > + &filp->f_mapping->flags)) > + BUG(); > + } > + } > + > + mutex_unlock(&mm_all_locks_mutex); > +} > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > new file mode 100644 > --- /dev/null > +++ b/mm/mmu_notifier.c > @@ -0,0 +1,276 @@ > +/* > + * linux/mm/mmu_notifier.c > + * > + * Copyright (C) 2008 Qumranet, Inc. > + * Copyright (C) 2008 SGI > + * Christoph Lameter > + * > + * This work is licensed under the terms of the GNU GPL, version 2. See > + * the COPYING file in the top-level directory. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* > + * This function can't run concurrently against mmu_notifier_register > + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap > + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers > + * in parallel despite there being no task using this mm any more, > + * through the vmas outside of the exit_mmap context, such as with > + * vmtruncate. This serializes against mmu_notifier_unregister with > + * the mmu_notifier_mm->lock in addition to RCU and it serializes > + * against the other mmu notifiers with RCU. struct mmu_notifier_mm > + * can't go away from under us as exit_mmap holds an mm_count pin > + * itself. > + */ > +void __mmu_notifier_release(struct mm_struct *mm) > +{ > + struct mmu_notifier *mn; > + > + spin_lock(&mm->mmu_notifier_mm->lock); > + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { > + mn = hlist_entry(mm->mmu_notifier_mm->list.first, > + struct mmu_notifier, > + hlist); > + /* > + * We arrived before mmu_notifier_unregister so > + * mmu_notifier_unregister will do nothing other than > + * to wait ->release to finish and > + * mmu_notifier_unregister to return. > + */ > + hlist_del_init_rcu(&mn->hlist); > + /* > + * RCU here will block mmu_notifier_unregister until > + * ->release returns. > + */ > + rcu_read_lock(); > + spin_unlock(&mm->mmu_notifier_mm->lock); > + /* > + * if ->release runs before mmu_notifier_unregister it > + * must be handled as it's the only way for the driver > + * to flush all existing sptes and stop the driver > + * from establishing any more sptes before all the > + * pages in the mm are freed. > + */ > + if (mn->ops->release) > + mn->ops->release(mn, mm); > + rcu_read_unlock(); > + spin_lock(&mm->mmu_notifier_mm->lock); > + } > + spin_unlock(&mm->mmu_notifier_mm->lock); > + > + /* > + * synchronize_rcu here prevents mmu_notifier_release to > + * return to exit_mmap (which would proceed freeing all pages > + * in the mm) until the ->release method returns, if it was > + * invoked by mmu_notifier_unregister. > + * > + * The mmu_notifier_mm can't go away from under us because one > + * mm_count is hold by exit_mmap. > + */ > + synchronize_rcu(); > +} > + > +/* > + * If no young bitflag is supported by the hardware, ->clear_flush_young can > + * unmap the address and return 1 or 0 depending if the mapping previously > + * existed or not. > + */ > +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, > + unsigned long address) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + int young = 0; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { > + if (mn->ops->clear_flush_young) > + young |= mn->ops->clear_flush_young(mn, mm, address); > + } > + rcu_read_unlock(); > + > + return young; > +} > + > +void __mmu_notifier_invalidate_page(struct mm_struct *mm, > + unsigned long address) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { > + if (mn->ops->invalidate_page) > + mn->ops->invalidate_page(mn, mm, address); > + } > + rcu_read_unlock(); > +} > + > +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { > + if (mn->ops->invalidate_range_start) > + mn->ops->invalidate_range_start(mn, mm, start, end); > + } > + rcu_read_unlock(); > +} > + > +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + struct mmu_notifier *mn; > + struct hlist_node *n; > + > + rcu_read_lock(); > + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { > + if (mn->ops->invalidate_range_end) > + mn->ops->invalidate_range_end(mn, mm, start, end); > + } > + rcu_read_unlock(); > +} > + > +static int do_mmu_notifier_register(struct mmu_notifier *mn, > + struct mm_struct *mm, > + int take_mmap_sem) > +{ > + struct mmu_notifier_mm * mmu_notifier_mm; > + int ret; > + > + BUG_ON(atomic_read(&mm->mm_users) <= 0); > + > + ret = -ENOMEM; > + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); > + if (unlikely(!mmu_notifier_mm)) > + goto out; > + > + if (take_mmap_sem) > + down_write(&mm->mmap_sem); > + ret = mm_take_all_locks(mm); > + if (unlikely(ret)) > + goto out_cleanup; > + > + if (!mm_has_notifiers(mm)) { > + INIT_HLIST_HEAD(&mmu_notifier_mm->list); > + spin_lock_init(&mmu_notifier_mm->lock); > + mm->mmu_notifier_mm = mmu_notifier_mm; > + mmu_notifier_mm = NULL; > + } > + atomic_inc(&mm->mm_count); > + > + /* > + * Serialize the update against mmu_notifier_unregister. A > + * side note: mmu_notifier_release can't run concurrently with > + * us because we hold the mm_users pin (either implicitly as > + * current->mm or explicitly with get_task_mm() or similar). > + * We can't race against any other mmu notifier method either > + * thanks to mm_take_all_locks(). > + */ > + spin_lock(&mm->mmu_notifier_mm->lock); > + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); > + spin_unlock(&mm->mmu_notifier_mm->lock); > + > + mm_drop_all_locks(mm); > +out_cleanup: > + if (take_mmap_sem) > + up_write(&mm->mmap_sem); > + /* kfree() does nothing if mmu_notifier_mm is NULL */ > + kfree(mmu_notifier_mm); > +out: > + BUG_ON(atomic_read(&mm->mm_users) <= 0); > + return ret; > +} > + > +/* > + * Must not hold mmap_sem nor any other VM related lock when calling > + * this registration function. Must also ensure mm_users can't go down > + * to zero while this runs to avoid races with mmu_notifier_release, > + * so mm has to be current->mm or the mm should be pinned safely such > + * as with get_task_mm(). If the mm is not current->mm, the mm_users > + * pin should be released by calling mmput after mmu_notifier_register > + * returns. mmu_notifier_unregister must be always called to > + * unregister the notifier. mm_count is automatically pinned to allow > + * mmu_notifier_unregister to safely run at any time later, before or > + * after exit_mmap. ->release will always be called before exit_mmap > + * frees the pages. > + */ > +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + return do_mmu_notifier_register(mn, mm, 1); > +} > +EXPORT_SYMBOL_GPL(mmu_notifier_register); > + > +/* > + * Same as mmu_notifier_register but here the caller must hold the > + * mmap_sem in write mode. > + */ > +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + return do_mmu_notifier_register(mn, mm, 0); > +} > +EXPORT_SYMBOL_GPL(__mmu_notifier_register); > + > +/* this is called after the last mmu_notifier_unregister() returned */ > +void __mmu_notifier_mm_destroy(struct mm_struct *mm) > +{ > + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); > + kfree(mm->mmu_notifier_mm); > + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ > +} > + > +/* > + * This releases the mm_count pin automatically and frees the mm > + * structure if it was the last user of it. It serializes against > + * running mmu notifiers with RCU and against mmu_notifier_unregister > + * with the unregister lock + RCU. All sptes must be dropped before > + * calling mmu_notifier_unregister. ->release or any other notifier > + * method may be invoked concurrently with mmu_notifier_unregister, > + * and only after mmu_notifier_unregister returned we're guaranteed > + * that ->release or any other method can't run anymore. > + */ > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + BUG_ON(atomic_read(&mm->mm_count) <= 0); > + > + spin_lock(&mm->mmu_notifier_mm->lock); > + if (!hlist_unhashed(&mn->hlist)) { > + hlist_del_rcu(&mn->hlist); > + > + /* > + * RCU here will force exit_mmap to wait ->release to finish > + * before freeing the pages. > + */ > + rcu_read_lock(); > + spin_unlock(&mm->mmu_notifier_mm->lock); > + /* > + * exit_mmap will block in mmu_notifier_release to > + * guarantee ->release is called before freeing the > + * pages. > + */ > + if (mn->ops->release) > + mn->ops->release(mn, mm); > + rcu_read_unlock(); > + } else > + spin_unlock(&mm->mmu_notifier_mm->lock); > + > + /* > + * Wait any running method to finish, of course including > + * ->release if it was run by mmu_notifier_relase instead of us. > + */ > + synchronize_rcu(); > + > + BUG_ON(atomic_read(&mm->mm_count) <= 0); > + > + mmdrop(mm); > +} > +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); > diff --git a/mm/mprotect.c b/mm/mprotect.c > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -198,10 +199,12 @@ success: > dirty_accountable = 1; > } > > + mmu_notifier_invalidate_range_start(mm, start, end); > if (is_vm_hugetlb_page(vma)) > hugetlb_change_protection(vma, start, end, vma->vm_page_prot); > else > change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); > + mmu_notifier_invalidate_range_end(mm, start, end); > vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); > vm_stat_account(mm, newflags, vma->vm_file, nrpages); > return 0; > diff --git a/mm/mremap.c b/mm/mremap.c > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str > struct mm_struct *mm = vma->vm_mm; > pte_t *old_pte, *new_pte, pte; > spinlock_t *old_ptl, *new_ptl; > + unsigned long old_start; > > + old_start = old_addr; > + mmu_notifier_invalidate_range_start(vma->vm_mm, > + old_start, old_end); > if (vma->vm_file) { > /* > * Subtle point from Rajesh Venkatasubramanian: before > @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str > pte_unmap_unlock(old_pte - 1, old_ptl); > if (mapping) > spin_unlock(&mapping->i_mmap_lock); > + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); > } > > #define LATENCY_LIMIT (64 * PAGE_SIZE) > diff --git a/mm/rmap.c b/mm/rmap.c > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -49,6 +49,7 @@ > #include > #include > #include > +#include > > #include > > @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa > if (vma->vm_flags & VM_LOCKED) { > referenced++; > *mapcount = 1; /* break early from loop */ > - } else if (ptep_clear_flush_young(vma, address, pte)) > + } else if (ptep_clear_flush_young_notify(vma, address, pte)) > referenced++; > > /* Pretend the page is referenced if the task has the > @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page > pte_t entry; > > flush_cache_page(vma, address, pte_pfn(*pte)); > - entry = ptep_clear_flush(vma, address, pte); > + entry = ptep_clear_flush_notify(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > set_pte_at(mm, address, pte, entry); > @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page > * skipped over this mm) then we should reactivate it. > */ > if (!migration && ((vma->vm_flags & VM_LOCKED) || > - (ptep_clear_flush_young(vma, address, pte)))) { > + (ptep_clear_flush_young_notify(vma, address, pte)))) { > ret = SWAP_FAIL; > goto out_unmap; > } > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, page_to_pfn(page)); > - pteval = ptep_clear_flush(vma, address, pte); > + pteval = ptep_clear_flush_notify(vma, address, pte); > > /* Move the dirty bit to the physical page now the pte is gone. */ > if (pte_dirty(pteval)) > @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne > page = vm_normal_page(vma, address, *pte); > BUG_ON(!page || PageAnon(page)); > > - if (ptep_clear_flush_young(vma, address, pte)) > + if (ptep_clear_flush_young_notify(vma, address, pte)) > continue; > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pte)); > - pteval = ptep_clear_flush(vma, address, pte); > + pteval = ptep_clear_flush_notify(vma, address, pte); > > /* If nonlinear, store the file page offset in the pte. */ > if (page->index != linear_page_index(vma, address)) From dotanba at gmail.com Fri May 16 18:15:03 2008 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 17 May 2008 03:15:03 +0200 Subject: ***SPAM*** Re: [ofa-general] timeout question In-Reply-To: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> Message-ID: <482E3197.5040604@gmail.com> Rui Machado wrote: > Hi, > > >>> when setting the timeout in a struct ibv_qp_attr, this value >>> corresponds to the Local ACK timeout which according to the Infiniband >>> spec will define the transport timer timeout defined by the formula: >>> 4.096uS * 2 ^Local Ack timeout". Is this right? >>> And is there a value for this timeout to be considered "good practice"? >>> >>> >> This value is depend on your fabric size, on the HCA you have (and some more factors).. >> >>> Also, in a client-server setup, if this timeout is set to a "big >>> value" (like 30) when the server dies, the client will take that >>> amount of time to realize the failure. Is this correct? >>> >>> >> Yes, after (at least) the calculated time * number of retry_count usec, the sender QP will get a retry exceeded >> (if there was a SR which was posted without any response from the receiver). >> >> > hmm..... and is there no workaround for this, for this situation? I > mean, if the server dies isn't there any possibility that > the sender/client realizes this. If the timeout it's too large this > can be cumbersome. > > I tried reducing the timeout and indeed the client realizes faster > when the server exits but another problem arises: Without exiting the > server, > on the client side I get the error (retry exceed) when polling for a > recently posted send - this after some hours. > You don't really need to set a timeout of hours, I believe that a few seconds should be enough for almost any (todays) cluster... > Thank you for the help. > You are welcome :) Dotan From rdreier at cisco.com Fri May 16 12:41:42 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 12:41:42 -0700 Subject: [ofa-general] [PATCH/RFC] Remove IB_DEVICE_SEND_W_INV from 2.6.26 Message-ID: Given that we should have full support for memory management extensions pending for 2.6.27, and the support we have for send w/ invalidate in 2.6.26 is incomplete (no provision for returning STag/L_Key in receive completion and no implementation of that in amso1100 for one thing), I think it makes sense to simply remove the IB_DEVICE_SEND_W_INV capability flag rather than moving it to a new bit position. Then when we add all the memory management extension support in 2.6.27, we can just use bit 21 for IB_DEVICE_MEM_MGT_EXTENSIONS and avoid having such fine grained distinctions, and avoid having all sorts of strange code to monkey around with the SEND_W_INV bit in libibverbs and userspace driver libraries. Thoughts pro or con? - R. diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c index 9a054c6..b1441ae 100644 --- a/drivers/infiniband/hw/amso1100/c2_rnic.c +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -455,8 +455,7 @@ int __devinit c2_rnic_init(struct c2_dev *c2dev) IB_DEVICE_CURR_QP_STATE_MOD | IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_ZERO_STAG | - IB_DEVICE_MEM_WINDOW | - IB_DEVICE_SEND_W_INV); + IB_DEVICE_MEM_WINDOW); /* Allocate the qptr_array */ c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *)); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..31d30b1 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -105,7 +105,6 @@ enum ib_device_cap_flags { */ IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), - IB_DEVICE_SEND_W_INV = (1<<21), }; enum ib_atomic_cap { From swise at opengridcomputing.com Fri May 16 12:46:42 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 14:46:42 -0500 Subject: [ofa-general] Re: [PATCH/RFC] Remove IB_DEVICE_SEND_W_INV from 2.6.26 In-Reply-To: References: Message-ID: <482DE4A2.9070305@opengridcomputing.com> Sounds ok to me. Roland Dreier wrote: > Given that we should have full support for memory management extensions > pending for 2.6.27, and the support we have for send w/ invalidate in > 2.6.26 is incomplete (no provision for returning STag/L_Key in receive > completion and no implementation of that in amso1100 for one thing), I > think it makes sense to simply remove the IB_DEVICE_SEND_W_INV > capability flag rather than moving it to a new bit position. > > Then when we add all the memory management extension support in 2.6.27, > we can just use bit 21 for IB_DEVICE_MEM_MGT_EXTENSIONS and avoid having > such fine grained distinctions, and avoid having all sorts of strange > code to monkey around with the SEND_W_INV bit in libibverbs and > userspace driver libraries. > > Thoughts pro or con? > > - R. > > diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c > index 9a054c6..b1441ae 100644 > --- a/drivers/infiniband/hw/amso1100/c2_rnic.c > +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c > @@ -455,8 +455,7 @@ int __devinit c2_rnic_init(struct c2_dev *c2dev) > IB_DEVICE_CURR_QP_STATE_MOD | > IB_DEVICE_SYS_IMAGE_GUID | > IB_DEVICE_ZERO_STAG | > - IB_DEVICE_MEM_WINDOW | > - IB_DEVICE_SEND_W_INV); > + IB_DEVICE_MEM_WINDOW); > > /* Allocate the qptr_array */ > c2dev->qptr_array = vmalloc(C2_MAX_CQS * sizeof(void *)); > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 911a661..31d30b1 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -105,7 +105,6 @@ enum ib_device_cap_flags { > */ > IB_DEVICE_UD_IP_CSUM = (1<<18), > IB_DEVICE_UD_TSO = (1<<19), > - IB_DEVICE_SEND_W_INV = (1<<21), > }; > > enum ib_atomic_cap { > From hrosenstock at xsigo.com Fri May 16 12:52:09 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 16 May 2008 12:52:09 -0700 Subject: [ofa-general] Re: [PATCH] OpenSM: Add a Performance Manager HOWTO to the docs and the dist In-Reply-To: <20080515132723.3add7c6a.weiny2@llnl.gov> References: <20080515132723.3add7c6a.weiny2@llnl.gov> Message-ID: <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote: > I decided to write a little HOWTO to help people to set it up. Nice writeup :-) > 5) Can be run in a standby SM I thought it was changed so that it can run in a standalone mode without SM. Am I confusing this with something else ? -- Hal From hrosenstock at xsigo.com Fri May 16 13:13:52 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 16 May 2008 13:13:52 -0700 Subject: [ofa-general] [PATCH] [TRIVIAL] OpenSM/doc/modular_routing.txt: Fix typo Message-ID: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com> OpenSM/doc/modular_routing.txt: Fix typo Signed-off-by: Hal Rosenstock diff --git a/opensm/doc/modular-routing.txt b/opensm/doc/modular-routing.txt index f2f70f0..a531c5a 100644 --- a/opensm/doc/modular-routing.txt +++ b/opensm/doc/modular-routing.txt @@ -64,7 +64,7 @@ standard opensm dump directory (/var/log by default) when OSM_LOG_ROUTING logging flag is set. When routing engine 'file' is activated, but dump file is not specified -or not cannot be open default lid matrix algorithm will be used. +or cannot be opened, the default lid matrix algorithm will be used. There is also a switch forwarding tables dumper which generates a file compatible with dump_lfts.sh output. This file can be used From hrosenstock at xsigo.com Fri May 16 13:16:16 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 16 May 2008 13:16:16 -0700 Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED Message-ID: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED I'll leave the other heavy lifting to Yevgeny :-) Signed-off-by: Hal Rosenstock diff --git a/opensm/doc/QoS_management_in_OpenSM.txt b/opensm/doc/QoS_management_in_OpenSM.txt index 307c80f..17a4fd5 100644 --- a/opensm/doc/QoS_management_in_OpenSM.txt +++ b/opensm/doc/QoS_management_in_OpenSM.txt @@ -65,7 +65,7 @@ matching rules (see below). Port group lists ports by: II) QoS Setup (denoted by qos-setup). This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric. -However, this is not supported in OFED 1.3. +However, this is not supported in OpenSM currently. SL2VL and VLArb tables should be configured in the OpenSM options file (default location - /var/cache/opensm/opensm.opts). @@ -203,8 +203,8 @@ policy file and their syntax: qos-setup # This section of the policy file describes how to set up SL2VL and VL # Arbitration tables on various nodes in the fabric. - # However, this is not supported in OFED 1.3 - the section is parsed - # and ignored. SL2VL and VLArb tables should be configured in the + # However, this is not supported in OpenSM currently - the section is + # parsed and ignored. SL2VL and VLArb tables should be configured in the # OpenSM options file (by default - /var/cache/opensm/opensm.opts). end-qos-setup From weiny2 at llnl.gov Fri May 16 13:35:02 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 16 May 2008 13:35:02 -0700 Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO to the docs and the dist In-Reply-To: <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com> References: <20080515132723.3add7c6a.weiny2@llnl.gov> <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080516133502.27a1e9b6.weiny2@llnl.gov> On Fri, 16 May 2008 12:52:09 -0700 Hal Rosenstock wrote: > On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote: > > I decided to write a little HOWTO to help people to set it up. > > Nice writeup :-) > > > 5) Can be run in a standby SM > > I thought it was changed so that it can run in a standalone mode without > SM. Am I confusing this with something else ? > I think you are right I should have said standalone. However, can't it also work in a standby SM? yea, from the patch which Sasha applied: opensm/perfmgr: PerfMgr for SM standby and inactive states Here is an updated patch with the correction. Ira >From 9be13c3da4d34ad0a736ced4c9e3bb5e13a24bb6 Mon Sep 17 00:00:00 2001 From: Ira K. Weiny Date: Thu, 15 May 2008 08:19:17 -0700 Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist Signed-off-by: Ira K. Weiny --- opensm/Makefile.am | 3 +- opensm/doc/performance-manager-HOWTO.txt | 153 ++++++++++++++++++++++++++++++ opensm/opensm.spec.in | 2 +- 3 files changed, 156 insertions(+), 2 deletions(-) create mode 100644 opensm/doc/performance-manager-HOWTO.txt diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 3811963..4c79f49 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -24,8 +24,9 @@ endif man_MANS = man/opensm.8 man/osmtest.8 various_scripts = $(wildcard scripts/*) +docs = doc/performance-manager-HOWTO.txt -EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) +EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs) dist-hook: $(EXTRA_DIST) if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \ diff --git a/opensm/doc/performance-manager-HOWTO.txt b/opensm/doc/performance-manager-HOWTO.txt new file mode 100644 index 0000000..f0380f3 --- /dev/null +++ b/opensm/doc/performance-manager-HOWTO.txt @@ -0,0 +1,153 @@ +OpenSM Performance manager HOWTO +================================ + +Introduction +============ + +OpenSM now includes a performance manager which collects Port counters from +the subnet and stores them internally in OpenSM. + +Some of the features of the performance manager are: + + 1) Collect port data and error counters per v1.2 spec and store in + 64bit internal counts. + 2) Automatic reset of counters when they reach approximatly 3/4 full. + (While not guarenteeing that counts will not be missed this does + keep counts incrementing as best as possible given the current + hardware limitations.) + 3) Basic warnings in the OpenSM log on "critical" errors like symbol + errors. + 4) Automatically detects "outside" resets of counters and adjusts to + continue collecting data. + 5) Can be run when OpenSM is in standby or inactive states. + +Known issues are: + + 1) Data counters will be lost on high data rate links. Sweeping the + fabric fast enough for a DDR link is not practical. + 2) Default partition support only. + + +Setup and Usage +=============== + +Using the Performance Manager consists of 3 steps: + + 1) compiling in support for the perfmgr (Optionally: the console + socket as well) + 2) enabling the perfmgr and console in opensm.opts + 3) retrieving data which has been collected. + 3a) using console to "dump data" + 3b) using a plugin module to store the data to your own + "database" + +Step 1: Compile in support for the Performance Manager +------------------------------------------------------ + +Because of the performance manager's experimental status, it is not enabled at +compile time by default. (This will hopefully soon change as more people use +it and confirm that it does not break things... ;-) The configure option is +"--enable-perf-mgr". + +At this time it is really best to enable the console socket option as well. +OpenSM can be run in an "interactive" mode. But with the console socket option +turned on one can also make a connection to a running OpenSM. The console +option is "--enable-console-socket". This option requires the use of +tcp_wrappers to ensure security. Please be aware of your configuration for +tcp_wrappers as the commands presented in the console can affect the operation +of your subnet. + +The following configure line includes turning on the performance manager as +well as the console: + + ./configure --enable-perf-mgr --enable-console-socket + + +Step 2: Enable the perfmgr and console in opensm.opts +----------------------------------------------------- + +Turning the Perfmorance Manager on is pretty easy, set the following options in +the opensm.opts config file. (Default location is +/var/cache/opensm/opensm.opts) + + # Turn it all on. + perfmgr TRUE + + # sweep time in seconds + perfmgr_sweep_time_s 180 + + # Dump file to dump the events to + event_db_dump_file /var/log/opensm_port_counters.log + +Also enable the console socket and configure the port for it to listen to if +desired. + + # console [off|local|socket] + console socket + + # Telnet port for console (default 10000) + console_port 10000 + +As noted above you also need to set up tcp_wrappers to prevent unauthorized +users from connecting to the console.[*] + + [*] As an alternate you can use the loopback mode but I noticed when + writing this (OpenSM v3.1.10; OFED 1.3) that there are some bugs in + specifying the loopback mode in the opensm.opts file. Look for this to + be fixed in newer versions. + + [**] Also you could use "local" but this is only useful if you run + OpenSM in the foreground of a terminal. As OpenSM is usually started + as a daemon I left this out as an option. + +Step 3: retrieve data which has been collected +---------------------------------------------- + +Step 3a: Using console dump function +------------------------------------ + +The console command "perfmgr dump_counters" will dump counters to the file +specified in the opensm.opts file. In the example above +"/var/log/opensm_port_counters.log" + +Example output is below: + + +"SW1 wopr ISR9024D (MLX4 FW)" 0x8f10400411f56 port 1 (Since Mon May 12 13:27:14 2008) + symbol_err_cnt : 0 + link_err_recover : 0 + link_downed : 0 + rcv_err : 0 + rcv_rem_phys_err : 0 + rcv_switch_relay_err : 2 + xmit_discards : 0 + xmit_constraint_err : 0 + rcv_constraint_err : 0 + link_integrity_err : 0 + buf_overrun_err : 0 + vl15_dropped : 0 + xmit_data : 470435 + rcv_data : 405956 + xmit_pkts : 8954 + rcv_pkts : 6900 + unicast_xmit_pkts : 0 + unicast_rcv_pkts : 0 + multicast_xmit_pkts : 0 + multicast_rcv_pkts : 0 + + + +Step 3b: Using a plugin module +------------------------------ + +If you want a more automated method of retrieving the data OpenSM provides a +plugin interface to extend OpenSM. The header file is osm_event_plugin.h. +The functions you register with this interface will be called when data is +collected. You can then use that data as appropriate. + +An example plugin can be configured at compile time using the +"--enable-default-event-plugin" option on the configure line. This plugin is +very simple. It logs "events" recieved from the performance manager to a log +file. I don't recomend using this directly but rather use it as a templat to +create your own plugin. + diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index feabfef..c36d6f2 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -125,7 +125,7 @@ fi %{_sbindir}/opensm %{_sbindir}/osmtest %{_mandir}/man8/* -%doc AUTHORS COPYING README +%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt %{_sysconfdir}/init.d/opensmd %{_sbindir}/sldd.sh %config(noreplace) @OPENSM_CONFIG_DIR@/opensm.conf -- 1.5.1 From hrosenstock at xsigo.com Fri May 16 13:37:00 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 16 May 2008 13:37:00 -0700 Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO to the docs and the dist In-Reply-To: <20080516133502.27a1e9b6.weiny2@llnl.gov> References: <20080515132723.3add7c6a.weiny2@llnl.gov> <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com> <20080516133502.27a1e9b6.weiny2@llnl.gov> Message-ID: <1210970220.12616.304.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-16 at 13:35 -0700, Ira Weiny wrote: > However, can't it also work in a standby SM? Yes, it works with SM in any state and standalone without SM AFAIK. -- Hal From hrosenstock at xsigo.com Fri May 16 14:18:38 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 16 May 2008 14:18:38 -0700 Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED In-Reply-To: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> Message-ID: <1210972718.12616.309.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-16 at 13:16 -0700, Hal Rosenstock wrote: > OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED > > I'll leave the other heavy lifting to Yevgeny :-) I forgot: Please apply to master and ofed_1_3 > Signed-off-by: Hal Rosenstock From rdreier at cisco.com Fri May 16 14:28:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 14:28:18 -0700 Subject: [ofa-general] [PATCH/RFC] RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send() Message-ID: drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send': drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function This is what akpm describes as "the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem." Signed-off-by: Roland Dreier --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 79dbe5b..9926137 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -229,7 +229,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { int err = 0; - u8 t3_wr_flit_cnt; + u8 uninitialized_var(t3_wr_flit_cnt); enum t3_wr_opcode t3_wr_opcode = 0; enum t3_wr_flags t3_wr_flags; struct iwch_qp *qhp; -- 1.5.5.1 From swise at opengridcomputing.com Fri May 16 14:52:55 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 16:52:55 -0500 Subject: [ofa-general] Re: [PATCH/RFC] RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send() In-Reply-To: References: Message-ID: <482E0237.3000603@opengridcomputing.com> Acked-by: Steve Wise From swise at opengridcomputing.com Fri May 16 15:30:37 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:30:37 -0500 Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int> The following patch series proposes: - The API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. - cxgb3 support. Changes since version 2: - added device attribute max_fast_reg_page_list_len - added cxgb3 patch Changes since version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled Steve. From swise at opengridcomputing.com Fri May 16 15:32:43 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:32:43 -0500 Subject: [ofa-general] [PATCH RFC v3] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223037.27127.26712.stgit@dell3.ogc.int> References: <20080516223037.27127.26712.stgit@dell3.ogc.int> Message-ID: <20080516223243.27127.10687.stgit@dell3.ogc.int> Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. Signed-off-by: Steve Wise --- drivers/infiniband/core/verbs.c | 46 ++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 56 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 102 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..0a334b4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) +{ + struct ib_mr *mr; + + if (!pd->device->alloc_fast_reg_mr) + return ERR_PTR(-ENOSYS); + + mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int max_page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + if (!device->alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device->alloc_fast_reg_page_list(device, max_page_list_len); + + if (!IS_ERR(page_list)) { + page_list->device = device; + page_list->max_page_list_len = max_page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list->device->free_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..c4ace0f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), IB_DEVICE_SEND_W_INV = (1<<21), + IB_DEVICE_MEM_MGT_EXTENSIONS = (1<<22), }; enum ib_atomic_cap { @@ -151,6 +152,7 @@ struct ib_device_attr { int max_srq; int max_srq_wr; int max_srq_sge; + unsigned int max_fast_reg_page_list_len; u16 max_pkeys; u8 local_ca_ack_delay; }; @@ -414,6 +416,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -628,6 +632,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +683,20 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + unsigned int page_size; + unsigned int page_list_len; + unsigned int first_byte_offset; + u32 length; + int access_flags; + + } fast_reg; + struct { + struct ib_mr *mr; + } local_inv; } wr; }; @@ -1014,6 +1035,10 @@ struct ib_device { int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); + struct ib_mr * (*alloc_fast_reg_mr)(struct ib_pd *pd, + int max_page_list_len); + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); int (*rereg_phys_mr)(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -1808,6 +1833,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); /** + * ib_alloc_fast_reg_mr - Allocates memory region usable with the + * IB_WR_FAST_REG_MR send work request. + * @pd: The protection domain associated with the region. + * @max_page_list_len: requested max physical buffer list size to be allocated. + */ +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len); + +struct ib_fast_reg_page_list { + struct ib_device *device; + u64 *page_list; + unsigned int max_page_list_len; +}; + +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used + * in a IB_WR_FAST_REG_MR work request. The resources allocated by this method + * allows for dev-specific optimization of the FAST_REG operation. + * @device - ib device pointer. + * @page_list_len - depth of the page list array to be allocated. + */ +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len); + +/** + * ib_free_fast_reg_page_list - Deallocates a previously allocated + * page list array. + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. + */ +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From swise at opengridcomputing.com Fri May 16 15:32:56 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:32:56 -0500 Subject: [ofa-general] [RESEND PATCH RFC v3 0/2] RDMA: New Memory Extensions Message-ID: <20080516223256.27221.34568.stgit@dell3.ogc.int> The following patch series proposes: - The API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. - cxgb3 support. Changes since version 2: - added device attribute max_fast_reg_page_list_len - added cxgb3 patch Changes since version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled Steve. From swise at opengridcomputing.com Fri May 16 15:34:20 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:34:20 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223256.27221.34568.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> Message-ID: <20080516223419.27221.49014.stgit@dell3.ogc.int> Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the send queue. For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. Signed-off-by: Steve Wise --- drivers/infiniband/core/verbs.c | 46 ++++++++++++++++++++++++++++++++ include/rdma/ib_verbs.h | 56 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 102 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..0a334b4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) +{ + struct ib_mr *mr; + + if (!pd->device->alloc_fast_reg_mr) + return ERR_PTR(-ENOSYS); + + mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int max_page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + if (!device->alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device->alloc_fast_reg_page_list(device, max_page_list_len); + + if (!IS_ERR(page_list)) { + page_list->device = device; + page_list->max_page_list_len = max_page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list->device->free_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..c4ace0f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), IB_DEVICE_SEND_W_INV = (1<<21), + IB_DEVICE_MEM_MGT_EXTENSIONS = (1<<22), }; enum ib_atomic_cap { @@ -151,6 +152,7 @@ struct ib_device_attr { int max_srq; int max_srq_wr; int max_srq_sge; + unsigned int max_fast_reg_page_list_len; u16 max_pkeys; u8 local_ca_ack_delay; }; @@ -414,6 +416,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -628,6 +632,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +683,20 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + unsigned int page_size; + unsigned int page_list_len; + unsigned int first_byte_offset; + u32 length; + int access_flags; + + } fast_reg; + struct { + struct ib_mr *mr; + } local_inv; } wr; }; @@ -1014,6 +1035,10 @@ struct ib_device { int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); + struct ib_mr * (*alloc_fast_reg_mr)(struct ib_pd *pd, + int max_page_list_len); + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); int (*rereg_phys_mr)(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -1808,6 +1833,37 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); /** + * ib_alloc_fast_reg_mr - Allocates memory region usable with the + * IB_WR_FAST_REG_MR send work request. + * @pd: The protection domain associated with the region. + * @max_page_list_len: requested max physical buffer list size to be allocated. + */ +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len); + +struct ib_fast_reg_page_list { + struct ib_device *device; + u64 *page_list; + unsigned int max_page_list_len; +}; + +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array to be used + * in a IB_WR_FAST_REG_MR work request. The resources allocated by this method + * allows for dev-specific optimization of the FAST_REG operation. + * @device - ib device pointer. + * @page_list_len - depth of the page list array to be allocated. + */ +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len); + +/** + * ib_free_fast_reg_page_list - Deallocates a previously allocated + * page list array. + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. + */ +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From swise at opengridcomputing.com Fri May 16 15:34:22 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:34:22 -0500 Subject: [ofa-general] [PATCH RFC v3 2/2] RDMA/cxgb3: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223256.27221.34568.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> Message-ID: <20080516223422.27221.23807.stgit@dell3.ogc.int> - set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit. - set max_fast_reg_page_list_len device attribute. - add iwch_alloc_fast_reg_mr function. - add iwch_alloc_fastreg_pbl - add iwch_free_fastreg_pbl - adjust the WQ depth for kernel mode work queues to account for fastreg possibly taking 2 WR slots. - add fastreg_mr work request support. - add invalidate_mr work request support. - add send_with_inv and send_with_se_inv work request support. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 13 ++- drivers/infiniband/hw/cxgb3/cxio_hal.h | 1 drivers/infiniband/hw/cxgb3/cxio_wr.h | 50 ++++++++++- drivers/infiniband/hw/cxgb3/iwch_provider.c | 77 ++++++++++++++++- drivers/infiniband/hw/cxgb3/iwch_qp.c | 123 +++++++++++++++++++-------- 5 files changed, 214 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 3f441fc..6315c77 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -145,7 +145,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) } wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); memset(wqe, 0, sizeof(*wqe)); - build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7, 3); wqe->flags = cpu_to_be32(MODQP_WRITE_EC); sge_cmd = qpid << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); @@ -558,7 +558,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); memset(wqe, 0, sizeof(*wqe)); build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 0, - T3_CTL_QP_TID, 7); + T3_CTL_QP_TID, 7, 3); wqe->flags = cpu_to_be32(MODQP_WRITE_EC); sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); @@ -674,7 +674,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, Q_GENBIT(rdev_p->ctrl_qp.wptr, T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, - wr_len); + wr_len, 3); if (flag == T3_COMPLETION_FLAG) ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); len -= 96; @@ -816,6 +816,13 @@ int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) 0, 0); } +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + 0, 0, 0ULL, 0, 0, 0, 0); +} + int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) { struct t3_rdma_init_wr *wqe; diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 6e128f6..e7659f6 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -165,6 +165,7 @@ int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, u32 pbl_addr); int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid); int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index f1a25a8..bc8f49b 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -72,7 +72,8 @@ enum t3_wr_opcode { T3_WR_BIND = FW_WROPCODE_RI_BIND_MW, T3_WR_RCV = FW_WROPCODE_RI_RECEIVE, T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT, - T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP + T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP, + T3_WR_FASTREG = FW_WROPCODE_RI_FASTREGISTER_MR } __attribute__ ((packed)); enum t3_rdma_opcode { @@ -89,7 +90,8 @@ enum t3_rdma_opcode { T3_FAST_REGISTER, T3_LOCAL_INV, T3_QP_MOD, - T3_BYPASS + T3_BYPASS, + T3_RDMA_READ_REQ_WITH_INV, } __attribute__ ((packed)); static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop) @@ -170,11 +172,45 @@ struct t3_send_wr { struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ }; +#define T3_MAX_FASTREG_DEPTH 24 + +struct t3_fastreg_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 stag; /* 2 */ + __be32 len; + __be32 va_base_hi; /* 3 */ + __be32 va_base_lo_fbo; + __be32 page_type_perms; /* 4 */ + __be32 reserved; + __be64 pbl_addrs[0]; /* 5+ */ +}; + +#define S_FR_PAGE_COUNT 31 +#define M_FR_PAGE_COUNT 0xff +#define V_FR_PAGE_COUNT(x) ((x) << S_FR_PAGE_COUNT) +#define G_FR_PAGE_COUNT(x) ((((x) >> S_FR_PAGE_COUNT)) & M_FR_PAGE_COUNT) + +#define S_FR_PAGE_SIZE 23 +#define M_FR_PAGE_SIZE 0x7 +#define V_FR_PAGE_SIZE(x) ((x) << S_FR_PAGE_SIZE) +#define G_FR_PAGE_SIZE(x) ((((x) >> S_FR_PAGE_SIZE)) & M_FR_PAGE_SIZE) + +#define S_FR_TYPE 20 +#define M_FR_TYPE 0x1 +#define V_FR_TYPE(x) ((x) << S_FR_TYPE) +#define G_FR_TYPE(x) ((((x) >> S_FR_TYPE)) & M_FR_TYPE) + +#define S_FR_PERMS 20 +#define M_FR_PERMS 0x1f +#define V_FR_PERMS(x) ((x) << S_FR_PERMS) +#define G_FR_PERMS(x) ((((x) >> S_FR_PERMS)) & M_FR_PERMS) + struct t3_local_inv_wr { struct fw_riwrh wrh; /* 0 */ union t3_wrid wrid; /* 1 */ __be32 stag; /* 2 */ - __be32 reserved3; + __be32 reserved; }; struct t3_rdma_write_wr { @@ -210,7 +246,8 @@ enum t3_mem_perms { T3_MEM_ACCESS_LOCAL_READ = 0x1, T3_MEM_ACCESS_LOCAL_WRITE = 0x2, T3_MEM_ACCESS_REM_READ = 0x4, - T3_MEM_ACCESS_REM_WRITE = 0x8 + T3_MEM_ACCESS_REM_WRITE = 0x8, + T3_MEM_ACCESS_MW_BIND = 0x10 } __attribute__ ((packed)); struct t3_bind_mw_wr { @@ -346,6 +383,7 @@ union t3_wr { struct t3_rdma_write_wr write; struct t3_rdma_read_wr read; struct t3_receive_wr recv; + struct t3_fastreg_wr fastreg; struct t3_local_inv_wr local_inv; struct t3_bind_mw_wr bind; struct t3_bypass_wr bypass; @@ -368,10 +406,10 @@ static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, enum t3_wr_flags flags, u8 genbit, u32 tid, - u8 len) + u8 len, u8 sopeop) { wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | - V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_SOPEOP(sopeop) | V_FW_RIWR_FLAGS(flags)); wmb(); wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 8934178..cf51800 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -768,6 +768,64 @@ static int iwch_dealloc_mw(struct ib_mw *mw) return 0; } +static struct ib_mr *iwch_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + ret = iwch_alloc_pbl(mhp, pbl_depth); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->attr.pbl_size = pbl_depth; + ret = cxio_allocate_stag(&rhp->rdev, &stag, php->pdid); + if (ret) { + iwch_free_pbl(mhp); + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_NON_SHARED_MR; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __func__, mmid, mhp, stag); + return &(mhp->ibmr); +} + +static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl( + struct ib_device *device, + int page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + page_list = kmalloc(sizeof *page_list + page_list_len * sizeof(u64), + GFP_KERNEL); + if (!page_list) + return ERR_PTR(-ENOMEM); + + page_list->page_list = (u64 *)(page_list + 1); + + return page_list; +} + +static void iwch_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list) +{ + kfree(page_list); +} + static int iwch_destroy_qp(struct ib_qp *ib_qp) { struct iwch_dev *rhp; @@ -843,6 +901,15 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd, */ sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); wqsize = roundup_pow_of_two(rqsize + sqsize); + + /* + * Kernel users need more wq space for fastreg WRs which can take + * 2 WR fragments. + */ + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (!ucontext && wqsize < (rqsize + (2 * sqsize))) + wqsize = roundup_pow_of_two(rqsize + + roundup_pow_of_two(attrs->cap.max_send_wr * 2)); PDBG("%s wqsize %d sqsize %d rqsize %d\n", __func__, wqsize, sqsize, rqsize); qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); @@ -851,7 +918,6 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd, qhp->wq.size_log2 = ilog2(wqsize); qhp->wq.rq_size_log2 = ilog2(rqsize); qhp->wq.sq_size_log2 = ilog2(sqsize); - ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { kfree(qhp); @@ -1048,6 +1114,7 @@ static int iwch_query_device(struct ib_device *ibdev, props->max_mr = dev->attr.max_mem_regs; props->max_pd = dev->attr.max_pds; props->local_ca_ack_delay = 0; + props->max_fast_reg_page_list_len = T3_MAX_FASTREG_DEPTH; return 0; } @@ -1145,8 +1212,9 @@ int iwch_register_device(struct iwch_dev *dev) memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); dev->ibdev.owner = THIS_MODULE; - dev->device_cap_flags = - (IB_DEVICE_ZERO_STAG | IB_DEVICE_MEM_WINDOW); + dev->device_cap_flags = IB_DEVICE_ZERO_STAG | + IB_DEVICE_MEM_WINDOW | + IB_DEVICE_MEM_MGT_EXTENSIONS; dev->ibdev.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | @@ -1198,6 +1266,9 @@ int iwch_register_device(struct iwch_dev *dev) dev->ibdev.alloc_mw = iwch_alloc_mw; dev->ibdev.bind_mw = iwch_bind_mw; dev->ibdev.dealloc_mw = iwch_dealloc_mw; + dev->ibdev.alloc_fast_reg_mr = iwch_alloc_fast_reg_mr; + dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl; + dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl; dev->ibdev.attach_mcast = iwch_multicast_attach; dev->ibdev.detach_mcast = iwch_multicast_detach; diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 79dbe5b..9c0cc7e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -44,54 +44,39 @@ static int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, switch (wr->opcode) { case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: if (wr->send_flags & IB_SEND_SOLICITED) wqe->send.rdmaop = T3_SEND_WITH_SE; else wqe->send.rdmaop = T3_SEND; wqe->send.rem_stag = 0; break; -#if 0 /* Not currently supported */ - case TYPE_SEND_INVALIDATE: - case TYPE_SEND_INVALIDATE_IMMEDIATE: - wqe->send.rdmaop = T3_SEND_WITH_INV; - wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); - break; - case TYPE_SEND_SE_INVALIDATE: - wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + case IB_WR_SEND_WITH_INV: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + else + wqe->send.rdmaop = T3_SEND_WITH_INV; wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); break; -#endif default: - break; + return -EINVAL; } if (wr->num_sge > T3_MAX_SGE) return -EINVAL; wqe->send.reserved[0] = 0; wqe->send.reserved[1] = 0; wqe->send.reserved[2] = 0; - if (wr->opcode == IB_WR_SEND_WITH_IMM) { - plen = 4; - wqe->send.sgl[0].stag = wr->ex.imm_data; - wqe->send.sgl[0].len = __constant_cpu_to_be32(0); - wqe->send.num_sgle = __constant_cpu_to_be32(0); - *flit_cnt = 5; - } else { - plen = 0; - for (i = 0; i < wr->num_sge; i++) { - if ((plen + wr->sg_list[i].length) < plen) { - return -EMSGSIZE; - } - plen += wr->sg_list[i].length; - wqe->send.sgl[i].stag = - cpu_to_be32(wr->sg_list[i].lkey); - wqe->send.sgl[i].len = - cpu_to_be32(wr->sg_list[i].length); - wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; } - wqe->send.num_sgle = cpu_to_be32(wr->num_sge); - *flit_cnt = 4 + ((wr->num_sge) << 1); + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); wqe->send.plen = cpu_to_be32(plen); return 0; } @@ -155,6 +140,56 @@ static int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, return 0; } +static int iwch_build_fastreg(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq) +{ + int i; + u64 *p; + + if (wr->wr.fast_reg.page_list_len > T3_MAX_FASTREG_DEPTH) + return -EINVAL; + *wr_cnt = 1; + wqe->fastreg.stag = cpu_to_be32(wr->wr.fast_reg.mr->rkey); + wqe->fastreg.len = cpu_to_be32(wr->wr.fast_reg.length); + wqe->fastreg.va_base_hi = cpu_to_be32(wr->wr.fast_reg.iova_start>>32); + wqe->fastreg.va_base_lo_fbo = + cpu_to_be32(wr->wr.fast_reg.iova_start&0xffffffff); + wqe->fastreg.page_type_perms = cpu_to_be32( + V_FR_PAGE_COUNT(wr->wr.fast_reg.page_list_len) | + V_FR_PAGE_SIZE(ilog2(wr->wr.fast_reg.page_size)-12) | + V_FR_TYPE(T3_VA_BASED_TO) | + V_FR_PERMS(iwch_ib_to_mwbind_access(wr->wr.fast_reg.access_flags))); + p = &wqe->fastreg.pbl_addrs[0]; + for (i=0; iwr.fast_reg.page_list_len; i++, p++) { + + /* If we need a 2nd WR, then set it up */ + if (i == 10) { + *wr_cnt = 2; + wqe = (union t3_wr *)(wq->queue + + Q_PTR2IDX((wq->wptr+1), wq->size_log2)); + build_fw_riwrh((void *)wqe, T3_WR_FASTREG, 0, + Q_GENBIT(wq->wptr, wq->size_log2), + 0, 1 + wr->wr.fast_reg.page_list_len - 10, 1); + + p = &wqe->flit[1]; + } + *p = cpu_to_be64((u64)wr->wr.fast_reg.page_list->page_list[i]); + } + *flit_cnt = 5 + wr->wr.fast_reg.page_list_len; + if (*flit_cnt > 15) + *flit_cnt = 15; + return 0; +} + +static int iwch_build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + wqe->local_inv.stag = cpu_to_be32(wr->wr.local_inv.mr->rkey); + wqe->local_inv.reserved = 0; + *flit_cnt = sizeof(struct t3_local_inv_wr) >> 3; + return 0; +} + /* * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. */ @@ -238,6 +273,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, u32 num_wrs; unsigned long flag; struct t3_swsq *sqp; + int wr_cnt = 1; qhp = to_iwch_qp(ibqp); spin_lock_irqsave(&qhp->lock, flag); @@ -262,15 +298,15 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, t3_wr_flags = 0; if (wr->send_flags & IB_SEND_SOLICITED) t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; - if (wr->send_flags & IB_SEND_FENCE) - t3_wr_flags |= T3_READ_FENCE_FLAG; if (wr->send_flags & IB_SEND_SIGNALED) t3_wr_flags |= T3_COMPLETION_FLAG; sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); switch (wr->opcode) { case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: + case IB_WR_SEND_WITH_INV: + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; t3_wr_opcode = T3_WR_SEND; err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); break; @@ -289,6 +325,17 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (!qhp->wq.oldest_read) qhp->wq.oldest_read = sqp; break; + case IB_WR_FAST_REG_MR: + t3_wr_opcode = T3_WR_FASTREG; + err = iwch_build_fastreg(wqe, wr, &t3_wr_flit_cnt, + &wr_cnt, &qhp->wq); + break; + case IB_WR_INVALIDATE_MR: + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_LOCAL_FENCE_FLAG; + t3_wr_opcode = T3_WR_INV_STAG; + err = iwch_build_inv_stag(wqe, wr, &t3_wr_flit_cnt); + break; default: PDBG("%s post of type=%d TBD!\n", __func__, wr->opcode); @@ -307,14 +354,14 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), - 0, t3_wr_flit_cnt); + 0, t3_wr_flit_cnt, (wr_cnt == 1) ? 3 : 2); PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", __func__, (unsigned long long) wr->wr_id, idx, Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), sqp->opcode); wr = wr->next; num_wrs--; - ++(qhp->wq.wptr); + qhp->wq.wptr += wr_cnt; ++(qhp->wq.sq_wptr); } spin_unlock_irqrestore(&qhp->lock, flag); @@ -359,7 +406,7 @@ int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, wr->wr_id; build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), - 0, sizeof(struct t3_receive_wr) >> 3); + 0, sizeof(struct t3_receive_wr) >> 3, 3); PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " "wqe %p \n", __func__, (unsigned long long) wr->wr_id, idx, qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); @@ -444,7 +491,7 @@ int iwch_bind_mw(struct ib_qp *qp, wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, - sizeof(struct t3_bind_mw_wr) >> 3); + sizeof(struct t3_bind_mw_wr) >> 3, 3); ++(qhp->wq.wptr); ++(qhp->wq.sq_wptr); spin_unlock_irqrestore(&qhp->lock, flag); From rdreier at cisco.com Fri May 16 15:48:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 15:48:36 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: (Thomas Talpey's message of "Wed, 14 May 2008 23:32:21 -0400") References: Message-ID: > We've been hit by this twice this week on two NFS/RDMA servers, so I'm > glad to see this! But, for us it happens with memless ConnectX - our mthca > devices are ok (but OTOH they're memfull not memfree) OK, I see a problem with mlx4 -- it may spuriously return failure when you try to create a QP with max_send_sge == 32, but only for kernel QPs. Which is why my userspace test didn't catch it. - R. From rdreier at cisco.com Fri May 16 16:12:51 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 16:12:51 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: (Roland Dreier's message of "Fri, 16 May 2008 15:48:36 -0700") References: Message-ID: > OK, I see a problem with mlx4 -- it may spuriously return failure when > you try to create a QP with max_send_sge == 32, but only for kernel > QPs. Which is why my userspace test didn't catch it. The problem is this code in set_kernel_sq_size: if (dev->dev->caps.fw_ver >= MLX4_FW_VER_WQE_CTRL_NEC && qp->sq_signal_bits && BITS_PER_LONG == 64 && type != IB_QPT_SMI && type != IB_QPT_GSI) qp->sq.wqe_shift = ilog2(64); else qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) return -EINVAL; if we can't use the "WQE shrinking" feature (because of selective signaling in the NFS/RDMA case), and we want to use 32 sge entries, then the WQE size 's' will end up a little more than 512 bytes, and the wqe_shift will end up as 10. But since the max_sq_desc_sz is 1008, we return -EINVAL, when it is really fine to have a wqe_shift of 10 as long as we don't use more than 1008 bytes per descriptor (I think). So something like this is probably the fix (it suffices to make NFS/RDMA mount work with ConnectX on both sides): diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index cec030e..b6612a0 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -372,7 +372,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) + if (qp->sq.wqe_shift > + ilog2(roundup_pow_of_two(dev->dev->caps.max_sq_desc_sz))) return -EINVAL; qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift); @@ -395,7 +396,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, ++qp->sq.wqe_shift; } - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - send_wqe_overhead(type, qp->flags)) / sizeof (struct mlx4_wqe_data_seg); From rdreier at cisco.com Fri May 16 16:24:37 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 May 2008 16:24:37 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: (Roland Dreier's message of "Fri, 16 May 2008 16:12:51 -0700") References: Message-ID: Or maybe something like this is better: diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index cec030e..907eb34 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + send_wqe_overhead(type, qp->flags); + if (s > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + /* * Hermon supports shrinking WQEs, such that a single work * request can include multiple units of 1 << wqe_shift. This @@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) - return -EINVAL; - qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift); /* @@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, ++qp->sq.wqe_shift; } - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - send_wqe_overhead(type, qp->flags)) / sizeof (struct mlx4_wqe_data_seg); From clameter at sgi.com Fri May 16 18:38:28 2008 From: clameter at sgi.com (Christoph Lameter) Date: Fri, 16 May 2008 18:38:28 -0700 (PDT) Subject: [ofa-general] mm notifier: Notifications when pages are unmapped. In-Reply-To: References: <6b384bb988786aa78ef0.1210170958@duo.random> <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> Message-ID: Implementation of what Linus suggested: Defer the XPMEM processing until after the locks are dropped. Allow immediate action by GRU/KVM. This patch implements a callbacks for device drivers that establish external references to pages aside from the Linux rmaps. Those either: 1. Do not take a refcount on pages that are mapped from devices. They have a TLB cache like handling and must be able to flush external references from atomic contexts. These devices do not need to provide the _sync methods. 2. Do take a refcount on pages mapped externally. These are handling by marking pages as to be invalidated in atomic contexts. Invalidation may be started by the driver. A _sync variant for the individual or range unmap is called when we are back in a nonatomic context. At that point the device must complete the removal of external references and drop its refcount. With the mm notifier it is possible for the device driver to release external references after the page references are removed from a process that made them available. With the notifier it becomes possible to get pages unpinned on request and thus avoid issues that come with having a large amount of pinned pages. A device driver must subscribe to a process using mm_register_notifier(struct mm_struct *, struct mm_notifier *) The VM will then perform callbacks for operations that unmap or change permissions of pages in that address space. When the process terminates then first the ->release method is called to remove all pages still mapped to the proces. Before the mm_struct is freed the ->destroy() method is called which should dispose of the mm_notifier structure. The following callbacks exist: invalidate_range(notifier, mm_struct *, from , to) Invalidate a range of addresses. The invalidation is not required to complete immediately. invalidate_range_sync(notifier, mm_struct *, from, to) This is called after some invalidate_range callouts. The driver may only return when the invalidation of the references is completed. Callback is only called from non atomic contexts. There is no need to provide this callback if the driver can remove references in an atomic context. invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address) Invalidate references to a particular page. The driver may defer the invalidation. invalidate_page_sync(notifier, mm_struct *,struct *) Called after one or more invalidate_page() callbacks. The callback must only return when the external references have been removed. The callback does not need to be provided if the driver can remove references in atomic contexts. [NOTE] The invalidate_page_sync() callback is weird because it is called for every notifier that supports the invalidate_page_sync() callback if a page has PageNotifier() set. The driver must determine in an efficient way that the page is not of interest. This is because we do not have the mm context after we have dropped the rmap list lock. Drivers incrementing the refcount must set and clear PageNotifier appropriately when establishing and/or dropping a refcount! [These conditions are similar to the rmap notifier that was introduced in my V7 of the mmu_notifier]. There is no support for an aging callback. A device driver may simply set the reference bit on the linux pte when the external mapping is referenced if such support is desired. The patch is provisional. All functions are inlined for now. They should be wrapped like in Andrea's series. Its probably good to have Andrea review this if we actually decide to go this route since he is pretty good as detecting issues with complex lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and we are reintroducing that now in a light weight order to be able to defer freeing until after the rmap spinlocks have been dropped. Jack tested this with the GRU. Signed-off-by: Christoph Lameter --- fs/hugetlbfs/inode.c | 2 include/linux/mm_types.h | 3 include/linux/page-flags.h | 3 include/linux/rmap.h | 161 +++++++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 4 + mm/Kconfig | 4 + mm/filemap_xip.c | 2 mm/fremap.c | 2 mm/hugetlb.c | 3 mm/memory.c | 38 ++++++++-- mm/mmap.c | 3 mm/mprotect.c | 3 mm/mremap.c | 5 + mm/rmap.c | 11 ++- 14 files changed, 234 insertions(+), 10 deletions(-) Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/kernel/fork.c 2008-05-16 16:06:26.000000000 -0700 @@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; +#ifdef CONFIG_MM_NOTIFIER + mm->mm_notifier = NULL; +#endif return mm; } @@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mm_notifier_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/filemap_xip.c 2008-05-16 16:06:26.000000000 -0700 @@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); @@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp } } spin_unlock(&mapping->i_mmap_lock); + mm_notifier_invalidate_page_sync(page); } /* Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/fremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mm_notifier_invalidate_range(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mm_notifier_invalidate_range_sync(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/hugetlb.c 2008-05-16 17:50:31.000000000 -0700 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); @@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); __unmap_hugepage_range(vma, start, end); spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + mm_notifier_invalidate_range_sync(vma->vm_mm, start, end); } } Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/memory.c 2008-05-16 16:06:26.000000000 -0700 @@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s */ if (is_cow_mapping(vm_flags)) { ptep_set_wrprotect(src_mm, addr, src_pte); + mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE); pte = pte_wrprotect(pte); } @@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end); + + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath } tlb_finish_mmu(*tlbp, tlb_start, start); + mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { @@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) + if (tlb) { tlb_finish_mmu(tlb, address, end); + mm_notifier_invalidate_range(mm, address, end); + } return end; } @@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct * */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - page_cache_release(old_page); if (!pte_same(*page_table, orig_pte)) goto unlock; @@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = NULL; goto unlock; } @@ -1774,6 +1792,7 @@ gotten: * thread doing COW. */ ptep_clear_flush(vma, address, page_table); + mm_notifier_invalidate_page(mm, old_page, address); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1787,10 +1806,13 @@ gotten: if (new_page) page_cache_release(new_page); - if (old_page) - page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + if (old_page) { + mm_notifier_invalidate_page_sync(old_page); + page_cache_release(old_page); + } + if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mm_notifier_invalidate_range(mm, start, end); + mm_notifier_invalidate_range_sync(mm, start, end); } /* @@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mm_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mprotect.c 2008-05-16 16:06:26.000000000 -0700 @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -132,6 +133,7 @@ static void change_protection(struct vm_ change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable); } while (pgd++, addr = next, addr != end); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(vma->vm_mm, start, end); } int @@ -211,6 +213,7 @@ success: hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mm_notifier_invalidate_range_sync(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/mremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start = old_addr; if (vma->vm_file) { /* @@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); arch_enter_lazy_mmu_mode(); + mm_notifier_invalidate_range(mm, old_addr, old_end); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) @@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + + mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/rmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -52,6 +52,9 @@ #include +struct mm_notifier *mm_notifier_page_sync; +DECLARE_RWSEM(mm_notifier_page_sync_sem); + struct kmem_cache *anon_vma_cachep; /* This must be called under the mmap_sem. */ @@ -458,6 +461,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -502,8 +506,8 @@ int page_mkclean(struct page *page) ret = 1; } } + mm_notifier_invalidate_page_sync(page); } - return ret; } EXPORT_SYMBOL_GPL(page_mkclean); @@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + mm_notifier_invalidate_page_sync(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; } - Index: linux-2.6/include/linux/rmap.h =================================================================== --- linux-2.6.orig/include/linux/rmap.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/rmap.h 2008-05-16 18:32:52.000000000 -0700 @@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa #define SWAP_AGAIN 1 #define SWAP_FAIL 2 +#ifdef CONFIG_MM_NOTIFIER + +struct mm_notifier_ops { + void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page, unsigned long addr); + void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page); + void (*release)(struct mm_notifier *mn, struct mm_struct *mm); + void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm); +}; + +struct mm_notifier { + struct mm_notifier_ops *ops; + struct mm_struct *mm; + struct mm_notifier *next; + struct mm_notifier *next_page_sync; +}; + +extern struct mm_notifier *mm_notifier_page_sync; +extern struct rw_semaphore mm_notifier_page_sync_sem; + +/* + * Must hold mmap_sem when calling mm_notifier_register. + */ +static inline void mm_notifier_register(struct mm_notifier *mn, + struct mm_struct *mm) +{ + mn->mm = mm; + mn->next = mm->mm_notifier; + rcu_assign_pointer(mm->mm_notifier, mn); + if (mn->ops->invalidate_page_sync) { + down_write(&mm_notifier_page_sync_sem); + mn->next_page_sync = mm_notifier_page_sync; + mm_notifier_page_sync = mn; + up_write(&mm_notifier_page_sync_sem); + } +} + +/* + * Invalidate remote references in a particular address range + */ +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_range(mn, mm, start, end); +} + +/* + * Invalidate remote references in a particular address range. + * Can sleep. Only return if all remote references have been removed. + */ +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + if (mn->ops->invalidate_range_sync) + mn->ops->invalidate_range_sync(mn, mm, start, end); +} + +/* + * Invalidate remote references to a page + */ +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long addr) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_page(mn, mm, page, addr); +} + +/* + * Invalidate remote references to a partioular page. Only return + * if all references have been removed. + * + * Note: This is an expensive function since it is not clear at the time + * of call to which mm_struct() the page belongs.. It walks through the + * mmlist and calls the mmu notifier ops for each address space in the + * system. At some point this needs to be optimized. + */ +static inline void mm_notifier_invalidate_page_sync(struct page *page) +{ + struct mm_notifier *mn; + + if (!PageNotifier(page)) + return; + + down_read(&mm_notifier_page_sync_sem); + + for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync) + if (mn->ops->invalidate_page_sync) + mn->ops->invalidate_page_sync(mn, mn->mm, page); + + up_read(&mm_notifier_page_sync_sem); +} + +/* + * Invalidate all remote references before shutdown + */ +static inline void mm_notifier_release(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->release(mn, mm); +} + +/* + * Release resources before freeing mm_struct. + */ +static inline void mm_notifier_destroy(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + while (mm->mm_notifier) { + mn = mm->mm_notifier; + mm->mm_notifier = mn->next; + if (mn->ops->invalidate_page_sync) { + struct mm_notifier *m; + + down_write(&mm_notifier_page_sync_sem); + + if (mm_notifier_page_sync != mn) { + for (m = mm_notifier_page_sync; m; m = m->next_page_sync) + if (m->next_page_sync == mn) + break; + + m->next_page_sync = mn->next_page_sync; + } else + mm_notifier_page_sync = mn->next_page_sync; + + up_write(&mm_notifier_page_sync_sem); + } + mn->ops->destroy(mn, mm); + } +} +#else +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long address) {} +static inline void mm_notifier_invalidate_page_sync(struct page *page) {} +static inline void mm_notifier_release(struct mm_struct *mm) {} +static inline void mm_notifier_destroy(struct mm_struct *mm) {} +#endif + #endif /* _LINUX_RMAP_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/Kconfig 2008-05-16 16:06:26.000000000 -0700 @@ -205,3 +205,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MM_NOTIFIER + def_bool y + Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/mm_types.h 2008-05-16 16:06:26.000000000 -0700 @@ -244,6 +244,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MM_NOTIFIER + struct mm_notifier *mm_notifier; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/page-flags.h 2008-05-16 16:06:26.000000000 -0700 @@ -93,6 +93,7 @@ enum pageflags { PG_mappedtodisk, /* Has blocks allocated on-disk */ PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ + PG_notifier, /* Call notifier when page is changed/unmapped */ #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif @@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk) PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ +PAGEFLAG(Notifier, notifier); + #ifdef CONFIG_HIGHMEM /* * Must use a macro here due to header dependency issues. page_zone() is not Index: linux-2.6/fs/hugetlbfs/inode.c =================================================================== --- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/fs/hugetlbfs/inode.c 2008-05-16 16:06:55.000000000 -0700 @@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree __unmap_hugepage_range(vma, vma->vm_start + v_offset, vma->vm_end); + mm_notifier_invalidate_range_sync(vma->vm_mm, + vma->vm_start + v_offset, vma->vm_end); } } From sashak at voltaire.com Sat May 17 08:17:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 17 May 2008 18:17:50 +0300 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <20080515101418.4ccb53f3.weiny2@llnl.gov> Message-ID: <20080517151750.GA30185@sashak.voltaire.com> On 11:37 Thu 15 May , Chris Worley wrote: > On Thu, May 15, 2008 at 11:14 AM, Ira Weiny wrote: > > On Thu, 15 May 2008 10:26:37 -0600 > > "Chris Worley" wrote: > > >> After an sm change (i.e. using the "-r" switch), nodes can't ping each > >> other over IPoIB (other protocols also can't communicate). > > > > Is it absolutely necessary to run with the "-r" switch? Here we have not > > problems letting the SM attempt to use the same LID's for nodes. > > yes, especially when chaging routing algorithms between the default > and fat-tree. As Yevgeny said it looks like an error (or at least unexpected behavior) in fat-tree code. Could you send ibnetdiscover output and "old" guid2lid file for us? Sasha From kliteyn at mellanox.co.il Sat May 17 08:52:45 2008 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 17 May 2008 18:52:45 +0300 Subject: [ofa-general] Re: OpenSM and fat tree In-Reply-To: <20080517151750.GA30185@sashak.voltaire.com> References: <1210422553.2026.411.camel@hrosenstock-ws.xsigo.com> <4825EEC9.4070208@dev.mellanox.co.il> <1210591145.2026.458.camel@hrosenstock-ws.xsigo.com> <20080515111914.GO24654@sashak.voltaire.com> <1210867827.12616.38.camel@hrosenstock-ws.xsigo.com> <20080515101418.4ccb53f3.weiny2@llnl.gov> <20080517151750.GA30185@sashak.voltaire.com> Message-ID: <482EFF4D.8060501@mellanox.co.il> Sasha Khapyorsky wrote: > On 11:37 Thu 15 May , Chris Worley wrote: > >> On Thu, May 15, 2008 at 11:14 AM, Ira Weiny wrote: >> >>> On Thu, 15 May 2008 10:26:37 -0600 >>> "Chris Worley" wrote: >>> >> >> >>>> After an sm change (i.e. using the "-r" switch), nodes can't ping each >>>> other over IPoIB (other protocols also can't communicate). >>>> >>> Is it absolutely necessary to run with the "-r" switch? Here we have not >>> problems letting the SM attempt to use the same LID's for nodes. >>> >> yes, especially when chaging routing algorithms between the default >> and fat-tree. >> > > As Yevgeny said it looks like an error (or at least unexpected behavior) > in fat-tree code. Could you send ibnetdiscover output and "old" guid2lid > file for us? > There's also an open bug on bugzilla for this: https://bugs.openfabrics.org/show_bug.cgi?id=1031 (which also lacks the details that would help me to reproduce it). -- Yevgeny > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From eli at dev.mellanox.co.il Sat May 17 11:27:44 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sat, 17 May 2008 21:27:44 +0300 Subject: [ofa-general] Folks is this a known problem / already fixed ? In-Reply-To: <482DD233.5010808@oracle.com> References: <482DD233.5010808@oracle.com> Message-ID: <1211048864.6696.7.camel@eli-laptop> On Fri, 2008-05-16 at 14:28 -0400, Richard Frank wrote: > We see the following failure for our ConnetX HCAs.. with 1.3.1 Daily > 20080512 done on vanilla OEL5U1. > > They are failing to initialize with the following: > > mlx4_core: Mellanox ConnectX core driver v1.0 (February 28, 2008) > mlx4_core: Initializing 0000:05:00.0 > mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting. > mlx4_core 0000:05:00.0: Failed to initialize queue pair table, aborting. > mlx4_core: probe of 0000:05:00.0 failed with error -16 > > And lspci shows: > > 05:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev > a0) > Subsystem: Mellanox Technologies MT25418 [ConnectX IB DDR] > Flags: fast devsel, IRQ 169 > Memory at fcc00000 (64-bit, non-prefetchable) [disabled] [size=1M] > Memory at fff000000 (64-bit, prefetchable) [disabled] [size=8M] > Memory at fcbfe000 (64-bit, non-prefetchable) [disabled] [size=8K] > Capabilities: [40] Power Management version 3 > Capabilities: [48] Vital Product Data > Capabilities: [9c] MSI-X: Enable- Mask- TabSize=256 > Capabilities: [60] Express Endpoint IRQ 0 > Can you send the output of lspci for the bridge connecting the ConnectX with the upstream PCI bus? I guess the problem would be that the bridge blocks memory writes to ConnectX's UAR area thus causing a failure to arm the EQ and eventually resulting in failure to load the driver. Now it could be a failure of the kernel to configure the bridge properly. Could you try with the latest kernel? From sashak at voltaire.com Sat May 17 16:13:55 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 18 May 2008 02:13:55 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM: Fix rpm build, /opensm/opensm.conf failed to install In-Reply-To: <20080515132721.37644ade.weiny2@llnl.gov> References: <20080515132721.37644ade.weiny2@llnl.gov> Message-ID: <20080517231355.GB30185@sashak.voltaire.com> Hi Ira, On 13:27 Thu 15 May , Ira Weiny wrote: > > I found this while trying to add the Performance Manager HOWTO to the rpm. Right, when *.spec was generated with different sysconfdir value than one used with rpmbuild there is an issue.. Thanks for the finding this. > Therefore, I think this will conflict slightly with that patch. If you like I > can resubmit that patch after you apply this. Don't think it is needed. > > Thanks, > Ira > > From 8453b86e94175ff3054a57c5c50e337a96d536bd Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 15 May 2008 13:13:16 -0700 > Subject: [PATCH] Fix rpm build, /opensm/opensm.conf failed to install > > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sat May 17 16:23:45 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 18 May 2008 02:23:45 +0300 Subject: [ofa-general] Re: [PATCH V2] OpenSM: Add a Performance Manager HOWTO to the docs and the dist In-Reply-To: <20080516133502.27a1e9b6.weiny2@llnl.gov> References: <20080515132723.3add7c6a.weiny2@llnl.gov> <1210967529.12616.287.camel@hrosenstock-ws.xsigo.com> <20080516133502.27a1e9b6.weiny2@llnl.gov> Message-ID: <20080517232345.GC30185@sashak.voltaire.com> On 13:35 Fri 16 May , Ira Weiny wrote: > On Fri, 16 May 2008 12:52:09 -0700 > Hal Rosenstock wrote: > > > On Thu, 2008-05-15 at 13:27 -0700, Ira Weiny wrote: > > > I decided to write a little HOWTO to help people to set it up. > > > > Nice writeup :-) Really good doc. > > > > > 5) Can be run in a standby SM > > > > I thought it was changed so that it can run in a standalone mode without > > SM. Am I confusing this with something else ? > > > > I think you are right I should have said standalone. However, can't it also > work in a standby SM? > > yea, from the patch which Sasha applied: > > opensm/perfmgr: PerfMgr for SM standby and inactive states > > Here is an updated patch with the correction. > > Ira > > > From 9be13c3da4d34ad0a736ced4c9e3bb5e13a24bb6 Mon Sep 17 00:00:00 2001 > From: Ira K. Weiny > Date: Thu, 15 May 2008 08:19:17 -0700 > Subject: [PATCH] Add a Performance Manager HOWTO to the docs and the dist > > > Signed-off-by: Ira K. Weiny Applied. Thanks. Sasha From sashak at voltaire.com Sat May 17 16:32:02 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 18 May 2008 02:32:02 +0300 Subject: [ofa-general] Re: [PATCH] [TRIVIAL] OpenSM/doc/modular_routing.txt: Fix typo In-Reply-To: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com> References: <1210968832.12616.294.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080517233202.GE30185@sashak.voltaire.com> On 13:13 Fri 16 May , Hal Rosenstock wrote: > OpenSM/doc/modular_routing.txt: Fix typo > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat May 17 16:33:37 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 18 May 2008 02:33:37 +0300 Subject: [ofa-general] Re: [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED In-Reply-To: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080517233337.GF30185@sashak.voltaire.com> On 13:16 Fri 16 May , Hal Rosenstock wrote: > OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED > > I'll leave the other heavy lifting to Yevgeny :-) > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat May 17 17:10:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 18 May 2008 03:10:38 +0300 Subject: [ofa-general] [PATCH] OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED In-Reply-To: <1210972718.12616.309.camel@hrosenstock-ws.xsigo.com> References: <1210968976.12616.298.camel@hrosenstock-ws.xsigo.com> <1210972718.12616.309.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080518001038.GI30185@sashak.voltaire.com> On 14:18 Fri 16 May , Hal Rosenstock wrote: > On Fri, 2008-05-16 at 13:16 -0700, Hal Rosenstock wrote: > > OpenSM/doc/QoS_management_in_OpenSM.txt: Remove mention of OFED > > > > I'll leave the other heavy lifting to Yevgeny :-) > > I forgot: Please apply to master and ofed_1_3 Applied this in my branch. Sasha From yqayc at glamsham.com Sat May 17 22:32:14 2008 From: yqayc at glamsham.com (courtney) Date: Sat, 17 May 2008 21:32:14 -0800 Subject: [ofa-general] hello from courtney Message-ID: <55439863766036.iOfApjqsm3@side> Hi, i am here sitting in the internet caffe. Found your email and decided to write. I am 25 y.o.girl. I have a picture if you want. No need to reply here as this is not may email. Write me at acourtney3 at famplayfit.cn From ogerlitz at voltaire.com Sat May 17 22:59:53 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 18 May 2008 08:59:53 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <1210836027.18385.2.camel@mtls03> References: <1210836027.18385.2.camel@mtls03> Message-ID: <482FC5D9.7060009@voltaire.com> Eli Cohen wrote: > --- a/drivers/infiniband/hw/mlx4/cq.c > +++ b/drivers/infiniband/hw/mlx4/cq.c > @@ -637,6 +637,7 @@ repoll: > case MLX4_OPCODE_SEND_IMM: > wc->wc_flags |= IB_WC_WITH_IMM; > case MLX4_OPCODE_SEND: > + case MLX4_OPCODE_SEND_INVAL: > wc->opcode = IB_WC_SEND; > break; > case MLX4_OPCODE_RDMA_READ: > @@ -676,6 +677,13 @@ repoll: > wc->wc_flags = IB_WC_WITH_IMM; > wc->imm_data = cqe->immed_rss_invalid; > break; > + case MLX4_RECV_OPCODE_SEND_INVAL: > + wc->opcode = IB_WC_RECV; > + wc->wc_flags = IB_WC_WITH_INVALIDATE; > + /* > + * TBD: maybe we should just call this ieth_val > + */ > + wc->imm_data = be32_to_cpu(cqe->immed_rss_invalid); Eli, Is it correct that "cqe->immed_rss_invalid" equals "wr->ex.invalidate_rkey" that was provided at the sender? if yes, any reason not to have the same/similar union (imm_data/invalidate_rkey) also for the work completion structure? Or From eli at dev.mellanox.co.il Sun May 18 02:14:58 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 18 May 2008 12:14:58 +0300 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <482FC5D9.7060009@voltaire.com> References: <1210836027.18385.2.camel@mtls03> <482FC5D9.7060009@voltaire.com> Message-ID: <1211102098.6963.14.camel@eli-laptop> On Sun, 2008-05-18 at 08:59 +0300, Or Gerlitz wrote: > Is it correct that "cqe->immed_rss_invalid" equals > "wr->ex.invalidate_rkey" that was provided at the sender? if yes, any > reason not to have the same/similar union (imm_data/invalidate_rkey) > also for the work completion structure? No reason for them to be different. Roland already suggested to use a union here although he defines the union locally inside the containing struct thus he has two definitions for the same union. Roland do you intend to commit that? From monis at Voltaire.COM Sun May 18 05:19:50 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 18 May 2008 15:19:50 +0300 Subject: [ofa-general] [PATCH] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48187E5A.7040809@Voltaire.COM> <4819BD29.7080002@Voltaire.COM> <4820638E.4030901@Voltaire.COM> <4827FBDF.9040308@Voltaire.COM> <482A979F.6040305@Voltaire.COM> Message-ID: <48301EE6.5070902@Voltaire.COM> Roland Dreier wrote: > > OK. Here is an example that was viewed in our tests. > > One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server). > > SM takeover event takes place during traffic and as a result multicast info is flushed > > and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience > > is a very big chance) that the request to rejoin will be to the old SM and only after a retry join completes successfully. > > This takes too long and the patch solves it. > > OK, that is fairly convincing (and it would be nice to include when > sending the original patch). > > Please resend a version that fixes the races in the patch and we can > probably add this for 2.6.27. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Thanks. I will resend this patch in a series of 2 (in a different thread) The other patch in the series is related to the one above but was sent by me earlier in a different thread without justification. From ogerlitz at voltaire.com Sun May 18 05:22:12 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 18 May 2008 15:22:12 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <48301F74.4020905@voltaire.com> Steve Wise wrote: > - device-specific alloc/free of physical buffer lists for use in fast > register work requests. This allows devices to allocate this memory as > needed (like via dma_alloc_coherent). > Steve, Reading through the suggested API / patches and the previous threads I was not sure to understand if the HW driver must not assume that it has the ownership on the page --list-- structure until the registration work request is completed - or not. Now, if ownership can not be assumed (eg as for the SG list elements pointed by send/recv WR), the driver has to clone it anyway, and thus I don't see the need in the ib_alloc/free_fast_reg_page_list verbs. If ownership can be assumed, I suggest to have the core use the implementation of these two verbs as you did that for the Chelsio driver in case the HW driver did not implement it (i.e instead of returning ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM device (I think...) since the device is going to do DMA to read the page list, and the free_list verb should do DMA unmapping, etc. Or. From monis at Voltaire.COM Sun May 18 05:25:24 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 18 May 2008 15:25:24 +0300 Subject: [ofa-general] [PATCH 0/2] IB: Improve recovery from SM change events after takeover Message-ID: <48302034.8040709@Voltaire.COM> The patches below improve the the recovery of the IPoIB driver from a faulure of the SM and taking over by another SM. The purpose was to minimize the the time that 2 hosts with IPoIB stay remain disconnected after SM takeover event. Here is an example that was viewed in our tests. One IPoIB host (client) sends a stream of multicast packets to another IPoIB host (server). SM takeover event takes place during traffic and as a result multicast info is flushed and there is a need to rejoin by hosts. Without the patch there is a chance (which according to our experience is a very big chance) that the request to rejoin will be to the old SM and only after a retry join completes successfully. Our tests for IP multicast and unicast traffic between 2 hosts show that without the patch there is a period of time of up to 5 seconds that that communication is lost and with the patch the time decreases to less than a second. From monis at Voltaire.COM Sun May 18 05:34:31 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 18 May 2008 15:34:31 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: <48302034.8040709@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> Message-ID: <48302257.2050308@Voltaire.COM> This patch solves a race between work elements that are carried out after an event occurs. When SM address handle becomes invalid and needs an update it is handled by a work in the global workqueue. On the other hand this event is also handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. The patch sets the SM address handle to NULL and until update_sm_ah() is called, any request that needs sm_ah is replied with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the wrong SM so the request gets lost. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse and in some cases it gets better. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/core/sa_query.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index cf474ec..a2e61d7 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER) { - struct ib_sa_device *sa_dev; - sa_dev = container_of(handler, typeof(*sa_dev), event_handler); - + unsigned long flags; + struct ib_sa_device *sa_dev = + container_of(handler, typeof(*sa_dev), event_handler); + struct ib_sa_port *port = + &sa_dev->port[event->element.port_num - sa_dev->start_port]; + struct ib_sa_sm_ah *sm_ah; + + spin_lock_irqsave(&port->ah_lock, flags); + sm_ah = port->sm_ah; + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + if (sm_ah) + kref_put(&sm_ah->ref, free_sm_ah); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); } @@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + if (!port->sm_ah) + return -EAGAIN; agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); @@ -780,6 +793,9 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + if (!port->sm_ah) + return -EAGAIN; + agent = port->agent; if (method != IB_MGMT_METHOD_GET && @@ -877,8 +893,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; + if (!port->sm_ah) + return -EAGAIN; + agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; From monis at Voltaire.COM Sun May 18 05:36:11 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 18 May 2008 15:36:11 +0300 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <48302034.8040709@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> Message-ID: <483022BB.9060004@Voltaire.COM> The purpose of this patch is to make the events that are related to SM change (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. When SM related events are handled, it is not necessary to flush unicast info from device but only multicast info. This patch divides the events that are handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1 does more than 0). The main change is in __ipoib_ib_dev_flush(). Instead of flagging to the function about pkey_events we now use leveling. An event that requires "harder" flushing calls this function with higher number for level. Besides the concept, the actual change is that SM related events are not flushing unicast info and not bringing the device down but only refresh the multicast info in the background. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/ulp/ipoib/ipoib.h | 9 ++++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 37 ++++++++++++++++++----------- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 ++- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 19 +++++++------- 4 files changed, 43 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index ca126fc..8ed4dc0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -276,10 +276,11 @@ struct ipoib_dev_priv { struct delayed_work pkey_poll_task; struct delayed_work mcast_task; - struct work_struct flush_task; + struct work_struct flush_task0; + struct work_struct flush_task1; + struct work_struct flush_task2; struct work_struct restart_task; struct delayed_work ah_reap_task; - struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -427,7 +428,9 @@ void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); -void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_ib_dev_flush0(struct work_struct *work); +void ipoib_ib_dev_flush1(struct work_struct *work); +void ipoib_ib_dev_flush2(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f429bce..2a9c058 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -898,12 +898,14 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return 0; } -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) { struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; u16 new_index; + ipoib_dbg(priv, "Try flushing level %d\n", level); + mutex_lock(&priv->vlan_mutex); /* @@ -911,7 +913,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) * the parent is down. */ list_for_each_entry(cpriv, &priv->child_intfs, list) - __ipoib_ib_dev_flush(cpriv, pkey_event); + __ipoib_ib_dev_flush(cpriv, level); mutex_unlock(&priv->vlan_mutex); @@ -925,7 +927,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) return; } - if (pkey_event) { + if (level == 2) { if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ipoib_ib_dev_down(dev, 0); @@ -943,11 +945,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) priv->pkey_index = new_index; } - ipoib_dbg(priv, "flushing\n"); - ipoib_ib_dev_down(dev, 0); + ipoib_mcast_dev_flush(dev); + + if (level >= 1) + ipoib_ib_dev_down(dev, 0); - if (pkey_event) { + if (level >= 2) { ipoib_ib_dev_stop(dev, 0); ipoib_ib_dev_open(dev); } @@ -957,29 +961,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) * we get here, don't bring it back up if it's not configured up */ if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { - ipoib_ib_dev_up(dev); + if (level >= 1) + ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } } -void ipoib_ib_dev_flush(struct work_struct *work) +void ipoib_ib_dev_flush0(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + container_of(work, struct ipoib_dev_priv, flush_task0); - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 0); } -void ipoib_pkey_event(struct work_struct *work) +void ipoib_ib_dev_flush1(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_event_task); + container_of(work, struct ipoib_dev_priv, flush_task1); - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 1); } +void ipoib_ib_dev_flush2(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task2); + + __ipoib_ib_dev_flush(priv, 2); +} + void ipoib_ib_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2442090..2808023 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -989,9 +989,10 @@ static void ipoib_setup(struct net_device *dev) INIT_LIST_HEAD(&priv->multicast_list); INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 8766d29..80c0409 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, if (record->element.port_num != priv->port) return; - if (record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PORT_ACTIVE || - record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE || - record->event == IB_EVENT_CLIENT_REREGISTER) { - ipoib_dbg(priv, "Port state change event\n"); - queue_work(ipoib_workqueue, &priv->flush_task); + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, + record->device->name, record->element.port_num); + if ( record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER) { + queue_work(ipoib_workqueue, &priv->flush_task0); + } else if (record->event == IB_EVENT_PORT_ERR || + record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE) { + queue_work(ipoib_workqueue, &priv->flush_task1); } else if (record->event == IB_EVENT_PKEY_CHANGE) { - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); - queue_work(ipoib_workqueue, &priv->pkey_event_task); + queue_work(ipoib_workqueue, &priv->flush_task2); } } From jackm at dev.mellanox.co.il Sun May 18 07:34:55 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 18 May 2008 17:34:55 +0300 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <200805160919.45676.okir@lst.de> References: <200805160919.45676.okir@lst.de> Message-ID: <200805181734.55378.jackm@dev.mellanox.co.il> This is actually a known issue, which we never got around to fixing. From the OFED 1.2.5 release notes (document docs/mthca_release_notes.txt), in section "3. Known Issues": 3. In mem-free devices, RC QPs can be created with a maximum of (max_sge - 3) entries only. - Jack On Friday 16 May 2008 10:19, Olaf Kirch wrote: > On Friday 16 May 2008 00:22:12 Roland Dreier wrote: > > I ran into this a few weeks back as well, when I tried to up the SG limit > in RDS to 32 (on a arbel memfree card). memfree returns 30 as the max, I believe. ConnectX returns 32. > I grepped around the code a bit, got a little confused because of all the > different max_sge, max_sg and max_gs variables :-) and eventually > convinced myself that the max_sge reported simply doesn't include the > transport specific overhead that mthca_alloc_wqe_buf factors in. > > Given that you have quite different WQE overheads depending on the transport, > a conservative max_sge value that works for all transports wastes one or two > entries on some others. Maybe once the QP is created, it could report > the actual max_sge value (which may actually be greater than the conservative, > transport-independent max_sge estimate of the device). This is a problem, because then you are returning a value which is greater that the declared device max. This causes IB Spec non-compliance. > > Olaf From hrosenstock at xsigo.com Sun May 18 07:44:50 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sun, 18 May 2008 07:44:50 -0700 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <483022BB.9060004@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> Message-ID: <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote: > The purpose of this patch is to make the events that are related to SM change > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > When SM related events are handled, it is not necessary to flush unicast > info from device but only multicast info. How is unicast invalidation handled on these changes ? On a local LID change event, how does an end port know/determine what else (e.g. other LIDs, paths) the SM might have changed (that specifically might affect IPoIB since this is limited to IPoIB) ? Also, wouldn't there be similar issues with other ULPs ? -- Hal > This patch divides the events that are > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1 > does more than 0). > The main change is in __ipoib_ib_dev_flush(). Instead of flagging to the function > about pkey_events we now use leveling. An event that requires "harder" flushing > calls this function with higher number for level. Besides the concept, > the actual change is that SM related events are not flushing unicast info and > not bringing the device down but only refresh the multicast info in the background. > > Signed-off-by: Moni Levy > Signed-off-by: Moni Shoua > > --- > > drivers/infiniband/ulp/ipoib/ipoib.h | 9 ++++--- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 37 ++++++++++++++++++----------- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 ++- > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 19 +++++++------- > 4 files changed, 43 insertions(+), 27 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > index ca126fc..8ed4dc0 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -276,10 +276,11 @@ struct ipoib_dev_priv { > > struct delayed_work pkey_poll_task; > struct delayed_work mcast_task; > - struct work_struct flush_task; > + struct work_struct flush_task0; > + struct work_struct flush_task1; > + struct work_struct flush_task2; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > - struct work_struct pkey_event_task; > > struct ib_device *ca; > u8 port; > @@ -427,7 +428,9 @@ void ipoib_flush_paths(struct net_device *dev); > struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > -void ipoib_ib_dev_flush(struct work_struct *work); > +void ipoib_ib_dev_flush0(struct work_struct *work); > +void ipoib_ib_dev_flush1(struct work_struct *work); > +void ipoib_ib_dev_flush2(struct work_struct *work); > void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index f429bce..2a9c058 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -898,12 +898,14 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > return 0; > } > > -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) > { > struct ipoib_dev_priv *cpriv; > struct net_device *dev = priv->dev; > u16 new_index; > > + ipoib_dbg(priv, "Try flushing level %d\n", level); > + > mutex_lock(&priv->vlan_mutex); > > /* > @@ -911,7 +913,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * the parent is down. > */ > list_for_each_entry(cpriv, &priv->child_intfs, list) > - __ipoib_ib_dev_flush(cpriv, pkey_event); > + __ipoib_ib_dev_flush(cpriv, level); > > mutex_unlock(&priv->vlan_mutex); > > @@ -925,7 +927,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > return; > } > > - if (pkey_event) { > + if (level == 2) { > if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { > clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > ipoib_ib_dev_down(dev, 0); > @@ -943,11 +945,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > priv->pkey_index = new_index; > } > > - ipoib_dbg(priv, "flushing\n"); > > - ipoib_ib_dev_down(dev, 0); > + ipoib_mcast_dev_flush(dev); > + > + if (level >= 1) > + ipoib_ib_dev_down(dev, 0); > > - if (pkey_event) { > + if (level >= 2) { > ipoib_ib_dev_stop(dev, 0); > ipoib_ib_dev_open(dev); > } > @@ -957,29 +961,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * we get here, don't bring it back up if it's not configured up > */ > if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { > - ipoib_ib_dev_up(dev); > + if (level >= 1) > + ipoib_ib_dev_up(dev); > ipoib_mcast_restart_task(&priv->restart_task); > } > } > > -void ipoib_ib_dev_flush(struct work_struct *work) > +void ipoib_ib_dev_flush0(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, flush_task); > + container_of(work, struct ipoib_dev_priv, flush_task0); > > - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 0); > } > > -void ipoib_pkey_event(struct work_struct *work) > +void ipoib_ib_dev_flush1(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, pkey_event_task); > + container_of(work, struct ipoib_dev_priv, flush_task1); > > - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 1); > } > > +void ipoib_ib_dev_flush2(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, flush_task2); > + > + __ipoib_ib_dev_flush(priv, 2); > +} > + > void ipoib_ib_dev_cleanup(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 2442090..2808023 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -989,9 +989,10 @@ static void ipoib_setup(struct net_device *dev) > INIT_LIST_HEAD(&priv->multicast_list); > > INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); > - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); > INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); > - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); > + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); > + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); > + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); > } > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > index 8766d29..80c0409 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, > if (record->element.port_num != priv->port) > return; > > - if (record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PORT_ACTIVE || > - record->event == IB_EVENT_LID_CHANGE || > - record->event == IB_EVENT_SM_CHANGE || > - record->event == IB_EVENT_CLIENT_REREGISTER) { > - ipoib_dbg(priv, "Port state change event\n"); > - queue_work(ipoib_workqueue, &priv->flush_task); > + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, > + record->device->name, record->element.port_num); > + if ( record->event == IB_EVENT_SM_CHANGE || > + record->event == IB_EVENT_CLIENT_REREGISTER) { > + queue_work(ipoib_workqueue, &priv->flush_task0); > + } else if (record->event == IB_EVENT_PORT_ERR || > + record->event == IB_EVENT_PORT_ACTIVE || > + record->event == IB_EVENT_LID_CHANGE) { > + queue_work(ipoib_workqueue, &priv->flush_task1); > } else if (record->event == IB_EVENT_PKEY_CHANGE) { > - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); > - queue_work(ipoib_workqueue, &priv->pkey_event_task); > + queue_work(ipoib_workqueue, &priv->flush_task2); > } > } > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at dev.mellanox.co.il Sun May 18 07:49:41 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 18 May 2008 17:49:41 +0300 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: Message-ID: <200805181749.42187.jackm@dev.mellanox.co.il> On Saturday 17 May 2008 02:12, Roland Dreier wrote: > > if we can't use the "WQE shrinking" feature (because of selective > signaling in the NFS/RDMA case), and we want to use 32 sge entries, then > the WQE size 's' will end up a little more than 512 bytes, and the > wqe_shift will end up as 10. > But since the max_sq_desc_sz is 1008, we > return -EINVAL, when it is really fine to have a wqe_shift of 10 as long > as we don't use more than 1008 bytes per descriptor (I think). Correct. ... > @@ -395,7 +396,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, > ++qp->sq.wqe_shift; > } > > - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - > + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, > + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - > send_wqe_overhead(type, qp->flags)) / > sizeof (struct mlx4_wqe_data_seg); In this case, sq.max_gs ( = (1008 - wqe overhead) / 16) will be larger than the "max sge" value returned by ib_query_device, (max_sge returned by ib_query_device is 32). I'm not crazy about this inconsistency. Please note also that the IB Spec does not differentiate between Send max_sge, and Receive max_sge, so we're reduced to enforcing the minimum of the two values. The general approach taken in the driver is to enforce the smallest of the sge values, to avoid dealing with the individual qp type maxima. - Jack From tziporet at dev.mellanox.co.il Sun May 18 08:05:49 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 18 May 2008 18:05:49 +0300 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: References: <482AC510.3090602@dev.mellanox.co.il> Message-ID: <483045CD.8060301@mellanox.co.il> Chris Worley wrote: > Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5 > in the "kitchen sink" build. > > Is there any reason to NOT use connected mode? > In general the CM is better in performance of medium & large messages. We found the UD mode is better in small UDP messages Tziporet From monis at Voltaire.COM Sun May 18 08:10:02 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 18 May 2008 18:10:02 +0300 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> Message-ID: <483046CA.3010403@Voltaire.COM> Hal Rosenstock wrote: > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote: >> The purpose of this patch is to make the events that are related to SM change >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. >> When SM related events are handled, it is not necessary to flush unicast >> info from device but only multicast info. > > How is unicast invalidation handled on these changes ? On a local LID > change event, how does an end port know/determine what else (e.g. other > LIDs, paths) the SM might have changed (that specifically might affect > IPoIB since this is limited to IPoIB) ? I'm not sure I understand the question but local LID change would be handled as before with a LID_CHANGE event. For this type of event, there is not change in what IPoIB does to cope. > > Also, wouldn't there be similar issues with other ULPs ? There might be but the purpose of this one is to make things better for IPoIB > > -- Hal > From vlad at dev.mellanox.co.il Sun May 18 08:22:39 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 18 May 2008 18:22:39 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size Message-ID: <483049BF.4050603@dev.mellanox.co.il> From 3bb8b713da6a0b2087201d6fb6c1a8d9274cf16e Mon Sep 17 00:00:00 2001 From: Vladimir Sokolovsky Date: Sun, 18 May 2008 11:25:55 +0300 Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which hardcoded the minimum acceptable page_shift to be 12. However, new mlx4 firmware has a minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs. To preserve firmware compatibility with released OFED drivers, the firmware will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these drivers. However, to enable new drivers to take advantage of the available smaller page size, the mlx4 driver now first sets the log_pg_sz to the device minimum via the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP(). The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value. Signed-off-by: Jack Morgenstein Signed-off-by: Vladimir Sokolovsky --- drivers/net/mlx4/fw.c | 28 ++++++++++++++++++++++++++++ drivers/net/mlx4/fw.h | 6 ++++++ drivers/net/mlx4/main.c | 13 +++++++++++++ include/linux/mlx4/cmd.h | 2 +- 4 files changed, 48 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index d82f275..2b5006b 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) mlx4_dbg(dev, " %s\n", fname[i]); } +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *inbox; + int err = 0; + +#define MOD_STAT_CFG_IN_SIZE 0x100 + +#define MOD_STAT_CFG_PG_SZ_M_OFFSET 0x002 +#define MOD_STAT_CFG_PG_SZ_OFFSET 0x003 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, MOD_STAT_CFG_IN_SIZE); + + MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET); + MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET); + + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG, + MLX4_CMD_TIME_CLASS_A); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) { struct mlx4_cmd_mailbox *mailbox; diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h index 306cb9b..a0e046c 100644 --- a/drivers/net/mlx4/fw.h +++ b/drivers/net/mlx4/fw.h @@ -38,6 +38,11 @@ #include "mlx4.h" #include "icm.h" +struct mlx4_mod_stat_cfg { + u8 log_pg_sz; + u8 log_pg_sz_m; +}; + struct mlx4_dev_cap { int max_srq_sz; int max_qp_sz; @@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages); int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm); int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev); int mlx4_NOP(struct mlx4_dev *dev); +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg); #endif /* MLX4_FW_H */ diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index a6aa49f..2a155ee 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -75,6 +75,11 @@ static char mlx4_version[] __devinitdata = DRV_NAME ": Mellanox ConnectX core driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +static int mlx4_log_pg_sz = 0; +module_param(mlx4_log_pg_sz, int, 0444); +MODULE_PARM_DESC(mlx4_log_pg_sz, + "set FW log system min page size (0 gets native FW min. default=0)"); + static struct mlx4_profile default_profile = { .num_qp = 1 << 17, .num_srq = 1 << 16, @@ -485,6 +490,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_adapter adapter; struct mlx4_dev_cap dev_cap; + struct mlx4_mod_stat_cfg mlx4_cfg; struct mlx4_profile profile; struct mlx4_init_hca_param init_hca; u64 icm_size; @@ -502,6 +508,13 @@ static int mlx4_init_hca(struct mlx4_dev *dev) return err; } + mlx4_cfg.log_pg_sz_m = 1; + mlx4_cfg.log_pg_sz = (u8) mlx4_log_pg_sz; + err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg); + if (err) + mlx4_warn(dev, "Failed to override log_pg_sz parameter to %d\n", + mlx4_log_pg_sz); + err = mlx4_dev_cap(dev, &dev_cap); if (err) { mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n"); diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index 77323a7..3b563ed 100644 --- a/include/linux/mlx4/cmd.h +++ b/include/linux/mlx4/cmd.h @@ -169,7 +169,7 @@ static inline int mlx4_cmd_imm(struct mlx4_dev *dev, u64 in_param, u64 *out_para u32 in_modifier, u8 op_modifier, u16 op, unsigned long timeout) { - return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier, + return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier, op_modifier, op, timeout); } -- 1.5.5.1 From roland.list at gmail.com Sun May 18 09:04:27 2008 From: roland.list at gmail.com (Roland Dreier) Date: Sun, 18 May 2008 09:04:27 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <200805181749.42187.jackm@dev.mellanox.co.il> References: <200805181749.42187.jackm@dev.mellanox.co.il> Message-ID: >> - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - >> + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, >> + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - >> send_wqe_overhead(type, qp->flags)) / >> sizeof (struct mlx4_wqe_data_seg); > > In this case, sq.max_gs ( = (1008 - wqe overhead) / 16) will be larger than the > "max sge" value returned by ib_query_device, (max_sge returned by ib_query_device is 32). > I'm not crazy about this inconsistency. Please note also that the IB Spec does not > differentiate between Send max_sge, and Receive max_sge, so we're reduced to enforcing > the minimum of the two values. OK, we can clamp the value lower here to the max_sge reported by the driver (but the change I'm making here already only lowers the returned sq.max_gs value, since the value qp->sq_max_wqes_per_wr << qp->sq.wqe_shift will be 1024 in the case in question). But can you point me to the place in the IB spec where it requires all sge limits to be no bigger than the returned max_sge value? - R. From swise at opengridcomputing.com Sun May 18 09:46:21 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 18 May 2008 11:46:21 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48301F74.4020905@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> Message-ID: <48305D5D.4040401@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> - device-specific alloc/free of physical buffer lists for use in fast >> register work requests. This allows devices to allocate this memory as >> needed (like via dma_alloc_coherent). >> > Steve, > > Reading through the suggested API / patches and the previous threads I > was not sure to understand if the HW driver must not assume that it has > the ownership on the page --list-- structure until the registration work > request is completed - or not. > Yes, the driver owns the page list structure until the WR completes (ie is reaped by the consumer via poll_cq()). > Now, if ownership can not be assumed (eg as for the SG list elements > pointed by send/recv WR), the driver has to clone it anyway, and thus I > don't see the need in the ib_alloc/free_fast_reg_page_list verbs. > > If ownership can be assumed, I suggest to have the core use the > implementation of these two verbs as you did that for the Chelsio driver > in case the HW driver did not implement it (i.e instead of returning > ENOSYS). In that case, the alloc_list verb should do DMA mapping FROM > device (I think...) since the device is going to do DMA to read the page > list, and the free_list verb should do DMA unmapping, etc. > Some devices don't need DMA mappings at all (chelsio for instance). The idea of a device-specific method was so the device could allocate a bigger structure to hold its own context info. So a core service that sets up DMA, in my opinion, isn't really useful. Steve. > Or. > From rdreier at cisco.com Sun May 18 14:39:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 18 May 2008 14:39:55 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48301F74.4020905@voltaire.com> (Or Gerlitz's message of "Sun, 18 May 2008 15:22:12 +0300") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> Message-ID: > If ownership can be assumed, I suggest to have the core use the > implementation of these two verbs as you did that for the Chelsio > driver in case the HW driver did not implement it (i.e instead of > returning ENOSYS). In that case, the alloc_list verb should do DMA > mapping FROM device (I think...) since the device is going to do DMA > to read the page list, and the free_list verb should do DMA unmapping, > etc. Yes, the point of this verb is that the low-level driver owns the page list from when the fast register work request is posted until it completes. This should be explicitly documented somewhere. However the reason for having the low-level driver implement it is so that all strange device-specific issues can be taken care of in the driver. For instance mlx4 is going to require that the page list be aligned to 64 bytes, and will DMA from the memory, so we need to use dma_alloc_consistent(). On the other hand cxgb3 is just going to copy in software, so kmalloc is sufficient. - R. From rdreier at cisco.com Sun May 18 14:42:23 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 18 May 2008 14:42:23 -0700 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: <48302257.2050308@Voltaire.COM> (Moni Shoua's message of "Sun, 18 May 2008 15:34:31 +0300") References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> Message-ID: I asked you to resend with the race fixed. However I guess I never spelled out what race I meant ... I thought I pointed this out, but looking at the archives I don't see it. So think about this: What happens if someone calls ib_sa_path_rec_get() and @@ -663,6 +674,8 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + if (!port->sm_ah) + return -EAGAIN; right about here (*after* the test), ib_sa_event() does "port->sm_ah = NULL;" on another CPU. agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); From roland.list at gmail.com Sun May 18 14:45:40 2008 From: roland.list at gmail.com (Roland Dreier) Date: Sun, 18 May 2008 14:45:40 -0700 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <483049BF.4050603@dev.mellanox.co.il> References: <483049BF.4050603@dev.mellanox.co.il> Message-ID: [Adding list cc] > +static int mlx4_log_pg_sz = 0; > +module_param(mlx4_log_pg_sz, int, 0444); > +MODULE_PARM_DESC(mlx4_log_pg_sz, > + "set FW log system min page size (0 gets native FW min. default=0)"); Why do we need this module parameter? When would someone set it to anything other than 0? > - return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier, > + return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier, > op_modifier, op, timeout); I don't see any call to mlx4_cmd_imm in this patch -- why is this change needed? - R. From jackm at dev.mellanox.co.il Sun May 18 22:12:38 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 08:12:38 +0300 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: <200805181749.42187.jackm@dev.mellanox.co.il> Message-ID: <200805190812.39408.jackm@dev.mellanox.co.il> On Sunday 18 May 2008 19:04, Roland Dreier wrote: > But can you point me to the place in the IB spec where it requires all > sge limits to be no bigger > than the returned max_sge value? > Its not mentioned specifically, but certainly is strongly implied. 1. Section 11.2.1.2 -- Query HCA: • The maximum number of scatter/gather entries per Work Re- quest supported by this HCA, for all Work Requests other than Reliable Datagram Receive Queue Work Requests. This certainly implies that the create verb should not return a number of scatter-gather entries greater than the max_sge value (which is the max value supported by this HCA). Otherwise, why have a max_sge value returned by Query HCA? What use would it serve? Furthermore, how can the HCA return an sge value in Create QP which exceeds the max_sge value returned by Query HCA? If it does, the sge value in Create QP should be the one returned in Query HCA! - Jack From rdreier at cisco.com Sun May 18 22:54:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 18 May 2008 22:54:07 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <200805190812.39408.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 May 2008 08:12:38 +0300") References: <200805181749.42187.jackm@dev.mellanox.co.il> <200805190812.39408.jackm@dev.mellanox.co.il> Message-ID: > 1. Section 11.2.1.2 -- Query HCA: > • The maximum number of scatter/gather entries per Work Re- > quest supported by this HCA, for all Work Requests other > than Reliable Datagram Receive Queue Work Requests. > > This certainly implies that the create verb should not return > a number of scatter-gather entries greater than the max_sge value > (which is the max value supported by this HCA). I take this to mean that Query HCA should return the maxium number of s/g entries supported for all work requests (leaving aside RD). In other words, if some work requests support 4 s/g entries and others support 8, then query HCA should return 4 as the max s/g entries, since this is the largest number that *all* work requests support (although some support 8). > Otherwise, why have a max_sge value returned by Query HCA? What use > would it serve? It gives an upper bound on what consumers can request in a simple way, without having to have the complexity of per-transport limits for send and receive queues separately. > Furthermore, how can the HCA return an sge value > in Create QP which exceeds the max_sge value returned by Query HCA? If > it does, the sge value in Create QP should be the one returned in > Query HCA! The mlx4 case is a simple example: send work requests support more s/g entries than receive work requests do. So query HCA must return the lower receive work request limit, but I see no reason why create QP can't return the actual limit for send work requests. - R. From jackm at dev.mellanox.co.il Mon May 19 00:07:24 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 10:07:24 +0300 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: <200805190812.39408.jackm@dev.mellanox.co.il> Message-ID: <200805191007.24888.jackm@dev.mellanox.co.il> On Monday 19 May 2008 08:54, Roland Dreier wrote: > The mlx4 case is a simple example: send work requests support more s/g > entries than receive work requests do.  So query HCA must return the > lower receive work request limit, but I see no reason why create QP > can't return the actual limit for send work requests. > Then, we get into the complexity of sanity checking in create_qp (since we should be able to use the value returned by create-qp when calling create-qp, and get the same result). Essentially, we will need to check the requested sge numbers per QP type, whether it is for send or receive, etc. IMHO, this gets nasty very quickly -- creates a problem with support -- users will need a "roadmap" for create-qp. I much prefer to treat the query_hca returned values as absolute maxima, and enforce these limits (although this is at the expense of additional s/g entries for some qp types and send/receive). - Jack From eli at dev.mellanox.co.il Mon May 19 00:40:10 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 19 May 2008 10:40:10 +0300 Subject: [ofa-general] Re: [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: References: <1210836027.18385.2.camel@mtls03> Message-ID: <1211182810.6515.6.camel@eli-laptop> On Fri, 2008-05-16 at 11:21 -0700, Roland Dreier wrote: > Are we forced to to look at the firmware version, or can we use the bmme > flag that the DEV_CAP firmware command gives us? We are going to have a few capability bits defined for each bmme feature. Once they're defined I'll regenerate the patch and resend. From ogerlitz at voltaire.com Mon May 19 01:04:10 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 May 2008 11:04:10 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int><20080516223419.27221.49014.stgit@dell3.ogc.int><48301F74.4020905@voltaire.com> Message-ID: <4831347A.1010506@voltaire.com> Roland Dreier wrote: > > Yes, the point of this verb is that the low-level driver owns the page > list from when the fast register work request is posted until it > completes. This should be explicitly documented somewhere. > OK, got it, so this is different case compared to the SG elements which are not owned by the driver once the posting call returns. > > However the reason for having the low-level driver implement it is so > that all strange device-specific issues can be taken care of in the > driver. For instance mlx4 is going to require that the page list be > aligned to 64 bytes, and will DMA from the memory, so we need to use > dma_alloc_consistent(). On the other hand cxgb3 is just going to copy > in software, so kmalloc is sufficient. > I see. Just wondering, in the mlx4 case, is it a must to use dma consistent memory allocation or dma mapping would work too? Or. Or. From olaf.kirch at oracle.com Mon May 19 01:05:59 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Mon, 19 May 2008 10:05:59 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <200805161638.18067.olaf.kirch@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805141516.01908.okir@lst.de> <200805161638.18067.olaf.kirch@oracle.com> Message-ID: <200805191006.00114.olaf.kirch@oracle.com> > However, I'm still seeing performance degradation of ~5% with some packet > sizes. And that is *just* the overhead from exchanging the credit information > and checking it - at some point we need to take a spinlock, and that seems > to delay things just enough to make a dent in my throughput graph. Here's an updated version of the flow control patch - which is now completely lockless, and uses a single atomic_t to hold both credit counters. This has given me back close to full performance in my testing (throughput seems to be down less than 1%, which is almost within the noise range). I'll push it to my git tree a little later today, so folks can test it if they like. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ---- From: Olaf Kirch Subject: RDS: Implement IB flow control Here it is - flow control for RDS/IB. This patch is still very much experimental. Here's the essentials - The approach chosen here uses a credit-based flow control mechanism. Every SEND WR (including ACKs) consumes one credit, and if the sender runs out of credits, it stalls. - As new receive buffers are posted, credits are transferred to the remote node (using yet another RDS header byte for this). - Flow control is negotiated during connection setup. Initial credits are exchanged in the rds_ib_connect_private sruct - sending a value of zero (which is also the default for older protocol versions) means no flow control. - We avoid deadlock (both nodes depleting their credits, and being unable to inform the peer of newly posted buffers) by requiring that the last credit can only be used if we're posting new credits to the peer. The approach implemented here is lock-free; preliminary tests show the impact on throughput to be less than 1%, and the impact on RTT, CPU, TX delay and other metrics to be below the noise threshold. Flow control is configurable via sysctl. It only affects newly created connections however - so your best bet is to set this right after loading the RDS module. Signed-off-by: Olaf Kirch --- net/rds/ib.c | 1 net/rds/ib.h | 30 ++++++++ net/rds/ib_cm.c | 49 ++++++++++++- net/rds/ib_recv.c | 48 +++++++++--- net/rds/ib_send.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++ net/rds/ib_stats.c | 3 net/rds/ib_sysctl.c | 10 ++ net/rds/rds.h | 4 - 8 files changed, 325 insertions(+), 14 deletions(-) Index: ofa_kernel-1.3/net/rds/ib.h =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib.h +++ ofa_kernel-1.3/net/rds/ib.h @@ -46,6 +46,7 @@ struct rds_ib_connect_private { __be16 dp_protocol_minor_mask; /* bitmask */ __be32 dp_reserved1; __be64 dp_ack_seq; + __be32 dp_credit; /* non-zero enables flow ctl */ }; struct rds_ib_send_work { @@ -110,15 +111,32 @@ struct rds_ib_connection { struct ib_sge i_ack_sge; u64 i_ack_dma; unsigned long i_ack_queued; + + /* Flow control related information + * + * Our algorithm uses a pair variables that we need to access + * atomically - one for the send credits, and one posted + * recv credits we need to transfer to remote. + * Rather than protect them using a slow spinlock, we put both into + * a single atomic_t and update it using cmpxchg + */ + atomic_t i_credits; /* Protocol version specific information */ unsigned int i_hdr_idx; /* 1 (old) or 0 (3.1 or later) */ + unsigned int i_flowctl : 1; /* enable/disable flow ctl */ /* Batched completions */ unsigned int i_unsignaled_wrs; long i_unsignaled_bytes; }; +/* This assumes that atomic_t is at least 32 bits */ +#define IB_GET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_GET_POST_CREDITS(v) ((v) >> 16) +#define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_SET_POST_CREDITS(v) ((v) << 16) + struct rds_ib_ipaddr { struct list_head list; __be32 ipaddr; @@ -153,14 +171,17 @@ struct rds_ib_statistics { unsigned long s_ib_tx_cq_call; unsigned long s_ib_tx_cq_event; unsigned long s_ib_tx_ring_full; + unsigned long s_ib_tx_throttle; unsigned long s_ib_tx_sg_mapping_failure; unsigned long s_ib_tx_stalled; + unsigned long s_ib_tx_credit_updates; unsigned long s_ib_rx_cq_call; unsigned long s_ib_rx_cq_event; unsigned long s_ib_rx_ring_empty; unsigned long s_ib_rx_refill_from_cq; unsigned long s_ib_rx_refill_from_thread; unsigned long s_ib_rx_alloc_limit; + unsigned long s_ib_rx_credit_updates; unsigned long s_ib_ack_sent; unsigned long s_ib_ack_send_failure; unsigned long s_ib_ack_send_delayed; @@ -244,6 +265,8 @@ void rds_ib_flush_mrs(void); int __init rds_ib_recv_init(void); void rds_ib_recv_exit(void); int rds_ib_recv(struct rds_connection *conn); +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill); void rds_ib_inc_purge(struct rds_incoming *inc); void rds_ib_inc_free(struct rds_incoming *inc); int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov, @@ -252,6 +275,7 @@ void rds_ib_recv_cq_comp_handler(struct void rds_ib_recv_init_ring(struct rds_ib_connection *ic); void rds_ib_recv_clear_ring(struct rds_ib_connection *ic); void rds_ib_recv_init_ack(struct rds_ib_connection *ic); +void rds_ib_attempt_ack(struct rds_ib_connection *ic); void rds_ib_ack_send_complete(struct rds_ib_connection *ic); u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic); @@ -266,12 +290,17 @@ u32 rds_ib_ring_completed(struct rds_ib_ extern wait_queue_head_t rds_ib_ring_empty_wait; /* ib_send.c */ +void rds_ib_xmit_complete(struct rds_connection *conn); int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, unsigned int hdr_off, unsigned int sg, unsigned int off); void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context); void rds_ib_send_init_ring(struct rds_ib_connection *ic); void rds_ib_send_clear_ring(struct rds_ib_connection *ic); int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op); +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits); +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted); +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted, + u32 *adv_credits); /* ib_stats.c */ RDS_DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); @@ -287,6 +316,7 @@ extern unsigned long rds_ib_sysctl_max_r extern unsigned long rds_ib_sysctl_max_unsig_wrs; extern unsigned long rds_ib_sysctl_max_unsig_bytes; extern unsigned long rds_ib_sysctl_max_recv_allocation; +extern unsigned int rds_ib_sysctl_flow_control; extern ctl_table rds_ib_sysctl_table[]; /* Index: ofa_kernel-1.3/net/rds/ib_cm.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib_cm.c +++ ofa_kernel-1.3/net/rds/ib_cm.c @@ -55,6 +55,22 @@ static void rds_ib_set_protocol(struct r } /* + * Set up flow control + */ +static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (rds_ib_sysctl_flow_control && credits != 0) { + /* We're doing flow control */ + ic->i_flowctl = 1; + rds_ib_send_add_credits(conn, credits); + } else { + ic->i_flowctl = 0; + } +} + +/* * Connection established. * We get here for both outgoing and incoming connection. */ @@ -72,12 +88,16 @@ static void rds_ib_connect_complete(stru rds_ib_set_protocol(conn, RDS_PROTOCOL(dp->dp_protocol_major, dp->dp_protocol_minor)); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); } - rdsdebug("RDS/IB: ib conn complete on %u.%u.%u.%u version %u.%u\n", + printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n", NIPQUAD(conn->c_laddr), RDS_PROTOCOL_MAJOR(conn->c_version), - RDS_PROTOCOL_MINOR(conn->c_version)); + RDS_PROTOCOL_MINOR(conn->c_version), + ic->i_flowctl? ", flow control" : ""); + + rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); /* Tune the RNR timeout. We use a rather low timeout, but * not the absolute minimum - this should be tunable. @@ -129,6 +149,24 @@ static void rds_ib_cm_fill_conn_param(st dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); dp->dp_ack_seq = rds_ib_piggyb_ack(ic); + /* Advertise flow control. + * + * Major chicken and egg alert! + * We would like to post receive buffers before we get here (eg. + * in rds_ib_setup_qp), so that we can give the peer an accurate + * credit value. + * Unfortunately we can't post receive buffers until we've finished + * protocol negotiation, and know in which order data and payload + * are arranged. + * + * What we do here is we give the peer a small initial credit, and + * initialize the number of posted buffers to a negative value. + */ + if (ic->i_flowctl) { + atomic_set(&ic->i_credits, IB_SET_POST_CREDITS(-4)); + dp->dp_credit = cpu_to_be32(4); + } + conn_param->private_data = dp; conn_param->private_data_len = sizeof(*dp); } @@ -363,6 +401,7 @@ static int rds_ib_cm_handle_connect(stru ic = conn->c_transport_data; rds_ib_set_protocol(conn, version); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); /* If the peer gave us the last packet it saw, process this as if * we had received a regular ACK. */ @@ -428,6 +467,7 @@ out: static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id) { struct rds_connection *conn = cm_id->context; + struct rds_ib_connection *ic = conn->c_transport_data; struct rdma_conn_param conn_param; struct rds_ib_connect_private dp; int ret; @@ -435,6 +475,7 @@ static int rds_ib_cm_initiate_connect(st /* If the peer doesn't do protocol negotiation, we must * default to RDSv3.0 */ rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0); + ic->i_flowctl = rds_ib_sysctl_flow_control; /* advertise flow control */ ret = rds_ib_setup_qp(conn); if (ret) { @@ -688,6 +729,10 @@ void rds_ib_conn_shutdown(struct rds_con #endif ic->i_ack_recv = 0; + /* Clear flow control state */ + ic->i_flowctl = 0; + atomic_set(&ic->i_credits, 0); + if (ic->i_ibinc) { rds_inc_put(&ic->i_ibinc->ii_inc); ic->i_ibinc = NULL; Index: ofa_kernel-1.3/net/rds/ib_recv.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib_recv.c +++ ofa_kernel-1.3/net/rds/ib_recv.c @@ -220,16 +220,17 @@ out: * -1 is returned if posting fails due to temporary resource exhaustion. */ int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, - gfp_t page_gfp) + gfp_t page_gfp, int prefill) { struct rds_ib_connection *ic = conn->c_transport_data; struct rds_ib_recv_work *recv; struct ib_recv_wr *failed_wr; + unsigned int posted = 0; int ret = 0; u32 pos; - while (rds_conn_up(conn) && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { - + while ((prefill || rds_conn_up(conn)) + && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { if (pos >= ic->i_recv_ring.w_nr) { printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n", pos); @@ -257,8 +258,14 @@ int rds_ib_recv_refill(struct rds_connec ret = -1; break; } + + posted++; } + /* We're doing flow control - update the window. */ + if (ic->i_flowctl && posted) + rds_ib_advertise_credits(conn, posted); + if (ret) rds_ib_ring_unalloc(&ic->i_recv_ring, 1); return ret; @@ -436,7 +443,7 @@ static u64 rds_ib_get_ack(struct rds_ib_ #endif -static void rds_ib_send_ack(struct rds_ib_connection *ic) +static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits) { struct rds_header *hdr = ic->i_ack; struct ib_send_wr *failed_wr; @@ -448,6 +455,7 @@ static void rds_ib_send_ack(struct rds_i rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq); rds_message_populate_header(hdr, 0, 0, 0); hdr->h_ack = cpu_to_be64(seq); + hdr->h_credit = adv_credits; rds_message_make_checksum(hdr); ic->i_ack_queued = jiffies; @@ -460,6 +468,8 @@ static void rds_ib_send_ack(struct rds_i set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); rds_ib_stats_inc(s_ib_ack_send_failure); + /* Need to finesse this later. */ + BUG(); } else rds_ib_stats_inc(s_ib_ack_sent); } @@ -502,15 +512,27 @@ static void rds_ib_send_ack(struct rds_i * When we get here, we're called from the recv queue handler. * Check whether we ought to transmit an ACK. */ -static void rds_ib_attempt_ack(struct rds_ib_connection *ic) +void rds_ib_attempt_ack(struct rds_ib_connection *ic) { + unsigned int adv_credits; + if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) return; - if (!test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { - clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); - rds_ib_send_ack(ic); - } else + + if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { rds_ib_stats_inc(s_ib_ack_send_delayed); + return; + } + + /* Can we get a send credit? */ + if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) { + rds_ib_stats_inc(s_ib_tx_throttle); + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + return; + } + + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + rds_ib_send_ack(ic, adv_credits); } /* @@ -706,6 +728,10 @@ void rds_ib_process_recv(struct rds_conn state->ack_recv = be64_to_cpu(ihdr->h_ack); state->ack_recv_valid = 1; + /* Process the credits update if there was one */ + if (ihdr->h_credit) + rds_ib_send_add_credits(conn, ihdr->h_credit); + if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) { /* This is an ACK-only packet. The fact that it gets * special treatment here is that historically, ACKs @@ -877,7 +903,7 @@ void rds_ib_recv_cq_comp_handler(struct if (mutex_trylock(&ic->i_recv_mutex)) { if (rds_ib_recv_refill(conn, GFP_ATOMIC, - GFP_ATOMIC | __GFP_HIGHMEM)) + GFP_ATOMIC | __GFP_HIGHMEM, 0)) ret = -EAGAIN; else rds_ib_stats_inc(s_ib_rx_refill_from_cq); @@ -901,7 +927,7 @@ int rds_ib_recv(struct rds_connection *c * we're really low and we want the caller to back off for a bit. */ mutex_lock(&ic->i_recv_mutex); - if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER)) + if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0)) ret = -ENOMEM; else rds_ib_stats_inc(s_ib_rx_refill_from_thread); Index: ofa_kernel-1.3/net/rds/ib.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib.c +++ ofa_kernel-1.3/net/rds/ib.c @@ -187,6 +187,7 @@ static void rds_ib_exit(void) struct rds_transport rds_ib_transport = { .laddr_check = rds_ib_laddr_check, + .xmit_complete = rds_ib_xmit_complete, .xmit = rds_ib_xmit, .xmit_cong_map = NULL, .xmit_rdma = rds_ib_xmit_rdma, Index: ofa_kernel-1.3/net/rds/ib_send.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib_send.c +++ ofa_kernel-1.3/net/rds/ib_send.c @@ -245,6 +245,144 @@ void rds_ib_send_cq_comp_handler(struct } } +/* + * This is the main function for allocating credits when sending + * messages. + * + * Conceptually, we have two counters: + * - send credits: this tells us how many WRs we're allowed + * to submit without overruning the reciever's queue. For + * each SEND WR we post, we decrement this by one. + * + * - posted credits: this tells us how many WRs we recently + * posted to the receive queue. This value is transferred + * to the peer as a "credit update" in a RDS header field. + * Every time we transmit credits to the peer, we subtract + * the amount of transferred credits from this counter. + * + * It is essential that we avoid situations where both sides have + * exhausted their send credits, and are unable to send new credits + * to the peer. We achieve this by requiring that we send at least + * one credit update to the peer before exhausting our credits. + * When new credits arrive, we subtract one credit that is withheld + * until we've posted new buffers and are ready to transmit these + * credits (see rds_ib_send_add_credits below). + * + * The RDS send code is essentially single-threaded; rds_send_xmit + * grabs c_send_sem to ensure exclusive access to the send ring. + * However, the ACK sending code is independent and can race with + * message SENDs. + * + * In the send path, we need to update the counters for send credits + * and the counter of posted buffers atomically - when we use the + * last available credit, we cannot allow another thread to race us + * and grab the posted credits counter. Hence, we have to use a + * spinlock to protect the credit counter, or use atomics. + * + * Spinlocks shared between the send and the receive path are bad, + * because they create unnecessary delays. An early implementation + * using a spinlock showed a 5% degradation in throughput at some + * loads. + * + * This implementation avoids spinlocks completely, putting both + * counters into a single atomic, and updating that atomic using + * atomic_add (in the receive path, when receiving fresh credits), + * and using atomic_cmpxchg when updating the two counters. + */ +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, + u32 wanted, u32 *adv_credits) +{ + unsigned int avail, posted, got = 0, advertise; + long oldval, newval; + + *adv_credits = 0; + if (!ic->i_flowctl) + return wanted; + +try_again: + advertise = 0; + oldval = newval = atomic_read(&ic->i_credits); + posted = IB_GET_POST_CREDITS(oldval); + avail = IB_GET_SEND_CREDITS(oldval); + + rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n", + wanted, avail, posted); + + /* The last credit must be used to send a credit updated. */ + if (avail && !posted) + avail--; + + if (avail < wanted) { + struct rds_connection *conn = ic->i_cm_id->context; + + /* Oops, there aren't that many credits left! */ + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + got = avail; + } else { + /* Sometimes you get what you want, lalala. */ + got = wanted; + } + newval -= IB_SET_SEND_CREDITS(got); + + if (got && posted) { + advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT); + newval -= IB_SET_POST_CREDITS(advertise); + } + + /* Finally bill everything */ + if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval) + goto try_again; + + *adv_credits = advertise; + return got; +} + +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (credits == 0) + return; + + rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n", + credits, + IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)), + test_bit(RDS_LL_SEND_FULL, &conn->c_flags)? ", ll_send_full" : ""); + + atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits); + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384); + + rds_ib_stats_inc(s_ib_rx_credit_updates); +} + +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (posted == 0) + return; + + atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits); + + /* Decide whether to send an update to the peer now. + * If we would send a credit update for every single buffer we + * post, we would end up with an ACK storm (ACK arrives, + * consumes buffer, we refill the ring, send ACK to remote + * advertising the newly posted buffer... ad inf) + * + * Performance pretty much depends on how often we send + * credit updates - too frequent updates mean lots of ACKs. + * Too infrequent updates, and the peer will run out of + * credits and has to throttle. + * For the time being, 16 seems to be a good compromise. + */ + if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16) + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); +} + static inline void rds_ib_xmit_populate_wr(struct rds_ib_connection *ic, struct rds_ib_send_work *send, unsigned int pos, @@ -307,6 +445,8 @@ int rds_ib_xmit(struct rds_connection *c u32 pos; u32 i; u32 work_alloc; + u32 credit_alloc; + u32 adv_credits = 0; int send_flags = 0; int sent; int ret; @@ -314,6 +454,7 @@ int rds_ib_xmit(struct rds_connection *c BUG_ON(off % RDS_FRAG_SIZE); BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); + /* FIXME we may overallocate here */ if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) i = 1; else @@ -327,8 +468,29 @@ int rds_ib_xmit(struct rds_connection *c goto out; } + credit_alloc = work_alloc; + if (ic->i_flowctl) { + credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits); + if (credit_alloc < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc); + work_alloc = credit_alloc; + } + if (work_alloc == 0) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_throttle); + ret = -ENOMEM; + goto out; + } + } + /* map the message the first time we see it */ if (ic->i_rm == NULL) { + /* + printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n", + be16_to_cpu(rm->m_inc.i_hdr.h_dport), + rm->m_inc.i_hdr.h_flags, + be32_to_cpu(rm->m_inc.i_hdr.h_len)); + */ if (rm->m_nents) { rm->m_count = ib_dma_map_sg(dev, rm->m_sg, rm->m_nents, DMA_TO_DEVICE); @@ -449,6 +611,24 @@ add_header: * have been set up to point to the right header buffer. */ memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header)); + if (0) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n", + be16_to_cpu(hdr->h_dport), + hdr->h_flags, + be32_to_cpu(hdr->h_len)); + } + if (adv_credits) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + /* add credit and redo the header checksum */ + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + adv_credits = 0; + rds_ib_stats_inc(s_ib_tx_credit_updates); + } + if (prev) prev->s_wr.next = &send->s_wr; prev = send; @@ -472,6 +652,8 @@ add_header: rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); work_alloc = i; } + if (ic->i_flowctl && i < credit_alloc) + rds_ib_send_add_credits(conn, credit_alloc - i); /* XXX need to worry about failed_wr and partial sends. */ failed_wr = &first->s_wr; @@ -487,11 +669,14 @@ add_header: ic->i_rm = prev->s_rm; prev->s_rm = NULL; } + /* Finesse this later */ + BUG(); goto out; } ret = sent; out: + BUG_ON(adv_credits); return ret; } @@ -630,3 +815,12 @@ int rds_ib_xmit_rdma(struct rds_connecti out: return ret; } + +void rds_ib_xmit_complete(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + /* We may have a pending ACK or window update we were unable + * to send previously (due to flow control). Try again. */ + rds_ib_attempt_ack(ic); +} Index: ofa_kernel-1.3/net/rds/ib_stats.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib_stats.c +++ ofa_kernel-1.3/net/rds/ib_stats.c @@ -46,14 +46,17 @@ static char *rds_ib_stat_names[] = { "ib_tx_cq_call", "ib_tx_cq_event", "ib_tx_ring_full", + "ib_tx_throttle", "ib_tx_sg_mapping_failure", "ib_tx_stalled", + "ib_tx_credit_updates", "ib_rx_cq_call", "ib_rx_cq_event", "ib_rx_ring_empty", "ib_rx_refill_from_cq", "ib_rx_refill_from_thread", "ib_rx_alloc_limit", + "ib_rx_credit_updates", "ib_ack_sent", "ib_ack_send_failure", "ib_ack_send_delayed", Index: ofa_kernel-1.3/net/rds/rds.h =================================================================== --- ofa_kernel-1.3.orig/net/rds/rds.h +++ ofa_kernel-1.3/net/rds/rds.h @@ -170,6 +170,7 @@ struct rds_connection { #define RDS_FLAG_CONG_BITMAP 0x01 #define RDS_FLAG_ACK_REQUIRED 0x02 #define RDS_FLAG_RETRANSMITTED 0x04 +#define RDS_MAX_ADV_CREDIT 255 /* * Maximum space available for extension headers. @@ -183,7 +184,8 @@ struct rds_header { __be16 h_sport; __be16 h_dport; u8 h_flags; - u8 h_padding[5]; + u8 h_credit; + u8 h_padding[4]; __sum16 h_csum; u8 h_exthdr[RDS_HEADER_EXT_SPACE]; Index: ofa_kernel-1.3/net/rds/ib_sysctl.c =================================================================== --- ofa_kernel-1.3.orig/net/rds/ib_sysctl.c +++ ofa_kernel-1.3/net/rds/ib_sysctl.c @@ -53,6 +53,8 @@ unsigned long rds_ib_sysctl_max_unsig_by static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1; static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL; +unsigned int rds_ib_sysctl_flow_control = 1; + ctl_table rds_ib_sysctl_table[] = { { .ctl_name = 1, @@ -102,6 +104,14 @@ ctl_table rds_ib_sysctl_table[] = { .mode = 0644, .proc_handler = &proc_doulongvec_minmax, }, + { + .ctl_name = 6, + .procname = "flow_control", + .data = &rds_ib_sysctl_flow_control, + .maxlen = sizeof(rds_ib_sysctl_flow_control), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = 0} }; From ogerlitz at voltaire.com Mon May 19 01:26:29 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 May 2008 11:26:29 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <483139B5.8040908@voltaire.com> Steve Wise wrote: > Support for the IB BMME and iWARP equivalent memory extensions to > non shared memory regions. Usage Model: > > - MR allocated with ib_alloc_mr() > - Page lists allocated via ib_alloc_fast_reg_page_list(). > - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) > - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) > - MR deallocated with ib_dereg_mr() > - page lists dealloced via ib_free_fast_reg_page_list(). Steve, Does this design goes hand-in-hand with remote invalidation? such that if the remote side invalidated the mapping there no need to issue the IB_WR_INVALIDATE_MR work request. Also, does the proposed design support fmr pages of granularity different than the OS ones? for example the OS pages are 4K and the ULP wants to use fmr of 512 byte "pages (the "block lists" feature), etc. In that case doesn't the size of each page has to be specified in as a param to the alloc_fast_reg_mr() verb? > > Applications can allocate a fast_reg mr once, and then can repeatedly > bind the mr to different physical memory SGLs via posting work requests > to the send queue. For each outstanding mr-to-pbl binding in the SQ > pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can > be achieved while still allowing device-specific page_list processing. mmm, is it a must for the ULP issue page list alloc/free per IB_WR_FAST_REG_MR call? > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -676,6 +683,20 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u64 iova_start; > + struct ib_mr *mr; > + struct ib_fast_reg_page_list *page_list; > + unsigned int page_size; > + unsigned int page_list_len; > + unsigned int first_byte_offset; > + u32 length; > + int access_flags; > + > + } fast_reg; > + struct { > + struct ib_mr *mr; > + } local_inv; > } wr; > }; I suggest to use a "page_shift" notation and not "page_size" to comply with the kernel semantics of other APIs. Or. From Sumit.Gaur at Sun.COM Mon May 19 02:55:00 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Mon, 19 May 2008 15:25:00 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <20080513185404.3D00BE60C16@openfabrics.org> References: <20080513185404.3D00BE60C16@openfabrics.org> Message-ID: <48314E74.9010107@Sun.COM> Hi I have an issue while my program interacting with OFED umad library. I have two separate threads one for sending SMP,GMP packets and another to receive response. Things are working fine but during the whole process I keep receiving packets with unknown tid apart from correct response. Is it a correct behavior. If yes how I could avoid them ? Thanks and Regards sumit general-request at lists.openfabrics.org wrote: > Send general mailing list submissions to > general at lists.openfabrics.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > or, via email, send a message with subject or body 'help' to > general-request at lists.openfabrics.org > > You can reach the person managing the list at > general-owner at lists.openfabrics.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of general digest..." > > > Today's Topics: > > 1. Re: [PATCH] IB/core: handle race between elements in qork > queues after event (Roland Dreier) > 2. Re: RDS flow control (Steve Wise) > 3. Re: RDS flow control (Olaf Kirch) > 4. Re: RDS flow control (Steve Wise) > 5. Re: RDS flow control (Olaf Kirch) > 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence > checking (Roland Dreier) > 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with > struct pid * not pid_t. (Roland Dreier) > 8. Re: bitops take an unsigned long * (Roland Dreier) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 13 May 2008 10:41:39 -0700 > From: Roland Dreier > Subject: Re: [ofa-general] [PATCH] IB/core: handle race between > elements in qork queues after event > To: Moni Shoua > Cc: Olga Stern , OpenFabrics General > > Message-ID: > Content-Type: text/plain; charset=us-ascii > > > Can we please go on with this patch? We would like to see it in the next kernel. > > I still don't get why this is important to you. Is there a concrete > example of a situation where this actually makes a measurable difference? > > We need some justification for adding this locking complexity beyond "it > doesn't hurt." (And also of course we need it fixed so there aren't races) > > - R. > > > ------------------------------ > > Message: 2 > Date: Tue, 13 May 2008 12:58:11 -0500 > From: Steve Wise > Subject: Re: [ofa-general] RDS flow control > To: Richard Frank > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > Message-ID: <4829D6B3.5080900 at opengridcomputing.com> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Richard Frank wrote: > >>Steve Wise wrote: >> >>>Olaf Kirch wrote: >>> >>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: >>>> >>>> >>>>>As part of my effort to get RDS working for iWARP, I will be >>>>>working on the RDS flow control. Flow control is needed for iWARP >>>>>due to the fact that iWARP connections terminate if there is no >>>>>posted recv for an incoming packet. IB connections do not have >>>>>this limitation if setup in a certain way. In its current >>>>>implementation, RDS sets the connection attribute rnr_retry to 7. >>>>>This causes IB to retransmit until there is a posted recv buffer. >>>> >>>>I think for the initial implementation, it is fine for iWARP to just >>>>fail the connect when that happens, and re-establish the connection. >>>> >>>>If you use reasonable defaults for the send and recv queues, receiver >>>>overruns should be relatively rare. >>>> >>>>Once everything else works, let's revisit the flow control part. >>>> >>>> >>> >>>I _think_ you'll hit this quickly with one-way flows. Send >>>completions for iWARP only mean the user's buffer can be reused. Not >>>that its placed at the remote peer or in the remote user's buffer. >>> >> >>Let's see what happens - anyway - this could be solved in an IWARP >>extension to RDS - right ? > > > > Yes, by adding flow control. And it could be iwarp-specific if you > want. I would not suggest relying on connection termination and > re-establishment as the way to handle this :). > > > > >>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB >>>and rnr_retry == 0 using the rds perf tools? >>>Also "the everything else" part depends on remove fmr usage. I'm >>>working on the new RDMA memory verbs allowing fast registration of >>>physical memory via a send WR. To support iWARP we need to remove >>>the fmr usage from RDS. The idea was to replace fmrs with the new >>>fastreg verbs. Thoughts? >>> >> >>What does "fast" imply here - how does this compare to the performance >>of FMRs ? > > > > Don't know yet, but probably as fast. > > >>Why would not push memory window creation into the RDS transport >>specific implementations ? > > > Isn't it already transport-specific? IE you don't need FMRs for TCP. > (I'm ignorant on the specifics of the implementation at this point, so > please excuse any dumb statements :) > > > >>Changing the API may be OK - if we retain the performance we have with >>IB. > > > > I assume nothing would fly that regresses IB performance. Worst case, > you have an iwarp-specific RDS transport like you do for TCP, I guess. > Hopefully though, IB + iWARP will be a common transport. > > > >>>Stay tuned for the new verbs API RFC... >>> >>>Steve. >>>_______________________________________________ >>>general mailing list >>>general at lists.openfabrics.org >>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>>To unsubscribe, please visit >>>http://openib.org/mailman/listinfo/openib-general > > > > > ------------------------------ > > Message: 3 > Date: Tue, 13 May 2008 20:04:00 +0200 > From: Olaf Kirch > Subject: Re: [ofa-general] RDS flow control > To: Steve Wise > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > Message-ID: <200805132004.01371.okir at lst.de> > Content-Type: text/plain; charset="iso-8859-1" > > On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > >>Yes, by adding flow control. And it could be iwarp-specific if you >>want. I would not suggest relying on connection termination and >>re-establishment as the way to handle this :). > > > No, not in the long term. But let's hold off on the flow control stuff > for a little - I would first like to finish my patch set and hand it > out for you folks to bang on it, rather than the other way round. > Okay with you guys? > > >>I assume nothing would fly that regresses IB performance. Worst case, >>you have an iwarp-specific RDS transport like you do for TCP, I guess. >>Hopefully though, IB + iWARP will be a common transport. > > > If it turns out that way, fine. If iWARP ands up sharing 80% of the > code with IB except the RDMA specific functions, I think that's > very much acceptable, too. > > Olaf From ruimario at gmail.com Mon May 19 03:49:43 2008 From: ruimario at gmail.com (Rui Machado) Date: Mon, 19 May 2008 12:49:43 +0200 Subject: [ofa-general] timeout question In-Reply-To: <482DE8E3.4090200@gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> <482E3AEE.4070603@gmail.com> <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com> <482DE8E3.4090200@gmail.com> Message-ID: <6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com> 2008/5/16 Dotan Barak : > Rui Machado wrote: >> >> 2008/5/17 Dotan Barak : >> >>> >>> Rui Machado wrote: >>> >>>> >>>> 2008/5/16 Roland Dreier : >>>> >>>> >>>>> >>>>> > hmm..... and is there no workaround for this, for this situation? I >>>>> > mean, if the server dies isn't there any possibility that >>>>> > the sender/client realizes this. If the timeout it's too large this >>>>> > can be cumbersome. >>>>> > >>>>> > I tried reducing the timeout and indeed the client realizes faster >>>>> > when the server exits but another problem arises: Without exiting >>>>> the >>>>> > server, >>>>> > on the client side I get the error (retry exceed) when polling for a >>>>> > recently posted send - this after some hours. >>>>> >>>>> There's a tradeoff between detecting real failures faster, and reducing >>>>> false errors detected because a response came too slowly. >>>>> >>>>> Clearly if a response may take an amount of time 'X' to be received >>>>> under normal conditions, there's no way to conclude that the remote >>>>> side >>>>> has failed without waiting at least 'X'. >>>>> >>>>> >>>>> >>>> >>>> I understand. So there's no really difference between the two >>>> situations, real server failure or just a load problem that takes more >>>> time? >>>> >>>> >>> >>> From the sender QP point of view, they are the same (ack/nack wasn't send >>> during a specific >>> period of time) >>> >>>> >>>> Something like a different error or a SIGPIPE :) ? >>>> >>>> I will describe my situation, maybe it helps (bare with me as I'm >>>> starting with Infiniband and so on) >>>> I have a client and a server.The clients posts RDMA calls one at a >>>> time (post, poll, post...). So server is just there. >>>> If I try to start something like 16 clients on 1 machine, after a few >>>> hours I will get an error on some client programs (retry excess) with >>>> a timeout of 14. If I increase the timeout for 32, I don't see that >>>> error but if I stop the server, the clients take a lot of time to >>>> acknowledge that, which is also not wanted. >>>> That's why I asked if there a 'good value'. If I have such a load >>>> between 2 nodes, I always have to risk that if the server dies the >>>> client will take much time to see it. That's not nice! >>>> >>>> >>> >>> Did you try to increase the retry_count too? >>> (and not only the timeout). >>> > > Yes. >> >> But that wouldn't change my scenario since the overall time is given >> by the timeout * retry count right? >> >> >>> >>> By the way, Which RDMA operation do you execute READ or WRITE? >>> >> >> READ. >> > > Can you replace it with a write (from the other side)? > READ has "higher price" than a WRITE. > Can you please, shortly explain why this higher price? > Anyway, you should get the mentioned behavior anyway.. > > When the sender get the error, what is the status of the receiver QP? > (did you try to execute ibv_query_qp and get its status?) > I tried to get the qp state right after the error and it is 6 (which I believe is IBV_QPS_ERR). Why do you ask? Thanks Rui From ogerlitz at voltaire.com Mon May 19 04:29:30 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 May 2008 14:29:30 +0300 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <482C68A4.9020305@opengridcomputing.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> Message-ID: <4831649A.2020206@voltaire.com> Steve Wise wrote: > Sean Hefty wrote: >> I think we should just always report this event, and let users ignore >> it if they >> want. We don't seem to gain much by filtering the event at a lower >> level. > Um, doesn't that then change the ABI? Some apps might hurl on a new > (unexpected) event. Steve, I think ULPs should be designed/coded to live well with new events delivered by the rdma-cm as (A) the model is event based and (B) such events can be introduced while developing new features... So my suggestion is that a ULP which is limited in that sense would have to state in its package dependency requirements that they are dependent on librdmacm 1.0.7 or earlier. Ofcourse, the kernel is one package so I will make sure that the current intree rdma-cm consumers (iser, rnfs) live well with this event. Sean, please let me know your preference (as it was somehow unclear from the thread) if you want the delivery of this event to be dependent on the ulp asking for it or no. Or. From ogerlitz at voltaire.com Mon May 19 04:42:21 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 May 2008 14:42:21 +0300 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> References: <482C4A52.4000501@opengridcomputing.com> <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> Message-ID: <4831679D.9040804@voltaire.com> Sean Hefty wrote: >> So instead of adding a new function rdma_set_high_availability_mode, you >> could just set an option saying WANT_NETDEV_CHANGE_EVENTS. Maybe we >> need to add rdma_set_option() to the kernel RDMA-CM API? > I agree with this. Having a generic mechanism to report rare events would be > useful. Maybe the device removal notification can be adapted for this purpose? Sean, as suggested in the past (eg over the QoS discussion) rdma_set_option can serve from more purposes similarly to setsockopt, and I guess that down the road as the RDMA stack would get enhanced by more features adding these rdma_set/get_opt calls would make sense. So (Steve) in that respect, I don't see rdma_set_opt as a mechanism to report rare events. As I said please let me know your preference so I can work on a patch. Or. From hrosenstock at xsigo.com Mon May 19 04:47:23 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 04:47:23 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <48314E74.9010107@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> Message-ID: <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> Sumit, On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi > I have an issue while my program interacting with OFED umad library. Are you referring to libibumad ? > I have two > separate threads one for sending SMP,GMP packets and another to receive > response. Things are working fine but during the whole process I keep receiving > packets with unknown tid apart from correct response. What's the exact message ? > Is it a correct behavior. It could be; there's not enough info as to what is going on. It could be some unsolicited message (e.g. from SM) comes in during your transactions. Can you see what MADs are incoming ? One way to do that would be to run madeye. > If yes how I could avoid them ? Not sure what you are seeing yet. -- Hal > Thanks and Regards > sumit > > general-request at lists.openfabrics.org wrote: > > Send general mailing list submissions to > > general at lists.openfabrics.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > or, via email, send a message with subject or body 'help' to > > general-request at lists.openfabrics.org > > > > You can reach the person managing the list at > > general-owner at lists.openfabrics.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of general digest..." > > > > > > Today's Topics: > > > > 1. Re: [PATCH] IB/core: handle race between elements in qork > > queues after event (Roland Dreier) > > 2. Re: RDS flow control (Steve Wise) > > 3. Re: RDS flow control (Olaf Kirch) > > 4. Re: RDS flow control (Steve Wise) > > 5. Re: RDS flow control (Olaf Kirch) > > 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence > > checking (Roland Dreier) > > 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with > > struct pid * not pid_t. (Roland Dreier) > > 8. Re: bitops take an unsigned long * (Roland Dreier) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 13 May 2008 10:41:39 -0700 > > From: Roland Dreier > > Subject: Re: [ofa-general] [PATCH] IB/core: handle race between > > elements in qork queues after event > > To: Moni Shoua > > Cc: Olga Stern , OpenFabrics General > > > > Message-ID: > > Content-Type: text/plain; charset=us-ascii > > > > > Can we please go on with this patch? We would like to see it in the next kernel. > > > > I still don't get why this is important to you. Is there a concrete > > example of a situation where this actually makes a measurable difference? > > > > We need some justification for adding this locking complexity beyond "it > > doesn't hurt." (And also of course we need it fixed so there aren't races) > > > > - R. > > > > > > ------------------------------ > > > > Message: 2 > > Date: Tue, 13 May 2008 12:58:11 -0500 > > From: Steve Wise > > Subject: Re: [ofa-general] RDS flow control > > To: Richard Frank > > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > > Message-ID: <4829D6B3.5080900 at opengridcomputing.com> > > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > > > Richard Frank wrote: > > > >>Steve Wise wrote: > >> > >>>Olaf Kirch wrote: > >>> > >>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: > >>>> > >>>> > >>>>>As part of my effort to get RDS working for iWARP, I will be > >>>>>working on the RDS flow control. Flow control is needed for iWARP > >>>>>due to the fact that iWARP connections terminate if there is no > >>>>>posted recv for an incoming packet. IB connections do not have > >>>>>this limitation if setup in a certain way. In its current > >>>>>implementation, RDS sets the connection attribute rnr_retry to 7. > >>>>>This causes IB to retransmit until there is a posted recv buffer. > >>>> > >>>>I think for the initial implementation, it is fine for iWARP to just > >>>>fail the connect when that happens, and re-establish the connection. > >>>> > >>>>If you use reasonable defaults for the send and recv queues, receiver > >>>>overruns should be relatively rare. > >>>> > >>>>Once everything else works, let's revisit the flow control part. > >>>> > >>>> > >>> > >>>I _think_ you'll hit this quickly with one-way flows. Send > >>>completions for iWARP only mean the user's buffer can be reused. Not > >>>that its placed at the remote peer or in the remote user's buffer. > >>> > >> > >>Let's see what happens - anyway - this could be solved in an IWARP > >>extension to RDS - right ? > > > > > > > > Yes, by adding flow control. And it could be iwarp-specific if you > > want. I would not suggest relying on connection termination and > > re-establishment as the way to handle this :). > > > > > > > > > >>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB > >>>and rnr_retry == 0 using the rds perf tools? > >>>Also "the everything else" part depends on remove fmr usage. I'm > >>>working on the new RDMA memory verbs allowing fast registration of > >>>physical memory via a send WR. To support iWARP we need to remove > >>>the fmr usage from RDS. The idea was to replace fmrs with the new > >>>fastreg verbs. Thoughts? > >>> > >> > >>What does "fast" imply here - how does this compare to the performance > >>of FMRs ? > > > > > > > > Don't know yet, but probably as fast. > > > > > >>Why would not push memory window creation into the RDS transport > >>specific implementations ? > > > > > > Isn't it already transport-specific? IE you don't need FMRs for TCP. > > (I'm ignorant on the specifics of the implementation at this point, so > > please excuse any dumb statements :) > > > > > > > >>Changing the API may be OK - if we retain the performance we have with > >>IB. > > > > > > > > I assume nothing would fly that regresses IB performance. Worst case, > > you have an iwarp-specific RDS transport like you do for TCP, I guess. > > Hopefully though, IB + iWARP will be a common transport. > > > > > > > >>>Stay tuned for the new verbs API RFC... > >>> > >>>Steve. > >>>_______________________________________________ > >>>general mailing list > >>>general at lists.openfabrics.org > >>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>>To unsubscribe, please visit > >>>http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > ------------------------------ > > > > Message: 3 > > Date: Tue, 13 May 2008 20:04:00 +0200 > > From: Olaf Kirch > > Subject: Re: [ofa-general] RDS flow control > > To: Steve Wise > > Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > > Message-ID: <200805132004.01371.okir at lst.de> > > Content-Type: text/plain; charset="iso-8859-1" > > > > On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > > > >>Yes, by adding flow control. And it could be iwarp-specific if you > >>want. I would not suggest relying on connection termination and > >>re-establishment as the way to handle this :). > > > > > > No, not in the long term. But let's hold off on the flow control stuff > > for a little - I would first like to finish my patch set and hand it > > out for you folks to bang on it, rather than the other way round. > > Okay with you guys? > > > > > >>I assume nothing would fly that regresses IB performance. Worst case, > >>you have an iwarp-specific RDS transport like you do for TCP, I guess. > >>Hopefully though, IB + iWARP will be a common transport. > > > > > > If it turns out that way, fine. If iWARP ands up sharing 80% of the > > code with IB except the RDMA specific functions, I think that's > > very much acceptable, too. > > > > Olaf > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Sumit.Gaur at Sun.COM Mon May 19 04:50:05 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Mon, 19 May 2008 17:20:05 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> Message-ID: <4831696D.6060409@Sun.COM> Hi Hal, Hal Rosenstock wrote: > Sumit, > > On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: > >>Hi >>I have an issue while my program interacting with OFED umad library. > > > Are you referring to libibumad ? yes, I am using mad_receive(0, -1) function to get my response back. > > >>I have two >>separate threads one for sending SMP,GMP packets and another to receive >>response. Things are working fine but during the whole process I keep receiving >>packets with unknown tid apart from correct response. > > > What's the exact message ? Response comes as proper mad packets but with "tid" that I have never send and my logic to keep track of send/response pkts failed. > > >> Is it a correct behavior. > > > It could be; there's not enough info as to what is going on. It could be > some unsolicited message (e.g. from SM) comes in during your > transactions. Can you see what MADs are incoming ? One way to do that > would be to run madeye. Yes I could see complete mad with madhdr as following fields Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435 If these are unsolicited packets. Is there anyway to filter them. Any reference to madeye ? > > >>If yes how I could avoid them ? > > > Not sure what you are seeing yet. > > -- Hal > > >>Thanks and Regards >>sumit >> >>general-request at lists.openfabrics.org wrote: >> >>>Send general mailing list submissions to >>> general at lists.openfabrics.org >>> >>>To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>or, via email, send a message with subject or body 'help' to >>> general-request at lists.openfabrics.org >>> >>>You can reach the person managing the list at >>> general-owner at lists.openfabrics.org >>> >>>When replying, please edit your Subject line so it is more specific >>>than "Re: Contents of general digest..." >>> >>> >>>Today's Topics: >>> >>> 1. Re: [PATCH] IB/core: handle race between elements in qork >>> queues after event (Roland Dreier) >>> 2. Re: RDS flow control (Steve Wise) >>> 3. Re: RDS flow control (Olaf Kirch) >>> 4. Re: RDS flow control (Steve Wise) >>> 5. Re: RDS flow control (Olaf Kirch) >>> 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence >>> checking (Roland Dreier) >>> 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with >>> struct pid * not pid_t. (Roland Dreier) >>> 8. Re: bitops take an unsigned long * (Roland Dreier) >>> >>> >>>---------------------------------------------------------------------- >>> >>>Message: 1 >>>Date: Tue, 13 May 2008 10:41:39 -0700 >>>From: Roland Dreier >>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between >>> elements in qork queues after event >>>To: Moni Shoua >>>Cc: Olga Stern , OpenFabrics General >>> >>>Message-ID: >>>Content-Type: text/plain; charset=us-ascii >>> >>> > Can we please go on with this patch? We would like to see it in the next kernel. >>> >>>I still don't get why this is important to you. Is there a concrete >>>example of a situation where this actually makes a measurable difference? >>> >>>We need some justification for adding this locking complexity beyond "it >>>doesn't hurt." (And also of course we need it fixed so there aren't races) >>> >>> - R. >>> >>> >>>------------------------------ >>> >>>Message: 2 >>>Date: Tue, 13 May 2008 12:58:11 -0500 >>>From: Steve Wise >>>Subject: Re: [ofa-general] RDS flow control >>>To: Richard Frank >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com> >>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed >>> >>>Richard Frank wrote: >>> >>> >>>>Steve Wise wrote: >>>> >>>> >>>>>Olaf Kirch wrote: >>>>> >>>>> >>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: >>>>>> >>>>>> >>>>>> >>>>>>>As part of my effort to get RDS working for iWARP, I will be >>>>>>>working on the RDS flow control. Flow control is needed for iWARP >>>>>>>due to the fact that iWARP connections terminate if there is no >>>>>>>posted recv for an incoming packet. IB connections do not have >>>>>>>this limitation if setup in a certain way. In its current >>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7. >>>>>>>This causes IB to retransmit until there is a posted recv buffer. >>>>>> >>>>>>I think for the initial implementation, it is fine for iWARP to just >>>>>>fail the connect when that happens, and re-establish the connection. >>>>>> >>>>>>If you use reasonable defaults for the send and recv queues, receiver >>>>>>overruns should be relatively rare. >>>>>> >>>>>>Once everything else works, let's revisit the flow control part. >>>>>> >>>>>> >>>>> >>>>>I _think_ you'll hit this quickly with one-way flows. Send >>>>>completions for iWARP only mean the user's buffer can be reused. Not >>>>>that its placed at the remote peer or in the remote user's buffer. >>>>> >>>> >>>>Let's see what happens - anyway - this could be solved in an IWARP >>>>extension to RDS - right ? >>> >>> >>> >>>Yes, by adding flow control. And it could be iwarp-specific if you >>>want. I would not suggest relying on connection termination and >>>re-establishment as the way to handle this :). >>> >>> >>> >>> >>> >>>>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB >>>>>and rnr_retry == 0 using the rds perf tools? >>>>>Also "the everything else" part depends on remove fmr usage. I'm >>>>>working on the new RDMA memory verbs allowing fast registration of >>>>>physical memory via a send WR. To support iWARP we need to remove >>>>>the fmr usage from RDS. The idea was to replace fmrs with the new >>>>>fastreg verbs. Thoughts? >>>>> >>>> >>>>What does "fast" imply here - how does this compare to the performance >>>>of FMRs ? >>> >>> >>> >>>Don't know yet, but probably as fast. >>> >>> >>> >>>>Why would not push memory window creation into the RDS transport >>>>specific implementations ? >>> >>> >>>Isn't it already transport-specific? IE you don't need FMRs for TCP. >>>(I'm ignorant on the specifics of the implementation at this point, so >>>please excuse any dumb statements :) >>> >>> >>> >>> >>>>Changing the API may be OK - if we retain the performance we have with >>>>IB. >>> >>> >>> >>>I assume nothing would fly that regresses IB performance. Worst case, >>>you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>Hopefully though, IB + iWARP will be a common transport. >>> >>> >>> >>> >>>>>Stay tuned for the new verbs API RFC... >>>>> >>>>>Steve. >>>>>_______________________________________________ >>>>>general mailing list >>>>>general at lists.openfabrics.org >>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>>To unsubscribe, please visit >>>>>http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> >>>------------------------------ >>> >>>Message: 3 >>>Date: Tue, 13 May 2008 20:04:00 +0200 >>>From: Olaf Kirch >>>Subject: Re: [ofa-general] RDS flow control >>>To: Steve Wise >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>Message-ID: <200805132004.01371.okir at lst.de> >>>Content-Type: text/plain; charset="iso-8859-1" >>> >>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: >>> >>> >>>>Yes, by adding flow control. And it could be iwarp-specific if you >>>>want. I would not suggest relying on connection termination and >>>>re-establishment as the way to handle this :). >>> >>> >>>No, not in the long term. But let's hold off on the flow control stuff >>>for a little - I would first like to finish my patch set and hand it >>>out for you folks to bang on it, rather than the other way round. >>>Okay with you guys? >>> >>> >>> >>>>I assume nothing would fly that regresses IB performance. Worst case, >>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>>Hopefully though, IB + iWARP will be a common transport. >>> >>> >>>If it turns out that way, fine. If iWARP ands up sharing 80% of the >>>code with IB except the RDMA specific functions, I think that's >>>very much acceptable, too. >>> >>>Olaf >> >>_______________________________________________ >>general mailing list >>general at lists.openfabrics.org >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From ogerlitz at voltaire.com Mon May 19 05:05:08 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 19 May 2008 15:05:08 +0300 Subject: [ofa-general] Re: [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com> References: <000101c8b6ab$4fc9d7b0$bd59180a@amr.corp.intel.com> Message-ID: <48316CF4.2010507@voltaire.com> Sean Hefty wrote: >> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private >> *id_priv) >> +{ >> + struct rdma_dev_addr *dev_addr; >> + struct cma_work *work; >> + >> + dev_addr = &id_priv->id.route.addr.dev_addr; >> + >> + if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && >> + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { >> + printk(KERN_ERR "addr change for device %s used by id %p, >> notifying\n", >> + ndev->name, &id_priv->id); >> + work = kzalloc(sizeof *work, GFP_KERNEL); >> + if (!work) >> + return -ENOMEM; >> + work->id = id_priv; >> + INIT_WORK(&work->work, cma_work_handler); >> + work->old_state = id_priv->state; >> + work->new_state = id_priv->state; >> + work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE; >> + atomic_inc(&id_priv->refcount); >> + queue_work(cma_wq, &work->work); >> + } >> +} > > My initial thought on this is to see if we can just queue a single work item > that can be used to invoke the user callbacks. I'd have to see how the locking > worked out though to know if that approach is 'cleaner'. Sean, Yes, it is possible to queue a single work item that can be used to invoke the user callbacks, eg cma_netdev_change_handler() would be queued to be executed by a thread (eg cma_wq mentioned below) and do what this code does. What makes you think that its 'cleaner' to do it this way? > Currently, the rdma_cm ensures that only a single callback to the user is > invoked at a time. This is needed to support the user trying to destroy their > rdma_cm_id from the callback. I didn't look to see if this still maintains > that. OK, so I understand from the code that the callback to the user may be delivered not only through cma_work_handler (that is in the context of the work queue thread that is created by the rdmn-cm). So is the design keeps on serialization through tracking the ID state/changes before invoking the callback, or its a different method? Or. From hrosenstock at xsigo.com Mon May 19 06:01:21 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 06:01:21 -0700 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <483046CA.3010403@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> <483046CA.3010403@Voltaire.COM> Message-ID: <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com> On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote: > Hal Rosenstock wrote: > > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote: > >> The purpose of this patch is to make the events that are related to SM change > >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > >> When SM related events are handled, it is not necessary to flush unicast > >> info from device but only multicast info. > > > > How is unicast invalidation handled on these changes ? On a local LID > > change event, how does an end port know/determine what else (e.g. other > > LIDs, paths) the SM might have changed (that specifically might affect > > IPoIB since this is limited to IPoIB) ? > I'm not sure I understand the question but local LID change would be handled as before > with a LID_CHANGE event. For this type of event, there is not change in what IPoIB does to cope. It's SM change which I'm not sure about. I'm unaware of an IBA spec guarantee on preservation of paths on SM failover. Can you point me at this ? Also, as many routing protocols are dependent on where they are run in the subnet (location of SM node in the topology), I don't think all path parameters can be maintained when in a heterogeneous subnet and hence would need refreshing (or flushing to cause this) on an SM change event. So while it may work in a homogeneous subnet, I don't think this is the general case. > > Also, wouldn't there be similar issues with other ULPs ? > There might be but the purpose of this one is to make things better for IPoIB Understood; just trying to widen the scope. IMO other ULPs should at least be inspected for the same issues. The multicast issue is IPoIB specific but local LID, client reregister (maybe only events for other ULPs as multicast and service records may not apply (perhaps except DAPL but this may be old implementation)) and SM changes apply to all. -- Hal > > -- Hal > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at dev.mellanox.co.il Mon May 19 06:03:50 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 16:03:50 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> Message-ID: <200805191603.50735.jackm@dev.mellanox.co.il> On Monday 19 May 2008 00:45, Roland Dreier wrote: > [Adding list cc] > > > +static int mlx4_log_pg_sz = 0; > > +module_param(mlx4_log_pg_sz, int, 0444); > > +MODULE_PARM_DESC(mlx4_log_pg_sz, > > + "set FW log system min page size (0 gets native FW min. default=0)"); > > Why do we need this module parameter? When would someone set it to anything > other than 0? This is in case at some installation, the administrator wishes to use the legacy device page size of 12, for example. Having a module parameter enables such tweaking to be done painlessly. > > > - return __mlx4_cmd(dev, in_param, out_param, 1, in_modifier, > > + return __mlx4_cmd(dev, in_param, out_param, out_param ? 1 : 0, in_modifier, > > op_modifier, op, timeout); > > I don't see any call to mlx4_cmd_imm in this patch -- why is this change needed? > You're right, this was just a hold-over from the first version of the patch (which used immediate data instead of the mailbox). - Jack From Thomas.Talpey at netapp.com Mon May 19 06:14:32 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 19 May 2008 09:14:32 -0400 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: Message-ID: At 07:12 PM 5/16/2008, Roland Dreier wrote: >if we can't use the "WQE shrinking" feature (because of selective >signaling in the NFS/RDMA case), and we want to use 32 sge entries, then >the WQE size 's' will end up a little more than 512 bytes, and the >wqe_shift will end up as 10. Can you elaborate on this? The NFS/RDMA client does selective signalling on its send queue in order to save on interrupts and CQE generation/handling. Which I always thought was a (very) good approach. Because the RPC request/response paradigm guarantees an eventual receive completion, we simply defer (or even completely avoid) this work. Would that be a bad trade if it takes a WQE management opportunity away from the provider? It's quite easy to change this in the NFS/RDMA code, or make it a selectable parameter. Tom. From hrosenstock at xsigo.com Mon May 19 06:25:14 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 06:25:14 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <4831696D.6060409@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> Message-ID: <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> Hi Sumit, On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi Hal, > > > Hal Rosenstock wrote: > > Sumit, > > > > On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: > > > >>Hi > >>I have an issue while my program interacting with OFED umad library. > > > > > > Are you referring to libibumad ? > yes, I am using mad_receive(0, -1) function to get my response back. OK. > >>I have two > >>separate threads one for sending SMP,GMP packets and another to receive > >>response. Things are working fine but during the whole process I keep receiving > >>packets with unknown tid apart from correct response. > > > > > > What's the exact message ? > Response comes as proper mad packets but with "tid" that I have never send and > my logic to keep track of send/response pkts failed. > > > > > >> Is it a correct behavior. > > > > > > It could be; there's not enough info as to what is going on. It could be > > some unsolicited message (e.g. from SM) comes in during your > > transactions. Can you see what MADs are incoming ? One way to do that > > would be to run madeye. > Yes I could see complete mad with madhdr as following fields > > Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, > ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435 Class 129 is a Subn directed route packet. Some of the other info (like attribute ID) doesn't look right to me but maybe that's something "special" to your environment. > If these are unsolicited packets. Is there anyway to filter them. Yes. How do you register ? > Any reference to madeye ? There's only the code for this (kernel module) which is added by OFED (not upstream) in drivers/infiniband/util but it's pretty straightforward to use. -- Hal > >>If yes how I could avoid them ? > > > > > > Not sure what you are seeing yet. > > > > -- Hal > > > > > >>Thanks and Regards > >>sumit > >> > >>general-request at lists.openfabrics.org wrote: > >> > >>>Send general mailing list submissions to > >>> general at lists.openfabrics.org > >>> > >>>To subscribe or unsubscribe via the World Wide Web, visit > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>or, via email, send a message with subject or body 'help' to > >>> general-request at lists.openfabrics.org > >>> > >>>You can reach the person managing the list at > >>> general-owner at lists.openfabrics.org > >>> > >>>When replying, please edit your Subject line so it is more specific > >>>than "Re: Contents of general digest..." > >>> > >>> > >>>Today's Topics: > >>> > >>> 1. Re: [PATCH] IB/core: handle race between elements in qork > >>> queues after event (Roland Dreier) > >>> 2. Re: RDS flow control (Steve Wise) > >>> 3. Re: RDS flow control (Olaf Kirch) > >>> 4. Re: RDS flow control (Steve Wise) > >>> 5. Re: RDS flow control (Olaf Kirch) > >>> 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence > >>> checking (Roland Dreier) > >>> 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with > >>> struct pid * not pid_t. (Roland Dreier) > >>> 8. Re: bitops take an unsigned long * (Roland Dreier) > >>> > >>> > >>>---------------------------------------------------------------------- > >>> > >>>Message: 1 > >>>Date: Tue, 13 May 2008 10:41:39 -0700 > >>>From: Roland Dreier > >>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between > >>> elements in qork queues after event > >>>To: Moni Shoua > >>>Cc: Olga Stern , OpenFabrics General > >>> > >>>Message-ID: > >>>Content-Type: text/plain; charset=us-ascii > >>> > >>> > Can we please go on with this patch? We would like to see it in the next kernel. > >>> > >>>I still don't get why this is important to you. Is there a concrete > >>>example of a situation where this actually makes a measurable difference? > >>> > >>>We need some justification for adding this locking complexity beyond "it > >>>doesn't hurt." (And also of course we need it fixed so there aren't races) > >>> > >>> - R. > >>> > >>> > >>>------------------------------ > >>> > >>>Message: 2 > >>>Date: Tue, 13 May 2008 12:58:11 -0500 > >>>From: Steve Wise > >>>Subject: Re: [ofa-general] RDS flow control > >>>To: Richard Frank > >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > >>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com> > >>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >>> > >>>Richard Frank wrote: > >>> > >>> > >>>>Steve Wise wrote: > >>>> > >>>> > >>>>>Olaf Kirch wrote: > >>>>> > >>>>> > >>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>>As part of my effort to get RDS working for iWARP, I will be > >>>>>>>working on the RDS flow control. Flow control is needed for iWARP > >>>>>>>due to the fact that iWARP connections terminate if there is no > >>>>>>>posted recv for an incoming packet. IB connections do not have > >>>>>>>this limitation if setup in a certain way. In its current > >>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7. > >>>>>>>This causes IB to retransmit until there is a posted recv buffer. > >>>>>> > >>>>>>I think for the initial implementation, it is fine for iWARP to just > >>>>>>fail the connect when that happens, and re-establish the connection. > >>>>>> > >>>>>>If you use reasonable defaults for the send and recv queues, receiver > >>>>>>overruns should be relatively rare. > >>>>>> > >>>>>>Once everything else works, let's revisit the flow control part. > >>>>>> > >>>>>> > >>>>> > >>>>>I _think_ you'll hit this quickly with one-way flows. Send > >>>>>completions for iWARP only mean the user's buffer can be reused. Not > >>>>>that its placed at the remote peer or in the remote user's buffer. > >>>>> > >>>> > >>>>Let's see what happens - anyway - this could be solved in an IWARP > >>>>extension to RDS - right ? > >>> > >>> > >>> > >>>Yes, by adding flow control. And it could be iwarp-specific if you > >>>want. I would not suggest relying on connection termination and > >>>re-establishment as the way to handle this :). > >>> > >>> > >>> > >>> > >>> > >>>>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB > >>>>>and rnr_retry == 0 using the rds perf tools? > >>>>>Also "the everything else" part depends on remove fmr usage. I'm > >>>>>working on the new RDMA memory verbs allowing fast registration of > >>>>>physical memory via a send WR. To support iWARP we need to remove > >>>>>the fmr usage from RDS. The idea was to replace fmrs with the new > >>>>>fastreg verbs. Thoughts? > >>>>> > >>>> > >>>>What does "fast" imply here - how does this compare to the performance > >>>>of FMRs ? > >>> > >>> > >>> > >>>Don't know yet, but probably as fast. > >>> > >>> > >>> > >>>>Why would not push memory window creation into the RDS transport > >>>>specific implementations ? > >>> > >>> > >>>Isn't it already transport-specific? IE you don't need FMRs for TCP. > >>>(I'm ignorant on the specifics of the implementation at this point, so > >>>please excuse any dumb statements :) > >>> > >>> > >>> > >>> > >>>>Changing the API may be OK - if we retain the performance we have with > >>>>IB. > >>> > >>> > >>> > >>>I assume nothing would fly that regresses IB performance. Worst case, > >>>you have an iwarp-specific RDS transport like you do for TCP, I guess. > >>>Hopefully though, IB + iWARP will be a common transport. > >>> > >>> > >>> > >>> > >>>>>Stay tuned for the new verbs API RFC... > >>>>> > >>>>>Steve. > >>>>>_______________________________________________ > >>>>>general mailing list > >>>>>general at lists.openfabrics.org > >>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>> > >>>>>To unsubscribe, please visit > >>>>>http://openib.org/mailman/listinfo/openib-general > >>> > >>> > >>> > >>> > >>>------------------------------ > >>> > >>>Message: 3 > >>>Date: Tue, 13 May 2008 20:04:00 +0200 > >>>From: Olaf Kirch > >>>Subject: Re: [ofa-general] RDS flow control > >>>To: Steve Wise > >>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > >>>Message-ID: <200805132004.01371.okir at lst.de> > >>>Content-Type: text/plain; charset="iso-8859-1" > >>> > >>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > >>> > >>> > >>>>Yes, by adding flow control. And it could be iwarp-specific if you > >>>>want. I would not suggest relying on connection termination and > >>>>re-establishment as the way to handle this :). > >>> > >>> > >>>No, not in the long term. But let's hold off on the flow control stuff > >>>for a little - I would first like to finish my patch set and hand it > >>>out for you folks to bang on it, rather than the other way round. > >>>Okay with you guys? > >>> > >>> > >>> > >>>>I assume nothing would fly that regresses IB performance. Worst case, > >>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. > >>>>Hopefully though, IB + iWARP will be a common transport. > >>> > >>> > >>>If it turns out that way, fine. If iWARP ands up sharing 80% of the > >>>code with IB except the RDMA specific functions, I think that's > >>>very much acceptable, too. > >>> > >>>Olaf > >> > >>_______________________________________________ > >>general mailing list > >>general at lists.openfabrics.org > >>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From swise at opengridcomputing.com Mon May 19 06:40:29 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 08:40:29 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483139B5.8040908@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483139B5.8040908@voltaire.com> Message-ID: <4831834D.3050705@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> Support for the IB BMME and iWARP equivalent memory extensions to non >> shared memory regions. Usage Model: >> >> - MR allocated with ib_alloc_mr() >> - Page lists allocated via ib_alloc_fast_reg_page_list(). >> - MR made VALID and bound to a specific page list via >> ib_post_send(IB_WR_FAST_REG_MR) >> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) >> - MR deallocated with ib_dereg_mr() >> - page lists dealloced via ib_free_fast_reg_page_list(). > Steve, > > Does this design goes hand-in-hand with remote invalidation? such that > if the remote side invalidated the mapping there no need to issue the > IB_WR_INVALIDATE_MR work request. > Yes. > Also, does the proposed design support fmr pages of granularity > different than the OS ones? for example the OS pages are 4K and the > ULP wants to use fmr of 512 byte "pages (the "block lists" feature), > etc. In that case doesn't the size of each page has to be specified in > as a param to the alloc_fast_reg_mr() verb? Page size is passed in at the registration time. At allocation time, the HW only need to know what the max page list length (or PBL depth) will ever be so it can pre-allocate that at alloc time. The the actualy page list length, the page size of each entry in the page list, as well as the page list itself is passed in via the post_send(IB_WR_FAST_REG_MR) work request. See the fast_reg union in struct ib_send_wr. >> >> Applications can allocate a fast_reg mr once, and then can repeatedly >> bind the mr to different physical memory SGLs via posting work requests >> to the send queue. For each outstanding mr-to-pbl binding in the SQ >> pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can >> be achieved while still allowing device-specific page_list processing. > mmm, is it a must for the ULP issue page list alloc/free per > IB_WR_FAST_REG_MR call? > No, the can be reused as needed. They typically will only get allocated once, used many times, then freed when the application is done. My point in the text above was that an application could allocate N page lists and use them in a pipeline for the same fast reg mr by fencing things appropriately in the SQ. >> --- a/include/rdma/ib_verbs.h >> +++ b/include/rdma/ib_verbs.h >> @@ -676,6 +683,20 @@ struct ib_send_wr { >> u16 pkey_index; /* valid for GSI only */ >> u8 port_num; /* valid for DR SMPs on switch only */ >> } ud; >> + struct { >> + u64 iova_start; >> + struct ib_mr *mr; >> + struct ib_fast_reg_page_list *page_list; >> + unsigned int page_size; >> + unsigned int page_list_len; >> + unsigned int first_byte_offset; >> + u32 length; >> + int access_flags; >> + >> + } fast_reg; >> + struct { >> + struct ib_mr *mr; >> + } local_inv; >> } wr; >> }; > I suggest to use a "page_shift" notation and not "page_size" to comply > with the kernel semantics of other APIs. > Ok, I wondered about that. It will also ensure a power of two. Steve. From Thomas.Talpey at netapp.com Mon May 19 06:53:23 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 19 May 2008 09:53:23 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4831834D.3050705@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483139B5.8040908@voltaire.com> <4831834D.3050705@opengridcomputing.com> Message-ID: At 09:40 AM 5/19/2008, Steve Wise wrote: >> I suggest to use a "page_shift" notation and not "page_size" to comply >> with the kernel semantics of other APIs. >> >Ok, I wondered about that. It will also ensure a power of two. Does it have to be ^2? In the iWARP spec development, we envisioned the possibility of arbitrary page sizes. I don't recall any such dependency in the protocol architecture. Storage has been known to adopt non ^2 blocks, for instance including block checksums in sectors, etc. If transferred, these will become quite inefficient on ^2 hardware. Tom. From swise at opengridcomputing.com Mon May 19 06:56:03 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 08:56:03 -0500 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <4831679D.9040804@voltaire.com> References: <482C4A52.4000501@opengridcomputing.com> <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> <4831679D.9040804@voltaire.com> Message-ID: <483186F3.6090703@opengridcomputing.com> Or Gerlitz wrote: > Sean Hefty wrote: >>> So instead of adding a new function rdma_set_high_availability_mode, >>> you >>> could just set an option saying WANT_NETDEV_CHANGE_EVENTS. Maybe we >>> need to add rdma_set_option() to the kernel RDMA-CM API? >> I agree with this. Having a generic mechanism to report rare events >> would be >> useful. Maybe the device removal notification can be adapted for >> this purpose? > Sean, as suggested in the past (eg over the QoS discussion) > rdma_set_option can serve from more purposes similarly to setsockopt, > and I guess that down the road as the RDMA stack would get enhanced by > more features adding these rdma_set/get_opt calls would make sense. So > (Steve) in that respect, I don't see rdma_set_opt as a mechanism to > report rare events. > I don't understand your rationale above on why adding a new function is better than using an extensive "set this option" function. Can you clarify? Function rdma_set_option() wouldn't be the mechanism to report the events. It would be the mechanism to set an option indicating the cm_id wants NETDEV_CHANGE events. Seems like the exact fit rather than a new API call to set some very specific mode or option. I missed the QoS thread you're referencing, so please excuse me if I'm rehashing something that has been agreed-to in the past... Steve. From Sumit.Gaur at Sun.COM Mon May 19 06:49:41 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Mon, 19 May 2008 19:19:41 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> Message-ID: <48318575.7060701@Sun.COM> Hal Rosenstock wrote: > Hi Sumit, > > On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote: > >>Hi Hal, >> >> >>Hal Rosenstock wrote: >> >>>Sumit, >>> >>>On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: >>> >>> >>>>Hi >>>>I have an issue while my program interacting with OFED umad library. >>> >>> >>>Are you referring to libibumad ? >> >>yes, I am using mad_receive(0, -1) function to get my response back. > > > OK. > > >>>>I have two >>>>separate threads one for sending SMP,GMP packets and another to receive >>>>response. Things are working fine but during the whole process I keep receiving >>>>packets with unknown tid apart from correct response. >>> >>> >>>What's the exact message ? >> >>Response comes as proper mad packets but with "tid" that I have never send and >>my logic to keep track of send/response pkts failed. >> >>> >>>>Is it a correct behavior. >>> >>> >>>It could be; there's not enough info as to what is going on. It could be >>>some unsolicited message (e.g. from SM) comes in during your >>>transactions. Can you see what MADs are incoming ? One way to do that >>>would be to run madeye. >> >>Yes I could see complete mad with madhdr as following fields >> >>Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, >>ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435 > > > Class 129 is a Subn directed route packet. Some of the other info (like > attribute ID) doesn't look right to me but maybe that's something > "special" to your environment. Sorry missed last number AttributeID=4352 > > >> If these are unsolicited packets. Is there anyway to filter them. > > > Yes. How do you register ? For registration I am calling madrpc_init(ca, ca_port, mgmt_classes, 4) function once before starting polling thread for SMI and GSI packet receive. Once I received packet I am filtering them on the basis of madhdr->MgmtClass. int mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; for given ca and ca port of local node. > > >>Any reference to madeye ? > > > There's only the code for this (kernel module) which is added by OFED > (not upstream) in drivers/infiniband/util but it's pretty > straightforward to use. > > -- Hal > > >>>>If yes how I could avoid them ? >>> >>> >>>Not sure what you are seeing yet. >>> >>>-- Hal >>> >>> >>> >>>>Thanks and Regards >>>>sumit >>>> >>>>general-request at lists.openfabrics.org wrote: >>>> >>>> >>>>>Send general mailing list submissions to >>>>> general at lists.openfabrics.org >>>>> >>>>>To subscribe or unsubscribe via the World Wide Web, visit >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>or, via email, send a message with subject or body 'help' to >>>>> general-request at lists.openfabrics.org >>>>> >>>>>You can reach the person managing the list at >>>>> general-owner at lists.openfabrics.org >>>>> >>>>>When replying, please edit your Subject line so it is more specific >>>>>than "Re: Contents of general digest..." >>>>> >>>>> >>>>>Today's Topics: >>>>> >>>>> 1. Re: [PATCH] IB/core: handle race between elements in qork >>>>> queues after event (Roland Dreier) >>>>> 2. Re: RDS flow control (Steve Wise) >>>>> 3. Re: RDS flow control (Olaf Kirch) >>>>> 4. Re: RDS flow control (Steve Wise) >>>>> 5. Re: RDS flow control (Olaf Kirch) >>>>> 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence >>>>> checking (Roland Dreier) >>>>> 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with >>>>> struct pid * not pid_t. (Roland Dreier) >>>>> 8. Re: bitops take an unsigned long * (Roland Dreier) >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>> >>>>>Message: 1 >>>>>Date: Tue, 13 May 2008 10:41:39 -0700 >>>>>From: Roland Dreier >>>>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between >>>>> elements in qork queues after event >>>>>To: Moni Shoua >>>>>Cc: Olga Stern , OpenFabrics General >>>>> >>>>>Message-ID: >>>>>Content-Type: text/plain; charset=us-ascii >>>>> >>>>> >>>>>>Can we please go on with this patch? We would like to see it in the next kernel. >>>>> >>>>>I still don't get why this is important to you. Is there a concrete >>>>>example of a situation where this actually makes a measurable difference? >>>>> >>>>>We need some justification for adding this locking complexity beyond "it >>>>>doesn't hurt." (And also of course we need it fixed so there aren't races) >>>>> >>>>>- R. >>>>> >>>>> >>>>>------------------------------ >>>>> >>>>>Message: 2 >>>>>Date: Tue, 13 May 2008 12:58:11 -0500 >>>>>From: Steve Wise >>>>>Subject: Re: [ofa-general] RDS flow control >>>>>To: Richard Frank >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com> >>>>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed >>>>> >>>>>Richard Frank wrote: >>>>> >>>>> >>>>> >>>>>>Steve Wise wrote: >>>>>> >>>>>> >>>>>> >>>>>>>Olaf Kirch wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>As part of my effort to get RDS working for iWARP, I will be >>>>>>>>>working on the RDS flow control. Flow control is needed for iWARP >>>>>>>>>due to the fact that iWARP connections terminate if there is no >>>>>>>>>posted recv for an incoming packet. IB connections do not have >>>>>>>>>this limitation if setup in a certain way. In its current >>>>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7. >>>>>>>>>This causes IB to retransmit until there is a posted recv buffer. >>>>>>>> >>>>>>>>I think for the initial implementation, it is fine for iWARP to just >>>>>>>>fail the connect when that happens, and re-establish the connection. >>>>>>>> >>>>>>>>If you use reasonable defaults for the send and recv queues, receiver >>>>>>>>overruns should be relatively rare. >>>>>>>> >>>>>>>>Once everything else works, let's revisit the flow control part. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>>I _think_ you'll hit this quickly with one-way flows. Send >>>>>>>completions for iWARP only mean the user's buffer can be reused. Not >>>>>>>that its placed at the remote peer or in the remote user's buffer. >>>>>>> >>>>>> >>>>>>Let's see what happens - anyway - this could be solved in an IWARP >>>>>>extension to RDS - right ? >>>>> >>>>> >>>>> >>>>>Yes, by adding flow control. And it could be iwarp-specific if you >>>>>want. I would not suggest relying on connection termination and >>>>>re-establishment as the way to handle this :). >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB >>>>>>>and rnr_retry == 0 using the rds perf tools? >>>>>>>Also "the everything else" part depends on remove fmr usage. I'm >>>>>>>working on the new RDMA memory verbs allowing fast registration of >>>>>>>physical memory via a send WR. To support iWARP we need to remove >>>>>>>the fmr usage from RDS. The idea was to replace fmrs with the new >>>>>>>fastreg verbs. Thoughts? >>>>>>> >>>>>> >>>>>>What does "fast" imply here - how does this compare to the performance >>>>>>of FMRs ? >>>>> >>>>> >>>>> >>>>>Don't know yet, but probably as fast. >>>>> >>>>> >>>>> >>>>> >>>>>>Why would not push memory window creation into the RDS transport >>>>>>specific implementations ? >>>>> >>>>> >>>>>Isn't it already transport-specific? IE you don't need FMRs for TCP. >>>>>(I'm ignorant on the specifics of the implementation at this point, so >>>>>please excuse any dumb statements :) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>Changing the API may be OK - if we retain the performance we have with >>>>>>IB. >>>>> >>>>> >>>>> >>>>>I assume nothing would fly that regresses IB performance. Worst case, >>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>>>Hopefully though, IB + iWARP will be a common transport. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>>Stay tuned for the new verbs API RFC... >>>>>>> >>>>>>>Steve. >>>>>>>_______________________________________________ >>>>>>>general mailing list >>>>>>>general at lists.openfabrics.org >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>> >>>>>>>To unsubscribe, please visit >>>>>>>http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>>> >>>>>------------------------------ >>>>> >>>>>Message: 3 >>>>>Date: Tue, 13 May 2008 20:04:00 +0200 >>>>>From: Olaf Kirch >>>>>Subject: Re: [ofa-general] RDS flow control >>>>>To: Steve Wise >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>>>Message-ID: <200805132004.01371.okir at lst.de> >>>>>Content-Type: text/plain; charset="iso-8859-1" >>>>> >>>>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: >>>>> >>>>> >>>>> >>>>>>Yes, by adding flow control. And it could be iwarp-specific if you >>>>>>want. I would not suggest relying on connection termination and >>>>>>re-establishment as the way to handle this :). >>>>> >>>>> >>>>>No, not in the long term. But let's hold off on the flow control stuff >>>>>for a little - I would first like to finish my patch set and hand it >>>>>out for you folks to bang on it, rather than the other way round. >>>>>Okay with you guys? >>>>> >>>>> >>>>> >>>>> >>>>>>I assume nothing would fly that regresses IB performance. Worst case, >>>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>>>>Hopefully though, IB + iWARP will be a common transport. >>>>> >>>>> >>>>>If it turns out that way, fine. If iWARP ands up sharing 80% of the >>>>>code with IB except the RDMA specific functions, I think that's >>>>>very much acceptable, too. >>>>> >>>>>Olaf >>>> >>>>_______________________________________________ >>>>general mailing list >>>>general at lists.openfabrics.org >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > From swise at opengridcomputing.com Mon May 19 06:58:20 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 08:58:20 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483139B5.8040908@voltaire.com> <4831834D.3050705@opengridcomputing.com> Message-ID: <4831877C.6010208@opengridcomputing.com> An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Mon May 19 06:59:34 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 19 May 2008 09:59:34 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4831877C.6010208@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483139B5.8040908@voltaire.com> <4831834D.3050705@opengridcomputing.com> <4831877C.6010208@opengridcomputing.com> Message-ID: At 09:58 AM 5/19/2008, Steve Wise wrote: >>Storage has been known to adopt non ^2 blocks, for instance including >>block checksums in sectors, etc. If transferred, these will become quite >>inefficient on ^2 hardware. >> >> >Is this true today for any of the existing RDMA ULPs that will utilize fastreg? Ask the iSER and SRP folks. NFS won't. Tom. From jackm at dev.mellanox.co.il Mon May 19 07:03:05 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 17:03:05 +0300 Subject: [ofa-general] [PATCH] IPoIB: Test for NULL broadcast object in opiob_mcast_join_finish. Message-ID: <200805191703.05887.jackm@dev.mellanox.co.il> IPoIB: "join finish" occurring just after device was flushed caused Oops. ipoib_mcast_join_finish() processing could conceivably occur just after ipoib_mcast_dev_flush() was invoked (in which case the broadcast pointer is NULL). This patch tests for and fixes this case. Signed-off-by: Jack Morgenstein --- Roland, We encountered this problem in our regression testing (kernel Oops). (bugzilla bug 1040). The test randomly causes the HCA physical port to go down then up. We then have a situation where a "flush" could occur while IPoIB mcast initialization was still in progress. Index: ofed_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-05-19 15:48:17.000000000 +0300 +++ ofed_kernel/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-05-19 16:07:52.723294000 +0300 @@ -194,7 +194,13 @@ static int ipoib_mcast_join_finish(struc /* Set the cached Q_Key before we attach if it's the broadcast group */ if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { + spin_lock_irq(&priv->lock); + if (!priv->broadcast) { + spin_unlock_irq(&priv->lock); + return -EAGAIN; + } priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + spin_unlock_irq(&priv->lock); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; } ------------------------------------------------------- From tziporet at mellanox.co.il Mon May 19 07:20:57 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 19 May 2008 17:20:57 +0300 Subject: [ofa-general] Agenda for the OFED meeting today (May 5) Message-ID: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com> Hi, This is the agenda for the OFED meeting today: 1. OFED 1.3.1: 1.1 Schedule: rc1 - done on May 6 rc2 - May 22 <== I propose to delay to Thursday since there are few IPOIB bugs on work GA - May 29 1.2 OS support: SLES10 SP2 backports were done (thanks to Moshe from Voltaire) There is a request fro RHEL 5.2 - who has this OS and can help with the backports? 1.3 Bugs status Please set release version 1.3.1 for all bugs that should be resolved in 1.3.1 In the way the bugs are assigned today it is very hard to extract the relevant bugs for the release. This is the list of bugs that should be resolved to my best knowledge (please add more): 1024 normal monis at voltaire.com Bonding-Ping not recovery after reconnect the non active interface 1027 normal sashak at voltaire.com kernel panic in mad.c handle_outgoing_dr_smp with RESULT_CONSUMED 1031 normal kliteyn at mellanox.co.il OpenSM fat tree routing thinks fat tree isn't 1032 critical vu at mellanox.com RHEL 5.1 and OFED 1.3 cannot write IO blocks greater than 1024. 1038 normal eli at mellanox.co.il Kernel panic while running tcp/ip ltp tests 1040 normal jackm at mellanox.co.il Kernel Oops during "port up/down test" 1041 normal vlad at mellanox.co.il Install Failed with memtrack flag in the conf file 1042 normal vlad at mellanox.co.il ofed-1.3.1 install fails 2. OFED 1.4: - Kernel rebase status: we have prepared the new tree, make-dist pass but compilation still fails. Any help to resolve compilation issues is welcome. URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel - Update from the participants (mainly on new components/features): - NFSoRDMA - Jeff - Management - Sasha - Multiple EQs to best fit multi-core systems - we try to define it with Roland - RDMA CM to support IPv6 - Woody any news on this? - IB BMME and iWARP equivalent memory extensions - under progress on the general list 3. Open discussion - Upgrade memory in the OFA server: This request raised long time ago and we had a promise to do it after 1.3 release. What is the status? - Other topics ... Tziporet From swise at opengridcomputing.com Mon May 19 07:23:13 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 09:23:13 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> Message-ID: <48318D51.5070904@opengridcomputing.com> Roland Dreier wrote: > > If ownership can be assumed, I suggest to have the core use the > > implementation of these two verbs as you did that for the Chelsio > > driver in case the HW driver did not implement it (i.e instead of > > returning ENOSYS). In that case, the alloc_list verb should do DMA > > mapping FROM device (I think...) since the device is going to do DMA > > to read the page list, and the free_list verb should do DMA unmapping, > > etc. > > Yes, the point of this verb is that the low-level driver owns the page > list from when the fast register work request is posted until it > completes. This should be explicitly documented somewhere. > > I've added it to the comments for ib_alloc_fast_reg_page_list() as per Ralph Campbell's suggestion. > However the reason for having the low-level driver implement it is so > that all strange device-specific issues can be taken care of in the > driver. For instance mlx4 is going to require that the page list be > aligned to 64 bytes, and will DMA from the memory, so we need to use > dma_alloc_consistent(). On the other hand cxgb3 is just going to copy > in software, so kmalloc is sufficient. > > - R. > From swise at opengridcomputing.com Fri May 16 15:30:37 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:30:37 -0500 Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int> The following patch series proposes: - The API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. - cxgb3 support. Changes since version 3: - better comments to ib_alloc_fast_reg_page_list() function to explicitly state the page list is owned by the device until the fast_reg WR completes. Changes since version 2: - added device attribute max_fast_reg_page_list_len - added cxgb3 patch Changes since version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled Steve. From swise at opengridcomputing.com Fri May 16 15:30:37 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 16 May 2008 17:30:37 -0500 Subject: [ofa-general] [PATCH RFC v3] RDMA: New Memory Extensions Message-ID: <20080516223037.27127.26712.stgit@dell3.ogc.int> The following patch series proposes: - The API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. - cxgb3 support. Changes since version 3: - better comments to ib_alloc_fast_reg_page_list() function to explicitly state the page list is owned by the device until the fast_reg WR completes. - cxgb3 - when allocating a page list, set max_page_list_len Changes since version 2: - added device attribute max_fast_reg_page_list_len - added cxgb3 patch Changes since version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled Steve. From or.gerlitz at gmail.com Mon May 19 07:44:03 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 19 May 2008 17:44:03 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <483049BF.4050603@dev.mellanox.co.il> References: <483049BF.4050603@dev.mellanox.co.il> Message-ID: <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> On 5/18/08, Vladimir Sokolovsky wrote: > > There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which > hardcoded > the minimum acceptable page_shift to be 12. However, new mlx4 firmware has > a > minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so > that > ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs. > > To preserve firmware compatibility with released OFED drivers, the firmware > will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for > these > drivers. Hi Vlad, Roland, To start with, the bug is in the Linux kernel mlx4 driver, there's nothing like "OFED 1.3 mlx4 driver" (who's the maintainer? why there's need to be another instace of a driver merged into the mainline kerel, etc). This bug was fixed last week or so by a patch sent by someone from Mellanox. To continue with, maybe we just state that kernels < X = 2.6.26 are not compatible with FW version > Y = 2.3? or have the patch that fixes the problem be sent to -stable versions of older kernels? If those solutions are not enough, I think that the default behaviour of FW AND the mainline driver would be to get the actual minimal driver supported, namely nine. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Mon May 19 07:48:12 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 19 May 2008 17:48:12 +0300 Subject: [ofa-general] [RFC v2 PATCH 4/5] rdma/cma: implement RDMA_ALIGN_WITH_NETDEVICE ha mode In-Reply-To: <483186F3.6090703@opengridcomputing.com> References: <482C4A52.4000501@opengridcomputing.com> <000201c8b6ab$c0c66730$bd59180a@amr.corp.intel.com> <4831679D.9040804@voltaire.com> <483186F3.6090703@opengridcomputing.com> Message-ID: <15ddcffd0805190748k7ea3ca96o815cf64e71758d82@mail.gmail.com> On 5/19/08, Steve Wise wrote: > I don't understand your rationale above on why adding a new function is > better than using an extensive "set this option" function. Can you clarify? > > Function rdma_set_option() wouldn't be the mechanism to report the events. > It would be the mechanism to set an option indicating the cm_id wants > NETDEV_CHANGE events. Seems like the exact fit rather than a new API call > to set some very specific mode or option. I missed the QoS thread you're > referencing, so please excuse me if I'm rehashing something that has been > agreed-to in the past... > Steve, I think there was some misunderstanding here, I am not against rdma_set_option, I just said that I would follow what would be decided over the review / maintainer decision. I am fine both with using a set_opt call or a dedicated call. Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Mon May 19 07:49:10 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 07:49:10 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <48318575.7060701@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> Message-ID: <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-19 at 19:19 +0530, Sumit Gaur - Sun Microsystem wrote: > > Hal Rosenstock wrote: > > Hi Sumit, > > > > On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote: > > > >>Hi Hal, > >> > >> > >>Hal Rosenstock wrote: > >> > >>>Sumit, > >>> > >>>On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: > >>> > >>> > >>>>Hi > >>>>I have an issue while my program interacting with OFED umad library. > >>> > >>> > >>>Are you referring to libibumad ? > >> > >>yes, I am using mad_receive(0, -1) function to get my response back. > > > > > > OK. > > > > > >>>>I have two > >>>>separate threads one for sending SMP,GMP packets and another to receive > >>>>response. Things are working fine but during the whole process I keep receiving > >>>>packets with unknown tid apart from correct response. > >>> > >>> > >>>What's the exact message ? > >> > >>Response comes as proper mad packets but with "tid" that I have never send and > >>my logic to keep track of send/response pkts failed. > >> > >>> > >>>>Is it a correct behavior. > >>> > >>> > >>>It could be; there's not enough info as to what is going on. It could be > >>>some unsolicited message (e.g. from SM) comes in during your > >>>transactions. Can you see what MADs are incoming ? One way to do that > >>>would be to run madeye. > >> > >>Yes I could see complete mad with madhdr as following fields > >> > >>Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, > >>ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435 > > > > > > Class 129 is a Subn directed route packet. Some of the other info (like > > attribute ID) doesn't look right to me but maybe that's something > > "special" to your environment. > Sorry missed last number AttributeID=4352 I don't know what that attribute ID is so there's something different about that. Out of curiousity, what SM are you using ? > >> If these are unsolicited packets. Is there anyway to filter them. > > > > > > Yes. How do you register ? > For registration I am calling madrpc_init(ca, ca_port, mgmt_classes, 4) > function once before starting polling thread for SMI and GSI packet receive. > Once I received packet I am filtering them on the basis of madhdr->MgmtClass. > > int mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, > IB_PERFORMANCE_CLASS}; > > for given ca and ca port of local node. That looks like it would register with a NULL method mask which should filter unsolicited packets. I think I see the issue: the incoming packet appears to have a method of 129 (GetResp) which has the response bit on so it's not considered unsolicited. You need to see what exactly that packet is and where it's coming from and why. -- Hal > > > > > >>Any reference to madeye ? > > > > > > There's only the code for this (kernel module) which is added by OFED > > (not upstream) in drivers/infiniband/util but it's pretty > > straightforward to use. > > > > -- Hal > > > > > >>>>If yes how I could avoid them ? > >>> > >>> > >>>Not sure what you are seeing yet. > >>> > >>>-- Hal > >>> > >>> > >>> > >>>>Thanks and Regards > >>>>sumit > >>>> > >>>>general-request at lists.openfabrics.org wrote: > >>>> > >>>> > >>>>>Send general mailing list submissions to > >>>>> general at lists.openfabrics.org > >>>>> > >>>>>To subscribe or unsubscribe via the World Wide Web, visit > >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>or, via email, send a message with subject or body 'help' to > >>>>> general-request at lists.openfabrics.org > >>>>> > >>>>>You can reach the person managing the list at > >>>>> general-owner at lists.openfabrics.org > >>>>> > >>>>>When replying, please edit your Subject line so it is more specific > >>>>>than "Re: Contents of general digest..." > >>>>> > >>>>> > >>>>>Today's Topics: > >>>>> > >>>>> 1. Re: [PATCH] IB/core: handle race between elements in qork > >>>>> queues after event (Roland Dreier) > >>>>> 2. Re: RDS flow control (Steve Wise) > >>>>> 3. Re: RDS flow control (Olaf Kirch) > >>>>> 4. Re: RDS flow control (Steve Wise) > >>>>> 5. Re: RDS flow control (Olaf Kirch) > >>>>> 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence > >>>>> checking (Roland Dreier) > >>>>> 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with > >>>>> struct pid * not pid_t. (Roland Dreier) > >>>>> 8. Re: bitops take an unsigned long * (Roland Dreier) > >>>>> > >>>>> > >>>>>---------------------------------------------------------------------- > >>>>> > >>>>>Message: 1 > >>>>>Date: Tue, 13 May 2008 10:41:39 -0700 > >>>>>From: Roland Dreier > >>>>>Subject: Re: [ofa-general] [PATCH] IB/core: handle race between > >>>>> elements in qork queues after event > >>>>>To: Moni Shoua > >>>>>Cc: Olga Stern , OpenFabrics General > >>>>> > >>>>>Message-ID: > >>>>>Content-Type: text/plain; charset=us-ascii > >>>>> > >>>>> > >>>>>>Can we please go on with this patch? We would like to see it in the next kernel. > >>>>> > >>>>>I still don't get why this is important to you. Is there a concrete > >>>>>example of a situation where this actually makes a measurable difference? > >>>>> > >>>>>We need some justification for adding this locking complexity beyond "it > >>>>>doesn't hurt." (And also of course we need it fixed so there aren't races) > >>>>> > >>>>>- R. > >>>>> > >>>>> > >>>>>------------------------------ > >>>>> > >>>>>Message: 2 > >>>>>Date: Tue, 13 May 2008 12:58:11 -0500 > >>>>>From: Steve Wise > >>>>>Subject: Re: [ofa-general] RDS flow control > >>>>>To: Richard Frank > >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > >>>>>Message-ID: <4829D6B3.5080900 at opengridcomputing.com> > >>>>>Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >>>>> > >>>>>Richard Frank wrote: > >>>>> > >>>>> > >>>>> > >>>>>>Steve Wise wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>>Olaf Kirch wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>On Monday 12 May 2008 18:57:38 Jon Mason wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>As part of my effort to get RDS working for iWARP, I will be > >>>>>>>>>working on the RDS flow control. Flow control is needed for iWARP > >>>>>>>>>due to the fact that iWARP connections terminate if there is no > >>>>>>>>>posted recv for an incoming packet. IB connections do not have > >>>>>>>>>this limitation if setup in a certain way. In its current > >>>>>>>>>implementation, RDS sets the connection attribute rnr_retry to 7. > >>>>>>>>>This causes IB to retransmit until there is a posted recv buffer. > >>>>>>>> > >>>>>>>>I think for the initial implementation, it is fine for iWARP to just > >>>>>>>>fail the connect when that happens, and re-establish the connection. > >>>>>>>> > >>>>>>>>If you use reasonable defaults for the send and recv queues, receiver > >>>>>>>>overruns should be relatively rare. > >>>>>>>> > >>>>>>>>Once everything else works, let's revisit the flow control part. > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>>I _think_ you'll hit this quickly with one-way flows. Send > >>>>>>>completions for iWARP only mean the user's buffer can be reused. Not > >>>>>>>that its placed at the remote peer or in the remote user's buffer. > >>>>>>> > >>>>>> > >>>>>>Let's see what happens - anyway - this could be solved in an IWARP > >>>>>>extension to RDS - right ? > >>>>> > >>>>> > >>>>> > >>>>>Yes, by adding flow control. And it could be iwarp-specific if you > >>>>>want. I would not suggest relying on connection termination and > >>>>>re-establishment as the way to handle this :). > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>>But perhaps I'm wrong. Jon, maybe you should try to hit this with IB > >>>>>>>and rnr_retry == 0 using the rds perf tools? > >>>>>>>Also "the everything else" part depends on remove fmr usage. I'm > >>>>>>>working on the new RDMA memory verbs allowing fast registration of > >>>>>>>physical memory via a send WR. To support iWARP we need to remove > >>>>>>>the fmr usage from RDS. The idea was to replace fmrs with the new > >>>>>>>fastreg verbs. Thoughts? > >>>>>>> > >>>>>> > >>>>>>What does "fast" imply here - how does this compare to the performance > >>>>>>of FMRs ? > >>>>> > >>>>> > >>>>> > >>>>>Don't know yet, but probably as fast. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>Why would not push memory window creation into the RDS transport > >>>>>>specific implementations ? > >>>>> > >>>>> > >>>>>Isn't it already transport-specific? IE you don't need FMRs for TCP. > >>>>>(I'm ignorant on the specifics of the implementation at this point, so > >>>>>please excuse any dumb statements :) > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>Changing the API may be OK - if we retain the performance we have with > >>>>>>IB. > >>>>> > >>>>> > >>>>> > >>>>>I assume nothing would fly that regresses IB performance. Worst case, > >>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. > >>>>>Hopefully though, IB + iWARP will be a common transport. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>>Stay tuned for the new verbs API RFC... > >>>>>>> > >>>>>>>Steve. > >>>>>>>_______________________________________________ > >>>>>>>general mailing list > >>>>>>>general at lists.openfabrics.org > >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>>> > >>>>>>>To unsubscribe, please visit > >>>>>>>http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>------------------------------ > >>>>> > >>>>>Message: 3 > >>>>>Date: Tue, 13 May 2008 20:04:00 +0200 > >>>>>From: Olaf Kirch > >>>>>Subject: Re: [ofa-general] RDS flow control > >>>>>To: Steve Wise > >>>>>Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org > >>>>>Message-ID: <200805132004.01371.okir at lst.de> > >>>>>Content-Type: text/plain; charset="iso-8859-1" > >>>>> > >>>>>On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: > >>>>> > >>>>> > >>>>> > >>>>>>Yes, by adding flow control. And it could be iwarp-specific if you > >>>>>>want. I would not suggest relying on connection termination and > >>>>>>re-establishment as the way to handle this :). > >>>>> > >>>>> > >>>>>No, not in the long term. But let's hold off on the flow control stuff > >>>>>for a little - I would first like to finish my patch set and hand it > >>>>>out for you folks to bang on it, rather than the other way round. > >>>>>Okay with you guys? > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>I assume nothing would fly that regresses IB performance. Worst case, > >>>>>>you have an iwarp-specific RDS transport like you do for TCP, I guess. > >>>>>>Hopefully though, IB + iWARP will be a common transport. > >>>>> > >>>>> > >>>>>If it turns out that way, fine. If iWARP ands up sharing 80% of the > >>>>>code with IB except the RDMA specific functions, I think that's > >>>>>very much acceptable, too. > >>>>> > >>>>>Olaf > >>>> > >>>>_______________________________________________ > >>>>general mailing list > >>>>general at lists.openfabrics.org > >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > From rdreier at cisco.com Mon May 19 07:49:40 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 07:49:40 -0700 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <200805191603.50735.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 May 2008 16:03:50 +0300") References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> Message-ID: > This is in case at some installation, the administrator wishes to use > the legacy device page size of 12, for example. Having a module > parameter enables such tweaking to be done painlessly. And why would the administrator want that? - R. From rdreier at cisco.com Mon May 19 07:51:01 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 07:51:01 -0700 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> (Or Gerlitz's message of "Mon, 19 May 2008 17:44:03 +0300") References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> Message-ID: > To continue with, maybe we just state that kernels < X = 2.6.26 are not > compatible with FW version > Y = 2.3? or have the patch that fixes the > problem be sent to -stable versions of older kernels? Why? This patch provides a pretty simple way for older kernels and newer firmware to continue to work together, while providing the new functionality with new firmware and new kernels. Why introduce gratuitous breakage? - R. From or.gerlitz at gmail.com Mon May 19 07:55:42 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 19 May 2008 17:55:42 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> Message-ID: <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> On 5/19/08, Roland Dreier wrote: > > Why? This patch provides a pretty simple way for older kernels and > newer firmware to continue to work together, while providing the new > functionality with new firmware and new kernels. Why introduce > gratuitous breakage? > > I understand that for the new functionality to take effect with new kernels, the admin has to set the module param for a non default value, correct? so you are fine with with that? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jackm at dev.mellanox.co.il Mon May 19 08:01:43 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 18:01:43 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> Message-ID: <200805191801.43596.jackm@dev.mellanox.co.il> On Monday 19 May 2008 17:49, Roland Dreier wrote: > > This is in case at some installation, the administrator wishes to use > > the legacy device page size of 12, for example. Having a module > > parameter enables such tweaking to be done painlessly. > > And why would the administrator want that? > Ok, We'll get rid of the module parameter. - Jack From olga.shern at gmail.com Mon May 19 08:11:37 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Mon, 19 May 2008 18:11:37 +0300 Subject: [ofa-general] Re: [ewg] Agenda for the OFED meeting today (May 5) In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9040C0CA6@mtlexch01.mtl.com> Message-ID: On 5/19/08, Tziporet Koren wrote: > > Hi, > > This is the agenda for the OFED meeting today: > 1. OFED 1.3.1: > 1.1 Schedule: > rc1 - done on May 6 > rc2 - May 22 <== I propose to delay to Thursday since there are > few IPOIB bugs on work > GA - May 29 > 1.2 OS support: > SLES10 SP2 backports were done (thanks to Moshe from Voltaire) > There is a request fro RHEL 5.2 - who has this OS and can help > with the backports? > 1.3 Bugs status > Please set release version 1.3.1 for all bugs that should be > resolved in 1.3.1 > In the way the bugs are assigned today it is very hard to > extract the relevant bugs for the release. > This is the list of bugs that should be resolved to my best > knowledge (please add more): There is also bug number 1004 1004 maj P2 RHEL eli at mellanox.co.il IPoIB failed on stress testing 1024 normal monis at voltaire.com Bonding-Ping not recovery after > reconnect the non active interface > 1027 normal sashak at voltaire.com kernel panic in mad.c > handle_outgoing_dr_smp with RESULT_CONSUMED > 1031 normal kliteyn at mellanox.co.il OpenSM fat tree routing thinks > fat tree isn't > 1032 critical vu at mellanox.com RHEL 5.1 and OFED 1.3 > cannot write IO blocks greater than 1024. > 1038 normal eli at mellanox.co.il Kernel panic while running > tcp/ip ltp tests > 1040 normal jackm at mellanox.co.il Kernel Oops during "port up/down > test" > 1041 normal vlad at mellanox.co.il Install Failed with memtrack > flag in the conf file > 1042 normal vlad at mellanox.co.il ofed-1.3.1 install fails > > 2. OFED 1.4: > - Kernel rebase status: we have prepared the new tree, make-dist > pass but compilation still fails. > Any help to resolve compilation issues is welcome. > URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git > ofed_kernel > - Update from the participants (mainly on new > components/features): > - NFSoRDMA - Jeff > - Management - Sasha > - Multiple EQs to best fit multi-core systems - we try to > define it with Roland > - RDMA CM to support IPv6 - Woody any news on this? > - IB BMME and iWARP equivalent memory extensions - under > progress on the general list > > 3. Open discussion > - Upgrade memory in the OFA server: > This request raised long time ago and we had a promise to do it > after 1.3 release. What is the status? > - Other topics ... > > Tziporet > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at Voltaire.COM Mon May 19 08:17:00 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 19 May 2008 18:17:00 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> Message-ID: <483199EC.7070900@Voltaire.COM> Thanks for the comment and example. Please take a look below. The last paragraph in the patch documentation refers to the race you pointed. The test is made after sm_ah is copied to the mad so there is no risk of it being NULL in the mad structure. --------------------------------------- This patch solves a race between work elements that are carried out after an event occurs. When SM address handle becomes invalid and needs an update it is handled by a work in the global workqueue. On the other hand this event is also handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. The patch sets the SM address handle to NULL and until update_sm_ah() is called, any request that needs sm_ah is replied with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the wrong SM so the request gets lost. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse and in some cases it gets better. If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after the check for NULL SM address handle the result would be as before the patch and without a rist of dereferencing a NULL pinter. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/core/sa_query.c | 49 +++++++++++++++++++++++++++---------- 1 file changed, 37 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index cf474ec..8170381 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -413,9 +413,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER) { - struct ib_sa_device *sa_dev; - sa_dev = container_of(handler, typeof(*sa_dev), event_handler); - + unsigned long flags; + struct ib_sa_device *sa_dev = + container_of(handler, typeof(*sa_dev), event_handler); + struct ib_sa_port *port = + &sa_dev->port[event->element.port_num - sa_dev->start_port]; + struct ib_sa_sm_ah *sm_ah; + + spin_lock_irqsave(&port->ah_lock, flags); + sm_ah = port->sm_ah; + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + if (sm_ah) + kref_put(&sm_ah->ref, free_sm_ah); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); } @@ -673,6 +684,10 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, ret = alloc_mad(&query->sa_query, gfp_mask); if (ret) goto err1; + if (!port->sm_ah) { + ret = -EAGAIN; + goto err2; + } ib_sa_client_get(client); query->sa_query.client = client; @@ -694,13 +709,14 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); if (ret < 0) - goto err2; + goto err3; return ret; -err2: +err3: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); +err2: free_mad(&query->sa_query); err1: @@ -780,6 +796,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; if (method != IB_MGMT_METHOD_GET && @@ -795,6 +812,10 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, ret = alloc_mad(&query->sa_query, gfp_mask); if (ret) goto err1; + if (!port->sm_ah) { + ret = -EAGAIN; + goto err2; + } ib_sa_client_get(client); query->sa_query.client = client; @@ -817,15 +838,15 @@ int ib_sa_service_rec_query(struct ib_sa_client *client, ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); if (ret < 0) - goto err2; + goto err3; return ret; -err2: +err3: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); +err2: free_mad(&query->sa_query); - err1: kfree(query); return ret; @@ -877,8 +898,8 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, return -ENODEV; port = &sa_dev->port[port_num - sa_dev->start_port]; - agent = port->agent; + agent = port->agent; query = kmalloc(sizeof *query, gfp_mask); if (!query) return -ENOMEM; @@ -887,6 +908,10 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, ret = alloc_mad(&query->sa_query, gfp_mask); if (ret) goto err1; + if (!port->sm_ah) { + ret = -EAGAIN; + goto err2; + } ib_sa_client_get(client); query->sa_query.client = client; @@ -909,15 +934,15 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client, ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); if (ret < 0) - goto err2; + goto err3; return ret; -err2: +err3: *sa_query = NULL; ib_sa_client_put(query->sa_query.client); +err2: free_mad(&query->sa_query); - err1: kfree(query); return ret; From jackm at dev.mellanox.co.il Mon May 19 08:25:37 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 19 May 2008 18:25:37 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> Message-ID: <200805191825.38176.jackm@dev.mellanox.co.il> On Monday 19 May 2008 17:49, Roland Dreier wrote: > > This is in case at some installation, the administrator wishes to use > > the legacy device page size of 12, for example. Having a module > > parameter enables such tweaking to be done painlessly. > > And why would the administrator want that? > I just remembered. If we create FMRs using 512 as the device page size, we will use 8 times the MTT entries as we would if the page size was 4K. The ULP (rds and iser) can run out of MTT entries much faster. This can give an administrator a quick workaround if needed (until we fix the resource allocator to allow bitmaps larger than 2^20 -- which is the current default max number of MTTs). (The allocator's problem is that kzalloc cannot allocate a block larger than 128 KB (= 1Meg bits). - Jack From yevgenyp at mellanox.co.il Mon May 19 08:22:52 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Mon, 19 May 2008 18:22:52 +0300 Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, Patch 10) In-Reply-To: References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com> Message-ID: <48319B4C.7040309@mellanox.co.il> Roland Dreier wrote: > > > I would just like to see an approach that is fully thought through and > > > gives a way for applications/kernel drivers to choose a CQ vector based > > > on some information about what CPU it will go to. > > > Isn't the decision of which CPU an MSI-X is routed to (and hence, to > > which CPI an EQ is bound to) determined by userspace? (either by the irq > > balancer process or by manually setting /proc/irq//smp_affinity)? > > Yes, but how can anything tell which IRQ number corresponds to a given > "CQ vector" number? (And don't be too stuck on MSI-X, since ehca uses > some completely different GX-bus related thing to get multiple interrupts) > > > What are we risking in making the default action to spread interrupts? > > There are fairly plausible scenarios like a multi-threaded app where > each thread creates a send CQ and a receive CQ, which should both be > bound to the same CPU as the thread. If we spread all CQs then it's > impossible to get thread-locality. > > I'm not saying that round-robin is necessarily a bad default policy, but > I do think there needs to be a complete picture of how that policy can > be overridden before we go for multiple interrupt vectors. > > - R. Hello Roland, We can add the multiple interrupt vectors support in two stages: 1. The low level driver can create multiple interrupt vectors. Their name would include a serial number from 0 to #CPU's-1. The number of completion vectors can be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific completion vector when creating CQ, which means that passing vector=0 while creating CQ will assign it to completion vector #0. 2. As the second stage, we can create a "don't care" value which would mean that the driver can can attach the CQ to any completion vector. In this case the policy shouldn't necessary be round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ to the least busy one. What is your opinion on this solution? Thanks, Yevgeny From rdreier at cisco.com Mon May 19 08:43:48 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 08:43:48 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4831347A.1010506@voltaire.com> (Or Gerlitz's message of "Mon, 19 May 2008 11:04:10 +0300") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> <4831347A.1010506@voltaire.com> Message-ID: > I see. Just wondering, in the mlx4 case, is it a must to use dma > consistent memory allocation or dma mapping would work too? dma mapping would work too but then handling the map/unmap becomes an issue. I think it is way too complicated too add new verbs for map/unmap fastreg page list (in addition to the alloc/free fastreg page list that we are already adding) and force the consumer to do it. And if we expect the low-level driver to do it, then the map is easy (can be done while posting the send) but the unmap is a pain -- it would have to be done inside poll_cq when reapind the completion, and the low-level driver would have to keep some complicated extra data structure to go back from the completion to the original fast reg page list structure. From rdreier at cisco.com Mon May 19 08:44:57 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 08:44:57 -0700 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <200805191825.38176.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 May 2008 18:25:37 +0300") References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> <200805191825.38176.jackm@dev.mellanox.co.il> Message-ID: > I just remembered. If we create FMRs using 512 as the device page size, we will > use 8 times the MTT entries as we would if the page size was 4K. The ULP (rds and iser) > can run out of MTT entries much faster. Seems like a ULP issue -- if they don't want 512 byte pages for FMRs, why are they asking for them? - R. From rdreier at cisco.com Mon May 19 08:45:49 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 08:45:49 -0700 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> (Or Gerlitz's message of "Mon, 19 May 2008 17:55:42 +0300") References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> Message-ID: > I understand that for the new functionality to take effect with new > kernels, the admin has to set the module param for a non default value, > correct? so you are fine with with that? You misunderstood the patch I think (unless I did). By default new kernel + new firmware gets the smaller page size. The kernel module parameter does seem kind of useless. - R. From swise at opengridcomputing.com Mon May 19 08:46:39 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 10:46:39 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> <4831347A.1010506@voltaire.com> Message-ID: <4831A0DF.2070603@opengridcomputing.com> Roland Dreier wrote: > > I see. Just wondering, in the mlx4 case, is it a must to use dma > > consistent memory allocation or dma mapping would work too? > > dma mapping would work too but then handling the map/unmap becomes an > issue. I think it is way too complicated too add new verbs for > map/unmap fastreg page list (in addition to the alloc/free fastreg page > list that we are already adding) and force the consumer to do it. And > if we expect the low-level driver to do it, then the map is easy (can be > done while posting the send) but the unmap is a pain -- it would have to > be done inside poll_cq when reapind the completion, and the low-level > driver would have to keep some complicated extra data structure to go > back from the completion to the original fast reg page list structure. > And certain platforms can fail map requests (like PPC64) because they have limited resources for dma mapping. So then you'd fail a SQ work request when you might not want to... Steve. From olga.shern at gmail.com Mon May 19 08:47:57 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Mon, 19 May 2008 18:47:57 +0300 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> <483046CA.3010403@Voltaire.COM> <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com> Message-ID: On 5/19/08, Hal Rosenstock wrote: > > On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote: > > Hal Rosenstock wrote: > > > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote: > > >> The purpose of this patch is to make the events that are related to SM > change > > >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > > >> When SM related events are handled, it is not necessary to flush > unicast > > >> info from device but only multicast info. > > > > > > How is unicast invalidation handled on these changes ? On a local LID > > > change event, how does an end port know/determine what else (e.g. other > > > LIDs, paths) the SM might have changed (that specifically might affect > > > IPoIB since this is limited to IPoIB) ? > > I'm not sure I understand the question but local LID change would be > handled as before > > with a LID_CHANGE event. For this type of event, there is not change in > what IPoIB does to cope. > > It's SM change which I'm not sure about. I'm unaware of an IBA spec > guarantee on preservation of paths on SM failover. Can you point me at > this ? > > Also, as many routing protocols are dependent on where they are run in > the subnet (location of SM node in the topology), I don't think all path > parameters can be maintained when in a heterogeneous subnet and hence > would need refreshing (or flushing to cause this) on an SM change event. > > So while it may work in a homogeneous subnet, I don't think this is the > general case. You are rigth there is no IBA spec request to preserve LIDs but all SMs that we are familiar with, are doing so. You are refering to the case where there is remote LID change but not local LID change, but also without this patch this case is not taken care of. We should think about solution for this case in the future. > > Also, wouldn't there be similar issues with other ULPs ? > > There might be but the purpose of this one is to make things better for > IPoIB > > Understood; just trying to widen the scope. IMO other ULPs should at > least be inspected for the same issues. The multicast issue is IPoIB > specific but local LID, client reregister (maybe only events for other > ULPs as multicast and service records may not apply (perhaps except DAPL > but this may be old implementation)) and SM changes apply to all. > > -- Hal > > > > -- Hal > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon May 19 08:49:04 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 08:49:04 -0700 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: <483199EC.7070900@Voltaire.COM> (Moni Shoua's message of "Mon, 19 May 2008 18:17:00 +0300") References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: and what happens if alloc_mad() is called while port->sm_ah is NULL? From Sumit.Gaur at Sun.COM Mon May 19 09:08:56 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur) Date: Mon, 19 May 2008 21:38:56 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> Message-ID: <4831A618.9090806@Sun.COM> Hal Rosenstock wrote: > On Mon, 2008-05-19 at 19:19 +0530, Sumit Gaur - Sun Microsystem wrote: > >> Hal Rosenstock wrote: >> >>> Hi Sumit, >>> >>> On Mon, 2008-05-19 at 17:20 +0530, Sumit Gaur - Sun Microsystem wrote: >>> >>> >>>> Hi Hal, >>>> >>>> >>>> Hal Rosenstock wrote: >>>> >>>> >>>>> Sumit, >>>>> >>>>> On Mon, 2008-05-19 at 15:25 +0530, Sumit Gaur - Sun Microsystem wrote: >>>>> >>>>> >>>>> >>>>>> Hi >>>>>> I have an issue while my program interacting with OFED umad library. >>>>>> >>>>> Are you referring to libibumad ? >>>>> >>>> yes, I am using mad_receive(0, -1) function to get my response back. >>>> >>> OK. >>> >>> >>> >>>>>> I have two >>>>>> separate threads one for sending SMP,GMP packets and another to receive >>>>>> response. Things are working fine but during the whole process I keep receiving >>>>>> packets with unknown tid apart from correct response. >>>>>> >>>>> What's the exact message ? >>>>> >>>> Response comes as proper mad packets but with "tid" that I have never send and >>>> my logic to keep track of send/response pkts failed. >>>> >>>> >>>>>> Is it a correct behavior. >>>>>> >>>>> It could be; there's not enough info as to what is going on. It could be >>>>> some unsolicited message (e.g. from SM) comes in during your >>>>> transactions. Can you see what MADs are incoming ? One way to do that >>>>> would be to run madeye. >>>>> >>>> Yes I could see complete mad with madhdr as following fields >>>> >>>> Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, >>>> ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=435 >>>> >>> Class 129 is a Subn directed route packet. Some of the other info (like >>> attribute ID) doesn't look right to me but maybe that's something >>> "special" to your environment. >>> >> Sorry missed last number AttributeID=4352 >> > > I don't know what that attribute ID is so there's something different > about that. > > Out of curiousity, what SM are you using ? > > >>>> If these are unsolicited packets. Is there anyway to filter them. >>>> >>> Yes. How do you register ? >>> >> For registration I am calling madrpc_init(ca, ca_port, mgmt_classes, 4) >> function once before starting polling thread for SMI and GSI packet receive. >> Once I received packet I am filtering them on the basis of madhdr->MgmtClass. >> >> int mgmt_classes[4] = {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, >> IB_PERFORMANCE_CLASS}; >> >> for given ca and ca port of local node. >> > > That looks like it would register with a NULL method mask which should > filter unsolicited packets. > > I think I see the issue: the incoming packet appears to have a method of > 129 (GetResp) which has the response bit on so it's not considered > unsolicited. You need to see what exactly that packet is and where it's > coming from and why. > > -- Hal > > Hi Hal, It is true that packets received are looks like proper response but as I mentioned before they content TID that I have never send to OFED and this cause the problem. Why OFED is sending these extra packets Is the matter to investigate. sumit sumit >>> >>>> Any reference to madeye ? >>>> >>> There's only the code for this (kernel module) which is added by OFED >>> (not upstream) in drivers/infiniband/util but it's pretty >>> straightforward to use. >>> >>> -- Hal >>> >>> >>> >>>>>> If yes how I could avoid them ? >>>>>> >>>>> Not sure what you are seeing yet. >>>>> >>>>> -- Hal >>>>> >>>>> >>>>> >>>>> >>>>>> Thanks and Regards >>>>>> sumit >>>>>> >>>>>> general-request at lists.openfabrics.org wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Send general mailing list submissions to >>>>>>> general at lists.openfabrics.org >>>>>>> >>>>>>> To subscribe or unsubscribe via the World Wide Web, visit >>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>> or, via email, send a message with subject or body 'help' to >>>>>>> general-request at lists.openfabrics.org >>>>>>> >>>>>>> You can reach the person managing the list at >>>>>>> general-owner at lists.openfabrics.org >>>>>>> >>>>>>> When replying, please edit your Subject line so it is more specific >>>>>>> than "Re: Contents of general digest..." >>>>>>> >>>>>>> >>>>>>> Today's Topics: >>>>>>> >>>>>>> 1. Re: [PATCH] IB/core: handle race between elements in qork >>>>>>> queues after event (Roland Dreier) >>>>>>> 2. Re: RDS flow control (Steve Wise) >>>>>>> 3. Re: RDS flow control (Olaf Kirch) >>>>>>> 4. Re: RDS flow control (Steve Wise) >>>>>>> 5. Re: RDS flow control (Olaf Kirch) >>>>>>> 6. Re: [PATCH 3/3] IB/ipath - fix RDMA read response sequence >>>>>>> checking (Roland Dreier) >>>>>>> 7. Re: [PATCH][INFINIBAND]: Make ipath_portdata work with >>>>>>> struct pid * not pid_t. (Roland Dreier) >>>>>>> 8. Re: bitops take an unsigned long * (Roland Dreier) >>>>>>> >>>>>>> >>>>>>> ---------------------------------------------------------------------- >>>>>>> >>>>>>> Message: 1 >>>>>>> Date: Tue, 13 May 2008 10:41:39 -0700 >>>>>>> From: Roland Dreier >>>>>>> Subject: Re: [ofa-general] [PATCH] IB/core: handle race between >>>>>>> elements in qork queues after event >>>>>>> To: Moni Shoua >>>>>>> Cc: Olga Stern , OpenFabrics General >>>>>>> >>>>>>> Message-ID: >>>>>>> Content-Type: text/plain; charset=us-ascii >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Can we please go on with this patch? We would like to see it in the next kernel. >>>>>>>> >>>>>>> I still don't get why this is important to you. Is there a concrete >>>>>>> example of a situation where this actually makes a measurable difference? >>>>>>> >>>>>>> We need some justification for adding this locking complexity beyond "it >>>>>>> doesn't hurt." (And also of course we need it fixed so there aren't races) >>>>>>> >>>>>>> - R. >>>>>>> >>>>>>> >>>>>>> ------------------------------ >>>>>>> >>>>>>> Message: 2 >>>>>>> Date: Tue, 13 May 2008 12:58:11 -0500 >>>>>>> From: Steve Wise >>>>>>> Subject: Re: [ofa-general] RDS flow control >>>>>>> To: Richard Frank >>>>>>> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>>>>> Message-ID: <4829D6B3.5080900 at opengridcomputing.com> >>>>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >>>>>>> >>>>>>> Richard Frank wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Steve Wise wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Olaf Kirch wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Monday 12 May 2008 18:57:38 Jon Mason wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> As part of my effort to get RDS working for iWARP, I will be >>>>>>>>>>> working on the RDS flow control. Flow control is needed for iWARP >>>>>>>>>>> due to the fact that iWARP connections terminate if there is no >>>>>>>>>>> posted recv for an incoming packet. IB connections do not have >>>>>>>>>>> this limitation if setup in a certain way. In its current >>>>>>>>>>> implementation, RDS sets the connection attribute rnr_retry to 7. >>>>>>>>>>> This causes IB to retransmit until there is a posted recv buffer. >>>>>>>>>>> >>>>>>>>>> I think for the initial implementation, it is fine for iWARP to just >>>>>>>>>> fail the connect when that happens, and re-establish the connection. >>>>>>>>>> >>>>>>>>>> If you use reasonable defaults for the send and recv queues, receiver >>>>>>>>>> overruns should be relatively rare. >>>>>>>>>> >>>>>>>>>> Once everything else works, let's revisit the flow control part. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> I _think_ you'll hit this quickly with one-way flows. Send >>>>>>>>> completions for iWARP only mean the user's buffer can be reused. Not >>>>>>>>> that its placed at the remote peer or in the remote user's buffer. >>>>>>>>> >>>>>>>>> >>>>>>>> Let's see what happens - anyway - this could be solved in an IWARP >>>>>>>> extension to RDS - right ? >>>>>>>> >>>>>>> >>>>>>> Yes, by adding flow control. And it could be iwarp-specific if you >>>>>>> want. I would not suggest relying on connection termination and >>>>>>> re-establishment as the way to handle this :). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> But perhaps I'm wrong. Jon, maybe you should try to hit this with IB >>>>>>>>> and rnr_retry == 0 using the rds perf tools? >>>>>>>>> Also "the everything else" part depends on remove fmr usage. I'm >>>>>>>>> working on the new RDMA memory verbs allowing fast registration of >>>>>>>>> physical memory via a send WR. To support iWARP we need to remove >>>>>>>>> the fmr usage from RDS. The idea was to replace fmrs with the new >>>>>>>>> fastreg verbs. Thoughts? >>>>>>>>> >>>>>>>>> >>>>>>>> What does "fast" imply here - how does this compare to the performance >>>>>>>> of FMRs ? >>>>>>>> >>>>>>> >>>>>>> Don't know yet, but probably as fast. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Why would not push memory window creation into the RDS transport >>>>>>>> specific implementations ? >>>>>>>> >>>>>>> Isn't it already transport-specific? IE you don't need FMRs for TCP. >>>>>>> (I'm ignorant on the specifics of the implementation at this point, so >>>>>>> please excuse any dumb statements :) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Changing the API may be OK - if we retain the performance we have with >>>>>>>> IB. >>>>>>>> >>>>>>> >>>>>>> I assume nothing would fly that regresses IB performance. Worst case, >>>>>>> you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>>>>> Hopefully though, IB + iWARP will be a common transport. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> Stay tuned for the new verbs API RFC... >>>>>>>>> >>>>>>>>> Steve. >>>>>>>>> _______________________________________________ >>>>>>>>> general mailing list >>>>>>>>> general at lists.openfabrics.org >>>>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>>>> >>>>>>>>> To unsubscribe, please visit >>>>>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------ >>>>>>> >>>>>>> Message: 3 >>>>>>> Date: Tue, 13 May 2008 20:04:00 +0200 >>>>>>> From: Olaf Kirch >>>>>>> Subject: Re: [ofa-general] RDS flow control >>>>>>> To: Steve Wise >>>>>>> Cc: rds-devel at oss.oracle.com, general at lists.openfabrics.org >>>>>>> Message-ID: <200805132004.01371.okir at lst.de> >>>>>>> Content-Type: text/plain; charset="iso-8859-1" >>>>>>> >>>>>>> On Tuesday 13 May 2008 19:58:11 Steve Wise wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Yes, by adding flow control. And it could be iwarp-specific if you >>>>>>>> want. I would not suggest relying on connection termination and >>>>>>>> re-establishment as the way to handle this :). >>>>>>>> >>>>>>> No, not in the long term. But let's hold off on the flow control stuff >>>>>>> for a little - I would first like to finish my patch set and hand it >>>>>>> out for you folks to bang on it, rather than the other way round. >>>>>>> Okay with you guys? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> I assume nothing would fly that regresses IB performance. Worst case, >>>>>>>> you have an iwarp-specific RDS transport like you do for TCP, I guess. >>>>>>>> Hopefully though, IB + iWARP will be a common transport. >>>>>>>> >>>>>>> If it turns out that way, fine. If iWARP ands up sharing 80% of the >>>>>>> code with IB except the RDMA specific functions, I think that's >>>>>>> very much acceptable, too. >>>>>>> >>>>>>> Olaf >>>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>> > > From hrosenstock at xsigo.com Mon May 19 09:17:50 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 09:17:50 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <4831A618.9090806@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> Message-ID: <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> Sumit, On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote: > Hi Hal, > It is true that packets received are looks like proper response but as I > mentioned before they content TID that I have never send to OFED and > this cause the problem. Why OFED is sending these extra packets Is the > matter to investigate. The received packet is SM class attribute ID 4352 which is non IBA standard and AFAIK OFED does not send so it likely comes from some non OFED software. As far as why it is being received, it is a response to a class your application is subscribed to so it passes it through. As to what is going on, some sort of packet trace would be needed. -- Hal > sumit > sumit From or.gerlitz at gmail.com Mon May 19 09:27:56 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 19 May 2008 19:27:56 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> <200805191825.38176.jackm@dev.mellanox.co.il> Message-ID: <15ddcffd0805190927n4dc16c12m22aa9b5e219ed65e@mail.gmail.com> On 5/19/08, Roland Dreier wrote: > > Seems like a ULP issue -- if they don't want 512 byte pages for FMRs, > why are they asking for them? > Indeed, the ULP code has to do well with the reported sizes, etc Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From balaji at mcs.anl.gov Mon May 19 08:46:58 2008 From: balaji at mcs.anl.gov (Pavan Balaji) Date: Mon, 19 May 2008 10:46:58 -0500 Subject: [ofa-general] [p2s2-announce] Reminder: P2S2 Workshop Deadline Coming Up Message-ID: <4831A0F2.70106@mcs.anl.gov> We would like to remind you all that the P2S2 Workshop deadline is coming up in a few days (May 21st). We look forward to receiving paper submissions from you. Please note in the CFP below that the actual workshop is moved from the first day of the ICPP conference (Sep. 8th) to the last day (Sep. 12th), so that it does not conflict with other conferences in the same area. This announcement list is for people who are interested in the P2S2 workshop. If you are not interested in these announcements, information on how to unsubscribe from this list is available at the bottom of this email. ======================================================================== CALL FOR PAPERS =============== First International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sep. 12th, 2008 Web link: http://www.mcs.anl.gov/events/workshops/p2s2 To be held in conjunction with ICPP-08: The 27th International Conference on Parallel Processing Sep. 8-12, 2008 Portland, Oregon, USA SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel o Other Hybrid Programming Models * Systems software for scientific and enterprise computing o Communication sub-subsystems for high-end computing o High-performance File and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published by the IEEE Computer Society (together with the ICPP conference proceedings) in CD format only and will be available at the conference. SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. DATES AND DEADLINES ------------------- Paper Submission: Extended to May 21st, 2008 Author Notification: June 4th, 2008 Camera Ready: June 18th, 2008 PROGRAM CHAIRS -------------- * Pavan Balaji (Argonne National Laboratory) * Sayantan Sur (IBM Research) STEERING COMMITTEE ------------------ * William D. Gropp (University of Illinois Urbana-Champaign) * Dhabaleswar K. Panda (Ohio State University) * Vijay Saraswat (IBM Research) PROGRAM COMMITTEE ----------------- * David Bernholdt (Oak Ridge National Laboratory) * Ron Brightwell (Sandia National Laboratory) * Wu-chun Feng (Virginia Tech) * Richard Graham (Oak Ridge National Laboratory) * Hyun-wook Jin (Konkuk University, South Korea) * Sameer Kumar (IBM Research) * Doug Lea (State University of New York at Oswego) * Jarek Nieplocha (Pacific Northwest National Laboratory) * Scott Pakin (Los Alamos National Laboratory) * Vivek Sarkar (Rice University) * Rajeev Thakur (Argonne National Laboratory) * Pete Wyckoff (Ohio Supercomputing Center) If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== If you do not want to receive any more announcements regarding the P2S2 workshop, please send an email to majordomo at mcs.anl.gov with the email body (not email subject) as "unsubscribe p2s2-announce". ======================================================================== -- Pavan Balaji http://www.mcs.anl.gov/~balaji From tziporet at dev.mellanox.co.il Mon May 19 09:41:28 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 19 May 2008 19:41:28 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <200805191603.50735.jackm@dev.mellanox.co.il> <200805191825.38176.jackm@dev.mellanox.co.il> Message-ID: <4831ADB8.6040508@mellanox.co.il> Roland Dreier wrote: > > I just remembered. If we create FMRs using 512 as the device page size, we will > > use 8 times the MTT entries as we would if the page size was 4K. The ULP (rds and iser) > > can run out of MTT entries much faster. > > Seems like a ULP issue -- if they don't want 512 byte pages for FMRs, > why are they asking for them? > I agree - each ULP should ask the size needed and not relay on the HCA to decide this. This feature was added for backward compatibility of older kernel when using the new coming FW and no need to add more options to the driver. Tziporet From tziporet at dev.mellanox.co.il Mon May 19 09:43:02 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 19 May 2008 19:43:02 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> Message-ID: <4831AE16.9080807@mellanox.co.il> Roland Dreier wrote: > You misunderstood the patch I think (unless I did). By default new > kernel + new firmware gets the smaller page size. > This is correct. Tziporet From sashak at voltaire.com Mon May 19 10:01:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:01:30 +0300 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <1207851427.7695.123.camel@cardanus.llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1207851427.7695.123.camel@cardanus.llnl.gov> Message-ID: <20080519170130.GH4616@sashak.voltaire.com> Hi Al, On 11:17 Thu 10 Apr , Al Chu wrote: > > I suddenly thought about this. If the /var/cache/opensm/opensm.opts > file is no longer readable (and presumably people will not know about it > b/c it is not documented anywhere), At the moment I changed "usage" ('--help' option) message accordingly. > how will users know how to write the > opensm.conf? /var/cache/opensm/opensm.opts still be writable with '-c' option. And this can be used as template. > Will opesn distribute a "template" .conf file with all > values initially commented out?? (I think this is the best idea). Have nothing against it. Sasha From sashak at voltaire.com Mon May 19 10:03:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:03:03 +0300 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <1210617225.11133.461.camel@cardanus.llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> Message-ID: <20080519170303.GI4616@sashak.voltaire.com> On 11:33 Mon 12 May , Al Chu wrote: > > Ira and I were chatting. A few other comments: > > 1) Many configuration values are not output by default in opensm right > now, mainly b/c it behaves like a cache rather than an configuration > file. i.e. > > if (p_opts->connect_roots) > fprintf(opts_file, > "# Connect roots (use FALSE if unsure)\n" > "connect_roots %s\n\n", > p_opts->connect_roots ? "TRUE" : "FALSE"); > > Going forward w/ a config file, I think these should be output by > default all the time so users know they exist. Good point! Will submit patches shortly. > 2) Will there be an option to specify an alternate configuration file, > i.e. not /etc/opensm/opensm.conf? Yes, '-F' or '--config' option. Sasha From sashak at voltaire.com Mon May 19 10:06:24 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:06:24 +0300 Subject: [ofa-general] Re: [RFC][PATCH 0/4] opensm: using conventional config file In-Reply-To: <20080512144541.3879de40.weiny2@llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080512144541.3879de40.weiny2@llnl.gov> Message-ID: <20080519170624.GJ4616@sashak.voltaire.com> Hi Ira, On 14:45 Mon 12 May , Ira Weiny wrote: > > Also, I wonder if anyone would object to you applying your patches to the tree > as is and we work out the details from there? I don't see anything wrong with > your patches except that more work will be needed, as you said, in the man > pages and scripts. > > After you apply your patches I think we can start in changing the man pages and > scripts. Basically I'm fine with such approach. But I think applying "AS IS" will yet break startup scripts, I will look at this to setup at least temporary solution there. Sasha From sashak at voltaire.com Mon May 19 10:08:39 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:08:39 +0300 Subject: [ofa-general] [PATCH] opensm: merge disable_multicast and no_multicast_option options In-Reply-To: <20080519170303.GI4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> Message-ID: <20080519170839.GK4616@sashak.voltaire.com> I cannot find how those options should be different. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_subnet.h | 3 +-- opensm/opensm/osm_sa_class_port_info.c | 2 +- opensm/opensm/osm_subnet.c | 10 ++-------- 3 files changed, 4 insertions(+), 11 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index b1dd659..daab453 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -221,7 +221,6 @@ typedef struct _osm_subn_opt { boolean_t reassign_lids; boolean_t ignore_other_sm; boolean_t single_thread; - boolean_t no_multicast_option; boolean_t disable_multicast; boolean_t force_log_flush; uint8_t subnet_timeout; @@ -338,7 +337,7 @@ typedef struct _osm_subn_opt { * ignore_other_sm_option * This flag is TRUE if other SMs on the subnet should be ignored. * -* no_multicast_option +* disable_multicast * This flag is TRUE if OpenSM should disable multicast support. * * max_msg_fifo_timeout diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c index f0afb32..0839c1b 100644 --- a/opensm/opensm/osm_sa_class_port_info.c +++ b/opensm/opensm/osm_sa_class_port_info.c @@ -167,7 +167,7 @@ __osm_cpi_rcv_respond(IN osm_sa_t * sa, if (sa->p_subn->opt.qos) ib_class_set_cap_mask2(p_resp_cpi, OSM_CAP2_IS_QOS_SUPPORTED); - if (sa->p_subn->opt.no_multicast_option != TRUE) + if (!sa->p_subn->opt.disable_multicast) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 47d735f..a916270 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -409,7 +409,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->reassign_lids = FALSE; p_opt->ignore_other_sm = FALSE; p_opt->single_thread = FALSE; - p_opt->no_multicast_option = FALSE; p_opt->disable_multicast = FALSE; p_opt->force_log_flush = FALSE; p_opt->subnet_timeout = OSM_DEFAULT_SUBNET_TIMEOUT; @@ -1230,9 +1229,6 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_boolean("single_thread", p_key, p_val, &p_opts->single_thread); - opts_unpack_boolean("no_multicast_option", - p_key, p_val, &p_opts->no_multicast_option); - opts_unpack_boolean("disable_multicast", p_key, p_val, &p_opts->disable_multicast); @@ -1673,9 +1669,8 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "enable_quirks %s\n\n" "# If TRUE disables client reregistration\n" "no_clients_rereg %s\n\n" - "# If TRUE OpenSM should disable multicast support\n" - "no_multicast_option %s\n\n" - "# No multicast routing is performed if TRUE\n" + "# If TRUE OpenSM should disable multicast support and\n" + "# no multicast routing is performed if TRUE\n" "disable_multicast %s\n\n" "# If TRUE opensm will exit on fatal initialization issues\n" "exit_on_fatal %s\n\n" "# console [off|local" @@ -1695,7 +1690,6 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) p_opts->dump_files_dir, p_opts->enable_quirks ? "TRUE" : "FALSE", p_opts->no_clients_rereg ? "TRUE" : "FALSE", - p_opts->no_multicast_option ? "TRUE" : "FALSE", p_opts->disable_multicast ? "TRUE" : "FALSE", p_opts->exit_on_fatal ? "TRUE" : "FALSE", p_opts->console, -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Mon May 19 10:09:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:09:16 +0300 Subject: [ofa-general] [PATCH] opensm: remove unused pfn_ui_* callback options In-Reply-To: <20080519170839.GK4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> Message-ID: <20080519170916.GL4616@sashak.voltaire.com> Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks from OpenSM subnet options. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_subnet.h | 20 ------------------ opensm/opensm/osm_lid_mgr.c | 7 ------ opensm/opensm/osm_mcast_mgr.c | 40 ++++++----------------------------- opensm/opensm/osm_subnet.c | 4 --- 4 files changed, 7 insertions(+), 64 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index daab453..56b0165 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -248,10 +248,6 @@ typedef struct _osm_subn_opt { uint16_t console_port; cl_map_t port_prof_ignore_guids; boolean_t port_profile_switch_nodes; - osm_pfn_ui_extension_t pfn_ui_pre_lid_assign; - void *ui_pre_lid_assign_ctx; - osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign; - void *ui_mcast_fdb_assign_ctx; boolean_t sweep_on_trap; char *routing_engine_name; boolean_t connect_roots; @@ -412,22 +408,6 @@ typedef struct _osm_subn_opt { * If TRUE will count the number of switch nodes routed through * the link. If FALSE - only CA/RT nodes are counted. * -* pfn_ui_pre_lid_assign -* A UI function to be invoked prior to lid assigment. It should -* return 1 if any change was made to any lid or 0 otherwise. -* -* ui_pre_lid_assign_ctx -* A UI context (void *) to be provided to the pfn_ui_pre_lid_assign -* -* pfn_ui_mcast_fdb_assign -* A UI function to be called inside the mcast manager instead of -* the call for the build spanning tree. This will be called on -* every multicast call for create, join and leave, and is -* responsible for the mcast FDB configuration. -* -* ui_mcast_fdb_assign_ctx -* A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign -* * sweep_on_trap * Received traps will initiate a new sweep. * diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index af0d020..7f25750 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr) persistent db */ __osm_lid_mgr_init_sweep(p_mgr); - if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) { - OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE, - "Invoking UI function pfn_ui_pre_lid_assign\n"); - p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt. - ui_pre_lid_assign_ctx); - } - /* Set the send_set_reqs of the p_mgr to FALSE, and we'll see if any set requests were sent. If not - can signal OSM_SIGNAL_DONE */ diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index 683a16d..a6185fe 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, { ib_api_status_t status = IB_SUCCESS; ib_net16_t mlid; - boolean_t ui_mcast_fdb_assign_func_defined; OSM_LOG_ENTER(sm->p_log); @@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, goto Exit; } - if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign) - ui_mcast_fdb_assign_func_defined = TRUE; - else - ui_mcast_fdb_assign_func_defined = FALSE; - /* Clear the multicast tables to start clean, then build the spanning tree which sets the mcast table bits for each port in the group. - We will clean the multicast tables if a ui_mcast function isn't - defined, or if such function is defined, but we got here - through a MC_CREATE request - this means we are creating a new - multicast group - clean all old data. */ - if (ui_mcast_fdb_assign_func_defined == FALSE || - req_type == OSM_MCAST_REQ_TYPE_CREATE) - __osm_mcast_mgr_clear(sm, p_mgrp); - - /* If a UI function is defined, then we will call it here. - If not - the use the regular build spanning tree function */ - if (ui_mcast_fdb_assign_func_defined == FALSE) { - status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); - if (status != IB_SUCCESS) { - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " - "Unable to create spanning tree (%s)\n", - ib_get_err_str(status)); - goto Exit; - } - } else { - if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) { - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, - "Invoking UI function pfn_ui_mcast_fdb_assign\n"); - } + __osm_mcast_mgr_clear(sm, p_mgrp); - sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt. - ui_mcast_fdb_assign_ctx, - mlid, req_type, - port_guid); + status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); + if (status != IB_SUCCESS) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " + "Unable to create spanning tree (%s)\n", + ib_get_err_str(status)); + goto Exit; } Exit: diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index a916270..2191f2d 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; p_opt->accum_log_file = TRUE; p_opt->port_profile_switch_nodes = FALSE; - p_opt->pfn_ui_pre_lid_assign = NULL; - p_opt->ui_pre_lid_assign_ctx = NULL; - p_opt->pfn_ui_mcast_fdb_assign = NULL; - p_opt->ui_mcast_fdb_assign_ctx = NULL; p_opt->sweep_on_trap = TRUE; p_opt->routing_engine_name = NULL; p_opt->connect_roots = FALSE; -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Mon May 19 10:10:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:10:06 +0300 Subject: [ofa-general] [PATCH] opensm: port_prof_ignore_file option In-Reply-To: <20080519170839.GK4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> Message-ID: <20080519171006.GM4616@sashak.voltaire.com> Move run-time port_prof_ignore_guids map to osm_subnet_t struct and instead in options define port_prof_ignore_file - a name of the file with port guids to be ignored by port profiling. Command line option '-i' ('--ignore-guids') will work as before. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_port_profile.h | 7 +++---- opensm/include/opensm/osm_subnet.h | 10 +++++++--- opensm/opensm/main.c | 9 ++++----- opensm/opensm/osm_subnet.c | 17 ++++++++++++++--- 4 files changed, 28 insertions(+), 15 deletions(-) diff --git a/opensm/include/opensm/osm_port_profile.h b/opensm/include/opensm/osm_port_profile.h index 2442850..bbb59ef 100644 --- a/opensm/include/opensm/osm_port_profile.h +++ b/opensm/include/opensm/osm_port_profile.h @@ -205,7 +205,7 @@ static inline boolean_t osm_port_prof_is_ignored_port(IN const osm_subn_t * p_subn, IN ib_net64_t port_guid, IN uint8_t port_num) { - const cl_map_t *p_map = &(p_subn->opt.port_prof_ignore_guids); + const cl_map_t *p_map = &p_subn->port_prof_ignore_guids; const void *p_obj = cl_map_get(p_map, port_guid); size_t res; @@ -246,7 +246,7 @@ static inline void osm_port_prof_set_ignored_port(IN osm_subn_t * p_subn, IN ib_net64_t port_guid, IN uint8_t port_num) { - cl_map_t *p_map = &(p_subn->opt.port_prof_ignore_guids); + cl_map_t *p_map = &p_subn->port_prof_ignore_guids; const void *p_obj = cl_map_get(p_map, port_guid); size_t value = 0; @@ -259,8 +259,7 @@ osm_port_prof_set_ignored_port(IN osm_subn_t * p_subn, } value = value | (1 << port_num); - cl_map_insert(&(p_subn->opt.port_prof_ignore_guids), - port_guid, (void *)value); + cl_map_insert(p_map, port_guid, (void *)value); } /* diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 56b0165..349ba79 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -246,7 +246,7 @@ typedef struct _osm_subn_opt { boolean_t accum_log_file; char *console; uint16_t console_port; - cl_map_t port_prof_ignore_guids; + char *port_prof_ignore_file; boolean_t port_profile_switch_nodes; boolean_t sweep_on_trap; char *routing_engine_name; @@ -401,8 +401,8 @@ typedef struct _osm_subn_opt { * If TRUE (default) - the log file will be accumulated. * If FALSE - the log file will be erased before starting current opensm run. * -* port_prof_ignore_guids -* A map of guids to be ignored by port profiling. +* port_prof_ignore_file +* Name of file with port guids to be ignored by port profiling. * * port_profile_switch_nodes * If TRUE will count the number of switch nodes routed through @@ -531,6 +531,7 @@ typedef struct _osm_subn { cl_qlist_t sa_sr_list; cl_qlist_t sa_infr_list; cl_ptr_vector_t port_lid_tbl; + cl_map_t port_prof_ignore_guids; ib_net16_t master_sm_base_lid; ib_net16_t sm_base_lid; ib_net64_t sm_port_guid; @@ -587,6 +588,9 @@ typedef struct _osm_subn { * Container of pointers to all Port objects in the subent. * Indexed by port LID. * +* port_prof_ignore_guids +* A map of guids to be ignored by port profiling. +* * master_sm_base_lid * The base LID owned by the subnet's master SM. * diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index fb41d50..89a42b4 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -596,7 +596,6 @@ int main(int argc, char *argv[]) int32_t vendor_debug = 0; uint32_t next_option; boolean_t cache_options = FALSE; - char *ignore_guids_file_name = NULL; uint32_t val; const char *const short_option = "i:f:ed:g:l:L:s:t:a:u:m:R:zM:U:S:P:Y:NBIQvVhorcyxp:n:q:k:C:"; @@ -702,9 +701,9 @@ int main(int argc, char *argv[]) /* Specifies ignore guids file. */ - ignore_guids_file_name = optarg; + opt.port_prof_ignore_file = optarg; printf(" Ignore Guids File = %s\n", - ignore_guids_file_name); + opt.port_prof_ignore_file); break; case 'g': @@ -1027,8 +1026,8 @@ int main(int argc, char *argv[]) /* * Define some port guids to ignore during path equalization */ - if (ignore_guids_file_name != NULL) { - status = parse_ignore_guids_file(ignore_guids_file_name, &osm); + if (opt.port_prof_ignore_file != NULL) { + status = parse_ignore_guids_file(opt.port_prof_ignore_file, &osm); if (status != IB_SUCCESS) { printf("\nError from parse_ignore_guids_file (0x%X)\n", status); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 2191f2d..20add92 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -166,8 +166,9 @@ void osm_subn_destroy(IN osm_subn_t * const p_subn) } cl_ptr_vector_destroy(&p_subn->port_lid_tbl); - cl_map_remove_all(&(p_subn->opt.port_prof_ignore_guids)); - cl_map_destroy(&(p_subn->opt.port_prof_ignore_guids)); + + cl_map_remove_all(&p_subn->port_prof_ignore_guids); + cl_map_destroy(&p_subn->port_prof_ignore_guids); osm_qos_policy_destroy(p_subn->p_qos_policy); @@ -212,7 +213,7 @@ osm_subn_init(IN osm_subn_t * const p_subn, p_subn->min_ca_rate = IB_MAX_RATE; /* note that insert and remove are part of the port_profile thing */ - cl_map_init(&(p_subn->opt.port_prof_ignore_guids), 10); + cl_map_init(&p_subn->port_prof_ignore_guids, 10); p_subn->ignore_existing_lfts = TRUE; @@ -452,6 +453,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->qos = FALSE; p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; p_opt->accum_log_file = TRUE; + p_opt->port_prof_ignore_file = NULL; p_opt->port_profile_switch_nodes = FALSE; p_opt->sweep_on_trap = TRUE; p_opt->routing_engine_name = NULL; @@ -1270,6 +1272,9 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_uint8("log_flags", p_key, p_val, &p_opts->log_flags); + opts_unpack_charp("port_prof_ignore_file", p_key, p_val, + &p_opts->port_prof_ignore_file); + opts_unpack_boolean("port_profile_switch_nodes", p_key, p_val, &p_opts->port_profile_switch_nodes); @@ -1525,6 +1530,12 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "port_profile_switch_nodes %s\n\n", p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE"); + if (p_opts->port_prof_ignore_file) + fprintf(opts_file, + "# Name of file with port guids to be ignored by port profiling\n" + "port_prof_ignore_file %s\n\n", + p_opts->port_prof_ignore_file); + if (p_opts->routing_engine_name) fprintf(opts_file, "# Routing engine\n" -- 1.5.4.rc2.60.gb2e62 From kliteyn at dev.mellanox.co.il Mon May 19 10:12:57 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 19 May 2008 20:12:57 +0300 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> Message-ID: <4831B519.2060002@dev.mellanox.co.il> Hal Rosenstock wrote: > Sumit, > On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote: > >> Hi Hal, >> It is true that packets received are looks like proper response but as I >> mentioned before they content TID that I have never send to OFED and >> this cause the problem. Why OFED is sending these extra packets Is the >> matter to investigate. > > The received packet is SM class attribute ID 4352 which is non IBA > standard and AFAIK OFED does not send so it likely comes from some non > OFED software. Just a thought: Decimal 4352 is 0x1100. With reverted endian we get 0x0011, which is NodeInfo, that SM sends while sweeping the subnet, which comes at regular interval. As I said, just a thought... -- Yevgeny > As far as why it is being received, it is a response to a class your > application is subscribed to so it passes it through. > > As to what is going on, some sort of packet trace would be needed. > > -- Hal > >> sumit >> sumit > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon May 19 10:10:46 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 20:10:46 +0300 Subject: [ofa-general] [PATCH] opensm: write all OpenSM options to cache file In-Reply-To: <20080519171006.GM4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> <20080519171006.GM4616@sashak.voltaire.com> Message-ID: <20080519171046.GN4616@sashak.voltaire.com> We want to have all OpenSM options in cache file, so it will be useful as configuration template. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_subnet.c | 93 ++++++++++++++++++++++---------------------- 1 files changed, 47 insertions(+), 46 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 20add92..2dc0ca8 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -79,6 +79,8 @@ #define OSM_PATH_MAX 256 #endif +static const char null_str[] = "(null)"; + /********************************************************************** **********************************************************************/ void osm_subn_construct(IN osm_subn_t * const p_subn) @@ -621,7 +623,7 @@ opts_unpack_charp(IN char *p_req_key, cl_log_event("OpenSM", CL_LOG_INFO, buff, NULL, 0); /* special case the "(null)" string */ - if (strcmp("(null)", p_val_str) == 0) { + if (strcmp(null_str, p_val_str) == 0) { *p_val = NULL; } else { /* @@ -1530,51 +1532,50 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "port_profile_switch_nodes %s\n\n", p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE"); - if (p_opts->port_prof_ignore_file) - fprintf(opts_file, - "# Name of file with port guids to be ignored by port profiling\n" - "port_prof_ignore_file %s\n\n", - p_opts->port_prof_ignore_file); - - if (p_opts->routing_engine_name) - fprintf(opts_file, - "# Routing engine\n" - "# Supported engines: minhop, updn, file, ftree, lash, dor\n" - "routing_engine %s\n\n", p_opts->routing_engine_name); - if (p_opts->connect_roots) - fprintf(opts_file, - "# Connect roots (use FALSE if unsure)\n" - "connect_roots %s\n\n", - p_opts->connect_roots ? "TRUE" : "FALSE"); - if (p_opts->lid_matrix_dump_file) - fprintf(opts_file, - "# Lid matrix dump file name\n" - "lid_matrix_dump_file %s\n\n", - p_opts->lid_matrix_dump_file); - if (p_opts->ucast_dump_file) - fprintf(opts_file, - "# Ucast dump file name\n" - "ucast_dump_file %s\n\n", p_opts->ucast_dump_file); - if (p_opts->root_guid_file) - fprintf(opts_file, - "# The file holding the root node guids (for fat-tree or Up/Down)\n" - "# One guid in each line\n" - "root_guid_file %s\n\n", p_opts->root_guid_file); - if (p_opts->cn_guid_file) - fprintf(opts_file, - "# The file holding the fat-tree compute node guids\n" - "# One guid in each line\n" - "cn_guid_file %s\n\n", p_opts->cn_guid_file); - if (p_opts->ids_guid_file) - fprintf(opts_file, - "# The file holding the node ids which will be used by" - " Up/Down algorithm instead\n# of GUIDs (one guid and" - " id in each line)\n" - "ids_guid_file %s\n\n", p_opts->ids_guid_file); - if (p_opts->sa_db_file) - fprintf(opts_file, - "# SA database file name\n" - "sa_db_file %s\n\n", p_opts->sa_db_file); + fprintf(opts_file, + "# Name of file with port guids to be ignored by port profiling\n" + "port_prof_ignore_file %s\n\n", p_opts->port_prof_ignore_file ? + p_opts->port_prof_ignore_file : null_str); + + fprintf(opts_file, + "# Routing engine\n" + "# Supported engines: minhop, updn, file, ftree, lash, dor\n" + "routing_engine %s\n\n", p_opts->routing_engine_name ? + p_opts->routing_engine_name : null_str); + + fprintf(opts_file, + "# Connect roots (use FALSE if unsure)\n" + "connect_roots %s\n\n", + p_opts->connect_roots ? "TRUE" : "FALSE"); + + fprintf(opts_file, + "# Lid matrix dump file name\n" + "lid_matrix_dump_file %s\n\n", p_opts->lid_matrix_dump_file ? + p_opts->lid_matrix_dump_file : null_str); + + fprintf(opts_file, + "# Ucast dump file name\nucast_dump_file %s\n\n", + p_opts->ucast_dump_file ? p_opts->ucast_dump_file : null_str); + + fprintf(opts_file, + "# The file holding the root node guids (for fat-tree or Up/Down)\n" + "# One guid in each line\nroot_guid_file %s\n\n", + p_opts->root_guid_file ? p_opts->root_guid_file : null_str); + + fprintf(opts_file, + "# The file holding the fat-tree compute node guids\n" + "# One guid in each line\ncn_guid_file %s\n\n", + p_opts->cn_guid_file ? p_opts->cn_guid_file : null_str); + + fprintf(opts_file, + "# The file holding the node ids which will be used by" + " Up/Down algorithm instead\n# of GUIDs (one guid and" + " id in each line)\nids_guid_file %s\n\n", + p_opts->ids_guid_file ? p_opts->ids_guid_file : null_str); + + fprintf(opts_file, + "# SA database file name\nsa_db_file %s\n\n", + p_opts->sa_db_file ? p_opts->sa_db_file : null_str); fprintf(opts_file, "#\n# HANDOVER - MULTIPLE SMs OPTIONS\n#\n" -- 1.5.4.rc2.60.gb2e62 From hrosenstock at xsigo.com Mon May 19 10:18:28 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 10:18:28 -0700 Subject: [ofa-general] [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <1211121890.12616.381.camel@hrosenstock-ws.xsigo.com> <483046CA.3010403@Voltaire.COM> <1211202081.12616.417.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211217509.12616.487.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-19 at 18:47 +0300, Olga Shern (Voltaire) wrote: > > > On 5/19/08, Hal Rosenstock wrote: > On Sun, 2008-05-18 at 18:10 +0300, Moni Shoua wrote: > > Hal Rosenstock wrote: > > > On Sun, 2008-05-18 at 15:36 +0300, Moni Shoua wrote: > > >> The purpose of this patch is to make the events that are > related to SM change > > >> (namely CLIENT_REREGISTER event and SM_CHANGE event) less > disruptive. > > >> When SM related events are handled, it is not necessary > to flush unicast > > >> info from device but only multicast info. > > > > > > How is unicast invalidation handled on these changes ? On > a local LID > > > change event, how does an end port know/determine what > else (e.g. other > > > LIDs, paths) the SM might have changed (that specifically > might affect > > > IPoIB since this is limited to IPoIB) ? > > I'm not sure I understand the question but local LID change > would be handled as before > > with a LID_CHANGE event. For this type of event, there is > not change in what IPoIB does to cope. > > It's SM change which I'm not sure about. I'm unaware of an IBA > spec > guarantee on preservation of paths on SM failover. Can you > point me at > this ? > > Also, as many routing protocols are dependent on where they > are run in > the subnet (location of SM node in the topology), I don't > think all path > parameters can be maintained when in a heterogeneous subnet > and hence > would need refreshing (or flushing to cause this) on an SM > change event. > > So while it may work in a homogeneous subnet, I don't think > this is the > general case. > > You are rigth there is no IBA spec request to preserve LIDs but all > SMs that we are familiar with, > are doing so. It's more than LID preservation though; it's also routing preservation (and rate, etc.) if a path is rerouted in a heterogenous subnet on SM failover. In terms of SM LID preservation though, in the case of OpenSM there are two additional scenarios where LID preservation on SM failover is not a valid assumption: 1. If the guid2lid files are not sync'd with OpenSM instances. 2. If reassign LIDs is used. Also, since it's not a spec requirement, I don't see how this can be relied upon. Maybe some option (mod param) for this being the case in some configuration with the default being that LID preservation is not assumed ? Also, what happens when that assumption is not valid ? I'm referring to the case where it's treated as a less disruptive event but it really needed to be treated as a more disruptive one ? > You are refering to the case where there is remote LID change but not > local LID change, Yes, in addition to the changing the SM change event handling issues above. > but also without this patch this case is not taken care of. True. > We should think about solution for this case in the future. Indeed. -- Hal > > > Also, wouldn't there be similar issues with other ULPs ? > > There might be but the purpose of this one is to make things > better for IPoIB > > Understood; just trying to widen the scope. IMO other ULPs > should at > least be inspected for the same issues. The multicast issue is > IPoIB > specific but local LID, client reregister (maybe only events > for other > ULPs as multicast and service records may not apply (perhaps > except DAPL > but this may be old implementation)) and SM changes apply to > all. > > -- Hal > > > > -- Hal > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi- > bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hrosenstock at xsigo.com Mon May 19 10:19:55 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Mon, 19 May 2008 10:19:55 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <4831B519.2060002@dev.mellanox.co.il> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> <4831B519.2060002@dev.mellanox.co.il> Message-ID: <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-05-19 at 20:12 +0300, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > Sumit, > > On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote: > > > >> Hi Hal, > >> It is true that packets received are looks like proper response but as I > >> mentioned before they content TID that I have never send to OFED and > >> this cause the problem. Why OFED is sending these extra packets Is the > >> matter to investigate. > > > > The received packet is SM class attribute ID 4352 which is non IBA > > standard and AFAIK OFED does not send so it likely comes from some non > > OFED software. > > Just a thought: > Decimal 4352 is 0x1100. With reverted endian we get 0x0011, > which is NodeInfo, that SM sends while sweeping the subnet, > which comes at regular interval. > > As I said, just a thought... Yes, that makes sense to me. As this is an incoming response, maybe this node is running the SM as well as this application. -- Hal > -- Yevgeny > > > As far as why it is being received, it is a response to a class your > > application is subscribed to so it passes it through. > > > > As to what is going on, some sort of packet trace would be needed. > > > > -- Hal > > > >> sumit > >> sumit > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:31:23 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:01:23 +0530 Subject: [ofa-general] [PATCH v2 00/13] QLogic VNIC Driver Message-ID: <20080519102843.12355.832.stgit@localhost.localdomain> Roland, This is the second round of QLogic Virtual NIC driver patch series for submission to 2.6.27 kernel. The series has been tested against your for-2.6.27 branch. Based on comments received on first series of patches, following fixes are introduced in this series: - Removal of IB cache implementation for QLogic VNIC ULP. - netdev->priv structure allocation through alloc_netdev and use of netdev_priv() to access the same. - Implementation of spinlock to protect potential vnic->current_path race conditions. - Removed the use of "vnic->xmit_started" variable. - vnic_multicast.c coding style and lock fixes. - Use of "time_after" macro for jiffies comparison. - vnic_npevent_str has been moved to vnic_main.c to avoid its inclusion every time along with vnic_main.h - Use of kernel "is_power_of_2" function in place of driver's own. - Global "recv_ref" variable has been renamed to "vnic_recv_ref". I have signed-off all patches in the series. The sparse endianness checking for the driver did not give any warnings and checkpatch.pl have few warnings indicating lines slightly longer than 80 columns. Background: As mentioned in the first version of patch series, this series adds QLogic Virtual NIC (VNIC) driver which works in conjunction with the the QLogic Ethernet Virtual I/O Controller (EVIC) hardware. The VNIC driver along with the QLogic EVIC's two 10 Gigabit ethernet ports, enables Infiniband clusters to connect to Ethernet networks. This driver also works with the earlier version of the I/O Controller, the VEx. The QLogic VNIC driver creates virtual ethernet interfaces and tunnels the Ethernet data to/from the EVIC over Infiniband using an Infiniband reliable connection. [PATCH v2 01/13] QLogic VNIC: Driver - netdev implementation [PATCH v2 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx [PATCH v2 04/13] QLogic VNIC: Implementation of Control path of communication protocol [PATCH v2 05/13] QLogic VNIC: Implementation of Data path of communication protocol [PATCH v2 06/13] QLogic VNIC: IB core stack interaction [PATCH v2 07/13] QLogic VNIC: Handling configurable parameters of the driver [PATCH v2 08/13] QLogic VNIC: sysfs interface implementation for the driver [PATCH v2 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast [PATCH v2 10/13] QLogic VNIC: Driver Statistics collection [PATCH v2 11/13] QLogic VNIC: Driver utility file - implements various utility macros [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. [PATCH v2 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/ulp/qlgc_vnic/Kconfig | 28 drivers/infiniband/ulp/qlgc_vnic/Makefile | 13 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c | 379 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_config.h | 242 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.c | 2286 ++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.h | 179 ++ .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h | 368 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.c | 1492 +++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.h | 206 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 +++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h | 206 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.h | 154 + drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c | 319 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h | 77 + drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c | 112 + drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h | 79 + drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c | 234 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h | 497 ++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1131 ++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 62 + drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h | 103 + drivers/infiniband/ulp/qlgc_vnic/vnic_util.h | 250 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 +++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h | 176 ++ 27 files changed, 11951 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h -- Regards, Ram From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:31:58 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:01:58 +0530 Subject: [ofa-general] [PATCH v2 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103158.12355.61926.stgit@localhost.localdomain> From: Ramachandra K QLogic Virtual NIC Driver. This patch implements netdev registration, netdev functions and state maintenance of the QLogic Virtual NIC corresponding to the various events associated with the QLogic Ethernet Virtual I/O Controller (EVIC/VEx) connection. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.h | 154 ++++ 2 files changed, 1252 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c new file mode 100644 index 0000000..570c069 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c @@ -0,0 +1,1098 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_netpath.h" +#include "vnic_viport.h" +#include "vnic_ib.h" +#include "vnic_stats.h" + +#define MODULEVERSION "1.3.0.0.4" +#define MODULEDETAILS \ + "QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION + +MODULE_AUTHOR("QLogic Corp."); +MODULE_DESCRIPTION(MODULEDETAILS); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller"); + +u32 vnic_debug; + +module_param(vnic_debug, uint, 0444); +MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0"); + +LIST_HEAD(vnic_list); + +static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue); +static LIST_HEAD(vnic_npevent_list); +static DECLARE_COMPLETION(vnic_npevent_thread_exit); +static spinlock_t vnic_npevent_list_lock; +static struct task_struct *vnic_npevent_thread; +static int vnic_npevent_thread_end; + +static const char *const vnic_npevent_str[] = { + "PRIMARY CONNECTED", + "PRIMARY DISCONNECTED", + "PRIMARY CARRIER", + "PRIMARY NO CARRIER", + "PRIMARY TIMER EXPIRED", + "PRIMARY SETLINK", + "SECONDARY CONNECTED", + "SECONDARY DISCONNECTED", + "SECONDARY CARRIER", + "SECONDARY NO CARRIER", + "SECONDARY TIMER EXPIRED", + "SECONDARY SETLINK", + "FREE VNIC", +}; + +void vnic_connected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_connected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED); + + vnic_connected_stats(vnic); +} + +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_disconnected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED); +} + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_up()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP); +} + +void vnic_link_down(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_down()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN); +} + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) +{ + unsigned long flags; + + VNIC_FUNCTION("vnic_stop_xmit()\n"); + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (netpath == vnic->current_path) { + if (!netif_queue_stopped(vnic->netdevice)) { + netif_stop_queue(vnic->netdevice); + vnic->failed_over = 0; + } + + vnic_stop_xmit_stats(vnic); + } + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath) +{ + unsigned long flags; + + VNIC_FUNCTION("vnic_restart_xmit()\n"); + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (netpath == vnic->current_path) { + if (netif_queue_stopped(vnic->netdevice)) + netif_wake_queue(vnic->netdevice); + + vnic_restart_xmit_stats(vnic); + } + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb) +{ + VNIC_FUNCTION("vnic_recv_packet()\n"); + if ((netpath != vnic->current_path) || !vnic->open) { + VNIC_INFO("tossing packet\n"); + dev_kfree_skb(skb); + return; + } + + vnic->netdevice->last_rx = jiffies; + skb->dev = vnic->netdevice; + skb->protocol = eth_type_trans(skb, skb->dev); + if (!vnic->config->use_rx_csum) + skb->ip_summed = CHECKSUM_NONE; + netif_rx(skb); + vnic_recv_pkt_stats(vnic); +} + +static struct net_device_stats *vnic_get_stats(struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + unsigned long flags; + + VNIC_FUNCTION("vnic_get_stats()\n"); + vnic = netdev_priv(device); + + spin_lock_irqsave(&vnic->current_path_lock, flags); + np = vnic->current_path; + if (np && np->viport) { + atomic_inc(&np->viport->reference_count); + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + viport_get_stats(np->viport, &vnic->stats); + atomic_dec(&np->viport->reference_count); + wake_up(&np->viport->reference_queue); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + + return &vnic->stats; +} + +static int vnic_open(struct net_device *device) +{ + struct vnic *vnic; + + VNIC_FUNCTION("vnic_open()\n"); + vnic = netdev_priv(device); + + vnic->open++; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + netif_start_queue(vnic->netdevice); + + return 0; +} + +static int vnic_stop(struct net_device *device) +{ + struct vnic *vnic; + int ret = 0; + + VNIC_FUNCTION("vnic_stop()\n"); + vnic = netdev_priv(device); + netif_stop_queue(device); + vnic->open--; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + + return ret; +} + +static int vnic_hard_start_xmit(struct sk_buff *skb, + struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + cycles_t xmit_time; + int ret = -1; + + VNIC_FUNCTION("vnic_hard_start_xmit()\n"); + vnic = netdev_priv(device); + np = vnic->current_path; + + vnic_pre_pkt_xmit_stats(&xmit_time); + + if (np && np->viport) + ret = viport_xmit_packet(np->viport, skb); + + if (ret) { + vnic_xmit_fail_stats(vnic); + dev_kfree_skb_any(skb); + vnic->stats.tx_dropped++; + goto out; + } + + device->trans_start = jiffies; + vnic_post_pkt_xmit_stats(vnic, xmit_time); +out: + return 0; +} + +static void vnic_tx_timeout(struct net_device *device) +{ + struct vnic *vnic; + struct viport *viport = NULL; + unsigned long flags; + + VNIC_FUNCTION("vnic_tx_timeout()\n"); + vnic = netdev_priv(device); + device->trans_start = jiffies; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path && vnic->current_path->viport) { + if (vnic->failed_over) { + if (vnic->current_path == &vnic->primary_path) + viport = vnic->secondary_path.viport; + else if (vnic->current_path == &vnic->secondary_path) + viport = vnic->primary_path.viport; + } else + viport = vnic->current_path->viport; + + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + if (viport) + viport_failure(viport); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + + VNIC_ERROR("vnic_tx_timeout\n"); +} + +static void vnic_set_multicast_list(struct net_device *device) +{ + struct vnic *vnic; + unsigned long flags; + + VNIC_FUNCTION("vnic_set_multicast_list()\n"); + vnic = netdev_priv(device); + + spin_lock_irqsave(&vnic->lock, flags); + if (device->mc_count == 0) { + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + } else { + struct dev_mc_list *mc_list = device->mc_list; + int i; + + if (device->mc_count > vnic->mc_list_len) { + if (vnic->mc_list_len) + kfree(vnic->mc_list); + vnic->mc_list_len = device->mc_count + 10; + vnic->mc_list = kmalloc(vnic->mc_list_len * + sizeof *mc_list, GFP_ATOMIC); + if (!vnic->mc_list) { + vnic->mc_list_len = vnic->mc_count = 0; + VNIC_ERROR("failed allocating mc_list\n"); + goto failure; + } + } + vnic->mc_count = device->mc_count; + for (i = 0; i < device->mc_count; i++) { + vnic->mc_list[i] = *mc_list; + vnic->mc_list[i].next = &vnic->mc_list[i + 1]; + mc_list = mc_list->next; + } + } + spin_unlock_irqrestore(&vnic->lock, flags); + + if (vnic->primary_path.viport) + viport_set_multicast(vnic->primary_path.viport, + vnic->mc_list, vnic->mc_count); + + if (vnic->secondary_path.viport) + viport_set_multicast(vnic->secondary_path.viport, + vnic->mc_list, vnic->mc_count); + + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + return; +failure: + spin_unlock_irqrestore(&vnic->lock, flags); +} + +/** + * Following set of functions queues up the events for EVIC and the + * kernel thread queuing up the event might return. + */ +static int vnic_set_mac_address(struct net_device *device, void *addr) +{ + struct vnic *vnic; + struct sockaddr *sockaddr = addr; + u8 *address; + int ret = -1; + + VNIC_FUNCTION("vnic_set_mac_address()\n"); + vnic = netdev_priv(device); + + if (!is_valid_ether_addr(sockaddr->sa_data)) + return -EADDRNOTAVAIL; + + if (netif_running(device)) + return -EBUSY; + + memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN); + address = sockaddr->sa_data; + + if (vnic->primary_path.viport) + ret = viport_set_unicast(vnic->primary_path.viport, + address); + + if (ret) + return ret; + + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, address); + + vnic->mac_set = 1; + return 0; +} + +static int vnic_change_mtu(struct net_device *device, int mtu) +{ + struct vnic *vnic; + int ret = 0; + int pri_max_mtu; + int sec_max_mtu; + + VNIC_FUNCTION("vnic_change_mtu()\n"); + vnic = netdev_priv(device); + + if (vnic->primary_path.viport) + pri_max_mtu = viport_max_mtu(vnic->primary_path.viport); + else + pri_max_mtu = MAX_PARAM_VALUE; + + if (vnic->secondary_path.viport) + sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport); + else + sec_max_mtu = MAX_PARAM_VALUE; + + if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) { + device->mtu = mtu; + vnic_npevent_queue_evt(&vnic->primary_path, + VNIC_PRINP_SETLINK); + vnic_npevent_queue_evt(&vnic->secondary_path, + VNIC_SECNP_SETLINK); + } else if (pri_max_mtu < sec_max_mtu) + printk(KERN_WARNING PFX "%s: Maximum " + "supported MTU size is %d. " + "Cannot set MTU to %d\n", + vnic->config->name, pri_max_mtu, mtu); + else + printk(KERN_WARNING PFX "%s: Maximum " + "supported MTU size is %d. " + "Cannot set MTU to %d\n", + vnic->config->name, sec_max_mtu, mtu); + + return ret; +} + +static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath) +{ + u8 *address; + int ret; + + if (!vnic->mac_set) { + /* if netpath == secondary_path, then the primary path isn't + * connected. MAC address will be set when the primary + * connects. + */ + netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr); + address = vnic->netdevice->dev_addr; + + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, + address); + + vnic->mac_set = 1; + } + ret = register_netdev(vnic->netdevice); + if (ret) { + printk(KERN_ERR PFX "%s failed registering netdev " + "error %d - calling viport_failure\n", + config_viport_name(vnic->primary_path.viport->config), + ret); + vnic_free(vnic); + printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n", + config_viport_name(vnic->primary_path.viport->config)); + return ret; + } + + vnic->state = VNIC_REGISTERED; + vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/ + return 0; +} + +static void vnic_npevent_dequeue_all(struct vnic *vnic) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + +static void update_path_and_reconnect(struct netpath *netpath, + struct vnic *vnic) +{ + struct viport_config *config = netpath->viport->config; + int delay = 1; + + if (vnic_ib_get_path(netpath, vnic)) + return; + /* + * tell viport_connect to wait for default_no_path_timeout + * before connecting if we are retrying the same path index + * within default_no_path_timeout. + * This prevents flooding connect requests to a path (or set + * of paths) that aren't successfully connecting for some reason. + */ + if (time_after(jiffies, + (netpath->connect_time + vnic->config->no_path_timeout))) { + netpath->path_idx = config->path_idx; + netpath->connect_time = jiffies; + netpath->delay_reconnect = 0; + delay = 0; + } else if (config->path_idx != netpath->path_idx) { + delay = netpath->delay_reconnect; + netpath->path_idx = config->path_idx; + netpath->delay_reconnect = 1; + } else + delay = 1; + viport_connect(netpath->viport, delay); +} + +static inline void vnic_set_checksum_flag(struct vnic *vnic, + struct netpath *target_path) +{ + unsigned long flags; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + vnic->current_path = target_path; + vnic->failed_over = 1; + if (vnic->config->use_tx_csum && + netpath_can_tx_csum(vnic->current_path)) + vnic->netdevice->features |= NETIF_F_IP_CSUM; + + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +static void vnic_set_uni_multicast(struct vnic *vnic, + struct netpath *netpath) +{ + unsigned long flags; + u8 *address; + + if (vnic->mac_set) { + address = vnic->netdevice->dev_addr; + + if (netpath->viport) + viport_set_unicast(netpath->viport, address); + } + spin_lock_irqsave(&vnic->lock, flags); + + if (vnic->mc_list && netpath->viport) + viport_set_multicast(netpath->viport, vnic->mc_list, + vnic->mc_count); + + spin_unlock_irqrestore(&vnic->lock, flags); + if (vnic->state == VNIC_REGISTERED) { + if (!netpath->viport) + return; + viport_set_link(netpath->viport, + vnic->netdevice->flags & ~IFF_UP, + vnic->netdevice->mtu); + } +} + +static void vnic_set_netpath_timers(struct vnic *vnic, + struct netpath *netpath) +{ + switch (netpath->timer_state) { + case NETPATH_TS_IDLE: + netpath->timer_state = NETPATH_TS_ACTIVE; + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer(netpath, + vnic->config-> + primary_connect_timeout); + else + netpath_timer(netpath, + vnic->config-> + primary_reconnect_timeout); + break; + case NETPATH_TS_ACTIVE: + /*nothing to do*/ + break; + case NETPATH_TS_EXPIRED: + if (vnic->state == VNIC_UNINITIALIZED) + vnic_npevent_register(vnic, netpath); + + break; + } +} + +static void vnic_check_primary_path_timer(struct vnic *vnic) +{ + switch (vnic->primary_path.timer_state) { + case NETPATH_TS_ACTIVE: + /* nothing to do. just wait */ + break; + case NETPATH_TS_IDLE: + netpath_timer(&vnic->primary_path, + vnic->config-> + primary_switch_timeout); + break; + case NETPATH_TS_EXPIRED: + printk(KERN_INFO PFX + "%s: switching to primary path\n", + vnic->config->name); + + vnic_set_checksum_flag(vnic, &vnic->primary_path); + break; + } +} + +static void vnic_carrier_loss(struct vnic *vnic, + struct netpath *last_path) +{ + if (vnic->primary_path.carrier) { + vnic->carrier = 1; + vnic_set_checksum_flag(vnic, &vnic->primary_path); + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to primary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using primary path\n", + vnic->config->name); + + } else if ((vnic->secondary_path.carrier) && + (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) { + vnic->carrier = 1; + vnic_set_checksum_flag(vnic, &vnic->secondary_path); + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to secondary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using secondary path\n", + vnic->config->name); + + } + +} + +static void vnic_handle_path_change(struct vnic *vnic, + struct netpath **path) +{ + struct netpath *last_path = *path; + + if (!last_path) { + if (vnic->current_path == &vnic->primary_path) + last_path = &vnic->secondary_path; + else + last_path = &vnic->primary_path; + + } + + if (vnic->current_path && vnic->current_path->viport) + viport_set_link(vnic->current_path->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + + if (last_path->viport) + viport_set_link(last_path->viport, + vnic->netdevice->flags & + ~IFF_UP, vnic->netdevice->mtu); + + vnic_restart_xmit(vnic, vnic->current_path); +} + +static void vnic_report_path_change(struct vnic *vnic, + struct netpath *last_path, + int other_path_ok) +{ + if (!vnic->current_path) { + if (last_path == &vnic->primary_path) + printk(KERN_INFO PFX "%s: primary path lost, " + "no failover path available\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path lost, " + "no failover path available\n", + vnic->config->name); + return; + } + + if (last_path != vnic->current_path) + return; + + if (vnic->current_path == &vnic->secondary_path) { + if (other_path_ok != vnic->primary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: primary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: primary path now" + " available for failover\n", + vnic->config->name); + } + } else { + if (other_path_ok != vnic->secondary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: secondary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path now" + " available for failover\n", + vnic->config->name); + } + } +} + +static void vnic_handle_free_vnic_evt(struct vnic *vnic) +{ + unsigned long flags; + + if (!netif_queue_stopped(vnic->netdevice)) + netif_stop_queue(vnic->netdevice); + + netpath_timer_stop(&vnic->primary_path); + netpath_timer_stop(&vnic->secondary_path); + spin_lock_irqsave(&vnic->current_path_lock, flags); + vnic->current_path = NULL; + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + netpath_free(&vnic->primary_path); + netpath_free(&vnic->secondary_path); + if (vnic->state == VNIC_REGISTERED) + unregister_netdev(vnic->netdevice); + + vnic_npevent_dequeue_all(vnic); + kfree(vnic->config); + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group); + vnic_cleanup_stats_files(vnic); + device_unregister(&vnic->dev_info.dev); + wait_for_completion(&vnic->dev_info.released); + free_netdev(vnic->netdevice); +} + +static struct vnic *vnic_handle_npevent(struct vnic *vnic, + enum vnic_npevent_type npevt_type) +{ + struct netpath *netpath; + const char *netpath_str; + + if (npevt_type <= VNIC_PRINP_LASTTYPE) + netpath_str = netpath_to_string(vnic, &vnic->primary_path); + else if (npevt_type <= VNIC_SECNP_LASTTYPE) + netpath_str = netpath_to_string(vnic, &vnic->secondary_path); + else + netpath_str = netpath_to_string(vnic, vnic->current_path); + + VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n", + vnic->config->name, vnic_npevent_str[npevt_type], + netpath_str, vnic->carrier); + + switch (npevt_type) { + case VNIC_PRINP_CONNECTED: + netpath = &vnic->primary_path; + if (vnic->state == VNIC_UNINITIALIZED) { + if (vnic_npevent_register(vnic, netpath)) + break; + } + vnic_set_uni_multicast(vnic, netpath); + break; + case VNIC_SECNP_CONNECTED: + vnic_set_uni_multicast(vnic, &vnic->secondary_path); + break; + case VNIC_PRINP_TIMEREXPIRED: + netpath = &vnic->primary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (!netpath->carrier) + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_TIMEREXPIRED: + netpath = &vnic->secondary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (!netpath->carrier) + update_path_and_reconnect(netpath, vnic); + else { + if (vnic->state == VNIC_UNINITIALIZED) + vnic_npevent_register(vnic, netpath); + } + break; + case VNIC_PRINP_LINKUP: + vnic->primary_path.carrier = 1; + break; + case VNIC_SECNP_LINKUP: + netpath = &vnic->secondary_path; + netpath->carrier = 1; + if (!vnic->carrier) + vnic_set_netpath_timers(vnic, netpath); + break; + case VNIC_PRINP_LINKDOWN: + vnic->primary_path.carrier = 0; + break; + case VNIC_SECNP_LINKDOWN: + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer_stop(&vnic->secondary_path); + vnic->secondary_path.carrier = 0; + break; + case VNIC_PRINP_DISCONNECTED: + netpath = &vnic->primary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_DISCONNECTED: + netpath = &vnic->secondary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_PRINP_SETLINK: + netpath = vnic->current_path; + if (!netpath || !netpath->viport) + break; + viport_set_link(netpath->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + break; + case VNIC_SECNP_SETLINK: + netpath = &vnic->secondary_path; + if (!netpath || !netpath->viport) + break; + viport_set_link(netpath->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + break; + case VNIC_NP_FREEVNIC: + vnic_handle_free_vnic_evt(vnic); + vnic = NULL; + break; + } + return vnic; +} + +static int vnic_npevent_statemachine(void *context) +{ + struct vnic_npevent *vnic_link_evt; + enum vnic_npevent_type npevt_type; + struct vnic *vnic; + int last_carrier; + int other_path_ok = 0; + struct netpath *last_path; + + while (!vnic_npevent_thread_end || + !list_empty(&vnic_npevent_list)) { + unsigned long flags; + + wait_event_interruptible(vnic_npevent_queue, + !list_empty(&vnic_npevent_list) + || vnic_npevent_thread_end); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) { + spin_unlock_irqrestore(&vnic_npevent_list_lock, + flags); + VNIC_INFO("netpath statemachine wake" + " on empty list\n"); + continue; + } + + vnic_link_evt = list_entry(vnic_npevent_list.next, + struct vnic_npevent, + list_ptrs); + list_del(&vnic_link_evt->list_ptrs); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + vnic = vnic_link_evt->vnic; + npevt_type = vnic_link_evt->event_type; + kfree(vnic_link_evt); + + if (vnic->current_path == &vnic->secondary_path) + other_path_ok = vnic->primary_path.carrier; + else if (vnic->current_path == &vnic->primary_path) + other_path_ok = vnic->secondary_path.carrier; + + vnic = vnic_handle_npevent(vnic, npevt_type); + + if (!vnic) + continue; + + last_carrier = vnic->carrier; + last_path = vnic->current_path; + + if (!vnic->current_path || + !vnic->current_path->carrier) { + vnic->carrier = 0; + vnic->current_path = NULL; + vnic->netdevice->features &= ~NETIF_F_IP_CSUM; + } + + if (!vnic->carrier) + vnic_carrier_loss(vnic, last_path); + else if ((vnic->current_path != &vnic->primary_path) && + (vnic->config->prefer_primary) && + (vnic->primary_path.carrier)) + vnic_check_primary_path_timer(vnic); + + if (last_path) + vnic_report_path_change(vnic, last_path, + other_path_ok); + + VNIC_INFO("new netpath=%s, carrier=%d\n", + netpath_to_string(vnic, vnic->current_path), + vnic->carrier); + + if (vnic->current_path != last_path) + vnic_handle_path_change(vnic, &last_path); + + if (vnic->carrier != last_carrier) { + if (vnic->carrier) { + VNIC_INFO("netif_carrier_on\n"); + netif_carrier_on(vnic->netdevice); + vnic_carrier_loss_stats(vnic); + } else { + VNIC_INFO("netif_carrier_off\n"); + netif_carrier_off(vnic->netdevice); + vnic_disconn_stats(vnic); + } + + } + } + complete_and_exit(&vnic_npevent_thread_exit, 0); + return 0; +} + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + struct vnic_npevent *npevent; + unsigned long flags; + + npevent = kmalloc(sizeof *npevent, GFP_ATOMIC); + if (!npevent) { + VNIC_ERROR("Could not allocate memory for vnic event\n"); + return; + } + npevent->vnic = netpath->parent; + npevent->event_type = evt; + INIT_LIST_HEAD(&npevent->list_ptrs); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + list_add_tail(&npevent->list_ptrs, &vnic_npevent_list); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + wake_up(&vnic_npevent_queue); +} + +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + struct vnic *vnic = netpath->parent; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic) && + (npevt->event_type == evt)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + break; + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + +static int vnic_npevent_start(void) +{ + VNIC_FUNCTION("vnic_npevent_start()\n"); + + spin_lock_init(&vnic_npevent_list_lock); + vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL, + "qlgc_vnic_npevent_s_m"); + if (IS_ERR(vnic_npevent_thread)) { + printk(KERN_WARNING PFX "failed to create vnic npevent" + " thread; error %d\n", + (int) PTR_ERR(vnic_npevent_thread)); + vnic_npevent_thread = NULL; + return 1; + } + + return 0; +} + +void vnic_npevent_cleanup(void) +{ + if (vnic_npevent_thread) { + vnic_npevent_thread_end = 1; + wake_up(&vnic_npevent_queue); + wait_for_completion(&vnic_npevent_thread_exit); + vnic_npevent_thread = NULL; + } +} + +static void vnic_setup(struct net_device *device) +{ + ether_setup(device); + + /* ether_setup is used to fill + * device parameters for ethernet devices. + * We override some of the parameters + * which are specific to VNIC. + */ + device->get_stats = vnic_get_stats; + device->open = vnic_open; + device->stop = vnic_stop; + device->hard_start_xmit = vnic_hard_start_xmit; + device->tx_timeout = vnic_tx_timeout; + device->set_multicast_list = vnic_set_multicast_list; + device->set_mac_address = vnic_set_mac_address; + device->change_mtu = vnic_change_mtu; + device->watchdog_timeo = 10 * HZ; + device->features = 0; +} + +struct vnic *vnic_allocate(struct vnic_config *config) +{ + struct vnic *vnic = NULL; + struct net_device *netdev; + + VNIC_FUNCTION("vnic_allocate()\n"); + netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup); + if (!netdev) { + VNIC_ERROR("failed allocating vnic structure\n"); + return NULL; + } + + vnic = netdev_priv(netdev); + vnic->netdevice = netdev; + spin_lock_init(&vnic->lock); + spin_lock_init(&vnic->current_path_lock); + vnic_alloc_stats(vnic); + vnic->state = VNIC_UNINITIALIZED; + vnic->config = config; + + netpath_init(&vnic->primary_path, vnic, 0); + netpath_init(&vnic->secondary_path, vnic, 1); + + vnic->current_path = NULL; + vnic->failed_over = 0; + + list_add_tail(&vnic->list_ptrs, &vnic_list); + + return vnic; +} + +void vnic_free(struct vnic *vnic) +{ + VNIC_FUNCTION("vnic_free()\n"); + list_del(&vnic->list_ptrs); + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC); +} + +static void __exit vnic_cleanup(void) +{ + VNIC_FUNCTION("vnic_cleanup()\n"); + + VNIC_INIT("unloading %s\n", MODULEDETAILS); + + while (!list_empty(&vnic_list)) { + struct vnic *vnic = + list_entry(vnic_list.next, struct vnic, list_ptrs); + vnic_free(vnic); + } + + vnic_npevent_cleanup(); + viport_cleanup(); + vnic_ib_cleanup(); +} + +static int __init vnic_init(void) +{ + int ret; + VNIC_FUNCTION("vnic_init()\n"); + VNIC_INIT("Initializing %s\n", MODULEDETAILS); + + ret = config_start(); + if (ret) { + VNIC_ERROR("config_start failed\n"); + goto failure; + } + + ret = vnic_ib_init(); + if (ret) { + VNIC_ERROR("ib_start failed\n"); + goto failure; + } + + ret = viport_start(); + if (ret) { + VNIC_ERROR("viport_start failed\n"); + goto failure; + } + + ret = vnic_npevent_start(); + if (ret) { + VNIC_ERROR("vnic_npevent_start failed\n"); + goto failure; + } + + return 0; +failure: + vnic_cleanup(); + return ret; +} + +module_init(vnic_init); +module_exit(vnic_cleanup); diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h new file mode 100644 index 0000000..7535124 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h @@ -0,0 +1,154 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_MAIN_H_INCLUDED +#define VNIC_MAIN_H_INCLUDED + +#include +#include +#include +#include + +#include "vnic_config.h" +#include "vnic_netpath.h" + +extern u16 vnic_max_mtu; +extern struct list_head vnic_list; +extern struct attribute_group vnic_stats_attr_group; +extern cycles_t vnic_recv_ref; + +enum vnic_npevent_type { + VNIC_PRINP_CONNECTED = 0, + VNIC_PRINP_DISCONNECTED = 1, + VNIC_PRINP_LINKUP = 2, + VNIC_PRINP_LINKDOWN = 3, + VNIC_PRINP_TIMEREXPIRED = 4, + VNIC_PRINP_SETLINK = 5, + + /* used to figure out PRI vs SEC types for dbg msg*/ + VNIC_PRINP_LASTTYPE = VNIC_PRINP_SETLINK, + + VNIC_SECNP_CONNECTED = 6, + VNIC_SECNP_DISCONNECTED = 7, + VNIC_SECNP_LINKUP = 8, + VNIC_SECNP_LINKDOWN = 9, + VNIC_SECNP_TIMEREXPIRED = 10, + VNIC_SECNP_SETLINK = 11, + + /* used to figure out PRI vs SEC types for dbg msg*/ + VNIC_SECNP_LASTTYPE = VNIC_SECNP_SETLINK, + + VNIC_NP_FREEVNIC = 12, + + /* + * NOTE : If any new netpath event is being added, don't forget to + * add corresponding netpath event string into vnic_main.c. + */ +}; + +struct vnic_npevent { + struct list_head list_ptrs; + struct vnic *vnic; + enum vnic_npevent_type event_type; +}; + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); + +enum vnic_state { + VNIC_UNINITIALIZED = 0, + VNIC_REGISTERED = 1 +}; + +struct vnic { + struct list_head list_ptrs; + enum vnic_state state; + struct vnic_config *config; + struct netpath *current_path; + struct netpath primary_path; + struct netpath secondary_path; + int open; + int carrier; + int failed_over; + int mac_set; + struct net_device_stats stats; + struct net_device *netdevice; + struct dev_info dev_info; + struct dev_mc_list *mc_list; + int mc_list_len; + int mc_count; + spinlock_t lock; + spinlock_t current_path_lock; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t start_time; + cycles_t conn_time; + cycles_t disconn_ref; /* intermediate time */ + cycles_t disconn_time; + u32 disconn_num; + cycles_t xmit_time; + u32 xmit_num; + u32 xmit_fail; + cycles_t recv_time; + u32 recv_num; + u32 multicast_recv_num; + cycles_t xmit_ref; /* intermediate time */ + cycles_t xmit_off_time; + u32 xmit_off_num; + cycles_t carrier_ref; /* intermediate time */ + cycles_t carrier_off_time; + u32 carrier_off_num; + } statistics; + struct dev_info stat_info; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct vnic *vnic_allocate(struct vnic_config *config); + +void vnic_free(struct vnic *vnic); + +void vnic_connected(struct vnic *vnic, struct netpath *netpath); +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath); + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath); +void vnic_link_down(struct vnic *vnic, struct netpath *netpath); + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath); +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath); + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb); +void vnic_npevent_cleanup(void); +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn); +#endif /* VNIC_MAIN_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:32:28 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:02:28 +0530 Subject: [ofa-general] [PATCH v2 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103228.12355.9952.stgit@localhost.localdomain> From: Ramachandra K This patch implements the netpath layer of QLogic VNIC. Netpath is an abstraction of a connection to EVIC. It primarily includes the implementation which maintains the timers to monitor the status of the connection to EVIC/VEx. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c | 112 +++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h | 79 ++++++++++++++++ 2 files changed, 191 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c new file mode 100644 index 0000000..820b996 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c @@ -0,0 +1,112 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" + +static void vnic_npevent_timeout(unsigned long data) +{ + struct netpath *netpath = (struct netpath *)data; + + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); +} + +void netpath_timer(struct netpath *netpath, int timeout) +{ + if (netpath->timer_state == NETPATH_TS_ACTIVE) + del_timer_sync(&netpath->timer); + if (timeout) { + init_timer(&netpath->timer); + netpath->timer_state = NETPATH_TS_ACTIVE; + netpath->timer.expires = jiffies + timeout; + netpath->timer.data = (unsigned long)netpath; + netpath->timer.function = vnic_npevent_timeout; + add_timer(&netpath->timer); + } else + vnic_npevent_timeout((unsigned long)netpath); +} + +void netpath_timer_stop(struct netpath *netpath) +{ + if (netpath->timer_state != NETPATH_TS_ACTIVE) + return; + del_timer_sync(&netpath->timer); + if (netpath->second_bias) + vnic_npevent_dequeue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_dequeue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); + + netpath->timer_state = NETPATH_TS_IDLE; +} + +void netpath_free(struct netpath *netpath) +{ + if (!netpath->viport) + return; + viport_free(netpath->viport); + netpath->viport = NULL; + sysfs_remove_group(&netpath->dev_info.dev.kobj, + &vnic_path_attr_group); + device_unregister(&netpath->dev_info.dev); + wait_for_completion(&netpath->dev_info.released); +} + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias) +{ + netpath->parent = vnic; + netpath->carrier = 0; + netpath->viport = NULL; + netpath->second_bias = second_bias; + netpath->timer_state = NETPATH_TS_IDLE; + init_timer(&netpath->timer); +} + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath) +{ + if (!netpath) + return "NULL"; + else if (netpath == &vnic->primary_path) + return "PRIMARY"; + else if (netpath == &vnic->secondary_path) + return "SECONDARY"; + else + return "UNKNOWN"; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h new file mode 100644 index 0000000..f4e142e --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h @@ -0,0 +1,79 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_NETPATH_H_INCLUDED +#define VNIC_NETPATH_H_INCLUDED + +#include + +#include "vnic_sys.h" + +struct viport; +struct vnic; + +enum netpath_ts { + NETPATH_TS_IDLE = 0, + NETPATH_TS_ACTIVE = 1, + NETPATH_TS_EXPIRED = 2 +}; + +struct netpath { + int carrier; + struct vnic *parent; + struct viport *viport; + size_t path_idx; + unsigned long connect_time; + int second_bias; + u8 is_primary_path; + u8 delay_reconnect; + struct timer_list timer; + enum netpath_ts timer_state; + struct dev_info dev_info; +}; + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias); +void netpath_free(struct netpath *netpath); + +void netpath_timer(struct netpath *netpath, int timeout); +void netpath_timer_stop(struct netpath *netpath); + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath); + +#define netpath_get_hw_addr(netpath, address) \ + viport_get_hw_addr((netpath)->viport, address) +#define netpath_is_connected(netpath) \ + (netpath->state == NETPATH_CONNECTED) +#define netpath_can_tx_csum(netpath) \ + viport_can_tx_csum(netpath->viport) + +#endif /* VNIC_NETPATH_H_INCLUDED */ From joel at finetec.com Mon May 19 10:29:30 2008 From: joel at finetec.com (Joe Li) Date: Mon, 19 May 2008 10:29:30 -0700 Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8 Message-ID: Hello everyone, I am a newbie to openfabric and I have an issue here which needs your help. When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I get an ofa-kernel rpm build error: "Running rpm -e --allmatches libibverbs libibcommon libibumad librdmacm opensm-libs ibutils openib opensm-libs dapl libibcommon libibumad libibverbs librdmacm ibutils ibutils-libs Build ofa_kernel RPM Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'configure_options --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-cxgb3-mod --with-nes-mod --with-ipath_inf-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-srp-target-mod --with-rds-mod' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'KVERSION 2.6.25-rc3' --define 'K_SRC /lib/modules/2.6.25-rc3/build' --define 'network_dir /etc/sysconfig/network-scripts' --define '_prefix /usr' /ofed1.3/OFED-1.3/SRPMS/ofa_kernel-1.3-ofed1.3.src.rpm Failed to build ofa_kernel RPM See /tmp/OFED.6361.logs/ofa_kernel.rpmbuild.log" In the ofa_kernel.rpmbuild.log file, it says: gcc -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.addr.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/4.1.2/include -D__KERNEL__ \ -include include/linux/autoconf.h \ -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include/linux/autoconf.h \ \ \ \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/include \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/debug \ -I/usr/local/include/scst \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/net/cxgb3 \ -Iinclude \ \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -fno-stack-protector -m64 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(addr)" -D"KBUILD_MODNAME=KBUILD_STR(ib_addr)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/.tmp_addr.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'rdma_translate_ip': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:113: warning: passing argument 1 of 'ip_dev_find' makes pointer from integer without a cast /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:113: error: too few arguments to function 'ip_dev_find' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_send_arp': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: warning: passing argument 1 of 'ip_route_output_key' from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: warning: passing argument 2 of 'ip_route_output_key' from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:161: error: too few arguments to function 'ip_route_output_key' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_resolve_remote': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: warning: passing argument 1 of 'ip_route_output_key' from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: warning: passing argument 2 of 'ip_route_output_key' from incompatible pointer type /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:182: error: too few arguments to function 'ip_route_output_key' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c: In function 'addr_resolve_local': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:264: warning: passing argument 1 of 'ip_dev_find' makes pointer from integer without a cast /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:264: error: too few arguments to function 'ip_dev_find' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:268: error: implicit declaration of function 'ZERONET' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.c:272: error: implicit declaration of function 'LOOPBACK' make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/core] Error 2 make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband] Error 2 make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.25-rc3' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.67082 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.67082 (%build) Can anyone please point out what might be wrong? Thanks in advance. Regards Joe -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:32:58 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:02:58 +0530 Subject: [ofa-general] [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103258.12355.6146.stgit@localhost.localdomain> From: Poornima Kamath Implementation of the statemachine for the protocol used while communicating with the EVIC. The patch also implements the viport abstraction which represents the virtual ethernet port on EVIC. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 ++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h | 176 +++ 2 files changed, 1390 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c new file mode 100644 index 0000000..0a94cd3 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c @@ -0,0 +1,1214 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" +#include "vnic_control.h" +#include "vnic_data.h" +#include "vnic_config.h" +#include "vnic_control_pkt.h" + +#define VIPORT_DISCONN_TIMER 10000 /* 10 seconds */ + +#define MAX_RETRY_INTERVAL 20000 /* 20 seconds */ +#define RETRY_INCREMENT 5000 /* 5 seconds */ +#define MAX_CONNECT_RETRY_TIMEOUT 600000 /* 10 minutes */ + +static DECLARE_WAIT_QUEUE_HEAD(viport_queue); +static LIST_HEAD(viport_list); +static DECLARE_COMPLETION(viport_thread_exit); +static spinlock_t viport_list_lock; + +static struct task_struct *viport_thread; +static int viport_thread_end; + +static void viport_timer(struct viport *viport, int timeout); + +struct viport *viport_allocate(struct viport_config *config) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_allocate()\n"); + viport = kzalloc(sizeof *viport, GFP_KERNEL); + if (!viport) { + VIPORT_ERROR("failed allocating viport structure\n"); + return NULL; + } + + viport->state = VIPORT_DISCONNECTED; + viport->link_state = LINK_FIRSTCONNECT; + viport->connect = WAIT; + viport->new_mtu = 1500; + viport->new_flags = 0; + viport->config = config; + viport->connect = DELAY; + viport->data.max_mtu = vnic_max_mtu; + spin_lock_init(&viport->lock); + init_waitqueue_head(&viport->stats_queue); + init_waitqueue_head(&viport->disconnect_queue); + init_waitqueue_head(&viport->reference_queue); + INIT_LIST_HEAD(&viport->list_ptrs); + + vnic_mc_init(viport); + + return viport; +} + +void viport_connect(struct viport *viport, int delay) +{ + VIPORT_FUNCTION("viport_connect()\n"); + + if (viport->connect != DELAY) + viport->connect = (delay) ? DELAY : NOW; + if (viport->link_state == LINK_FIRSTCONNECT) { + u32 duration; + duration = (net_random() & 0x1ff); + if (!viport->parent->is_primary_path) + duration += 0x1ff; + viport->link_state = LINK_RETRYWAIT; + viport_timer(viport, duration); + } else + viport_kick(viport); +} + +void viport_disconnect(struct viport *viport) +{ + VIPORT_FUNCTION("viport_disconnect()\n"); + viport->disconnect = 1; + viport_failure(viport); + wait_event(viport->disconnect_queue, viport->disconnect == 0); +} + +void viport_free(struct viport *viport) +{ + VIPORT_FUNCTION("viport_free()\n"); + viport_disconnect(viport); /* NOTE: this can sleep */ + vnic_mc_uninit(viport); + kfree(viport->config); + kfree(viport); +} + +void viport_set_link(struct viport *viport, u16 flags, u16 mtu) +{ + unsigned long localflags; + int i; + + VIPORT_FUNCTION("viport_set_link()\n"); + if (mtu > data_max_mtu(&viport->data)) { + VIPORT_ERROR("configuration error." + " mtu of %d unsupported by %s\n", mtu, + config_viport_name(viport->config)); + goto failure; + } + + spin_lock_irqsave(&viport->lock, localflags); + flags &= IFF_UP | IFF_ALLMULTI | IFF_PROMISC; + if ((viport->new_flags != flags) + || (viport->new_mtu != mtu)) { + viport->new_flags = flags; + viport->new_mtu = mtu; + viport->updates |= NEED_LINK_CONFIG; + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + if (((viport->mtu <= MCAST_MSG_SIZE) && (mtu > MCAST_MSG_SIZE)) || + ((viport->mtu > MCAST_MSG_SIZE) && (mtu <= MCAST_MSG_SIZE))) { + /* + * MTU value will enable/disable the multicast. In + * either case, need to send the CMD_CONFIG_ADDRESS2 to + * EVIC. Hence, setting the NEED_ADDRESS_CONFIG flag. + */ + viport->updates |= NEED_ADDRESS_CONFIG; + if (mtu <= MCAST_MSG_SIZE) { + VIPORT_PRINT("%s: MTU changed; " + "old:%d new:%d (threshold:%d);" + " MULTICAST will be enabled.\n", + config_viport_name(viport->config), + viport->mtu, mtu, + (int)MCAST_MSG_SIZE); + } else { + VIPORT_PRINT("%s: MTU changed; " + "old:%d new:%d (threshold:%d); " + "MULTICAST will be disabled.\n", + config_viport_name(viport->config), + viport->mtu, mtu, + (int)MCAST_MSG_SIZE); + } + /* When we resend these addresses, EVIC will + * send mgid=0 back in response. So no need to + * shutoff ib_multicast. + */ + for (i = MCAST_ADDR_START; i < viport->num_mac_addresses; i++) { + if (viport->mac_addresses[i].valid) + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + } + } + viport_kick(viport); + } + + spin_unlock_irqrestore(&viport->lock, localflags); + return; +failure: + viport_failure(viport); +} + +int viport_set_unicast(struct viport *viport, u8 *address) +{ + unsigned long flags; + int ret = -1; + VIPORT_FUNCTION("viport_set_unicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + if (memcmp(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN)) { + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].operation + = VNIC_OP_SET_ENTRY; + viport->updates |= NEED_ADDRESS_CONFIG; + viport_kick(viport); + } + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +int viport_set_multicast(struct viport *viport, + struct dev_mc_list *mc_list, int mc_count) +{ + u32 old_update_list; + int i; + int ret = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_set_multicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + old_update_list = viport->updates; + if (mc_count > viport->num_mac_addresses - MCAST_ADDR_START) + viport->updates |= NEED_LINK_CONFIG | MCAST_OVERFLOW; + else { + if (mc_count == 0) { + ret = 0; + goto out; + } + if (viport->updates & MCAST_OVERFLOW) { + viport->updates &= ~MCAST_OVERFLOW; + viport->updates |= NEED_LINK_CONFIG; + } + for (i = MCAST_ADDR_START; i < mc_count + MCAST_ADDR_START; + i++, mc_list = mc_list->next) { + if (viport->mac_addresses[i].valid && + !memcmp(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN)) + continue; + memcpy(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN); + viport->mac_addresses[i].valid = 1; + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + for (; i < viport->num_mac_addresses; i++) { + if (!viport->mac_addresses[i].valid) + continue; + viport->mac_addresses[i].valid = 0; + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + if (mc_count) + viport->updates |= NEED_ADDRESS_CONFIG; + } + + if (viport->updates != old_update_list) + viport_kick(viport); + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +static inline void viport_disable_multicast(struct viport *viport) +{ + VIPORT_INFO("turned off IB_MULTICAST\n"); + viport->config->control_config.ib_multicast = 0; + viport->config->control_config.ib_config.conn_data.features_supported &= + __constant_cpu_to_be32((u32)~VNIC_FEAT_INBOUND_IB_MC); + viport->link_state = LINK_RESET; +} + +void viport_get_stats(struct viport *viport, + struct net_device_stats *stats) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_get_stats()\n"); + /* Reference count has been already incremented indicating + * that viport structure is being used, which prevents its + * freeing when this task sleeps + */ + if (time_after(jiffies, + (viport->last_stats_time + viport->config->stats_interval))) { + + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_STATS; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + wait_event(viport->stats_queue, + !(viport->updates & NEED_STATS) + || (viport->disconnect == 1)); + + if (viport->stats.ethernet_status) + vnic_link_up(viport->vnic, viport->parent); + else + vnic_link_down(viport->vnic, viport->parent); + } + + stats->rx_packets = be64_to_cpu(viport->stats.if_in_ok); + stats->tx_packets = be64_to_cpu(viport->stats.if_out_ok); + stats->rx_bytes = be64_to_cpu(viport->stats.if_in_octets); + stats->tx_bytes = be64_to_cpu(viport->stats.if_out_octets); + stats->rx_errors = be64_to_cpu(viport->stats.if_in_errors); + stats->tx_errors = be64_to_cpu(viport->stats.if_out_errors); + stats->rx_dropped = 0; /* EIOC doesn't track */ + stats->tx_dropped = 0; /* EIOC doesn't track */ + stats->multicast = be64_to_cpu(viport->stats.if_in_nucast_pkts); + stats->collisions = 0; /* EIOC doesn't track */ +} + +int viport_xmit_packet(struct viport *viport, struct sk_buff *skb) +{ + int status = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_xmit_packet()\n"); + spin_lock_irqsave(&viport->lock, flags); + if (viport->state == VIPORT_CONNECTED) + status = data_xmit_packet(&viport->data, skb); + spin_unlock_irqrestore(&viport->lock, flags); + + return status; +} + +void viport_kick(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_kick()\n"); + spin_lock_irqsave(&viport_list_lock, flags); + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +void viport_failure(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_failure()\n"); + vnic_stop_xmit(viport->vnic, viport->parent); + spin_lock_irqsave(&viport_list_lock, flags); + viport->errored = 1; + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +static void viport_timeout(unsigned long data) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_timeout()\n"); + viport = (struct viport *)data; + viport->timer_active = 0; + viport_kick(viport); +} + +static void viport_timer(struct viport *viport, int timeout) +{ + VIPORT_FUNCTION("viport_timer()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + init_timer(&viport->timer); + viport->timer.expires = jiffies + timeout; + viport->timer.data = (unsigned long)viport; + viport->timer.function = viport_timeout; + viport->timer_active = 1; + add_timer(&viport->timer); +} + +static void viport_timer_stop(struct viport *viport) +{ + VIPORT_FUNCTION("viport_timer_stop()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + viport->timer_active = 0; +} + +static int viport_init_mac_addresses(struct viport *viport) +{ + struct vnic_address_op2 *temp; + unsigned long flags; + int i; + + VIPORT_FUNCTION("viport_init_mac_addresses()\n"); + i = viport->num_mac_addresses * sizeof *temp; + temp = kzalloc(viport->num_mac_addresses * sizeof *temp, + GFP_KERNEL); + if (!temp) { + VIPORT_ERROR("failed allocating MAC address table\n"); + return -ENOMEM; + } + + spin_lock_irqsave(&viport->lock, flags); + viport->mac_addresses = temp; + for (i = 0; i < viport->num_mac_addresses; i++) { + viport->mac_addresses[i].index = cpu_to_be16(i); + viport->mac_addresses[i].vlan = + cpu_to_be16(viport->default_vlan); + } + memset(viport->mac_addresses[BROADCAST_ADDR].address, + 0xFF, ETH_ALEN); + viport->mac_addresses[BROADCAST_ADDR].valid = 1; + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + viport->hw_mac_address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].valid = 1; + + spin_unlock_irqrestore(&viport->lock, flags); + + return 0; +} + +static inline void viport_match_mac_address(struct vnic *vnic, + struct viport *viport) +{ + if (vnic && vnic->current_path && + viport == vnic->current_path->viport && + vnic->mac_set && + memcmp(vnic->netdevice->dev_addr, viport->hw_mac_address, ETH_ALEN)) { + VIPORT_ERROR("*** ERROR MAC address mismatch; " + "current = %02x:%02x:%02x:%02x:%02x:%02x " + "From EVIC = %02x:%02x:%02x:%02x:%02x:%02x\n", + vnic->netdevice->dev_addr[0], + vnic->netdevice->dev_addr[1], + vnic->netdevice->dev_addr[2], + vnic->netdevice->dev_addr[3], + vnic->netdevice->dev_addr[4], + vnic->netdevice->dev_addr[5], + viport->hw_mac_address[0], + viport->hw_mac_address[1], + viport->hw_mac_address[2], + viport->hw_mac_address[3], + viport->hw_mac_address[4], + viport->hw_mac_address[5]); + } +} + +static int viport_handle_init_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_UNINITIALIZED: + LINK_STATE("state LINK_UNINITIALIZED\n"); + viport->updates = 0; + spin_lock_irq(&viport_list_lock); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + if (atomic_read(&viport->reference_count)) { + wake_up(&viport->stats_queue); + wait_event(viport->reference_queue, + atomic_read(&viport->reference_count) == 0); + } + /* No more references to viport structure + * so it is safe to delete it by waking disconnect + * queue + */ + + viport->disconnect = 0; + wake_up(&viport->disconnect_queue); + break; + case LINK_INITIALIZE: + LINK_STATE("state LINK_INITIALIZE\n"); + viport->errored = 0; + viport->connect = WAIT; + viport->last_stats_time = 0; + if (viport->disconnect) + viport->link_state = LINK_UNINITIALIZED; + else + viport->link_state = LINK_INITIALIZECONTROL; + break; + case LINK_INITIALIZECONTROL: + LINK_STATE("state LINK_INITIALIZECONTROL\n"); + viport->pd = ib_alloc_pd(viport->config->ibdev); + if (IS_ERR(viport->pd)) + viport->link_state = LINK_DISCONNECTED; + else if (control_init(&viport->control, viport, + &viport->config->control_config, + viport->pd)) { + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + + } else + viport->link_state = LINK_INITIALIZEDATA; + break; + case LINK_INITIALIZEDATA: + LINK_STATE("state LINK_INITIALIZEDATA\n"); + if (data_init(&viport->data, viport, + &viport->config->data_config, + viport->pd)) + viport->link_state = LINK_CLEANUPCONTROL; + else + viport->link_state = LINK_CONTROLCONNECT; + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_control_states(struct viport *viport) +{ + enum link_state old_state; + struct vnic *vnic; + + do { + switch (old_state = viport->link_state) { + case LINK_CONTROLCONNECT: + if (vnic_ib_cm_connect(&viport->control.ib_conn)) + viport->link_state = LINK_CLEANUPDATA; + else + viport->link_state = LINK_CONTROLCONNECTWAIT; + break; + case LINK_CONTROLCONNECTWAIT: + LINK_STATE("state LINK_CONTROLCONNECTWAIT\n"); + if (control_is_connected(&viport->control)) + viport->link_state = LINK_INITVNICREQ; + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + case LINK_INITVNICREQ: + LINK_STATE("state LINK_INITVNICREQ\n"); + if (control_init_vnic_req(&viport->control)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_INITVNICRSP; + break; + case LINK_INITVNICRSP: + LINK_STATE("state LINK_INITVNICRSP\n"); + control_process_async(&viport->control); + + if (!control_init_vnic_rsp(&viport->control, + &viport->features_supported, + viport->hw_mac_address, + &viport->num_mac_addresses, + &viport->default_vlan)) { + if (viport_init_mac_addresses(viport)) + viport->link_state = + LINK_RESETCONTROL; + else { + viport->link_state = + LINK_BEGINDATAPATH; + /* + * Ensure that the current path's MAC + * address matches the one returned by + * EVIC - we've had cases of mismatch + * which then caused havoc. + */ + vnic = viport->parent->parent; + viport_match_mac_address(vnic, viport); + } + } + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_data_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_BEGINDATAPATH: + LINK_STATE("state LINK_BEGINDATAPATH\n"); + viport->link_state = LINK_CONFIGDATAPATHREQ; + break; + case LINK_CONFIGDATAPATHREQ: + LINK_STATE("state LINK_CONFIGDATAPATHREQ\n"); + if (control_config_data_path_req(&viport->control, + data_path_id(&viport-> + data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data))) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_CONFIGDATAPATHRSP; + break; + case LINK_CONFIGDATAPATHRSP: + LINK_STATE("state LINK_CONFIGDATAPATHRSP\n"); + control_process_async(&viport->control); + + if (!control_config_data_path_rsp(&viport->control, + data_host_pool + (&viport->data), + data_eioc_pool + (&viport->data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data), + data_host_pool_min + (&viport->data), + data_eioc_pool_min + (&viport->data))) + viport->link_state = LINK_DATACONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + case LINK_DATACONNECT: + LINK_STATE("state LINK_DATACONNECT\n"); + if (data_connect(&viport->data)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_DATACONNECTWAIT; + break; + case LINK_DATACONNECTWAIT: + LINK_STATE("state LINK_DATACONNECTWAIT\n"); + control_process_async(&viport->control); + if (data_is_connected(&viport->data)) + viport->link_state = LINK_XCHGPOOLREQ; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_xchgpool_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_XCHGPOOLREQ: + LINK_STATE("state LINK_XCHGPOOLREQ\n"); + if (control_exchange_pools_req(&viport->control, + data_local_pool_addr + (&viport->data), + data_local_pool_rkey + (&viport->data))) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_XCHGPOOLRSP; + break; + case LINK_XCHGPOOLRSP: + LINK_STATE("state LINK_XCHGPOOLRSP\n"); + control_process_async(&viport->control); + + if (!control_exchange_pools_rsp(&viport->control, + data_remote_pool_addr + (&viport->data), + data_remote_pool_rkey + (&viport->data))) + viport->link_state = LINK_INITIALIZED; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_INITIALIZED: + LINK_STATE("state LINK_INITIALIZED\n"); + viport->state = VIPORT_CONNECTED; + printk(KERN_INFO PFX + "%s: connection established\n", + config_viport_name(viport->config)); + data_connected(&viport->data); + vnic_connected(viport->parent->parent, + viport->parent); + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + printk(KERN_INFO PFX "%s: Supports Inbound IB " + "Multicast\n", + config_viport_name(viport->config)); + if (mc_data_init(&viport->mc_data, viport, + &viport->config->data_config, + viport->pd)) { + viport_disable_multicast(viport); + break; + } + } + spin_lock_irq(&viport->lock); + viport->mtu = 1500; + viport->flags = 0; + if ((viport->mtu != viport->new_mtu) || + (viport->flags != viport->new_flags)) + viport->updates |= NEED_LINK_CONFIG; + spin_unlock_irq(&viport->lock); + viport->link_state = LINK_IDLE; + viport->retry_duration = 0; + viport->total_retry_duration = 0; + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_idle_states(struct viport *viport) +{ + enum link_state old_state; + int handle_mc_join_compl, handle_mc_join; + + do { + switch (old_state = viport->link_state) { + case LINK_IDLE: + LINK_STATE("state LINK_IDLE\n"); + if (viport->config->hb_interval) + viport_timer(viport, + viport->config->hb_interval); + viport->link_state = LINK_IDLING; + break; + case LINK_IDLING: + LINK_STATE("state LINK_IDLING\n"); + control_process_async(&viport->control); + if (viport->errored) { + viport_timer_stop(viport); + viport->errored = 0; + viport->link_state = LINK_RESET; + break; + } + + spin_lock_irq(&viport->lock); + handle_mc_join = (viport->updates & NEED_MCAST_JOIN); + handle_mc_join_compl = + (viport->updates & NEED_MCAST_COMPLETION); + /* + * Turn off both flags, the handler functions will + * rearm them if necessary. + */ + viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION); + + if (viport->updates & NEED_LINK_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGLINKREQ; + } else if (viport->updates & NEED_ADDRESS_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGADDRSREQ; + } else if (viport->updates & NEED_STATS) { + viport_timer_stop(viport); + viport->link_state = LINK_REPORTSTATREQ; + } else if (viport->config->hb_interval) { + if (!viport->timer_active) + viport->link_state = + LINK_HEARTBEATREQ; + } + spin_unlock_irq(&viport->lock); + if (handle_mc_join) { + if (vnic_mc_join(viport)) + viport_disable_multicast(viport); + } + if (handle_mc_join_compl) + vnic_mc_join_handle_completion(viport); + + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_config_states(struct viport *viport) +{ + enum link_state old_state; + int res; + + do { + switch (old_state = viport->link_state) { + case LINK_CONFIGLINKREQ: + LINK_STATE("state LINK_CONFIGLINKREQ\n"); + spin_lock_irq(&viport->lock); + viport->updates &= ~NEED_LINK_CONFIG; + viport->flags = viport->new_flags; + if (viport->updates & MCAST_OVERFLOW) + viport->flags |= IFF_ALLMULTI; + viport->mtu = viport->new_mtu; + spin_unlock_irq(&viport->lock); + if (control_config_link_req(&viport->control, + viport->flags, + viport->mtu)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_CONFIGLINKRSP; + break; + case LINK_CONFIGLINKRSP: + LINK_STATE("state LINK_CONFIGLINKRSP\n"); + control_process_async(&viport->control); + + if (!control_config_link_rsp(&viport->control, + &viport->flags, + &viport->mtu)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_CONFIGADDRSREQ: + LINK_STATE("state LINK_CONFIGADDRSREQ\n"); + + spin_lock_irq(&viport->lock); + res = control_config_addrs_req(&viport->control, + viport->mac_addresses, + viport-> + num_mac_addresses); + + if (res > 0) { + viport->updates &= ~NEED_ADDRESS_CONFIG; + viport->link_state = LINK_CONFIGADDRSRSP; + } else if (res == 0) + viport->link_state = LINK_CONFIGADDRSRSP; + else + viport->link_state = LINK_RESET; + spin_unlock_irq(&viport->lock); + break; + case LINK_CONFIGADDRSRSP: + LINK_STATE("state LINK_CONFIGADDRSRSP\n"); + control_process_async(&viport->control); + + if (!control_config_addrs_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_stat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_REPORTSTATREQ: + LINK_STATE("state LINK_REPORTSTATREQ\n"); + if (control_report_statistics_req(&viport->control)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_REPORTSTATRSP; + break; + case LINK_REPORTSTATRSP: + LINK_STATE("state LINK_REPORTSTATRSP\n"); + control_process_async(&viport->control); + + spin_lock_irq(&viport->lock); + if (control_report_statistics_rsp(&viport->control, + &viport->stats) == 0) { + viport->updates &= ~NEED_STATS; + viport->last_stats_time = jiffies; + wake_up(&viport->stats_queue); + viport->link_state = LINK_IDLE; + } + + spin_unlock_irq(&viport->lock); + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_heartbeat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_HEARTBEATREQ: + LINK_STATE("state LINK_HEARTBEATREQ\n"); + if (control_heartbeat_req(&viport->control, + viport->config->hb_timeout)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_HEARTBEATRSP; + break; + case LINK_HEARTBEATRSP: + LINK_STATE("state LINK_HEARTBEATRSP\n"); + control_process_async(&viport->control); + + if (!control_heartbeat_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_reset_states(struct viport *viport) +{ + enum link_state old_state; + int handle_mc_join_compl = 0, handle_mc_join = 0; + + do { + switch (old_state = viport->link_state) { + case LINK_RESET: + LINK_STATE("state LINK_RESET\n"); + viport->errored = 0; + spin_lock_irq(&viport->lock); + viport->state = VIPORT_DISCONNECTED; + /* + * Turn off both flags, the handler functions will + * rearm them if necessary + */ + viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION); + + spin_unlock_irq(&viport->lock); + vnic_link_down(viport->vnic, viport->parent); + printk(KERN_INFO PFX + "%s: connection lost\n", + config_viport_name(viport->config)); + if (handle_mc_join) { + if (vnic_mc_join(viport)) + viport_disable_multicast(viport); + } + if (handle_mc_join_compl) + vnic_mc_join_handle_completion(viport); + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + vnic_mc_leave(viport); + vnic_mc_data_cleanup(&viport->mc_data); + } + + if (control_reset_req(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + else + viport->link_state = LINK_RESETRSP; + break; + case LINK_RESETRSP: + LINK_STATE("state LINK_RESETRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_DATADISCONNECT; + } + break; + case LINK_RESETCONTROL: + LINK_STATE("state LINK_RESETCONTROL\n"); + if (control_reset_req(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + else + viport->link_state = LINK_RESETCONTROLRSP; + break; + case LINK_RESETCONTROLRSP: + LINK_STATE("state LINK_RESETCONTROLRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_disconn_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_DATADISCONNECT: + LINK_STATE("state LINK_DATADISCONNECT\n"); + data_disconnect(&viport->data); + viport->link_state = LINK_CONTROLDISCONNECT; + break; + case LINK_CONTROLDISCONNECT: + LINK_STATE("state LINK_CONTROLDISCONNECT\n"); + viport->link_state = LINK_CLEANUPDATA; + break; + case LINK_CLEANUPDATA: + LINK_STATE("state LINK_CLEANUPDATA\n"); + data_cleanup(&viport->data); + viport->link_state = LINK_CLEANUPCONTROL; + break; + case LINK_CLEANUPCONTROL: + LINK_STATE("state LINK_CLEANUPCONTROL\n"); + spin_lock_irq(&viport->lock); + kfree(viport->mac_addresses); + viport->mac_addresses = NULL; + spin_unlock_irq(&viport->lock); + control_cleanup(&viport->control); + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + break; + case LINK_DISCONNECTED: + LINK_STATE("state LINK_DISCONNECTED\n"); + vnic_disconnected(viport->parent->parent, + viport->parent); + if (viport->disconnect != 0) + viport->link_state = LINK_UNINITIALIZED; + else if (viport->retry == 1) { + viport->retry = 0; + /* + * Check if the initial retry interval has crossed + * 20 seconds. + * The retry interval is initially 5 seconds which + * is incremented by 5. Once it is 20 the interval + * is fixed to 20 seconds till 10 minutes, + * after which retrying is stopped + */ + if (viport->retry_duration < MAX_RETRY_INTERVAL) + viport->retry_duration += + RETRY_INCREMENT; + + viport->total_retry_duration += + viport->retry_duration; + + if (viport->total_retry_duration >= + MAX_CONNECT_RETRY_TIMEOUT) { + viport->link_state = LINK_UNINITIALIZED; + printk("Timed out after retrying" + " for retry_duration %d msecs\n" + , viport->total_retry_duration); + } else { + viport->connect = DELAY; + viport->link_state = LINK_RETRYWAIT; + } + viport_timer(viport, + msecs_to_jiffies(viport->retry_duration)); + } else { + u32 duration = 5000 + ((net_random()) & 0x1FF); + if (!viport->parent->is_primary_path) + duration += 0x1ff; + viport_timer(viport, + msecs_to_jiffies(duration)); + viport->connect = DELAY; + viport->link_state = LINK_RETRYWAIT; + } + break; + case LINK_RETRYWAIT: + LINK_STATE("state LINK_RETRYWAIT\n"); + viport->stats.ethernet_status = 0; + viport->updates = 0; + wake_up(&viport->stats_queue); + if (viport->disconnect != 0) { + viport_timer_stop(viport); + viport->link_state = LINK_UNINITIALIZED; + } else if (viport->connect == DELAY) { + if (!viport->timer_active) + viport->link_state = LINK_INITIALIZE; + } else if (viport->connect == NOW) { + viport_timer_stop(viport); + viport->link_state = LINK_INITIALIZE; + } + break; + case LINK_FIRSTCONNECT: + viport->stats.ethernet_status = 0; + viport->updates = 0; + wake_up(&viport->stats_queue); + if (viport->disconnect != 0) { + viport_timer_stop(viport); + viport->link_state = LINK_UNINITIALIZED; + } + + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_statemachine(void *context) +{ + struct viport *viport; + enum link_state old_link_state; + + VIPORT_FUNCTION("viport_statemachine()\n"); + while (!viport_thread_end || !list_empty(&viport_list)) { + wait_event_interruptible(viport_queue, + !list_empty(&viport_list) + || viport_thread_end); + spin_lock_irq(&viport_list_lock); + if (list_empty(&viport_list)) { + spin_unlock_irq(&viport_list_lock); + continue; + } + viport = list_entry(viport_list.next, struct viport, + list_ptrs); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + + do { + old_link_state = viport->link_state; + + /* + * Optimize for the state machine steady state + * by checking for the most common states first. + * + */ + if (viport_handle_idle_states(viport) == 0) + break; + if (viport_handle_heartbeat_states(viport) == 0) + break; + if (viport_handle_stat_states(viport) == 0) + break; + if (viport_handle_config_states(viport) == 0) + break; + + if (viport_handle_init_states(viport) == 0) + break; + if (viport_handle_control_states(viport) == 0) + break; + if (viport_handle_data_states(viport) == 0) + break; + if (viport_handle_xchgpool_states(viport) == 0) + break; + if (viport_handle_reset_states(viport) == 0) + break; + if (viport_handle_disconn_states(viport) == 0) + break; + } while (viport->link_state != old_link_state); + } + + complete_and_exit(&viport_thread_exit, 0); +} + +int viport_start(void) +{ + VIPORT_FUNCTION("viport_start()\n"); + + spin_lock_init(&viport_list_lock); + viport_thread = kthread_run(viport_statemachine, NULL, + "qlgc_vnic_viport_s_m"); + if (IS_ERR(viport_thread)) { + printk(KERN_WARNING PFX "Could not create viport_thread;" + " error %d\n", (int) PTR_ERR(viport_thread)); + viport_thread = NULL; + return 1; + } + + return 0; +} + +void viport_cleanup(void) +{ + VIPORT_FUNCTION("viport_cleanup()\n"); + if (viport_thread) { + viport_thread_end = 1; + wake_up(&viport_queue); + wait_for_completion(&viport_thread_exit); + viport_thread = NULL; + } +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h new file mode 100644 index 0000000..6d36181 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h @@ -0,0 +1,176 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_VIPORT_H_INCLUDED +#define VNIC_VIPORT_H_INCLUDED + +#include "vnic_control.h" +#include "vnic_data.h" +#include "vnic_multicast.h" + +enum viport_state { + VIPORT_DISCONNECTED = 0, + VIPORT_CONNECTED = 1 +}; + +enum link_state { + LINK_UNINITIALIZED = 0, + LINK_INITIALIZE = 1, + LINK_INITIALIZECONTROL = 2, + LINK_INITIALIZEDATA = 3, + LINK_CONTROLCONNECT = 4, + LINK_CONTROLCONNECTWAIT = 5, + LINK_INITVNICREQ = 6, + LINK_INITVNICRSP = 7, + LINK_BEGINDATAPATH = 8, + LINK_CONFIGDATAPATHREQ = 9, + LINK_CONFIGDATAPATHRSP = 10, + LINK_DATACONNECT = 11, + LINK_DATACONNECTWAIT = 12, + LINK_XCHGPOOLREQ = 13, + LINK_XCHGPOOLRSP = 14, + LINK_INITIALIZED = 15, + LINK_IDLE = 16, + LINK_IDLING = 17, + LINK_CONFIGLINKREQ = 18, + LINK_CONFIGLINKRSP = 19, + LINK_CONFIGADDRSREQ = 20, + LINK_CONFIGADDRSRSP = 21, + LINK_REPORTSTATREQ = 22, + LINK_REPORTSTATRSP = 23, + LINK_HEARTBEATREQ = 24, + LINK_HEARTBEATRSP = 25, + LINK_RESET = 26, + LINK_RESETRSP = 27, + LINK_RESETCONTROL = 28, + LINK_RESETCONTROLRSP = 29, + LINK_DATADISCONNECT = 30, + LINK_CONTROLDISCONNECT = 31, + LINK_CLEANUPDATA = 32, + LINK_CLEANUPCONTROL = 33, + LINK_DISCONNECTED = 34, + LINK_RETRYWAIT = 35, + LINK_FIRSTCONNECT = 36 +}; + +enum { + BROADCAST_ADDR = 0, + UNICAST_ADDR = 1, + MCAST_ADDR_START = 2 +}; + +#define current_mac_address mac_addresses[UNICAST_ADDR].address + +enum { + NEED_STATS = 0x00000001, + NEED_ADDRESS_CONFIG = 0x00000002, + NEED_LINK_CONFIG = 0x00000004, + MCAST_OVERFLOW = 0x00000008, + NEED_MCAST_COMPLETION = 0x00000010, + NEED_MCAST_JOIN = 0x00000020 +}; + +struct viport { + struct list_head list_ptrs; + struct netpath *parent; + struct vnic *vnic; + struct viport_config *config; + struct control control; + struct data data; + spinlock_t lock; + struct ib_pd *pd; + enum viport_state state; + enum link_state link_state; + struct vnic_cmd_report_stats_rsp stats; + wait_queue_head_t stats_queue; + unsigned long last_stats_time; + u32 features_supported; + u8 hw_mac_address[ETH_ALEN]; + u16 default_vlan; + u16 num_mac_addresses; + struct vnic_address_op2 *mac_addresses; + u32 updates; + u16 flags; + u16 new_flags; + u16 mtu; + u16 new_mtu; + u32 errored; + enum { WAIT, DELAY, NOW } connect; + u32 disconnect; + u32 retry; + wait_queue_head_t disconnect_queue; + int timer_active; + struct timer_list timer; + u32 retry_duration; + u32 total_retry_duration; + atomic_t reference_count; + wait_queue_head_t reference_queue; + struct mc_info mc_info; + struct mc_data mc_data; +}; + +int viport_start(void); +void viport_cleanup(void); + +struct viport *viport_allocate(struct viport_config *config); +void viport_free(struct viport *viport); + +void viport_connect(struct viport *viport, int delay); +void viport_disconnect(struct viport *viport); + +void viport_set_link(struct viport *viport, u16 flags, u16 mtu); +void viport_get_stats(struct viport *viport, + struct net_device_stats *stats); +int viport_xmit_packet(struct viport *viport, struct sk_buff *skb); +void viport_kick(struct viport *viport); + +void viport_failure(struct viport *viport); + +int viport_set_unicast(struct viport *viport, u8 *address); +int viport_set_multicast(struct viport *viport, + struct dev_mc_list *mc_list, + int mc_count); + +#define viport_max_mtu(viport) data_max_mtu(&(viport)->data) + +#define viport_get_hw_addr(viport, address) \ + memcpy(address, (viport)->hw_mac_address, ETH_ALEN) + +#define viport_features(viport) ((viport)->features_supported) + +#define viport_can_tx_csum(viport) \ + (((viport)->features_supported & \ + (VNIC_FEAT_IPV4_CSUM_TX | VNIC_FEAT_TCP_CSUM_TX | \ + VNIC_FEAT_UDP_CSUM_TX)) == (VNIC_FEAT_IPV4_CSUM_TX | \ + VNIC_FEAT_TCP_CSUM_TX | VNIC_FEAT_UDP_CSUM_TX)) + +#endif /* VNIC_VIPORT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:33:28 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:03:28 +0530 Subject: [ofa-general] [PATCH v2 04/13] QLogic VNIC: Implementation of Control path of communication protocol In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103328.12355.6429.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the files that define the control packet formats and implements various control messages that are exchanged as part of the communication protocol with the EVIC/VEx. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_control.c | 2286 ++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.h | 179 ++ .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h | 368 +++ 3 files changed, 2833 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c new file mode 100644 index 0000000..774a071 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c @@ -0,0 +1,2286 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_stats.h" + +#define vnic_multicast_address(rsp2_address, index) \ + ((rsp2_address)->list_address_ops[index].address[0] & 0x01) + +static void control_log_control_packet(struct vnic_control_packet *pkt); + +char *control_ifcfg_name(struct control *control) +{ + if (!control) + return "nctl"; + if (!control->parent) + return "np"; + if (!control->parent->parent) + return "npp"; + if (!control->parent->parent->parent) + return "nppp"; + if (!control->parent->parent->parent->config) + return "npppc"; + return (control->parent->parent->parent->config->name); +} + +static void control_recv(struct control *control, struct recv_io *recv_io) +{ + if (vnic_ib_post_recv(&control->ib_conn, &recv_io->io)) + viport_failure(control->parent); +} + +static void control_recv_complete(struct io *io) +{ + struct recv_io *recv_io = (struct recv_io *)io; + struct recv_io *last_recv_io; + struct control *control = &io->viport->control; + struct vnic_control_packet *pkt = control_packet(recv_io); + struct vnic_control_header *c_hdr = &pkt->hdr; + unsigned long flags; + cycles_t response_time; + + CONTROL_FUNCTION("%s: control_recv_complete() State=%d\n", + control_ifcfg_name(control), control->req_state); + + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + control_note_rsptime_stats(&response_time); + CONTROL_PACKET(pkt); + spin_lock_irqsave(&control->io_lock, flags); + if (c_hdr->pkt_type == TYPE_INFO) { + last_recv_io = control->info; + control->info = recv_io; + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + if (last_recv_io) + control_recv(control, last_recv_io); + } else if (c_hdr->pkt_type == TYPE_RSP) { + u8 repost = 0; + u8 fail = 0; + u8 kick = 0; + + switch (control->req_state) { + case REQ_INACTIVE: + case RSP_RECEIVED: + case REQ_COMPLETED: + CONTROL_ERROR("%s: Unexpected control" + "response received: CMD = %d\n", + control_ifcfg_name(control), + c_hdr->pkt_cmd); + control_log_control_packet(pkt); + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_POSTED: + case REQ_SENT: + if (c_hdr->pkt_cmd != control->last_cmd + || c_hdr->pkt_seq_num != control->seq_num) { + CONTROL_ERROR("%s: Incorrect Control Response " + "received\n", + control_ifcfg_name(control)); + CONTROL_ERROR("%s: Sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: Received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + control->req_state = REQ_FAILED; + fail = 1; + } else { + control->response = recv_io; + control_update_rsptime_stats(control, + response_time); + if (control->req_state == REQ_POSTED) { + CONTROL_INFO("%s: Recv CMD RSP %d" + "before Send Completion\n", + control_ifcfg_name(control), + c_hdr->pkt_cmd); + control->req_state = RSP_RECEIVED; + } else { + control->req_state = REQ_COMPLETED; + kick = 1; + } + } + break; + case REQ_FAILED: + /* stay in REQ_FAILED state */ + repost = 1; + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock*/ + if (kick) + viport_kick(control->parent); + if (repost || fail) { + control_recv(control, recv_io); + if (fail) + viport_failure(control->parent); + } + + } else { + list_add_tail(&recv_io->io.list_ptrs, + &control->failure_list); + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + } + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); +} + +static void control_timeout(unsigned long data) +{ + struct control *control; + unsigned long flags; + u8 fail = 0; + u8 kick = 0; + + control = (struct control *)data; + CONTROL_FUNCTION("%s: control_timeout(), State=%d\n", + control_ifcfg_name(control), control->req_state); + control->timer_state = TIMER_EXPIRED; + + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + kick = 1; + /* stay in REQ_INACTIVE state */ + break; + case REQ_POSTED: + case REQ_SENT: + control->req_state = REQ_FAILED; + CONTROL_ERROR("%s: No send Completion for Cmd=%d \n", + control_ifcfg_name(control), control->last_cmd); + control_timeout_stats(control); + fail = 1; + break; + case RSP_RECEIVED: + control->req_state = REQ_FAILED; + CONTROL_ERROR("%s: No response received from EIOC for Cmd=%d\n", + control_ifcfg_name(control), control->last_cmd); + control_timeout_stats(control); + fail = 1; + break; + case REQ_COMPLETED: + /* stay in REQ_COMPLETED state*/ + kick = 1; + break; + case REQ_FAILED: + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + if (kick) + viport_kick(control->parent); + + return; +} + +static void control_timer(struct control *control, int timeout) +{ + CONTROL_FUNCTION("%s: control_timer()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + mod_timer(&control->timer, jiffies + timeout); + else { + init_timer(&control->timer); + control->timer.expires = jiffies + timeout; + control->timer.data = (unsigned long)control; + control->timer.function = control_timeout; + control->timer_state = TIMER_ACTIVE; + add_timer(&control->timer); + } +} + +static void control_timer_stop(struct control *control) +{ + CONTROL_FUNCTION("%s: control_timer_stop()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + del_timer_sync(&control->timer); + + control->timer_state = TIMER_IDLE; +} + +static int control_send(struct control *control, struct send_io *send_io) +{ + unsigned long flags; + u8 ret = -1; + u8 fail = 0; + struct vnic_control_packet *pkt = control_packet(send_io); + + CONTROL_FUNCTION("%s: control_send(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + CONTROL_PACKET(pkt); + control_timer(control, control->config->rsp_timeout); + control_note_reqtime_stats(control); + if (vnic_ib_post_send(&control->ib_conn, &control->send_io.io)) { + CONTROL_ERROR("%s: Failed to post send\n", + control_ifcfg_name(control)); + /* stay in REQ_INACTIVE state*/ + fail = 1; + } else { + control->last_cmd = pkt->hdr.pkt_cmd; + control->req_state = REQ_POSTED; + ret = 0; + } + break; + case REQ_POSTED: + case REQ_SENT: + case RSP_RECEIVED: + case REQ_COMPLETED: + CONTROL_ERROR("%s:Previous Command is not completed." + "New CMD: %d Last CMD: %d Seq: %d\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->last_cmd, control->seq_num); + + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_FAILED: + /* this can occur after an error when ViPort state machine + * attempts to reset the link. + */ + CONTROL_INFO("%s:Attempt to send in failed state." + "New CMD: %d Last CMD: %d\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->last_cmd); + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + return ret; + +} + +static void control_send_complete(struct io *io) +{ + struct control *control = &io->viport->control; + unsigned long flags; + u8 fail = 0; + u8 kick = 0; + + CONTROL_FUNCTION("%s: control_sendComplete(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + case REQ_SENT: + case REQ_COMPLETED: + CONTROL_ERROR("%s: Unexpected control send completion\n", + control_ifcfg_name(control)); + fail = 1; + control->req_state = REQ_FAILED; + break; + case REQ_POSTED: + control->req_state = REQ_SENT; + break; + case RSP_RECEIVED: + control->req_state = REQ_COMPLETED; + kick = 1; + break; + case REQ_FAILED: + /* stay in REQ_FAILED state */ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + if (kick) + viport_kick(control->parent); + + return; +} + +void control_process_async(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + unsigned long flags; + + CONTROL_FUNCTION("%s: control_process_async()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + spin_lock_irqsave(&control->io_lock, flags); + recv_io = control->info; + if (recv_io) { + CONTROL_INFO("%s: processing info packet\n", + control_ifcfg_name(control)); + control->info = NULL; + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd == CMD_REPORT_STATUS) { + u32 status; + status = + be32_to_cpu(pkt->cmd.report_status.status_number); + switch (status) { + case VNIC_STATUS_LINK_UP: + CONTROL_INFO("%s: link up\n", + control_ifcfg_name(control)); + vnic_link_up(control->parent->vnic, + control->parent->parent); + break; + case VNIC_STATUS_LINK_DOWN: + CONTROL_INFO("%s: link down\n", + control_ifcfg_name(control)); + vnic_link_down(control->parent->vnic, + control->parent->parent); + break; + default: + CONTROL_ERROR("%s: asynchronous status" + " received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + break; + } + } + if ((pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) || + pkt->cmd.report_status.is_fatal) + viport_failure(control->parent); + + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + + while (!list_empty(&control->failure_list)) { + CONTROL_INFO("%s: processing error packet\n", + control_ifcfg_name(control)); + recv_io = (struct recv_io *) + list_entry(control->failure_list.next, struct io, + list_ptrs); + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + CONTROL_ERROR("%s: asynchronous error received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + if ((pkt->hdr.pkt_type != TYPE_ERR) + || (pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) + || pkt->cmd.report_status.is_fatal) + viport_failure(control->parent); + + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + spin_unlock_irqrestore(&control->io_lock, flags); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + CONTROL_FUNCTION("%s: done control_process_async\n", + control_ifcfg_name(control)); +} + +static struct send_io *control_init_hdr(struct control *control, u8 cmd) +{ + struct control_config *config; + struct vnic_control_packet *pkt; + struct vnic_control_header *hdr; + + CONTROL_FUNCTION("control_init_hdr()\n"); + config = control->config; + + pkt = control_packet(&control->send_io); + hdr = &pkt->hdr; + + hdr->pkt_type = TYPE_REQ; + hdr->pkt_cmd = cmd; + control->seq_num++; + hdr->pkt_seq_num = control->seq_num; + hdr->pkt_retry_count = 0; + + return &control->send_io; +} + +static struct recv_io *control_get_rsp(struct control *control) +{ + struct recv_io *recv_io = NULL; + unsigned long flags; + u8 fail = 0; + + CONTROL_FUNCTION("%s: control_getRsp(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + CONTROL_ERROR("%s: Checked for Response with no" + "command pending\n", + control_ifcfg_name(control)); + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_POSTED: + case REQ_SENT: + case RSP_RECEIVED: + /* no response available yet + stay in present state*/ + break; + case REQ_COMPLETED: + recv_io = control->response; + if (!recv_io) { + control->req_state = REQ_FAILED; + fail = 1; + break; + } + control->response = NULL; + control->last_cmd = CMD_INVALID; + control_timer_stop(control); + control->req_state = REQ_INACTIVE; + break; + case REQ_FAILED: + control_timer_stop(control); + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + if (fail) + viport_failure(control->parent); + return recv_io; +} + +int control_init_vnic_req(struct control *control) +{ + struct send_io *send_io; + struct control_config *config = control->config; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_req *init_vnic_req; + + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_INIT_VNIC); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + init_vnic_req = &pkt->cmd.init_vnic_req; + init_vnic_req->vnic_major_version = + __constant_cpu_to_be16(VNIC_MAJORVERSION); + init_vnic_req->vnic_minor_version = + __constant_cpu_to_be16(VNIC_MINORVERSION); + init_vnic_req->vnic_instance = config->vnic_instance; + init_vnic_req->num_data_paths = 1; + init_vnic_req->num_address_entries = + cpu_to_be16(config->max_address_entries); + + control->last_cmd = pkt->hdr.pkt_cmd; + CONTROL_PACKET(pkt); + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +static int control_chk_vnic_rsp_values(struct control *control, + u16 *num_addrs, + u8 num_data_paths, + u8 num_lan_switches, + u32 *features) +{ + + struct control_config *config = control->config; + + if ((control->maj_ver > VNIC_MAJORVERSION) + || ((control->maj_ver == VNIC_MAJORVERSION) + && (control->min_ver > VNIC_MINORVERSION))) { + CONTROL_ERROR("%s: unsupported version\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_data_paths != 1) { + CONTROL_ERROR("%s: EIOC returned too many datapaths\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs > config->max_address_entries) { + CONTROL_ERROR("%s: EIOC returned more address" + " entries than requested\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs < config->min_address_entries) { + CONTROL_ERROR("%s: not enough address entries\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches < 1) { + CONTROL_ERROR("%s: EIOC returned no lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches > 1) { + CONTROL_ERROR("%s: EIOC returned multiple lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + CONTROL_ERROR("%s checking features %x ib_multicast:%d\n", + control_ifcfg_name(control), + *features, config->ib_multicast); + if ((*features & VNIC_FEAT_INBOUND_IB_MC) && !config->ib_multicast) { + /* disable multicast if it is not on in the cfg file, or + if we turned it off because join failed */ + *features &= ~VNIC_FEAT_INBOUND_IB_MC; + } + + return 0; +failure: + return -1; +} + +int control_init_vnic_rsp(struct control *control, u32 *features, + u8 *mac_address, u16 *num_addrs, u16 *vlan) +{ + u8 num_data_paths; + u8 num_lan_switches; + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_rsp *init_vnic_rsp; + + + CONTROL_FUNCTION("%s: control_init_vnic_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_INIT_VNIC) + goto failure; + + init_vnic_rsp = &pkt->cmd.init_vnic_rsp; + control->maj_ver = be16_to_cpu(init_vnic_rsp->vnic_major_version); + control->min_ver = be16_to_cpu(init_vnic_rsp->vnic_minor_version); + num_data_paths = init_vnic_rsp->num_data_paths; + num_lan_switches = init_vnic_rsp->num_lan_switches; + *features = be32_to_cpu(init_vnic_rsp->features_supported); + *num_addrs = be16_to_cpu(init_vnic_rsp->num_address_entries); + + if (control_chk_vnic_rsp_values(control, num_addrs, + num_data_paths, + num_lan_switches, + features)) + goto failure; + + control->lan_switch.lan_switch_num = + init_vnic_rsp->lan_switch[0].lan_switch_num; + control->lan_switch.num_enet_ports = + init_vnic_rsp->lan_switch[0].num_enet_ports; + control->lan_switch.default_vlan = + init_vnic_rsp->lan_switch[0].default_vlan; + *vlan = be16_to_cpu(control->lan_switch.default_vlan); + memcpy(control->lan_switch.hw_mac_address, + init_vnic_rsp->lan_switch[0].hw_mac_address, ETH_ALEN); + memcpy(mac_address, init_vnic_rsp->lan_switch[0].hw_mac_address, + ETH_ALEN); + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static void copy_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst) +{ + dst->size_recv_pool_entry = src->size_recv_pool_entry; + dst->num_recv_pool_entries = src->num_recv_pool_entries; + dst->timeout_before_kick = src->timeout_before_kick; + dst->num_recv_pool_entries_before_kick = + src->num_recv_pool_entries_before_kick; + dst->num_recv_pool_bytes_before_kick = + src->num_recv_pool_bytes_before_kick; + dst->free_recv_pool_entries_per_update = + src->free_recv_pool_entries_per_update; +} + +static int check_recv_pool_config_value(__be32 *src, __be32 *dst, + __be32 *max, __be32 *min, + char *name) +{ + u32 value; + + value = be32_to_cpu(*src); + if (value > be32_to_cpu(*max)) { + CONTROL_ERROR("value %s too large\n", name); + return -1; + } else if (value < be32_to_cpu(*min)) { + CONTROL_ERROR("value %s too small\n", name); + return -1; + } + + *dst = cpu_to_be32(value); + return 0; +} + +static int check_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst, + struct vnic_recv_pool_config *max, + struct vnic_recv_pool_config *min) +{ + if (check_recv_pool_config_value(&src->size_recv_pool_entry, + &dst->size_recv_pool_entry, + &max->size_recv_pool_entry, + &min->size_recv_pool_entry, + "size_recv_pool_entry") + || check_recv_pool_config_value(&src->num_recv_pool_entries, + &dst->num_recv_pool_entries, + &max->num_recv_pool_entries, + &min->num_recv_pool_entries, + "num_recv_pool_entries") + || check_recv_pool_config_value(&src->timeout_before_kick, + &dst->timeout_before_kick, + &max->timeout_before_kick, + &min->timeout_before_kick, + "timeout_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_entries_before_kick, + &dst-> + num_recv_pool_entries_before_kick, + &max-> + num_recv_pool_entries_before_kick, + &min-> + num_recv_pool_entries_before_kick, + "num_recv_pool_entries_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_bytes_before_kick, + &dst-> + num_recv_pool_bytes_before_kick, + &max-> + num_recv_pool_bytes_before_kick, + &min-> + num_recv_pool_bytes_before_kick, + "num_recv_pool_bytes_before_kick") + || check_recv_pool_config_value(&src-> + free_recv_pool_entries_per_update, + &dst-> + free_recv_pool_entries_per_update, + &max-> + free_recv_pool_entries_per_update, + &min-> + free_recv_pool_entries_per_update, + "free_recv_pool_entries_per_update")) + goto failure; + + if (!is_power_of_2(be32_to_cpu(dst->num_recv_pool_entries))) { + CONTROL_ERROR("num_recv_pool_entries (%d)" + " must be power of 2\n", + dst->num_recv_pool_entries); + goto failure; + } + + if (!is_power_of_2(be32_to_cpu(dst-> + free_recv_pool_entries_per_update))) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d)" + " must be power of 2\n", + dst->free_recv_pool_entries_per_update); + goto failure; + } + + if (be32_to_cpu(dst->free_recv_pool_entries_per_update) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->free_recv_pool_entries_per_update, + dst->num_recv_pool_entries); + goto failure; + } + + if (be32_to_cpu(dst->num_recv_pool_entries_before_kick) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("num_recv_pool_entries_before_kick (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->num_recv_pool_entries_before_kick, + dst->num_recv_pool_entries); + goto failure; + } + + return 0; +failure: + return -1; +} + +int control_config_data_path_req(struct control *control, u64 path_id, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_DATA_PATH); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_data_path = &pkt->cmd.config_data_path_req; + config_data_path->data_path = 0; + config_data_path->path_identifier = path_id; + copy_recv_pool_config(host, + &config_data_path->host_recv_pool_config); + copy_recv_pool_config(eioc, + &config_data_path->eioc_recv_pool_config); + CONTROL_PACKET(pkt); + + control->last_cmd = pkt->hdr.pkt_cmd; + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_data_path_rsp(struct control *control, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc, + struct vnic_recv_pool_config *max_host, + struct vnic_recv_pool_config *max_eioc, + struct vnic_recv_pool_config *min_host, + struct vnic_recv_pool_config *min_eioc) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_DATA_PATH) + goto failure; + + config_data_path = &pkt->cmd.config_data_path_rsp; + if (config_data_path->data_path != 0) { + CONTROL_ERROR("%s: received CMD_CONFIG_DATA_PATH response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + config_data_path->data_path); + goto failure; + } + + if (check_recv_pool_config(&config_data_path-> + host_recv_pool_config, + host, max_host, min_host) + || check_recv_pool_config(&config_data_path-> + eioc_recv_pool_config, + eioc, max_eioc, min_eioc)) { + goto failure; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_exchange_pools_req(struct control *control, u64 addr, u32 rkey) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_EXCHANGE_POOLS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + exchange_pools = &pkt->cmd.exchange_pools_req; + exchange_pools->data_path = 0; + exchange_pools->pool_rkey = cpu_to_be32(rkey); + exchange_pools->pool_addr = cpu_to_be64(addr); + + control->last_cmd = pkt->hdr.pkt_cmd; + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_exchange_pools_rsp(struct control *control, u64 *addr, + u32 *rkey) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_EXCHANGE_POOLS) + goto failure; + + exchange_pools = &pkt->cmd.exchange_pools_rsp; + *rkey = be32_to_cpu(exchange_pools->pool_rkey); + *addr = be64_to_cpu(exchange_pools->pool_addr); + + if (exchange_pools->data_path != 0) { + CONTROL_ERROR("%s: received CMD_EXCHANGE_POOLS response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + exchange_pools->data_path); + goto failure; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_config_link_req(struct control *control, u16 flags, u16 mtu) +{ + struct send_io *send_io; + struct vnic_cmd_config_link *config_link_req; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_config_link_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_LINK); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_link_req = &pkt->cmd.config_link_req; + config_link_req->lan_switch_num = + control->lan_switch.lan_switch_num; + config_link_req->cmd_flags = VNIC_FLAG_SET_MTU; + if (flags & IFF_UP) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_NIC; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_NIC; + if (flags & IFF_ALLMULTI) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_MCAST_ALL; + if (flags & IFF_PROMISC) { + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_PROMISC; + /* the EIOU doesn't really do PROMISC mode. + * if PROMISC is set, it only receives unicast packets + * I also have to set MCAST_ALL if I want real + * PROMISC mode. + */ + config_link_req->cmd_flags &= ~VNIC_FLAG_DISABLE_MCAST_ALL; + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + } else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_PROMISC; + + config_link_req->mtu_size = cpu_to_be16(mtu); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_link_rsp(struct control *control, u16 *flags, u16 *mtu) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_link *config_link_rsp; + + CONTROL_FUNCTION("%s: control_config_link_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_LINK) + goto failure; + config_link_rsp = &pkt->cmd.config_link_rsp; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_NIC) + *flags |= IFF_UP; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + *flags |= IFF_ALLMULTI; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + *flags |= IFF_PROMISC; + + *mtu = be16_to_cpu(config_link_rsp->mtu_size); + + if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + /* featuresSupported might include INBOUND_IB_MC but + MTU might cause it to be auto-disabled at embedded */ + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) { + union ib_gid mgid = config_link_rsp->allmulti_mgid; + if (mgid.raw[0] != 0xff) { + CONTROL_ERROR("%s: invalid formatprefix " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + } else { + /* rather than issuing join here, which might + * arrive at SM before EVIC creates the MC + * group, postpone it. + */ + vnic_mc_join_setup(control->parent, &mgid); + CONTROL_ERROR("join setup for ALL_MULTI\n"); + } + } + /* we don't want to leave mcast group if MCAST_ALL is disabled + * because there are no doubt multicast addresses set and we + * want to stay joined so we can get that traffic via the + * mcast group. + */ + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +/* control_config_addrs_req: + * return values: + * -1: failure + * 0: incomplete (successful operation, but more address + * table entries to be updated) + * 1: complete + */ +int control_config_addrs_req(struct control *control, + struct vnic_address_op2 *addrs, u16 num) +{ + u16 i; + u8 j; + int ret = 1; + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_addresses *config_addrs_req; + struct vnic_cmd_config_addresses2 *config_addrs_req2; + + CONTROL_FUNCTION("%s: control_config_addrs_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + CONTROL_INFO("Sending CMD_CONFIG_ADDRESSES2 %lx MAX:%d " + "sizes:%d %d(off:%d) sizes2:%d %d %d" + "(off:%d - %d %d %d %d %d %d %d)\n", jiffies, + (int)MAX_CONFIG_ADDR_ENTRIES2, + (int)sizeof(struct vnic_cmd_config_addresses), + (int)sizeof(struct vnic_address_op), + (int)offsetof(struct vnic_cmd_config_addresses, + list_address_ops), + (int)sizeof(struct vnic_cmd_config_addresses2), + (int)sizeof(struct vnic_address_op2), + (int)sizeof(union ib_gid), + (int)offsetof(struct vnic_cmd_config_addresses2, + list_address_ops), + (int)offsetof(struct vnic_address_op2, index), + (int)offsetof(struct vnic_address_op2, operation), + (int)offsetof(struct vnic_address_op2, valid), + (int)offsetof(struct vnic_address_op2, address), + (int)offsetof(struct vnic_address_op2, vlan), + (int)offsetof(struct vnic_address_op2, reserved), + (int)offsetof(struct vnic_address_op2, mgid) + ); + send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES2); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_addrs_req2 = &pkt->cmd.config_addresses_req2; + memset(pkt->cmd.cmd_data, 0, VNIC_MAX_CONTROLDATASZ); + config_addrs_req2->lan_switch_num = + control->lan_switch.lan_switch_num; + for (i = 0, j = 0; (i < num) && (j < MAX_CONFIG_ADDR_ENTRIES2); i++) { + if (!addrs[i].operation) + continue; + config_addrs_req2->list_address_ops[j].index = + cpu_to_be16(i); + config_addrs_req2->list_address_ops[j].operation = + VNIC_OP_SET_ENTRY; + config_addrs_req2->list_address_ops[j].valid = + addrs[i].valid; + memcpy(config_addrs_req2->list_address_ops[j].address, + addrs[i].address, ETH_ALEN); + config_addrs_req2->list_address_ops[j].vlan = + addrs[i].vlan; + addrs[i].operation = 0; + CONTROL_INFO("%s i=%d " + "addr[%d]=%02x:%02x:%02x:%02x:%02x:%02x " + "valid:%d\n", control_ifcfg_name(control), i, j, + addrs[i].address[0], addrs[i].address[1], + addrs[i].address[2], addrs[i].address[3], + addrs[i].address[4], addrs[i].address[5], + addrs[i].valid); + j++; + } + config_addrs_req2->num_address_ops = j; + } else { + send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_addrs_req = &pkt->cmd.config_addresses_req; + config_addrs_req->lan_switch_num = + control->lan_switch.lan_switch_num; + for (i = 0, j = 0; (i < num) && (j < 16); i++) { + if (!addrs[i].operation) + continue; + config_addrs_req->list_address_ops[j].index = + cpu_to_be16(i); + config_addrs_req->list_address_ops[j].operation = + VNIC_OP_SET_ENTRY; + config_addrs_req->list_address_ops[j].valid = + addrs[i].valid; + memcpy(config_addrs_req->list_address_ops[j].address, + addrs[i].address, ETH_ALEN); + config_addrs_req->list_address_ops[j].vlan = + addrs[i].vlan; + addrs[i].operation = 0; + j++; + } + config_addrs_req->num_address_ops = j; + } + for (; i < num; i++) { + if (addrs[i].operation) { + ret = 0; + break; + } + } + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + if (control_send(control, send_io)) + return -1; + return ret; +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +static int process_cmd_config_address2_rsp(struct control *control, + struct vnic_control_packet *pkt, + struct recv_io *recv_io) +{ + struct vnic_cmd_config_addresses2 *config_addrs_rsp2; + int idx, mcaddrs, nomgid; + union ib_gid mgid, rsp_mgid; + + config_addrs_rsp2 = &pkt->cmd.config_addresses_rsp2; + CONTROL_INFO("%s rsp to CONFIG_ADDRESSES2\n", + control_ifcfg_name(control)); + + for (idx = 0, mcaddrs = 0, nomgid = 1; + idx < config_addrs_rsp2->num_address_ops; + idx++) { + if (!config_addrs_rsp2->list_address_ops[idx].valid) + continue; + + /* check if address is multicasts */ + if (!vnic_multicast_address(config_addrs_rsp2, idx)) + continue; + + mcaddrs++; + mgid = config_addrs_rsp2->list_address_ops[idx].mgid; + CONTROL_INFO("%s: got mgid " VNIC_GID_FMT + " MCAST_MSG_SIZE:%d mtu:%d\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw), + (int)MCAST_MSG_SIZE, + control->parent->mtu); + + /* Embedded should have turned off multicast + * due to large MTU size; mgid had better be 0. + */ + if (control->parent->mtu > MCAST_MSG_SIZE) { + if ((mgid.global.subnet_prefix != 0) || + (mgid.global.interface_id != 0)) { + CONTROL_ERROR("%s: invalid mgid; " + "expected 0 " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + } + continue; + } + if (mgid.raw[0] != 0xff) { + CONTROL_ERROR("%s: invalid formatprefix " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + continue; + } + nomgid = 0; /* got a valid mgid */ + + /* let's verify that all the mgids match this one */ + for (; idx < config_addrs_rsp2->num_address_ops; idx++) { + if (!config_addrs_rsp2->list_address_ops[idx].valid) + continue; + + /* check if address is multicasts */ + if (!vnic_multicast_address(config_addrs_rsp2, idx)) + continue; + + rsp_mgid = config_addrs_rsp2->list_address_ops[idx].mgid; + if (memcmp(&mgid, &rsp_mgid, sizeof(union ib_gid)) == 0) + continue; + + CONTROL_ERROR("%s: Multicast Group MGIDs not " + "unique; mgids: " VNIC_GID_FMT + " " VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw), + VNIC_GID_RAW_ARG(rsp_mgid.raw)); + return 1; + } + + /* rather than issuing join here, which might arrive + * at SM before EVIC creates the MC group, postpone it. + */ + vnic_mc_join_setup(control->parent, &mgid); + + /* there is only one multicast group to join, so we're done. */ + break; + } + + /* we sent atleast one multicast address but got no MGID + * back so, if it is not allmulti case, leave the group + * we joined before. (for allmulti case we have to stay + * joined) + */ + if ((config_addrs_rsp2->num_address_ops > 0) && (mcaddrs > 0) && + nomgid && !(control->parent->flags & IFF_ALLMULTI)) { + CONTROL_INFO("numaddrops:%d mcadrs:%d nomgid:%d\n", + config_addrs_rsp2->num_address_ops, + mcaddrs > 0, nomgid); + + vnic_mc_leave(control->parent); + } + + return 0; +} + +int control_config_addrs_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_config_addrs_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if ((pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES) && + (pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES2)) + goto failure; + + if (((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) && + !control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) || + ((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES) && + control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC)) { + CONTROL_ERROR("%s unexpected response pktCmd:%d flag:%x\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->parent->features_supported & + VNIC_FEAT_INBOUND_IB_MC); + goto failure; + } + + if (pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) { + if (process_cmd_config_address2_rsp(control, pkt, recv_io)) + goto failure; + } else { + struct vnic_cmd_config_addresses *config_addrs_rsp; + config_addrs_rsp = &pkt->cmd.config_addresses_rsp; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_report_statistics_req(struct control *control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_req *report_statistics_req; + + CONTROL_FUNCTION("%s: control_report_statistics_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_REPORT_STATISTICS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + report_statistics_req = &pkt->cmd.report_statistics_req; + report_statistics_req->lan_switch_num = + control->lan_switch.lan_switch_num; + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_report_statistics_rsp(struct control *control, + struct vnic_cmd_report_stats_rsp *stats) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_rsp *rep_stat_rsp; + + CONTROL_FUNCTION("%s: control_report_statistics_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_REPORT_STATISTICS) + goto failure; + + rep_stat_rsp = &pkt->cmd.report_statistics_rsp; + + stats->if_in_broadcast_pkts = rep_stat_rsp->if_in_broadcast_pkts; + stats->if_in_multicast_pkts = rep_stat_rsp->if_in_multicast_pkts; + stats->if_in_octets = rep_stat_rsp->if_in_octets; + stats->if_in_ucast_pkts = rep_stat_rsp->if_in_ucast_pkts; + stats->if_in_nucast_pkts = rep_stat_rsp->if_in_nucast_pkts; + stats->if_in_underrun = rep_stat_rsp->if_in_underrun; + stats->if_in_errors = rep_stat_rsp->if_in_errors; + stats->if_out_errors = rep_stat_rsp->if_out_errors; + stats->if_out_octets = rep_stat_rsp->if_out_octets; + stats->if_out_ucast_pkts = rep_stat_rsp->if_out_ucast_pkts; + stats->if_out_multicast_pkts = rep_stat_rsp->if_out_multicast_pkts; + stats->if_out_broadcast_pkts = rep_stat_rsp->if_out_broadcast_pkts; + stats->if_out_nucast_pkts = rep_stat_rsp->if_out_nucast_pkts; + stats->if_out_ok = rep_stat_rsp->if_out_ok; + stats->if_in_ok = rep_stat_rsp->if_in_ok; + stats->if_out_ucast_bytes = rep_stat_rsp->if_out_ucast_bytes; + stats->if_out_multicast_bytes = rep_stat_rsp->if_out_multicast_bytes; + stats->if_out_broadcast_bytes = rep_stat_rsp->if_out_broadcast_bytes; + stats->if_in_ucast_bytes = rep_stat_rsp->if_in_ucast_bytes; + stats->if_in_multicast_bytes = rep_stat_rsp->if_in_multicast_bytes; + stats->if_in_broadcast_bytes = rep_stat_rsp->if_in_broadcast_bytes; + stats->ethernet_status = rep_stat_rsp->ethernet_status; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_reset_req(struct control *control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_RESET); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_reset_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_RESET) + goto failure; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_heartbeat_req(struct control *control, u32 hb_interval) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_req; + + CONTROL_FUNCTION("%s: control_heartbeat_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_HEARTBEAT); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + heartbeat_req = &pkt->cmd.heartbeat_req; + heartbeat_req->hb_interval = cpu_to_be32(hb_interval); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_heartbeat_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_rsp; + + CONTROL_FUNCTION("%s: control_heartbeat_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_HEARTBEAT) + goto failure; + + heartbeat_rsp = &pkt->cmd.heartbeat_rsp; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static int control_init_recv_ios(struct control *control, + struct viport *viport, + struct vnic_control_packet *pkt) +{ + struct io *io; + struct ib_device *ibdev = viport->config->ibdev; + struct control_config *config = control->config; + dma_addr_t recv_dma; + unsigned int i; + + + control->recv_len = sizeof *pkt * config->num_recvs; + control->recv_dma = ib_dma_map_single(ibdev, + pkt, control->recv_len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(ibdev, control->recv_dma)) { + CONTROL_ERROR("control recv dma map error\n"); + goto failure; + } + + recv_dma = control->recv_dma; + for (i = 0; i < config->num_recvs; i++) { + io = &control->recv_ios[i].io; + io->viport = viport; + io->routine = control_recv_complete; + io->type = RECV; + + control->recv_ios[i].virtual_addr = (u8 *)pkt; + control->recv_ios[i].list.addr = recv_dma; + control->recv_ios[i].list.length = sizeof *pkt; + control->recv_ios[i].list.lkey = control->mr->lkey; + + recv_dma = recv_dma + sizeof *pkt; + pkt++; + + io->rwr.wr_id = (u64)io; + io->rwr.sg_list = &control->recv_ios[i].list; + io->rwr.num_sge = 1; + if (vnic_ib_post_recv(&control->ib_conn, io)) + goto unmap_recv; + } + + return 0; +unmap_recv: + ib_dma_unmap_single(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); +failure: + return -1; +} + +static int control_init_send_ios(struct control *control, + struct viport *viport, + struct vnic_control_packet *pkt) +{ + struct io *io; + struct ib_device *ibdev = viport->config->ibdev; + + control->send_io.virtual_addr = (u8 *)pkt; + control->send_len = sizeof *pkt; + control->send_dma = ib_dma_map_single(ibdev, pkt, + control->send_len, + DMA_TO_DEVICE); + if (ib_dma_mapping_error(ibdev, control->send_dma)) { + CONTROL_ERROR("control send dma map error\n"); + goto failure; + } + + io = &control->send_io.io; + io->viport = viport; + io->routine = control_send_complete; + + control->send_io.list.addr = control->send_dma; + control->send_io.list.length = sizeof *pkt; + control->send_io.list.lkey = control->mr->lkey; + + io->swr.wr_id = (u64)io; + io->swr.sg_list = &control->send_io.list; + io->swr.num_sge = 1; + io->swr.opcode = IB_WR_SEND; + io->swr.send_flags = IB_SEND_SIGNALED; + io->type = SEND; + + return 0; +failure: + return -1; +} + +int control_init(struct control *control, struct viport *viport, + struct control_config *config, struct ib_pd *pd) +{ + struct vnic_control_packet *pkt; + unsigned int sz; + + CONTROL_FUNCTION("%s: control_init()\n", + control_ifcfg_name(control)); + control->parent = viport; + control->config = config; + control->ib_conn.viport = viport; + control->ib_conn.ib_config = &config->ib_config; + control->ib_conn.state = IB_CONN_UNINITTED; + control->ib_conn.callback_thread = NULL; + control->ib_conn.callback_thread_end = 0; + control->req_state = REQ_INACTIVE; + control->last_cmd = CMD_INVALID; + control->seq_num = 0; + control->response = NULL; + control->info = NULL; + INIT_LIST_HEAD(&control->failure_list); + spin_lock_init(&control->io_lock); + + if (vnic_ib_conn_init(&control->ib_conn, viport, pd, + &config->ib_config)) { + CONTROL_ERROR("Control IB connection" + " initialization failed\n"); + goto failure; + } + + control->mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(control->mr)) { + CONTROL_ERROR("%s: failed to register memory" + " for control connection\n", + control_ifcfg_name(control)); + goto destroy_conn; + } + + control->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &control->ib_conn); + if (IS_ERR(control->ib_conn.cm_id)) { + CONTROL_ERROR("creating control CM ID failed\n"); + goto destroy_mr; + } + + sz = sizeof(struct recv_io) * config->num_recvs; + control->recv_ios = vmalloc(sz); + + if (!control->recv_ios) { + CONTROL_ERROR("%s: failed allocating space for recv ios\n", + control_ifcfg_name(control)); + goto destroy_cm_id; + } + + memset(control->recv_ios, 0, sz); + /*One send buffer and num_recvs recv buffers */ + control->local_storage = kzalloc(sizeof *pkt * + (config->num_recvs + 1), + GFP_KERNEL); + + if (!control->local_storage) { + CONTROL_ERROR("%s: failed allocating space" + " for local storage\n", + control_ifcfg_name(control)); + goto free_recv_ios; + } + + pkt = control->local_storage; + if (control_init_send_ios(control, viport, pkt)) + goto free_storage; + + pkt++; + if (control_init_recv_ios(control, viport, pkt)) + goto unmap_send; + + return 0; + +unmap_send: + ib_dma_unmap_single(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); +free_storage: + kfree(control->local_storage); +free_recv_ios: + vfree(control->recv_ios); +destroy_cm_id: + ib_destroy_cm_id(control->ib_conn.cm_id); +destroy_mr: + ib_dereg_mr(control->mr); +destroy_conn: + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); +failure: + return -1; +} + +void control_cleanup(struct control *control) +{ + CONTROL_FUNCTION("%s: control_disconnect()\n", + control_ifcfg_name(control)); + + if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0)) + CONTROL_ERROR("control CM DREQ sending failed\n"); + + control->ib_conn.state = IB_CONN_DISCONNECTED; + control_timer_stop(control); + control->req_state = REQ_INACTIVE; + control->response = NULL; + control->last_cmd = CMD_INVALID; + completion_callback_cleanup(&control->ib_conn); + ib_destroy_cm_id(control->ib_conn.cm_id); + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); + ib_dereg_mr(control->mr); + ib_dma_unmap_single(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + ib_dma_unmap_single(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + vfree(control->recv_ios); + kfree(control->local_storage); + +} + +static void control_log_report_status_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATUS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " lan_switch_num = %u, is_fatal = %u\n", + pkt->cmd.report_status.lan_switch_num, + pkt->cmd.report_status.is_fatal); + printk(KERN_INFO + " status_number = %u, status_info = %u\n", + be32_to_cpu(pkt->cmd.report_status.status_number), + be32_to_cpu(pkt->cmd.report_status.status_info)); + pkt->cmd.report_status.file_name[31] = '\0'; + pkt->cmd.report_status.routine[31] = '\0'; + printk(KERN_INFO " filename = %s, routine = %s\n", + pkt->cmd.report_status.file_name, + pkt->cmd.report_status.routine); + printk(KERN_INFO + " line_num = %u, error_parameter = %u\n", + be32_to_cpu(pkt->cmd.report_status.line_num), + be32_to_cpu(pkt->cmd.report_status.error_parameter)); + pkt->cmd.report_status.desc_text[127] = '\0'; + printk(KERN_INFO " desc_text = %s\n", + pkt->cmd.report_status.desc_text); +} + +static void control_log_report_stats_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " lan_switch_num = %u\n", + pkt->cmd.report_statistics_req.lan_switch_num); + if (pkt->hdr.pkt_type == TYPE_REQ) + return; + printk(KERN_INFO " if_in_broadcast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_pkts)); + printk(" if_in_multicast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_pkts)); + printk(KERN_INFO " if_in_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_octets)); + printk(" if_in_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_pkts)); + printk(KERN_INFO " if_in_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_nucast_pkts)); + printk(" if_in_underrun = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_underrun)); + printk(KERN_INFO " if_in_errors = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_errors)); + printk(" if_out_errors = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_errors)); + printk(KERN_INFO " if_out_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_octets)); + printk(" if_out_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_pkts)); + printk(KERN_INFO " if_out_multicast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_pkts)); + printk(" if_out_broadcast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_pkts)); + printk(KERN_INFO " if_out_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_nucast_pkts)); + printk(" if_out_ok = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_out_ok)); + printk(KERN_INFO " if_in_ok = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_in_ok)); + printk(" if_out_ucast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_bytes)); + printk(KERN_INFO " if_out_multicast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_bytes)); + printk(" if_out_broadcast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_bytes)); + printk(KERN_INFO " if_in_ucast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_bytes)); + printk(" if_in_multicast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_bytes)); + printk(KERN_INFO " if_in_broadcast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_bytes)); + printk(" ethernet_status = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + ethernet_status)); +} + +static void control_log_config_link_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_LINK\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " cmd_flags = %x\n", + pkt->cmd.config_link_req.cmd_flags); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_ENABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_NIC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_DISABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_NIC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_SET_MTU) + printk(KERN_INFO + " VNIC_FLAG_SET_MTU\n"); + printk(KERN_INFO + " lan_switch_num = %x, mtu_size = %d\n", + pkt->cmd.config_link_req.lan_switch_num, + be16_to_cpu(pkt->cmd.config_link_req.mtu_size)); + if (pkt->hdr.pkt_type == TYPE_RSP) { + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.config_link_req. + default_vlan), + pkt->cmd.config_link_req.hw_mac_address[0], + pkt->cmd.config_link_req.hw_mac_address[1], + pkt->cmd.config_link_req.hw_mac_address[2], + pkt->cmd.config_link_req.hw_mac_address[3], + pkt->cmd.config_link_req.hw_mac_address[4], + pkt->cmd.config_link_req.hw_mac_address[5]); + } +} + +static void print_config_addr(struct vnic_address_op *list, + int num_address_ops, size_t mgidoff) +{ + int i = 0; + + while (i < num_address_ops && i < 16) { + printk(KERN_INFO " list_address_ops[%u].index" + " = %u\n", i, be16_to_cpu(list->index)); + switch (list->operation) { + case VNIC_OP_GET_ENTRY: + printk(KERN_INFO " list_address_ops[%u]." + "operation = VNIC_OP_GET_ENTRY\n", i); + break; + case VNIC_OP_SET_ENTRY: + printk(KERN_INFO " list_address_ops[%u]." + "operation = VNIC_OP_SET_ENTRY\n", i); + break; + default: + printk(KERN_INFO " list_address_ops[%u]." + "operation = UNKNOWN(%d)\n", i, + list->operation); + break; + } + printk(KERN_INFO " list_address_ops[%u].valid" + " = %u\n", i, list->valid); + printk(KERN_INFO " list_address_ops[%u].address" + " = %02x:%02x:%02x:%02x:%02x:%02x\n", i, + list->address[0], list->address[1], + list->address[2], list->address[3], + list->address[4], list->address[5]); + printk(KERN_INFO " list_address_ops[%u].vlan" + " = %u\n", i, be16_to_cpu(list->vlan)); + if (mgidoff) { + printk(KERN_INFO + " list_address_ops[%u].mgid" + " = " VNIC_GID_FMT "\n", i, + VNIC_GID_RAW_ARG((char *)list + mgidoff)); + list = (struct vnic_address_op *) + ((char *)list + sizeof(struct vnic_address_op2)); + } else + list = (struct vnic_address_op *) + ((char *)list + sizeof(struct vnic_address_op)); + i++; + } +} + +static void control_log_config_addrs_pkt(struct vnic_control_packet *pkt, + u8 addresses2) +{ + struct vnic_address_op *list; + int no_address_ops; + + if (addresses2) + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_ADDRESSES2\n"); + else + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_ADDRESSES\n"); + printk(KERN_INFO " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, pkt->hdr.pkt_retry_count); + if (addresses2) { + printk(KERN_INFO " num_address_ops = %x," + " lan_switch_num = %d\n", + pkt->cmd.config_addresses_req2.num_address_ops, + pkt->cmd.config_addresses_req2.lan_switch_num); + list = (struct vnic_address_op *) + pkt->cmd.config_addresses_req2.list_address_ops; + no_address_ops = pkt->cmd.config_addresses_req2.num_address_ops; + print_config_addr(list, no_address_ops, + offsetof(struct vnic_address_op2, mgid)); + } else { + printk(KERN_INFO " num_address_ops = %x," + " lan_switch_num = %d\n", + pkt->cmd.config_addresses_req.num_address_ops, + pkt->cmd.config_addresses_req.lan_switch_num); + list = pkt->cmd.config_addresses_req.list_address_ops; + no_address_ops = pkt->cmd.config_addresses_req.num_address_ops; + print_config_addr(list, no_address_ops, 0); + } +} + +static void control_log_exch_pools_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_EXCHANGE_POOLS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " datapath = %u\n", + pkt->cmd.exchange_pools_req.data_path); + printk(KERN_INFO " pool_rkey = %08x" + " pool_addr = %llx\n", + be32_to_cpu(pkt->cmd.exchange_pools_req.pool_rkey), + be64_to_cpu(pkt->cmd.exchange_pools_req.pool_addr)); +} + +static void control_log_data_path_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_DATA_PATH\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " path_identifier = %llx," + " data_path = %u\n", + pkt->cmd.config_data_path_req.path_identifier, + pkt->cmd.config_data_path_req.data_path); + printk(KERN_INFO + "host config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + free_recv_pool_entries_per_update)); + printk(KERN_INFO + "eioc config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + free_recv_pool_entries_per_update)); +} + +static void control_log_init_vnic_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_INIT_VNIC\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " vnic_major_version = %u," + " vnic_minor_version = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_major_version), + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_minor_version)); + if (pkt->hdr.pkt_type == TYPE_REQ) { + printk(KERN_INFO + " vnic_instance = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_req.vnic_instance, + pkt->cmd.init_vnic_req.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req. + num_address_entries)); + } else { + printk(KERN_INFO + " num_lan_switches = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_rsp.num_lan_switches, + pkt->cmd.init_vnic_rsp.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u," + " features_supported = %08x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + num_address_entries), + be32_to_cpu(pkt->cmd.init_vnic_rsp. + features_supported)); + if (pkt->cmd.init_vnic_rsp.num_lan_switches != 0) { + printk(KERN_INFO + "lan_switch[0] lan_switch_num = %u," + " num_enet_ports = %08x\n", + pkt->cmd.init_vnic_rsp. + lan_switch[0].lan_switch_num, + pkt->cmd.init_vnic_rsp. + lan_switch[0].num_enet_ports); + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + lan_switch[0].default_vlan), + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[0], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[1], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[2], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[3], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[4], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[5]); + } + } +} + +static void control_log_control_packet(struct vnic_control_packet *pkt) +{ + switch (pkt->hdr.pkt_type) { + case TYPE_INFO: + printk(KERN_INFO "control_packet: pkt_type = TYPE_INFO\n"); + break; + case TYPE_REQ: + printk(KERN_INFO "control_packet: pkt_type = TYPE_REQ\n"); + break; + case TYPE_RSP: + printk(KERN_INFO "control_packet: pkt_type = TYPE_RSP\n"); + break; + case TYPE_ERR: + printk(KERN_INFO "control_packet: pkt_type = TYPE_ERR\n"); + break; + default: + printk(KERN_INFO "control_packet: pkt_type = UNKNOWN\n"); + } + + switch (pkt->hdr.pkt_cmd) { + case CMD_INIT_VNIC: + control_log_init_vnic_pkt(pkt); + break; + case CMD_CONFIG_DATA_PATH: + control_log_data_path_pkt(pkt); + break; + case CMD_EXCHANGE_POOLS: + control_log_exch_pools_pkt(pkt); + break; + case CMD_CONFIG_ADDRESSES: + control_log_config_addrs_pkt(pkt, 0); + break; + case CMD_CONFIG_ADDRESSES2: + control_log_config_addrs_pkt(pkt, 1); + break; + case CMD_CONFIG_LINK: + control_log_config_link_pkt(pkt); + break; + case CMD_REPORT_STATISTICS: + control_log_report_stats_pkt(pkt); + break; + case CMD_CLEAR_STATISTICS: + printk(KERN_INFO + " pkt_cmd = CMD_CLEAR_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_REPORT_STATUS: + control_log_report_status_pkt(pkt); + + break; + case CMD_RESET: + printk(KERN_INFO + " pkt_cmd = CMD_RESET\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_HEARTBEAT: + printk(KERN_INFO + " pkt_cmd = CMD_HEARTBEAT\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " hb_interval = %d\n", + be32_to_cpu(pkt->cmd.heartbeat_req.hb_interval)); + break; + default: + printk(KERN_INFO + " pkt_cmd = UNKNOWN (%u)\n", + pkt->hdr.pkt_cmd); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + } +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h new file mode 100644 index 0000000..57fab67 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_H_INCLUDED +#define VNIC_CONTROL_H_INCLUDED + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS +#include +#include +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" + +enum control_timer_state { + TIMER_IDLE = 0, + TIMER_ACTIVE = 1, + TIMER_EXPIRED = 2 +}; + +enum control_request_state { + REQ_INACTIVE, /* quiet state, all previous operations done + * response is NULL + * last_cmd = CMD_INVALID + * timer_state = IDLE + */ + REQ_POSTED, /* REQ put on send Q + * response is NULL + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_SENT, /* Send completed for REQ + * response is NULL + * last_cmd = command issued + * timer_state = ACTIVE + */ + RSP_RECEIVED, /* Received Resp, but no Send completion yet + * response is response buffer received + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_COMPLETED, /* all processing for REQ completed, ready to be gotten + * response is response buffer received + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_FAILED, /* processing of REQ/RSP failed. + * response is NULL + * last_cmd = CMD_INVALID + * timer_state = IDLE or EXPIRED + * viport has been moved to error state to force + * recovery + */ +}; + +struct control { + struct viport *parent; + struct control_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + struct vnic_control_packet *local_storage; + int send_len; + int recv_len; + u16 maj_ver; + u16 min_ver; + struct vnic_lan_switch_attribs lan_switch; + struct send_io send_io; + struct recv_io *recv_ios; + dma_addr_t send_dma; + dma_addr_t recv_dma; + enum control_timer_state timer_state; + enum control_request_state req_state; + struct timer_list timer; + u8 seq_num; + u8 last_cmd; + struct recv_io *response; + struct recv_io *info; + struct list_head failure_list; + spinlock_t io_lock; + struct completion done; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t request_time; /* intermediate value */ + cycles_t response_time; + u32 response_num; + cycles_t response_max; + cycles_t response_min; + u32 timeout_num; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +int control_init(struct control *control, struct viport *viport, + struct control_config *config, struct ib_pd *pd); + +void control_cleanup(struct control *control); + +void control_process_async(struct control *control); + +int control_init_vnic_req(struct control *control); +int control_init_vnic_rsp(struct control *control, u32 *features, + u8 *mac_address, u16 *num_addrs, u16 *vlan); + +int control_config_data_path_req(struct control *control, u64 path_id, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc); +int control_config_data_path_rsp(struct control *control, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc, + struct vnic_recv_pool_config *max_host, + struct vnic_recv_pool_config *max_eioc, + struct vnic_recv_pool_config *min_host, + struct vnic_recv_pool_config *min_eioc); + +int control_exchange_pools_req(struct control *control, + u64 addr, u32 rkey); +int control_exchange_pools_rsp(struct control *control, + u64 *addr, u32 *rkey); + +int control_config_link_req(struct control *control, + u16 flags, u16 mtu); +int control_config_link_rsp(struct control *control, + u16 *flags, u16 *mtu); + +int control_config_addrs_req(struct control *control, + struct vnic_address_op2 *addrs, u16 num); +int control_config_addrs_rsp(struct control *control); + +int control_report_statistics_req(struct control *control); +int control_report_statistics_rsp(struct control *control, + struct vnic_cmd_report_stats_rsp *stats); + +int control_heartbeat_req(struct control *control, u32 hb_interval); +int control_heartbeat_rsp(struct control *control); + +int control_reset_req(struct control *control); +int control_reset_rsp(struct control *control); + +#define control_packet(io) \ + (struct vnic_control_packet *)(io)->virtual_addr +#define control_is_connected(control) \ + (vnic_ib_conn_connected(&((control)->ib_conn))) + +#define control_last_req(control) control_packet(&(control)->send_io) +#define control_features(control) (control)->features_supported + +#define control_get_mac_address(control,addr) \ + memcpy(addr, (control)->lan_switch.hw_mac_address, ETH_ALEN) + +#endif /* VNIC_CONTROL_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h new file mode 100644 index 0000000..1fc62fb --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h @@ -0,0 +1,368 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_PKT_H_INCLUDED +#define VNIC_CONTROL_PKT_H_INCLUDED + +#include +#include + +#define VNIC_MAX_NODENAME_LEN 64 + +struct vnic_connection_data { + u64 path_id; + u8 vnic_instance; + u8 path_num; + u8 nodename[VNIC_MAX_NODENAME_LEN + 1]; + u8 reserved; /* for alignment */ + __be32 features_supported; +}; + +struct vnic_control_header { + u8 pkt_type; + u8 pkt_cmd; + u8 pkt_seq_num; + u8 pkt_retry_count; + u32 reserved; /* for 64-bit alignmnet */ +}; + +/* ptk_type values */ +enum { + TYPE_INFO = 0, + TYPE_REQ = 1, + TYPE_RSP = 2, + TYPE_ERR = 3 +}; + +/* ptk_cmd values */ +enum { + CMD_INVALID = 0, + CMD_INIT_VNIC = 1, + CMD_CONFIG_DATA_PATH = 2, + CMD_EXCHANGE_POOLS = 3, + CMD_CONFIG_ADDRESSES = 4, + CMD_CONFIG_LINK = 5, + CMD_REPORT_STATISTICS = 6, + CMD_CLEAR_STATISTICS = 7, + CMD_REPORT_STATUS = 8, + CMD_RESET = 9, + CMD_HEARTBEAT = 10, + CMD_CONFIG_ADDRESSES2 = 11, +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_REQ data format */ +struct vnic_cmd_init_vnic_req { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 vnic_instance; + u8 num_data_paths; + __be16 num_address_entries; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP subdata format */ +struct vnic_lan_switch_attribs { + u8 lan_switch_num; + u8 num_enet_ports; + __be16 default_vlan; + u8 hw_mac_address[ETH_ALEN]; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP data format */ +struct vnic_cmd_init_vnic_rsp { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 num_lan_switches; + u8 num_data_paths; + __be16 num_address_entries; + __be32 features_supported; + struct vnic_lan_switch_attribs lan_switch[1]; +}; + +/* features_supported values */ +enum { + VNIC_FEAT_IPV4_HEADERS = 0x0001, + VNIC_FEAT_IPV6_HEADERS = 0x0002, + VNIC_FEAT_IPV4_CSUM_RX = 0x0004, + VNIC_FEAT_IPV4_CSUM_TX = 0x0008, + VNIC_FEAT_TCP_CSUM_RX = 0x0010, + VNIC_FEAT_TCP_CSUM_TX = 0x0020, + VNIC_FEAT_UDP_CSUM_RX = 0x0040, + VNIC_FEAT_UDP_CSUM_TX = 0x0080, + VNIC_FEAT_TCP_SEGMENT = 0x0100, + VNIC_FEAT_IPV4_IPSEC_OFFLOAD = 0x0200, + VNIC_FEAT_IPV6_IPSEC_OFFLOAD = 0x0400, + VNIC_FEAT_FCS_PROPAGATE = 0x0800, + VNIC_FEAT_PF_KICK = 0x1000, + VNIC_FEAT_PF_FORCE_ROUTE = 0x2000, + VNIC_FEAT_CHASH_OFFLOAD = 0x4000, + /* host send with immediate data */ + VNIC_FEAT_RDMA_IMMED = 0x8000, + /* host ignore inbound PF_VLAN_INSERT flag */ + VNIC_FEAT_IGNORE_VLAN = 0x10000, + /* host supports IB multicast for inbound Ethernet mcast traffic */ + VNIC_FEAT_INBOUND_IB_MC = 0x20000, +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH subdata format */ +struct vnic_recv_pool_config { + __be32 size_recv_pool_entry; + __be32 num_recv_pool_entries; + __be32 timeout_before_kick; + __be32 num_recv_pool_entries_before_kick; + __be32 num_recv_pool_bytes_before_kick; + __be32 free_recv_pool_entries_per_update; +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH data format */ +struct vnic_cmd_config_data_path { + u64 path_identifier; + u8 data_path; + u8 reserved[3]; + struct vnic_recv_pool_config host_recv_pool_config; + struct vnic_recv_pool_config eioc_recv_pool_config; +}; + +/* pkt_cmd CMD_EXCHANGE_POOLS data format */ +struct vnic_cmd_exchange_pools { + u8 data_path; + u8 reserved[3]; + __be32 pool_rkey; + __be64 pool_addr; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES subdata format */ +struct vnic_address_op { + __be16 index; + u8 operation; + u8 valid; + u8 address[6]; + __be16 vlan; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES2 subdata format */ +struct vnic_address_op2 { + __be16 index; + u8 operation; + u8 valid; + u8 address[6]; + __be16 vlan; + u32 reserved; /* for alignment */ + union ib_gid mgid; /* valid in rsp only if both ends support mcast */ +}; + +/* operation values */ +enum { + VNIC_OP_SET_ENTRY = 0x01, + VNIC_OP_GET_ENTRY = 0x02 +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES data format */ +struct vnic_cmd_config_addresses { + u8 num_address_ops; + u8 lan_switch_num; + struct vnic_address_op list_address_ops[1]; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES2 data format */ +struct vnic_cmd_config_addresses2 { + u8 num_address_ops; + u8 lan_switch_num; + u8 reserved1; + u8 reserved2; + u8 reserved3; + struct vnic_address_op2 list_address_ops[1]; +}; + +/* CMD_CONFIG_LINK data format */ +struct vnic_cmd_config_link { + u8 cmd_flags; + u8 lan_switch_num; + __be16 mtu_size; + __be16 default_vlan; + u8 hw_mac_address[6]; + u32 reserved; /* for alignment */ + /* valid in rsp only if both ends support mcast */ + union ib_gid allmulti_mgid; +}; + +/* cmd_flags values */ +enum { + VNIC_FLAG_ENABLE_NIC = 0x01, + VNIC_FLAG_DISABLE_NIC = 0x02, + VNIC_FLAG_ENABLE_MCAST_ALL = 0x04, + VNIC_FLAG_DISABLE_MCAST_ALL = 0x08, + VNIC_FLAG_ENABLE_PROMISC = 0x10, + VNIC_FLAG_DISABLE_PROMISC = 0x20, + VNIC_FLAG_SET_MTU = 0x40 +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_REQ data format */ +struct vnic_cmd_report_stats_req { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_RSP data format */ +struct vnic_cmd_report_stats_rsp { + u8 lan_switch_num; + u8 reserved[7]; /* for 64-bit alignment */ + __be64 if_in_broadcast_pkts; + __be64 if_in_multicast_pkts; + __be64 if_in_octets; + __be64 if_in_ucast_pkts; + __be64 if_in_nucast_pkts; /* if_in_broadcast_pkts + + if_in_multicast_pkts */ + __be64 if_in_underrun; /* (OID_GEN_RCV_NO_BUFFER) */ + __be64 if_in_errors; /* (OID_GEN_RCV_ERROR) */ + __be64 if_out_errors; /* (OID_GEN_XMIT_ERROR) */ + __be64 if_out_octets; + __be64 if_out_ucast_pkts; + __be64 if_out_multicast_pkts; + __be64 if_out_broadcast_pkts; + __be64 if_out_nucast_pkts; /* if_out_broadcast_pkts + + if_out_multicast_pkts */ + __be64 if_out_ok; /* if_out_nucast_pkts + + if_out_ucast_pkts(OID_GEN_XMIT_OK) */ + __be64 if_in_ok; /* if_in_nucast_pkts + + if_in_ucast_pkts(OID_GEN_RCV_OK) */ + __be64 if_out_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_XMT) */ + __be64 if_out_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_XMT) */ + __be64 if_out_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_XMT) */ + __be64 if_in_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_RCV) */ + __be64 if_in_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_RCV) */ + __be64 if_in_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_RCV) */ + __be64 ethernet_status; /* OID_GEN_MEDIA_CONNECT_STATUS) */ +}; + +/* pkt_cmd CMD_CLEAR_STATISTICS data format */ +struct vnic_cmd_clear_statistics { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATUS data format */ +struct vnic_cmd_report_status { + u8 lan_switch_num; + u8 is_fatal; + u8 reserved[2]; /* for 32-bit alignment */ + __be32 status_number; + __be32 status_info; + u8 file_name[32]; + u8 routine[32]; + __be32 line_num; + __be32 error_parameter; + u8 desc_text[128]; +}; + +/* pkt_cmd CMD_HEARTBEAT data format */ +struct vnic_cmd_heartbeat { + __be32 hb_interval; +}; + +enum { + VNIC_STATUS_LINK_UP = 1, + VNIC_STATUS_LINK_DOWN = 2, + VNIC_STATUS_ENET_AGGREGATION_CHANGE = 3, + VNIC_STATUS_EIOC_SHUTDOWN = 4, + VNIC_STATUS_CONTROL_ERROR = 5, + VNIC_STATUS_EIOC_ERROR = 6 +}; + +#define VNIC_MAX_CONTROLPKTSZ 256 +#define VNIC_MAX_CONTROLDATASZ \ + (VNIC_MAX_CONTROLPKTSZ - sizeof(struct vnic_control_header)) + +struct vnic_control_packet { + struct vnic_control_header hdr; + union { + struct vnic_cmd_init_vnic_req init_vnic_req; + struct vnic_cmd_init_vnic_rsp init_vnic_rsp; + struct vnic_cmd_config_data_path config_data_path_req; + struct vnic_cmd_config_data_path config_data_path_rsp; + struct vnic_cmd_exchange_pools exchange_pools_req; + struct vnic_cmd_exchange_pools exchange_pools_rsp; + struct vnic_cmd_config_addresses config_addresses_req; + struct vnic_cmd_config_addresses2 config_addresses_req2; + struct vnic_cmd_config_addresses config_addresses_rsp; + struct vnic_cmd_config_addresses2 config_addresses_rsp2; + struct vnic_cmd_config_link config_link_req; + struct vnic_cmd_config_link config_link_rsp; + struct vnic_cmd_report_stats_req report_statistics_req; + struct vnic_cmd_report_stats_rsp report_statistics_rsp; + struct vnic_cmd_clear_statistics clear_statistics_req; + struct vnic_cmd_clear_statistics clear_statistics_rsp; + struct vnic_cmd_report_status report_status; + struct vnic_cmd_heartbeat heartbeat_req; + struct vnic_cmd_heartbeat heartbeat_rsp; + + char cmd_data[VNIC_MAX_CONTROLDATASZ]; + } cmd; +}; + +union ib_gid_cpu { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +static inline void bswap_ib_gid(union ib_gid *mgid1, union ib_gid_cpu *mgid2) +{ + /* swap hi & low */ + __be64 low = mgid1->global.subnet_prefix; + mgid2->global.subnet_prefix = be64_to_cpu(mgid1->global.interface_id); + mgid2->global.interface_id = be64_to_cpu(low); +} + +#define VNIC_GID_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x" + +#define VNIC_GID_RAW_ARG(gid) be16_to_cpu(*(__be16 *)&(gid)[0]), \ + be16_to_cpu(*(__be16 *)&(gid)[2]), \ + be16_to_cpu(*(__be16 *)&(gid)[4]), \ + be16_to_cpu(*(__be16 *)&(gid)[6]), \ + be16_to_cpu(*(__be16 *)&(gid)[8]), \ + be16_to_cpu(*(__be16 *)&(gid)[10]), \ + be16_to_cpu(*(__be16 *)&(gid)[12]), \ + be16_to_cpu(*(__be16 *)&(gid)[14]) + + +/* These defines are used to figure out how many address entries can be passed + * in config_addresses request. + */ +#define MAX_CONFIG_ADDR_ENTRIES \ + ((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses) \ + - sizeof(struct vnic_address_op)))/sizeof(struct vnic_address_op)) +#define MAX_CONFIG_ADDR_ENTRIES2 \ + ((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses2) \ + - sizeof(struct vnic_address_op2)))/sizeof(struct vnic_address_op2)) + + +#endif /* VNIC_CONTROL_PKT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:33:58 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:03:58 +0530 Subject: [ofa-general] [PATCH v2 05/13] QLogic VNIC: Implementation of Data path of communication protocol In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103358.12355.76791.stgit@localhost.localdomain> From: Ramachandra K This patch implements the actual data transfer part of the communication protocol with the EVIC/VEx. RDMA of ethernet packets is implemented in here. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_data.c | 1492 +++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.h | 206 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h | 103 ++ 3 files changed, 1801 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c new file mode 100644 index 0000000..b81fcde --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c @@ -0,0 +1,1492 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_data.h" +#include "vnic_trailer.h" +#include "vnic_stats.h" + +static void data_received_kick(struct io *io); +static void data_xmit_complete(struct io *io); + +static void mc_data_recv_routine(struct io *io); +static void mc_data_post_recvs(struct mc_data *mc_data); +static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb, + struct viport_trailer *trailer); + +static u32 min_rcv_skb = 60; +module_param(min_rcv_skb, int, 0444); +MODULE_PARM_DESC(min_rcv_skb, "Packets of size (in bytes) less than" + " or equal this value will be copied during receive." + " Default 60"); + +static u32 min_xmt_skb = 60; +module_param(min_xmt_skb, int, 0444); +MODULE_PARM_DESC(min_xmit_skb, "Packets of size (in bytes) less than" + " or equal to this value will be copied during transmit." + "Default 60"); + +int data_init(struct data *data, struct viport *viport, + struct data_config *config, struct ib_pd *pd) +{ + DATA_FUNCTION("data_init()\n"); + + data->parent = viport; + data->config = config; + data->ib_conn.viport = viport; + data->ib_conn.ib_config = &config->ib_config; + data->ib_conn.state = IB_CONN_UNINITTED; + data->ib_conn.callback_thread = NULL; + data->ib_conn.callback_thread_end = 0; + + if ((min_xmt_skb < 60) || (min_xmt_skb > 9000)) { + DATA_ERROR("min_xmt_skb (%d) must be between 60 and 9000\n", + min_xmt_skb); + goto failure; + } + if (vnic_ib_conn_init(&data->ib_conn, viport, pd, + &config->ib_config)) { + DATA_ERROR("Data IB connection initialization failed\n"); + goto failure; + } + data->mr = ib_get_dma_mr(pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(data->mr)) { + DATA_ERROR("failed to register memory for" + " data connection\n"); + goto destroy_conn; + } + + data->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &data->ib_conn); + + if (IS_ERR(data->ib_conn.cm_id)) { + DATA_ERROR("creating data CM ID failed\n"); + goto dereg_mr; + } + + return 0; + +dereg_mr: + ib_dereg_mr(data->mr); +destroy_conn: + completion_callback_cleanup(&data->ib_conn); + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); +failure: + return -1; +} + +static void data_post_recvs(struct data *data) +{ + unsigned long flags; + int i = 0; + + DATA_FUNCTION("data_post_recvs()\n"); + spin_lock_irqsave(&data->recv_ios_lock, flags); + while (!list_empty(&data->recv_ios)) { + struct io *io = list_entry(data->recv_ios.next, + struct io, list_ptrs); + struct recv_io *recv_io = (struct recv_io *)io; + + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + if (vnic_ib_post_recv(&data->ib_conn, &recv_io->io)) { + viport_failure(data->parent); + return; + } + i++; + spin_lock_irqsave(&data->recv_ios_lock, flags); + } + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + DATA_INFO("data posted %d %p\n", i, &data->recv_ios); +} + +static void data_init_pool_work_reqs(struct data *data, + struct recv_io *recv_io) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct rdma_dest *rdma_dest; + dma_addr_t xmit_dma; + u8 *xmit_data; + unsigned int i; + + INIT_LIST_HEAD(&data->recv_ios); + spin_lock_init(&data->recv_ios_lock); + spin_lock_init(&data->xmit_buf_lock); + for (i = 0; i < data->config->num_recvs; i++) { + recv_io[i].io.viport = data->parent; + recv_io[i].io.routine = data_received_kick; + recv_io[i].list.addr = data->region_data_dma; + recv_io[i].list.length = 4; + recv_io[i].list.lkey = data->mr->lkey; + + recv_io[i].io.rwr.wr_id = (u64)&recv_io[i].io; + recv_io[i].io.rwr.sg_list = &recv_io[i].list; + recv_io[i].io.rwr.num_sge = 1; + + list_add(&recv_io[i].io.list_ptrs, &data->recv_ios); + } + + INIT_LIST_HEAD(&recv_pool->avail_recv_bufs); + for (i = 0; i < recv_pool->pool_sz; i++) { + rdma_dest = &recv_pool->recv_bufs[i]; + list_add(&rdma_dest->list_ptrs, + &recv_pool->avail_recv_bufs); + } + + xmit_dma = xmit_pool->xmitdata_dma; + xmit_data = xmit_pool->xmit_data; + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + rdma_io = &xmit_pool->xmit_bufs[i]; + rdma_io->index = i; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = data_xmit_complete; + + rdma_io->list[0].lkey = data->mr->lkey; + rdma_io->list[1].lkey = data->mr->lkey; + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 2; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + rdma_io->data = xmit_data; + rdma_io->data_dma = xmit_dma; + + xmit_data += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + xmit_dma += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + rdma_io->trailer = (struct viport_trailer *)xmit_data; + rdma_io->trailer_dma = xmit_dma; + xmit_data += sizeof(struct viport_trailer); + xmit_dma += sizeof(struct viport_trailer); + } + + xmit_pool->rdma_rkey = data->mr->rkey; + xmit_pool->rdma_addr = xmit_pool->buf_pool_dma; +} + +static void data_init_free_bufs_swrs(struct data *data) +{ + struct rdma_io *rdma_io; + struct send_io *send_io; + + rdma_io = &data->free_bufs_io; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = NULL; + + rdma_io->list[0].lkey = data->mr->lkey; + + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 1; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + send_io = &data->kick_io; + send_io->io.viport = data->parent; + send_io->io.routine = NULL; + + send_io->list.addr = data->region_data_dma; + send_io->list.length = 0; + send_io->list.lkey = data->mr->lkey; + + send_io->io.swr.wr_id = (u64)send_io; + send_io->io.swr.sg_list = &send_io->list; + send_io->io.swr.num_sge = 1; + send_io->io.swr.opcode = IB_WR_SEND; + send_io->io.swr.send_flags = IB_SEND_SIGNALED; + send_io->io.type = SEND; +} + +static int data_init_buf_pools(struct data *data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + + recv_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * recv_pool->eioc_pool_sz; + + recv_pool->buf_pool = kzalloc(recv_pool->buf_pool_len, GFP_KERNEL); + + if (!recv_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for recv pool bufpool\n", + recv_pool->buf_pool_len); + goto failure; + } + + recv_pool->buf_pool_dma = + ib_dma_map_single(viport->config->ibdev, + recv_pool->buf_pool, recv_pool->buf_pool_len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, recv_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_recv_pool; + } + + xmit_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * xmit_pool->pool_sz; + xmit_pool->buf_pool = kzalloc(xmit_pool->buf_pool_len, GFP_KERNEL); + + if (!xmit_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for xmit pool bufpool\n", + xmit_pool->buf_pool_len); + goto unmap_recv_pool; + } + + xmit_pool->buf_pool_dma = + ib_dma_map_single(viport->config->ibdev, + xmit_pool->buf_pool, xmit_pool->buf_pool_len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_xmit_pool; + } + + xmit_pool->xmit_data = kzalloc(xmit_pool->xmitdata_len, GFP_KERNEL); + + if (!xmit_pool->xmit_data) { + DATA_ERROR("failed allocating %d bytes for xmit data\n", + xmit_pool->xmitdata_len); + goto unmap_xmit_pool; + } + + xmit_pool->xmitdata_dma = + ib_dma_map_single(viport->config->ibdev, + xmit_pool->xmit_data, xmit_pool->xmitdata_len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->xmitdata_dma)) { + DATA_ERROR("xmit data dma map error\n"); + goto free_xmit_data; + } + + return 0; + +free_xmit_data: + kfree(xmit_pool->xmit_data); +unmap_xmit_pool: + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); +free_xmit_pool: + kfree(xmit_pool->buf_pool); +unmap_recv_pool: + ib_dma_unmap_single(data->parent->config->ibdev, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); +free_recv_pool: + kfree(recv_pool->buf_pool); +failure: + return -1; +} + +static void data_init_xmit_pool(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + + xmit_pool->pool_sz = + be32_to_cpu(data->eioc_pool_parms.num_recv_pool_entries); + xmit_pool->buffer_sz = + be32_to_cpu(data->eioc_pool_parms.size_recv_pool_entry); + + xmit_pool->notify_count = 0; + xmit_pool->notify_bundle = data->config->notify_bundle; + xmit_pool->next_xmit_pool = 0; + xmit_pool->num_xmit_bufs = xmit_pool->notify_bundle * 2; + xmit_pool->next_xmit_buf = 0; + xmit_pool->last_comp_buf = xmit_pool->num_xmit_bufs - 1; + /* This assumes that data_init_recv_pool has been called + * before. + */ + data->max_mtu = MAX_PAYLOAD(min((data)->recv_pool.buffer_sz, + (data)->xmit_pool.buffer_sz)) - VLAN_ETH_HLEN; + + xmit_pool->kick_count = 0; + xmit_pool->kick_byte_count = 0; + + xmit_pool->send_kicks = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick) + || be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + xmit_pool->kick_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick); + xmit_pool->kick_byte_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + + xmit_pool->need_buffers = 1; + + xmit_pool->xmitdata_len = + BUFFER_SIZE(min_xmt_skb) * xmit_pool->num_xmit_bufs; +} + +static void data_init_recv_pool(struct data *data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + + recv_pool->pool_sz = data->config->host_recv_pool_entries; + recv_pool->eioc_pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + if (recv_pool->pool_sz > recv_pool->eioc_pool_sz) + recv_pool->pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + + recv_pool->buffer_sz = + be32_to_cpu(data->host_pool_parms.size_recv_pool_entry); + + recv_pool->sz_free_bundle = + be32_to_cpu(data-> + host_pool_parms.free_recv_pool_entries_per_update); + recv_pool->num_free_bufs = 0; + recv_pool->num_posted_bufs = 0; + + recv_pool->next_full_buf = 0; + recv_pool->next_free_buf = 0; + recv_pool->kick_on_free = 0; +} + +int data_connect(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + struct recv_io *recv_io; + unsigned int sz; + struct viport *viport = data->parent; + + DATA_FUNCTION("data_connect()\n"); + + /* Do not interchange the order of the functions + * called below as this will affect the MAX MTU + * calculation + */ + + data_init_recv_pool(data); + data_init_xmit_pool(data); + + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz + + sizeof(struct recv_io) * data->config->num_recvs + + sizeof(struct rdma_io) * xmit_pool->num_xmit_bufs; + + data->local_storage = vmalloc(sz); + + if (!data->local_storage) { + DATA_ERROR("failed allocating %d bytes" + " local storage\n", sz); + goto out; + } + + memset(data->local_storage, 0, sz); + + recv_pool->recv_bufs = (struct rdma_dest *)data->local_storage; + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz; + + recv_io = (struct recv_io *)(data->local_storage + sz); + sz += sizeof(struct recv_io) * data->config->num_recvs; + + xmit_pool->xmit_bufs = (struct rdma_io *)(data->local_storage + sz); + data->region_data = kzalloc(4, GFP_KERNEL); + + if (!data->region_data) { + DATA_ERROR("failed to alloc memory for region data\n"); + goto free_local_storage; + } + + data->region_data_dma = + ib_dma_map_single(viport->config->ibdev, + data->region_data, 4, DMA_BIDIRECTIONAL); + + if (ib_dma_mapping_error(viport->config->ibdev, data->region_data_dma)) { + DATA_ERROR("region data dma map error\n"); + goto free_region_data; + } + + if (data_init_buf_pools(data)) + goto unmap_region_data; + + data_init_free_bufs_swrs(data); + data_init_pool_work_reqs(data, recv_io); + + data_post_recvs(data); + + if (vnic_ib_cm_connect(&data->ib_conn)) + goto unmap_region_data; + + return 0; + +unmap_region_data: + ib_dma_unmap_single(data->parent->config->ibdev, + data->region_data_dma, 4, DMA_BIDIRECTIONAL); +free_region_data: + kfree(data->region_data); +free_local_storage: + vfree(data->local_storage); +out: + return -1; +} + +static void data_add_free_buffer(struct data *data, int index, + struct rdma_dest *rdma_dest) +{ + struct recv_pool *pool = &data->recv_pool; + struct buff_pool_entry *bpe; + dma_addr_t vaddr_dma; + + DATA_FUNCTION("data_add_free_buffer()\n"); + rdma_dest->trailer->connection_hash_and_valid = 0; + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe = &pool->buf_pool[index]; + bpe->rkey = cpu_to_be32(data->mr->rkey); + vaddr_dma = ib_dma_map_single(data->parent->config->ibdev, + rdma_dest->data, pool->buffer_sz, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(data->parent->config->ibdev, vaddr_dma)) { + DATA_ERROR("rdma_dest->data dma map error\n"); + goto failure; + } + bpe->remote_addr = cpu_to_be64(vaddr_dma); + bpe->valid = (u32) (rdma_dest - &pool->recv_bufs[0]) + 1; + ++pool->num_free_bufs; +failure: + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); +} + +/* NOTE: this routine is not reentrant */ +static void data_alloc_buffers(struct data *data, int initial_allocation) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct sk_buff *skb; + int index; + + DATA_FUNCTION("data_alloc_buffers()\n"); + index = ADD(pool->next_free_buf, pool->num_free_bufs, + pool->eioc_pool_sz); + + while (!list_empty(&pool->avail_recv_bufs)) { + rdma_dest = + list_entry(pool->avail_recv_bufs.next, + struct rdma_dest, list_ptrs); + if (!rdma_dest->skb) { + if (initial_allocation) + skb = alloc_skb(pool->buffer_sz + 2, + GFP_KERNEL); + else + skb = dev_alloc_skb(pool->buffer_sz + 2); + if (!skb) + break; + skb_reserve(skb, 2); + skb_put(skb, pool->buffer_sz); + rdma_dest->skb = skb; + rdma_dest->data = skb->data; + rdma_dest->trailer = + (struct viport_trailer *)(rdma_dest->data + + pool->buffer_sz - + sizeof(struct + viport_trailer)); + } + rdma_dest->trailer->connection_hash_and_valid = 0; + + list_del_init(&rdma_dest->list_ptrs); + + data_add_free_buffer(data, index, rdma_dest); + index = NEXT(index, pool->eioc_pool_sz); + } +} + +static void data_send_kick_message(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + DATA_FUNCTION("data_send_kick_message()\n"); + /* stop timer for bundle_timeout */ + if (data->kick_timer_on) { + del_timer(&data->kick_timer); + data->kick_timer_on = 0; + } + pool->kick_count = 0; + pool->kick_byte_count = 0; + + /* TODO: keep track of when kick is outstanding, and + * don't reuse until complete + */ + if (vnic_ib_post_send(&data->ib_conn, &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + } +} + +static void data_send_free_recv_buffers(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct ib_send_wr *swr = &data->free_bufs_io.io.swr; + + int bufs_sent = 0; + u64 rdma_addr; + u32 offset; + u32 sz; + unsigned int num_to_send, next_increment; + + DATA_FUNCTION("data_send_free_recv_buffers()\n"); + + for (num_to_send = pool->sz_free_bundle; + num_to_send <= pool->num_free_bufs; + num_to_send += pool->sz_free_bundle) { + /* handle multiple bundles as one when possible. */ + next_increment = num_to_send + pool->sz_free_bundle; + if ((next_increment <= pool->num_free_bufs) + && (pool->next_free_buf + next_increment <= + pool->eioc_pool_sz)) + continue; + + offset = pool->next_free_buf * + sizeof(struct buff_pool_entry); + sz = num_to_send * sizeof(struct buff_pool_entry); + rdma_addr = pool->eioc_rdma_addr + offset; + swr->sg_list->length = sz; + swr->sg_list->addr = pool->buf_pool_dma + offset; + swr->wr.rdma.remote_addr = rdma_addr; + + if (vnic_ib_post_send(&data->ib_conn, + &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + return; + } + INC(pool->next_free_buf, num_to_send, pool->eioc_pool_sz); + pool->num_free_bufs -= num_to_send; + pool->num_posted_bufs += num_to_send; + bufs_sent = 1; + } + + if (bufs_sent) { + if (pool->kick_on_free) + data_send_kick_message(data); + } + if (pool->num_posted_bufs == 0) { + struct vnic *vnic = data->parent->vnic; + unsigned long flags; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path == &vnic->primary_path) { + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + DATA_ERROR("%s: primary path: " + "unable to allocate receive buffers\n", + vnic->config->name); + } else { + if (vnic->current_path == &vnic->secondary_path) { + spin_unlock_irqrestore(&vnic->current_path_lock, + flags); + DATA_ERROR("%s: secondary path: " + "unable to allocate receive buffers\n", + vnic->config->name); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, + flags); + } + data->ib_conn.state = IB_CONN_ERRORED; + viport_failure(data->parent); + } +} + +void data_connected(struct data *data) +{ + DATA_FUNCTION("data_connected()\n"); + data->free_bufs_io.io.swr.wr.rdma.rkey = + data->recv_pool.eioc_rdma_rkey; + data_alloc_buffers(data, 1); + data_send_free_recv_buffers(data); + data->connected = 1; +} + +void data_disconnect(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + unsigned int i; + + DATA_FUNCTION("data_disconnect()\n"); + + data->connected = 0; + if (data->kick_timer_on) { + del_timer_sync(&data->kick_timer); + data->kick_timer_on = 0; + } + + if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0)) + DATA_ERROR("data CM DREQ sending failed\n"); + data->ib_conn.state = IB_CONN_DISCONNECTED; + + completion_callback_cleanup(&data->ib_conn); + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + if (xmit_pool->xmit_bufs[i].skb) + dev_kfree_skb(xmit_pool->xmit_bufs[i].skb); + xmit_pool->xmit_bufs[i].skb = NULL; + + } + for (i = 0; i < recv_pool->pool_sz; i++) { + if (data->recv_pool.recv_bufs[i].skb) + dev_kfree_skb(recv_pool->recv_bufs[i].skb); + recv_pool->recv_bufs[i].skb = NULL; + } + vfree(data->local_storage); + if (data->region_data) { + ib_dma_unmap_single(data->parent->config->ibdev, + data->region_data_dma, 4, + DMA_BIDIRECTIONAL); + kfree(data->region_data); + } + + if (recv_pool->buf_pool) { + ib_dma_unmap_single(data->parent->config->ibdev, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); + kfree(recv_pool->buf_pool); + } + + if (xmit_pool->buf_pool) { + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); + kfree(xmit_pool->buf_pool); + } + + if (xmit_pool->xmit_data) { + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + kfree(xmit_pool->xmit_data); + } +} + +void data_cleanup(struct data *data) +{ + ib_destroy_cm_id(data->ib_conn.cm_id); + + /* Completion callback cleanup called again. + * This is to cleanup the threads in case there is an + * error before state LINK_DATACONNECT due to which + * data_disconnect is not called. + */ + completion_callback_cleanup(&data->ib_conn); + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); + ib_dereg_mr(data->mr); + +} + +static int data_alloc_xmit_buffer(struct data *data, struct sk_buff *skb, + struct buff_pool_entry **pp_bpe, + struct rdma_io **pp_rdma_io, + int *last) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + int ret; + + DATA_FUNCTION("data_alloc_xmit_buffer()\n"); + + spin_lock_irqsave(&data->xmit_buf_lock, flags); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + *last = 0; + *pp_rdma_io = &pool->xmit_bufs[pool->next_xmit_buf]; + *pp_bpe = &pool->buf_pool[pool->next_xmit_pool]; + + if ((*pp_bpe)->valid && pool->next_xmit_buf != + pool->last_comp_buf) { + INC(pool->next_xmit_buf, 1, pool->num_xmit_bufs); + INC(pool->next_xmit_pool, 1, pool->pool_sz); + if (!pool->buf_pool[pool->next_xmit_pool].valid) { + DATA_INFO("just used the last EIOU" + " receive buffer\n"); + *last = 1; + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + data_kickreq_stats(data); + } else if (pool->next_xmit_buf == pool->last_comp_buf) { + DATA_INFO("just used our last xmit buffer\n"); + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + } + (*pp_rdma_io)->skb = skb; + (*pp_bpe)->valid = 0; + ret = 0; + } else { + data_no_xmitbuf_stats(data); + DATA_ERROR("Out of xmit buffers\n"); + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + ret = -1; + } + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, + pool->buf_pool_len, DMA_TO_DEVICE); + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); + return ret; +} + +static void data_rdma_packet(struct data *data, struct buff_pool_entry *bpe, + struct rdma_io *rdma_io) +{ + struct ib_send_wr *swr; + struct sk_buff *skb; + dma_addr_t trailer_data_dma; + dma_addr_t skb_data_dma; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + u8 *d; + int len; + int fill_len; + + DATA_FUNCTION("data_rdma_packet()\n"); + swr = &rdma_io->io.swr; + skb = rdma_io->skb; + len = ALIGN(rdma_io->len, VIPORT_TRAILER_ALIGNMENT); + fill_len = len - skb->len; + + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + + d = (u8 *) rdma_io->trailer - fill_len; + trailer_data_dma = rdma_io->trailer_dma - fill_len; + memset(d, 0, fill_len); + + swr->sg_list[0].length = skb->len; + if (skb->len <= min_xmt_skb) { + memcpy(rdma_io->data, skb->data, skb->len); + swr->sg_list[0].lkey = data->mr->lkey; + swr->sg_list[0].addr = rdma_io->data_dma; + dev_kfree_skb_any(skb); + rdma_io->skb = NULL; + } else { + swr->sg_list[0].lkey = data->mr->lkey; + + skb_data_dma = ib_dma_map_single(viport->config->ibdev, + skb->data, skb->len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, skb_data_dma)) { + DATA_ERROR("skb data dma map error\n"); + goto failure; + } + + rdma_io->skb_data_dma = skb_data_dma; + + swr->sg_list[0].addr = skb_data_dma; + skb_orphan(skb); + } + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + swr->sg_list[1].addr = trailer_data_dma; + swr->sg_list[1].length = fill_len + sizeof(struct viport_trailer); + swr->sg_list[0].lkey = data->mr->lkey; + swr->wr.rdma.remote_addr = be64_to_cpu(bpe->remote_addr); + swr->wr.rdma.remote_addr += data->xmit_pool.buffer_sz; + swr->wr.rdma.remote_addr -= (sizeof(struct viport_trailer) + len); + swr->wr.rdma.rkey = be32_to_cpu(bpe->rkey); + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + /* If VNIC_FEAT_RDMA_IMMED is supported then change the work request + * opcode to IB_WR_RDMA_WRITE_WITH_IMM + */ + + if (data->parent->features_supported & VNIC_FEAT_RDMA_IMMED) { + swr->ex.imm_data = 0; + swr->opcode = IB_WR_RDMA_WRITE_WITH_IMM; + } + + data->xmit_pool.notify_count++; + if (data->xmit_pool.notify_count >= data->xmit_pool.notify_bundle) { + data->xmit_pool.notify_count = 0; + swr->send_flags = IB_SEND_SIGNALED; + } else { + swr->send_flags = 0; + } + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + if (vnic_ib_post_send(&data->ib_conn, &rdma_io->io)) { + DATA_ERROR("failed to post send for data RDMA write\n"); + viport_failure(data->parent); + goto failure; + } + + data_xmits_stats(data); +failure: + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); +} + +static void data_kick_timeout_handler(unsigned long arg) +{ + struct data *data = (struct data *)arg; + + DATA_FUNCTION("data_kick_timeout_handler()\n"); + data->kick_timer_on = 0; + data_send_kick_message(data); +} + +int data_xmit_packet(struct data *data, struct sk_buff *skb) +{ + struct xmit_pool *pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct buff_pool_entry *bpe; + struct viport_trailer *trailer; + unsigned int sz = skb->len; + int last; + + DATA_FUNCTION("data_xmit_packet()\n"); + if (sz > pool->buffer_sz) { + DATA_ERROR("outbound packet too large, size = %d\n", sz); + return -1; + } + + if (data_alloc_xmit_buffer(data, skb, &bpe, &rdma_io, &last)) { + DATA_ERROR("error in allocating data xmit buffer\n"); + return -1; + } + + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + trailer = rdma_io->trailer; + + memset(trailer, 0, sizeof *trailer); + memcpy(trailer->dest_mac_addr, skb->data, ETH_ALEN); + + if (skb->sk) + trailer->connection_hash_and_valid = 0x40 | + ((be16_to_cpu(inet_sk(skb->sk)->sport) + + be16_to_cpu(inet_sk(skb->sk)->dport)) & 0x3f); + + trailer->connection_hash_and_valid |= CHV_VALID; + + if ((sz > 16) && (*(__be16 *) (skb->data + 12) == + __constant_cpu_to_be16(ETH_P_8021Q))) { + trailer->vlan = *(__be16 *) (skb->data + 14); + memmove(skb->data + 4, skb->data, 12); + skb_pull(skb, 4); + sz -= 4; + trailer->pkt_flags |= PF_VLAN_INSERT; + } + if (last) + trailer->pkt_flags |= PF_KICK; + if (sz < ETH_ZLEN) { + /* EIOU requires all packets to be + * of ethernet minimum packet size. + */ + trailer->data_length = __constant_cpu_to_be16(ETH_ZLEN); + rdma_io->len = ETH_ZLEN; + } else { + trailer->data_length = cpu_to_be16(sz); + rdma_io->len = sz; + } + + if (skb->ip_summed == CHECKSUM_PARTIAL) { + trailer->tx_chksum_flags = TX_CHKSUM_FLAGS_CHECKSUM_V4 + | TX_CHKSUM_FLAGS_IP_CHECKSUM + | TX_CHKSUM_FLAGS_TCP_CHECKSUM + | TX_CHKSUM_FLAGS_UDP_CHECKSUM; + } + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + data_rdma_packet(data, bpe, rdma_io); + + if (pool->send_kicks) { + /* EIOC needs kicks to inform it of sent packets */ + pool->kick_count++; + pool->kick_byte_count += sz; + if ((pool->kick_count >= pool->kick_bundle) + || (pool->kick_byte_count >= pool->kick_byte_bundle)) { + data_send_kick_message(data); + } else if (pool->kick_count == 1) { + init_timer(&data->kick_timer); + /* timeout_before_kick is in usec */ + data->kick_timer.expires = + msecs_to_jiffies(be32_to_cpu(data-> + eioc_pool_parms.timeout_before_kick) * 1000) + + jiffies; + data->kick_timer.data = (unsigned long)data; + data->kick_timer.function = data_kick_timeout_handler; + add_timer(&data->kick_timer); + data->kick_timer_on = 1; + } + } + return 0; +} + +static void data_check_xmit_buffers(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + + DATA_FUNCTION("data_check_xmit_buffers()\n"); + spin_lock_irqsave(&data->xmit_buf_lock, flags); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + if (data->xmit_pool.need_buffers + && pool->buf_pool[pool->next_xmit_pool].valid + && pool->next_xmit_buf != pool->last_comp_buf) { + data->xmit_pool.need_buffers = 0; + vnic_restart_xmit(data->parent->vnic, + data->parent->parent); + DATA_INFO("there are free xmit buffers\n"); + } + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); +} + +static struct sk_buff *data_recv_to_skbuff(struct data *data, + struct rdma_dest *rdma_dest) +{ + struct viport_trailer *trailer; + struct sk_buff *skb = NULL; + int start; + unsigned int len; + u8 rx_chksum_flags; + + DATA_FUNCTION("data_recv_to_skbuff()\n"); + trailer = rdma_dest->trailer; + start = data_offset(data, trailer); + len = data_len(data, trailer); + + if (len <= min_rcv_skb) + skb = dev_alloc_skb(len + VLAN_HLEN + 2); + /* leave room for VLAN header and alignment */ + if (skb) { + skb_reserve(skb, VLAN_HLEN + 2); + memcpy(skb->data, rdma_dest->data + start, len); + skb_put(skb, len); + } else { + skb = rdma_dest->skb; + rdma_dest->skb = NULL; + rdma_dest->trailer = NULL; + rdma_dest->data = NULL; + skb_pull(skb, start); + skb_trim(skb, len); + } + + rx_chksum_flags = trailer->rx_chksum_flags; + DATA_INFO("rx_chksum_flags = %d, LOOP = %c, IP = %c," + " TCP = %c, UDP = %c\n", + rx_chksum_flags, + (rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) ? 'Y' : 'N', + (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED) ? 'N' : + '-'); + + if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) + || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) + && ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) + || (rx_chksum_flags & + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED)))) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + if ((trailer->pkt_flags & PF_VLAN_INSERT) && + !(data->parent->features_supported & VNIC_FEAT_IGNORE_VLAN)) { + u8 *rv; + + rv = skb_push(skb, 4); + memmove(rv, rv + 4, 12); + *(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q); + if (trailer->pkt_flags & PF_PVID_OVERRIDDEN) + *(__be16 *) (rv + 14) = trailer->vlan & + __constant_cpu_to_be16(0xF000); + else + *(__be16 *) (rv + 14) = trailer->vlan; + } + + return skb; +} + +static int data_incoming_recv(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct viport_trailer *trailer; + struct buff_pool_entry *bpe; + struct sk_buff *skb; + dma_addr_t vaddr_dma; + + DATA_FUNCTION("data_incoming_recv()\n"); + if (pool->next_full_buf == pool->next_free_buf) + return -1; + bpe = &pool->buf_pool[pool->next_full_buf]; + vaddr_dma = be64_to_cpu(bpe->remote_addr); + rdma_dest = &pool->recv_bufs[bpe->valid - 1]; + trailer = rdma_dest->trailer; + + if (!trailer + || !(trailer->connection_hash_and_valid & CHV_VALID)) + return -1; + + /* received a packet */ + if (trailer->pkt_flags & PF_KICK) + pool->kick_on_free = 1; + + skb = data_recv_to_skbuff(data, rdma_dest); + + if (skb) { + vnic_recv_packet(data->parent->vnic, + data->parent->parent, skb); + list_add(&rdma_dest->list_ptrs, &pool->avail_recv_bufs); + } + + ib_dma_unmap_single(data->parent->config->ibdev, + vaddr_dma, pool->buffer_sz, + DMA_FROM_DEVICE); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe->valid = 0; + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + INC(pool->next_full_buf, 1, pool->eioc_pool_sz); + pool->num_posted_bufs--; + data_recvs_stats(data); + return 0; +} + +static void data_received_kick(struct io *io) +{ + struct data *data = &io->viport->data; + unsigned long flags; + + DATA_FUNCTION("data_received_kick()\n"); + data_note_kickrcv_time(); + spin_lock_irqsave(&data->recv_ios_lock, flags); + list_add(&io->list_ptrs, &data->recv_ios); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + data_post_recvs(data); + data_rcvkicks_stats(data); + data_check_xmit_buffers(data); + + while (!data_incoming_recv(data)); + + if (data->connected) { + data_alloc_buffers(data, 0); + data_send_free_recv_buffers(data); + } +} + +static void data_xmit_complete(struct io *io) +{ + struct rdma_io *rdma_io = (struct rdma_io *)io; + struct data *data = &io->viport->data; + struct xmit_pool *pool = &data->xmit_pool; + struct sk_buff *skb; + + DATA_FUNCTION("data_xmit_complete()\n"); + + if (rdma_io->skb) + ib_dma_unmap_single(data->parent->config->ibdev, + rdma_io->skb_data_dma, rdma_io->skb->len, + DMA_TO_DEVICE); + + while (pool->last_comp_buf != rdma_io->index) { + INC(pool->last_comp_buf, 1, pool->num_xmit_bufs); + skb = pool->xmit_bufs[pool->last_comp_buf].skb; + if (skb) + dev_kfree_skb_any(skb); + pool->xmit_bufs[pool->last_comp_buf].skb = NULL; + } + + data_check_xmit_buffers(data); +} + +static int mc_data_alloc_skb(struct ud_recv_io *recv_io, u32 len, + int initial_allocation) +{ + struct sk_buff *skb; + struct mc_data *mc_data = &recv_io->io.viport->mc_data; + + DATA_FUNCTION("mc_data_alloc_skb\n"); + if (initial_allocation) + skb = alloc_skb(len, GFP_KERNEL); + else + skb = alloc_skb(len, GFP_ATOMIC); + if (!skb) { + DATA_ERROR("failed to alloc MULTICAST skb\n"); + return -1; + } + skb_put(skb, len); + recv_io->skb = skb; + + recv_io->skb_data_dma = ib_dma_map_single( + recv_io->io.viport->config->ibdev, + skb->data, skb->len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma)) { + DATA_ERROR("skb data dma map error\n"); + dev_kfree_skb(skb); + return -1; + } + + recv_io->list[0].addr = recv_io->skb_data_dma; + recv_io->list[0].length = sizeof(struct ib_grh); + recv_io->list[0].lkey = mc_data->mr->lkey; + + recv_io->list[1].addr = recv_io->skb_data_dma + sizeof(struct ib_grh); + recv_io->list[1].length = len - sizeof(struct ib_grh); + recv_io->list[1].lkey = mc_data->mr->lkey; + + recv_io->io.rwr.wr_id = (u64)&recv_io->io; + recv_io->io.rwr.sg_list = recv_io->list; + recv_io->io.rwr.num_sge = 2; + recv_io->io.rwr.next = NULL; + + return 0; +} + +static int mc_data_alloc_buffers(struct mc_data *mc_data) +{ + unsigned int i, num; + struct ud_recv_io *bufs = NULL, *recv_io; + + DATA_FUNCTION("mc_data_alloc_buffers\n"); + if (!mc_data->skb_len) { + unsigned int len; + /* align multicast msg buffer on viport_trailer boundary */ + len = (MCAST_MSG_SIZE + VIPORT_TRAILER_ALIGNMENT - 1) & + (~((unsigned int)VIPORT_TRAILER_ALIGNMENT - 1)); + /* + * Add size of grh and trailer - + * note, we don't need a + 4 for vlan because we have room in + * netbuf for grh & trailer and we'll strip them both, so there + * will be room enough to handle the 4 byte insertion for vlan. + */ + len += sizeof(struct ib_grh) + + sizeof(struct viport_trailer); + mc_data->skb_len = len; + DATA_INFO("mc_data->skb_len %d (sizes:%d %d)\n", + len, (int)sizeof(struct ib_grh), + (int)sizeof(struct viport_trailer)); + } + mc_data->recv_len = sizeof(struct ud_recv_io) * mc_data->num_recvs; + bufs = kmalloc(mc_data->recv_len, GFP_KERNEL); + if (!bufs) { + DATA_ERROR("failed to allocate MULTICAST buffers size:%d\n", + mc_data->recv_len); + return -1; + } + DATA_INFO("allocated num_recvs:%d recv_len:%d \n", + mc_data->num_recvs, mc_data->recv_len); + for (num = 0; num < mc_data->num_recvs; num++) { + recv_io = &bufs[num]; + recv_io->len = mc_data->skb_len; + recv_io->io.type = RECV_UD; + recv_io->io.viport = mc_data->parent; + recv_io->io.routine = mc_data_recv_routine; + + if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 1)) { + for (i = 0; i < num; i++) { + recv_io = &bufs[i]; + ib_dma_unmap_single(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma, + recv_io->skb->len, + DMA_FROM_DEVICE); + dev_kfree_skb(recv_io->skb); + } + kfree(bufs); + return -1; + } + list_add_tail(&recv_io->io.list_ptrs, + &mc_data->avail_recv_ios_list); + } + mc_data->recv_ios = bufs; + return 0; +} + +void vnic_mc_data_cleanup(struct mc_data *mc_data) +{ + unsigned int num; + + DATA_FUNCTION("vnic_mc_data_cleanup()\n"); + completion_callback_cleanup(&mc_data->ib_conn); + if (!IS_ERR(mc_data->ib_conn.qp)) { + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL); + } + if (!IS_ERR(mc_data->ib_conn.cq)) { + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); + } + if (mc_data->recv_ios) { + for (num = 0; num < mc_data->num_recvs; num++) { + if (mc_data->recv_ios[num].skb) + dev_kfree_skb(mc_data->recv_ios[num].skb); + mc_data->recv_ios[num].skb = NULL; + } + kfree(mc_data->recv_ios); + mc_data->recv_ios = (struct ud_recv_io *)NULL; + } + if (mc_data->mr) { + ib_dereg_mr(mc_data->mr); + mc_data->mr = (struct ib_mr *)NULL; + } + DATA_FUNCTION("vnic_mc_data_cleanup done\n"); + +} + +int mc_data_init(struct mc_data *mc_data, struct viport *viport, + struct data_config *config, struct ib_pd *pd) +{ + DATA_FUNCTION("mc_data_init()\n"); + + mc_data->num_recvs = viport->data.config->num_recvs; + + INIT_LIST_HEAD(&mc_data->avail_recv_ios_list); + spin_lock_init(&mc_data->recv_lock); + + mc_data->parent = viport; + mc_data->config = config; + + mc_data->ib_conn.cm_id = NULL; + mc_data->ib_conn.viport = viport; + mc_data->ib_conn.ib_config = &config->ib_config; + mc_data->ib_conn.state = IB_CONN_UNINITTED; + mc_data->ib_conn.callback_thread = NULL; + mc_data->ib_conn.callback_thread_end = 0; + + if (vnic_ib_mc_init(mc_data, viport, pd, + &config->ib_config)) { + DATA_ERROR("vnic_ib_mc_init failed\n"); + goto failure; + } + mc_data->mr = ib_get_dma_mr(pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(mc_data->mr)) { + DATA_ERROR("failed to register memory for" + " mc_data connection\n"); + goto destroy_conn; + } + + if (mc_data_alloc_buffers(mc_data)) + goto dereg_mr; + + mc_data_post_recvs(mc_data); + if (vnic_ib_mc_mod_qp_to_rts(mc_data->ib_conn.qp)) + goto dereg_mr; + + return 0; + +dereg_mr: + ib_dereg_mr(mc_data->mr); + mc_data->mr = (struct ib_mr *)NULL; +destroy_conn: + completion_callback_cleanup(&mc_data->ib_conn); + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL); + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); +failure: + return -1; +} + +static void mc_data_post_recvs(struct mc_data *mc_data) +{ + unsigned long flags; + int i = 0; + DATA_FUNCTION("mc_data_post_recvs\n"); + spin_lock_irqsave(&mc_data->recv_lock, flags); + while (!list_empty(&mc_data->avail_recv_ios_list)) { + struct io *io = list_entry(mc_data->avail_recv_ios_list.next, + struct io, list_ptrs); + struct ud_recv_io *recv_io = + container_of(io, struct ud_recv_io, io); + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); + if (vnic_ib_mc_post_recv(mc_data, &recv_io->io)) { + viport_failure(mc_data->parent); + return; + } + spin_lock_irqsave(&mc_data->recv_lock, flags); + i++; + } + DATA_INFO("mcdata posted %d %p\n", i, &mc_data->avail_recv_ios_list); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); +} + +static void mc_data_recv_routine(struct io *io) +{ + struct sk_buff *skb; + struct ib_grh *grh; + struct viport_trailer *trailer; + struct mc_data *mc_data; + unsigned long flags; + struct ud_recv_io *recv_io = container_of(io, struct ud_recv_io, io); + union ib_gid_cpu sgid; + + DATA_FUNCTION("mc_data_recv_routine\n"); + skb = recv_io->skb; + grh = (struct ib_grh *)skb->data; + mc_data = &recv_io->io.viport->mc_data; + + ib_dma_unmap_single(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma, recv_io->skb->len, + DMA_FROM_DEVICE); + + /* first - check if we've got our own mc packet */ + /* convert sgid from host to cpu form before comparing */ + bswap_ib_gid(&grh->sgid, &sgid); + if (cpu_to_be64(sgid.global.interface_id) == + io->viport->config->path_info.path.sgid.global.interface_id) { + DATA_ERROR("dropping - our mc packet\n"); + dev_kfree_skb(skb); + } else { + /* GRH is at head and trailer at end. Remove GRH from head. */ + trailer = (struct viport_trailer *) + (skb->data + recv_io->len - + sizeof(struct viport_trailer)); + skb_pull(skb, sizeof(struct ib_grh)); + if (trailer->connection_hash_and_valid & CHV_VALID) { + mc_data_recv_to_skbuff(io->viport, skb, trailer); + vnic_recv_packet(io->viport->vnic, io->viport->parent, + skb); + vnic_multicast_recv_pkt_stats(io->viport->vnic); + } else { + DATA_ERROR("dropping - no CHV_VALID in HashAndValid\n"); + dev_kfree_skb(skb); + } + } + recv_io->skb = NULL; + if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 0)) + return; + + spin_lock_irqsave(&mc_data->recv_lock, flags); + list_add_tail(&recv_io->io.list_ptrs, &mc_data->avail_recv_ios_list); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); + mc_data_post_recvs(mc_data); + return; +} + +static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb, + struct viport_trailer *trailer) +{ + u8 rx_chksum_flags = trailer->rx_chksum_flags; + + /* drop alignment bytes at start */ + skb_pull(skb, trailer->data_alignment_offset); + /* drop excess from end */ + skb_trim(skb, __be16_to_cpu(trailer->data_length)); + + if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) + || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) + && ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) + || (rx_chksum_flags & + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED)))) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + if ((trailer->pkt_flags & PF_VLAN_INSERT) && + !(viport->features_supported & VNIC_FEAT_IGNORE_VLAN)) { + u8 *rv; + + /* insert VLAN id between source & length */ + DATA_INFO("VLAN adjustment\n"); + rv = skb_push(skb, 4); + memmove(rv, rv + 4, 12); + *(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q); + if (trailer->pkt_flags & PF_PVID_OVERRIDDEN) + /* + * Indicates VLAN is 0 but we keep the protocol id. + */ + *(__be16 *) (rv + 14) = trailer->vlan & + __constant_cpu_to_be16(0xF000); + else + *(__be16 *) (rv + 14) = trailer->vlan; + DATA_INFO("vlan:%x\n", *(int *)(rv+14)); + } + + return; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h new file mode 100644 index 0000000..866b9ee --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h @@ -0,0 +1,206 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_DATA_H_INCLUDED +#define VNIC_DATA_H_INCLUDED + +#include + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS +#include +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" +#include "vnic_trailer.h" + +struct rdma_dest { + struct list_head list_ptrs; + struct sk_buff *skb; + u8 *data; + struct viport_trailer *trailer __attribute__((aligned(32))); +}; + +struct buff_pool_entry { + __be64 remote_addr; + __be32 rkey; + u32 valid; +}; + +struct recv_pool { + u32 buffer_sz; + u32 pool_sz; + u32 eioc_pool_sz; + u32 eioc_rdma_rkey; + u64 eioc_rdma_addr; + u32 next_full_buf; + u32 next_free_buf; + u32 num_free_bufs; + u32 num_posted_bufs; + u32 sz_free_bundle; + int kick_on_free; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_dest *recv_bufs; + struct list_head avail_recv_bufs; +}; + +struct xmit_pool { + u32 buffer_sz; + u32 pool_sz; + u32 notify_count; + u32 notify_bundle; + u32 next_xmit_buf; + u32 last_comp_buf; + u32 num_xmit_bufs; + u32 next_xmit_pool; + u32 kick_count; + u32 kick_byte_count; + u32 kick_bundle; + u32 kick_byte_bundle; + int need_buffers; + int send_kicks; + uint32_t rdma_rkey; + u64 rdma_addr; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_io *xmit_bufs; + u8 *xmit_data; + dma_addr_t xmitdata_dma; + int xmitdata_len; +}; + +struct data { + struct viport *parent; + struct data_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + u8 *local_storage; + struct vnic_recv_pool_config host_pool_parms; + struct vnic_recv_pool_config eioc_pool_parms; + struct recv_pool recv_pool; + struct xmit_pool xmit_pool; + u8 *region_data; + dma_addr_t region_data_dma; + struct rdma_io free_bufs_io; + struct send_io kick_io; + struct list_head recv_ios; + spinlock_t recv_ios_lock; + spinlock_t xmit_buf_lock; + int kick_timer_on; + int connected; + u16 max_mtu; + struct timer_list kick_timer; + struct completion done; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + u32 xmit_num; + u32 recv_num; + u32 free_buf_sends; + u32 free_buf_num; + u32 free_buf_min; + u32 kick_recvs; + u32 kick_reqs; + u32 no_xmit_bufs; + cycles_t no_xmit_buf_time; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct mc_data { + struct viport *parent; + struct data_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + + u32 num_recvs; + u32 skb_len; + spinlock_t recv_lock; + int recv_len; + struct ud_recv_io *recv_ios; + struct list_head avail_recv_ios_list; +}; + +int data_init(struct data *data, struct viport *viport, + struct data_config *config, struct ib_pd *pd); + +int data_connect(struct data *data); +void data_connected(struct data *data); +void data_disconnect(struct data *data); + +int data_xmit_packet(struct data *data, struct sk_buff *skb); + +void data_cleanup(struct data *data); + +#define data_is_connected(data) \ + (vnic_ib_conn_connected(&((data)->ib_conn))) +#define data_path_id(data) (data)->config->path_id +#define data_eioc_pool(data) &(data)->eioc_pool_parms +#define data_host_pool(data) &(data)->host_pool_parms +#define data_eioc_pool_min(data) &(data)->config->eioc_min +#define data_host_pool_min(data) &(data)->config->host_min +#define data_eioc_pool_max(data) &(data)->config->eioc_max +#define data_host_pool_max(data) &(data)->config->host_max +#define data_local_pool_addr(data) (data)->xmit_pool.rdma_addr +#define data_local_pool_rkey(data) (data)->xmit_pool.rdma_rkey +#define data_remote_pool_addr(data) &(data)->recv_pool.eioc_rdma_addr +#define data_remote_pool_rkey(data) &(data)->recv_pool.eioc_rdma_rkey + +#define data_max_mtu(data) (data)->max_mtu + + +#define data_len(data, trailer) be16_to_cpu(trailer->data_length) +#define data_offset(data, trailer) \ + ((data)->recv_pool.buffer_sz - sizeof(struct viport_trailer) \ + - ALIGN(data_len((data), (trailer)), VIPORT_TRAILER_ALIGNMENT) \ + + (trailer->data_alignment_offset)) + +/* the following macros manipulate ring buffer indexes. + * the ring buffer size must be a power of 2. + */ +#define ADD(index, increment, size) (((index) + (increment))&((size) - 1)) +#define NEXT(index, size) ADD(index, 1, size) +#define INC(index, increment, size) (index) = ADD(index, increment, size) + +/* this is max multicast msg embedded will send */ +#define MCAST_MSG_SIZE \ + (2048 - sizeof(struct ib_grh) - sizeof(struct viport_trailer)) + +int mc_data_init(struct mc_data *mc_data, struct viport *viport, + struct data_config *config, + struct ib_pd *pd); + +void vnic_mc_data_cleanup(struct mc_data *mc_data); + +#endif /* VNIC_DATA_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h new file mode 100644 index 0000000..dd8a073 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h @@ -0,0 +1,103 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_TRAILER_H_INCLUDED +#define VNIC_TRAILER_H_INCLUDED + +/* pkt_flags values */ +enum { + PF_CHASH_VALID = 0x01, + PF_IPSEC_VALID = 0x02, + PF_TCP_SEGMENT = 0x04, + PF_KICK = 0x08, + PF_VLAN_INSERT = 0x10, + PF_PVID_OVERRIDDEN = 0x20, + PF_FCS_INCLUDED = 0x40, + PF_FORCE_ROUTE = 0x80 +}; + +/* tx_chksum_flags values */ +enum { + TX_CHKSUM_FLAGS_CHECKSUM_V4 = 0x01, + TX_CHKSUM_FLAGS_CHECKSUM_V6 = 0x02, + TX_CHKSUM_FLAGS_TCP_CHECKSUM = 0x04, + TX_CHKSUM_FLAGS_UDP_CHECKSUM = 0x08, + TX_CHKSUM_FLAGS_IP_CHECKSUM = 0x10 +}; + +/* rx_chksum_flags values */ +enum { + RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED = 0x01, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED = 0x02, + RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED = 0x04, + RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED = 0x08, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED = 0x10, + RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED = 0x20, + RX_CHKSUM_FLAGS_LOOPBACK = 0x40, + RX_CHKSUM_FLAGS_RESERVED = 0x80 +}; + +/* connection_hash_and_valid values */ +enum { + CHV_VALID = 0x80, + CHV_HASH_MASH = 0x7f +}; + +struct viport_trailer { + s8 data_alignment_offset; + u8 rndis_header_length; /* reserved for use by edp */ + __be16 data_length; + u8 pkt_flags; + u8 tx_chksum_flags; + u8 rx_chksum_flags; + u8 ip_sec_flags; + u32 tcp_seq_no; + u32 ip_sec_offload_handle; + u32 ip_sec_next_offload_handle; + u8 dest_mac_addr[6]; + __be16 vlan; + u16 time_stamp; + u8 origin; + u8 connection_hash_and_valid; +}; + +#define VIPORT_TRAILER_ALIGNMENT 32 + +#define BUFFER_SIZE(len) \ + (sizeof(struct viport_trailer) + \ + ALIGN((len), VIPORT_TRAILER_ALIGNMENT)) + +#define MAX_PAYLOAD(len) \ + ALIGN_DOWN((len) - sizeof(struct viport_trailer), \ + VIPORT_TRAILER_ALIGNMENT) + +#endif /* VNIC_TRAILER_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:34:29 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:04:29 +0530 Subject: [ofa-general] [PATCH v2 06/13] QLogic VNIC: IB core stack interaction In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103428.12355.53123.stgit@localhost.localdomain> From: Ramachandra K The patch implements the interaction of the QLogic VNIC driver with the underlying core infiniband stack. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h | 206 ++++++ 2 files changed, 1249 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c new file mode 100644 index 0000000..c43e69e --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c @@ -0,0 +1,1043 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_data.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_sys.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +static int vnic_ib_inited; +static void vnic_add_one(struct ib_device *device); +static void vnic_remove_one(struct ib_device *device); +static int vnic_defer_completion(void *ptr); + +static int vnic_ib_mc_init_qp(struct mc_data *mc_data, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config); + +static struct ib_client vnic_client = { + .name = "vnic", + .add = vnic_add_one, + .remove = vnic_remove_one +}; + +struct ib_sa_client vnic_sa_client; + +int vnic_ib_init(void) +{ + int ret = -1; + + IB_FUNCTION("vnic_ib_init()\n"); + + /* class has to be registered before + * calling ib_register_client() because, that call + * will trigger vnic_add_port() which will register + * class_device for the port with the parent class + * as vnic_class + */ + ret = class_register(&vnic_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class" + " infiniband_qlgc_vnic; error %d", ret); + goto out; + } + + ib_sa_register_client(&vnic_sa_client); + ret = ib_register_client(&vnic_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client;" + " error %d", ret); + goto err_ib_reg; + } + + interface_dev.dev.class = &vnic_class; + interface_dev.dev.release = vnic_release_dev; + snprintf(interface_dev.dev.bus_id, + BUS_ID_SIZE, "interfaces"); + init_completion(&interface_dev.released); + ret = device_register(&interface_dev.dev); + if (ret) { + printk(KERN_ERR PFX "couldn't register class interfaces;" + " error %d", ret); + goto err_class_dev; + } + ret = device_create_file(&interface_dev.dev, + &dev_attr_delete_vnic); + if (ret) { + printk(KERN_ERR PFX "couldn't create class file" + " 'delete_vnic'; error %d", ret); + goto err_class_file; + } + + vnic_ib_inited = 1; + + return ret; +err_class_file: + device_unregister(&interface_dev.dev); +err_class_dev: + ib_unregister_client(&vnic_client); +err_ib_reg: + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +out: + return ret; +} + +static struct vnic_ib_port *vnic_add_port(struct vnic_ib_device *device, + u8 port_num) +{ + struct vnic_ib_port *port; + + port = kzalloc(sizeof *port, GFP_KERNEL); + if (!port) + return NULL; + + init_completion(&port->pdev_info.released); + port->dev = device; + port->port_num = port_num; + + port->pdev_info.dev.class = &vnic_class; + port->pdev_info.dev.parent = NULL; + port->pdev_info.dev.release = vnic_release_dev; + snprintf(port->pdev_info.dev.bus_id, BUS_ID_SIZE, + "vnic-%s-%d", device->dev->name, port_num); + + if (device_register(&port->pdev_info.dev)) + goto free_port; + + if (device_create_file(&port->pdev_info.dev, + &dev_attr_create_primary)) + goto err_class; + if (device_create_file(&port->pdev_info.dev, + &dev_attr_create_secondary)) + goto err_class; + + return port; +err_class: + device_unregister(&port->pdev_info.dev); +free_port: + kfree(port); + + return NULL; +} + +static void vnic_add_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port; + int s, e, p; + + vnic_dev = kmalloc(sizeof *vnic_dev, GFP_KERNEL); + if (!vnic_dev) + return; + + vnic_dev->dev = device; + INIT_LIST_HEAD(&vnic_dev->port_list); + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = 0; + e = 0; + + } else { + s = 1; + e = device->phys_port_cnt; + + } + + for (p = s; p <= e; p++) { + port = vnic_add_port(vnic_dev, p); + if (port) + list_add_tail(&port->list, &vnic_dev->port_list); + } + + ib_set_client_data(device, &vnic_client, vnic_dev); + +} + +static void vnic_remove_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port, *tmp_port; + + vnic_dev = ib_get_client_data(device, &vnic_client); + list_for_each_entry_safe(port, tmp_port, + &vnic_dev->port_list, list) { + device_unregister(&port->pdev_info.dev); + /* + * wait for sysfs entries to go away, so that no new vnics + * are created + */ + wait_for_completion(&port->pdev_info.released); + kfree(port); + + } + kfree(vnic_dev); + + /* TODO Only those vnic interfaces associated with + * the HCA whose remove event is called should be freed + * Currently all the vnic interfaces are freed + */ + + while (!list_empty(&vnic_list)) { + struct vnic *vnic = + list_entry(vnic_list.next, struct vnic, list_ptrs); + vnic_free(vnic); + } + + vnic_npevent_cleanup(); + viport_cleanup(); + +} + +void vnic_ib_cleanup(void) +{ + IB_FUNCTION("vnic_ib_cleanup()\n"); + + if (!vnic_ib_inited) + return; + + device_unregister(&interface_dev.dev); + wait_for_completion(&interface_dev.released); + + ib_unregister_client(&vnic_client); + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +} + +static void vnic_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *context) +{ + struct vnic_ib_path_info *p = context; + p->status = status; + if (!status) + p->path = *pathrec; + + complete(&p->done); +} + +int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic) +{ + struct viport_config *config = netpath->viport->config; + int ret = 0; + + init_completion(&config->path_info.done); + IB_INFO("Using SA path rec get time out value of %d\n", + config->sa_path_rec_get_timeout); + config->path_info.path_query_id = + ib_sa_path_rec_get(&vnic_sa_client, + config->ibdev, + config->port, + &config->path_info.path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + config->sa_path_rec_get_timeout, + GFP_KERNEL, + vnic_path_rec_completion, + &config->path_info, + &config->path_info.path_query); + + if (config->path_info.path_query_id < 0) { + IB_ERROR("SA path record query failed; error %d\n", + config->path_info.path_query_id); + ret = config->path_info.path_query_id; + goto out; + } + + wait_for_completion(&config->path_info.done); + + if (config->path_info.status < 0) { + printk(KERN_WARNING PFX "connection not available to dgid " + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x", + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[0]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[2]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[4]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[6]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[8]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[10]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[12]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[14])); + + if (config->path_info.status == -ETIMEDOUT) + printk(KERN_INFO " path query timed out\n"); + else if (config->path_info.status == -EIO) + printk(KERN_INFO " path query sending error\n"); + else + printk(KERN_INFO " error %d\n", + config->path_info.status); + + ret = config->path_info.status; + } +out: + if (ret) + netpath_timer(netpath, vnic->config->no_path_timeout); + + return ret; +} + +static inline void vnic_ib_handle_completions(struct ib_wc *wc, + struct vnic_ib_conn *ib_conn, + u32 *comp_num, + cycles_t *comp_time) +{ + struct io *io; + + io = (struct io *)(wc->wr_id); + vnic_ib_comp_stats(ib_conn, comp_num); + if (wc->status) { + IB_INFO("completion error wc.status %d" + " wc.opcode %d vendor err 0x%x\n", + wc->status, wc->opcode, wc->vendor_err); + } else if (io) { + vnic_ib_io_stats(io, ib_conn, *comp_time); + if (io->type == RECV_UD) { + struct ud_recv_io *recv_io = + container_of(io, struct ud_recv_io, io); + recv_io->len = wc->byte_len; + } + if (io->routine) + (*io->routine) (io); + } +} + +static void ib_qp_event(struct ib_event *event, void *context) +{ + IB_ERROR("QP event %d\n", event->event); +} + +static void vnic_ib_completion(struct ib_cq *cq, void *ptr) +{ + struct vnic_ib_conn *ib_conn = ptr; + unsigned long flags; + int compl_received; + struct ib_wc wc; + cycles_t comp_time; + u32 comp_num = 0; + + /* for multicast, cm_id is NULL, so skip that test */ + if (ib_conn->cm_id && + (ib_conn->state != IB_CONN_CONNECTED)) + return; + + /* Check if completion processing is taking place in thread + * If not then process completions in this handler, + * else set compl_received if not set, to indicate that + * there are more completions to process in thread. + */ + + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + compl_received = ib_conn->compl_received; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags); + + if (ib_conn->in_thread || compl_received) { + if (!compl_received) { + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 1; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, + flags); + } + wake_up(&(ib_conn->callback_wait_queue)); + } else { + vnic_ib_note_comptime_stats(&comp_time); + vnic_ib_callback_stats(ib_conn); + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + vnic_ib_handle_completions(&wc, ib_conn, &comp_num, + &comp_time); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + break; + + /* If we get more completions than the completion limit + * defer completion to the thread + */ + if ((!ib_conn->in_thread) && + (comp_num >= ib_conn->ib_config->completion_limit)) { + ib_conn->in_thread = 1; + spin_lock_irqsave( + &ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 1; + spin_unlock_irqrestore( + &ib_conn->compl_received_lock, flags); + wake_up(&(ib_conn->callback_wait_queue)); + break; + } + + } + vnic_ib_maxio_stats(ib_conn, comp_num); + } +} + +static int vnic_ib_mod_qp_to_rts(struct ib_cm_id *cm_id, + struct vnic_ib_conn *ib_conn) +{ + int attr_mask = 0; + int ret; + struct ib_qp_attr *qp_attr = NULL; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + return -ENOMEM; + + qp_attr->qp_state = IB_QPS_RTR; + + ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (ret) + goto out; + + ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask); + if (ret) + goto out; + + IB_INFO("QP RTR\n"); + + qp_attr->qp_state = IB_QPS_RTS; + + ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (ret) + goto out; + + ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask); + if (ret) + goto out; + + IB_INFO("QP RTS\n"); + + ret = ib_send_cm_rtu(cm_id, NULL, 0); + if (ret) + goto out; +out: + kfree(qp_attr); + return ret; +} + +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct vnic_ib_conn *ib_conn = cm_id->context; + struct viport *viport = ib_conn->viport; + int err = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + IB_ERROR("sending CM REQ failed\n"); + err = 1; + viport->retry = 1; + break; + case IB_CM_REP_RECEIVED: + IB_INFO("CM REP recvd\n"); + if (vnic_ib_mod_qp_to_rts(cm_id, ib_conn)) + err = 1; + else { + ib_conn->state = IB_CONN_CONNECTED; + vnic_ib_connected_time_stats(ib_conn); + IB_INFO("RTU SENT\n"); + } + break; + case IB_CM_REJ_RECEIVED: + printk(KERN_ERR PFX " CM rejected control connection\n"); + if (event->param.rej_rcvd.reason == + IB_CM_REJ_INVALID_SERVICE_ID) + printk(KERN_ERR "reason: invalid service ID. " + "IOCGUID value specified may be incorrect\n"); + else + printk(KERN_ERR "reason code : 0x%x\n", + event->param.rej_rcvd.reason); + + err = 1; + viport->retry = 1; + break; + case IB_CM_MRA_RECEIVED: + IB_INFO("CM MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + IB_INFO("CM DREP recvd\n"); + ib_conn->state = IB_CONN_DISCONNECTED; + break; + + case IB_CM_TIMEWAIT_EXIT: + IB_ERROR("CM timewait exit\n"); + err = 1; + break; + + default: + IB_INFO("unhandled CM event %d\n", event->event); + break; + + } + + if (err) { + ib_conn->state = IB_CONN_DISCONNECTED; + viport_failure(viport); + } + + viport_kick(viport); + return 0; +} + + +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn) +{ + struct ib_cm_req_param *req = NULL; + struct viport *viport; + int ret = -1; + + if (!vnic_ib_conn_initted(ib_conn)) { + IB_ERROR("IB Connection out of state for CM connect (%d)\n", + ib_conn->state); + return -EINVAL; + } + + vnic_ib_conntime_stats(ib_conn); + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + viport = ib_conn->viport; + + req->primary_path = &viport->config->path_info.path; + req->alternate_path = NULL; + req->qp_num = ib_conn->qp->qp_num; + req->qp_type = ib_conn->qp->qp_type; + req->service_id = ib_conn->ib_config->service_id; + req->private_data = &ib_conn->ib_config->conn_data; + req->private_data_len = sizeof(struct vnic_connection_data); + req->flow_control = 1; + + get_random_bytes(&req->starting_psn, 4); + req->starting_psn &= 0xffffff; + + /* + * Both responder_resources and initiator_depth are set to zero + * as we do not need RDMA read. + * + * They also must be set to zero, otherwise data connections + * are rejected by VEx. + */ + req->responder_resources = 0; + req->initiator_depth = 0; + req->remote_cm_response_timeout = 20; + req->local_cm_response_timeout = 20; + req->retry_count = ib_conn->ib_config->retry_count; + req->rnr_retry_count = ib_conn->ib_config->rnr_retry_count; + req->max_cm_retries = 15; + + ib_conn->state = IB_CONN_CONNECTING; + + ret = ib_send_cm_req(ib_conn->cm_id, req); + + kfree(req); + + if (ret) { + IB_ERROR("CM REQ sending failed; error %d \n", ret); + ib_conn->state = IB_CONN_DISCONNECTED; + } + + return ret; +} + +static int vnic_ib_init_qp(struct vnic_ib_conn *ib_conn, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config) +{ + struct ib_qp_init_attr *init_attr; + struct ib_qp_attr *attr; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + init_attr->event_handler = ib_qp_event; + init_attr->cap.max_send_wr = config->num_sends; + init_attr->cap.max_recv_wr = config->num_recvs; + init_attr->cap.max_recv_sge = config->recv_scatter; + init_attr->cap.max_send_sge = config->send_gather; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = ib_conn->cq; + init_attr->recv_cq = ib_conn->cq; + + ib_conn->qp = ib_create_qp(pd, init_attr); + + if (IS_ERR(ib_conn->qp)) { + ret = -1; + IB_ERROR("could not create QP\n"); + goto free_init_attr; + } + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + ret = -ENOMEM; + goto destroy_qp; + } + + ret = ib_find_pkey(viport_config->ibdev, viport_config->port, + be16_to_cpu(viport_config->path_info.path.pkey), + &attr->pkey_index); + if (ret) { + printk(KERN_WARNING PFX "ib_find_pkey() failed; " + "error %d\n", ret); + goto freeattr; + } + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; + attr->port_num = viport_config->port; + + ret = ib_modify_qp(ib_conn->qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | IB_QP_PORT); + if (ret) { + printk(KERN_WARNING PFX "could not modify QP; error %d \n", + ret); + goto freeattr; + } + + kfree(attr); + kfree(init_attr); + return ret; + +freeattr: + kfree(attr); +destroy_qp: + ib_destroy_qp(ib_conn->qp); +free_init_attr: + kfree(init_attr); + return ret; +} + +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config) +{ + struct viport_config *viport_config = viport->config; + int ret = -1; + unsigned int cq_size = config->num_sends + config->num_recvs; + + + if (!vnic_ib_conn_uninitted(ib_conn)) { + IB_ERROR("IB Connection out of state for init (%d)\n", + ib_conn->state); + return -EINVAL; + } + + ib_conn->cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion, +#ifdef BUILD_FOR_OFED_1_2 + NULL, ib_conn, cq_size); +#else + NULL, ib_conn, cq_size, 0); +#endif + if (IS_ERR(ib_conn->cq)) { + IB_ERROR("could not create CQ\n"); + goto out; + } + + IB_INFO("cq created %p %d\n", ib_conn->cq, cq_size); + ib_req_notify_cq(ib_conn->cq, IB_CQ_NEXT_COMP); + init_waitqueue_head(&(ib_conn->callback_wait_queue)); + init_completion(&(ib_conn->callback_thread_exit)); + + spin_lock_init(&ib_conn->compl_received_lock); + + ib_conn->callback_thread = kthread_run(vnic_defer_completion, ib_conn, + "qlgc_vnic_def_compl"); + if (IS_ERR(ib_conn->callback_thread)) { + IB_ERROR("Could not create vnic_callback_thread;" + " error %d\n", (int) PTR_ERR(ib_conn->callback_thread)); + ib_conn->callback_thread = NULL; + goto destroy_cq; + } + + ret = vnic_ib_init_qp(ib_conn, config, pd, viport_config); + + if (ret) + goto destroy_thread; + + spin_lock_init(&ib_conn->conn_lock); + ib_conn->state = IB_CONN_INITTED; + + return ret; + +destroy_thread: + completion_callback_cleanup(ib_conn); +destroy_cq: + ib_destroy_cq(ib_conn->cq); +out: + return ret; +} + +int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io) +{ + cycles_t post_time; + struct ib_recv_wr *bad_wr; + int ret = -1; + unsigned long flags; + + IB_FUNCTION("vnic_ib_post_recv()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + + if (!vnic_ib_conn_initted(ib_conn) && + !vnic_ib_conn_connected(ib_conn)) { + ret = -EINVAL; + goto out; + } + + vnic_ib_pre_rcvpost_stats(ib_conn, io, &post_time); + io->type = RECV; + ret = ib_post_recv(ib_conn->qp, &io->rwr, &bad_wr); + if (ret) { + IB_ERROR("error in posting rcv wr; error %d\n", ret); + ib_conn->state = IB_CONN_ERRORED; + goto out; + } + + vnic_ib_post_rcvpost_stats(ib_conn, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; + +} + +int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io) +{ + cycles_t post_time; + unsigned long flags; + struct ib_send_wr *bad_wr; + int ret = -1; + + IB_FUNCTION("vnic_ib_post_send()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + if (!vnic_ib_conn_connected(ib_conn)) { + IB_ERROR("IB Connection out of state for" + " posting sends (%d)\n", ib_conn->state); + goto out; + } + + vnic_ib_pre_sendpost_stats(io, &post_time); + if (io->swr.opcode == IB_WR_RDMA_WRITE) + io->type = RDMA; + else + io->type = SEND; + + ret = ib_post_send(ib_conn->qp, &io->swr, &bad_wr); + if (ret) { + IB_ERROR("error in posting send wr; error %d\n", ret); + ib_conn->state = IB_CONN_ERRORED; + goto out; + } + + vnic_ib_post_sendpost_stats(ib_conn, io, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; +} + +static int vnic_defer_completion(void *ptr) +{ + struct vnic_ib_conn *ib_conn = ptr; + struct ib_wc wc; + struct ib_cq *cq = ib_conn->cq; + cycles_t comp_time; + u32 comp_num = 0; + unsigned long flags; + + while (!ib_conn->callback_thread_end) { + wait_event_interruptible(ib_conn->callback_wait_queue, + ib_conn->compl_received || + ib_conn->callback_thread_end); + ib_conn->in_thread = 1; + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 0; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + goto out_thread; + + vnic_ib_note_comptime_stats(&comp_time); + vnic_ib_callback_stats(ib_conn); + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + vnic_ib_handle_completions(&wc, ib_conn, &comp_num, + &comp_time); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + break; + } + vnic_ib_maxio_stats(ib_conn, comp_num); +out_thread: + ib_conn->in_thread = 0; + } + complete_and_exit(&(ib_conn->callback_thread_exit), 0); + return 0; +} + +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn) +{ + if (ib_conn->callback_thread) { + ib_conn->callback_thread_end = 1; + wake_up(&(ib_conn->callback_wait_queue)); + wait_for_completion(&(ib_conn->callback_thread_exit)); + ib_conn->callback_thread = NULL; + } +} + +int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config) +{ + struct viport_config *viport_config = viport->config; + int ret = -1; + unsigned int cq_size = config->num_recvs; /* recvs only */ + + IB_FUNCTION("vnic_ib_mc_init\n"); + + mc_data->ib_conn.cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion, +#ifdef BUILD_FOR_OFED_1_2 + NULL, &mc_data->ib_conn, cq_size); +#else + NULL, &mc_data->ib_conn, cq_size, 0); +#endif + if (IS_ERR(mc_data->ib_conn.cq)) { + IB_ERROR("ib_create_cq failed\n"); + goto out; + } + IB_INFO("mc cq created %p %d\n", mc_data->ib_conn.cq, cq_size); + + ret = ib_req_notify_cq(mc_data->ib_conn.cq, IB_CQ_NEXT_COMP); + if (ret) { + IB_ERROR("ib_req_notify_cq failed %x \n", ret); + goto destroy_cq; + } + + init_waitqueue_head(&(mc_data->ib_conn.callback_wait_queue)); + init_completion(&(mc_data->ib_conn.callback_thread_exit)); + + spin_lock_init(&mc_data->ib_conn.compl_received_lock); + mc_data->ib_conn.callback_thread = kthread_run(vnic_defer_completion, + &mc_data->ib_conn, + "qlgc_vnic_mc_def_compl"); + if (IS_ERR(mc_data->ib_conn.callback_thread)) { + IB_ERROR("Could not create vnic_callback_thread for MULTICAST;" + " error %d\n", + (int) PTR_ERR(mc_data->ib_conn.callback_thread)); + mc_data->ib_conn.callback_thread = NULL; + goto destroy_cq; + } + IB_INFO("callback_thread created\n"); + + ret = vnic_ib_mc_init_qp(mc_data, config, pd, viport_config); + if (ret) + goto destroy_thread; + + spin_lock_init(&mc_data->ib_conn.conn_lock); + mc_data->ib_conn.state = IB_CONN_INITTED; /* stays in this state */ + + return ret; + +destroy_thread: + completion_callback_cleanup(&mc_data->ib_conn); +destroy_cq: + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); +out: + return ret; +} + +static int vnic_ib_mc_init_qp(struct mc_data *mc_data, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config) +{ + struct ib_qp_init_attr *init_attr; + struct ib_qp_attr *qp_attr; + int ret; + + IB_FUNCTION("vnic_ib_mc_init_qp\n"); + + if (!mc_data->ib_conn.cq) { + IB_ERROR("cq is null\n"); + return -ENOMEM; + } + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) { + IB_ERROR("failed to alloc init_attr\n"); + return -ENOMEM; + } + + init_attr->cap.max_recv_wr = config->num_recvs; + init_attr->cap.max_send_wr = 1; + init_attr->cap.max_recv_sge = 2; + init_attr->cap.max_send_sge = 1; + + /* Completion for all work requests. */ + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + + init_attr->qp_type = IB_QPT_UD; + + init_attr->send_cq = mc_data->ib_conn.cq; + init_attr->recv_cq = mc_data->ib_conn.cq; + + IB_INFO("creating qp %d \n", config->num_recvs); + + mc_data->ib_conn.qp = ib_create_qp(pd, init_attr); + + if (IS_ERR(mc_data->ib_conn.qp)) { + ret = -1; + IB_ERROR("could not create QP\n"); + goto free_init_attr; + } + + qp_attr = kzalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) { + ret = -ENOMEM; + goto destroy_qp; + } + + qp_attr->qp_state = IB_QPS_INIT; + qp_attr->port_num = viport_config->port; + qp_attr->qkey = IOC_NUMBER(be64_to_cpu(viport_config->ioc_guid)); + qp_attr->pkey_index = 0; + /* cannot set access flags for UD qp + qp_attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; */ + + IB_INFO("port_num:%d qkey:%d pkey:%d\n", qp_attr->port_num, + qp_attr->qkey, qp_attr->pkey_index); + ret = ib_modify_qp(mc_data->ib_conn.qp, qp_attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_QKEY | + + /* cannot set this for UD + IB_QP_ACCESS_FLAGS | */ + + IB_QP_PORT); + if (ret) { + IB_ERROR("ib_modify_qp to INIT failed %d \n", ret); + goto free_qp_attr; + } + + kfree(qp_attr); + kfree(init_attr); + return ret; + +free_qp_attr: + kfree(qp_attr); +destroy_qp: + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = ERR_PTR(-EINVAL); +free_init_attr: + kfree(init_attr); + return ret; +} + +int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp) +{ + int ret; + struct ib_qp_attr *qp_attr = NULL; + + IB_FUNCTION("vnic_ib_mc_mod_qp_to_rts\n"); + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + return -ENOMEM; + + memset(qp_attr, 0, sizeof *qp_attr); + qp_attr->qp_state = IB_QPS_RTR; + + ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE); + if (ret) { + IB_ERROR("ib_modify_qp to RTR failed %d\n", ret); + goto out; + } + IB_INFO("MC QP RTR\n"); + + memset(qp_attr, 0, sizeof *qp_attr); + qp_attr->qp_state = IB_QPS_RTS; + qp_attr->sq_psn = 0; + + ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + IB_ERROR("ib_modify_qp to RTS failed %d\n", ret); + goto out; + } + IB_INFO("MC QP RTS\n"); + + return 0; + +out: + kfree(qp_attr); + return -1; +} + +int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io) +{ + cycles_t post_time; + struct ib_recv_wr *bad_wr; + int ret = -1; + + IB_FUNCTION("vnic_ib_mc_post_recv()\n"); + + vnic_ib_pre_rcvpost_stats(&mc_data->ib_conn, io, &post_time); + io->type = RECV_UD; + ret = ib_post_recv(mc_data->ib_conn.qp, &io->rwr, &bad_wr); + if (ret) { + IB_ERROR("error in posting rcv wr; error %d\n", ret); + goto out; + } + vnic_ib_post_rcvpost_stats(&mc_data->ib_conn, post_time); + +out: + return ret; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h new file mode 100644 index 0000000..ebf9ef5 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h @@ -0,0 +1,206 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_IB_H_INCLUDED +#define VNIC_IB_H_INCLUDED + +#include +#include +#include +#include +#include +#include + +#include "vnic_sys.h" +#include "vnic_netpath.h" +#define PFX "qlgc_vnic: " + +struct io; +typedef void (comp_routine_t) (struct io *io); + +enum vnic_ib_conn_state { + IB_CONN_UNINITTED = 0, + IB_CONN_INITTED = 1, + IB_CONN_CONNECTING = 2, + IB_CONN_CONNECTED = 3, + IB_CONN_DISCONNECTED = 4, + IB_CONN_ERRORED = 5 +}; + +struct vnic_ib_conn { + struct viport *viport; + struct vnic_ib_config *ib_config; + spinlock_t conn_lock; + enum vnic_ib_conn_state state; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_cm_id *cm_id; + int callback_thread_end; + struct task_struct *callback_thread; + wait_queue_head_t callback_wait_queue; + u32 in_thread; + u32 compl_received; + struct completion callback_thread_exit; + spinlock_t compl_received_lock; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t connection_time; + cycles_t rdma_post_time; + u32 rdma_post_ios; + cycles_t rdma_comp_time; + u32 rdma_comp_ios; + cycles_t send_post_time; + u32 send_post_ios; + cycles_t send_comp_time; + u32 send_comp_ios; + cycles_t recv_post_time; + u32 recv_post_ios; + cycles_t recv_comp_time; + u32 recv_comp_ios; + u32 num_ios; + u32 num_callbacks; + u32 max_ios; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct vnic_ib_path_info { + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + int status; + struct completion done; +}; + +struct vnic_ib_device { + struct ib_device *dev; + struct list_head port_list; +}; + +struct vnic_ib_port { + struct vnic_ib_device *dev; + u8 port_num; + struct dev_info pdev_info; + struct list_head list; +}; + +struct io { + struct list_head list_ptrs; + struct viport *viport; + comp_routine_t *routine; + struct ib_recv_wr rwr; + struct ib_send_wr swr; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + cycles_t time; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + enum {RECV, RDMA, SEND, RECV_UD} type; +}; + +struct rdma_io { + struct io io; + struct ib_sge list[2]; + u16 index; + u16 len; + u8 *data; + dma_addr_t data_dma; + struct sk_buff *skb; + dma_addr_t skb_data_dma; + struct viport_trailer *trailer; + dma_addr_t trailer_dma; +}; + +struct send_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +struct recv_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +struct ud_recv_io { + struct io io; + u16 len; + dma_addr_t skb_data_dma; + struct ib_sge list[2]; /* one for grh and other for rest of pkt. */ + struct sk_buff *skb; +}; + +int vnic_ib_init(void); +void vnic_ib_cleanup(void); + +struct vnic; +int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic); +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config); + +int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn); +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +#define vnic_ib_conn_uninitted(ib_conn) \ + ((ib_conn)->state == IB_CONN_UNINITTED) +#define vnic_ib_conn_initted(ib_conn) \ + ((ib_conn)->state == IB_CONN_INITTED) +#define vnic_ib_conn_connecting(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTING) +#define vnic_ib_conn_connected(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTED) +#define vnic_ib_conn_disconnected(ib_conn) \ + ((ib_conn)->state == IB_CONN_DISCONNECTED) + +#define MCAST_GROUP_INVALID 0x00 /* viport failed to join or left mc group */ +#define MCAST_GROUP_JOINING 0x01 /* wait for completion */ +#define MCAST_GROUP_JOINED 0x02 /* join process completed successfully */ + +/* vnic_sa_client is used to register with sa once. It is needed to join and + * leave multicast groups. + */ +extern struct ib_sa_client vnic_sa_client; + +/* The following functions are using initialize and handle multicast + * components. + */ +struct mc_data; /* forward declaration */ +/* Initialize all necessary mc components */ +int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config); +/* Put multicast qp in RTS */ +int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp); +/* Post multicast receive buffers */ +int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io); + +#endif /* VNIC_IB_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:34:59 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:04:59 +0530 Subject: [ofa-general] [PATCH v2 07/13] QLogic VNIC: Handling configurable parameters of the driver In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103459.12355.51105.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the files that handle various configurable parameters of the VNIC driver ---- configuration of virtual NIC, control, data connections to the EVIC and general IB connection parameters. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_config.c | 379 ++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_config.h | 242 +++++++++++++++ 2 files changed, 621 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c new file mode 100644 index 0000000..8bde3d8 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c @@ -0,0 +1,379 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_trailer.h" +#include "vnic_main.h" + +u16 vnic_max_mtu = MAX_MTU; + +static u32 default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; +static u32 sa_path_rec_get_timeout = SA_PATH_REC_GET_TIMEOUT; +static u32 default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; +static u32 default_primary_switch_timeout = DEFAULT_PRIMARY_SWITCH_TIMEOUT; +static int default_prefer_primary = DEFAULT_PREFER_PRIMARY; + +static int use_rx_csum = VNIC_USE_RX_CSUM; +static int use_tx_csum = VNIC_USE_TX_CSUM; + +static u32 control_response_timeout = CONTROL_RSP_TIMEOUT; +static u32 completion_limit = DEFAULT_COMPLETION_LIMIT; + +module_param(vnic_max_mtu, ushort, 0444); +MODULE_PARM_DESC(vnic_max_mtu, "Maximum MTU size (1500-9500). Default is 9500"); + +module_param(default_prefer_primary, bool, 0444); +MODULE_PARM_DESC(default_prefer_primary, "Determines if primary path is" + " preferred (1) or not (0). Defaults to 0"); +module_param(use_rx_csum, bool, 0444); +MODULE_PARM_DESC(use_rx_csum, "Determines if RX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(use_tx_csum, bool, 0444); +MODULE_PARM_DESC(use_tx_csum, "Determines if TX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(default_no_path_timeout, uint, 0444); +MODULE_PARM_DESC(default_no_path_timeout, "Time to wait in milliseconds" + " before reconnecting to VEx after connection loss"); +module_param(default_primary_reconnect_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_reconnect_timeout, "Time to wait in" + " milliseconds before reconnecting the" + " primary path to VEx"); +module_param(default_primary_switch_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_switch_timeout, "Time to wait before" + " switching back to primary path if" + " primary path is preferred"); +module_param(sa_path_rec_get_timeout, uint, 0444); +MODULE_PARM_DESC(sa_path_rec_get_timeout, "Time out value in milliseconds" + " for SA path record get queries"); + +module_param(control_response_timeout, uint, 0444); +MODULE_PARM_DESC(control_response_timeout, "Time out value in milliseconds" + " to wait for response to control requests"); + +module_param(completion_limit, uint, 0444); +MODULE_PARM_DESC(completion_limit, "Maximum completions to process" + " in a single completion callback invocation. Default is 100" + " Minimum value is 10"); + +static void config_control_defaults(struct control_config *control_config, + struct path_param *params) +{ + int len; + char *dot; + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + control_config->ib_config.service_id = cpu_to_be64(sid); + control_config->ib_config.conn_data.path_id = 0; + control_config->ib_config.conn_data.vnic_instance = params->instance; + control_config->ib_config.conn_data.path_num = 0; + control_config->ib_config.conn_data.features_supported = + __constant_cpu_to_be32((u32) (VNIC_FEAT_IGNORE_VLAN | + VNIC_FEAT_RDMA_IMMED)); + dot = strchr(init_utsname()->nodename, '.'); + + if (dot) + len = dot - init_utsname()->nodename; + else + len = strlen(init_utsname()->nodename); + + if (len > VNIC_MAX_NODENAME_LEN) + len = VNIC_MAX_NODENAME_LEN; + + memcpy(control_config->ib_config.conn_data.nodename, + init_utsname()->nodename, len); + + if (params->ib_multicast == 1) + control_config->ib_multicast = 1; + else if (params->ib_multicast == 0) + control_config->ib_multicast = 0; + else { + /* parameter is not set - enable it by default */ + control_config->ib_multicast = 1; + CONFIG_ERROR("IOCGUID=%llx INSTANCE=%d IB_MULTICAST defaulted" + " to TRUE\n", + be64_to_cpu(params->ioc_guid), + (char)params->instance); + } + + if (control_config->ib_multicast) + control_config->ib_config.conn_data.features_supported |= + __constant_cpu_to_be32(VNIC_FEAT_INBOUND_IB_MC); + + control_config->ib_config.retry_count = RETRY_COUNT; + control_config->ib_config.rnr_retry_count = RETRY_COUNT; + control_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* These values are not configurable*/ + control_config->ib_config.num_recvs = 5; + control_config->ib_config.num_sends = 1; + control_config->ib_config.recv_scatter = 1; + control_config->ib_config.send_gather = 1; + control_config->ib_config.completion_limit = completion_limit; + + control_config->num_recvs = control_config->ib_config.num_recvs; + + control_config->vnic_instance = params->instance; + control_config->max_address_entries = MAX_ADDRESS_ENTRIES; + control_config->min_address_entries = MIN_ADDRESS_ENTRIES; + control_config->rsp_timeout = msecs_to_jiffies(control_response_timeout); +} + +static void config_data_defaults(struct data_config *data_config, + struct path_param *params) +{ + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + data_config->ib_config.service_id = cpu_to_be64(sid); + data_config->ib_config.conn_data.path_id = jiffies; /* random */ + data_config->ib_config.conn_data.vnic_instance = params->instance; + data_config->ib_config.conn_data.path_num = 0; + + data_config->ib_config.retry_count = RETRY_COUNT; + data_config->ib_config.rnr_retry_count = RETRY_COUNT; + data_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* + * NOTE: the num_recvs size assumes that the EIOC could + * RDMA enough packets to fill all of the host recv + * pool entries, plus send a kick message after each + * packet, plus RDMA new buffers for the size of + * the EIOC recv buffer pool, plus send kick messages + * after each min_host_update_sz of new buffers all + * before the host can even pull off the first completed + * receive off the completion queue, and repost the + * receive. NOT LIKELY! + */ + data_config->ib_config.num_recvs = HOST_RECV_POOL_ENTRIES + + (MAX_EIOC_POOL_SZ / MIN_HOST_UPDATE_SZ); + + data_config->ib_config.num_sends = (2 * NOTIFY_BUNDLE_SZ) + + (HOST_RECV_POOL_ENTRIES / MIN_EIOC_UPDATE_SZ) + 1; + + data_config->ib_config.recv_scatter = 1; /* not configurable */ + data_config->ib_config.send_gather = 2; /* not configurable */ + data_config->ib_config.completion_limit = completion_limit; + + data_config->num_recvs = data_config->ib_config.num_recvs; + data_config->path_id = data_config->ib_config.conn_data.path_id; + + + data_config->host_recv_pool_entries = HOST_RECV_POOL_ENTRIES; + + data_config->host_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->host_max.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + vnic_max_mtu)); + data_config->eioc_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->eioc_max.size_recv_pool_entry = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_HOST_POOL_SZ); + data_config->host_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + data_config->eioc_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_EIOC_POOL_SZ); + data_config->eioc_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_EIOC_POOL_SZ); + + data_config->host_min.timeout_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_TIMEOUT); + data_config->host_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_TIMEOUT); + data_config->eioc_min.timeout_before_kick = 0; + data_config->eioc_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_ENTRIES); + data_config->host_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_ENTRIES); + data_config->eioc_min.num_recv_pool_entries_before_kick = 0; + data_config->eioc_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_BYTES); + data_config->host_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_BYTES); + data_config->eioc_min.num_recv_pool_bytes_before_kick = 0; + data_config->eioc_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_HOST_UPDATE_SZ); + data_config->host_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_HOST_UPDATE_SZ); + data_config->eioc_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_EIOC_UPDATE_SZ); + data_config->eioc_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_EIOC_UPDATE_SZ); + + data_config->notify_bundle = NOTIFY_BUNDLE_SZ; +} + +static void config_path_info_defaults(struct viport_config *config, + struct path_param *params) +{ + int i; + ib_query_gid(config->ibdev, config->port, 0, + &config->path_info.path.sgid); + for (i = 0; i < 16; i++) + config->path_info.path.dgid.raw[i] = params->dgid[i]; + + config->path_info.path.pkey = params->pkey; + config->path_info.path.numb_path = 1; + config->sa_path_rec_get_timeout = sa_path_rec_get_timeout; + +} + +static void config_viport_defaults(struct viport_config *config, + struct path_param *params) +{ + config->ibdev = params->ibdev; + config->port = params->port; + config->ioc_guid = params->ioc_guid; + config->stats_interval = msecs_to_jiffies(VIPORT_STATS_INTERVAL); + config->hb_interval = msecs_to_jiffies(VIPORT_HEARTBEAT_INTERVAL); + config->hb_timeout = VIPORT_HEARTBEAT_TIMEOUT * 1000; + /*hb_timeout needs to be in usec*/ + strcpy(config->ioc_string, params->ioc_string); + config_path_info_defaults(config, params); + + config_control_defaults(&config->control_config, params); + config_data_defaults(&config->data_config, params); +} + +static void config_vnic_defaults(struct vnic_config *config) +{ + config->no_path_timeout = msecs_to_jiffies(default_no_path_timeout); + config->primary_connect_timeout = + msecs_to_jiffies(DEFAULT_PRIMARY_CONNECT_TIMEOUT); + config->primary_reconnect_timeout = + msecs_to_jiffies(default_primary_reconnect_timeout); + config->primary_switch_timeout = + msecs_to_jiffies(default_primary_switch_timeout); + config->prefer_primary = default_prefer_primary; + config->use_rx_csum = use_rx_csum; + config->use_tx_csum = use_tx_csum; +} + +struct viport_config *config_alloc_viport(struct path_param *params) +{ + struct viport_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("could not allocate memory for" + " struct viport_config\n"); + return NULL; + } + + config_viport_defaults(config, params); + + return config; +} + +struct vnic_config *config_alloc_vnic(void) +{ + struct vnic_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("couldn't allocate memory for" + " struct vnic_config\n"); + + return NULL; + } + + config_vnic_defaults(config); + return config; +} + +char *config_viport_name(struct viport_config *config) +{ + /* function only called by one thread, can return a static string */ + static char str[64]; + + sprintf(str, "GUID %llx instance %d", + be64_to_cpu(config->ioc_guid), + config->control_config.vnic_instance); + return str; +} + +int config_start(void) +{ + vnic_max_mtu = min_t(u16, vnic_max_mtu, MAX_MTU); + vnic_max_mtu = max_t(u16, vnic_max_mtu, MIN_MTU); + + sa_path_rec_get_timeout = min_t(u32, sa_path_rec_get_timeout, + MAX_SA_TIMEOUT); + sa_path_rec_get_timeout = max_t(u32, sa_path_rec_get_timeout, + MIN_SA_TIMEOUT); + + control_response_timeout = min_t(u32, control_response_timeout, + MAX_CONTROL_RSP_TIMEOUT); + + control_response_timeout = max_t(u32, control_response_timeout, + MIN_CONTROL_RSP_TIMEOUT); + + completion_limit = max_t(u32, completion_limit, + MIN_COMPLETION_LIMIT); + + if (!default_no_path_timeout) + default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; + + if (!default_primary_reconnect_timeout) + default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; + + if (!default_primary_switch_timeout) + default_primary_switch_timeout = + DEFAULT_PRIMARY_SWITCH_TIMEOUT; + + return 0; + +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h new file mode 100644 index 0000000..dca5f98 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h @@ -0,0 +1,242 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONFIG_H_INCLUDED +#define VNIC_CONFIG_H_INCLUDED + +#include +#include +#include + +#include "vnic_control.h" +#include "vnic_ib.h" + +#define SST_AGN 0x10ULL +#define SST_OUI 0x00066AULL + +enum { + CONTROL_PATH_ID = 0x0, + DATA_PATH_ID = 0x1 +}; + +#define IOC_NUMBER(GUID) (((GUID) >> 32) & 0xFF) + +enum { + VNIC_CLASS_SUBCLASS = 0x2000066A, + VNIC_PROTOCOL = 0, + VNIC_PROT_VERSION = 1 +}; + +enum { + MIN_MTU = 1500, /* minimum negotiated MTU size */ + MAX_MTU = 9500 /* jumbo frame */ +}; + +/* + * TODO: tune the pool parameter values + */ +enum { + MIN_ADDRESS_ENTRIES = 16, + MAX_ADDRESS_ENTRIES = 64 +}; + +enum { + HOST_RECV_POOL_ENTRIES = 512, + MIN_HOST_POOL_SZ = 64, + MIN_EIOC_POOL_SZ = 64, + MAX_EIOC_POOL_SZ = 256, + MIN_HOST_UPDATE_SZ = 8, + MAX_HOST_UPDATE_SZ = 32, + MIN_EIOC_UPDATE_SZ = 8, + MAX_EIOC_UPDATE_SZ = 32, + NOTIFY_BUNDLE_SZ = 32 +}; + +enum { + MIN_HOST_KICK_TIMEOUT = 10, /* in usec */ + MAX_HOST_KICK_TIMEOUT = 100 /* in usec */ +}; + +enum { + MIN_HOST_KICK_ENTRIES = 1, + MAX_HOST_KICK_ENTRIES = 128 +}; + +enum { + MIN_HOST_KICK_BYTES = 0, + MAX_HOST_KICK_BYTES = 5000 +}; + +enum { + DEFAULT_NO_PATH_TIMEOUT = 10000, + DEFAULT_PRIMARY_CONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_RECONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_SWITCH_TIMEOUT = 10000 +}; + +enum { + VIPORT_STATS_INTERVAL = 500, /* .5 sec */ + VIPORT_HEARTBEAT_INTERVAL = 1000, /* 1 second */ + VIPORT_HEARTBEAT_TIMEOUT = 64000 /* 64 sec */ +}; + +enum { + /* 5 sec increased for EVIC support for large number of + * host connections + */ + CONTROL_RSP_TIMEOUT = 5000, + MIN_CONTROL_RSP_TIMEOUT = 1000, /* 1 sec */ + MAX_CONTROL_RSP_TIMEOUT = 60000 /* 60 sec */ +}; + +/* Maximum number of completions to be processed + * during a single completion callback invocation + */ +enum { + DEFAULT_COMPLETION_LIMIT = 100, + MIN_COMPLETION_LIMIT = 10 +}; + +/* infiniband connection parameters */ +enum { + RETRY_COUNT = 3, + MIN_RNR_TIMER = 22, /* 20 ms */ + DEFAULT_PKEY = 0 /* pkey table index */ +}; + +enum { + SA_PATH_REC_GET_TIMEOUT = 1000, /* 1000 ms */ + MIN_SA_TIMEOUT = 100, /* 100 ms */ + MAX_SA_TIMEOUT = 20000 /* 20s */ +}; + +#define MAX_PARAM_VALUE 0x40000000 +#define VNIC_USE_RX_CSUM 1 +#define VNIC_USE_TX_CSUM 1 +#define DEFAULT_PREFER_PRIMARY 0 + +/* As per IBTA specification, IOCString Maximum length can be 512 bits. */ +#define MAX_IOC_STRING_LEN (512/8) + +struct path_param { + __be64 ioc_guid; + u8 ioc_string[MAX_IOC_STRING_LEN+1]; + u8 port; + u8 instance; + struct ib_device *ibdev; + struct vnic_ib_port *ibport; + char name[IFNAMSIZ]; + u8 dgid[16]; + __be16 pkey; + int rx_csum; + int tx_csum; + int heartbeat; + int ib_multicast; +}; + +struct vnic_ib_config { + __be64 service_id; + struct vnic_connection_data conn_data; + u32 retry_count; + u32 rnr_retry_count; + u8 min_rnr_timer; + u32 num_sends; + u32 num_recvs; + u32 recv_scatter; /* 1 */ + u32 send_gather; /* 1 or 2 */ + u32 completion_limit; +}; + +struct control_config { + struct vnic_ib_config ib_config; + u32 num_recvs; + u8 vnic_instance; + u16 max_address_entries; + u16 min_address_entries; + u32 rsp_timeout; + u32 ib_multicast; +}; + +struct data_config { + struct vnic_ib_config ib_config; + u64 path_id; + u32 num_recvs; + u32 host_recv_pool_entries; + struct vnic_recv_pool_config host_min; + struct vnic_recv_pool_config host_max; + struct vnic_recv_pool_config eioc_min; + struct vnic_recv_pool_config eioc_max; + u32 notify_bundle; +}; + +struct viport_config { + struct viport *viport; + struct control_config control_config; + struct data_config data_config; + struct vnic_ib_path_info path_info; + u32 sa_path_rec_get_timeout; + struct ib_device *ibdev; + u32 port; + unsigned long stats_interval; + u32 hb_interval; + u32 hb_timeout; + __be64 ioc_guid; + u8 ioc_string[MAX_IOC_STRING_LEN+1]; + size_t path_idx; +}; + +/* + * primary_connect_timeout - if the secondary connects first, + * how long do we give the primary? + * primary_reconnect_timeout - same as above, but used when recovering + * from the case where both paths fail + * primary_switch_timeout - how long do we wait before switching to the + * primary when it comes back? + */ +struct vnic_config { + struct vnic *vnic; + char name[IFNAMSIZ]; + unsigned long no_path_timeout; + u32 primary_connect_timeout; + u32 primary_reconnect_timeout; + u32 primary_switch_timeout; + int prefer_primary; + int use_rx_csum; + int use_tx_csum; +}; + +int config_start(void); +struct viport_config *config_alloc_viport(struct path_param *params); +struct vnic_config *config_alloc_vnic(void); +char *config_viport_name(struct viport_config *config); + +#endif /* VNIC_CONFIG_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:35:29 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:05:29 +0530 Subject: [ofa-general] [PATCH v2 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103529.12355.82570.stgit@localhost.localdomain> From: Amar Mudrankit The sysfs interface for the QLogic VNIC driver is implemented through this patch. Signed-off-by: Amar Mudrankit Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath --- drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1131 +++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 62 + 2 files changed, 1193 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c new file mode 100644 index 0000000..312f37c --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c @@ -0,0 +1,1131 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +/* + * target eiocs are added by writing + * + * ioc_guid=,dgid=,pkey=,name= + * to the create_primary sysfs attribute. + */ +enum { + VNIC_OPT_ERR = 0, + VNIC_OPT_IOC_GUID = 1 << 0, + VNIC_OPT_DGID = 1 << 1, + VNIC_OPT_PKEY = 1 << 2, + VNIC_OPT_NAME = 1 << 3, + VNIC_OPT_INSTANCE = 1 << 4, + VNIC_OPT_RXCSUM = 1 << 5, + VNIC_OPT_TXCSUM = 1 << 6, + VNIC_OPT_HEARTBEAT = 1 << 7, + VNIC_OPT_IOC_STRING = 1 << 8, + VNIC_OPT_IB_MULTICAST = 1 << 9, + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), +}; + +static match_table_t vnic_opt_tokens = { + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, + {VNIC_OPT_DGID, "dgid=%s"}, + {VNIC_OPT_PKEY, "pkey=%x"}, + {VNIC_OPT_NAME, "name=%s"}, + {VNIC_OPT_INSTANCE, "instance=%d"}, + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, + {VNIC_OPT_ERR, NULL} +}; + +void vnic_release_dev(struct device *dev) +{ + struct dev_info *dev_info = + container_of(dev, struct dev_info, dev); + + complete(&dev_info->released); + +} + +struct class vnic_class = { + .name = "infiniband_qlgc_vnic", + .dev_release = vnic_release_dev +}; + +struct dev_info interface_dev; + +DEVICE_ATTR(create_primary, S_IWUSR, NULL, vnic_create_primary); +DEVICE_ATTR(create_secondary, S_IWUSR, NULL, vnic_create_secondary); +DEVICE_ATTR(delete_vnic, S_IWUSR, NULL, vnic_delete); + +static int vnic_parse_options(const char *buf, struct path_param *param) +{ + char *options, *sep_opt; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i, len; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + sep_opt = options; + while ((p = strsep(&sep_opt, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, vnic_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case VNIC_OPT_IOC_GUID: + p = match_strdup(args); + param->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, + 16)); + kfree(p); + break; + + case VNIC_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) { + printk(KERN_WARNING PFX + "bad dest GID parameter '%s'\n", p); + kfree(p); + goto out; + } + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + param->dgid[i] = simple_strtoul(dgid, NULL, + 16); + + } + kfree(p); + break; + + case VNIC_OPT_PKEY: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX + "bad P_key parameter '%s'\n", p); + goto out; + } + param->pkey = cpu_to_be16(token); + break; + + case VNIC_OPT_NAME: + p = match_strdup(args); + if (strlen(p) >= IFNAMSIZ) { + printk(KERN_WARNING PFX + "interface name parameter too long\n"); + kfree(p); + goto out; + } + strcpy(param->name, p); + kfree(p); + break; + case VNIC_OPT_INSTANCE: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 255 || token < 0) { + printk(KERN_WARNING PFX + "instance parameter must be" + " >= 0 and <= 255\n"); + goto out; + } + + param->instance = token; + break; + case VNIC_OPT_RXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->rx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->rx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad rx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_TXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->tx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->tx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad tx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_HEARTBEAT: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 6000 || token <= 0) { + printk(KERN_WARNING PFX + "heartbeat parameter must be" + " > 0 and <= 6000\n"); + goto out; + } + param->heartbeat = token; + break; + case VNIC_OPT_IOC_STRING: + p = match_strdup(args); + len = strlen(p); + if (len > MAX_IOC_STRING_LEN) { + printk(KERN_WARNING PFX + "ioc string parameter too long\n"); + kfree(p); + goto out; + } + strcpy(param->ioc_string, p); + if (*(p + len - 1) != '\"') { + strcat(param->ioc_string, ","); + kfree(p); + p = strsep(&sep_opt, "\""); + strcat(param->ioc_string, p); + sep_opt++; + } else { + *(param->ioc_string + len - 1) = '\0'; + kfree(p); + } + break; + case VNIC_OPT_IB_MULTICAST: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->ib_multicast = 1; + else if (!strncmp(p, "false", 5)) + param->ib_multicast = 0; + else { + printk(KERN_WARNING PFX + "bad ib_multicast parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + default: + printk(KERN_WARNING PFX + "unknown parameter or missing value " + "'%s' in target creation request\n", p); + goto out; + } + + } + + if ((opt_mask & VNIC_OPT_ALL) == VNIC_OPT_ALL) + ret = 0; + else + for (i = 0; i < ARRAY_SIZE(vnic_opt_tokens); ++i) + if ((vnic_opt_tokens[i].token & VNIC_OPT_ALL) && + !(vnic_opt_tokens[i].token & opt_mask)) + printk(KERN_WARNING PFX + "target creation request is " + "missing parameter '%s'\n", + vnic_opt_tokens[i].pattern); + +out: + kfree(options); + return ret; + +} + +static ssize_t show_vnic_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + switch (vnic->state) { + case VNIC_UNINITIALIZED: + return sprintf(buf, "VNIC_UNINITIALIZED\n"); + case VNIC_REGISTERED: + return sprintf(buf, "VNIC_REGISTERED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static DEVICE_ATTR(vnic_state, S_IRUGO, show_vnic_state, NULL); + +static ssize_t show_rx_csum(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + + if (vnic->config->use_rx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static DEVICE_ATTR(rx_csum, S_IRUGO, show_rx_csum, NULL); + +static ssize_t show_tx_csum(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + + if (vnic->config->use_tx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static DEVICE_ATTR(tx_csum, S_IRUGO, show_tx_csum, NULL); + +static ssize_t show_current_path(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + unsigned long flags; + size_t length; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path == &vnic->primary_path) + length = sprintf(buf, "primary_path\n"); + else if (vnic->current_path == &vnic->secondary_path) + length = sprintf(buf, "secondary path\n"); + else + length = sprintf(buf, "none\n"); + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + return length; +} + +static DEVICE_ATTR(current_path, S_IRUGO, show_current_path, NULL); + +static struct attribute *vnic_dev_attrs[] = { + &dev_attr_vnic_state.attr, + &dev_attr_rx_csum.attr, + &dev_attr_tx_csum.attr, + &dev_attr_current_path.attr, + NULL +}; + +struct attribute_group vnic_dev_attr_group = { + .attrs = vnic_dev_attrs, +}; + +static inline void print_dgid(u8 *dgid) +{ + int i; + + for (i = 0; i < 16; i += 2) + printk("%04x", be16_to_cpu(*(__be16 *)&dgid[i])); +} + +static inline int is_dgid_zero(u8 *dgid) +{ + int i; + + for (i = 0; i < 16; i++) { + if (dgid[i] != 0) + return 1; + } + return 0; +} + +static int create_netpath(struct netpath *npdest, + struct path_param *p_params) +{ + struct viport_config *viport_config; + struct viport *viport; + struct vnic *vnic; + struct list_head *ptr; + int ret = 0; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (vnic->primary_path.viport) { + viport_config = vnic->primary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance) + && (be64_to_cpu(p_params->ioc_guid))) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + + if (vnic->secondary_path.viport) { + viport_config = vnic->secondary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance) + && (be64_to_cpu(p_params->ioc_guid))) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + } + + if (npdest->viport) { + SYS_ERROR("create_netpath: path already exists\n"); + ret = -EINVAL; + goto out; + } + + viport_config = config_alloc_viport(p_params); + if (!viport_config) { + SYS_ERROR("create_netpath: failed creating viport config\n"); + ret = -1; + goto out; + } + + /*User specified heartbeat value is in 1/100s of a sec*/ + if (p_params->heartbeat != -1) { + viport_config->hb_interval = + msecs_to_jiffies(p_params->heartbeat * 10); + viport_config->hb_timeout = + (p_params->heartbeat << 6) * 10000; /* usec */ + } + + viport_config->path_idx = 0; + + viport = viport_allocate(viport_config); + if (!viport) { + SYS_ERROR("create_netpath: failed creating viport\n"); + kfree(viport_config); + ret = -1; + goto out; + } + + npdest->viport = viport; + viport->parent = npdest; + viport->vnic = npdest->parent; + + if (is_dgid_zero(p_params->dgid) && p_params->ioc_guid != 0 + && p_params->pkey != 0) { + viport_kick(viport); + vnic_disconnected(npdest->parent, npdest); + } else { + printk(KERN_WARNING "Specified parameters IOCGUID=%llx, " + "P_Key=%x, DGID=", be64_to_cpu(p_params->ioc_guid), + p_params->pkey); + print_dgid(p_params->dgid); + printk(" insufficient for establishing %s path for interface " + "%s. Hence, path will not be established.\n", + (npdest->second_bias ? "secondary" : "primary"), + p_params->name); + } +out: + return ret; +} + +static struct vnic *create_vnic(struct path_param *param) +{ + struct vnic_config *vnic_config; + struct vnic *vnic; + struct list_head *ptr; + + SYS_INFO("create_vnic: name = %s\n", param->name); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, param->name)) { + SYS_ERROR("vnic %s already exists\n", + param->name); + return NULL; + } + } + + vnic_config = config_alloc_vnic(); + if (!vnic_config) { + SYS_ERROR("create_vnic: failed creating vnic config\n"); + return NULL; + } + + if (param->rx_csum != -1) + vnic_config->use_rx_csum = param->rx_csum; + + if (param->tx_csum != -1) + vnic_config->use_tx_csum = param->tx_csum; + + strcpy(vnic_config->name, param->name); + vnic = vnic_allocate(vnic_config); + if (!vnic) { + SYS_ERROR("create_vnic: failed allocating vnic\n"); + goto free_vnic_config; + } + + init_completion(&vnic->dev_info.released); + + vnic->dev_info.dev.class = NULL; + vnic->dev_info.dev.parent = &interface_dev.dev; + vnic->dev_info.dev.release = vnic_release_dev; + snprintf(vnic->dev_info.dev.bus_id, BUS_ID_SIZE, + vnic_config->name); + + if (device_register(&vnic->dev_info.dev)) { + SYS_ERROR("create_vnic: error in registering" + " vnic class dev\n"); + goto free_vnic; + } + + if (sysfs_create_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group)) { + SYS_ERROR("create_vnic: error in creating" + "vnic attr group\n"); + goto err_attr; + + } + + if (vnic_setup_stats_files(vnic)) + goto err_stats; + + return vnic; +err_stats: + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group); +err_attr: + device_unregister(&vnic->dev_info.dev); + wait_for_completion(&vnic->dev_info.released); +free_vnic: + list_del(&vnic->list_ptrs); + kfree(vnic); +free_vnic_config: + kfree(vnic_config); + return NULL; +} + +ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr, + const char *buf, size_t count) +{ + struct vnic *vnic; + struct list_head *ptr; + int ret = -EINVAL; + + if (count > IFNAMSIZ) { + printk(KERN_WARNING PFX "invalid vnic interface name\n"); + return ret; + } + + SYS_INFO("vnic_delete: name = %s\n", buf); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, buf)) { + vnic_free(vnic); + return count; + } + } + + printk(KERN_WARNING PFX "vnic interface '%s' does not exist\n", buf); + return ret; +} + +static ssize_t show_viport_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct netpath *path = container_of(info, struct netpath, dev_info); + switch (path->viport->state) { + case VIPORT_DISCONNECTED: + return sprintf(buf, "VIPORT_DISCONNECTED\n"); + case VIPORT_CONNECTED: + return sprintf(buf, "VIPORT_CONNECTED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static DEVICE_ATTR(viport_state, S_IRUGO, show_viport_state, NULL); + +static ssize_t show_link_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct netpath *path = container_of(info, struct netpath, dev_info); + + switch (path->viport->link_state) { + case LINK_UNINITIALIZED: + return sprintf(buf, "LINK_UNINITIALIZED\n"); + case LINK_INITIALIZE: + return sprintf(buf, "LINK_INITIALIZE\n"); + case LINK_INITIALIZECONTROL: + return sprintf(buf, "LINK_INITIALIZECONTROL\n"); + case LINK_INITIALIZEDATA: + return sprintf(buf, "LINK_INITIALIZEDATA\n"); + case LINK_CONTROLCONNECT: + return sprintf(buf, "LINK_CONTROLCONNECT\n"); + case LINK_CONTROLCONNECTWAIT: + return sprintf(buf, "LINK_CONTROLCONNECTWAIT\n"); + case LINK_INITVNICREQ: + return sprintf(buf, "LINK_INITVNICREQ\n"); + case LINK_INITVNICRSP: + return sprintf(buf, "LINK_INITVNICRSP\n"); + case LINK_BEGINDATAPATH: + return sprintf(buf, "LINK_BEGINDATAPATH\n"); + case LINK_CONFIGDATAPATHREQ: + return sprintf(buf, "LINK_CONFIGDATAPATHREQ\n"); + case LINK_CONFIGDATAPATHRSP: + return sprintf(buf, "LINK_CONFIGDATAPATHRSP\n"); + case LINK_DATACONNECT: + return sprintf(buf, "LINK_DATACONNECT\n"); + case LINK_DATACONNECTWAIT: + return sprintf(buf, "LINK_DATACONNECTWAIT\n"); + case LINK_XCHGPOOLREQ: + return sprintf(buf, "LINK_XCHGPOOLREQ\n"); + case LINK_XCHGPOOLRSP: + return sprintf(buf, "LINK_XCHGPOOLRSP\n"); + case LINK_INITIALIZED: + return sprintf(buf, "LINK_INITIALIZED\n"); + case LINK_IDLE: + return sprintf(buf, "LINK_IDLE\n"); + case LINK_IDLING: + return sprintf(buf, "LINK_IDLING\n"); + case LINK_CONFIGLINKREQ: + return sprintf(buf, "LINK_CONFIGLINKREQ\n"); + case LINK_CONFIGLINKRSP: + return sprintf(buf, "LINK_CONFIGLINKRSP\n"); + case LINK_CONFIGADDRSREQ: + return sprintf(buf, "LINK_CONFIGADDRSREQ\n"); + case LINK_CONFIGADDRSRSP: + return sprintf(buf, "LINK_CONFIGADDRSRSP\n"); + case LINK_REPORTSTATREQ: + return sprintf(buf, "LINK_REPORTSTATREQ\n"); + case LINK_REPORTSTATRSP: + return sprintf(buf, "LINK_REPORTSTATRSP\n"); + case LINK_HEARTBEATREQ: + return sprintf(buf, "LINK_HEARTBEATREQ\n"); + case LINK_HEARTBEATRSP: + return sprintf(buf, "LINK_HEARTBEATRSP\n"); + case LINK_RESET: + return sprintf(buf, "LINK_RESET\n"); + case LINK_RESETRSP: + return sprintf(buf, "LINK_RESETRSP\n"); + case LINK_RESETCONTROL: + return sprintf(buf, "LINK_RESETCONTROL\n"); + case LINK_RESETCONTROLRSP: + return sprintf(buf, "LINK_RESETCONTROLRSP\n"); + case LINK_DATADISCONNECT: + return sprintf(buf, "LINK_DATADISCONNECT\n"); + case LINK_CONTROLDISCONNECT: + return sprintf(buf, "LINK_CONTROLDISCONNECT\n"); + case LINK_CLEANUPDATA: + return sprintf(buf, "LINK_CLEANUPDATA\n"); + case LINK_CLEANUPCONTROL: + return sprintf(buf, "LINK_CLEANUPCONTROL\n"); + case LINK_DISCONNECTED: + return sprintf(buf, "LINK_DISCONNECTED\n"); + case LINK_RETRYWAIT: + return sprintf(buf, "LINK_RETRYWAIT\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + + } + +} +static DEVICE_ATTR(link_state, S_IRUGO, show_link_state, NULL); + +static ssize_t show_heartbeat(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + /* hb_inteval is in jiffies, convert it back to + * 1/100ths of a second + */ + return sprintf(buf, "%d\n", + (jiffies_to_msecs(path->viport->config->hb_interval)/10)); +} + +static DEVICE_ATTR(heartbeat, S_IRUGO, show_heartbeat, NULL); + +static ssize_t show_ioc_guid(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%llx\n", + __be64_to_cpu(path->viport->config->ioc_guid)); +} + +static DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); + +static inline void get_dgid_string(u8 *dgid, char *buf) +{ + int i; + char holder[5]; + + for (i = 0; i < 16; i += 2) { + sprintf(holder, "%04x", be16_to_cpu(*(__be16 *)&dgid[i])); + strcat(buf, holder); + } + + strcat(buf, "\n"); +} + +static ssize_t show_dgid(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + get_dgid_string(path->viport->config->path_info.path.dgid.raw, buf); + + return strlen(buf); +} + +static DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); + +static ssize_t show_pkey(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%x\n", path->viport->config->path_info.path.pkey); +} + +static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t show_hca_info(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "vnic-%s-%d\n", path->viport->config->ibdev->name, + path->viport->config->port); +} + +static DEVICE_ATTR(hca_info, S_IRUGO, show_hca_info, NULL); + +static ssize_t show_ioc_string(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%s\n", path->viport->config->ioc_string); +} + +static DEVICE_ATTR(ioc_string, S_IRUGO, show_ioc_string, NULL); + +static ssize_t show_multicast_state(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + if (!(path->viport->features_supported & VNIC_FEAT_INBOUND_IB_MC)) + return sprintf(buf, "feature not enabled\n"); + + switch (path->viport->mc_info.state) { + case MCAST_STATE_INVALID: + return sprintf(buf, "state=Invalid\n"); + case MCAST_STATE_JOINING: + return sprintf(buf, "state=Joining MGID:" VNIC_GID_FMT "\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw)); + case MCAST_STATE_ATTACHING: + return sprintf(buf, "state=Attaching MGID:" VNIC_GID_FMT + " MLID:%X\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw), + path->viport->mc_info.mlid); + case MCAST_STATE_JOINED_ATTACHED: + return sprintf(buf, + "state=Joined & Attached MGID:" VNIC_GID_FMT + " MLID:%X\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw), + path->viport->mc_info.mlid); + case MCAST_STATE_DETACHING: + return sprintf(buf, "state=Detaching MGID: " VNIC_GID_FMT "\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw)); + case MCAST_STATE_RETRIED: + return sprintf(buf, "state=Retries Exceeded\n"); + } + return sprintf(buf, "invalid state\n"); +} + +static DEVICE_ATTR(multicast_state, S_IRUGO, show_multicast_state, NULL); + +static struct attribute *vnic_path_attrs[] = { + &dev_attr_viport_state.attr, + &dev_attr_link_state.attr, + &dev_attr_heartbeat.attr, + &dev_attr_ioc_guid.attr, + &dev_attr_dgid.attr, + &dev_attr_pkey.attr, + &dev_attr_hca_info.attr, + &dev_attr_ioc_string.attr, + &dev_attr_multicast_state.attr, + NULL +}; + +struct attribute_group vnic_path_attr_group = { + .attrs = vnic_path_attrs, +}; + + +static int setup_path_class_files(struct netpath *path, char *name) +{ + init_completion(&path->dev_info.released); + + path->dev_info.dev.class = NULL; + path->dev_info.dev.parent = &path->parent->dev_info.dev; + path->dev_info.dev.release = vnic_release_dev; + snprintf(path->dev_info.dev.bus_id, BUS_ID_SIZE, name); + + if (device_register(&path->dev_info.dev)) { + SYS_ERROR("error in registering path class dev\n"); + goto out; + } + + if (sysfs_create_group(&path->dev_info.dev.kobj, + &vnic_path_attr_group)) { + SYS_ERROR("error in creating vnic path group attrs"); + goto err_path; + } + + return 0; + +err_path: + device_unregister(&path->dev_info.dev); + wait_for_completion(&path->dev_info.released); +out: + return -1; + +} + +static inline void update_dgids(u8 *old, u8 *new, char *vnic_name, + char *path_name) +{ + int i; + + if (!memcmp(old, new, 16)) + return; + + printk(KERN_INFO PFX "Changing dgid from 0x"); + print_dgid(old); + printk(" to 0x"); + print_dgid(new); + printk(" for %s path of %s\n", path_name, vnic_name); + for (i = 0; i < 16; i++) + old[i] = new[i]; +} + +static inline void update_ioc_guids(struct path_param *params, + struct netpath *path, + char *vnic_name, char *path_name) +{ + u64 sid; + + if (path->viport->config->ioc_guid == params->ioc_guid) + return; + + printk(KERN_INFO PFX "Changing IOC GUID from 0x%llx to 0x%llx " + "for %s path of %s\n", + __be64_to_cpu(path->viport->config->ioc_guid), + __be64_to_cpu(params->ioc_guid), path_name, vnic_name); + + path->viport->config->ioc_guid = params->ioc_guid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + path->viport->config->control_config.ib_config.service_id = + cpu_to_be64(sid); + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + path->viport->config->data_config.ib_config.service_id = + cpu_to_be64(sid); +} + +static inline void update_pkeys(__be16 *old, __be16 *new, char *vnic_name, + char *path_name) +{ + if (*old == *new) + return; + + printk(KERN_INFO PFX "Changing P_Key from 0x%x to 0x%x " + "for %s path of %s\n", *old, *new, + path_name, vnic_name); + *old = *new; +} + +static void update_ioc_strings(struct path_param *params, struct netpath *path, + char *path_name) +{ + if (!strcmp(params->ioc_string, path->viport->config->ioc_string)) + return; + + printk(KERN_INFO PFX "Changing ioc_string to %s for %s path of %s\n", + params->ioc_string, path_name, params->name); + + strcpy(path->viport->config->ioc_string, params->ioc_string); +} + +static void update_path_parameters(struct path_param *params, + struct netpath *path) +{ + update_dgids(path->viport->config->path_info.path.dgid.raw, + params->dgid, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_ioc_guids(params, path, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_pkeys(&path->viport->config->path_info.path.pkey, + ¶ms->pkey, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_ioc_strings(params, path, + (path->second_bias ? "secondary" : "primary")); +} + +static ssize_t update_params_and_connect(struct path_param *params, + struct netpath *path, size_t count) +{ + if (is_dgid_zero(params->dgid) && params->ioc_guid != 0 && + params->pkey != 0) { + + if (!memcmp(path->viport->config->path_info.path.dgid.raw, + params->dgid, 16) && + params->ioc_guid == path->viport->config->ioc_guid && + params->pkey == path->viport->config->path_info.path.pkey) { + + printk(KERN_WARNING PFX "All of the dgid, ioc_guid and " + "pkeys are same as the existing" + " one. Not updating values.\n"); + return -EINVAL; + } else { + if (path->viport->state == VIPORT_CONNECTED) { + printk(KERN_WARNING PFX "%s path of %s " + "interface is already in connected " + "state. Not updating values.\n", + (path->second_bias ? "Secondary" : "Primary"), + path->parent->config->name); + return -EINVAL; + } else { + update_path_parameters(params, path); + viport_kick(path->viport); + vnic_disconnected(path->parent, path); + return count; + } + } + } else { + printk(KERN_WARNING PFX "Either dgid, iocguid, pkey is zero. " + "No update.\n"); + return -EINVAL; + } +} + +ssize_t vnic_create_primary(struct device *dev, + struct device_attribute *dev_attr, const char *buf, + size_t count) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic_ib_port *target = + container_of(info, struct vnic_ib_port, pdev_info); + + struct path_param param; + int ret = -EINVAL; + struct vnic *vnic; + struct list_head *ptr; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + param.ib_multicast = -1; + *param.ioc_string = '\0'; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, param.name)) { + ret = update_params_and_connect(¶m, + &vnic->primary_path, + count); + goto out; + } + } + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + vnic = create_vnic(¶m); + if (!vnic) { + printk(KERN_ERR PFX "creating vnic failed\n"); + ret = -EINVAL; + goto out; + } + + if (create_netpath(&vnic->primary_path, ¶m)) { + printk(KERN_ERR PFX "creating primary netpath failed\n"); + goto free_vnic; + } + + if (setup_path_class_files(&vnic->primary_path, "primary_path")) + goto free_vnic; + + if (vnic && !vnic->primary_path.viport) { + printk(KERN_ERR PFX "no valid netpaths\n"); + goto free_vnic; + } + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} + +ssize_t vnic_create_secondary(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic_ib_port *target = + container_of(info, struct vnic_ib_port, pdev_info); + + struct path_param param; + struct vnic *vnic = NULL; + int ret = -EINVAL; + struct list_head *ptr; + int found = 0; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + param.ib_multicast = -1; + *param.ioc_string = '\0'; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strncmp(vnic->config->name, param.name, IFNAMSIZ)) { + if (vnic->secondary_path.viport) { + ret = update_params_and_connect(¶m, + &vnic->secondary_path, + count); + goto out; + } + found = 1; + break; + } + } + + if (!found) { + printk(KERN_ERR PFX + "primary connection with name '%s' does not exist\n", + param.name); + ret = -EINVAL; + goto out; + } + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + if (create_netpath(&vnic->secondary_path, ¶m)) { + printk(KERN_ERR PFX "creating secondary netpath failed\n"); + ret = -EINVAL; + goto out; + } + + if (setup_path_class_files(&vnic->secondary_path, "secondary_path")) + goto free_vnic; + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h new file mode 100644 index 0000000..b41e770 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h @@ -0,0 +1,62 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_SYS_H_INCLUDED +#define VNIC_SYS_H_INCLUDED + +struct dev_info { + struct device dev; + struct completion released; +}; + +extern struct class vnic_class; +extern struct dev_info interface_dev; +extern struct attribute_group vnic_dev_attr_group; +extern struct attribute_group vnic_path_attr_group; +extern struct device_attribute dev_attr_create_primary; +extern struct device_attribute dev_attr_create_secondary; +extern struct device_attribute dev_attr_delete_vnic; + +extern void vnic_release_dev(struct device *dev); + +extern ssize_t vnic_create_primary(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count); + +extern ssize_t vnic_create_secondary(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count); + +extern ssize_t vnic_delete(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count); +#endif /*VNIC_SYS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:36:29 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:06:29 +0530 Subject: [ofa-general] [PATCH v2 10/13] QLogic VNIC: Driver Statistics collection In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103629.12355.46869.stgit@localhost.localdomain> From: Amar Mudrankit Collection of statistics about QLogic VNIC interfaces is implemented in this patch. Signed-off-by: Amar Mudrankit Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath --- drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c | 234 ++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h | 497 +++++++++++++++++++++++++ 2 files changed, 731 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c new file mode 100644 index 0000000..d11a8df --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c @@ -0,0 +1,234 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_main.h" + +cycles_t vnic_recv_ref; + +/* + * TODO: Statistics reporting for control path, data path, + * RDMA times, IOs etc + * + */ +static ssize_t show_lifetime(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time = get_cycles() - vnic->statistics.start_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(lifetime, S_IRUGO, show_lifetime, NULL); + +static ssize_t show_conntime(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + if (vnic->statistics.conn_time) + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.conn_time); + return 0; +} + +static DEVICE_ATTR(connection_time, S_IRUGO, show_conntime, NULL); + +static ssize_t show_disconnects(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.disconn_ref) + num = vnic->statistics.disconn_num + 1; + else + num = vnic->statistics.disconn_num; + + return sprintf(buf, "%d\n", num); +} + +static DEVICE_ATTR(disconnects, S_IRUGO, show_disconnects, NULL); + +static ssize_t show_total_disconn_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.disconn_ref) + time = vnic->statistics.disconn_time + + get_cycles() - vnic->statistics.disconn_ref; + else + time = vnic->statistics.disconn_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(total_disconn_time, S_IRUGO, show_total_disconn_time, NULL); + +static ssize_t show_carrier_losses(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.carrier_ref) + num = vnic->statistics.carrier_off_num + 1; + else + num = vnic->statistics.carrier_off_num; + + return sprintf(buf, "%d\n", num); +} + +static DEVICE_ATTR(carrier_losses, S_IRUGO, show_carrier_losses, NULL); + +static ssize_t show_total_carr_loss_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.carrier_ref) + time = vnic->statistics.carrier_off_time + + get_cycles() - vnic->statistics.carrier_ref; + else + time = vnic->statistics.carrier_off_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(total_carrier_loss_time, S_IRUGO, + show_total_carr_loss_time, NULL); + +static ssize_t show_total_recv_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.recv_time); +} + +static DEVICE_ATTR(total_recv_time, S_IRUGO, show_total_recv_time, NULL); + +static ssize_t show_recvs(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.recv_num); +} + +static DEVICE_ATTR(recvs, S_IRUGO, show_recvs, NULL); + +static ssize_t show_multicast_recvs(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.multicast_recv_num); +} + +static DEVICE_ATTR(multicast_recvs, S_IRUGO, show_multicast_recvs, NULL); + +static ssize_t show_total_xmit_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.xmit_time); +} + +static DEVICE_ATTR(total_xmit_time, S_IRUGO, show_total_xmit_time, NULL); + +static ssize_t show_xmits(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_num); +} + +static DEVICE_ATTR(xmits, S_IRUGO, show_xmits, NULL); + +static ssize_t show_failed_xmits(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_fail); +} + +static DEVICE_ATTR(failed_xmits, S_IRUGO, show_failed_xmits, NULL); + +static struct attribute *vnic_stats_attrs[] = { + &dev_attr_lifetime.attr, + &dev_attr_xmits.attr, + &dev_attr_total_xmit_time.attr, + &dev_attr_failed_xmits.attr, + &dev_attr_recvs.attr, + &dev_attr_multicast_recvs.attr, + &dev_attr_total_recv_time.attr, + &dev_attr_connection_time.attr, + &dev_attr_disconnects.attr, + &dev_attr_total_disconn_time.attr, + &dev_attr_carrier_losses.attr, + &dev_attr_total_carrier_loss_time.attr, + NULL +}; + +struct attribute_group vnic_stats_attr_group = { + .attrs = vnic_stats_attrs, +}; diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h new file mode 100644 index 0000000..a241b71 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h @@ -0,0 +1,497 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_STATS_H_INCLUDED +#define VNIC_STATS_H_INCLUDED + +#include "vnic_main.h" +#include "vnic_ib.h" +#include "vnic_sys.h" + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + if (vnic->statistics.conn_time == 0) { + vnic->statistics.conn_time = + get_cycles() - vnic->statistics.start_time; + } + + if (vnic->statistics.disconn_ref != 0) { + vnic->statistics.disconn_time += + get_cycles() - vnic->statistics.disconn_ref; + vnic->statistics.disconn_num++; + vnic->statistics.disconn_ref = 0; + } + +} + +static inline void vnic_stop_xmit_stats(struct vnic *vnic) +{ + if (vnic->statistics.xmit_ref == 0) + vnic->statistics.xmit_ref = get_cycles(); +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + if (vnic->statistics.xmit_ref != 0) { + vnic->statistics.xmit_off_time += + get_cycles() - vnic->statistics.xmit_ref; + vnic->statistics.xmit_off_num++; + vnic->statistics.xmit_ref = 0; + } +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + vnic->statistics.recv_time += get_cycles() - vnic_recv_ref; + vnic->statistics.recv_num++; +} + +static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic) +{ + vnic->statistics.multicast_recv_num++; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + vnic->statistics.xmit_time += get_cycles() - time; + vnic->statistics.xmit_num++; + +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + vnic->statistics.xmit_fail++; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + if (vnic->statistics.carrier_ref != 0) { + vnic->statistics.carrier_off_time += + get_cycles() - vnic->statistics.carrier_ref; + vnic->statistics.carrier_off_num++; + vnic->statistics.carrier_ref = 0; + } +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + init_completion(&vnic->stat_info.released); + vnic->stat_info.dev.class = NULL; + vnic->stat_info.dev.parent = &vnic->dev_info.dev; + vnic->stat_info.dev.release = vnic_release_dev; + snprintf(vnic->stat_info.dev.bus_id, BUS_ID_SIZE, + "stats"); + + if (device_register(&vnic->stat_info.dev)) { + SYS_ERROR("create_vnic: error in registering" + " stat class dev\n"); + goto stats_out; + } + + if (sysfs_create_group(&vnic->stat_info.dev.kobj, + &vnic_stats_attr_group)) + goto err_stats_file; + + return 0; +err_stats_file: + device_unregister(&vnic->stat_info.dev); + wait_for_completion(&vnic->stat_info.released); +stats_out: + return -1; +} + +static inline void vnic_cleanup_stats_files(struct vnic *vnic) +{ + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_stats_attr_group); + device_unregister(&vnic->stat_info.dev); + wait_for_completion(&vnic->stat_info.released); +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + if (!vnic->statistics.disconn_ref) + vnic->statistics.disconn_ref = get_cycles(); + + if (vnic->statistics.carrier_ref == 0) + vnic->statistics.carrier_ref = get_cycles(); +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + vnic->statistics.start_time = get_cycles(); +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + response_time -= control->statistics.request_time; + control->statistics.response_time += response_time; + control->statistics.response_num++; + if (control->statistics.response_max < response_time) + control->statistics.response_max = response_time; + if ((control->statistics.response_min == 0) || + (control->statistics.response_min > response_time)) + control->statistics.response_min = response_time; + +} + +static inline void control_note_reqtime_stats(struct control *control) +{ + control->statistics.request_time = get_cycles(); +} + +static inline void control_timeout_stats(struct control *control) +{ + control->statistics.timeout_num++; +} + +static inline void data_kickreq_stats(struct data *data) +{ + data->statistics.kick_reqs++; +} + +static inline void data_no_xmitbuf_stats(struct data *data) +{ + data->statistics.no_xmit_bufs++; +} + +static inline void data_xmits_stats(struct data *data) +{ + data->statistics.xmit_num++; +} + +static inline void data_recvs_stats(struct data *data) +{ + data->statistics.recv_num++; +} + +static inline void data_note_kickrcv_time(void) +{ + vnic_recv_ref = get_cycles(); +} + +static inline void data_rcvkicks_stats(struct data *data) +{ + data->statistics.kick_recvs++; +} + + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = get_cycles(); +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.num_callbacks++; +} + +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ib_conn->statistics.num_ios++; + *comp_num = *comp_num + 1; + +} + +static inline void vnic_ib_io_stats(struct io *io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + if ((io->type == RECV) || (io->type == RECV_UD)) + io->time = comp_time; + else if (io->type == RDMA) { + ib_conn->statistics.rdma_comp_time += comp_time - io->time; + ib_conn->statistics.rdma_comp_ios++; + } else if (io->type == SEND) { + ib_conn->statistics.send_comp_time += comp_time - io->time; + ib_conn->statistics.send_comp_ios++; + } +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + if (comp_num > ib_conn->statistics.max_ios) + ib_conn->statistics.max_ios = comp_num; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = + get_cycles() - ib_conn->statistics.connection_time; + +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + *time = get_cycles(); + if (io->time != 0) { + ib_conn->statistics.recv_comp_time += *time - io->time; + ib_conn->statistics.recv_comp_ios++; + } + +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ib_conn->statistics.recv_post_time += get_cycles() - time; + ib_conn->statistics.recv_post_ios++; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + io->time = *time = get_cycles(); +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + time = get_cycles() - time; + if (io->swr.opcode == IB_WR_RDMA_WRITE) { + ib_conn->statistics.rdma_post_time += time; + ib_conn->statistics.rdma_post_ios++; + } else { + ib_conn->statistics.send_post_time += time; + ib_conn->statistics.send_post_ios++; + } +} +#else /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_stop_xmit_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + ; +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + ; +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + return 0; +} + +static inline void vnic_cleanup_stats_files(struct vnic *vnic) +{ + ; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + ; +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + ; +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + ; +} + +static inline void control_note_reqtime_stats(struct control *control) +{ + ; +} + +static inline void control_timeout_stats(struct control *control) +{ + ; +} + +static inline void data_kickreq_stats(struct data *data) +{ + ; +} + +static inline void data_no_xmitbuf_stats(struct data *data) +{ + ; +} + +static inline void data_xmits_stats(struct data *data) +{ + ; +} + +static inline void data_recvs_stats(struct data *data) +{ + ; +} + +static inline void data_note_kickrcv_time(void) +{ + ; +} + +static inline void data_rcvkicks_stats(struct data *data) +{ + ; +} + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) + +{ + ; +} +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ; +} + +static inline void vnic_ib_io_stats(struct io *io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + ; +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + ; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + ; +} +#endif /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +#endif /*VNIC_STATS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:37:00 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:07:00 +0530 Subject: [ofa-general] [PATCH v2 11/13] QLogic VNIC: Driver utility file - implements various utility macros In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103700.12355.20370.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the driver utility file which mainly contains utility macros for debugging of QLogic VNIC driver. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_util.h | 250 ++++++++++++++++++++++++++ 1 files changed, 250 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h new file mode 100644 index 0000000..572e338 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h @@ -0,0 +1,250 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_UTIL_H_INCLUDED +#define VNIC_UTIL_H_INCLUDED + +#define MODULE_NAME "QLGC_VNIC" + +#define VNIC_MAJORVERSION 1 +#define VNIC_MINORVERSION 1 + +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) + +extern u32 vnic_debug; + +enum { + DEBUG_IB_INFO = 0x00000001, + DEBUG_IB_FUNCTION = 0x00000002, + DEBUG_IB_FSTATUS = 0x00000004, + DEBUG_IB_ASSERTS = 0x00000008, + DEBUG_CONTROL_INFO = 0x00000010, + DEBUG_CONTROL_FUNCTION = 0x00000020, + DEBUG_CONTROL_PACKET = 0x00000040, + DEBUG_CONFIG_INFO = 0x00000100, + DEBUG_DATA_INFO = 0x00001000, + DEBUG_DATA_FUNCTION = 0x00002000, + DEBUG_NETPATH_INFO = 0x00010000, + DEBUG_VIPORT_INFO = 0x00100000, + DEBUG_VIPORT_FUNCTION = 0x00200000, + DEBUG_LINK_STATE = 0x00400000, + DEBUG_VNIC_INFO = 0x01000000, + DEBUG_VNIC_FUNCTION = 0x02000000, + DEBUG_MCAST_INFO = 0x04000000, + DEBUG_MCAST_FUNCTION = 0x08000000, + DEBUG_SYS_INFO = 0x10000000, + DEBUG_SYS_VERBOSE = 0x40000000 +}; + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_DEBUG +#define PRINT(level, x, fmt, arg...) \ + printk(level "%s: %s: %s, line %d: " fmt, \ + MODULE_NAME, x, __FILE__, __LINE__, ##arg) + +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ + do { \ + if (condition) \ + printk(level "%s: %s: %s, line %d: " fmt, \ + MODULE_NAME, x, __FILE__, __LINE__, \ + ##arg); \ + } while (0) +#else +#define PRINT(level, x, fmt, arg...) \ + printk(level "%s: " fmt, MODULE_NAME, ##arg) + +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ + do { \ + if (condition) \ + printk(level "%s: %s: " fmt, \ + MODULE_NAME, x, ##arg); \ + } while (0) +#endif /*CONFIG_INFINIBAND_QLGC_VNIC_DEBUG*/ + +#define IB_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "IB", fmt, ##arg) +#define IB_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "IB", fmt, ##arg) + +#define IB_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_FUNCTION), \ + fmt, ##arg) + +#define IB_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_INFO), \ + fmt, ##arg) + +#define IB_ASSERT(x) \ + do { \ + if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x)) \ + panic("%s assertion failed, file: %s," \ + " line %d: ", \ + MODULE_NAME, __FILE__, __LINE__) \ + } while (0) + +#define CONTROL_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONTROL", fmt, ##arg) +#define CONTROL_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONTROL", fmt, ##arg) + +#define CONTROL_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_INFO), \ + fmt, ##arg) + +#define CONTROL_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_FUNCTION), \ + fmt, ##arg) + +#define CONTROL_PACKET(pkt) \ + do { \ + if (vnic_debug & DEBUG_CONTROL_PACKET) \ + control_log_control_packet(pkt); \ + } while (0) + +#define CONFIG_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONFIG", fmt, ##arg) +#define CONFIG_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONFIG", fmt, ##arg) + +#define CONFIG_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONFIG", \ + (vnic_debug & DEBUG_CONFIG_INFO), \ + fmt, ##arg) + +#define DATA_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "DATA", fmt, ##arg) +#define DATA_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "DATA", fmt, ##arg) + +#define DATA_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_INFO), \ + fmt, ##arg) + +#define DATA_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_FUNCTION), \ + fmt, ##arg) + + +#define MCAST_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "MCAST", fmt, ##arg) +#define MCAST_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "MCAST", fmt, ##arg) + +#define MCAST_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "MCAST", \ + (vnic_debug & DEBUG_MCAST_INFO), \ + fmt, ##arg) + +#define MCAST_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "MCAST", \ + (vnic_debug & DEBUG_MCAST_FUNCTION), \ + fmt, ##arg) + +#define NETPATH_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NETPATH", fmt, ##arg) +#define NETPATH_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NETPATH", fmt, ##arg) + +#define NETPATH_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NETPATH", \ + (vnic_debug & DEBUG_NETPATH_INFO), \ + fmt, ##arg) + +#define VIPORT_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "VIPORT", fmt, ##arg) +#define VIPORT_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "VIPORT", fmt, ##arg) + +#define VIPORT_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_INFO), \ + fmt, ##arg) + +#define VIPORT_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_FUNCTION), \ + fmt, ##arg) + +#define LINK_STATE(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "LINK", \ + (vnic_debug & DEBUG_LINK_STATE), \ + fmt, ##arg) + +#define VNIC_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) +#define VNIC_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NIC", fmt, ##arg) +#define VNIC_INIT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) + +#define VNIC_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_INFO), \ + fmt, ##arg) + +#define VNIC_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_FUNCTION), \ + fmt, ##arg) + +#define SYS_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "SYS", fmt, ##arg) +#define SYS_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "SYS", fmt, ##arg) + +#define SYS_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "SYS", \ + (vnic_debug & DEBUG_SYS_INFO), \ + fmt, ##arg) + +#endif /* VNIC_UTIL_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:37:30 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:07:30 +0530 Subject: [ofa-general] [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103730.12355.14730.stgit@localhost.localdomain> From: Ramachandra K Kconfig and Makefile for the QLogic VNIC driver. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/Kconfig | 28 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/Makefile | 13 +++++++++++++ 2 files changed, 41 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile diff --git a/drivers/infiniband/ulp/qlgc_vnic/Kconfig b/drivers/infiniband/ulp/qlgc_vnic/Kconfig new file mode 100644 index 0000000..6a08770 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/Kconfig @@ -0,0 +1,28 @@ +config INFINIBAND_QLGC_VNIC + tristate "QLogic VNIC - Support for QLogic Ethernet Virtual I/O Controller" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the QLogic Ethernet Virtual I/O Controller + (EVIC). In conjunction with the EVIC, this provides virtual + ethernet interfaces and transports ethernet packets over + InfiniBand so that you can communicate with Ethernet networks + using your IB device. + +config INFINIBAND_QLGC_VNIC_DEBUG + bool "QLogic VNIC Verbose debugging" + depends on INFINIBAND_QLGC_VNIC + default n + ---help--- + This option causes verbose debugging code to be compiled + into the QLogic VNIC driver. The output can be turned on via the + vnic_debug module parameter. + +config INFINIBAND_QLGC_VNIC_STATS + bool "QLogic VNIC Statistics" + depends on INFINIBAND_QLGC_VNIC + default n + ---help--- + This option compiles statistics collecting code into the + data path of the QLogic VNIC driver to help in profiling and fine + tuning. This adds some overhead in the interest of gathering + data. diff --git a/drivers/infiniband/ulp/qlgc_vnic/Makefile b/drivers/infiniband/ulp/qlgc_vnic/Makefile new file mode 100644 index 0000000..509dd67 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/Makefile @@ -0,0 +1,13 @@ +obj-$(CONFIG_INFINIBAND_QLGC_VNIC) += qlgc_vnic.o + +qlgc_vnic-y := vnic_main.o \ + vnic_ib.o \ + vnic_viport.o \ + vnic_control.o \ + vnic_data.o \ + vnic_netpath.o \ + vnic_config.o \ + vnic_sys.o \ + vnic_multicast.o + +qlgc_vnic-$(CONFIG_INFINIBAND_QLGC_VNIC_STATS) += vnic_stats.o From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:35:59 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:05:59 +0530 Subject: [ofa-general] [PATCH v2 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103559.12355.85037.stgit@localhost.localdomain> From: Usha Srinivasan Implementation of ethernet broadcasting and multicasting for QLogic VNIC interface by making use of underlying IB multicasting. Signed-off-by: Usha Srinivasan Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c | 319 +++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h | 77 +++++ 2 files changed, 396 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c new file mode 100644 index 0000000..f40ea20 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c @@ -0,0 +1,319 @@ +/* + * Copyright (c) 2008 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_util.h" + +static inline void vnic_set_multicast_state_invalid(struct viport *viport) +{ + viport->mc_info.state = MCAST_STATE_INVALID; + viport->mc_info.mc = NULL; + memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid)); +} + +int vnic_mc_init(struct viport *viport) +{ + MCAST_FUNCTION("vnic_mc_init %p\n", viport); + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_lock_init(&viport->mc_info.lock); + + return 0; +} + +void vnic_mc_uninit(struct viport *viport) +{ + unsigned long flags; + MCAST_FUNCTION("vnic_mc_uninit %p\n", viport); + + spin_lock_irqsave(&viport->mc_info.lock, flags); + if ((viport->mc_info.state != MCAST_STATE_INVALID) && + (viport->mc_info.state != MCAST_STATE_RETRIED)) { + MCAST_ERROR("%s mcast state is not INVALID or RETRIED %d\n", + control_ifcfg_name(&viport->control), + viport->mc_info.state); + } + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_FUNCTION("vnic_mc_uninit done\n"); +} + + +/* This function is called when NEED_MCAST_COMPLETION is set. + * It finishes off the join multicast work. + */ +int vnic_mc_join_handle_completion(struct viport *viport) +{ + unsigned int ret = 0; + + MCAST_FUNCTION("vnic_mc_join_handle_completion()\n"); + if (viport->mc_info.state != MCAST_STATE_JOINING) { + MCAST_ERROR("%s unexpected mcast state in handle_completion: " + " %d\n", control_ifcfg_name(&viport->control), + viport->mc_info.state); + ret = -1; + goto out; + } + viport->mc_info.state = MCAST_STATE_ATTACHING; + MCAST_INFO("%s Attaching QP %lx mgid:" + VNIC_GID_FMT " mlid:%x\n", + control_ifcfg_name(&viport->control), jiffies, + VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw), + viport->mc_info.mlid); + ret = ib_attach_mcast(viport->mc_data.ib_conn.qp, &viport->mc_info.mgid, + viport->mc_info.mlid); + if (ret) { + MCAST_ERROR("%s Attach mcast qp failed %d\n", + control_ifcfg_name(&viport->control), ret); + ret = -1; + goto out; + } + viport->mc_info.state = MCAST_STATE_JOINED_ATTACHED; + MCAST_INFO("%s UD QP successfully attached to mcast group\n", + control_ifcfg_name(&viport->control)); + +out: + return ret; +} + +/* NOTE: ib_sa.h says "returning a non-zero value from this callback will + * result in destroying the multicast tracking structure. + */ +static int vnic_mc_join_complete(int status, + struct ib_sa_multicast *multicast) +{ + struct viport *viport = (struct viport *)multicast->context; + unsigned long flags; + + MCAST_FUNCTION("vnic_mc_join_complete() status:%x\n", status); + if (status) { + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (status == -ENETRESET) { + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_ERROR("%s got ENETRESET\n", + control_ifcfg_name(&viport->control)); + goto out; + } + /* perhaps the mcgroup hasn't yet been created - retry */ + viport->mc_info.retries++; + viport->mc_info.mc = NULL; + if (viport->mc_info.retries > MAX_MCAST_JOIN_RETRIES) { + viport->mc_info.state = MCAST_STATE_RETRIED; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_ERROR("%s join failed 0x%x - max retries:%d " + "exceeded\n", + control_ifcfg_name(&viport->control), + status, viport->mc_info.retries); + } else { + viport->mc_info.state = MCAST_STATE_INVALID; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_MCAST_JOIN; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_ERROR("%s join failed 0x%x - retrying; " + "retries:%d\n", + control_ifcfg_name(&viport->control), + status, viport->mc_info.retries); + } + goto out; + } + + /* finish join work from main state loop for viport - in case + * the work itself cannot be done in a callback environment */ + spin_lock_irqsave(&viport->lock, flags); + viport->mc_info.mlid = be16_to_cpu(multicast->rec.mlid); + viport->updates |= NEED_MCAST_COMPLETION; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_INFO("%s setting NEED_MCAST_COMPLETION %x %x\n", + control_ifcfg_name(&viport->control), + multicast->rec.mlid, viport->mc_info.mlid); +out: + return status; +} + +void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid) +{ + unsigned long flags; + + MCAST_FUNCTION("in vnic_mc_join_setup\n"); + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (viport->mc_info.state != MCAST_STATE_INVALID) { + if (viport->mc_info.state == MCAST_STATE_DETACHING) + MCAST_ERROR("%s detach in progress\n", + control_ifcfg_name(&viport->control)); + else if (viport->mc_info.state == MCAST_STATE_RETRIED) + MCAST_ERROR("%s max join retries exceeded\n", + control_ifcfg_name(&viport->control)); + else { + /* join/attach in progress or done */ + /* verify that the current mgid is same as prev mgid */ + if (memcmp(mgid, &viport->mc_info.mgid, sizeof(union ib_gid)) != 0) { + /* Separate MGID for each IOC */ + MCAST_ERROR("%s Multicast Group MGIDs not " + "unique; mgids: " VNIC_GID_FMT + " " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(mgid->raw), + VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw)); + } else + MCAST_INFO("%s join already issued: %d\n", + control_ifcfg_name(&viport->control), + viport->mc_info.state); + + } + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + return; + } + viport->mc_info.mgid = *mgid; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_MCAST_JOIN; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_INFO("%s setting NEED_MCAST_JOIN \n", + control_ifcfg_name(&viport->control)); +} + +int vnic_mc_join(struct viport *viport) +{ + struct ib_sa_mcmember_rec rec; + ib_sa_comp_mask comp_mask; + unsigned long flags; + int ret = 0; + + MCAST_FUNCTION("vnic_mc_join()\n"); + if (!viport->mc_data.ib_conn.qp) { + MCAST_ERROR("%s qp is NULL\n", + control_ifcfg_name(&viport->control)); + ret = -1; + goto out; + } + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (viport->mc_info.state != MCAST_STATE_INVALID) { + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_INFO("%s Multicast join already issued\n", + control_ifcfg_name(&viport->control)); + goto out; + } + viport->mc_info.state = MCAST_STATE_JOINING; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + + memset(&rec, 0, sizeof(rec)); + rec.join_state = 2; /* bit 1 is Nonmember */ + rec.mgid = viport->mc_info.mgid; + rec.port_gid = viport->config->path_info.path.sgid; + + comp_mask = IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + MCAST_INFO("%s Joining Multicast group%lx mgid:" + VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), jiffies, + VNIC_GID_RAW_ARG(rec.mgid.raw), + VNIC_GID_RAW_ARG(rec.port_gid.raw)); + + viport->mc_info.mc = ib_sa_join_multicast(&vnic_sa_client, + viport->config->ibdev, viport->config->port, + &rec, comp_mask, GFP_KERNEL, + vnic_mc_join_complete, viport); + + if (IS_ERR(viport->mc_info.mc)) { + MCAST_ERROR("%s Multicast joining failed " VNIC_GID_FMT + ".\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(rec.mgid.raw)); + viport->mc_info.state = MCAST_STATE_INVALID; + ret = -1; + goto out; + } + MCAST_INFO("%s Multicast group join issued mgid:" + VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(rec.mgid.raw), + VNIC_GID_RAW_ARG(rec.port_gid.raw)); +out: + return ret; +} + +void vnic_mc_leave(struct viport *viport) +{ + unsigned long flags; + unsigned int ret; + struct ib_sa_multicast *mc; + + MCAST_FUNCTION("vnic_mc_leave()\n"); + + spin_lock_irqsave(&viport->mc_info.lock, flags); + if ((viport->mc_info.state == MCAST_STATE_INVALID) || + (viport->mc_info.state == MCAST_STATE_RETRIED)) { + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + return; + } + + if (viport->mc_info.state == MCAST_STATE_JOINED_ATTACHED) { + + viport->mc_info.state = MCAST_STATE_DETACHING; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + ret = ib_detach_mcast(viport->mc_data.ib_conn.qp, + &viport->mc_info.mgid, + viport->mc_info.mlid); + if (ret) { + MCAST_ERROR("%s UD QP Detach failed %d\n", + control_ifcfg_name(&viport->control), ret); + return; + } + MCAST_INFO("%s UD QP detached succesfully\n", + control_ifcfg_name(&viport->control)); + spin_lock_irqsave(&viport->mc_info.lock, flags); + } + mc = viport->mc_info.mc; + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + + if (mc) { + MCAST_INFO("%s Freeing up multicast structure.\n", + control_ifcfg_name(&viport->control)); + ib_sa_free_multicast(mc); + } + MCAST_FUNCTION("vnic_mc_leave done\n"); + return; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h new file mode 100644 index 0000000..e049180 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h @@ -0,0 +1,77 @@ +/* + * Copyright (c) 2008 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __VNIC_MULTICAST_H__ +#define __VNIC_MULTTCAST_H__ + +enum { + MCAST_STATE_INVALID = 0x00, /* join not attempted or failed */ + MCAST_STATE_JOINING = 0x01, /* join mcgroup in progress */ + MCAST_STATE_ATTACHING = 0x02, /* join completed with success, + * attach qp to mcgroup in progress + */ + MCAST_STATE_JOINED_ATTACHED = 0x03, /* join completed with success */ + MCAST_STATE_DETACHING = 0x04, /* detach qp in progress */ + MCAST_STATE_RETRIED = 0x05, /* retried join and failed */ +}; + +#define MAX_MCAST_JOIN_RETRIES 5 /* used to retry join */ + +struct mc_info { + u8 state; + spinlock_t lock; + union ib_gid mgid; + u16 mlid; + struct ib_sa_multicast *mc; + u8 retries; +}; + + +int vnic_mc_init(struct viport *viport); +void vnic_mc_uninit(struct viport *viport); +extern char *control_ifcfg_name(struct control *control); + +/* This function is called when a viport gets a multicast mgid from EVIC + and must join the multicast group. It sets up NEED_MCAST_JOIN flag, which + results in vnic_mc_join being called later. */ +void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid); + +/* This function is called when NEED_MCAST_JOIN flag is set. */ +int vnic_mc_join(struct viport *viport); + +/* This function is called when NEED_MCAST_COMPLETION is set. + It finishes off the join multicast work. */ +int vnic_mc_join_handle_completion(struct viport *viport); + +void vnic_mc_leave(struct viport *viport); + +#endif /* __VNIC_MULTICAST_H__ */ From ramachandra.kuchimanchi at qlogic.com Mon May 19 03:38:00 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 19 May 2008 16:08:00 +0530 Subject: [ofa-general] [PATCH v2 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile In-Reply-To: <20080519102843.12355.832.stgit@localhost.localdomain> References: <20080519102843.12355.832.stgit@localhost.localdomain> Message-ID: <20080519103800.12355.70429.stgit@localhost.localdomain> From: Ramachandra K This patch modifies the toplevel Infiniband Kconfig and Makefile to include QLogic VNIC as new ULP. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index a5dc78a..0775df5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -53,4 +53,6 @@ source "drivers/infiniband/ulp/srp/Kconfig" source "drivers/infiniband/ulp/iser/Kconfig" +source "drivers/infiniband/ulp/qlgc_vnic/Kconfig" + endif # INFINIBAND diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index ed35e44..845271e 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES) += hw/nes/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_INFINIBAND_QLGC_VNIC) += ulp/qlgc_vnic/ From weiny2 at llnl.gov Mon May 19 11:21:25 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 19 May 2008 11:21:25 -0700 Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback options In-Reply-To: <20080519170916.GL4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> <20080519170916.GL4616@sashak.voltaire.com> Message-ID: <20080519112125.3ec8ae61.weiny2@llnl.gov> Could these be used by some Windows add on? Not that it matters... Ira On Mon, 19 May 2008 20:09:16 +0300 Sasha Khapyorsky wrote: > > Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks > from OpenSM subnet options. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/include/opensm/osm_subnet.h | 20 ------------------ > opensm/opensm/osm_lid_mgr.c | 7 ------ > opensm/opensm/osm_mcast_mgr.c | 40 ++++++----------------------------- > opensm/opensm/osm_subnet.c | 4 --- > 4 files changed, 7 insertions(+), 64 deletions(-) > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index daab453..56b0165 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -248,10 +248,6 @@ typedef struct _osm_subn_opt { > uint16_t console_port; > cl_map_t port_prof_ignore_guids; > boolean_t port_profile_switch_nodes; > - osm_pfn_ui_extension_t pfn_ui_pre_lid_assign; > - void *ui_pre_lid_assign_ctx; > - osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign; > - void *ui_mcast_fdb_assign_ctx; > boolean_t sweep_on_trap; > char *routing_engine_name; > boolean_t connect_roots; > @@ -412,22 +408,6 @@ typedef struct _osm_subn_opt { > * If TRUE will count the number of switch nodes routed through > * the link. If FALSE - only CA/RT nodes are counted. > * > -* pfn_ui_pre_lid_assign > -* A UI function to be invoked prior to lid assigment. It should > -* return 1 if any change was made to any lid or 0 otherwise. > -* > -* ui_pre_lid_assign_ctx > -* A UI context (void *) to be provided to the pfn_ui_pre_lid_assign > -* > -* pfn_ui_mcast_fdb_assign > -* A UI function to be called inside the mcast manager instead of > -* the call for the build spanning tree. This will be called on > -* every multicast call for create, join and leave, and is > -* responsible for the mcast FDB configuration. > -* > -* ui_mcast_fdb_assign_ctx > -* A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign > -* > * sweep_on_trap > * Received traps will initiate a new sweep. > * > diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c > index af0d020..7f25750 100644 > --- a/opensm/opensm/osm_lid_mgr.c > +++ b/opensm/opensm/osm_lid_mgr.c > @@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr) > persistent db */ > __osm_lid_mgr_init_sweep(p_mgr); > > - if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) { > - OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE, > - "Invoking UI function pfn_ui_pre_lid_assign\n"); > - p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt. > - ui_pre_lid_assign_ctx); > - } > - > /* Set the send_set_reqs of the p_mgr to FALSE, and > we'll see if any set requests were sent. If not - > can signal OSM_SIGNAL_DONE */ > diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c > index 683a16d..a6185fe 100644 > --- a/opensm/opensm/osm_mcast_mgr.c > +++ b/opensm/opensm/osm_mcast_mgr.c > @@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, > { > ib_api_status_t status = IB_SUCCESS; > ib_net16_t mlid; > - boolean_t ui_mcast_fdb_assign_func_defined; > > OSM_LOG_ENTER(sm->p_log); > > @@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, > goto Exit; > } > > - if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign) > - ui_mcast_fdb_assign_func_defined = TRUE; > - else > - ui_mcast_fdb_assign_func_defined = FALSE; > - > /* > Clear the multicast tables to start clean, then build > the spanning tree which sets the mcast table bits for each > port in the group. > - We will clean the multicast tables if a ui_mcast function isn't > - defined, or if such function is defined, but we got here > - through a MC_CREATE request - this means we are creating a new > - multicast group - clean all old data. > */ > - if (ui_mcast_fdb_assign_func_defined == FALSE || > - req_type == OSM_MCAST_REQ_TYPE_CREATE) > - __osm_mcast_mgr_clear(sm, p_mgrp); > - > - /* If a UI function is defined, then we will call it here. > - If not - the use the regular build spanning tree function */ > - if (ui_mcast_fdb_assign_func_defined == FALSE) { > - status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); > - if (status != IB_SUCCESS) { > - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " > - "Unable to create spanning tree (%s)\n", > - ib_get_err_str(status)); > - goto Exit; > - } > - } else { > - if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) { > - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Invoking UI function pfn_ui_mcast_fdb_assign\n"); > - } > + __osm_mcast_mgr_clear(sm, p_mgrp); > > - sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt. > - ui_mcast_fdb_assign_ctx, > - mlid, req_type, > - port_guid); > + status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); > + if (status != IB_SUCCESS) { > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " > + "Unable to create spanning tree (%s)\n", > + ib_get_err_str(status)); > + goto Exit; > } > > Exit: > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index a916270..2191f2d 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; > p_opt->accum_log_file = TRUE; > p_opt->port_profile_switch_nodes = FALSE; > - p_opt->pfn_ui_pre_lid_assign = NULL; > - p_opt->ui_pre_lid_assign_ctx = NULL; > - p_opt->pfn_ui_mcast_fdb_assign = NULL; > - p_opt->ui_mcast_fdb_assign_ctx = NULL; > p_opt->sweep_on_trap = TRUE; > p_opt->routing_engine_name = NULL; > p_opt->connect_roots = FALSE; > -- > 1.5.4.rc2.60.gb2e62 > From ramachandra.kuchimanchi at qlogic.com Mon May 19 11:59:16 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 20 May 2008 00:29:16 +0530 Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8 In-Reply-To: References: Message-ID: <71d336490805191159k451cfc4dn6f6ad4f3ab88c43b@mail.gmail.com> On Mon, May 19, 2008 at 10:59 PM, Joe Li wrote: > > Hello everyone, > > I am a newbie to openfabric and I have an issue here which needs your help. > When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I > get an ofa-kernel rpm build error: OFED-1.3 does not support kernel 2.6.25. Among the plain vanilla kernel versions it supports 2.6.24. Please refer to docs/OFED_release_notes.txt for the supported kernels. Regards, Ram From joel at finetec.com Mon May 19 12:00:55 2008 From: joel at finetec.com (Joe Li) Date: Mon, 19 May 2008 12:00:55 -0700 Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8 References: <71d336490805191159k451cfc4dn6f6ad4f3ab88c43b@mail.gmail.com> Message-ID: Thank you very much for the information, I will check the release notes. Regards Joe ________________________________ From: ariston at gmail.com on behalf of Ramachandra K Sent: Mon 5/19/2008 11:59 AM To: Joe Li Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8 On Mon, May 19, 2008 at 10:59 PM, Joe Li wrote: > > Hello everyone, > > I am a newbie to openfabric and I have an issue here which needs your help. > When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel 2.6.25-rc8, I > get an ofa-kernel rpm build error: OFED-1.3 does not support kernel 2.6.25. Among the plain vanilla kernel versions it supports 2.6.24. Please refer to docs/OFED_release_notes.txt for the supported kernels. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Mon May 19 13:06:15 2008 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 19 May 2008 22:06:15 +0200 Subject: [ofa-general] timeout question In-Reply-To: <6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com> References: <6978b4af0805161110i235d1975k99b8788df14f4277@mail.gmail.com> <6978b4af0805161140t7b12532brc3709a2aec6154df@mail.gmail.com> <482E3AEE.4070603@gmail.com> <6978b4af0805161201r4a113a1cpd98cbb1e81cf175a@mail.gmail.com> <482DE8E3.4090200@gmail.com> <6978b4af0805190349o11c7eab4t74607708a369489@mail.gmail.com> Message-ID: <4831DDB7.4060804@gmail.com> Rui Machado wrote: >> >> Can you replace it with a write (from the other side)? >> READ has "higher price" than a WRITE. >> >> > > Can you please, shortly explain why this higher price? > Basically, when a Read request is being sent the HCA need to preserve resources in order to accept the response. > >> Anyway, you should get the mentioned behavior anyway.. >> >> When the sender get the error, what is the status of the receiver QP? >> (did you try to execute ibv_query_qp and get its status?) >> >> > > I tried to get the qp state right after the error and it is 6 (which I > believe is IBV_QPS_ERR). > Why do you ask? > If the QP in the receiver side is in ERROR state, it means that there was an error and this is the root cause that caused to the retry exceeded from the first place ... maybe there was a remote address/key violation that caused the receiver QP to be transitioned to the ERROR state. Dotan From worleys at gmail.com Mon May 19 12:11:43 2008 From: worleys at gmail.com (Chris Worley) Date: Mon, 19 May 2008 13:11:43 -0600 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: <483045CD.8060301@mellanox.co.il> References: <482AC510.3090602@dev.mellanox.co.il> <483045CD.8060301@mellanox.co.il> Message-ID: In using netcat in UDP mode over IPoIB, I loose 25%-40% of the packets. Is that expected? Thanks, Chris On Sun, May 18, 2008 at 9:05 AM, Tziporet Koren wrote: > Chris Worley wrote: >> >> Ahhh... it was probably because I added the RPMs w/o deleting 1.2.5.5 >> in the "kitchen sink" build. >> >> Is there any reason to NOT use connected mode? >> > > In general the CM is better in performance of medium & large messages. > We found the UD mode is better in small UDP messages > > Tziporet > > > From sashak at voltaire.com Mon May 19 12:53:02 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 19 May 2008 22:53:02 +0300 Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback options In-Reply-To: <20080519112125.3ec8ae61.weiny2@llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> <20080519170916.GL4616@sashak.voltaire.com> <20080519112125.3ec8ae61.weiny2@llnl.gov> Message-ID: <20080519195302.GA1183@sashak.voltaire.com> On 11:21 Mon 19 May , Ira Weiny wrote: > Could these be used by some Windows add on? Don't think, likely there were some ideas for utilizing osmsh or so (now we have more powerful "routing engine" stuff). But I don't know for sure. Guess we will know if it was. > Not that it matters... You said... :) Sasha From xma at us.ibm.com Mon May 19 13:08:07 2008 From: xma at us.ibm.com (Shirley Ma) Date: Mon, 19 May 2008 13:08:07 -0700 Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, Patch 10) In-Reply-To: <48319B4C.7040309@mellanox.co.il> Message-ID: We should support smp affinity from the userland as well through sysfs. Thanks Shirley Yevgeny Petrilin Sent by: general-bounces at lists.openfabrics.org 05/19/2008 08:22 AM To Roland Dreier cc Christoph Raisch , Hoang-Nam Nguyen , general at lists.openfabrics.org Subject Re: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, Patch 10) Roland Dreier wrote: > > > I would just like to see an approach that is fully thought through and > > > gives a way for applications/kernel drivers to choose a CQ vector based > > > on some information about what CPU it will go to. > > > Isn't the decision of which CPU an MSI-X is routed to (and hence, to > > which CPI an EQ is bound to) determined by userspace? (either by the irq > > balancer process or by manually setting /proc/irq//smp_affinity)? > > Yes, but how can anything tell which IRQ number corresponds to a given > "CQ vector" number? (And don't be too stuck on MSI-X, since ehca uses > some completely different GX-bus related thing to get multiple interrupts) > > > What are we risking in making the default action to spread interrupts? > > There are fairly plausible scenarios like a multi-threaded app where > each thread creates a send CQ and a receive CQ, which should both be > bound to the same CPU as the thread. If we spread all CQs then it's > impossible to get thread-locality. > > I'm not saying that round-robin is necessarily a bad default policy, but > I do think there needs to be a complete picture of how that policy can > be overridden before we go for multiple interrupt vectors. > > - R. Hello Roland, We can add the multiple interrupt vectors support in two stages: 1. The low level driver can create multiple interrupt vectors. Their name would include a serial number from 0 to #CPU's-1. The number of completion vectors can be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific completion vector when creating CQ, which means that passing vector=0 while creating CQ will assign it to completion vector #0. 2. As the second stage, we can create a "don't care" value which would mean that the driver can can attach the CQ to any completion vector. In this case the policy shouldn't necessary be round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ to the least busy one. What is your opinion on this solution? Thanks, Yevgeny _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon May 19 14:24:49 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 May 2008 14:24:49 -0700 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <4831649A.2020206@voltaire.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> <4831649A.2020206@voltaire.com> Message-ID: <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> >Sean, please let me know your preference (as it was somehow unclear from >the thread) if you want the delivery of this event to be dependent on >the ulp asking for it or no. I spent most of the morning looking at this, and until I know what the trade-offs really are in the implementation, I can't say that I have a strong preference for how to deal with any of this. My main concerns are: * All callbacks from the rdma_cm are serialized * We minimize the overhead of reporting events * We don't lose events * If the user returns a non-zero value from a callback, the rdma_cm_id is destroyed, an no further callbacks are invoked. and in concept I prefer to: * Always report the event and let ULPs ignore it * Let someone come up with a fantastically simple way of reporting new events The existing rdma_cm callbacks are naturally serialized with each other. (Callback for connect after resolve route after resolve address...) This allows using the stack for event structures, but the cost is complex synchronization with device removal. Supporting additional events while meeting the concerns listed above will be equally challenging. So if we can simplify device removal handling, then supporting similar types of events should be easier as well. If we can guarantee that this works, one option is to acquire a mutex before invoking a callback on an rdma_cm_id. I hesitate to hold any locks while in a callback, since it restricts what the user can do, but if the mutex is only used to synchronize calling the user back, it may work, since the rdma_cm never invokes a callback from a downcall. This should simplify the device removal handling, eliminating wait_remove and dev_remove from the rdma_cm_id. Alternatively, the ib_cm serializes callbacks using different logic (see cm_process_work() and use of work_count/work_list). I've been looking at what it would take to use the ib_cm event logic in the rdma_cm. The trick is to minimize the event reporting overhead without losing any events, (and minimizing the overhead may require registering for events...) What I've been exploring is adding an event_list to the rdma_cm_id. Whenever the user performs an asynchronous operation, event structure(s) is allocated and placed on the event_list. When an asynchronous operation completes, the event structure is removed from this list, placed on a work_list, and a call like cma_process_work() is invoked. Note that some operations (e.g. connect) result in multiple callbacks to the rdma_cm (connect and disconnect). And the more I consider this option, the more appealing just holding a mutex around the callbacks becomes. - Sean From robert.j.woodruff at intel.com Mon May 19 15:01:31 2008 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 19 May 2008 15:01:31 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info Message-ID: Hi guys, I am starting to put together an updated list of maintainers that can be displayed on the OFA website, as what is out there now is horribly out of date. I am only collecting the info for the Linux side, Sean and Stan will be collecting the info for the Windows stack. Please respond with the components that you (or the people that you work with) maintain and the current maintainer of that component and email info. I have been working with the website maintainers to put in place a way that we can have the link to the maintainers list be a pointer to a simple text file, so it can be easily updated on the server when things change in the future, but I'd like to put together the initial file and then later it can be easily updated when things change. woody From swise at opengridcomputing.com Mon May 19 15:04:21 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 17:04:21 -0500 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: References: Message-ID: <4831F965.6060607@opengridcomputing.com> Woodruff, Robert J wrote: > Hi guys, > > I am starting to put together an updated list of maintainers > that can be displayed on the OFA website, > as what is out there now is horribly out of > date. I am only collecting the info for the Linux side, > Sean and Stan will be collecting the info for the Windows stack. > > Please respond with the components that you (or the people that > you work with) maintain and the > current maintainer of that component and email info. > > Can you send the list of components? > I have been working with the website maintainers to put > in place a way that we can have the link to the maintainers > list be a pointer to a simple text file, so it can be > easily updated on the server when things change in the future, > but I'd like to put together the initial file and then later > it can be easily updated when things change. > > > woody > From rdreier at cisco.com Mon May 19 15:05:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:05:59 -0700 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: (Roland Dreier's message of "Mon, 19 May 2008 08:49:04 -0700") References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: > and what happens if alloc_mad() is called while port->sm_ah is NULL? Trivial fix seems to be to move the test for whether port->sm_ah is NULL into alloc_mad(), and have it return -EAGAIN if so. - R. From sean.hefty at intel.com Mon May 19 15:11:09 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 May 2008 15:11:09 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: References: Message-ID: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> This is what I could find in the MAINTAINERS file for 2.6.25: EHCA (IBM GX bus InfiniBand adapter) DRIVER: P: Hoang-Nam Nguyen M: hnguyen at de.ibm.com P: Christoph Raisch M: raisch at de.ibm.com L: general at lists.openfabrics.org S: Supported INFINIBAND SUBSYSTEM P: Roland Dreier M: rolandd at cisco.com P: Sean Hefty M: sean.hefty at intel.com P: Hal Rosenstock M: hal.rosenstock at gmail.com L: general at lists.openfabrics.org W: http://www.openib.org/ T: git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git S: Supported IPATH DRIVER: P: Ralph Campbell M: infinipath at qlogic.com L: general at lists.openfabrics.org T: git git://git.qlogic.com/ipath-linux-2.6 S: Supported NETEFFECT IWARP RNIC DRIVER (IW_NES) P: Faisal Latif M: flatif at neteffect.com P: Nishi Gupta M: ngupta at neteffect.com P: Glenn Streiff M: gstreiff at neteffect.com L: general at lists.openfabrics.org W: http://www.neteffect.com S: Supported F: drivers/infiniband/hw/nes/ AMSO1100 RNIC DRIVER P: Tom Tucker M: tom at opengridcomputing.com P: Steve Wise M: swise at opengridcomputing.com L: general at lists.openfabrics.org S: Maintained From rdreier at cisco.com Mon May 19 15:12:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:12:15 -0700 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: <483199EC.7070900@Voltaire.COM> (Moni Shoua's message of "Mon, 19 May 2008 18:17:00 +0300") References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: By the way: > + struct ib_sa_sm_ah *sm_ah; > + > + spin_lock_irqsave(&port->ah_lock, flags); > + sm_ah = port->sm_ah; > + port->sm_ah = NULL; > + spin_unlock_irqrestore(&port->ah_lock, flags); > + > + if (sm_ah) > + kref_put(&sm_ah->ref, free_sm_ah); Is there some reason why this can't be simpler like: spin_lock_irqsave(&port->ah_lock, flags); if (port->sm_ah) kref_put(&port->sm_ah->ref, free_sm_ah); port->sm_ah = NULL; spin_unlock_irqrestore(&port->ah_lock, flags); I guess the same cleanup applies to update_sm_ah(), except after your patch I don't see any way that update_sm_ah() could be called with sm_ah anything but NULL, so we could drop the old_ah stuff completely there. - R. From rdreier at cisco.com Mon May 19 15:15:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:15:08 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> (Sean Hefty's message of "Mon, 19 May 2008 15:11:09 -0700") References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> Message-ID: thanks Sean. I notice at least CXGB3 is missing from MAINTAINERS -- Steve, if you want to send an entry I would add it. Also if anyone wants to add an ISER entry that would be good too. I could add IP-OVER-INFINIBAND and SRP entries but not sure it's worth it since I'm already in the main INFINIBAND entry. Maybe it's worth changing "INFINIBAND SUBSYSTEM" to "INFINIBAND/IWARP/RDMA SUBSYSTEM"? - R. From rdreier at cisco.com Mon May 19 15:19:28 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:19:28 -0700 Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, Patch 10) In-Reply-To: <48319B4C.7040309@mellanox.co.il> (Yevgeny Petrilin's message of "Mon, 19 May 2008 18:22:52 +0300") References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com> <48319B4C.7040309@mellanox.co.il> Message-ID: > We can add the multiple interrupt vectors support in two stages: > 1. The low level driver can create multiple interrupt vectors. Their name would include a > serial number from 0 to #CPU's-1. The number of completion vectors can > be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific > completion vector when creating CQ, which means that passing vector=0 while creating CQ > will assign it to completion vector #0. > > 2. As the second stage, we can create a "don't care" value which would mean that the driver can > can attach the CQ to any completion vector. In this case the policy shouldn't necessary be > round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ > to the least busy one. this makes sense. However I think we need to come up with some mechanism where a ULP or application can assign some semantic value to the CQ event vector it chooses. Maybe a new verb is required. - R. From rdreier at cisco.com Mon May 19 15:23:03 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:23:03 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <483022BB.9060004@Voltaire.COM> (Moni Shoua's message of "Sun, 18 May 2008 15:36:11 +0300") References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> Message-ID: > The purpose of this patch is to make the events that are related to SM change > (namely CLIENT_REREGISTER event and SM_CHANGE event) less disruptive. > When SM related events are handled, it is not necessary to flush unicast > info from device but only multicast info. This patch divides the events that are > handled by IPoIB to three categories; 0, 1 and 2 (when 2 does more than 1 and 1 > does more than 0). I see two issues with this patch: - Is is architecturally guaranteed by the IB spec that flushing unicast info is not required on an SM change or client reregister event? - The implementation looks to make maintainability somewhat harder, since it's not very clear what level 0, 1, and 2 events really mean. I suggest using some symbolic names (maybe bitmasks that are |ed together?) to make it explicit what is being flushed. - R. From rdreier at cisco.com Mon May 19 15:24:04 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:24:04 -0700 Subject: [ofa-general] [PATCH v2] IB/mlx4: Add send with invalidate support In-Reply-To: <1211102098.6963.14.camel@eli-laptop> (Eli Cohen's message of "Sun, 18 May 2008 12:14:58 +0300") References: <1210836027.18385.2.camel@mtls03> <482FC5D9.7060009@voltaire.com> <1211102098.6963.14.camel@eli-laptop> Message-ID: > No reason for them to be different. Roland already suggested to use a > union here although he defines the union locally inside the containing > struct thus he has two definitions for the same union. Roland do you > intend to commit that? I can if everyone agrees with it. I can't think of a good way to describe the union independently, so I think I'll keep it as being duplicated between the WR and completion structures. - R. From rdreier at cisco.com Mon May 19 15:27:21 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 15:27:21 -0700 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: (Thomas Talpey's message of "Mon, 19 May 2008 09:14:32 -0400") References: Message-ID: > >if we can't use the "WQE shrinking" feature (because of selective > >signaling in the NFS/RDMA case), and we want to use 32 sge entries, then > >the WQE size 's' will end up a little more than 512 bytes, and the > >wqe_shift will end up as 10. > > Can you elaborate on this? The NFS/RDMA client does selective signalling > on its send queue in order to save on interrupts and CQE generation/handling. > Which I always thought was a (very) good approach. Because the RPC > request/response paradigm guarantees an eventual receive completion, > we simply defer (or even completely avoid) this work. > > Would that be a bad trade if it takes a WQE management opportunity away > from the provider? It's quite easy to change this in the NFS/RDMA code, > or make it a selectable parameter. mlx4 has a feature that lets the driver post smaller WQEs to the send queue if not all s/g entries are used. But the current implementation at least can only use this feature if selective signaling is off. So it's a tradeoff -- more work completions or bigger data structures for the HCA to fetch. In the NFS/RDMA case I would expect the selective signaling to be a win, but I guess the thing to do is try ConnectX without selective signaling and see which wins. - R. From robert.j.woodruff at intel.com Mon May 19 15:28:15 2008 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 19 May 2008 15:28:15 -0700 Subject: [ofa-general] RE: Current list of Linux maintainers and their email info In-Reply-To: <4831F965.6060607@opengridcomputing.com> References: <4831F965.6060607@opengridcomputing.com> Message-ID: Steve Wise wrote, >Can you send the list of components? Here is what I have so far as the list of kernel and userspace components. Kernel Components Core kernel drivers, infiniband/core Sean Hefty, sean.hefty at intel.com Roland Dreier, rdreir at csco.com Hardware Drivers: Mellanox HCA drivers, infiniband/hw/mthca, infiniband/hw/mlx4 Qlogic HCA driver, infiniband/hw/ipath NetEffects RNIC driver, infiniband/hw/nes IBM HCA, infinband/hw/ehca Chelsio RNIC, infiniband/hw/cxgb3 Upper Level Protocols IPoIB SRP iSer SDP SRPT qlgc_vnic RDS User Space Components libibverbs uDAPL IB-Bonding IB-Sim IB-Utils IB-Diags libibcm librdmacm libibcommon libibmad libibumad libipathverbs libmlx4 libmthca libnes librdmacm libsdp mpi-selector mpitests mstflint mvapich mvapich2 openmpi open-iscsi opensm perftest qlvnictools qperf rds-tools sdpnetstat srptools From robert.j.woodruff at intel.com Mon May 19 15:49:43 2008 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 19 May 2008 15:49:43 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> Message-ID: Roland wrote, >Maybe it's worth changing "INFINIBAND SUBSYSTEM" to >"INFINIBAND/IWARP/RDMA SUBSYSTEM"? > - R. Good point, probably should change to something like INFINIBAND/IWARP/RDMA SUBSYSTEM, and perhaps for the kernel components, we should base our list off of the MAINTAINERS file in the kernel tree, plus I suppose we'd have to add a couple extra entries in our list for the things (like SDP) that are in OFED, but not upstream. From swise at opengridcomputing.com Mon May 19 16:04:19 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 May 2008 18:04:19 -0500 Subject: [ofa-general] [PATCH] Add cxgb3 and iw_cxgb3 maintainers. Message-ID: <20080519230419.5000.56974.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- MAINTAINERS | 14 ++++++++++++++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index c68a118..11453eb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1239,6 +1239,20 @@ L: video4linux-list at redhat.com W: http://linuxtv.org S: Maintained +CXGB3 ETHERNET DRIVER (CXGB3) +P: Divy Le Ray +M: divy at chelsio.com +L: netdev at vger.kernel.org +W: http://www.chelsio.com +S: Supported + +CXGB3 IWARP RNIC DRIVER (IW_CXGB3) +P: Steve Wise +M: swise at chelsio.com +L: general at lists.openfabrics.org +W: http://www.openfabrics.org +S: Supported + CYBERPRO FB DRIVER P: Russell King M: rmk at arm.linux.org.uk From rdreier at cisco.com Mon May 19 16:10:11 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 19 May 2008 16:10:11 -0700 Subject: [ofa-general] Re: [PATCH] Add cxgb3 and iw_cxgb3 maintainers. In-Reply-To: <20080519230419.5000.56974.stgit@dell3.ogc.int> (Steve Wise's message of "Mon, 19 May 2008 18:04:19 -0500") References: <20080519230419.5000.56974.stgit@dell3.ogc.int> Message-ID: cool with you, Divy? If so I will merge. From divy at chelsio.com Mon May 19 16:11:33 2008 From: divy at chelsio.com (Divy Le Ray) Date: Mon, 19 May 2008 16:11:33 -0700 Subject: [ofa-general] Re: [PATCH] Add cxgb3 and iw_cxgb3 maintainers. In-Reply-To: References: <20080519230419.5000.56974.stgit@dell3.ogc.int> Message-ID: <48320925.7070405@chelsio.com> Roland Dreier wrote: > cool with you, Divy? > > If so I will merge. > Yes, it is perfect. Thanks a lot, Divy From sashak at voltaire.com Mon May 19 18:56:52 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 May 2008 04:56:52 +0300 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> Message-ID: <20080520015652.GE1183@sashak.voltaire.com> On 15:28 Mon 19 May , Woodruff, Robert J wrote: > > Here is what I have so far as the list of kernel and userspace > components. > IB-Sim > IB-Diags > libibcommon > libibmad > libibumad > opensm are "mine" - Sasha Khapyorsky Probably it would be good to unify component and package names? if so: IB-Sim -> ibsim, IB-Diags -> infiniband-diags. Sasha From eli at dev.mellanox.co.il Mon May 19 22:10:29 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 20 May 2008 08:10:29 +0300 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: Message-ID: <1211260229.6556.18.camel@eli-laptop> On Mon, 2008-05-19 at 15:27 -0700, Roland Dreier wrote: > mlx4 has a feature that lets the driver post smaller WQEs to the send > queue if not all s/g entries are used. But the current implementation > at least can only use this feature if selective signaling is off. > > So it's a tradeoff -- more work completions or bigger data structures > for the HCA to fetch. In the NFS/RDMA case I would expect the selective > signaling to be a win, but I guess the thing to do is try ConnectX > without selective signaling and see which wins. > Roland, I posted a few months ago a patch that optimizes post send for selective signaling QPs. It must have slipped somehow because I did not get any reply on it and since I did not know of anyone using selective signaling I forgot about this too. The idea is that for selective signaling QPs, before you stamp the WQE, you read the value of the DS field which denotes the effective size of the descriptor as used in the previous post, and stamp only that area, relying on the fact that the rest of the descriptor is already stamped. Here is a link to the patch. I don't know if it applies cleanly now but if we agree on the idea I will generate it again against the current tree. http://lists.openfabrics.org/pipermail/general/2008-January/045071.html From npiggin at suse.de Mon May 19 22:31:46 2008 From: npiggin at suse.de (Nick Piggin) Date: Tue, 20 May 2008 07:31:46 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080516115005.GC4287@sgi.com> References: <20080508003838.GA9878@sgi.com> <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> Message-ID: <20080520053145.GA19502@wotan.suse.de> On Fri, May 16, 2008 at 06:50:05AM -0500, Robin Holt wrote: > On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote: > > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > > > > > Oh, I get that confused because of the mixed up naming conventions > > > > > there: unmap_page_range should actually be called zap_page_range. But > > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > > > > > How is that synchronized with code that walks the same pagetable. These > > > > walks may not hold mmap_sem either. I would expect that one could only > > > > remove a portion of the pagetable where we have some sort of guarantee > > > > that no accesses occur. So the removal of the vma prior ensures that? > > > > > > I don't really understand the question. If you remove the pte and invalidate > > > the TLBS on the remote image's process (importing the page), then it can > > > of course try to refault the page in because it's vma is still there. But > > > you catch that refault in your driver , which can prevent the page from > > > being faulted back in. > > > > I think Christoph's question has more to do with faults that are > > in flight. A recently requested fault could have just released the > > last lock that was holding up the invalidate callout. It would then > > begin messaging back the response PFN which could still be in flight. > > The invalidate callout would then fire and do the interrupt shoot-down > > while that response was still active (essentially beating the inflight > > response). The invalidate would clear up nothing and then the response > > would insert the PFN after it is no longer the correct PFN. > > I just looked over XPMEM. I think we could make this work. We already > have a list of active faults which is protected by a simple spinlock. > I would need to nest this lock within another lock protected our PFN > table (currently it is a mutex) and then the invalidate interrupt handler > would need to mark the fault as invalid (which is also currently there). > > I think my sticking points with the interrupt method remain at fault > containment and timeout. The inability of the ia64 processor to handle > provide predictive failures for the read/write of memory on other > partitions prevents us from being able to contain the failure. I don't > think we can get the information we would need to do the invalidate > without introducing fault containment issues which has been a continous > area of concern for our customers. Really? You can get the information through via a sleeping messaging API, but not a non-sleeping one? What is the difference from the hardware POV? From orenmeron at dev.mellanox.co.il Mon May 19 22:36:59 2008 From: orenmeron at dev.mellanox.co.il (Oren Meron) Date: Tue, 20 May 2008 08:36:59 +0300 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: References: Message-ID: <4832637B.6040800@dev.mellanox.co.il> Woodruff, Robert J wrote: > Hi guys, > > I am starting to put together an updated list of maintainers > that can be displayed on the OFA website, > as what is out there now is horribly out of > date. I am only collecting the info for the Linux side, > Sean and Stan will be collecting the info for the Windows stack. > > Please respond with the components that you (or the people that > you work with) maintain and the > current maintainer of that component and email info. > > I have been working with the website maintainers to put > in place a way that we can have the link to the maintainers > list be a pointer to a simple text file, so it can be > easily updated on the server when things change in the future, > but I'd like to put together the initial file and then later > it can be easily updated when things change. > > > woody > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Hi Woody, i maintain the perftest low-level bench marks. thanks - oren. From ogerlitz at voltaire.com Mon May 19 23:01:38 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 09:01:38 +0300 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> Message-ID: <48326942.7080800@voltaire.com> Sean Hefty wrote: > This is what I could find in the MAINTAINERS file for 2.6.25: I am not sure to follow why there's a need to duplicate the Linux kernel IB (RDMA) stack maintainers file at the ofa website, but if for some reason people feel this is needed I suggest to have a smart link that somehow goes to Linus tree and fetches the up2date info. Or. From ogerlitz at voltaire.com Mon May 19 23:05:56 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 09:05:56 +0300 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> Message-ID: <48326A44.2080606@voltaire.com> Woodruff, Robert J wrote: > SRPT > Isn't what you can "SRPT" the target known as SCST which supports also iSCSI? if yes, I don't see why the string "SRP" has to be in the name. > IB-Bonding > the ib-bonding package provides the kernel bonding module Or. From Sumit.Gaur at Sun.COM Mon May 19 23:14:10 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Tue, 20 May 2008 11:44:10 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> <4831B519.2060002@dev.mellanox.co.il> <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> Message-ID: <48326C32.7000303@Sun.COM> Hal Rosenstock wrote: > On Mon, 2008-05-19 at 20:12 +0300, Yevgeny Kliteynik wrote: > >>Hal Rosenstock wrote: >> >>>Sumit, >>>On Mon, 2008-05-19 at 21:38 +0530, Sumit Gaur wrote: >>> >>> >>>>Hi Hal, >>>>It is true that packets received are looks like proper response but as I >>>>mentioned before they content TID that I have never send to OFED and >>>>this cause the problem. Why OFED is sending these extra packets Is the >>>>matter to investigate. >>> >>>The received packet is SM class attribute ID 4352 which is non IBA >>>standard and AFAIK OFED does not send so it likely comes from some non >>>OFED software. >> >>Just a thought: >>Decimal 4352 is 0x1100. With reverted endian we get 0x0011, >>which is NodeInfo, that SM sends while sweeping the subnet, >>which comes at regular interval. >> >>As I said, just a thought... > > > Yes, that makes sense to me. As this is an incoming response, maybe this > node is running the SM as well as this application. Yes, Node is running SM too. sminfo: sm lid 1 sm guid 0x3ba00534f000d, activity count 1213884 priority 7 state 3 SMINFO_MASTER Now looks like we are going in right direction So extra packets are incoming SM packets. So same question again arises How we can identify and filter these incoming SM packets in application from the regular responses. > > -- Hal > > >>-- Yevgeny >> >> >>>As far as why it is being received, it is a response to a class your >>>application is subscribed to so it passes it through. >>> From ogerlitz at voltaire.com Mon May 19 23:37:12 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 09:37:12 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4831A0DF.2070603@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> <4831347A.1010506@voltaire.com> <4831A0DF.2070603@opengridcomputing.com> Message-ID: <48327198.7080305@voltaire.com> Steve Wise wrote: >> dma mapping would work too but then handling the map/unmap becomes an >> issue. I think it is way too complicated too add new verbs for >> map/unmap fastreg page list (in addition to the alloc/free fastreg page >> list that we are already adding) and force the consumer to do it. And >> if we expect the low-level driver to do it, then the map is easy (can be >> done while posting the send) but the unmap is a pain -- it would have to >> be done inside poll_cq when reapind the completion, and the low-level >> driver would have to keep some complicated extra data structure to go >> back from the completion to the original fast reg page list structure. >> > And certain platforms can fail map requests (like PPC64) because they > have limited resources for dma mapping. So then you'd fail a SQ work > request when you might not want to... I see the point in allocating the page lists in dma consistent memory to make the mechanics of letting the HCA to DMA the list easier and simpler, as I think Roland is suggesting in his post. However, I an not sure to understand how this helps in the PPC64 case, if the HCA does DMA to fetch the list, then IOMMU slots have to be consumed this way or another, correct? Or. From ogerlitz at voltaire.com Mon May 19 23:41:29 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 09:41:29 +0300 Subject: [ofa-general] Need help--Failed to build ofa-kernel.rpm while installing OFED1.3 on RHEL5.1 with Kernel 2.6.25-rc3/rc8 In-Reply-To: References: Message-ID: <48327299.6060303@voltaire.com> Joe Li wrote: > > I am a newbie to openfabric and I have an issue here which needs your > help. When trying to install OFED1.3 on kernel 2.6.25-rc3 or kernel > 2.6.25-rc8, I get an ofa-kernel rpm build error: > May I ask from what reason on earth you need the ofa-kernel rpm on top of kernel whose IB stack is newer then the contents of the package? Or. From ogerlitz at voltaire.com Mon May 19 23:45:31 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 09:45:31 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> Message-ID: <4832738B.7070105@voltaire.com> Roland Dreier wrote: > You misunderstood the patch I think (unless I did). By default new > kernel + new firmware gets the smaller page size. > OK, thanks, I guess the change-log can be changed to make this point clearer. Actually, thinking on this a little further, if the (say 2.6.27/28 or later) mlx4 driver is going to support the memory extensions, maybe we could remove from it the support for the proprietary FMRs anyway at that point? Or. From tziporet at dev.mellanox.co.il Tue May 20 00:58:01 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 10:58:01 +0300 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: <48326A44.2080606@voltaire.com> References: <4831F965.6060607@opengridcomputing.com> <48326A44.2080606@voltaire.com> Message-ID: <48328489.2030305@mellanox.co.il> Or Gerlitz wrote: > Woodruff, Robert J wrote: >> SRPT >> > Isn't what you can "SRPT" the target known as SCST which supports > also iSCSI? if yes, I don't see why the string "SRP" has to be in the > name. >> IB-Bonding >> > the ib-bonding package provides the kernel bonding module > But I think it will be good to know who is the maintainer from the IB side (at least for OFED users) Tziporet From ogerlitz at voltaire.com Tue May 20 01:03:40 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 11:03:40 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <483285DC.20003@voltaire.com> Steve Wise wrote: > Support for the IB BMME and iWARP equivalent memory extensions to > non shared memory regions. Usage Model: > - MR allocated with ib_alloc_mr() > - Page lists allocated via ib_alloc_fast_reg_page_list(). > - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) > - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) Steve, I am trying to further understand what would be a real life ULP design here, and I think there are some more issues to clarify/define for the case of ULP which has to create a mapping for a list of pages and send this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it for RDMA. AFAIK, the idea was to let the ulp post --two-- work requests, where the first creates the mapping and the second sends this mapping to the remote side, such that the second does not start before the first completes (i.e a fence). Now, the above scheme means that the ulp knows the value of the rkey/stag at the time of posting these two work requests (since it has to encode it in the second one), so something has to be clarified re the rkey/stag here, do they change each time this MR is used? how many bits can be changed, etc. I guess my questions are to some extent RTFM ones, but, first, with some quick looking in the IB spec I did not manage to get enough answers (pointers appreciated...) and second, you are proposing an implementation here, so I think it makes sense to review the actual usage model to see all aspects needed for ULPs are covered... Talking on usage, do you plan to patch the mainline nfs-rdma code to use these verbs? Or. > - MR deallocated with ib_dereg_mr() > - page lists dealloced via ib_free_fast_reg_page_list() From kliteyn at mellanox.co.il Tue May 20 01:05:26 2008 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 20 May 2008 11:05:26 +0300 Subject: [ofa-general] Re: [PATCH] opensm: remove unused pfn_ui_* callback options In-Reply-To: <20080519112125.3ec8ae61.weiny2@llnl.gov> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080519170303.GI4616@sashak.voltaire.com> <20080519170839.GK4616@sashak.voltaire.com> <20080519170916.GL4616@sashak.voltaire.com> <20080519112125.3ec8ae61.weiny2@llnl.gov> Message-ID: <48328646.1000403@mellanox.co.il> Ira Weiny wrote: > Could these be used by some Windows add on? > I'm not aware of any usage of these callbacks. -- Yevgeny > Not that it matters... > > Ira > > > On Mon, 19 May 2008 20:09:16 +0300 > Sasha Khapyorsky wrote: > > >> Remove unused pfn_ui_pre_lid_assign and pfn_ui_mcast_fdb_assign callbacks >> from OpenSM subnet options. >> >> Signed-off-by: Sasha Khapyorsky >> --- >> opensm/include/opensm/osm_subnet.h | 20 ------------------ >> opensm/opensm/osm_lid_mgr.c | 7 ------ >> opensm/opensm/osm_mcast_mgr.c | 40 ++++++----------------------------- >> opensm/opensm/osm_subnet.c | 4 --- >> 4 files changed, 7 insertions(+), 64 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h >> index daab453..56b0165 100644 >> --- a/opensm/include/opensm/osm_subnet.h >> +++ b/opensm/include/opensm/osm_subnet.h >> @@ -248,10 +248,6 @@ typedef struct _osm_subn_opt { >> uint16_t console_port; >> cl_map_t port_prof_ignore_guids; >> boolean_t port_profile_switch_nodes; >> - osm_pfn_ui_extension_t pfn_ui_pre_lid_assign; >> - void *ui_pre_lid_assign_ctx; >> - osm_pfn_ui_mcast_extension_t pfn_ui_mcast_fdb_assign; >> - void *ui_mcast_fdb_assign_ctx; >> boolean_t sweep_on_trap; >> char *routing_engine_name; >> boolean_t connect_roots; >> @@ -412,22 +408,6 @@ typedef struct _osm_subn_opt { >> * If TRUE will count the number of switch nodes routed through >> * the link. If FALSE - only CA/RT nodes are counted. >> * >> -* pfn_ui_pre_lid_assign >> -* A UI function to be invoked prior to lid assigment. It should >> -* return 1 if any change was made to any lid or 0 otherwise. >> -* >> -* ui_pre_lid_assign_ctx >> -* A UI context (void *) to be provided to the pfn_ui_pre_lid_assign >> -* >> -* pfn_ui_mcast_fdb_assign >> -* A UI function to be called inside the mcast manager instead of >> -* the call for the build spanning tree. This will be called on >> -* every multicast call for create, join and leave, and is >> -* responsible for the mcast FDB configuration. >> -* >> -* ui_mcast_fdb_assign_ctx >> -* A UI context (void *) to be provided to the pfn_ui_mcast_fdb_assign >> -* >> * sweep_on_trap >> * Received traps will initiate a new sweep. >> * >> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c >> index af0d020..7f25750 100644 >> --- a/opensm/opensm/osm_lid_mgr.c >> +++ b/opensm/opensm/osm_lid_mgr.c >> @@ -1212,13 +1212,6 @@ osm_signal_t osm_lid_mgr_process_sm(IN osm_lid_mgr_t * const p_mgr) >> persistent db */ >> __osm_lid_mgr_init_sweep(p_mgr); >> >> - if (p_mgr->p_subn->opt.pfn_ui_pre_lid_assign) { >> - OSM_LOG(p_mgr->p_log, OSM_LOG_VERBOSE, >> - "Invoking UI function pfn_ui_pre_lid_assign\n"); >> - p_mgr->p_subn->opt.pfn_ui_pre_lid_assign(p_mgr->p_subn->opt. >> - ui_pre_lid_assign_ctx); >> - } >> - >> /* Set the send_set_reqs of the p_mgr to FALSE, and >> we'll see if any set requests were sent. If not - >> can signal OSM_SIGNAL_DONE */ >> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c >> index 683a16d..a6185fe 100644 >> --- a/opensm/opensm/osm_mcast_mgr.c >> +++ b/opensm/opensm/osm_mcast_mgr.c >> @@ -1085,7 +1085,6 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, >> { >> ib_api_status_t status = IB_SUCCESS; >> ib_net16_t mlid; >> - boolean_t ui_mcast_fdb_assign_func_defined; >> >> OSM_LOG_ENTER(sm->p_log); >> >> @@ -1107,44 +1106,19 @@ osm_mcast_mgr_process_tree(osm_sm_t * sm, >> goto Exit; >> } >> >> - if (sm->p_subn->opt.pfn_ui_mcast_fdb_assign) >> - ui_mcast_fdb_assign_func_defined = TRUE; >> - else >> - ui_mcast_fdb_assign_func_defined = FALSE; >> - >> /* >> Clear the multicast tables to start clean, then build >> the spanning tree which sets the mcast table bits for each >> port in the group. >> - We will clean the multicast tables if a ui_mcast function isn't >> - defined, or if such function is defined, but we got here >> - through a MC_CREATE request - this means we are creating a new >> - multicast group - clean all old data. >> */ >> - if (ui_mcast_fdb_assign_func_defined == FALSE || >> - req_type == OSM_MCAST_REQ_TYPE_CREATE) >> - __osm_mcast_mgr_clear(sm, p_mgrp); >> - >> - /* If a UI function is defined, then we will call it here. >> - If not - the use the regular build spanning tree function */ >> - if (ui_mcast_fdb_assign_func_defined == FALSE) { >> - status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); >> - if (status != IB_SUCCESS) { >> - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " >> - "Unable to create spanning tree (%s)\n", >> - ib_get_err_str(status)); >> - goto Exit; >> - } >> - } else { >> - if (osm_log_is_active(sm->p_log, OSM_LOG_DEBUG)) { >> - OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Invoking UI function pfn_ui_mcast_fdb_assign\n"); >> - } >> + __osm_mcast_mgr_clear(sm, p_mgrp); >> >> - sm->p_subn->opt.pfn_ui_mcast_fdb_assign(sm->p_subn->opt. >> - ui_mcast_fdb_assign_ctx, >> - mlid, req_type, >> - port_guid); >> + status = __osm_mcast_mgr_build_spanning_tree(sm, p_mgrp); >> + if (status != IB_SUCCESS) { >> + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0A17: " >> + "Unable to create spanning tree (%s)\n", >> + ib_get_err_str(status)); >> + goto Exit; >> } >> >> Exit: >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c >> index a916270..2191f2d 100644 >> --- a/opensm/opensm/osm_subnet.c >> +++ b/opensm/opensm/osm_subnet.c >> @@ -453,10 +453,6 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) >> p_opt->qos_policy_file = OSM_DEFAULT_QOS_POLICY_FILE; >> p_opt->accum_log_file = TRUE; >> p_opt->port_profile_switch_nodes = FALSE; >> - p_opt->pfn_ui_pre_lid_assign = NULL; >> - p_opt->ui_pre_lid_assign_ctx = NULL; >> - p_opt->pfn_ui_mcast_fdb_assign = NULL; >> - p_opt->ui_mcast_fdb_assign_ctx = NULL; >> p_opt->sweep_on_trap = TRUE; >> p_opt->routing_engine_name = NULL; >> p_opt->connect_roots = FALSE; >> -- >> 1.5.4.rc2.60.gb2e62 >> >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From vlad at dev.mellanox.co.il Tue May 20 01:08:17 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 20 May 2008 11:08:17 +0300 Subject: [ofa-general] [PATCH v1] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size Message-ID: <483286F1.2080406@dev.mellanox.co.il> From a2df38ebba98611e24336c9e4ac4f709224aeadc Mon Sep 17 00:00:00 2001 From: Vladimir Sokolovsky Date: Sun, 18 May 2008 11:25:55 +0300 Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which hardcoded the minimum acceptable page_shift to be 12. However, new mlx4 firmware has a minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs. To preserve firmware compatibility with released OFED drivers, the firmware will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these drivers. However, to enable new drivers to take advantage of the available smaller page size, the mlx4 driver now first sets the log_pg_sz to the device minimum via the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP(). The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value. Signed-off-by: Jack Morgenstein Signed-off-by: Vladimir Sokolovsky --- drivers/net/mlx4/fw.c | 28 ++++++++++++++++++++++++++++ drivers/net/mlx4/fw.h | 6 ++++++ drivers/net/mlx4/main.c | 7 +++++++ 3 files changed, 41 insertions(+), 0 deletions(-) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index d82f275..2b5006b 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) mlx4_dbg(dev, " %s\n", fname[i]); } +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *inbox; + int err = 0; + +#define MOD_STAT_CFG_IN_SIZE 0x100 + +#define MOD_STAT_CFG_PG_SZ_M_OFFSET 0x002 +#define MOD_STAT_CFG_PG_SZ_OFFSET 0x003 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, MOD_STAT_CFG_IN_SIZE); + + MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET); + MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET); + + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG, + MLX4_CMD_TIME_CLASS_A); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) { struct mlx4_cmd_mailbox *mailbox; diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h index 306cb9b..a0e046c 100644 --- a/drivers/net/mlx4/fw.h +++ b/drivers/net/mlx4/fw.h @@ -38,6 +38,11 @@ #include "mlx4.h" #include "icm.h" +struct mlx4_mod_stat_cfg { + u8 log_pg_sz; + u8 log_pg_sz_m; +}; + struct mlx4_dev_cap { int max_srq_sz; int max_qp_sz; @@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages); int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm); int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev); int mlx4_NOP(struct mlx4_dev *dev); +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg); #endif /* MLX4_FW_H */ diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index a6aa49f..d373601 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -485,6 +485,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_adapter adapter; struct mlx4_dev_cap dev_cap; + struct mlx4_mod_stat_cfg mlx4_cfg; struct mlx4_profile profile; struct mlx4_init_hca_param init_hca; u64 icm_size; @@ -502,6 +503,12 @@ static int mlx4_init_hca(struct mlx4_dev *dev) return err; } + mlx4_cfg.log_pg_sz_m = 1; + mlx4_cfg.log_pg_sz = 0; + err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg); + if (err) + mlx4_warn(dev, "Failed to override log_pg_sz parameter\n"); + err = mlx4_dev_cap(dev, &dev_cap); if (err) { mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n"); -- 1.5.5.1 From ogerlitz at voltaire.com Tue May 20 02:09:32 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 12:09:32 +0300 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: <48328489.2030305@mellanox.co.il> References: <4831F965.6060607@opengridcomputing.com> <48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il> Message-ID: <4832954C.2080209@voltaire.com> Tziporet Koren wrote: > But I think it will be good to know who is the maintainer from the IB > side (at least for OFED users) The mainline maintainer info of bonding is: BONDING DRIVER P: Jay Vosburgh M: fubar at us.ibm.com L: bonding-devel at lists.sourceforge.net W: http://sourceforge.net/projects/bonding/ S: Supported You need to ask him in case you intend to copy this record from the maintainers file, or ask Moni Shoua if you can list him as a contact for issues not related directly to the mainline driver. Or. From ogerlitz at voltaire.com Tue May 20 02:12:37 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 12:12:37 +0300 Subject: [ofa-general] [PATCH v1] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <483286F1.2080406@dev.mellanox.co.il> References: <483286F1.2080406@dev.mellanox.co.il> Message-ID: <48329605.80704@voltaire.com> Vladimir Sokolovsky wrote: > There is a bug in the OFED 1.3 mlx4 driver in mlx4_alloc_fmr which > hardcoded > the minimum acceptable page_shift to be 12. However, new mlx4 firmware > has a > minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- > so that > ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs. Please remove the word OFED from the change-log, the bug is in the mlx4 driver, period, it was not added by patches merged into ofed. Also replace "mlx4 firmware" with "connectx firmware" as mlx4 is not the name of any HW product and this text can confuse. Or. From tziporet at dev.mellanox.co.il Tue May 20 02:56:13 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 12:56:13 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <4832738B.7070105@voltaire.com> References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> <4832738B.7070105@voltaire.com> Message-ID: <4832A03D.6000103@mellanox.co.il> Or Gerlitz wrote: > Roland Dreier wrote: >> You misunderstood the patch I think (unless I did). By default new >> kernel + new firmware gets the smaller page size. >> > OK, thanks, I guess the change-log can be changed to make this point > clearer. > > Actually, thinking on this a little further, if the (say 2.6.27/28 or > later) mlx4 driver is going to support the memory extensions, maybe we > could remove from it the support for the proprietary FMRs anyway at > that point? > We plan to add the memory extension for 1.6.27 or 28, but this is with ConnectX only. So if someone is using InfiniHost III they will still need the FMRs Tziporet From holt at sgi.com Tue May 20 03:01:11 2008 From: holt at sgi.com (Robin Holt) Date: Tue, 20 May 2008 05:01:11 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080520053145.GA19502@wotan.suse.de> References: <200805132206.47655.nickpiggin@yahoo.com.au> <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> <20080520053145.GA19502@wotan.suse.de> Message-ID: <20080520100111.GC30341@sgi.com> On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote: > On Fri, May 16, 2008 at 06:50:05AM -0500, Robin Holt wrote: > > On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote: > > > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > > > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > > > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > > > > > > > Oh, I get that confused because of the mixed up naming conventions > > > > > > there: unmap_page_range should actually be called zap_page_range. But > > > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > > > > > > > How is that synchronized with code that walks the same pagetable. These > > > > > walks may not hold mmap_sem either. I would expect that one could only > > > > > remove a portion of the pagetable where we have some sort of guarantee > > > > > that no accesses occur. So the removal of the vma prior ensures that? > > > > > > > > I don't really understand the question. If you remove the pte and invalidate > > > > the TLBS on the remote image's process (importing the page), then it can > > > > of course try to refault the page in because it's vma is still there. But > > > > you catch that refault in your driver , which can prevent the page from > > > > being faulted back in. > > > > > > I think Christoph's question has more to do with faults that are > > > in flight. A recently requested fault could have just released the > > > last lock that was holding up the invalidate callout. It would then > > > begin messaging back the response PFN which could still be in flight. > > > The invalidate callout would then fire and do the interrupt shoot-down > > > while that response was still active (essentially beating the inflight > > > response). The invalidate would clear up nothing and then the response > > > would insert the PFN after it is no longer the correct PFN. > > > > I just looked over XPMEM. I think we could make this work. We already > > have a list of active faults which is protected by a simple spinlock. > > I would need to nest this lock within another lock protected our PFN > > table (currently it is a mutex) and then the invalidate interrupt handler > > would need to mark the fault as invalid (which is also currently there). > > > > I think my sticking points with the interrupt method remain at fault > > containment and timeout. The inability of the ia64 processor to handle > > provide predictive failures for the read/write of memory on other > > partitions prevents us from being able to contain the failure. I don't > > think we can get the information we would need to do the invalidate > > without introducing fault containment issues which has been a continous > > area of concern for our customers. > > Really? You can get the information through via a sleeping messaging API, > but not a non-sleeping one? What is the difference from the hardware POV? That was covered in the early very long discussion about 28 seconds. The read timeout for the BTE is 28 seconds and it automatically retried for certain failures. In interrupt context, that is 56 seconds without any subsequent interrupts of that or lower priority. Thanks, Robin From npiggin at suse.de Tue May 20 03:50:25 2008 From: npiggin at suse.de (Nick Piggin) Date: Tue, 20 May 2008 12:50:25 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080520100111.GC30341@sgi.com> References: <20080513153238.GL19717@sgi.com> <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> <20080520053145.GA19502@wotan.suse.de> <20080520100111.GC30341@sgi.com> Message-ID: <20080520105025.GA25791@wotan.suse.de> On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote: > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote: > > > > Really? You can get the information through via a sleeping messaging API, > > but not a non-sleeping one? What is the difference from the hardware POV? > > That was covered in the early very long discussion about 28 seconds. > The read timeout for the BTE is 28 seconds and it automatically retried > for certain failures. In interrupt context, that is 56 seconds without > any subsequent interrupts of that or lower priority. I thought you said it would be possible to get the required invalidate information without using the BTE. Couldn't you use XPMEM pages in the kernel to read the data out of, if nothing else? From holt at sgi.com Tue May 20 04:05:29 2008 From: holt at sgi.com (Robin Holt) Date: Tue, 20 May 2008 06:05:29 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080520105025.GA25791@wotan.suse.de> References: <20080514041122.GE24516@wotan.suse.de> <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> <20080520053145.GA19502@wotan.suse.de> <20080520100111.GC30341@sgi.com> <20080520105025.GA25791@wotan.suse.de> Message-ID: <20080520110528.GD30341@sgi.com> On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote: > On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote: > > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote: > > > > > > Really? You can get the information through via a sleeping messaging API, > > > but not a non-sleeping one? What is the difference from the hardware POV? > > > > That was covered in the early very long discussion about 28 seconds. > > The read timeout for the BTE is 28 seconds and it automatically retried > > for certain failures. In interrupt context, that is 56 seconds without > > any subsequent interrupts of that or lower priority. > > I thought you said it would be possible to get the required invalidate > information without using the BTE. Couldn't you use XPMEM pages in > the kernel to read the data out of, if nothing else? I was wrong about that. I thought it was safe to do an uncached write, but it turns out any processor write is uncontained and the MCA that surfaces would be fatal. Likewise for the uncached read. From npiggin at suse.de Tue May 20 04:14:24 2008 From: npiggin at suse.de (Nick Piggin) Date: Tue, 20 May 2008 13:14:24 +0200 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080520110528.GD30341@sgi.com> References: <20080514112625.GY9878@sgi.com> <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> <20080520053145.GA19502@wotan.suse.de> <20080520100111.GC30341@sgi.com> <20080520105025.GA25791@wotan.suse.de> <20080520110528.GD30341@sgi.com> Message-ID: <20080520111424.GB25791@wotan.suse.de> On Tue, May 20, 2008 at 06:05:29AM -0500, Robin Holt wrote: > On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote: > > On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote: > > > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote: > > > > > > > > Really? You can get the information through via a sleeping messaging API, > > > > but not a non-sleeping one? What is the difference from the hardware POV? > > > > > > That was covered in the early very long discussion about 28 seconds. > > > The read timeout for the BTE is 28 seconds and it automatically retried > > > for certain failures. In interrupt context, that is 56 seconds without > > > any subsequent interrupts of that or lower priority. > > > > I thought you said it would be possible to get the required invalidate > > information without using the BTE. Couldn't you use XPMEM pages in > > the kernel to read the data out of, if nothing else? > > I was wrong about that. I thought it was safe to do an uncached write, > but it turns out any processor write is uncontained and the MCA that > surfaces would be fatal. Likewise for the uncached read. Oh, so the BTE transfer is purely for fault isolation. I was thinking you guys might have sufficient control of the hardware to be able to do it at the level of CPU memory operations, but if it is some limitation of ia64, then I guess that's a problem. How do you do fault isolation of userspace XPMEM accesses? From ogerlitz at voltaire.com Tue May 20 04:19:35 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 14:19:35 +0300 Subject: [ofa-general] [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size In-Reply-To: <4832A03D.6000103@mellanox.co.il> References: <483049BF.4050603@dev.mellanox.co.il> <15ddcffd0805190744uaf4710ah9a00a5653fe9f45a@mail.gmail.com> <15ddcffd0805190755ka46eb48p523272b1e10f00c9@mail.gmail.com> <4832738B.7070105@voltaire.com> <4832A03D.6000103@mellanox.co.il> Message-ID: <4832B3C7.1060602@voltaire.com> Tziporet Koren wrote: > Or Gerlitz wrote: >> Actually, thinking on this a little further, if the (say 2.6.27/28 or >> later) mlx4 driver is going to support the memory extensions, maybe >> we could remove from it the support for the proprietary FMRs anyway >> at that point? > We plan to add the memory extension for 1.6.27 or 28, but this is with > ConnectX only. > So if someone is using InfiniHost III they will still need the FMRs > Sure, if it was not clear, I said remove it from the --mlx4-- driver and not from the core/mthca Or. From holt at sgi.com Tue May 20 04:26:35 2008 From: holt at sgi.com (Robin Holt) Date: Tue, 20 May 2008 06:26:35 -0500 Subject: [ofa-general] Re: [PATCH 08 of 11] anon-vma-rwsem In-Reply-To: <20080520111424.GB25791@wotan.suse.de> References: <20080515075747.GA7177@wotan.suse.de> <20080515235203.GB25305@wotan.suse.de> <20080516112306.GA4287@sgi.com> <20080516115005.GC4287@sgi.com> <20080520053145.GA19502@wotan.suse.de> <20080520100111.GC30341@sgi.com> <20080520105025.GA25791@wotan.suse.de> <20080520110528.GD30341@sgi.com> <20080520111424.GB25791@wotan.suse.de> Message-ID: <20080520112635.GE30341@sgi.com> On Tue, May 20, 2008 at 01:14:24PM +0200, Nick Piggin wrote: > On Tue, May 20, 2008 at 06:05:29AM -0500, Robin Holt wrote: > > On Tue, May 20, 2008 at 12:50:25PM +0200, Nick Piggin wrote: > > > On Tue, May 20, 2008 at 05:01:11AM -0500, Robin Holt wrote: > > > > On Tue, May 20, 2008 at 07:31:46AM +0200, Nick Piggin wrote: > > > > > > > > > > Really? You can get the information through via a sleeping messaging API, > > > > > but not a non-sleeping one? What is the difference from the hardware POV? > > > > > > > > That was covered in the early very long discussion about 28 seconds. > > > > The read timeout for the BTE is 28 seconds and it automatically retried > > > > for certain failures. In interrupt context, that is 56 seconds without > > > > any subsequent interrupts of that or lower priority. > > > > > > I thought you said it would be possible to get the required invalidate > > > information without using the BTE. Couldn't you use XPMEM pages in > > > the kernel to read the data out of, if nothing else? > > > > I was wrong about that. I thought it was safe to do an uncached write, > > but it turns out any processor write is uncontained and the MCA that > > surfaces would be fatal. Likewise for the uncached read. > > Oh, so the BTE transfer is purely for fault isolation. I was thinking > you guys might have sufficient control of the hardware to be able to > do it at the level of CPU memory operations, but if it is some > limitation of ia64, then I guess that's a problem. > > How do you do fault isolation of userspace XPMEM accesses? The MCA handler can see the fault was either in userspace (processor priviledge level I believe) or in the early kernel entry where it is saving registers. When it sees that condition, it kills the users process. While in kernel space, there is no equivalent of the saving user state that forces the processor stall. From ogerlitz at voltaire.com Tue May 20 05:10:20 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 20 May 2008 15:10:20 +0300 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> <4831649A.2020206@voltaire.com> <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> Message-ID: <4832BFAC.2050506@voltaire.com> Sean Hefty wrote: > and in concept I prefer to: > * Always report the event and let ULPs ignore it > * Let someone come up with a fantastically simple way of reporting new events I am fine with the approach of always report the event and let ULPs ignore it. Looking on how the ABI versions are exchanged between the rdma_ucm module to librdmacm, I don't see much alternatives other to bumping the ABI version to five. If librdmacm can somehow note against what ABI version the app was built, we could bump the ABI version to five and require the user to upgrade his librdmacm to be able to run, but have --librdmacm-- hide this event from the user in case "his version" of the ABI is smaller. > I spent most of the morning looking at this, and until I know what the > trade-offs really are in the implementation, I can't say that I have a strong > preference for how to deal with any of this. My main concerns are: > > * All callbacks from the rdma_cm are serialized > * We minimize the overhead of reporting events > * We don't lose events > * If the user returns a non-zero value from a callback, the rdma_cm_id is > destroyed, an no further callbacks are invoked. Thanks for looking into that. Yes, I think its correct and fair to require that all these characteristics would remain also after merging the new event. > The existing rdma_cm callbacks are naturally serialized with each other. > (Callback for connect after resolve route after resolve address...) This allows > using the stack for event structures, but the cost is complex synchronization > with device removal. Supporting additional events while meeting the concerns > listed above will be equally challenging. So if we can simplify device removal > handling, then supporting similar types of events should be easier as well. > > If we can guarantee that this works, one option is to acquire a mutex before > invoking a callback on an rdma_cm_id. I hesitate to hold any locks while in a > callback, since it restricts what the user can do, but if the mutex is only used > to synchronize calling the user back, it may work, since the rdma_cm never > invokes a callback from a downcall. This should simplify the device removal > handling, eliminating wait_remove and dev_remove from the rdma_cm_id. I would like to look into this possibility which as you stated later in your post is simpler compared to the alternatives and would also make the current code of supporting device removal less complex. So can/should that mutex be the existing one defined in cma.c or a new one? Or From tziporet at dev.mellanox.co.il Tue May 20 05:12:39 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 15:12:39 +0300 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: References: <482AC510.3090602@dev.mellanox.co.il> <483045CD.8060301@mellanox.co.il> Message-ID: <4832C037.1060206@mellanox.co.il> Chris Worley wrote: > In using netcat in UDP mode over IPoIB, I loose 25%-40% of the > packets. Is that expected? > What is the netcap test? Tziporet From tziporet at dev.mellanox.co.il Tue May 20 05:20:47 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 15:20:47 +0300 Subject: [ofa-general][PATCH] Re: mlx4: Completion EQ per cpu (MP support, Patch 10) In-Reply-To: References: <40FA0A8088E8A441973D37502F00933E3A24@mtlexch01.mtl.com> <48319B4C.7040309@mellanox.co.il> Message-ID: <4832C21F.9060005@mellanox.co.il> Roland Dreier wrote: > > We can add the multiple interrupt vectors support in two stages: > > 1. The low level driver can create multiple interrupt vectors. Their name would include a > > serial number from 0 to #CPU's-1. The number of completion vectors can > > be populated through ib_device.num_comp_vectors. Then each ulp can ask for a specific > > completion vector when creating CQ, which means that passing vector=0 while creating CQ > > will assign it to completion vector #0. > > > > 2. As the second stage, we can create a "don't care" value which would mean that the driver can > > can attach the CQ to any completion vector. In this case the policy shouldn't necessary be > > round-robin. We can manage the number of "clients" for each completion vector and then assign the CQ > > to the least busy one. > > this makes sense. However I think we need to come up with some > mechanism where a ULP or application can assign some semantic value to > the CQ event vector it chooses. Maybe a new verb is required. > > Add another verb is also a good idea. Do you have anything in mind? For now all ULPs use vector 0 and we stay with the same behavior as today. So is it OK to merge the change of the mlx4_core driver now? Thanks Tziporet From vlad at dev.mellanox.co.il Tue May 20 05:25:52 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 20 May 2008 15:25:52 +0300 Subject: [ofa-general] [PATCH v2] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size Message-ID: <4832C350.50004@dev.mellanox.co.il> From a2df38ebba98611e24336c9e4ac4f709224aeadc Mon Sep 17 00:00:00 2001 From: Vladimir Sokolovsky Date: Sun, 18 May 2008 11:25:55 +0300 Subject: [PATCH] mlx4: implement MOD_STAT_CFG command to use for changing ConnectX page size There was a bug in the mlx4 driver in mlx4_alloc_fmr which hardcoded the minimum acceptable page_shift to be 12. However, new ConnectX firmware has a minimum page_shift of 9 (log_pg_sz of 9 returned by QUERY_DEV_LIM) -- so that ib_fmr_alloc fails for ULPs using the device minimum when creating FMRs. To preserve firmware compatibility with released mlx4 drivers, the firmware will continue to return 12 as before for log_page_sz in QUERY_DEV_CAP for these drivers. However, to enable new drivers to take advantage of the available smaller page size, the mlx4 driver now first sets the log_pg_sz to the device minimum via the MOD_STAT_CFG() command, and only then calls QUERY_DEV_CAP(). The QUERY_DEV_CAP() command then returns the new (lower) log_pg_sz value. Signed-off-by: Jack Morgenstein Signed-off-by: Vladimir Sokolovsky --- drivers/net/mlx4/fw.c | 28 ++++++++++++++++++++++++++++ drivers/net/mlx4/fw.h | 6 ++++++ drivers/net/mlx4/main.c | 7 +++++++ 3 files changed, 41 insertions(+), 0 deletions(-) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index d82f275..2b5006b 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -101,6 +101,34 @@ static void dump_dev_cap_flags(struct mlx4_dev *dev, u32 flags) mlx4_dbg(dev, " %s\n", fname[i]); } +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg) +{ + struct mlx4_cmd_mailbox *mailbox; + u32 *inbox; + int err = 0; + +#define MOD_STAT_CFG_IN_SIZE 0x100 + +#define MOD_STAT_CFG_PG_SZ_M_OFFSET 0x002 +#define MOD_STAT_CFG_PG_SZ_OFFSET 0x003 + + mailbox = mlx4_alloc_cmd_mailbox(dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + inbox = mailbox->buf; + + memset(inbox, 0, MOD_STAT_CFG_IN_SIZE); + + MLX4_PUT(inbox, cfg->log_pg_sz, MOD_STAT_CFG_PG_SZ_OFFSET); + MLX4_PUT(inbox, cfg->log_pg_sz_m, MOD_STAT_CFG_PG_SZ_M_OFFSET); + + err = mlx4_cmd(dev, mailbox->dma, 0, 0, MLX4_CMD_MOD_STAT_CFG, + MLX4_CMD_TIME_CLASS_A); + + mlx4_free_cmd_mailbox(dev, mailbox); + return err; +} + int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) { struct mlx4_cmd_mailbox *mailbox; diff --git a/drivers/net/mlx4/fw.h b/drivers/net/mlx4/fw.h index 306cb9b..a0e046c 100644 --- a/drivers/net/mlx4/fw.h +++ b/drivers/net/mlx4/fw.h @@ -38,6 +38,11 @@ #include "mlx4.h" #include "icm.h" +struct mlx4_mod_stat_cfg { + u8 log_pg_sz; + u8 log_pg_sz_m; +}; + struct mlx4_dev_cap { int max_srq_sz; int max_qp_sz; @@ -162,5 +167,6 @@ int mlx4_SET_ICM_SIZE(struct mlx4_dev *dev, u64 icm_size, u64 *aux_pages); int mlx4_MAP_ICM_AUX(struct mlx4_dev *dev, struct mlx4_icm *icm); int mlx4_UNMAP_ICM_AUX(struct mlx4_dev *dev); int mlx4_NOP(struct mlx4_dev *dev); +int mlx4_MOD_STAT_CFG(struct mlx4_dev *dev, struct mlx4_mod_stat_cfg *cfg); #endif /* MLX4_FW_H */ diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index a6aa49f..d373601 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -485,6 +485,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_adapter adapter; struct mlx4_dev_cap dev_cap; + struct mlx4_mod_stat_cfg mlx4_cfg; struct mlx4_profile profile; struct mlx4_init_hca_param init_hca; u64 icm_size; @@ -502,6 +503,12 @@ static int mlx4_init_hca(struct mlx4_dev *dev) return err; } + mlx4_cfg.log_pg_sz_m = 1; + mlx4_cfg.log_pg_sz = 0; + err = mlx4_MOD_STAT_CFG(dev, &mlx4_cfg); + if (err) + mlx4_warn(dev, "Failed to override log_pg_sz parameter\n"); + err = mlx4_dev_cap(dev, &dev_cap); if (err) { mlx4_err(dev, "QUERY_DEV_CAP command failed, aborting.\n"); -- 1.5.5.1 From hrosenstock at xsigo.com Tue May 20 05:46:25 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 05:46:25 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <48326C32.7000303@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> <4831B519.2060002@dev.mellanox.co.il> <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> <48326C32.7000303@Sun.COM> Message-ID: <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote: > How we can identify and filter these incoming SM packets in application from > the regular responses. I'm surprised that it's working this way; that SM responses are getting into your application as they _should_ have a different transaction ID per the following. >From the kernel Documentation/infiniband/user_mad.txt: Transaction IDs Users of the umad devices can use the lower 32 bits of the transaction ID field (that is, the least significant half of the field in network byte order) in MADs being sent to match request/response pairs. The upper 32 bits are reserved for use by the kernel and will be overwritten before a MAD is sent. Is the same fd being used by OpenSM and your application somehow or you are not using OpenSM and your SM overlaps with this ? -- Hal From hrosenstock at xsigo.com Tue May 20 05:46:47 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 05:46:47 -0700 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: <4832C037.1060206@mellanox.co.il> References: <482AC510.3090602@dev.mellanox.co.il> <483045CD.8060301@mellanox.co.il> <4832C037.1060206@mellanox.co.il> Message-ID: <1211287607.12616.569.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 15:12 +0300, Tziporet Koren wrote: > Chris Worley wrote: > > In using netcat in UDP mode over IPoIB, I loose 25%-40% of the > > packets. Is that expected? > > > > What is the netcap test? See http://netcat.sourceforge.net/ > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Thomas.Talpey at netapp.com Tue May 20 05:51:13 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 20 May 2008 08:51:13 -0400 Subject: [ofa-general] RE: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> Message-ID: At 06:28 PM 5/19/2008, Woodruff, Robert J wrote: >Here is what I have so far as the list of kernel and userspace >components. > > Kernel Components > >Core kernel drivers, infiniband/core >... Your list doesn't include NFS/RDMA, but those components are effectively part of existing MAINTAINER sections. So, it might be worth adding a note to the effect that the NFS/RDMA client is maintained by Tom Talpey, as part of Trond Myklebust's NFS client codebase, and the NFS/RDMA server is Tom Tucker's, as part of Bruce Fields' and Neil Brown's NFS server base. These are the "NFS CLIENT" and "KERNEL NFSD" sections of the existing MAINTAINERS file, respectively. I don't personally think these NFS/RDMA sub-components currently rise to the level of needing mention in MAINTAINERS, but something in the openib docs would be very good. Tom. From monis at Voltaire.COM Tue May 20 06:31:08 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 16:31:08 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: <4832D29C.2060704@Voltaire.COM> Roland Dreier wrote: > By the way: > > > + struct ib_sa_sm_ah *sm_ah; > > + > > + spin_lock_irqsave(&port->ah_lock, flags); > > + sm_ah = port->sm_ah; > > + port->sm_ah = NULL; > > + spin_unlock_irqrestore(&port->ah_lock, flags); > > + > > + if (sm_ah) > > + kref_put(&sm_ah->ref, free_sm_ah); > > Is there some reason why this can't be simpler like: > > spin_lock_irqsave(&port->ah_lock, flags); > if (port->sm_ah) > kref_put(&port->sm_ah->ref, free_sm_ah); > port->sm_ah = NULL; > spin_unlock_irqrestore(&port->ah_lock, flags); > What happens if this happens # | CPU-0 | CPU-1 | | 1 | if (port->sm_ah) | | kref_put(&port->sm_ah->ref, free_sm_ah); | --+-----------------------------------------------------+----------------------- 2 | | alloc_mad() --+-----------------------------------------------------+----------------------- 3 | port->sm_ah = NULL; | As I see it, process on CPU-1 gets a garbage sm_ah Do you agree? > I guess the same cleanup applies to update_sm_ah(), except after your > patch I don't see any way that update_sm_ah() could be called with sm_ah > anything but NULL, so we could drop the old_ah stuff completely there. I agree. The cleanup code can be completely removed. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From monis at Voltaire.COM Tue May 20 06:32:22 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 16:32:22 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: <4832D2E6.9050907@Voltaire.COM> Roland Dreier wrote: > > and what happens if alloc_mad() is called while port->sm_ah is NULL? > > Trivial fix seems to be to move the test for whether port->sm_ah is NULL > into alloc_mad(), and have it return -EAGAIN if so. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > I agree From monis at Voltaire.COM Tue May 20 06:34:33 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 16:34:33 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> Message-ID: <4832D369.7070300@Voltaire.COM> Roland Dreier wrote: > and what happens if alloc_mad() is called while port->sm_ah is NULL? > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > In this case it is protected by the check after alloc_mad() is called but I already moved the check inside alloc_mad() as you suggested so the answer is a bit irrelevant now. From swise at opengridcomputing.com Tue May 20 06:40:44 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 May 2008 08:40:44 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48327198.7080305@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> <4831347A.1010506@voltaire.com> <4831A0DF.2070603@opengridcomputing.com> <48327198.7080305@voltaire.com> Message-ID: <4832D4DC.2040006@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >>> dma mapping would work too but then handling the map/unmap becomes an >>> issue. I think it is way too complicated too add new verbs for >>> map/unmap fastreg page list (in addition to the alloc/free fastreg page >>> list that we are already adding) and force the consumer to do it. And >>> if we expect the low-level driver to do it, then the map is easy >>> (can be >>> done while posting the send) but the unmap is a pain -- it would >>> have to >>> be done inside poll_cq when reapind the completion, and the low-level >>> driver would have to keep some complicated extra data structure to go >>> back from the completion to the original fast reg page list structure. >>> >> And certain platforms can fail map requests (like PPC64) because they >> have limited resources for dma mapping. So then you'd fail a SQ work >> request when you might not want to... > I see the point in allocating the page lists in dma consistent memory > to make the mechanics of letting the HCA to DMA the list easier and > simpler, as I think Roland is suggesting in his post. However, I an > not sure to understand how this helps in the PPC64 case, if the HCA > does DMA to fetch the list, then IOMMU slots have to be consumed this > way or another, correct? > My point is that if you do the mappipng at allocation time, then the failure will happen when you allocate the page list vs when you post the send WR. Maybe it doesn't matter, but the idea, I think, is to not fail post_send for lack of resources. Everything should be pre-allocated pretty much by the time you post work requests... Steve. > Or. > From rdreier at cisco.com Tue May 20 06:53:45 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 06:53:45 -0700 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: <4832D29C.2060704@Voltaire.COM> (Moni Shoua's message of "Tue, 20 May 2008 16:31:08 +0300") References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> <4832D29C.2060704@Voltaire.COM> Message-ID: > > spin_lock_irqsave(&port->ah_lock, flags); > > if (port->sm_ah) > > kref_put(&port->sm_ah->ref, free_sm_ah); > > port->sm_ah = NULL; > > spin_unlock_irqrestore(&port->ah_lock, flags); > > > What happens if this happens > > # | CPU-0 | CPU-1 > | | > 1 | if (port->sm_ah) | > | kref_put(&port->sm_ah->ref, free_sm_ah); | > --+-----------------------------------------------------+----------------------- > 2 | | alloc_mad() > --+-----------------------------------------------------+----------------------- > 3 | port->sm_ah = NULL; | > > As I see it, process on CPU-1 gets a garbage sm_ah > Do you agree? alloc_mad() must obviously take the lock when looking at port->sm_ah, and take a reference with kref_get() before dropping the lock. - R. From swise at opengridcomputing.com Tue May 20 06:55:28 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 20 May 2008 08:55:28 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483285DC.20003@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> Message-ID: <4832D850.2010102@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> Support for the IB BMME and iWARP equivalent memory extensions to non >> shared memory regions. Usage Model: >> - MR allocated with ib_alloc_mr() >> - Page lists allocated via ib_alloc_fast_reg_page_list(). >> - MR made VALID and bound to a specific page list via >> ib_post_send(IB_WR_FAST_REG_MR) >> - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) > Steve, > > I am trying to further understand what would be a real life ULP design > here, and I think there are some more issues to clarify/define for the > case of ULP which has to create a mapping for a list of pages and send > this mapping (eg IB/rkey iWARP/stag) to a remote party that uses it > for RDMA. > > AFAIK, the idea was to let the ulp post --two-- work requests, where > the first creates the mapping and the second sends this mapping to the > remote side, such that the second does not start before the first > completes (i.e a fence). > > Now, the above scheme means that the ulp knows the value of the > rkey/stag at the time of posting these two work requests (since it has > to encode it in the second one), so something has to be clarified re > the rkey/stag here, do they change each time this MR is used? how many > bits can be changed, etc. The ULP knows the rkey/stag because its returned up front in the ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue which we haven't exposed yet to the ULP). The same rkey/stag can be used for multiple mappings. It can be made invalid at any point in time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the same rkey/stag advertised is not a risk. So you allocate the rkey/stag up front, allocate page_lists up front, then as needed you populate your page list and bind it to the rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via IB_WR_INVALIDATE_MR. You can do this any number of times, and with proper fencing, you can pipeline these mappings. Eventually when you're done doing IO (like for NFSRDMA when the mount is unmounted) you free up the page list(s) and mr/rkey/stag. So NFSRDMA will keep these fast_reg_mrs and page_list structs pre-allocated and hung off some context so that per RPC, they can be bound/registered, the IO executed, and then the MR invalidated as part of processing the RPC. > > I guess my questions are to some extent RTFM ones, but, first, with > some quick looking in the IB spec I did not manage to get enough > answers (pointers appreciated...) and second, you are proposing an > implementation here, so I think it makes sense to review the actual > usage model to see all aspects needed for ULPs are covered... > > Talking on usage, do you plan to patch the mainline nfs-rdma code to > use these verbs? Yes. Tom Tucker will be doing this. Jon Mason is implementing RDS changes to utilize this too. The hope is all this makes 2.6.27/ofed-1.4. I can also post test code (krping module) if anyone is interested. I'm developing that now. Steve. From monis at Voltaire.COM Tue May 20 07:01:02 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 17:01:02 +0300 Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between elements in work queues after event In-Reply-To: <48302034.8040709@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> Message-ID: <4832D99E.3010205@Voltaire.COM> This is the second version after some of Roland's comments. One comments however is still pending. ----------------------------------------------------- This patch solves a race between work elements that are carried out after an event occurs. When SM address handle becomes invalid and needs an update it is handled by a work in the global workqueue. On the other hand this event is also handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. The patch sets the SM address handle to NULL and until update_sm_ah() is called, any request that needs sm_ah is replied with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the wrong SM so the request gets lost. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse and in some cases it gets better. If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after the check for NULL SM address handle the result would be as before the patch and without a risk of dereferencing a NULL pinter. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/core/sa_query.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index cf474ec..7224bd1 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -361,7 +361,7 @@ static void update_sm_ah(struct work_struct *work) { struct ib_sa_port *port = container_of(work, struct ib_sa_port, update_task); - struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_sa_sm_ah *new_ah; struct ib_port_attr port_attr; struct ib_ah_attr ah_attr; @@ -397,12 +397,9 @@ static void update_sm_ah(struct work_struct *work) } spin_lock_irq(&port->ah_lock); - old_ah = port->sm_ah; port->sm_ah = new_ah; spin_unlock_irq(&port->ah_lock); - if (old_ah) - kref_put(&old_ah->ref, free_sm_ah); } static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) @@ -413,9 +410,20 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER) { - struct ib_sa_device *sa_dev; - sa_dev = container_of(handler, typeof(*sa_dev), event_handler); - + unsigned long flags; + struct ib_sa_device *sa_dev = + container_of(handler, typeof(*sa_dev), event_handler); + struct ib_sa_port *port = + &sa_dev->port[event->element.port_num - sa_dev->start_port]; + struct ib_sa_sm_ah *sm_ah; + + spin_lock_irqsave(&port->ah_lock, flags); + sm_ah = port->sm_ah; + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); + + if (sm_ah) + kref_put(&port->sm_ah->ref, free_sm_ah); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); } @@ -519,6 +527,10 @@ static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask) unsigned long flags; spin_lock_irqsave(&query->port->ah_lock, flags); + if (!query->port->sm_ah) { + spin_unlock_irqrestore(&query->port->ah_lock, flags); + return -EAGAIN; + } kref_get(&query->port->sm_ah->ref); query->sm_ah = query->port->sm_ah; spin_unlock_irqrestore(&query->port->ah_lock, flags); From monis at Voltaire.COM Tue May 20 07:05:26 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 17:05:26 +0300 Subject: [ofa-general] [PATCH 1/2] IB/core: handle race between elements in qork queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <48302257.2050308@Voltaire.COM> <483199EC.7070900@Voltaire.COM> <4832D29C.2060704@Voltaire.COM> Message-ID: <4832DAA6.7070500@Voltaire.COM> Roland Dreier wrote: > > > spin_lock_irqsave(&port->ah_lock, flags); > > > if (port->sm_ah) > > > kref_put(&port->sm_ah->ref, free_sm_ah); > > > port->sm_ah = NULL; > > > spin_unlock_irqrestore(&port->ah_lock, flags); > > > > > What happens if this happens > > > > # | CPU-0 | CPU-1 > > | | > > 1 | if (port->sm_ah) | > > | kref_put(&port->sm_ah->ref, free_sm_ah); | > > --+-----------------------------------------------------+----------------------- > > 2 | | alloc_mad() > > --+-----------------------------------------------------+----------------------- > > 3 | port->sm_ah = NULL; | > > > > As I see it, process on CPU-1 gets a garbage sm_ah > > Do you agree? > > alloc_mad() must obviously take the lock when looking at port->sm_ah, > and take a reference with kref_get() before dropping the lock. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > You're right. I just sent a V2 but it needs to be modified according to above. I'll resend soon thanks From rdreier at cisco.com Tue May 20 07:07:29 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 07:07:29 -0700 Subject: [ofa-general] Re: [PATCH V2 1/2] IB/core: handle race between elements in work queues after event In-Reply-To: <4832D99E.3010205@Voltaire.COM> (Moni Shoua's message of "Tue, 20 May 2008 17:01:02 +0300") References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM> Message-ID: > This is the second version after some of Roland's comments. > One comments however is still pending. What comment is still pending? From monis at voltaire.com Tue May 20 07:09:56 2008 From: monis at voltaire.com (Moni Shoua) Date: Tue, 20 May 2008 17:09:56 +0300 Subject: [ofa-general] RE: [PATCH V2 1/2] IB/core: handle race between elements in work queues after event In-Reply-To: References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM> Message-ID: <39C75744D164D948A170E9792AF8E7CA01269D56@exil.voltaire.com> I meant to the one with simpler cleanup in ib_sa_event() but it is no longer pending. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Tuesday, May 20, 2008 17:07 To: Moni Shoua Cc: Olga Shern; OpenFabrics General; Moni Levy Subject: Re: [PATCH V2 1/2] IB/core: handle race between elements in work queues after event > This is the second version after some of Roland's comments. > One comments however is still pending. What comment is still pending? From monis at Voltaire.COM Tue May 20 07:13:20 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 20 May 2008 17:13:20 +0300 Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between elements in work queues after event In-Reply-To: <4832D99E.3010205@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM> Message-ID: <4832DC80.2000408@Voltaire.COM> This patch solves a race between work elements that are carried out after an event occurs. When SM address handle becomes invalid and needs an update it is handled by a work in the global workqueue. On the other hand this event is also handled in ib_ipoib by queuing a work in the ipoib_workqueue that does mcast join. Although queuing is in the right order, it is done to 2 different workqueues and so there is no guarantee that the first to be queued is the first to be executed. The patch sets the SM address handle to NULL and until update_sm_ah() is called, any request that needs sm_ah is replied with -EAGAIN return status. For consumers, the patch doesn't make things worse. Before the patch, MADS are sent to the wrong SM so the request gets lost. Consumers can be improved if they examine the return code and respond to EAGAIN properly but even without an improvement the situation is not getting worse and in some cases it gets better. If ib_sa_event() is called in during consumers work (e.g. ib_sa_path_rec_get()) and after the check for NULL SM address handle the result would be as before the patch and without a risk of dereferencing a NULL pinter. Signed-off-by: Moni Levy Signed-off-by: Moni Shoua --- drivers/infiniband/core/sa_query.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index cf474ec..78ea815 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -361,7 +361,7 @@ static void update_sm_ah(struct work_struct *work) { struct ib_sa_port *port = container_of(work, struct ib_sa_port, update_task); - struct ib_sa_sm_ah *new_ah, *old_ah; + struct ib_sa_sm_ah *new_ah; struct ib_port_attr port_attr; struct ib_ah_attr ah_attr; @@ -397,12 +397,9 @@ static void update_sm_ah(struct work_struct *work) } spin_lock_irq(&port->ah_lock); - old_ah = port->sm_ah; port->sm_ah = new_ah; spin_unlock_irq(&port->ah_lock); - if (old_ah) - kref_put(&old_ah->ref, free_sm_ah); } static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event) @@ -413,8 +410,17 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER) { - struct ib_sa_device *sa_dev; - sa_dev = container_of(handler, typeof(*sa_dev), event_handler); + unsigned long flags; + struct ib_sa_device *sa_dev = + container_of(handler, typeof(*sa_dev), event_handler); + struct ib_sa_port *port = + &sa_dev->port[event->element.port_num - sa_dev->start_port]; + + spin_lock_irqsave(&port->ah_lock, flags); + if (port->sm_ah) + kref_put(&port->sm_ah->ref, free_sm_ah); + port->sm_ah = NULL; + spin_unlock_irqrestore(&port->ah_lock, flags); schedule_work(&sa_dev->port[event->element.port_num - sa_dev->start_port].update_task); @@ -519,6 +525,10 @@ static int alloc_mad(struct ib_sa_query *query, gfp_t gfp_mask) unsigned long flags; spin_lock_irqsave(&query->port->ah_lock, flags); + if (!query->port->sm_ah) { + spin_unlock_irqrestore(&query->port->ah_lock, flags); + return -EAGAIN; + } kref_get(&query->port->sm_ah->ref); query->sm_ah = query->port->sm_ah; spin_unlock_irqrestore(&query->port->ah_lock, flags); From worleys at gmail.com Tue May 20 07:34:31 2008 From: worleys at gmail.com (Chris Worley) Date: Tue, 20 May 2008 08:34:31 -0600 Subject: [ofa-general] SET_IPOIB_CM enabled by default in some builds and not in others In-Reply-To: <4832C037.1060206@mellanox.co.il> References: <482AC510.3090602@dev.mellanox.co.il> <483045CD.8060301@mellanox.co.il> <4832C037.1060206@mellanox.co.il> Message-ID: On Tue, May 20, 2008 at 6:12 AM, Tziporet Koren wrote: > Chris Worley wrote: >> >> In using netcat in UDP mode over IPoIB, I loose 25%-40% of the >> packets. Is that expected? >> > > What is the netcap test? Start a listener on one node (both nodes running RHEL5.1/OFED1.3): [root at poib01 ~]# nc -v -v -u -l 61984 | dd of=/dev/null bs=1024k Start a sender on another, which completes: [root at poib04 mnt]# dd if=/dev/zero bs=1024k count=10000 | nc -v -v -u 36.102.28.91 61984 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 48.0578 seconds, 218 MB/s But the sender just hangs until ^C: Connection from 36.102.28.94 port 61984 [udp/*] accepted 0+6078248 records in 0+6078248 records out 6239293440 bytes (6.2 GB) copied, 179.339 seconds, 34.8 MB/s Chris > > Tziporet > From tziporet at mellanox.co.il Tue May 20 07:59:34 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 17:59:34 +0300 Subject: [ofa-general] OFED May 29, 08 meeting summary Message-ID: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com> OFED May 29, 08 meeting summary 1. OFED 1.3.1: 1.1 Schedule: decided to delay rc2 in two days to enable more IPoIB bug fixes. rc1 - done on May 6 rc2 - May 22 GA - May 29 1.2 OS support: SLES10 SP2 backports were done (thanks to Moshe from Voltaire) There is a request fro RHEL 5.2 - we still look for a volunteer to add the backports 1.3 Bugs status Please set release version 1.3.1 for all bugs that should be resolved in 1.3.1 We decided these are the bugs that should be fixed for 1.3.1: 1027 normal sashak at voltaire.com kernel panic in mad.c handle_outgoing_dr_smp with RESULT_CONSUMED Qlogic should check 1032 critical vu at mellanox.com RHEL 5.1 and OFED 1.3 cannot write IO blocks greater than 1024. Mellanox to check 1038 normal eli at mellanox.co.il Kernel panic while running tcp/ip ltp tests Under debug 1040 normal jackm at mellanox.co.il Kernel Oops during "port up/down test" A fix was provided 1041 normal vlad at mellanox.co.il Install Failed with memtrack flag in the conf file To be fixed 1042 normal vlad at mellanox.co.il ofed-1.3.1 install fails To be fixed 1004 major eli at mellanox.co.il IPoIB failed on stress testing Under debug - Eli & Jack uDAPL bug should be provided too - Arlin 2. OFED 1.4: - Kernel rebase status: we have prepared the new tree, make-dist pass but compilation still fails. Any help to resolve compilation issues is welcome. URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel Vlad will send status soon - Update from the participants (mainly on new components/features): - NFSoRDMA - Jeff is working now on OFED 1.3 on SLES 10 and will do porting for 1.4 later - Management - no update but a lot of activity on the list - Multiple EQs to best fit multi-core systems - definition ongoing - RDMA CM to support IPv6 - Seems no one will do it so I will remove this from the plans - IB BMME and iWARP equivalent memory extensions (Steve): Verbs are under definition; Chelsio already implemented them and planning to change NFSoRDMA to use them. Mellanox plan to implement them too with ConnectX. 3. Upgrade memory in the OFA server: We need to upgrade the OFA server: memory, hard disk and Ubuntu version Date selected is second week of June Johann - what is the procedure to get this going? 4. Web site update: Woody updated on improvements to the web site that should be available in few weeks. 5. OpenSuse build system (Yiftah): Voltaire found its require more work that first planed and they put this effort on hold for now. Tziporet From vlad at dev.mellanox.co.il Tue May 20 08:40:53 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 20 May 2008 18:40:53 +0300 Subject: [ofa-general] Re: [ewg] OFED May 29, 08 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com> Message-ID: <4832F105.8000402@dev.mellanox.co.il> Tziporet Koren wrote: > 2. OFED 1.4: > - Kernel rebase status: we have prepared the new tree, make-dist > pass but compilation still fails. > Any help to resolve compilation issues is welcome. > URL: git://git.openfabrics.org/ofed_1_4/linux-2.6.git > ofed_kernel > Vlad will send status soon Compilation passed on 2.6.26-rc2 kernel (except SDP). Working on backport patches. - Vladimir From David.Shue.ctr at rl.af.mil Tue May 20 08:54:21 2008 From: David.Shue.ctr at rl.af.mil (Shue, David CTR USAF AFMC AFRL/RITB) Date: Tue, 20 May 2008 11:54:21 -0400 Subject: [ofa-general] Infiniband Card Trouble In-Reply-To: References: Message-ID: UPDATE I was given a reflash image from a MELLANOX rep and it worked for me! Thanks for everyone's effort to help me. -Dave -----Original Message----- From: Mike Heinz [mailto:michael.heinz at qlogic.com] Sent: Thursday, May 01, 2008 11:21 AM To: Shue, David CTR USAF AFMC AFRL/RITB; general at lists.openfabrics.org Subject: RE: [ofa-general] Infiniband Card Trouble #6 makes it sound like it's an ofed installation issue rather than the HCA itself. Could you post the relevant /var/log/messages? Messages from ib_mthca would be especially important. In addition, the output from mstflint -d q could also be useful. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shue, David CTR USAF AFMC AFRL/RITB Sent: Thursday, May 01, 2008 9:09 AM To: general at lists.openfabrics.org Subject: [ofa-general] Infiniband Card Trouble Hello, I have used the OFED-1.3 software to communicate with the current cards I have. These cards come up as "MT23108" in the logs, and I am not sure whom the manufacturer is. I was able to program the cards, and even install MPICH2 and run tests. I have recently obtained new IB cards from HP "HP PCI-X 2-port 4X Fabric (HPC) Adapter" http://h20000.www2.hp.com/bizsupport/TechSupport/Home.jsp?lang=en&cc=id& prodTypeId=12883&prodSeriesId=460713&lang=en&cc=id and these cards do not work the same. The machine boots up fine with the card in, and shows the card as Mellanox "MT23108" also? The two cards are visibly different in every way. Is the MT23108 a certain platform for IB? I am new to the entire IB technology. This is the history of what I did. 1) Staged the machine RH EL v5 2) Install the IB card 3) Boot machine up 4) Can see the card looking at "lspci" and "dmesg" but nothing in the network area or under "ifconfig" (Just like with the first cards) 5) I then install the OFED-1.3 software to communicate and configure the card 6) When I go to start the card (instead of reboot but have tried both ways) /etc/init.d/openib start, it all fails. I then look in the log file and see a bunch of "unknown symbol..." and "disagrees..." for all items of ib_uverbs, ib_umad,iw_cxgb3,ib_path, mlx_ib, and so on. 7) When I reboot, the machine reaches "UDEV" of the reboot stage, hangs for a little bit, and then many errors show and the machine won't boot, unless I take the card out. If I uninstall the OFED software, it will reboot fine with the card still in. The card from HP giving me problems, does not appear to have any drivers for it. It looks like HP supports it to work on Windows, and HPUX. I'm look for any help you can provide. Thanks in advance, Dave >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> David Shue Systems Specialist Computer Sciences Corporation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< From Jeffrey.C.Becker at nasa.gov Tue May 20 09:26:01 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Tue, 20 May 2008 09:26:01 -0700 Subject: [ofa-general] RE: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> Message-ID: <4832FB99.1080409@nasa.gov> Talpey, Thomas wrote: > At 06:28 PM 5/19/2008, Woodruff, Robert J wrote: > >> Here is what I have so far as the list of kernel and userspace >> components. >> >> Kernel Components >> >> Core kernel drivers, infiniband/core >> ... >> > > Your list doesn't include NFS/RDMA, but those components are effectively part > of existing MAINTAINER sections. > > So, it might be worth adding a note to the effect that the NFS/RDMA client is > maintained by Tom Talpey, as part of Trond Myklebust's NFS client codebase, > and the NFS/RDMA server is Tom Tucker's, as part of Bruce Fields' and Neil Brown's > NFS server base. These are the "NFS CLIENT" and "KERNEL NFSD" sections of the > existing MAINTAINERS file, respectively. > And I am responsible for integrating this work with OFED, particularly providing and maintaining kernel backports for the various distros. Thanks. -jeff > I don't personally think these NFS/RDMA sub-components currently rise to the > level of needing mention in MAINTAINERS, but something in the openib docs would > be very good. > > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From arlin.r.davis at intel.com Tue May 20 09:36:11 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 20 May 2008 09:36:11 -0700 Subject: [ofa-general] RE: [ewg] OFED May 29, 08 meeting summary In-Reply-To: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C904111C9F@mtlexch01.mtl.com> Message-ID: > >OFED May 29, 08 meeting summary > >1. OFED 1.3.1: > 1.3 Bugs status > Please set release version 1.3.1 for all bugs that should be >resolved in 1.3.1 > We decided these are the bugs that should be fixed for 1.3.1: > >uDAPL bug should be provided too - Arlin 1044 normal arlin.r.davis at intel.com uDAPL 1.2 - dat_ia_open delay if DNS not configured To be fixed. From ramachandra.kuchimanchi at qlogic.com Tue May 20 09:52:50 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 20 May 2008 22:22:50 +0530 Subject: [ofa-general] Re: [PATCH 10/13] QLogic VNIC: Driver Statistics collection In-Reply-To: References: <20080430171028.31725.86190.stgit@localhost.localdomain> <20080430172055.31725.70663.stgit@localhost.localdomain> Message-ID: <71d336490805200952k2c31b6abkfcf70ed7481a95d@mail.gmail.com> Roland, On Fri, May 16, 2008 at 4:03 AM, Roland Dreier wrote: > > +#else /*CONFIG_INIFINIBAND_VNIC_STATS*/ > > + > > +static inline void vnic_connected_stats(struct vnic *vnic) > > +{ > > + ; > > +} > > there are an awful lot of stubs here. Do you really expect anyone to > set CONFIG_INFINIBAND_QLGC_VNIC_STATS=n? Sorry, missed this comment from the first round of review. Yes, the default behavior we want is to disable the statistics collection as some of the statistics collection is in the data transfer path and hence this code can add overheads. These statistics are more for performance debugging/tuning. Regards, Ram From sean.hefty at intel.com Tue May 20 10:02:34 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 May 2008 10:02:34 -0700 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <4832BFAC.2050506@voltaire.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> <4831649A.2020206@voltaire.com> <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> <4832BFAC.2050506@voltaire.com> Message-ID: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com> >I am fine with the approach of always report the event and let ULPs >ignore it. Looking on how the ABI versions are exchanged between the >rdma_ucm module to librdmacm, I don't see much alternatives other to >bumping the ABI version to five. If librdmacm can somehow note against >what ABI version the app was built, we could bump the ABI version to >five and require the user to upgrade his librdmacm to be able to run, >but have --librdmacm-- hide this event from the user in case "his >version" of the ABI is smaller. I was only thinking of the kernel interfaces, but I don't see that this really changes the ABI. An existing library continues to work unmodified. (Is this that different than adding a new return value from a call?) If there really is an issue, then the rdma_ucm can toss the event. >I would like to look into this possibility which as you stated later in >your post is simpler compared to the alternatives and would also make >the current code of supporting device removal less complex. So >can/should that mutex be the existing one defined in cma.c or a new one? After more thought, this approach is what I would try first. I think you will need a new mutex per rdma_cm_id that does nothing but serializes callbacks. You might be able to acquire/release it in disable/enable remove, but I didn't look into the implementation in that much detail. - Sean From clameter at sgi.com Tue May 20 11:21:09 2008 From: clameter at sgi.com (Christoph Lameter) Date: Tue, 20 May 2008 11:21:09 -0700 (PDT) Subject: [ofa-general] Yet another mm notifier: Notify when pages are unmapped. Message-ID: Robin suggested that the last post as a reply in the anon_vma thread made this patch vanish. So here it is again (guess we are all tired of notifiers...) This patch implements a callbacks for device drivers that establish external references to pages aside from the Linux rmaps. Those either: 1. Do not take a refcount on pages that are mapped from devices. They have a TLB cache like handling and must be able to flush external references from atomic contexts. These devices do not need to provide the _sync methods. 2. Do take a refcount on pages mapped externally. These are handled by marking pages as to be invalidated in atomic contexts. Invalidation may be started by the driver. A _sync variant for the individual or range unmap is called when we are back in a nonatomic context. At that point the device must complete the removal of external references and drop its refcount. With the mm notifier it is possible for the device driver to release external references after the page references are removed from a process that made them available. With the notifier it becomes possible to get pages unpinned on request and thus avoid issues that come with having a large amount of pinned pages. A device driver must subscribe to a process using mm_register_notifier(struct mm_struct *, struct mm_notifier *) The VM will then perform callbacks for operations that unmap or change permissions of pages in that address space. When the process terminates then the ->release method is called first to remove all pages still mapped to the proces. Before the mm_struct is freed the ->destroy() method is called which should dispose of the mm_notifier structure. The following callbacks exist: invalidate_range(notifier, mm_struct *, from , to) Invalidate a range of addresses. The invalidation is not required to complete immediately. invalidate_range_sync(notifier, mm_struct *, from, to) This is called after some invalidate_range callouts. The driver may only return when the invalidation of the references is completed. Callback is only called from non atomic contexts. There is no need to provide this callback if the driver can remove references in an atomic context. invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address) Invalidate references to a particular page. The driver may defer the invalidation. invalidate_page_sync(notifier, mm_struct *,struct *) Called after one or more invalidate_page() callbacks. The callback must only return when the external references have been removed. The callback does not need to be provided if the driver can remove references in atomic contexts. [NOTE] The invalidate_page_sync() callback is weird because it is called for every notifier that supports the invalidate_page_sync() callback if a page has PageNotifier() set. The driver must determine in an efficient way that the page is not of interest. This is because we do not have the mm context after we have dropped the rmap list lock. Drivers incrementing the refcount must set and clear PageNotifier appropriately when establishing and/or dropping a refcount! [These conditions are similar to the rmap notifier that was introduced in my V7 of the mmu_notifier]. There is no support for an aging callback. A device driver may simply set the reference bit on the linux pte when the external mapping is referenced if such support is desired. The patch is provisional. All functions are inlined for now. They should be wrapped like in Andrea's series. Its probably good to have Andrea review this if we actually decide to go this route since he is pretty good as detecting issues with complex lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and we are reintroducing that now in a light weight order to be able to defer freeing until after the rmap spinlocks have been dropped. Jack tested this with the GRU. Signed-off-by: Christoph Lameter --- fs/hugetlbfs/inode.c | 2 include/linux/mm_types.h | 3 include/linux/page-flags.h | 3 include/linux/rmap.h | 161 +++++++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 4 + mm/Kconfig | 4 + mm/filemap_xip.c | 2 mm/fremap.c | 2 mm/hugetlb.c | 3 mm/memory.c | 38 ++++++++-- mm/mmap.c | 3 mm/mprotect.c | 3 mm/mremap.c | 5 + mm/rmap.c | 11 ++- 14 files changed, 234 insertions(+), 10 deletions(-) Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/kernel/fork.c 2008-05-16 16:06:26.000000000 -0700 @@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; +#ifdef CONFIG_MM_NOTIFIER + mm->mm_notifier = NULL; +#endif return mm; } @@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mm_notifier_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/filemap_xip.c 2008-05-16 16:06:26.000000000 -0700 @@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); @@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp } } spin_unlock(&mapping->i_mmap_lock); + mm_notifier_invalidate_page_sync(page); } /* Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/fremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mm_notifier_invalidate_range(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mm_notifier_invalidate_range_sync(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/hugetlb.c 2008-05-16 17:50:31.000000000 -0700 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); @@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); __unmap_hugepage_range(vma, start, end); spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + mm_notifier_invalidate_range_sync(vma->vm_mm, start, end); } } Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/memory.c 2008-05-16 16:06:26.000000000 -0700 @@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s */ if (is_cow_mapping(vm_flags)) { ptep_set_wrprotect(src_mm, addr, src_pte); + mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE); pte = pte_wrprotect(pte); } @@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end); + + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath } tlb_finish_mmu(*tlbp, tlb_start, start); + mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { @@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) + if (tlb) { tlb_finish_mmu(tlb, address, end); + mm_notifier_invalidate_range(mm, address, end); + } return end; } @@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct * */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - page_cache_release(old_page); if (!pte_same(*page_table, orig_pte)) goto unlock; @@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = NULL; goto unlock; } @@ -1774,6 +1792,7 @@ gotten: * thread doing COW. */ ptep_clear_flush(vma, address, page_table); + mm_notifier_invalidate_page(mm, old_page, address); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1787,10 +1806,13 @@ gotten: if (new_page) page_cache_release(new_page); - if (old_page) - page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + if (old_page) { + mm_notifier_invalidate_page_sync(old_page); + page_cache_release(old_page); + } + if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mm_notifier_invalidate_range(mm, start, end); + mm_notifier_invalidate_range_sync(mm, start, end); } /* @@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mm_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mprotect.c 2008-05-16 16:06:26.000000000 -0700 @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -132,6 +133,7 @@ static void change_protection(struct vm_ change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable); } while (pgd++, addr = next, addr != end); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(vma->vm_mm, start, end); } int @@ -211,6 +213,7 @@ success: hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mm_notifier_invalidate_range_sync(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/mremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start = old_addr; if (vma->vm_file) { /* @@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); arch_enter_lazy_mmu_mode(); + mm_notifier_invalidate_range(mm, old_addr, old_end); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) @@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + + mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/rmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -52,6 +52,9 @@ #include +struct mm_notifier *mm_notifier_page_sync; +DECLARE_RWSEM(mm_notifier_page_sync_sem); + struct kmem_cache *anon_vma_cachep; /* This must be called under the mmap_sem. */ @@ -458,6 +461,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -502,8 +506,8 @@ int page_mkclean(struct page *page) ret = 1; } } + mm_notifier_invalidate_page_sync(page); } - return ret; } EXPORT_SYMBOL_GPL(page_mkclean); @@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + mm_notifier_invalidate_page_sync(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; } - Index: linux-2.6/include/linux/rmap.h =================================================================== --- linux-2.6.orig/include/linux/rmap.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/rmap.h 2008-05-16 18:32:52.000000000 -0700 @@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa #define SWAP_AGAIN 1 #define SWAP_FAIL 2 +#ifdef CONFIG_MM_NOTIFIER + +struct mm_notifier_ops { + void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page, unsigned long addr); + void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page); + void (*release)(struct mm_notifier *mn, struct mm_struct *mm); + void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm); +}; + +struct mm_notifier { + struct mm_notifier_ops *ops; + struct mm_struct *mm; + struct mm_notifier *next; + struct mm_notifier *next_page_sync; +}; + +extern struct mm_notifier *mm_notifier_page_sync; +extern struct rw_semaphore mm_notifier_page_sync_sem; + +/* + * Must hold mmap_sem when calling mm_notifier_register. + */ +static inline void mm_notifier_register(struct mm_notifier *mn, + struct mm_struct *mm) +{ + mn->mm = mm; + mn->next = mm->mm_notifier; + rcu_assign_pointer(mm->mm_notifier, mn); + if (mn->ops->invalidate_page_sync) { + down_write(&mm_notifier_page_sync_sem); + mn->next_page_sync = mm_notifier_page_sync; + mm_notifier_page_sync = mn; + up_write(&mm_notifier_page_sync_sem); + } +} + +/* + * Invalidate remote references in a particular address range + */ +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_range(mn, mm, start, end); +} + +/* + * Invalidate remote references in a particular address range. + * Can sleep. Only return if all remote references have been removed. + */ +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + if (mn->ops->invalidate_range_sync) + mn->ops->invalidate_range_sync(mn, mm, start, end); +} + +/* + * Invalidate remote references to a page + */ +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long addr) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_page(mn, mm, page, addr); +} + +/* + * Invalidate remote references to a partioular page. Only return + * if all references have been removed. + * + * Note: This is an expensive function since it is not clear at the time + * of call to which mm_struct() the page belongs.. It walks through the + * mmlist and calls the mmu notifier ops for each address space in the + * system. At some point this needs to be optimized. + */ +static inline void mm_notifier_invalidate_page_sync(struct page *page) +{ + struct mm_notifier *mn; + + if (!PageNotifier(page)) + return; + + down_read(&mm_notifier_page_sync_sem); + + for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync) + if (mn->ops->invalidate_page_sync) + mn->ops->invalidate_page_sync(mn, mn->mm, page); + + up_read(&mm_notifier_page_sync_sem); +} + +/* + * Invalidate all remote references before shutdown + */ +static inline void mm_notifier_release(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->release(mn, mm); +} + +/* + * Release resources before freeing mm_struct. + */ +static inline void mm_notifier_destroy(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + while (mm->mm_notifier) { + mn = mm->mm_notifier; + mm->mm_notifier = mn->next; + if (mn->ops->invalidate_page_sync) { + struct mm_notifier *m; + + down_write(&mm_notifier_page_sync_sem); + + if (mm_notifier_page_sync != mn) { + for (m = mm_notifier_page_sync; m; m = m->next_page_sync) + if (m->next_page_sync == mn) + break; + + m->next_page_sync = mn->next_page_sync; + } else + mm_notifier_page_sync = mn->next_page_sync; + + up_write(&mm_notifier_page_sync_sem); + } + mn->ops->destroy(mn, mm); + } +} +#else +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long address) {} +static inline void mm_notifier_invalidate_page_sync(struct page *page) {} +static inline void mm_notifier_release(struct mm_struct *mm) {} +static inline void mm_notifier_destroy(struct mm_struct *mm) {} +#endif + #endif /* _LINUX_RMAP_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/Kconfig 2008-05-16 16:06:26.000000000 -0700 @@ -205,3 +205,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MM_NOTIFIER + def_bool y + Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/mm_types.h 2008-05-16 16:06:26.000000000 -0700 @@ -244,6 +244,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MM_NOTIFIER + struct mm_notifier *mm_notifier; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/page-flags.h 2008-05-16 16:06:26.000000000 -0700 @@ -93,6 +93,7 @@ enum pageflags { PG_mappedtodisk, /* Has blocks allocated on-disk */ PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ + PG_notifier, /* Call notifier when page is changed/unmapped */ #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif @@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk) PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ +PAGEFLAG(Notifier, notifier); + #ifdef CONFIG_HIGHMEM /* * Must use a macro here due to header dependency issues. page_zone() is not Index: linux-2.6/fs/hugetlbfs/inode.c =================================================================== --- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/fs/hugetlbfs/inode.c 2008-05-16 16:06:55.000000000 -0700 @@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree __unmap_hugepage_range(vma, vma->vm_start + v_offset, vma->vm_end); + mm_notifier_invalidate_range_sync(vma->vm_mm, + vma->vm_start + v_offset, vma->vm_end); } } From robert.j.woodruff at intel.com Tue May 20 11:35:24 2008 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 20 May 2008 11:35:24 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: <48326942.7080800@voltaire.com> References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> <48326942.7080800@voltaire.com> Message-ID: Or wrote, > This is what I could find in the MAINTAINERS file for 2.6.25: >I am not sure to follow why there's a need to duplicate the Linux kernel >IB (RDMA) stack maintainers file at the ofa website, but if for some >reason people feel this is needed I suggest to have a smart link that >somehow goes to Linus tree and fetches the up2date info. >Or. We should have the list for a couple of reasons, first, not all OFA components are upstream, e.g., SDP. And also, the MAINTAINERS list in the kernel tree is only for the kernel components, so it would be good to have a list for the OFA user-space components/maintainers as well, and if we are posting a maintainers list on our website, might as well include the complete list. woody From sean.hefty at intel.com Tue May 20 12:06:14 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 May 2008 12:06:14 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> <48326942.7080800@voltaire.com> Message-ID: <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com> OFED needs a separate list of maintainers. From tziporet at dev.mellanox.co.il Tue May 20 12:10:40 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 20 May 2008 22:10:40 +0300 Subject: [ofa-general] RE: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> Message-ID: <48332230.2000609@mellanox.co.il> Woodruff, Robert J wrote: > Here is what I have so far as the list of kernel and userspace > components. > > > Regarding some of the ULPs - Roland is the maintainer in the kernel but we have other people in OFED that work on them I suggest we add their names too under OFED column. I also filled the user space components with my best knowledge on the owners Tziporet > > Kernel Components > > Core kernel drivers, infiniband/core > Sean Hefty, sean.hefty at intel.com > Roland Dreier, rdreir at csco.com > > Hardware Drivers: > Mellanox HCA drivers, infiniband/hw/mthca, infiniband/hw/mlx4 > In addition for Roland please add Jack Morgenstein for OFED > Qlogic HCA driver, infiniband/hw/ipath > > NetEffects RNIC driver, infiniband/hw/nes > > IBM HCA, infinband/hw/ehca > > Chelsio RNIC, infiniband/hw/cxgb3 > > Upper Level Protocols > > IPoIB > Can we also add here Eli Cohen since he is working on all the OFED related issues > SRP > Please add here Vu Pham for OFED > iSer > > SDP > Amir Vadai amirv at mellanox.co.il > SRPT > Vu Pham vuhuong at mellanox.com > qlgc_vnic > > RDS > Olaf Kirch olaf.kirch at oracle.com > User Space Components > > libibverbs > Roland Dreier > uDAPL > "Davis, Arlin R" > IB-Bonding > This is a kernel module and not user space. > IB-Sim > Sasha Khapyorsky > IB-Utils > Oren Kladnitsky > IB-Diags > Sasha Khapyorsky > libibcm > "Hefty, Sean" > librdmacm > "Hefty, Sean" > libibcommon > Roland Dreier > libibmad > Sasha Khapyorsky > libibumad > Sasha Khapyorsky > libipathverbs > > libmlx4 > Roland Dreier In addition for Roland please add Jack Morgenstein for OFED > libmthca > > Roland Dreier In addition for Roland please add Jack Morgenstein for OFED > libnes > > librdmacm > "Hefty, Sean" > libsdp > Amir Vadai amirv at mellanox.co.il > mpi-selector > Jeff Squyres (jsquyres) > mpitests > Pavel Shamis (Pasha) > mstflint Oren Kladnitsky > mvapich > "Pavel Shamis (Pasha)" > mvapich2 > Jonathan L. Perkins > openmpi > > Jeff Squyres (jsquyres) > open-iscsi > > opensm > Sasha Khapyorsky > > perftest > Oren Meron > qlvnictools > > qperf > Johann George > rds-tools > Olaf Kirch olaf.kirch at oracle.com > sdpnetstat > > Amir Vadai amirv at mellanox.co.il > srptools > Vu Pham vuhuong at mellanox.com From hrosenstock at xsigo.com Tue May 20 12:13:39 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 12:13:39 -0700 Subject: [ofa-general] RE: Current list of Linux maintainers and their email info In-Reply-To: <48332230.2000609@mellanox.co.il> References: <4831F965.6060607@opengridcomputing.com> <48332230.2000609@mellanox.co.il> Message-ID: <1211310819.12616.634.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 22:10 +0300, Tziporet Koren wrote: > > libibcommon > > > Roland Dreier Sasha Khapyorsky From robert.j.woodruff at intel.com Tue May 20 13:12:48 2008 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 20 May 2008 13:12:48 -0700 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: <4832954C.2080209@voltaire.com> References: <4831F965.6060607@opengridcomputing.com> <48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il> <4832954C.2080209@voltaire.com> Message-ID: Or wrote, Tziporet Koren wrote: >> But I think it will be good to know who is the maintainer from the IB >> side (at least for OFED users) >The mainline maintainer info of bonding is: >BONDING DRIVER >P: Jay Vosburgh >M: fubar at us.ibm.com >L: bonding-devel at lists.sourceforge.net >W: http://sourceforge.net/projects/bonding/ >S: Supported >You need to ask him in case you intend to copy this record from the >maintainers file, or ask Moni Shoua if you can list him as a contact for >issues not related directly to the mainline driver. >Or. Since this is a separate open source project in sourceforge, and not developed in OFA/OFED, perhaps we do not need this in our list of maintainers. From matthias at sgi.com Tue May 20 13:19:50 2008 From: matthias at sgi.com (Matthias Blankenhaus) Date: Tue, 20 May 2008 13:19:50 -0700 (PDT) Subject: [ofa-general] saquery port problems Message-ID: Howdy ! While using this tool to run some queries on a two port HCA, I noticed some odd behavior. Here are my observations running on a SLES10SP2 (x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III Ex HCA: (01) saquery -C mthca0 -m This yields the output for port number two. This is not conform with the usual ib tools behavior to report on port one per default. (02) saquery -C mthca0 -m -P 1 Fails with "Failed to find active port, check port status with "ibstat". This is incorrect, since # ibstat mthca0 1 CA: 'mthca0' Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x02510a68 Port GUID: 0x0008f10403987dc5 This might be the reason why (01) report on port two. (03) saquery -C mthca0 -m -P 2 Works and is identical with the out out from (01). However, the following command options work: (04) saquery -P 1 -m Correctly yields the output for port one. In other words port one seems to be fine unlike reported in (02). (05) saquery -P 2 -m Correctly yields the output for port two. Is it incorrect to use -C and -P in combination ? Why does does saquery think that port one is not active ? Thanx, Matthias From jon at opengridcomputing.com Tue May 20 13:45:22 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Tue, 20 May 2008 15:45:22 -0500 Subject: [ofa-general] RDS flow control In-Reply-To: <200805191006.00114.olaf.kirch@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805141516.01908.okir@lst.de> <200805161638.18067.olaf.kirch@oracle.com> <200805191006.00114.olaf.kirch@oracle.com> Message-ID: <20080520204522.GD31790@opengridcomputing.com> On Mon, May 19, 2008 at 10:05:59AM +0200, Olaf Kirch wrote: > > > However, I'm still seeing performance degradation of ~5% with some packet > > sizes. And that is *just* the overhead from exchanging the credit information > > and checking it - at some point we need to take a spinlock, and that seems > > to delay things just enough to make a dent in my throughput graph. > > Here's an updated version of the flow control patch - which is now completely > lockless, and uses a single atomic_t to hold both credit counters. This has > given me back close to full performance in my testing (throughput seems to be > down less than 1%, which is almost within the noise range). > > I'll push it to my git tree a little later today, so folks can test it if > they like. Works well on my setup. With proper flow control, there should no longer be a need for rnr_retry (as there should always be a posted recv buffer waiting for the incoming data). I did a quick test and removed it and everything seemed to be happy on my rds-stress run. Thanks for pulling this in. Jon > > Olaf > -- > Olaf Kirch | --- o --- Nous sommes du soleil we love when we play > okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax > ---- > From: Olaf Kirch > Subject: RDS: Implement IB flow control > > Here it is - flow control for RDS/IB. > > This patch is still very much experimental. Here's the essentials > > - The approach chosen here uses a credit-based flow control > mechanism. Every SEND WR (including ACKs) consumes one credit, > and if the sender runs out of credits, it stalls. > > - As new receive buffers are posted, credits are transferred to the > remote node (using yet another RDS header byte for this). > > - Flow control is negotiated during connection setup. Initial credits > are exchanged in the rds_ib_connect_private sruct - sending a value > of zero (which is also the default for older protocol versions) > means no flow control. > > - We avoid deadlock (both nodes depleting their credits, and being > unable to inform the peer of newly posted buffers) by requiring > that the last credit can only be used if we're posting new credits > to the peer. > > The approach implemented here is lock-free; preliminary tests show > the impact on throughput to be less than 1%, and the impact on RTT, > CPU, TX delay and other metrics to be below the noise threshold. > > Flow control is configurable via sysctl. It only affects newly created > connections however - so your best bet is to set this right after loading > the RDS module. > > Signed-off-by: Olaf Kirch > --- > net/rds/ib.c | 1 > net/rds/ib.h | 30 ++++++++ > net/rds/ib_cm.c | 49 ++++++++++++- > net/rds/ib_recv.c | 48 +++++++++--- > net/rds/ib_send.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > net/rds/ib_stats.c | 3 > net/rds/ib_sysctl.c | 10 ++ > net/rds/rds.h | 4 - > 8 files changed, 325 insertions(+), 14 deletions(-) > > Index: ofa_kernel-1.3/net/rds/ib.h > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib.h > +++ ofa_kernel-1.3/net/rds/ib.h > @@ -46,6 +46,7 @@ struct rds_ib_connect_private { > __be16 dp_protocol_minor_mask; /* bitmask */ > __be32 dp_reserved1; > __be64 dp_ack_seq; > + __be32 dp_credit; /* non-zero enables flow ctl */ > }; > > struct rds_ib_send_work { > @@ -110,15 +111,32 @@ struct rds_ib_connection { > struct ib_sge i_ack_sge; > u64 i_ack_dma; > unsigned long i_ack_queued; > + > + /* Flow control related information > + * > + * Our algorithm uses a pair variables that we need to access > + * atomically - one for the send credits, and one posted > + * recv credits we need to transfer to remote. > + * Rather than protect them using a slow spinlock, we put both into > + * a single atomic_t and update it using cmpxchg > + */ > + atomic_t i_credits; > > /* Protocol version specific information */ > unsigned int i_hdr_idx; /* 1 (old) or 0 (3.1 or later) */ > + unsigned int i_flowctl : 1; /* enable/disable flow ctl */ > > /* Batched completions */ > unsigned int i_unsignaled_wrs; > long i_unsignaled_bytes; > }; > > +/* This assumes that atomic_t is at least 32 bits */ > +#define IB_GET_SEND_CREDITS(v) ((v) & 0xffff) > +#define IB_GET_POST_CREDITS(v) ((v) >> 16) > +#define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) > +#define IB_SET_POST_CREDITS(v) ((v) << 16) > + > struct rds_ib_ipaddr { > struct list_head list; > __be32 ipaddr; > @@ -153,14 +171,17 @@ struct rds_ib_statistics { > unsigned long s_ib_tx_cq_call; > unsigned long s_ib_tx_cq_event; > unsigned long s_ib_tx_ring_full; > + unsigned long s_ib_tx_throttle; > unsigned long s_ib_tx_sg_mapping_failure; > unsigned long s_ib_tx_stalled; > + unsigned long s_ib_tx_credit_updates; > unsigned long s_ib_rx_cq_call; > unsigned long s_ib_rx_cq_event; > unsigned long s_ib_rx_ring_empty; > unsigned long s_ib_rx_refill_from_cq; > unsigned long s_ib_rx_refill_from_thread; > unsigned long s_ib_rx_alloc_limit; > + unsigned long s_ib_rx_credit_updates; > unsigned long s_ib_ack_sent; > unsigned long s_ib_ack_send_failure; > unsigned long s_ib_ack_send_delayed; > @@ -244,6 +265,8 @@ void rds_ib_flush_mrs(void); > int __init rds_ib_recv_init(void); > void rds_ib_recv_exit(void); > int rds_ib_recv(struct rds_connection *conn); > +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, > + gfp_t page_gfp, int prefill); > void rds_ib_inc_purge(struct rds_incoming *inc); > void rds_ib_inc_free(struct rds_incoming *inc); > int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov, > @@ -252,6 +275,7 @@ void rds_ib_recv_cq_comp_handler(struct > void rds_ib_recv_init_ring(struct rds_ib_connection *ic); > void rds_ib_recv_clear_ring(struct rds_ib_connection *ic); > void rds_ib_recv_init_ack(struct rds_ib_connection *ic); > +void rds_ib_attempt_ack(struct rds_ib_connection *ic); > void rds_ib_ack_send_complete(struct rds_ib_connection *ic); > u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic); > > @@ -266,12 +290,17 @@ u32 rds_ib_ring_completed(struct rds_ib_ > extern wait_queue_head_t rds_ib_ring_empty_wait; > > /* ib_send.c */ > +void rds_ib_xmit_complete(struct rds_connection *conn); > int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, > unsigned int hdr_off, unsigned int sg, unsigned int off); > void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context); > void rds_ib_send_init_ring(struct rds_ib_connection *ic); > void rds_ib_send_clear_ring(struct rds_ib_connection *ic); > int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op); > +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits); > +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted); > +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted, > + u32 *adv_credits); > > /* ib_stats.c */ > RDS_DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); > @@ -287,6 +316,7 @@ extern unsigned long rds_ib_sysctl_max_r > extern unsigned long rds_ib_sysctl_max_unsig_wrs; > extern unsigned long rds_ib_sysctl_max_unsig_bytes; > extern unsigned long rds_ib_sysctl_max_recv_allocation; > +extern unsigned int rds_ib_sysctl_flow_control; > extern ctl_table rds_ib_sysctl_table[]; > > /* > Index: ofa_kernel-1.3/net/rds/ib_cm.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib_cm.c > +++ ofa_kernel-1.3/net/rds/ib_cm.c > @@ -55,6 +55,22 @@ static void rds_ib_set_protocol(struct r > } > > /* > + * Set up flow control > + */ > +static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits) > +{ > + struct rds_ib_connection *ic = conn->c_transport_data; > + > + if (rds_ib_sysctl_flow_control && credits != 0) { > + /* We're doing flow control */ > + ic->i_flowctl = 1; > + rds_ib_send_add_credits(conn, credits); > + } else { > + ic->i_flowctl = 0; > + } > +} > + > +/* > * Connection established. > * We get here for both outgoing and incoming connection. > */ > @@ -72,12 +88,16 @@ static void rds_ib_connect_complete(stru > rds_ib_set_protocol(conn, > RDS_PROTOCOL(dp->dp_protocol_major, > dp->dp_protocol_minor)); > + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); > } > > - rdsdebug("RDS/IB: ib conn complete on %u.%u.%u.%u version %u.%u\n", > + printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n", > NIPQUAD(conn->c_laddr), > RDS_PROTOCOL_MAJOR(conn->c_version), > - RDS_PROTOCOL_MINOR(conn->c_version)); > + RDS_PROTOCOL_MINOR(conn->c_version), > + ic->i_flowctl? ", flow control" : ""); > + > + rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); > > /* Tune the RNR timeout. We use a rather low timeout, but > * not the absolute minimum - this should be tunable. > @@ -129,6 +149,24 @@ static void rds_ib_cm_fill_conn_param(st > dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); > dp->dp_ack_seq = rds_ib_piggyb_ack(ic); > > + /* Advertise flow control. > + * > + * Major chicken and egg alert! > + * We would like to post receive buffers before we get here (eg. > + * in rds_ib_setup_qp), so that we can give the peer an accurate > + * credit value. > + * Unfortunately we can't post receive buffers until we've finished > + * protocol negotiation, and know in which order data and payload > + * are arranged. > + * > + * What we do here is we give the peer a small initial credit, and > + * initialize the number of posted buffers to a negative value. > + */ > + if (ic->i_flowctl) { > + atomic_set(&ic->i_credits, IB_SET_POST_CREDITS(-4)); > + dp->dp_credit = cpu_to_be32(4); > + } > + > conn_param->private_data = dp; > conn_param->private_data_len = sizeof(*dp); > } > @@ -363,6 +401,7 @@ static int rds_ib_cm_handle_connect(stru > ic = conn->c_transport_data; > > rds_ib_set_protocol(conn, version); > + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); > > /* If the peer gave us the last packet it saw, process this as if > * we had received a regular ACK. */ > @@ -428,6 +467,7 @@ out: > static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id) > { > struct rds_connection *conn = cm_id->context; > + struct rds_ib_connection *ic = conn->c_transport_data; > struct rdma_conn_param conn_param; > struct rds_ib_connect_private dp; > int ret; > @@ -435,6 +475,7 @@ static int rds_ib_cm_initiate_connect(st > /* If the peer doesn't do protocol negotiation, we must > * default to RDSv3.0 */ > rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0); > + ic->i_flowctl = rds_ib_sysctl_flow_control; /* advertise flow control */ > > ret = rds_ib_setup_qp(conn); > if (ret) { > @@ -688,6 +729,10 @@ void rds_ib_conn_shutdown(struct rds_con > #endif > ic->i_ack_recv = 0; > > + /* Clear flow control state */ > + ic->i_flowctl = 0; > + atomic_set(&ic->i_credits, 0); > + > if (ic->i_ibinc) { > rds_inc_put(&ic->i_ibinc->ii_inc); > ic->i_ibinc = NULL; > Index: ofa_kernel-1.3/net/rds/ib_recv.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib_recv.c > +++ ofa_kernel-1.3/net/rds/ib_recv.c > @@ -220,16 +220,17 @@ out: > * -1 is returned if posting fails due to temporary resource exhaustion. > */ > int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, > - gfp_t page_gfp) > + gfp_t page_gfp, int prefill) > { > struct rds_ib_connection *ic = conn->c_transport_data; > struct rds_ib_recv_work *recv; > struct ib_recv_wr *failed_wr; > + unsigned int posted = 0; > int ret = 0; > u32 pos; > > - while (rds_conn_up(conn) && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { > - > + while ((prefill || rds_conn_up(conn)) > + && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { > if (pos >= ic->i_recv_ring.w_nr) { > printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n", > pos); > @@ -257,8 +258,14 @@ int rds_ib_recv_refill(struct rds_connec > ret = -1; > break; > } > + > + posted++; > } > > + /* We're doing flow control - update the window. */ > + if (ic->i_flowctl && posted) > + rds_ib_advertise_credits(conn, posted); > + > if (ret) > rds_ib_ring_unalloc(&ic->i_recv_ring, 1); > return ret; > @@ -436,7 +443,7 @@ static u64 rds_ib_get_ack(struct rds_ib_ > #endif > > > -static void rds_ib_send_ack(struct rds_ib_connection *ic) > +static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits) > { > struct rds_header *hdr = ic->i_ack; > struct ib_send_wr *failed_wr; > @@ -448,6 +455,7 @@ static void rds_ib_send_ack(struct rds_i > rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq); > rds_message_populate_header(hdr, 0, 0, 0); > hdr->h_ack = cpu_to_be64(seq); > + hdr->h_credit = adv_credits; > rds_message_make_checksum(hdr); > ic->i_ack_queued = jiffies; > > @@ -460,6 +468,8 @@ static void rds_ib_send_ack(struct rds_i > set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); > > rds_ib_stats_inc(s_ib_ack_send_failure); > + /* Need to finesse this later. */ > + BUG(); > } else > rds_ib_stats_inc(s_ib_ack_sent); > } > @@ -502,15 +512,27 @@ static void rds_ib_send_ack(struct rds_i > * When we get here, we're called from the recv queue handler. > * Check whether we ought to transmit an ACK. > */ > -static void rds_ib_attempt_ack(struct rds_ib_connection *ic) > +void rds_ib_attempt_ack(struct rds_ib_connection *ic) > { > + unsigned int adv_credits; > + > if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) > return; > - if (!test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { > - clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); > - rds_ib_send_ack(ic); > - } else > + > + if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { > rds_ib_stats_inc(s_ib_ack_send_delayed); > + return; > + } > + > + /* Can we get a send credit? */ > + if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) { > + rds_ib_stats_inc(s_ib_tx_throttle); > + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); > + return; > + } > + > + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); > + rds_ib_send_ack(ic, adv_credits); > } > > /* > @@ -706,6 +728,10 @@ void rds_ib_process_recv(struct rds_conn > state->ack_recv = be64_to_cpu(ihdr->h_ack); > state->ack_recv_valid = 1; > > + /* Process the credits update if there was one */ > + if (ihdr->h_credit) > + rds_ib_send_add_credits(conn, ihdr->h_credit); > + > if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) { > /* This is an ACK-only packet. The fact that it gets > * special treatment here is that historically, ACKs > @@ -877,7 +903,7 @@ void rds_ib_recv_cq_comp_handler(struct > > if (mutex_trylock(&ic->i_recv_mutex)) { > if (rds_ib_recv_refill(conn, GFP_ATOMIC, > - GFP_ATOMIC | __GFP_HIGHMEM)) > + GFP_ATOMIC | __GFP_HIGHMEM, 0)) > ret = -EAGAIN; > else > rds_ib_stats_inc(s_ib_rx_refill_from_cq); > @@ -901,7 +927,7 @@ int rds_ib_recv(struct rds_connection *c > * we're really low and we want the caller to back off for a bit. > */ > mutex_lock(&ic->i_recv_mutex); > - if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER)) > + if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0)) > ret = -ENOMEM; > else > rds_ib_stats_inc(s_ib_rx_refill_from_thread); > Index: ofa_kernel-1.3/net/rds/ib.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib.c > +++ ofa_kernel-1.3/net/rds/ib.c > @@ -187,6 +187,7 @@ static void rds_ib_exit(void) > > struct rds_transport rds_ib_transport = { > .laddr_check = rds_ib_laddr_check, > + .xmit_complete = rds_ib_xmit_complete, > .xmit = rds_ib_xmit, > .xmit_cong_map = NULL, > .xmit_rdma = rds_ib_xmit_rdma, > Index: ofa_kernel-1.3/net/rds/ib_send.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib_send.c > +++ ofa_kernel-1.3/net/rds/ib_send.c > @@ -245,6 +245,144 @@ void rds_ib_send_cq_comp_handler(struct > } > } > > +/* > + * This is the main function for allocating credits when sending > + * messages. > + * > + * Conceptually, we have two counters: > + * - send credits: this tells us how many WRs we're allowed > + * to submit without overruning the reciever's queue. For > + * each SEND WR we post, we decrement this by one. > + * > + * - posted credits: this tells us how many WRs we recently > + * posted to the receive queue. This value is transferred > + * to the peer as a "credit update" in a RDS header field. > + * Every time we transmit credits to the peer, we subtract > + * the amount of transferred credits from this counter. > + * > + * It is essential that we avoid situations where both sides have > + * exhausted their send credits, and are unable to send new credits > + * to the peer. We achieve this by requiring that we send at least > + * one credit update to the peer before exhausting our credits. > + * When new credits arrive, we subtract one credit that is withheld > + * until we've posted new buffers and are ready to transmit these > + * credits (see rds_ib_send_add_credits below). > + * > + * The RDS send code is essentially single-threaded; rds_send_xmit > + * grabs c_send_sem to ensure exclusive access to the send ring. > + * However, the ACK sending code is independent and can race with > + * message SENDs. > + * > + * In the send path, we need to update the counters for send credits > + * and the counter of posted buffers atomically - when we use the > + * last available credit, we cannot allow another thread to race us > + * and grab the posted credits counter. Hence, we have to use a > + * spinlock to protect the credit counter, or use atomics. > + * > + * Spinlocks shared between the send and the receive path are bad, > + * because they create unnecessary delays. An early implementation > + * using a spinlock showed a 5% degradation in throughput at some > + * loads. > + * > + * This implementation avoids spinlocks completely, putting both > + * counters into a single atomic, and updating that atomic using > + * atomic_add (in the receive path, when receiving fresh credits), > + * and using atomic_cmpxchg when updating the two counters. > + */ > +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, > + u32 wanted, u32 *adv_credits) > +{ > + unsigned int avail, posted, got = 0, advertise; > + long oldval, newval; > + > + *adv_credits = 0; > + if (!ic->i_flowctl) > + return wanted; > + > +try_again: > + advertise = 0; > + oldval = newval = atomic_read(&ic->i_credits); > + posted = IB_GET_POST_CREDITS(oldval); > + avail = IB_GET_SEND_CREDITS(oldval); > + > + rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n", > + wanted, avail, posted); > + > + /* The last credit must be used to send a credit updated. */ > + if (avail && !posted) > + avail--; > + > + if (avail < wanted) { > + struct rds_connection *conn = ic->i_cm_id->context; > + > + /* Oops, there aren't that many credits left! */ > + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); > + got = avail; > + } else { > + /* Sometimes you get what you want, lalala. */ > + got = wanted; > + } > + newval -= IB_SET_SEND_CREDITS(got); > + > + if (got && posted) { > + advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT); > + newval -= IB_SET_POST_CREDITS(advertise); > + } > + > + /* Finally bill everything */ > + if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval) > + goto try_again; > + > + *adv_credits = advertise; > + return got; > +} > + > +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits) > +{ > + struct rds_ib_connection *ic = conn->c_transport_data; > + > + if (credits == 0) > + return; > + > + rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n", > + credits, > + IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)), > + test_bit(RDS_LL_SEND_FULL, &conn->c_flags)? ", ll_send_full" : ""); > + > + atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits); > + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) > + queue_delayed_work(rds_wq, &conn->c_send_w, 0); > + > + WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384); > + > + rds_ib_stats_inc(s_ib_rx_credit_updates); > +} > + > +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted) > +{ > + struct rds_ib_connection *ic = conn->c_transport_data; > + > + if (posted == 0) > + return; > + > + atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits); > + > + /* Decide whether to send an update to the peer now. > + * If we would send a credit update for every single buffer we > + * post, we would end up with an ACK storm (ACK arrives, > + * consumes buffer, we refill the ring, send ACK to remote > + * advertising the newly posted buffer... ad inf) > + * > + * Performance pretty much depends on how often we send > + * credit updates - too frequent updates mean lots of ACKs. > + * Too infrequent updates, and the peer will run out of > + * credits and has to throttle. > + * For the time being, 16 seems to be a good compromise. > + */ > + if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16) > + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); > +} > + > static inline void > rds_ib_xmit_populate_wr(struct rds_ib_connection *ic, > struct rds_ib_send_work *send, unsigned int pos, > @@ -307,6 +445,8 @@ int rds_ib_xmit(struct rds_connection *c > u32 pos; > u32 i; > u32 work_alloc; > + u32 credit_alloc; > + u32 adv_credits = 0; > int send_flags = 0; > int sent; > int ret; > @@ -314,6 +454,7 @@ int rds_ib_xmit(struct rds_connection *c > BUG_ON(off % RDS_FRAG_SIZE); > BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); > > + /* FIXME we may overallocate here */ > if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) > i = 1; > else > @@ -327,8 +468,29 @@ int rds_ib_xmit(struct rds_connection *c > goto out; > } > > + credit_alloc = work_alloc; > + if (ic->i_flowctl) { > + credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits); > + if (credit_alloc < work_alloc) { > + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc); > + work_alloc = credit_alloc; > + } > + if (work_alloc == 0) { > + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); > + rds_ib_stats_inc(s_ib_tx_throttle); > + ret = -ENOMEM; > + goto out; > + } > + } > + > /* map the message the first time we see it */ > if (ic->i_rm == NULL) { > + /* > + printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n", > + be16_to_cpu(rm->m_inc.i_hdr.h_dport), > + rm->m_inc.i_hdr.h_flags, > + be32_to_cpu(rm->m_inc.i_hdr.h_len)); > + */ > if (rm->m_nents) { > rm->m_count = ib_dma_map_sg(dev, > rm->m_sg, rm->m_nents, DMA_TO_DEVICE); > @@ -449,6 +611,24 @@ add_header: > * have been set up to point to the right header buffer. */ > memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header)); > > + if (0) { > + struct rds_header *hdr = &ic->i_send_hdrs[pos]; > + > + printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n", > + be16_to_cpu(hdr->h_dport), > + hdr->h_flags, > + be32_to_cpu(hdr->h_len)); > + } > + if (adv_credits) { > + struct rds_header *hdr = &ic->i_send_hdrs[pos]; > + > + /* add credit and redo the header checksum */ > + hdr->h_credit = adv_credits; > + rds_message_make_checksum(hdr); > + adv_credits = 0; > + rds_ib_stats_inc(s_ib_tx_credit_updates); > + } > + > if (prev) > prev->s_wr.next = &send->s_wr; > prev = send; > @@ -472,6 +652,8 @@ add_header: > rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); > work_alloc = i; > } > + if (ic->i_flowctl && i < credit_alloc) > + rds_ib_send_add_credits(conn, credit_alloc - i); > > /* XXX need to worry about failed_wr and partial sends. */ > failed_wr = &first->s_wr; > @@ -487,11 +669,14 @@ add_header: > ic->i_rm = prev->s_rm; > prev->s_rm = NULL; > } > + /* Finesse this later */ > + BUG(); > goto out; > } > > ret = sent; > out: > + BUG_ON(adv_credits); > return ret; > } > > @@ -630,3 +815,12 @@ int rds_ib_xmit_rdma(struct rds_connecti > out: > return ret; > } > + > +void rds_ib_xmit_complete(struct rds_connection *conn) > +{ > + struct rds_ib_connection *ic = conn->c_transport_data; > + > + /* We may have a pending ACK or window update we were unable > + * to send previously (due to flow control). Try again. */ > + rds_ib_attempt_ack(ic); > +} > Index: ofa_kernel-1.3/net/rds/ib_stats.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib_stats.c > +++ ofa_kernel-1.3/net/rds/ib_stats.c > @@ -46,14 +46,17 @@ static char *rds_ib_stat_names[] = { > "ib_tx_cq_call", > "ib_tx_cq_event", > "ib_tx_ring_full", > + "ib_tx_throttle", > "ib_tx_sg_mapping_failure", > "ib_tx_stalled", > + "ib_tx_credit_updates", > "ib_rx_cq_call", > "ib_rx_cq_event", > "ib_rx_ring_empty", > "ib_rx_refill_from_cq", > "ib_rx_refill_from_thread", > "ib_rx_alloc_limit", > + "ib_rx_credit_updates", > "ib_ack_sent", > "ib_ack_send_failure", > "ib_ack_send_delayed", > Index: ofa_kernel-1.3/net/rds/rds.h > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/rds.h > +++ ofa_kernel-1.3/net/rds/rds.h > @@ -170,6 +170,7 @@ struct rds_connection { > #define RDS_FLAG_CONG_BITMAP 0x01 > #define RDS_FLAG_ACK_REQUIRED 0x02 > #define RDS_FLAG_RETRANSMITTED 0x04 > +#define RDS_MAX_ADV_CREDIT 255 > > /* > * Maximum space available for extension headers. > @@ -183,7 +184,8 @@ struct rds_header { > __be16 h_sport; > __be16 h_dport; > u8 h_flags; > - u8 h_padding[5]; > + u8 h_credit; > + u8 h_padding[4]; > __sum16 h_csum; > > u8 h_exthdr[RDS_HEADER_EXT_SPACE]; > Index: ofa_kernel-1.3/net/rds/ib_sysctl.c > =================================================================== > --- ofa_kernel-1.3.orig/net/rds/ib_sysctl.c > +++ ofa_kernel-1.3/net/rds/ib_sysctl.c > @@ -53,6 +53,8 @@ unsigned long rds_ib_sysctl_max_unsig_by > static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1; > static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL; > > +unsigned int rds_ib_sysctl_flow_control = 1; > + > ctl_table rds_ib_sysctl_table[] = { > { > .ctl_name = 1, > @@ -102,6 +104,14 @@ ctl_table rds_ib_sysctl_table[] = { > .mode = 0644, > .proc_handler = &proc_doulongvec_minmax, > }, > + { > + .ctl_name = 6, > + .procname = "flow_control", > + .data = &rds_ib_sysctl_flow_control, > + .maxlen = sizeof(rds_ib_sysctl_flow_control), > + .mode = 0644, > + .proc_handler = &proc_dointvec, > + }, > { .ctl_name = 0} > }; > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From matthias at sgi.com Tue May 20 13:44:32 2008 From: matthias at sgi.com (Matthias Blankenhaus) Date: Tue, 20 May 2008 13:44:32 -0700 (PDT) Subject: [ofa-general] saquery port problems In-Reply-To: References: Message-ID: Forgot some important info: saquery BUILD VERSION: 1.3.6 OFED-1.3 On Tue, 20 May 2008, Matthias Blankenhaus wrote: > Howdy ! > > While using this tool to run some queries on a two port HCA, I noticed > some odd behavior. Here are my observations running on a SLES10SP2 > (x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III > Ex HCA: > > (01) saquery -C mthca0 -m > This yields the output for port number two. This is not conform with the > usual ib tools behavior to report on port one per default. > > (02) saquery -C mthca0 -m -P 1 > Fails with "Failed to find active port, check port status with "ibstat". > This is incorrect, since > > # ibstat mthca0 1 > CA: 'mthca0' > Port 1: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 5 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a68 > Port GUID: 0x0008f10403987dc5 > > This might be the reason why (01) report on port two. > > (03) saquery -C mthca0 -m -P 2 > Works and is identical with the out out from (01). > > However, the following command options work: > > (04) saquery -P 1 -m > Correctly yields the output for port one. In other words > port one seems to be fine unlike reported in (02). > > (05) saquery -P 2 -m > Correctly yields the output for port two. > > > Is it incorrect to use -C and -P in combination ? Why does does > saquery think that port one is not active ? > > > Thanx, > Matthias > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Thomas.Talpey at netapp.com Tue May 20 13:53:39 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 20 May 2008 16:53:39 -0400 Subject: [ofa-general] RDS flow control In-Reply-To: <20080520204522.GD31790@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805141516.01908.okir@lst.de> <200805161638.18067.olaf.kirch@oracle.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> Message-ID: At 04:45 PM 5/20/2008, Jon Mason wrote: >With proper flow control, there should no longer be a need for >rnr_retry (as there >should always be a posted recv buffer waiting for the incoming data). I did a >quick test and removed it and everything seemed to be happy on my >rds-stress run. I'd be interested in any extended load-testing of operation with rnr_retry==0 that you might be able to do. The NFS/RDMA client sets it to zero, for the same reason (the rpcrdma protocol exchanges credits). But at the NFS Connectathon last week we were seeing spontaneous connection loss, that went away when we set rnr_retry to 7 (infinity). However, it also did not appear when it was set to 1, and later we were able to pass again at zero. Very strange, I'm still trying to figure if it's an upper layer issue or some lower layer timing quirk. The switch we were using there was a bit flaky. Tom. From rdreier at cisco.com Tue May 20 14:02:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:02:18 -0700 Subject: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <200805191007.24888.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 May 2008 10:07:24 +0300") References: <200805190812.39408.jackm@dev.mellanox.co.il> <200805191007.24888.jackm@dev.mellanox.co.il> Message-ID: > Then, we get into the complexity of sanity checking in create_qp (since we should > be able to use the value returned by create-qp when calling create-qp, and get > the same result). Essentially, we will need to check the requested sge numbers > per QP type, whether it is for send or receive, etc. IMHO, this gets nasty very > quickly -- creates a problem with support -- users will need a "roadmap" for create-qp. Actually it seems pretty easy to understand -- returned max_sge is the largest value that is guaranteed to work. If it happens that the requested QP gives more capabilities "for free" then the driver will tell you in the returned structure. But whatever. > I much prefer to treat the query_hca returned values as absolute maxima, and enforce > these limits (although this is at the expense of additional s/g entries for some > qp types and send/receive). OK, I added the patch below to fix this mlx4 bug without returning any s/g entries beyond what the device returns. commit cd155c1c7c9e64df6afb5504d292fef7cb783a4f Author: Roland Dreier Date: Tue May 20 14:00:02 2008 -0700 IB/mlx4: Fix creation of kernel QP with max number of send s/g entries When creating a kernel QP where the consumer asked for a send queue with lots of scatter/gater entries, set_kernel_sq_size() incorrectly returned an error if the send queue stride is larger than the hardware's maximum send work request descriptor size. This is not a problem; the only issue is to make sure that the actual descriptors used do not overflow the maximum descriptor size, so check this instead. Clamp the returned max_send_sge value to be no bigger than what query_device returns for the max_sge to avoid confusing hapless users, even if the hardware is capable of handling a few more s/g entries. This bug caused NFS/RDMA mounts to fail when the server adapter used the mlx4 driver. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index cec030e..a80df22 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + send_wqe_overhead(type, qp->flags); + if (s > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + /* * Hermon supports shrinking WQEs, such that a single work * request can include multiple units of 1 << wqe_shift. This @@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) - return -EINVAL; - qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift); /* @@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, ++qp->sq.wqe_shift; } - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - send_wqe_overhead(type, qp->flags)) / sizeof (struct mlx4_wqe_data_seg); @@ -411,7 +412,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_wr = qp->sq.max_post = (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; - cap->max_send_sge = qp->sq.max_gs; + cap->max_send_sge = min(qp->sq.max_gs, + min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg)); /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; From rdreier at cisco.com Tue May 20 14:02:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:02:59 -0700 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: <1211260229.6556.18.camel@eli-laptop> (Eli Cohen's message of "Tue, 20 May 2008 08:10:29 +0300") References: <1211260229.6556.18.camel@eli-laptop> Message-ID: > Roland, I posted a few months ago a patch that optimizes post send for > selective signaling QPs. It must have slipped somehow because I did not > get any reply on it and since I did not know of anyone using selective > signaling I forgot about this too. The idea is that for selective > signaling QPs, before you stamp the WQE, you read the value of the DS > field which denotes the effective size of the descriptor as used in the > previous post, and stamp only that area, relying on the fact that the > rest of the descriptor is already stamped. Here is a link to the patch. > I don't know if it applies cleanly now but if we agree on the idea I > will generate it again against the current tree. Does it make a measurable difference? If so then it seems like a good idea. From olaf.kirch at oracle.com Tue May 20 14:13:39 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Tue, 20 May 2008 23:13:39 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <20080520204522.GD31790@opengridcomputing.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> Message-ID: <200805202313.40213.olaf.kirch@oracle.com> On Tuesday 20 May 2008 22:45:22 Jon Mason wrote: > Works well on my setup. Good to hear! > With proper flow control, there should no longer be a need for rnr_retry (as there > should always be a posted recv buffer waiting for the incoming data). I did a > quick test and removed it and everything seemed to be happy on my rds-stress run. I would like to make the setting of the RNR retry/timeout conditional on whether both ends of the connection support flow control or not - we need to think of rolling upgrades of a cluster, so mixed environments just have to work. Unfortunately, the RNR retry count is set prior to establishing the connection, before we even know whether the remote is capable of doing flow control. Is there a way of changing the RNR retry count back to 0 after establishing the connection? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From rdreier at cisco.com Tue May 20 14:21:40 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:21:40 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com> (Olaf Kirch's message of "Tue, 20 May 2008 23:13:39 +0200") References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: > Is there a way of changing the RNR retry count back to 0 after establishing > the connection? Yes... quite complicated but possible. Basically you have to transition to the QP to the "send queue drained" (SQD) state, change the rnr retry value in an SQD->SQD transition and then transition back to RTS. Not sure if anyone has ever tested that whole operation, so it may or may not actually work without driver/fw fixes required. - R. From hrosenstock at xsigo.com Tue May 20 14:24:15 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 14:24:15 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 14:21 -0700, Roland Dreier wrote: > > Is there a way of changing the RNR retry count back to 0 after establishing > > the connection? > > Yes... quite complicated but possible. Basically you have to transition > to the QP to the "send queue drained" (SQD) state, change the rnr retry > value in an SQD->SQD transition and then transition back to RTS. Not > sure if anyone has ever tested that whole operation, so it may or may > not actually work without driver/fw fixes required. That's the local end; is there some needed CM aspect of this too ? -- Hal > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue May 20 14:27:45 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:27:45 -0700 Subject: [ofa-general] [PATCH] IPoIB: Test for NULL broadcast object in opiob_mcast_join_finish. In-Reply-To: <200805191703.05887.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Mon, 19 May 2008 17:03:05 +0300") References: <200805191703.05887.jackm@dev.mellanox.co.il> Message-ID: thanks, applied. From ralph.campbell at qlogic.com Tue May 20 14:28:38 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 20 May 2008 14:28:38 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: <1211318918.3949.268.camel@brick.pathscale.com> On Tue, 2008-05-20 at 23:13 +0200, Olaf Kirch wrote: > On Tuesday 20 May 2008 22:45:22 Jon Mason wrote: > > Works well on my setup. > > Good to hear! > > > With proper flow control, there should no longer be a need for rnr_retry (as there > > should always be a posted recv buffer waiting for the incoming data). I did a > > quick test and removed it and everything seemed to be happy on my rds-stress run. > > I would like to make the setting of the RNR retry/timeout conditional on > whether both ends of the connection support flow control or not - we need > to think of rolling upgrades of a cluster, so mixed environments just have > to work. Unfortunately, the RNR retry count is set prior to establishing > the connection, before we even know whether the remote is capable of doing > flow control. > > Is there a way of changing the RNR retry count back to 0 after establishing > the connection? You can use ib_modify_qp() to set the QP state to IB_QPS_SQD (drain), modify the IB_QP_RNR_RETRY parameter, and modify the QP back to IB_QPS_RTS. It seems to me that modify QP could allow a RTS to RTS transition and set the IB_QP_RNR_RETRY count but the qp_state_table[] doesn't seem to indicate that is valid. From rdreier at cisco.com Tue May 20 14:31:50 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:31:50 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> (Hal Rosenstock's message of "Tue, 20 May 2008 14:24:15 -0700") References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> Message-ID: > That's the local end; is there some needed CM aspect of this too ? I don't think so. RNR retry behavior is purely local so I don't see any need to coordinate when changing it. - R. From rdreier at cisco.com Tue May 20 14:33:40 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 14:33:40 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <1211318918.3949.268.camel@brick.pathscale.com> (Ralph Campbell's message of "Tue, 20 May 2008 14:28:38 -0700") References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <1211318918.3949.268.camel@brick.pathscale.com> Message-ID: > You can use ib_modify_qp() to set the QP state to IB_QPS_SQD (drain), > modify the IB_QP_RNR_RETRY parameter, and modify the QP back to > IB_QPS_RTS. It seems to me that modify QP could allow a RTS to RTS > transition and set the IB_QP_RNR_RETRY count but the qp_state_table[] > doesn't seem to indicate that is valid. The IB spec doesn't allow changing RNR retry on RTS to RTS transitions. Probably because synchronizing the change with in-flight send requests (that might be doing RNR handling at that moment) is too much of a mess. - R. From hrosenstock at xsigo.com Tue May 20 14:41:11 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 14:41:11 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211319671.18236.38.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 14:31 -0700, Roland Dreier wrote: > > That's the local end; is there some needed CM aspect of this too ? > > I don't think so. RNR retry behavior is purely local so I don't see any > need to coordinate when changing it. Yes, but it is exchanged in both CM REQ and REP: The total number of times that the REQ or REP sender wishes the receiver to retry RNR NAK errors before posting a completion error -- Hal > - R. From dave.olson at qlogic.com Tue May 20 15:08:51 2008 From: dave.olson at qlogic.com (Dave Olson) Date: Tue, 20 May 2008 15:08:51 -0700 (PDT) Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned Message-ID: Ralph Campbell will submit this patch for ofed 1.3.1, also. IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED This was observed with the hw/ipath driver, but could happen with any driver. It's OFED bug 1027. The fix is to kfree the local data and break, rather than falling through. Signed-off-by: Dave Olson --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -747,7 +747,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, break; case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: kmem_cache_free(ib_mad_cache, mad_priv); - break; + kfree(local); + goto out; case IB_MAD_RESULT_SUCCESS: /* Treat like an incoming receive MAD */ port_priv = ib_get_mad_port(mad_agent_priv->agent.device, Dave Olson dave.olson at qlogic.com From arlin.r.davis at intel.com Tue May 20 15:10:00 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 20 May 2008 15:10:00 -0700 Subject: [ofa-general] [PATCH 1/1][v1.2] dtest: fix build issue with Redhat EL5.1 Message-ID: <002401c8bac6$4047fe10$2fbf020a@amr.corp.intel.com> need include files/definitions for sleep, getpid, gettimeofday Signed-off by: Arlin Davis ardavis at ichips.intel.com --- test/dtest/dtest.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 2db141f..039b6bf 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -35,13 +35,16 @@ #include #include #include +#include #include +#include #include #include #include #include #include #include +#include #ifndef DAPL_PROVIDER #define DAPL_PROVIDER "OpenIB-cma" -- 1.5.2.5 From rdreier at cisco.com Tue May 20 15:10:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 15:10:08 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <1211319671.18236.38.camel@hrosenstock-ws.xsigo.com> (Hal Rosenstock's message of "Tue, 20 May 2008 14:41:11 -0700") References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <1211318655.18236.31.camel@hrosenstock-ws.xsigo.com> <1211319671.18236.38.camel@hrosenstock-ws.xsigo.com> Message-ID: > Yes, but it is exchanged in both CM REQ and REP: > The total number of times that the REQ or REP sender wishes the receiver > to retry RNR NAK errors before posting a completion error I know -- but there isn't any requirement that I know of to do any further CM stuff if the values change after the connection is established. From rdreier at cisco.com Tue May 20 15:17:59 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 15:17:59 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: (Dave Olson's message of "Tue, 20 May 2008 15:08:51 -0700 (PDT)") References: Message-ID: > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: > kmem_cache_free(ib_mad_cache, mad_priv); > - break; > + kfree(local); > + goto out; Seems you need to set ret = 1 here? Otherwise I think ib_post_send_mad will continue handling the send even though the packet was supposedly consumed. Also as a side note, I think handle_outgoing_dr_smp() would be clearer if rather than having out: return ret; and then doing stuff like ret = -EINVAL; goto out; the code just did "return -EINVAL;" Maybe I'll do that cleanup for 2.6.27. - R. From dave.olson at qlogic.com Tue May 20 15:23:26 2008 From: dave.olson at qlogic.com (Dave Olson) Date: Tue, 20 May 2008 15:23:26 -0700 (PDT) Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: References: Message-ID: On Tue, 20 May 2008, Roland Dreier wrote: | > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: | > kmem_cache_free(ib_mad_cache, mad_priv); | > - break; | > + kfree(local); | > + goto out; | | Seems you need to set ret = 1 here? Otherwise I think ib_post_send_mad | will continue handling the send even though the packet was supposedly | consumed. Yes, you are right about the fact that it should be set, but apparently all callers are simply checking for a return value > 0, because the packet is only sent once (return values > 1 have no defined meaning so I'm not surprised the callers just check > 0). Do you want me to resubmit it that way, or do you want to make the change? | Also as a side note, I think handle_outgoing_dr_smp() would be clearer | if rather than having | | out: | return ret; | | and then doing stuff like | | ret = -EINVAL; | goto out; | | the code just did "return -EINVAL;" | | Maybe I'll do that cleanup for 2.6.27. Seems reasonable enough to me. Dave Olson dave.olson at qlogic.com From ralph.campbell at qlogic.com Tue May 20 15:28:07 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 20 May 2008 15:28:07 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: References: Message-ID: <1211322487.3949.275.camel@brick.pathscale.com> On Tue, 2008-05-20 at 15:17 -0700, Roland Dreier wrote: > > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: > > kmem_cache_free(ib_mad_cache, mad_priv); > > - break; > > + kfree(local); > > + goto out; > > Seems you need to set ret = 1 here? Otherwise I think ib_post_send_mad > will continue handling the send even though the packet was supposedly > consumed. I agree. > Also as a side note, I think handle_outgoing_dr_smp() would be clearer > if rather than having > > out: > return ret; > > and then doing stuff like > > ret = -EINVAL; > goto out; > > the code just did "return -EINVAL;" > > Maybe I'll do that cleanup for 2.6.27. I also agree but I remember at one point we got pushback from one of the mainline kernel developers who really wanted to see only one return point in the code even if it meant more gotos. I don't remember who though. From sean.hefty at intel.com Tue May 20 15:31:46 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 May 2008 15:31:46 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <1211322487.3949.275.camel@brick.pathscale.com> References: <1211322487.3949.275.camel@brick.pathscale.com> Message-ID: <000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com> >I also agree but I remember at one point we got pushback from >one of the mainline kernel developers who really wanted to see >only one return point in the code even if it meant more gotos. >I don't remember who though. I think the coding style document calls out using a single return point, but I don't think that's always the cleanest approach either. From rdreier at cisco.com Tue May 20 15:33:45 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 15:33:45 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: (Dave Olson's message of "Tue, 20 May 2008 15:23:26 -0700 (PDT)") References: Message-ID: > Yes, you are right about the fact that it should be set, but apparently > all callers are simply checking for a return value > 0, because the > packet is only sent once (return values > 1 have no defined meaning so > I'm not surprised the callers just check > 0). In my tree (ie the upstream kernel) I see only one place handle_outgoing_dr_smp() is called, and it looks like: ret = handle_outgoing_dr_smp(mad_agent_priv, mad_send_wr); if (ret < 0) /* error */ goto error; else if (ret == 1) /* locally consumed */ continue; so I'm not sure I understand what you mean. Clearly ret == 1 i special (any other positive return value is treated like 0). > Do you want me to resubmit it that way, or do you want to make the > change? I can fix it up locally but you are in charge of making sure that OFED 1.3.1 gets what you want it to. From hrosenstock at xsigo.com Tue May 20 15:34:03 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 15:34:03 -0700 Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: References: Message-ID: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote: > Ralph Campbell will submit this patch for ofed 1.3.1, also. > > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED > > This was observed with the hw/ipath driver, but could happen with any > driver. Does this also occur with mthca/mlx4 ? Was the same thing that caused this on ipath tried with either of these HCAs ? > It's OFED bug 1027. What's the port disable command which causes this crash ? -- Hal > The fix is to kfree the local data and break, rather than falling through. > > Signed-off-by: Dave Olson > > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -747,7 +747,8 @@ static int handle_outgoing_dr_smp(struct > ib_mad_agent_private *mad_agent_priv, > break; > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: > kmem_cache_free(ib_mad_cache, mad_priv); > - break; > + kfree(local); > + goto out; > case IB_MAD_RESULT_SUCCESS: > /* Treat like an incoming receive MAD */ > port_priv = ib_get_mad_port(mad_agent_priv->agent.device, > > Dave Olson > dave.olson at qlogic.com > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue May 20 15:37:37 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 20 May 2008 15:37:37 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com> (Sean Hefty's message of "Tue, 20 May 2008 15:31:46 -0700") References: <1211322487.3949.275.camel@brick.pathscale.com> <000101c8bac9$49c529b0$0d59180a@amr.corp.intel.com> Message-ID: > >I also agree but I remember at one point we got pushback from > >one of the mainline kernel developers who really wanted to see > >only one return point in the code even if it meant more gotos. > >I don't remember who though. > > I think the coding style document calls out using a single return point, but I > don't think that's always the cleanest approach either. CodingStyle suggests using goto to avoid duplicating cleaup code at every return. I don't think anyone would argue in favor of the style we're talking about here of using goto to jump to a plain return statement. it doesn't help avoid bugs caused by missing cleanup, and it actually *causes* bugs like the one here where it becomes easy to forget what the function is returning. - R. From dave.olson at qlogic.com Tue May 20 15:41:43 2008 From: dave.olson at qlogic.com (Dave Olson) Date: Tue, 20 May 2008 15:41:43 -0700 (PDT) Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com> References: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com> Message-ID: On Tue, 20 May 2008, Hal Rosenstock wrote: | On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote: | > Ralph Campbell will submit this patch for ofed 1.3.1, also. | > | > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED | > | > This was observed with the hw/ipath driver, but could happen with any | > driver. | | Does this also occur with mthca/mlx4 ? Was the same thing that caused | this on ipath tried with either of these HCAs ? No, it doesn't happen on mthca/mlx4, but presumably those drivers never return this value. Yes, we tried it with them. No problem is introduced on those by the change, either. | > It's OFED bug 1027. | | What's the port disable command which causes this crash ? It was with the qlogic quicksilver command iba_portdisable and iba_portenable commands (they've been ported to work with OFED 1.3). ibportstate refuses to work on non-switch nodes, although in the past we've done local modifications to allow it to work on HCAs as well (we never submitted that change, because we assumed somebody explictly didn't want HCAs targetted). Dave Olson dave.olson at qlogic.com From dave.olson at qlogic.com Tue May 20 15:44:47 2008 From: dave.olson at qlogic.com (Dave Olson) Date: Tue, 20 May 2008 15:44:47 -0700 (PDT) Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: References: Message-ID: On Tue, 20 May 2008, Roland Dreier wrote: | > Yes, you are right about the fact that it should be set, but apparently | > all callers are simply checking for a return value > 0, because the | > packet is only sent once (return values > 1 have no defined meaning so | > I'm not surprised the callers just check > 0). | | In my tree (ie the upstream kernel) I see only one place | handle_outgoing_dr_smp() is called, and it looks like: | | ret = handle_outgoing_dr_smp(mad_agent_priv, | mad_send_wr); | if (ret < 0) /* error */ | goto error; | else if (ret == 1) /* locally consumed */ | continue; | | so I'm not sure I understand what you mean. Clearly ret == 1 i special | (any other positive return value is treated like 0). Indeed, I see that also, now that I look again more carefully. We definitely didn't see an infinite attempt to send the packet, so something else must have cleaned that up for us. Anyway, returning 1 is clearly the right answer. | > Do you want me to resubmit it that way, or do you want to make the | > change? | | I can fix it up locally but you are in charge of making sure that OFED | 1.3.1 gets what you want it to. Thanks, and yes, we'll do that for OFED 1.3.1 Dave Olson dave.olson at qlogic.com From nab at linux-iscsi.org Tue May 20 15:48:17 2008 From: nab at linux-iscsi.org (Nicholas A. Bellinger) Date: Tue, 20 May 2008 15:48:17 -0700 Subject: [ofa-general] LIO-Target Core v3.0.0 imported in k.o git Message-ID: <1211323697.14731.68.camel@haakon2.linux-iscsi.org> Greetings all, The LIO-Target Core v3.0.0 tree has been imported from v2.9-STABLE from Linux-iSCSI.org source tree repositories into kernel.org git, and is building w/ v2.6.26-rc3. It can be found at: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=summary I will be continuing the cleanup activies for upstream, which so far has included the removal of legacy unused engine level mirroring/replication bits (as we are using LIO-DRBD or LIO-NR1 w/ MD for this now), and a few LINUX_VERSION_CODE removals. [nab at hera lio-core]$ wc -l *.c *.h | tail -n 1 57879 total The goal now will be to seperate out the LIO-Target / LIO-Core pieces for moving the latter upstream initially (eg a working passthrough, and then v3.0 LIO-Target using traditional iSCSI, then iWARP and iSER. This will all be going into the roadmap on Linux-iSCSI.org for reference. I invite interested parties to have a look, and please contact me on the LIO-Target devel list, or privately if you would like to get involved looking at some code that are in line with your knowledge/interests/projects. Many thanks for your most valuable of time, --nab From hrosenstock at xsigo.com Tue May 20 15:49:21 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 20 May 2008 15:49:21 -0700 Subject: [ofa-general] PATCH 1/1 - fix kernel crash in mad.c when IB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: References: <1211322844.18236.44.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211323761.18236.54.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-20 at 15:41 -0700, Dave Olson wrote: > On Tue, 20 May 2008, Hal Rosenstock wrote: > > | On Tue, 2008-05-20 at 15:08 -0700, Dave Olson wrote: > | > Ralph Campbell will submit this patch for ofed 1.3.1, also. > | > > | > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONSUMED > | > > | > This was observed with the hw/ipath driver, but could happen with any > | > driver. > | > | Does this also occur with mthca/mlx4 ? Was the same thing that caused > | this on ipath tried with either of these HCAs ? > > No, it doesn't happen on mthca/mlx4, but presumably those drivers never > return this value. Yes, we tried it with them. No problem is > introduced on those by the change, either. > > | > It's OFED bug 1027. > | > | What's the port disable command which causes this crash ? > > It was with the qlogic quicksilver command iba_portdisable > and iba_portenable commands (they've been ported to work with OFED 1.3). > > ibportstate refuses to work on non-switch nodes, Right. > although in the past > we've done local modifications to allow it to work on HCAs as well > (we never submitted that change, because we assumed somebody explictly > didn't want HCAs targetted). Yes, it was a conscious choice discussed on the list. It's so people don't shoot themselves in the foot as once a port is disabled, it might not be so easy to enable it depending on the configuration (out of band access might be needed to reenable). The only configurations where that would be limiting would be CA<->CA and CA<->router without intervening switches. -- Hal > Dave Olson > dave.olson at qlogic.com > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Thomas.Talpey at netapp.com Tue May 20 19:18:20 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 20 May 2008 22:18:20 -0400 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: <1211260229.6556.18.camel@eli-laptop> Message-ID: At 05:02 PM 5/20/2008, Roland Dreier wrote: > > > The idea is that for selective > > signaling QPs, before you stamp the WQE, you read the value of the DS > > field which denotes the effective size of the descriptor as used in the > > previous post, and stamp only that area, relying on the fact that the > > rest of the descriptor is already stamped. > >Does it make a measurable difference? If so then it seems like a good idea. I'll be happy to try it, but I bet it'll be hard to measure the difference with a storage workload. It sounds like a bit of a micro-optimization at the HCA interface, avoiding a few DMA cycles? I didn't see a URL so let me know if so. Tom. From ogerlitz at voltaire.com Tue May 20 23:13:14 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 09:13:14 +0300 Subject: [ofa-general] Re: Current list of Linux maintainers and their email info In-Reply-To: References: <4831F965.6060607@opengridcomputing.com> <48326A44.2080606@voltaire.com> <48328489.2030305@mellanox.co.il> <4832954C.2080209@voltaire.com> Message-ID: <4833BD7A.5030303@voltaire.com> Woodruff, Robert J wrote: > Since this is a separate open source project in sourceforge, > and not developed in OFA/OFED, perhaps we do not need this > in our list of maintainers. > Woody, The bonding driver is a kernel module maintained by Jay V. The ib-bonding package provided both this module and some enhancements to network configuration tools to support bonding of ipoib devices. Or. From ogerlitz at voltaire.com Tue May 20 23:16:19 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 09:16:19 +0300 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com> References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> <48326942.7080800@voltaire.com> <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com> Message-ID: <4833BE33.7000504@voltaire.com> Sean Hefty wrote: > OFED needs a separate list of maintainers Does this refer to the issue that the kernel IB maintainers can't be accountable for the OFED IB kernel code since it includes patches which were never reviewed nor merged upstream, as Roland noted in his comment (below)?! Or > > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, May 15, 2008 03:02 > To: Or Gerlitz > Cc: Sean Hefty; general at lists.openfabrics.org > Subject: Re: [ofa-general] Re: the so many IPoIB-UD failures > introduced by OFED 1.3 > > > Maybe its about time for the Linux IB maintainers to get a little > angry?! > > I'm not angry about it, although I have pretty much given up on trying > to debug IPoIB issues seen running anything other than an upstream > kernel. It seems like the OFED maintainers, the enterprise distros and > their customers should be more concerned about the failure of the OFED > process -- clearly producing something much buggier and less reliable > than the stock kernel is not what anyone wants. > > - R. From olaf.kirch at oracle.com Tue May 20 23:37:54 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 21 May 2008 08:37:54 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: References: <200805121157.38135.jon@opengridcomputing.com> <1211318918.3949.268.camel@brick.pathscale.com> Message-ID: <200805210837.55051.olaf.kirch@oracle.com> On Tuesday 20 May 2008 23:33:40 Roland Dreier wrote: > The IB spec doesn't allow changing RNR retry on RTS to RTS transitions. > Probably because synchronizing the change with in-flight send requests > (that might be doing RNR handling at that moment) is too much of a mess. I tried modifying the RNR retry count before transitioning to RTS (while the QP is still in RTR state), but that failed with EINVAL. Shouldn't it be possible to do that? Anyway, when I take the brief detour through SQD state, resetting the RNR retry count seems to work. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From okir at lst.de Tue May 20 23:49:49 2008 From: okir at lst.de (Olaf Kirch) Date: Wed, 21 May 2008 08:49:49 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <200805210837.55051.olaf.kirch@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805210837.55051.olaf.kirch@oracle.com> Message-ID: <200805210849.50582.okir@lst.de> On Wednesday 21 May 2008 08:37:54 Olaf Kirch wrote: > Anyway, when I take the brief detour through SQD state, resetting the RNR > retry count seems to work. For those willing to test this, I committed the following patch to branch future-20080519 on my git tree. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ----- commit 735bdc95be33db4f21051c0d50090bc128719d98 Author: Olaf Kirch Date: Tue May 20 22:41:04 2008 -0700 RDS: disable RNR retries when flow control is on When flow control is enabled on a connection, we don't need RNR retries. Turning them off allows us to detect potential bugs in the credit accounting more quickly. Signed-off-by: Olaf Kirch diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 20c888d..a49e394 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -99,21 +99,40 @@ static void rds_ib_connect_complete(struct rds_connection *conn, struct rdma_cm_ rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); - /* Tune the RNR timeout. We use a rather low timeout, but - * not the absolute minimum - this should be tunable. + /* Tune RNR behavior. Without flow control, we use a rather + * low timeout, but not the absolute minimum - this should + * be tunable. * * We already set the RNR retry count to 7 (which is the - * smallest infinite number :-) above + * smallest infinite number :-) above. + * If flow control is off, we want to change this back to 0 + * so that we learn quickly when our credit accounting is + * buggy. */ - qp_attr.qp_state = IB_QPS_RTS; - qp_attr.min_rnr_timer = IB_RNR_TIMER_000_32; - ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, - IB_QP_STATE | IB_QP_MIN_RNR_TIMER); - if (ret) { - printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER, %u): err=%d\n", - qp_attr.min_rnr_timer, -ret); + if (ic->i_flowctl) { + /* It seems we have to take a brief detour through SQD state + * in order to change the RNR retry count. */ + qp_attr.qp_state = IB_QPS_SQD; + ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, SQD): err=%d\n", -ret); + + qp_attr.rnr_retry = 0; + ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_RNR_RETRY); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_RNR_RETRY, 0): err=%d\n", -ret); + } else { + qp_attr.min_rnr_timer = IB_RNR_TIMER_000_32; + ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_MIN_RNR_TIMER); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret); } + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", -ret); + /* update ib_device with this local ipaddr */ rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); ib_update_ipaddr_for_device(rds_ibdev, conn->c_laddr); From eli at dev.mellanox.co.il Tue May 20 23:58:56 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 21 May 2008 09:58:56 +0300 Subject: Selective signalling vs WQE size - was Re: [ofa-general] mthca max_sge value... ugh. In-Reply-To: References: <1211260229.6556.18.camel@eli-laptop> Message-ID: <1211353136.31377.5.camel@mtls03> On Tue, 2008-05-20 at 22:18 -0400, Talpey, Thomas wrote: > > > >Does it make a measurable difference? If so then it seems like a good idea. It does a noticeable difference when the required message rate is high. In those cases you spare the CPU from the need to write to memory, possibly saving cache misses. I saw differences for IPoIB in OFED where we use selective signalling for the UD QP. > > I'll be happy to try it, but I bet it'll be hard to measure the difference > with a storage workload. It sounds like a bit of a micro-optimization > at the HCA interface, avoiding a few DMA cycles? > > I didn't see a URL so let me know if so. > http://lists.openfabrics.org/pipermail/general/2008-January/045071.html From sean.hefty at intel.com Wed May 21 00:39:43 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 21 May 2008 00:39:43 -0700 Subject: [ofa-general] Current list of Linux maintainers and their email info In-Reply-To: <4833BE33.7000504@voltaire.com> References: <000201c8b9fd$3eff43c0$bd59180a@amr.corp.intel.com> <48326942.7080800@voltaire.com> <000001c8baac$93e26c50$0d59180a@amr.corp.intel.com> <4833BE33.7000504@voltaire.com> Message-ID: <000101c8bb15$d5fe0180$6f248686@amr.corp.intel.com> >> OFED needs a separate list of maintainers >Does this refer to the issue that the kernel IB maintainers can't be >accountable for the OFED IB kernel code since it includes patches which >were never reviewed nor merged upstream, as Roland noted in his comment >(below)?! I thought the userspace libraries differ as well. From Sumit.Gaur at Sun.COM Wed May 21 00:33:00 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Wed, 21 May 2008 13:03:00 +0530 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> <4831B519.2060002@dev.mellanox.co.il> <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> <48326C32.7000303@Sun.COM> <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com> Message-ID: <4833D02C.9040205@Sun.COM> Hi Hal/Yevgeny, Looks like I am more confusing you then a clearcut question so I am giving you my implementation detail in simple steps:- 1) I am calling madrpc_init(ca, ca_port, mgmt_classes, 4); for given ca and ca_port to register following four classes to OFED library {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; 2) After registration I am opening two separate independent threads one for sending MADs and another for receiving it. 3) Sending thread send MADs using umad_send(port_id, class_agents[mgtclass],&sndbuf, length, timeout, 0); 4) Receiver thread receive MADs using mad_receive(0, -1); function. 5) I am sending SMP and GMP packets at regular time interval and keep receiving response on receiver thread properly. But sometime I am receiving some extra packets with *unknown tids*(tid I have never send). e.g. Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=4352 (All are decimal representation) Now question comes how I could filter these extra packets. These incoming packets could be response SM sends while sweeping the subnet(As pointed out by Yevgeny). Is these any unique MAD field that could be checked for SM response. OR this could not be filtered then i will change logic in application. Thanks and Regards sumit Hal Rosenstock wrote: > On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote: > >>How we can identify and filter these incoming SM packets in application from >>the regular responses. > > > I'm surprised that it's working this way; that SM responses are getting > into your application as they _should_ have a different transaction ID > per the following. yes they have different TID. > >>From the kernel Documentation/infiniband/user_mad.txt: > > Transaction IDs > > Users of the umad devices can use the lower 32 bits of the > transaction ID field (that is, the least significant half of the > field in network byte order) in MADs being sent to match > request/response pairs. The upper 32 bits are reserved for use by > the kernel and will be overwritten before a MAD is sent. > > Is the same fd being used by OpenSM and your application somehow or you > are not using OpenSM and your SM overlaps with this ? I am not using OpenSM, I am directing calling umad libraries. > > -- Hal > From ogerlitz at voltaire.com Wed May 21 00:41:23 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 10:41:23 +0300 Subject: [ofa-general] RDS flow control In-Reply-To: References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: <4833D223.5090007@voltaire.com> Roland Dreier wrote: > > Is there a way of changing the RNR retry count back to 0 after establishing > > the connection? > > Yes... quite complicated but possible. Basically you have to transition > to the QP to the "send queue drained" (SQD) state, change the rnr retry > value in an SQD->SQD transition and then transition back to RTS. In case the RTS->SQD->SQD->RTS transition is not applicable or just for the sake of being aware to more solutions, I gave it some thought and its seems possible for you to build a protocol which uses exchange (through the private data carried by the CM messages) whether each side supports credit management, and based on that && HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability decide what value to place into the QP RNR retries. On the passive side of the connection its trivial, since the rdma-cm uses the values you place into the conn_param parameters of rdma_accept. On the active side, things are a bit more complex, but with some changes, I think you would be able to do it also in a different way than the SQD one: the RNR retries are set into the QP once its being moved to RTS (Ready-To-Send). So, if you managed to get the QP into your hands --before-- the RTU is sent (since this point in time is the last synchoronization step provided to you by the IB CM), you could set the RNR retries value accroding to info carried in the REP message sent by the passive (which you have posted in the private data to rdma_accept, etc). This would be possible, if you enhance the rdma-cm to deliver RDMA_CM_EVENT_CONNECT_RESPONSE event also to IDs created with the PS_TCP port space (eg conditioned on some new field in conn_param) where today its supported only to PS_SDP ones. Once this change is in place, you will get RDMA_CM_EVENT_CONNECT_RESPONSE event, decide what RNR retry value you want to use, and call rdma_accept providing this value (one more little change is needed here in cma.c), the rdma cm would override the value set by cm_init_qp_rts_attr, see cma_modify_qp_rts -> rdma_init_qp_attr -> ib_cm_init_qp_attr -> cm_init_qp_rts_attr and you are done... Or. From ogerlitz at voltaire.com Wed May 21 00:48:42 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 10:48:42 +0300 Subject: [ofa-general] RDS flow control In-Reply-To: <4833D223.5090007@voltaire.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <4833D223.5090007@voltaire.com> Message-ID: <4833D3DA.4040106@voltaire.com> Or Gerlitz wrote: > HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability I see now that only the mlx4, mthca and ehca drivers advertise this capability, but the ipath doesn't, Ralph, was it just forgotten or you guys really don't support this? Or. From ogerlitz at voltaire.com Wed May 21 02:08:05 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 12:08:05 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4832D4DC.2040006@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <48301F74.4020905@voltaire.com> <4831347A.1010506@voltaire.com> <4831A0DF.2070603@opengridcomputing.com> <48327198.7080305@voltaire.com> <4832D4DC.2040006@opengridcomputing.com> Message-ID: <4833E675.5050500@voltaire.com> Steve Wise wrote: > My point is that if you do the mappipng at allocation time, then the > failure will happen when you allocate the page list vs when you post > the send WR. Maybe it doesn't matter, but the idea, I think, is to > not fail post_send for lack of resources. Everything should be > pre-allocated pretty much by the time you post work requests... fair-enough. I understand we are requiring that a page list can be reused without being freed, just make sure its documents. Or. From ogerlitz at voltaire.com Wed May 21 02:24:59 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 12:24:59 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4832D850.2010102@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> Message-ID: <4833EA6B.9000705@voltaire.com> Steve Wise wrote: >>> Support for the IB BMME and iWARP equivalent memory extensions ... >>> Usage Model: >>> - MR allocated with ib_alloc_mr() >>> - Page lists allocated via ib_alloc_fast_reg_page_list(). >>> - MR made VALID & bound to a specific page list via >>> ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via >>> ib_post_send(IB_WR_INVALIDATE_MR) >> AFAIK, the idea was to let the ulp post --two-- work requests, where >> the first creates the mapping and the second sends this mapping to >> the remote side, such that the second does not start before the first >> completes (i.e a fence). >> >> Now, the above scheme means that the ulp knows the value of the >> rkey/stag at the time of posting these two work requests (since it >> has to encode it in the second one), so something has to be clarified >> re the rkey/stag here, do they change each time this MR is used? how >> many bits can be changed, etc. > > The ULP knows the rkey/stag because its returned up front in the > ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue > which we haven't exposed yet to the ULP). The same rkey/stag can be > used for multiple mappings. It can be made invalid at any point in > time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the > same rkey/stag advertised is not a risk. I understand that this (same rkey/stag used for all mapping produced for a specific mr) is what you are proposing, I still think there's a chance that by the spec and (not less important!) by existing HW support, its possible to have a different rkey/stag per mapping done on an mr, for example the IB spec uses a "consumer owned key portion of the L_Key" notation which makes me think there should be a way to have different rkey per mapping, Roland? Dror? > 10.7.2.6 FAST REGISTER PHYSICAL MR > The Fast Register Physical MR Operation is allowed on Non-Shared > Physical Memory Regions that were created with a Consumer owned key > portion of the L_Key, and any associated R_Key Or From ogerlitz at voltaire.com Wed May 21 02:33:15 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 12:33:15 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4832D850.2010102@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> Message-ID: <4833EC5B.8070504@voltaire.com> Steve Wise wrote: > So you allocate the rkey/stag up front, allocate page_lists up front, > then as needed you populate your page list and bind it to the > rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via > IB_WR_INVALIDATE_MR. You can do this any number of times, and with > proper fencing, you can pipeline these mappings. Eventually when > you're done doing IO (like for NFSRDMA when the mount is unmounted) > you free up the page list(s) and mr/rkey/stag. Yes, that was my thought as well. Just to make sure, by "proper fencing" your understanding is that for both IB and iWARP the ULP should not wait for the fmr work request to complete and post the send work-request carrying the rkey/stag with the IB_SEND_FENCE flag? Looking in the IB spec, its seems that the fence indicator only applies to previous rdma-read / atomic operations, eg in section 11.4.1.1 POST SEND REQUEST it says: > Fence indicator. If the fence indicator is set, then all prior RDMA > Read and Atomic Work Requests on the queue must be completed before > starting to process this Work Request. >> Talking on usage, do you plan to patch the mainline nfs-rdma code to >> use these verbs? > Yes. Tom Tucker will be doing this. Jon Mason is implementing RDS > changes to utilize this too. The hope is all this makes 2.6.27/ofed-1.4. > > I can also post test code (krping module) if anyone is interested. > I'm developing that now. > Posting this code would be very much helpful (also to the discussion, I think), thanks. Or. From vlad at lists.openfabrics.org Wed May 21 03:10:51 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 21 May 2008 03:10:51 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080521-0200 daily build status Message-ID: <20080521101051.8307EE60E41@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From tziporet at dev.mellanox.co.il Wed May 21 03:31:55 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 21 May 2008 13:31:55 +0300 Subject: [ofa-general] RDS flow control In-Reply-To: References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: <4833FA1B.4010707@mellanox.co.il> Roland Dreier wrote: > > Is there a way of changing the RNR retry count back to 0 after establishing > > the connection? > > Yes... quite complicated but possible. Basically you have to transition > to the QP to the "send queue drained" (SQD) state, change the rnr retry > value in an SQD->SQD transition and then transition back to RTS. Not > sure if anyone has ever tested that whole operation, so it may or may > not actually work without driver/fw fixes required. > > > SQD is not implemented in ConnectX for now Tziporet From hrosenstock at xsigo.com Wed May 21 04:29:22 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 21 May 2008 04:29:22 -0700 Subject: [ofa-general] Receiving Unknown packets at regular interval In-Reply-To: <4833D02C.9040205@Sun.COM> References: <20080513185404.3D00BE60C16@openfabrics.org> <48314E74.9010107@Sun.COM> <1211197643.12616.408.camel@hrosenstock-ws.xsigo.com> <4831696D.6060409@Sun.COM> <1211203514.12616.430.camel@hrosenstock-ws.xsigo.com> <48318575.7060701@Sun.COM> <1211208550.12616.446.camel@hrosenstock-ws.xsigo.com> <4831A618.9090806@Sun.COM> <1211213870.12616.475.camel@hrosenstock-ws.xsigo.com> <4831B519.2060002@dev.mellanox.co.il> <1211217595.12616.490.camel@hrosenstock-ws.xsigo.com> <48326C32.7000303@Sun.COM> <1211287585.12616.568.camel@hrosenstock-ws.xsigo.com> <4833D02C.9040205@Sun.COM> Message-ID: <1211369362.18236.78.camel@hrosenstock-ws.xsigo.com> Sumit, On Wed, 2008-05-21 at 13:03 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi Hal/Yevgeny, > Looks like I am more confusing you then a clearcut question so I am giving you > my implementation detail in simple steps:- > > 1) I am calling madrpc_init(ca, ca_port, mgmt_classes, 4); for given ca and > ca_port to register following four classes to OFED library > > {IB_SMI_CLASS,IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; > > 2) After registration I am opening two separate independent threads one for > sending MADs and another for receiving it. > > 3) Sending thread send MADs using > > umad_send(port_id, class_agents[mgtclass],&sndbuf, length, timeout, 0); > > 4) Receiver thread receive MADs using mad_receive(0, -1); function. > > 5) I am sending SMP and GMP packets at regular time interval and keep receiving > response on receiver thread properly. But sometime I am receiving some extra > packets with *unknown tids*(tid I have never send). e.g. > > Response TID2 = 0x000000006701869b , BaseVersion = 1, MgmtClass=129, > ClassVersion=1, R_Method=129, ClassSpecific=1, Status=128, AttributeID=4352 > (All are decimal representation) > > Now question comes how I could filter these extra packets. These incoming > packets could be response SM sends while sweeping the subnet(As pointed out by > Yevgeny). Is these any unique MAD field that could be checked for SM response. > OR this could not be filtered then i will change logic in application. Not knowing what SMPs you are sending in your application, it's hard to be more specific. You can filter on class and attribute ID (assuming these attribute IDs per class) are distinct. Another approach would be able to filter on transaction ID as the upper 32 bits should be different (on a per underlying fd basis). This approach is simpler and not rely on non overlapping subsets of attributes. I was also trying to say that I'm not sure you should be seeing these packets (and I don't think your application should need to do this). The current filtering appears to not be working for some unknown reason in your environment. Hopefully this makes more sense. Sorry for all the confusion. -- Hal > Thanks and Regards > sumit > > > > Hal Rosenstock wrote: > > On Tue, 2008-05-20 at 11:44 +0530, Sumit Gaur - Sun Microsystem wrote: > > > >>How we can identify and filter these incoming SM packets in application from > >>the regular responses. > > > > > > I'm surprised that it's working this way; that SM responses are getting > > into your application as they _should_ have a different transaction ID > > per the following. > yes they have different TID. > > > >>From the kernel Documentation/infiniband/user_mad.txt: > > > > Transaction IDs > > > > Users of the umad devices can use the lower 32 bits of the > > transaction ID field (that is, the least significant half of the > > field in network byte order) in MADs being sent to match > > request/response pairs. The upper 32 bits are reserved for use by > > the kernel and will be overwritten before a MAD is sent. > > > > Is the same fd being used by OpenSM and your application somehow or you > > are not using OpenSM and your SM overlaps with this ? > I am not using OpenSM, I am directing calling umad libraries. > > > > -- Hal > > From tziporet at mellanox.co.il Wed May 21 04:30:28 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 21 May 2008 14:30:28 +0300 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <1211322487.3949.275.camel@brick.pathscale.com> Message-ID: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com> Ralph, Can you provide the patch to OFED 1.3.1 today so we will be able to include it in RC1? Thanks Tziporet -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Ralph Campbell Sent: Wednesday, May 21, 2008 1:28 AM To: Roland Dreier Cc: Dave Olson; general at lists.openfabrics.org Subject: Re: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned On Tue, 2008-05-20 at 15:17 -0700, Roland Dreier wrote: > > case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: > > kmem_cache_free(ib_mad_cache, mad_priv); > > - break; > > + kfree(local); > > + goto out; > > Seems you need to set ret = 1 here? Otherwise I think ib_post_send_mad > will continue handling the send even though the packet was supposedly > consumed. I agree. > Also as a side note, I think handle_outgoing_dr_smp() would be clearer > if rather than having > > out: > return ret; > > and then doing stuff like > > ret = -EINVAL; > goto out; > > the code just did "return -EINVAL;" > > Maybe I'll do that cleanup for 2.6.27. I also agree but I remember at one point we got pushback from one of the mainline kernel developers who really wanted to see only one return point in the code even if it meant more gotos. I don't remember who though. _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Wed May 21 05:43:11 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 15:43:11 +0300 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> <4831649A.2020206@voltaire.com> <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> <4832BFAC.2050506@voltaire.com> <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com> Message-ID: <483418DF.1080707@voltaire.com> Sean Hefty wrote: > I was only thinking of the kernel interfaces, but I don't see that this really > changes the ABI. An existing library continues to work unmodified. (Is this > that different than adding a new return value from a call?) If there really is > an issue, then the rdma_ucm can toss the event. Yes, I agree that the ABI shouldn't be changed on every new return code or event added so we can deliver the new event and existing apps should be ignoring it (and if a real issue is find, we can block it at the rdma_ucm). Or From ossrosch at linux.vnet.ibm.com Wed May 21 05:58:55 2008 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 21 May 2008 14:58:55 +0200 Subject: [ofa-general] [PATCH ofed-1.3.1 0/2] IB/ehca: Misc fixes Message-ID: <200805211458.56384.ossrosch@linux.vnet.ibm.com> Hi Vlad! I'm sending you a patch set for ehca to be included in ofed-1.3.1. These patches are based on OFED-1.3.1-rc1 and already included in kernel main line. 1/2 IB/ehca: Fix function return types 2/2 IB/ehca: Wait for async events to finish before destroying QP They should apply cleanly against Vlad's git tree. Please accept them if they are ok. Thanks Stefan From ossrosch at linux.vnet.ibm.com Wed May 21 05:59:18 2008 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 21 May 2008 14:59:18 +0200 Subject: [ofa-general] [PATCH ofed-1.3.1 1/2] IB/ehca: Fix function return types Message-ID: <200805211459.19594.ossrosch@linux.vnet.ibm.com> Signed-off-by: Stefan Roscher --- ehca_0041_Fix_wrong_return_types.patch | 38 +++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch 2008-05-21 14:39:20.000000000 +0200 @@ -0,0 +1,38 @@ +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c +--- a/drivers/infiniband/hw/ehca/ehca_hca.c 2008-05-21 13:54:31.000000000 +0200 ++++ b/drivers/infiniband/hw/ehca/ehca_hca.c 2008-05-21 14:35:25.000000000 +0200 +@@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device * + props->max_ee = limit_uint(rblock->max_rd_ee_context); + props->max_rdd = limit_uint(rblock->max_rd_domain); + props->max_fmr = limit_uint(rblock->max_mr); +- props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); + props->max_qp_rd_atom = limit_uint(rblock->max_rr_qp); + props->max_ee_rd_atom = limit_uint(rblock->max_rr_ee_context); + props->max_res_rd_atom = limit_uint(rblock->max_rr_hca); +@@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device * + } + + props->max_pkeys = 16; +- props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); ++ props->local_ca_ack_delay = min_t(u8, rblock->local_ca_ack_delay, 255); + props->max_raw_ipv6_qp = limit_uint(rblock->max_raw_ipv6_qp); + props->max_raw_ethy_qp = limit_uint(rblock->max_raw_ethy_qp); + props->max_mcast_grp = limit_uint(rblock->max_mcast_grp); +@@ -136,7 +135,7 @@ query_device1: + return ret; + } + +-static int map_mtu(struct ehca_shca *shca, u32 fw_mtu) ++static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu) + { + switch (fw_mtu) { + case 0x1: +@@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shc + } + } + +-static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) ++static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) + { + switch (vl_cap) { + case 0x1: From eli at mellanox.co.il Wed May 21 05:59:29 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Wed, 21 May 2008 15:59:29 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Optimize stamping for selective signalling QPs Message-ID: <1211374769.6577.21.camel@eli-laptop> >From e6b956c2233669fc21ba1565fbcf78ce2cd186b7 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Wed, 21 May 2008 15:55:51 +0300 Subject: [PATCH] IB/mlx4: Optimize stamping for selective signalling QPs The idea is that for selective signaling QPs, before stamping the WQE, you read the value of the DS field which denotes the effective size of the descriptor as used in the previous post, and stamp only that area, relying on the fact that the rest of the descriptor is already stamped. Signed-off-by: Eli Cohen --- This version cleanly applies on the head of the "for-2.6.26" branch. drivers/infiniband/hw/mlx4/qp.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index cec030e..b4d25c2 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -129,9 +129,10 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) int ind; void *buf; __be32 stamp; + struct mlx4_wqe_ctrl_seg *ctrl; - s = roundup(size, 1U << qp->sq.wqe_shift); if (qp->sq_max_wqes_per_wr > 1) { + s = roundup(size, 1U << qp->sq.wqe_shift); for (i = 0; i < s; i += 64) { ind = (i >> qp->sq.wqe_shift) + n; stamp = ind & qp->sq.wqe_cnt ? cpu_to_be32(0x7fffffff) : @@ -141,7 +142,8 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size) *wqe = stamp; } } else { - buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1)); + s = (ctrl->fence_size & 0x3f) << 4; for (i = 64; i < s; i += 64) { wqe = buf + i; *wqe = cpu_to_be32(0xffffffff); -- 1.5.5.1 From ossrosch at linux.vnet.ibm.com Wed May 21 05:59:53 2008 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Wed, 21 May 2008 14:59:53 +0200 Subject: [ofa-general] [PATCH ofed-1.3.1 2/2] IB/ehca: Wait for async events to finish before destroying QP Message-ID: <200805211459.55844.ossrosch@linux.vnet.ibm.com> Signed-off-by: Stefan Roscher --- ehca_0042_Count_async_events_for_EQs.patch | 55 +++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch 1970-01-01 01:00:00.000000000 +0100 +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch 2008-05-21 14:41:20.000000000 +0200 @@ -0,0 +1,55 @@ +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h +--- a/drivers/infiniband/hw/ehca/ehca_classes.h 2008-05-21 13:54:31.000000000 +0200 ++++ b/drivers/infiniband/hw/ehca/ehca_classes.h 2008-05-21 14:35:25.000000000 +0200 +@@ -192,6 +192,8 @@ struct ehca_qp { + int mtu_shift; + u32 message_count; + u32 packet_count; ++ atomic_t nr_events; /* events seen */ ++ wait_queue_head_t wait_completion; + }; + + #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ) +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c +--- a/drivers/infiniband/hw/ehca/ehca_irq.c 2008-05-21 13:54:31.000000000 +0200 ++++ b/drivers/infiniband/hw/ehca/ehca_irq.c 2008-05-21 14:35:25.000000000 +0200 +@@ -204,6 +204,8 @@ static void qp_event_callback(struct ehc + + read_lock(&ehca_qp_idr_lock); + qp = idr_find(&ehca_qp_idr, token); ++ if (qp) ++ atomic_inc(&qp->nr_events); + read_unlock(&ehca_qp_idr_lock); + + if (!qp) +@@ -223,6 +225,8 @@ static void qp_event_callback(struct ehc + if (fatal && qp->ext_type == EQPT_SRQBASE) + dispatch_qp_event(shca, qp, IB_EVENT_QP_LAST_WQE_REACHED); + ++ if (atomic_dec_and_test(&qp->nr_events)) ++ wake_up(&qp->wait_completion); + return; + } + +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c +--- a/drivers/infiniband/hw/ehca/ehca_qp.c 2008-05-21 13:54:31.000000000 +0200 ++++ b/drivers/infiniband/hw/ehca/ehca_qp.c 2008-05-21 14:35:25.000000000 +0200 +@@ -561,6 +566,8 @@ static struct ehca_qp *internal_create_q + return ERR_PTR(-ENOMEM); + } + ++ atomic_set(&my_qp->nr_events, 0); ++ init_waitqueue_head(&my_qp->wait_completion); + spin_lock_init(&my_qp->spinlock_s); + spin_lock_init(&my_qp->spinlock_r); + my_qp->qp_type = qp_type; +@@ -1929,6 +1936,9 @@ static int internal_destroy_qp(struct ib + idr_remove(&ehca_qp_idr, my_qp->token); + write_unlock_irqrestore(&ehca_qp_idr_lock, flags); + ++ /* now wait until all pending events have completed */ ++ wait_event(my_qp->wait_completion, !atomic_read(&my_qp->nr_events)); ++ + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); + if (h_ret != H_SUCCESS) { + ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li " From richard.frank at oracle.com Wed May 21 06:21:05 2008 From: richard.frank at oracle.com (Richard Frank) Date: Wed, 21 May 2008 09:21:05 -0400 Subject: [ofa-general] RDS flow control In-Reply-To: <200805202313.40213.olaf.kirch@oracle.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> Message-ID: <483421C1.2090000@oracle.com> From Oracle's perspective I think you can punt on the cluster rolling upgrade aspect of this. It's likely that Oracle will be running a newer version of RDS with flow control - or all nodes will be on an older version. As long as different versions of drivers interact without crashing the node - and report a mismatch in protocol - we're probably OK. Assuming the flow control turns out to not impact performance - then we should just remove the old RNR code. If we're ready = I give this version of the driver to our performance folks - hopefully they can give a bash in the next week or so.. Olaf Kirch wrote: > On Tuesday 20 May 2008 22:45:22 Jon Mason wrote: > >> Works well on my setup. >> > > Good to hear! > > >> With proper flow control, there should no longer be a need for rnr_retry (as there >> should always be a posted recv buffer waiting for the incoming data). I did a >> quick test and removed it and everything seemed to be happy on my rds-stress run. >> > > I would like to make the setting of the RNR retry/timeout conditional on > whether both ends of the connection support flow control or not - we need > to think of rolling upgrades of a cluster, so mixed environments just have > to work. Unfortunately, the RNR retry count is set prior to establishing > the connection, before we even know whether the remote is capable of doing > flow control. > > Is there a way of changing the RNR retry count back to 0 after establishing > the connection? > > Olaf > From vlad at dev.mellanox.co.il Wed May 21 06:28:18 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 21 May 2008 16:28:18 +0300 Subject: [ofa-general] Re: [ewg] [PATCH ofed-1.3.1 1/2] IB/ehca: Fix function return types In-Reply-To: <200805211459.19594.ossrosch@linux.vnet.ibm.com> References: <200805211459.19594.ossrosch@linux.vnet.ibm.com> Message-ID: <48342372.4000109@dev.mellanox.co.il> Stefan Roscher wrote: > Signed-off-by: Stefan Roscher > --- > ehca_0041_Fix_wrong_return_types.patch | 38 +++++++++++++++++++++++++++++++++ > 1 file changed, 38 insertions(+) > > diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch > --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch 1970-01-01 01:00:00.000000000 +0100 > +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0041_Fix_wrong_return_types.patch 2008-05-21 14:39:20.000000000 +0200 > @@ -0,0 +1,38 @@ > +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c > +--- a/drivers/infiniband/hw/ehca/ehca_hca.c 2008-05-21 13:54:31.000000000 +0200 > ++++ b/drivers/infiniband/hw/ehca/ehca_hca.c 2008-05-21 14:35:25.000000000 +0200 > +@@ -101,7 +101,6 @@ int ehca_query_device(struct ib_device * > + props->max_ee = limit_uint(rblock->max_rd_ee_context); > + props->max_rdd = limit_uint(rblock->max_rd_domain); > + props->max_fmr = limit_uint(rblock->max_mr); > +- props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); > + props->max_qp_rd_atom = limit_uint(rblock->max_rr_qp); > + props->max_ee_rd_atom = limit_uint(rblock->max_rr_ee_context); > + props->max_res_rd_atom = limit_uint(rblock->max_rr_hca); > +@@ -115,7 +114,7 @@ int ehca_query_device(struct ib_device * > + } > + > + props->max_pkeys = 16; > +- props->local_ca_ack_delay = limit_uint(rblock->local_ca_ack_delay); > ++ props->local_ca_ack_delay = min_t(u8, rblock->local_ca_ack_delay, 255); > + props->max_raw_ipv6_qp = limit_uint(rblock->max_raw_ipv6_qp); > + props->max_raw_ethy_qp = limit_uint(rblock->max_raw_ethy_qp); > + props->max_mcast_grp = limit_uint(rblock->max_mcast_grp); > +@@ -136,7 +135,7 @@ query_device1: > + return ret; > + } > + > +-static int map_mtu(struct ehca_shca *shca, u32 fw_mtu) > ++static enum ib_mtu map_mtu(struct ehca_shca *shca, u32 fw_mtu) > + { > + switch (fw_mtu) { > + case 0x1: > +@@ -156,7 +155,7 @@ static int map_mtu(struct ehca_shca *shc > + } > + } > + > +-static int map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) > ++static u8 map_number_of_vls(struct ehca_shca *shca, u32 vl_cap) > + { > + switch (vl_cap) { > + case 0x1: Applied, Regards, Vladimir From vlad at dev.mellanox.co.il Wed May 21 06:28:36 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 21 May 2008 16:28:36 +0300 Subject: [ofa-general] [PATCH ofed-1.3.1 2/2] IB/ehca: Wait for async events to finish before destroying QP In-Reply-To: <200805211459.55844.ossrosch@linux.vnet.ibm.com> References: <200805211459.55844.ossrosch@linux.vnet.ibm.com> Message-ID: <48342384.4080509@dev.mellanox.co.il> Stefan Roscher wrote: > Signed-off-by: Stefan Roscher > --- > ehca_0042_Count_async_events_for_EQs.patch | 55 +++++++++++++++++++++++++++++ > 1 file changed, 55 insertions(+) > > diff -Nurp ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch > --- ofa_kernel-1.3_old/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch 1970-01-01 01:00:00.000000000 +0100 > +++ ofa_kernel-1.3_new/kernel_patches/fixes/ehca_0042_Count_async_events_for_EQs.patch 2008-05-21 14:41:20.000000000 +0200 > @@ -0,0 +1,55 @@ > +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_classes.h b/drivers/infiniband/hw/ehca/ehca_classes.h > +--- a/drivers/infiniband/hw/ehca/ehca_classes.h 2008-05-21 13:54:31.000000000 +0200 > ++++ b/drivers/infiniband/hw/ehca/ehca_classes.h 2008-05-21 14:35:25.000000000 +0200 > +@@ -192,6 +192,8 @@ struct ehca_qp { > + int mtu_shift; > + u32 message_count; > + u32 packet_count; > ++ atomic_t nr_events; /* events seen */ > ++ wait_queue_head_t wait_completion; > + }; > + > + #define IS_SRQ(qp) (qp->ext_type == EQPT_SRQ) > +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c > +--- a/drivers/infiniband/hw/ehca/ehca_irq.c 2008-05-21 13:54:31.000000000 +0200 > ++++ b/drivers/infiniband/hw/ehca/ehca_irq.c 2008-05-21 14:35:25.000000000 +0200 > +@@ -204,6 +204,8 @@ static void qp_event_callback(struct ehc > + > + read_lock(&ehca_qp_idr_lock); > + qp = idr_find(&ehca_qp_idr, token); > ++ if (qp) > ++ atomic_inc(&qp->nr_events); > + read_unlock(&ehca_qp_idr_lock); > + > + if (!qp) > +@@ -223,6 +225,8 @@ static void qp_event_callback(struct ehc > + if (fatal && qp->ext_type == EQPT_SRQBASE) > + dispatch_qp_event(shca, qp, IB_EVENT_QP_LAST_WQE_REACHED); > + > ++ if (atomic_dec_and_test(&qp->nr_events)) > ++ wake_up(&qp->wait_completion); > + return; > + } > + > +diff -Nurp a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c > +--- a/drivers/infiniband/hw/ehca/ehca_qp.c 2008-05-21 13:54:31.000000000 +0200 > ++++ b/drivers/infiniband/hw/ehca/ehca_qp.c 2008-05-21 14:35:25.000000000 +0200 > +@@ -561,6 +566,8 @@ static struct ehca_qp *internal_create_q > + return ERR_PTR(-ENOMEM); > + } > + > ++ atomic_set(&my_qp->nr_events, 0); > ++ init_waitqueue_head(&my_qp->wait_completion); > + spin_lock_init(&my_qp->spinlock_s); > + spin_lock_init(&my_qp->spinlock_r); > + my_qp->qp_type = qp_type; > +@@ -1929,6 +1936,9 @@ static int internal_destroy_qp(struct ib > + idr_remove(&ehca_qp_idr, my_qp->token); > + write_unlock_irqrestore(&ehca_qp_idr_lock, flags); > + > ++ /* now wait until all pending events have completed */ > ++ wait_event(my_qp->wait_completion, !atomic_read(&my_qp->nr_events)); > ++ > + h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); > + if (h_ret != H_SUCCESS) { > + ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li " Applied, Regards, Vladimir From ogerlitz at voltaire.com Wed May 21 06:31:54 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 21 May 2008 16:31:54 +0300 Subject: [ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs In-Reply-To: <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com> References: <000001c8b6aa$a5418810$bd59180a@amr.corp.intel.com> <482C68A4.9020305@opengridcomputing.com> <4831649A.2020206@voltaire.com> <000101c8b9f6$c4dd4cf0$bd59180a@amr.corp.intel.com> <4832BFAC.2050506@voltaire.com> <000101c8ba9b$4ca68da0$a3d8180a@amr.corp.intel.com> Message-ID: <4834244A.4010509@voltaire.com> Sean Hefty wrote: > After more thought, this approach is what I would try first. I think you will > need a new mutex per rdma_cm_id that does nothing but serializes callbacks. You > might be able to acquire/release it in disable/enable remove, but I didn't look > into the implementation in that much detail. OK, thanks a lot for the guidance, I will try that and let you know. Or. From nix.or.die at googlemail.com Wed May 21 07:06:36 2008 From: nix.or.die at googlemail.com (Gabriel C) Date: Wed, 21 May 2008 16:06:36 +0200 Subject: [ofa-general] linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings Message-ID: <48342C6C.2010502@googlemail.com> On linux-next from today , allmodconfig, I see the following warnings on 64bit: ... CC [M] drivers/infiniband/hw/ipath/ipath_sdma.o drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:269: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:271: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:273: warning: passing argument 2 of 'variable_test_bit' from incompatible pointer type drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' ... Signed-off-by: Gabriel C --- I see the 'format' warnings in mainline also. drivers/infiniband/hw/ipath/ipath_sdma.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 3697449..5f80151 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque) /* everything is stopped, time to clean up and restart */ if (status == IPATH_SDMA_ABORT_ABORTED) { struct ipath_sdma_txreq *txp, *txpnext; - u64 hwstatus; + unsigned long hwstatus; int notify = 0; hwstatus = ipath_read_kreg64(dd, @@ -346,7 +346,7 @@ resched: */ if (jiffies > dd->ipath_sdma_abort_jiffies) { ipath_dbg("looping with status 0x%016llx\n", - dd->ipath_sdma_status); + (unsigned long long)dd->ipath_sdma_status); dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ; } resched_noprint: @@ -616,7 +616,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd) spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); if (!needed) { ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n", - dd->ipath_sdma_status); + (unsigned long long)dd->ipath_sdma_status); goto bail; } spin_lock_irqsave(&dd->ipath_sendctrl_lock, flags); From yevgenyp at mellanox.co.il Wed May 21 07:35:33 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Wed, 21 May 2008 17:35:33 +0300 Subject: [ofa-general][PATCH 0/2] mlx4: Multiple completion vectors Message-ID: <48343335.6050608@mellanox.co.il> Hello Roland, This is the implementation of mlx4 support for multiple completion vectors. Main idea: Create a completion EQ for every core to allow a better utilization of multi-core machines. Number of created completion vectors is advertised through ib_device.num_comp_vectors. Each ULP can decide to which completion vector it wants to assign a created CQ. It can also let mlx4_core to decide on the completion vector number, and it will attach the CQ to the EQ that has the smallest number of CQs attached to it. Thanks, Yevgeny From yevgenyp at mellanox.co.il Wed May 21 07:35:49 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Wed, 21 May 2008 17:35:49 +0300 Subject: [ofa-general][PATCH v1 1/2]mlx4: Multiple completion vectors support Message-ID: <48343345.6010205@mellanox.co.il> >From cde7a50546a0172849955909de41bcb1f8395f4e Mon Sep 17 00:00:00 2001 From: Yevgeny Petrilin Date: Tue, 20 May 2008 11:29:51 +0300 Subject: [PATCH] mlx4: Multiple completion vectors support The driver now creates a completion EQ for every core. While allocating CQ a ULP asks a completion vector number it wants the CQ to be attached to. The number of completion vectors is advertised through ib_device.num_comp_vectors Signed-off-by: Yevgeny Petrilin --- drivers/infiniband/hw/mlx4/cq.c | 2 +- drivers/infiniband/hw/mlx4/main.c | 2 +- drivers/net/mlx4/cq.c | 14 ++++++++-- drivers/net/mlx4/eq.c | 47 ++++++++++++++++++++++++------------ drivers/net/mlx4/main.c | 14 ++++++---- drivers/net/mlx4/mlx4.h | 4 +- include/linux/mlx4/device.h | 4 ++- 7 files changed, 57 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 4521319..3519f92 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -221,7 +221,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector } err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar, - cq->db.dma, &cq->mcq, 0); + cq->db.dma, &cq->mcq, vector, 0); if (err) goto err_dbmap; diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 098dcd2..7ffcb00 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -567,7 +567,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) mlx4_foreach_port(i, ibdev->ports_map) ibdev->num_ports++; ibdev->ib_dev.phys_port_cnt = ibdev->num_ports; - ibdev->ib_dev.num_comp_vectors = 1; + ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; ibdev->ib_dev.dma_device = &dev->pdev->dev; ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION; diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index 95e87a2..9be895f 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -189,7 +189,7 @@ EXPORT_SYMBOL_GPL(mlx4_cq_resize); int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed) + unsigned vector, int collapsed) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = &priv->cq_table; @@ -227,7 +227,15 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP].eqn; + + if (vector >= dev->caps.num_comp_vectors) { + err = -EINVAL; + goto err_radix; + } + + cq->comp_eq_idx = MLX4_EQ_COMP_CPU0 + vector; + cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + vector].eqn; cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; mtt_addr = mlx4_mtt_addr(dev, mtt); @@ -276,7 +284,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) if (err) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); - synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq); + synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index e141a15..825e90c 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -265,7 +265,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr) writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); return IRQ_RETVAL(work); @@ -482,7 +482,7 @@ static void mlx4_free_irqs(struct mlx4_dev *dev) if (eq_table->have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) if (eq_table->eq[i].have_irq) free_irq(eq_table->eq[i].irq, eq_table->eq + i); } @@ -553,6 +553,7 @@ void mlx4_unmap_eq_icm(struct mlx4_dev *dev) int mlx4_init_eq_table(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + int req_eqs; int err; int i; @@ -573,11 +574,21 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) priv->eq_table.clr_int = priv->clr_base + (priv->eq_table.inta_pin < 32 ? 4 : 0); - err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, - (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0, - &priv->eq_table.eq[MLX4_EQ_COMP]); - if (err) - goto err_out_unmap; + dev->caps.num_comp_vectors = 0; + req_eqs = (dev->flags & MLX4_FLAG_MSI_X) ? num_online_cpus() : 1; + while (req_eqs) { + err = mlx4_create_eq( + dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, + (dev->flags & MLX4_FLAG_MSI_X) ? + (MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors) : 0, + &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors]); + if (err) + goto err_out_comp; + + dev->caps.num_comp_vectors++; + req_eqs--; + } err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE, (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0, @@ -586,12 +597,16 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) goto err_out_comp; if (dev->flags & MLX4_FLAG_MSI_X) { - static const char *eq_name[] = { - [MLX4_EQ_COMP] = DRV_NAME " (comp)", - [MLX4_EQ_ASYNC] = DRV_NAME " (async)" - }; + static char eq_name[MLX4_NUM_EQ][20]; + + for (i = 0; i < MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors; ++i) { + if (i == 0) + snprintf(eq_name[0], 20, DRV_NAME "(async)"); + else + snprintf(eq_name[i], 20, "comp_" DRV_NAME "%d", + i - 1); - for (i = 0; i < MLX4_NUM_EQ; ++i) { err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name[i], priv->eq_table.eq + i); @@ -616,7 +631,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) eq_set_ci(&priv->eq_table.eq[i], 1); return 0; @@ -625,9 +640,9 @@ err_out_async: mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); err_out_comp: - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]); + for (i = 0; i < dev->caps.num_comp_vectors; ++i) + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i]); -err_out_unmap: mlx4_unmap_clr_int(dev); mlx4_free_irqs(dev); @@ -646,7 +661,7 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev) mlx4_free_irqs(dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) mlx4_free_eq(dev, &priv->eq_table.eq[i]); mlx4_unmap_clr_int(dev); diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index f86d472..4a909cb 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -911,22 +911,24 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); struct msix_entry entries[MLX4_NUM_EQ]; + int needed_vectors = MLX4_EQ_COMP_CPU0 + num_online_cpus(); int err; int i; if (msi_x) { - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) entries[i].entry = i; - err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries)); + err = pci_enable_msix(dev->pdev, entries, needed_vectors); if (err) { if (err > 0) - mlx4_info(dev, "Only %d MSI-X vectors available, " - "not using MSI-X\n", err); + mlx4_info(dev, "Only %d MSI-X vectors " + "available, need %d. Not using MSI-X\n", + err, needed_vectors); goto no_msi; } - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = entries[i].vector; dev->flags |= MLX4_FLAG_MSI_X; @@ -934,7 +936,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) } no_msi: - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = dev->pdev->irq; } diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 7bc4cbf..4435272 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -64,8 +64,8 @@ enum { enum { MLX4_EQ_ASYNC, - MLX4_EQ_COMP, - MLX4_NUM_EQ + MLX4_EQ_COMP_CPU0, + MLX4_NUM_EQ = MLX4_EQ_COMP_CPU0 + NR_CPUS }; enum { diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index d97314d..7cbe078 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -187,6 +187,7 @@ struct mlx4_caps { int reserved_cqs; int num_eqs; int reserved_eqs; + int num_comp_vectors; int num_mpts; int num_mtt_segs; int fmr_reserved_mtts; @@ -305,6 +306,7 @@ struct mlx4_cq { int arm_sn; int cqn; + int comp_eq_idx; atomic_t refcount; struct completion free; @@ -434,7 +436,7 @@ void mlx4_free_hwq_res(struct mlx4_dev *mdev, struct mlx4_hwq_resources *wqres, int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed); + unsigned vector, int collapsed); void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq); int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base); -- 1.5.4 From yevgenyp at mellanox.co.il Wed May 21 07:36:13 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Wed, 21 May 2008 17:36:13 +0300 Subject: [ofa-general][PATCH v1 2/2] mlx4: Default value for automatic completion vector selection Message-ID: <4834335D.8030903@mellanox.co.il> >From b652a738eb5acbb01f0d0a143a12a7bdcc86d002 Mon Sep 17 00:00:00 2001 From: Yevgeny Petrilin Date: Tue, 20 May 2008 13:51:00 +0300 Subject: [PATCH] mlx4: Default value for automatic completion vector selection When the vector number passed to mlx4_cq_alloc is MLX4_ANY_VECTOR (0xff), the driver selects the completion vector that has the least CQs attached to it and attaches the CQ to the chosen vector. Signed-off-by: Yevgeny Petrilin --- drivers/net/mlx4/cq.c | 22 +++++++++++++++++++++- drivers/net/mlx4/mlx4.h | 1 + include/linux/mlx4/device.h | 4 ++++ 3 files changed, 26 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index 9be895f..7f0bdf6 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq, } EXPORT_SYMBOL_GPL(mlx4_cq_resize); +static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv) +{ + int i; + int index = 0; + int min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0].load; + + for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) { + if (priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load < min) { + index = i; + min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load; + } + } + + return index; +} + int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, unsigned vector, int collapsed) @@ -228,7 +244,9 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - if (vector >= dev->caps.num_comp_vectors) { + if (vector == MLX4_ANY_VECTOR) + vector = mlx4_find_least_loaded_vector(priv); + else if (vector >= dev->caps.num_comp_vectors) { err = -EINVAL; goto err_radix; } @@ -248,6 +266,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, if (err) goto err_radix; + priv->eq_table.eq[cq->comp_eq_idx].load++; cq->cons_index = 0; cq->arm_sn = 1; cq->uar = uar; @@ -285,6 +304,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); + priv->eq_table.eq[cq->comp_eq_idx].load--; spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 4435272..b2d103a 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -144,6 +144,7 @@ struct mlx4_eq { u16 irq; u16 have_irq; int nent; + int load; struct mlx4_buf_list *page_list; struct mlx4_mtt mtt; }; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 7cbe078..3cfe5c1 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -151,6 +151,10 @@ enum { MLX4_NUM_FEXCH = 64 * 1024, }; +enum { + MLX4_ANY_VECTOR = 0xff +}; + static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) { return (major << 32) | (minor << 16) | subminor; -- 1.5.4 From jackm at dev.mellanox.co.il Wed May 21 07:36:00 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 21 May 2008 17:36:00 +0300 Subject: [ofa-general] IPoIB-UD post_send failures (OFED 1.3) In-Reply-To: <20080511102345.GJ5298@sgi.com> References: <20080508171936.GU24293@sgi.com> <1210493899.15669.116.camel@mtls03> <20080511102345.GJ5298@sgi.com> Message-ID: <200805211736.01231.jackm@dev.mellanox.co.il> Arthur, I just checked in a fix for bugzilla 1004, which seems to be the same problem you are seeing. (I just noticed your explanation in this thread in an earlier post: "So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), followed by a call to ipoib_send() would get to a situation where the queue was full, but not stopped." ). This is correct, and this was the bug (in addition to a missing invocation of netif_stop_queue in ipoib_ib_tx_timer_func() ). The patch uses the same value for tx_outstanding in all cases in the test for invoking netif_stop_queue(), so that there is no way the kernel will continue to send TX packets to IPoIB if the queue becomes too full. (using the same value in all tests creates a "barrier" with no holes). This patch will be part of OFED 1.3.1-rc2 -- and you should see no more mthca "queue full" messages. - Jack P.S., this fix is not needed in the upstream kernel, since the unsignalled UD send mechanism was not added upstream. On Sunday 11 May 2008 13:23, akepner at sgi.com wrote: > On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote: > > .... > > The reason why the queue is stopped when there is one entry still left > > is to allow ipoib_ib_tx_timer_func() to post a special send request that > > will ensure a completion is reported for this operation thus freeing > > entries at the tx ring. I don't think the scenario you describe here can > > lead to a deadlock since if that happens, it will be released because of > > either one of the following two reasons: > > 1. If the tx queue contains not yet polled, more than one completion of > > send WRs posted by ipoib_cm_send(), they will soon be polled since they > > are posted to a signaled QP and sooner or later will generate > > completions and interrupts. In this case, subsequent postings to > > ipoib_send() will work as expected. > > > > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it > > means that there are 126 outstanding ipoib_send() requests at the tx > > queue and this means that a few of them are signaled and are expected to > > be completed soon. > > Thanks for the explanation. > > The main problem that we're seeing is that we just stop getting > completions for the send queue. (And we see this with OFED-1.2 > and 1.3, which makes me think that it's unlikely to be due to the > IPoIB driver since that's changed so much.) > > > ..... > > And last, could you arrange a remote access to a machine in this > > condition so we could check the state of the device/FW? > > > > Yes, I think so. Let me see if I can arrange that. > From olaf.kirch at oracle.com Wed May 21 07:52:49 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Wed, 21 May 2008 16:52:49 +0200 Subject: [ofa-general] RDS flow control In-Reply-To: <4833FA1B.4010707@mellanox.co.il> References: <200805121157.38135.jon@opengridcomputing.com> <4833FA1B.4010707@mellanox.co.il> Message-ID: <200805211652.50470.olaf.kirch@oracle.com> On Wednesday 21 May 2008 12:31:55 Tziporet Koren wrote: > SQD is not implemented in ConnectX for now So what do I do on ConnectX? Will the state transition RTR->SQD just appear to work (despite not doing anything), or will it fail? Will the subsequent change of the RNR retry count succeed? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From ctung at neteffect.com Wed May 21 09:49:58 2008 From: ctung at neteffect.com (Chien Tung) Date: Wed, 21 May 2008 11:49:58 -0500 Subject: [ofa-general] [ PATCH ] RDMA/nes Update MAINTAINERS list Message-ID: <200805211649.m4LGnwPP026935@velma.neteffect.com> Adding Chien to maintainers list for NetEffect. Signed-off-by: Chien Tung --- MAINTAINERS | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index bc1c008..39feafc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2837,8 +2837,8 @@ S: Maintained NETEFFECT IWARP RNIC DRIVER (IW_NES) P: Faisal Latif M: flatif at neteffect.com -P: Nishi Gupta -M: ngupta at neteffect.com +P: Chien Tung +M: ctung at neteffect.com P: Glenn Streiff M: gstreiff at neteffect.com L: general at lists.openfabrics.org From kurai at magna-tokyo.com Wed May 21 10:04:14 2008 From: kurai at magna-tokyo.com (Top Software kaufen) Date: Wed, 21 May 2008 18:04:14 +0100 Subject: [ofa-general] Legale und populaere Software aus aller Welt! Message-ID: <01c8bb6d$13ea2300$a167b750@kurai> Ihre Software kommt ganz schnell ins Haus. Zahlen Sie und laden Sie es runter! Wir verkaufen Programme in allen europaeischen Sprachen, fuer Windows und fuer Macintosh. Wir verkaufen nur originale Vollversionen, aber sehr preiswert. Unser kompetentes Team wird Ihnen bei der Istallation helfen, falls Sie es brauchen. Wir bieten Geld-Zurueck-Garantie und rasche Antworten vom Support! Wir bieten Ihnen die perfekten Softwaren an http://macabesoft.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralph.campbell at qlogic.com Wed May 21 10:29:24 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 21 May 2008 10:29:24 -0700 Subject: [ofa-general] RDS flow control In-Reply-To: <4833D3DA.4040106@voltaire.com> References: <200805121157.38135.jon@opengridcomputing.com> <200805191006.00114.olaf.kirch@oracle.com> <20080520204522.GD31790@opengridcomputing.com> <200805202313.40213.olaf.kirch@oracle.com> <4833D223.5090007@voltaire.com> <4833D3DA.4040106@voltaire.com> Message-ID: <1211390964.3949.283.camel@brick.pathscale.com> On Wed, 2008-05-21 at 10:48 +0300, Or Gerlitz wrote: > Or Gerlitz wrote: > > HW support of the IB_DEVICE_RC_RNR_NAK_GEN device capability > I see now that only the mlx4, mthca and ehca drivers advertise this > capability, but the ipath doesn't, Ralph, was it just forgotten or you > guys really don't support this? > > Or. It is supported. Its just a bug that the capability bit isn't set. I'll make a patch for this. From ralph.campbell at qlogic.com Wed May 21 10:34:20 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 21 May 2008 10:34:20 -0700 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com> Message-ID: <1211391260.3949.286.camel@brick.pathscale.com> On Wed, 2008-05-21 at 14:30 +0300, Tziporet Koren wrote: > Ralph, > > Can you provide the patch to OFED 1.3.1 today so we will be able to > include it in RC1? > > Thanks > Tziporet The patch is available for pulling from: git://git.openfabrics.org/~ralphc/linux-2.6/.git ofed_kernel commit 2c62d7930703acea41568a98cc74e712475ebe38 Author: Ralph Campbell (QLogic) Date: Tue May 20 16:58:41 2008 -0700 IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONS This was observed with the hw/ipath driver, but could happen with any driver. It's OFED bug 1027. The fix is to kfree the local data and break, rather than falling through. Signed-off-by: Dave Olson From gdror at mellanox.co.il Wed May 21 12:02:02 2008 From: gdror at mellanox.co.il (Dror Goldenberg) Date: Wed, 21 May 2008 22:02:02 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4833EA6B.9000705@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> Message-ID: <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> -----Original Message----- From: Or Gerlitz [mailto:ogerlitz at voltaire.com] Sent: Wednesday, May 21, 2008 12:25 PM To: Steve Wise Cc: rdreier at cisco.com; general at lists.openfabrics.org; Dror Goldenberg Subject: Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support Steve Wise wrote: >>> Support for the IB BMME and iWARP equivalent memory extensions ... >>> Usage Model: >>> - MR allocated with ib_alloc_mr() >>> - Page lists allocated via ib_alloc_fast_reg_page_list(). >>> - MR made VALID & bound to a specific page list via >>> ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via >>> ib_post_send(IB_WR_INVALIDATE_MR) >> AFAIK, the idea was to let the ulp post --two-- work requests, where >> the first creates the mapping and the second sends this mapping to >> the remote side, such that the second does not start before the first >> completes (i.e a fence). >> >> Now, the above scheme means that the ulp knows the value of the >> rkey/stag at the time of posting these two work requests (since it >> has to encode it in the second one), so something has to be clarified >> re the rkey/stag here, do they change each time this MR is used? how >> many bits can be changed, etc. > > The ULP knows the rkey/stag because its returned up front in the > ib_alloc_fast_reg_mr(). And it doesn't change (ignoring the key issue > which we haven't exposed yet to the ULP). The same rkey/stag can be > used for multiple mappings. It can be made invalid at any point in > time via the IB_WR_INVALIDATE_MR so the fact that you're leaving the > same rkey/stag advertised is not a risk. I understand that this (same rkey/stag used for all mapping produced for a specific mr) is what you are proposing, I still think there's a chance that by the spec and (not less important!) by existing HW support, its possible to have a different rkey/stag per mapping done on an mr, for example the IB spec uses a "consumer owned key portion of the L_Key" notation which makes me think there should be a way to have different rkey per mapping, Roland? Dror? [dg] When you post a fast register WQE, you specify the new 8 LSBits to be assigned to the MR. The rest 24 MSBits are the ones that you obtained while allocating the MR, and they persist throughout the lifetime of this MR. From worleys at gmail.com Wed May 21 14:44:14 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 21 May 2008 15:44:14 -0600 Subject: OFED 1.3 w/ a Lustre kernel (was Re: [ofa-general] kernel ib build (OFED 1.3) fails on SLES 10) Message-ID: I'm building OFED 1.3 for an RHEL5.1 kernel for Lustre 1.6.4.3: 2.6.18-53.1.13.el5_lustre.1.6.4.3smp. The install.pl script errors-out at the end of building the kernel modules RPM saying some built modules don't exist, but those modules do get built; the only other warning is some undefined symbols coming out of the modpost command. The differences between the .configs of the two kernels are minimal, so I think it's the same problem as the attached (not getting the proper patch files ). What's the work-around? Chris P.S. Here's a sample of the output of the modpost command: WARNING: "scst_register_target_template" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_unregister" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_register" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_unregister_target_template" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_register_session" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_rx_mgmt_fn" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! WARNING: "scst_cmd_init_done" [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined! Here's the errors at the end: RPM build errors: ... File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/iscsi_tcp.ko File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/libiscsi.ko File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/scsi/scsi_transport_iscsi.ko File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/net/rds File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/net/cxgb3 File not found: /var/tmp/OFED/lib/modules/2.6.18-53.1.13.el5_lustre.1.6.4.3smp/updates/kernel/drivers/net/mlx4 On Tue, Apr 8, 2008 at 8:43 AM, Brian J. Murrell wrote: > On Tue, 2008-04-08 at 10:13 +0200, Thomas Großmann wrote: >> Hi, > > Hi > >> kernel ib build (OFED 1.3) fails on SLES 10. > > To be fair, it fails on Sun's version of the SLES 10 kernel for Lustre, > and here is why: > >> Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.52212 >> + umask 022 >> + cd /var/tmp/OFED_topdir/BUILD >> + /bin/rm -rf /var/tmp/OFED >> ++ dirname /var/tmp/OFED >> + /bin/mkdir -p /var/tmp >> + /bin/mkdir /var/tmp/OFED >> + cd ofa_kernel-1.3 >> + rm -rf /var/tmp/OFED >> + cd /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3 >> + mkdir -p /var/tmp/OFED//usr/local/ofed-1.3/src >> + cp -a /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3 /var/tmp/OFED//usr/local/ofed-1.3/src >> + ./configure --prefix=/usr/local/ofed-1.3 --kernel-version 2.6.16-54-0.2.5_lustre.1.6.4.3smp --kernel-sources /lib/modules/2.6.16-54-0.2.5_lustre.1.6.4.3smp/build --modules-dir /lib/modules/2.6.16-54-0.2.5_lustre.1.6.4.3smp/updates --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-cxgb3-mod --with-nes-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-srp-target-mod --with-rds-mod --with-qlgc_vnic-mod >> ofed_patch.mk does not exist. running ofed_patch.sh >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.3/ofed_scripts/ofed_patch.sh --kernel-version 2.6.16-54-0.2.5_lustre.1.6.4.3smp > ----------------------------------------------------------------------------------------------------------------------^ > This kernel version does match what ofed_patch.sh thinks is a SLES 10 > kernel because it is not of the form "2.6.16.*-*-*". Here's the code in > ofed_patch.sh which detects SLES 10 kernels and assigns the right patch > series for it: > > 2.6.16.*-*-*) > minor=$(echo $KVERSION | cut -d"." -f4 | cut -d"-" -f1) > if [ $minor -lt 37 ]; then > echo 2.6.16_sles10 > elif [ $minor -lt 60 ]; then > echo 2.6.16_sles10_sp1 > else > echo 2.6.16_sles10_sp2 > fi > ;; > > The lustre kernel version for SLES 10 is > "2.6.16-54-0.2.5_lustre.1.6.4.3smp". In order for it to match the above > code it needs to have a "-" put before the "smp" at the end. I am > working on the Lustre build process to do exactly this right at this > moment as well as build our released RPMs with OFED 1.3 support right in > them. My work is being done in Lustre bugzilla ticket 15316. When I > have something working, I will post an attachment there with a patch for > our current b1_6 that should apply to 1.6.4.3. > > In theory you should be able use the "--with-backport*" configure > options to override this detection when building the RPMs however see my > message to this list (inconsistent use of --with-backport[-patches]) > last Saturday about how this seems to be broken currently. > > Cheers, > b. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From arlin.r.davis at intel.com Wed May 21 15:07:24 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 21 May 2008 15:07:24 -0700 Subject: [ofa-general] [PATCH 1/3][2.0] dapl: change cma provider to use max_rdma_read_in, out from ep_attr instead of HCA max values when connecting. Message-ID: <000d01c8bb8f$0d353f50$8bc3020a@amr.corp.intel.com> Patch set for v2.0. Same fixes already applied to v1.2 Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/openib_cma/dapl_ib_cm.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_cm.c b/dapl/openib_cma/dapl_ib_cm.c index d3835b3..de35eba 100755 --- a/dapl/openib_cma/dapl_ib_cm.c +++ b/dapl/openib_cma/dapl_ib_cm.c @@ -540,8 +540,10 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HANDLE ep_handle, /* Setup QP/CM parameters and private data in cm_id */ (void)dapl_os_memzero(&conn->params, sizeof(conn->params)); - conn->params.responder_resources = conn->hca->ib_trans.max_rdma_rd_in; - conn->params.initiator_depth = conn->hca->ib_trans.max_rdma_rd_out; + conn->params.responder_resources = + ep_ptr->param.ep_attr.max_rdma_read_in; + conn->params.initiator_depth = + ep_ptr->param.ep_attr.max_rdma_read_out; conn->params.flow_control = 1; conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT; conn->params.retry_count = IB_RC_RETRY_COUNT; -- 1.5.2.5 From arlin.r.davis at intel.com Wed May 21 15:07:28 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 21 May 2008 15:07:28 -0700 Subject: [ofa-general] [PATCH 3/3][2.0] dtest, dtestx, dapltest: fix build issues with Redhat EL5.1 Message-ID: need include files/definitions for sleep, getpid, gettimeofday Signed-off by: Arlin Davis ardavis at ichips.intel.com --- test/dapltest/mdep/linux/dapl_mdep_user.h | 1 + test/dtest/dtest.c | 3 +++ test/dtest/dtestx.c | 3 +++ 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/test/dapltest/mdep/linux/dapl_mdep_user.h b/test/dapltest/mdep/linux/dapl_mdep_user.h index 52199d1..c39d3d6 100755 --- a/test/dapltest/mdep/linux/dapl_mdep_user.h +++ b/test/dapltest/mdep/linux/dapl_mdep_user.h @@ -42,6 +42,7 @@ #include #include #include +#include /* inet_ntoa */ #include diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 9c8ec71..095ff40 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -63,13 +63,16 @@ #include #include #include +#include #include +#include #include #include #include #include #include #include +#include #define DAPL_PROVIDER "ofa-v2-ib0" diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c index 19ef788..1db60eb 100755 --- a/test/dtest/dtestx.c +++ b/test/dtest/dtestx.c @@ -48,12 +48,15 @@ #define DAPL_PROVIDER "ibnic0v2" #else #include +#include #include +#include #include #include #include #include #include +#include #define DAPL_PROVIDER "ofa-v2-ib0" #define F64x "%"PRIx64"" -- 1.5.2.5 From arlin.r.davis at intel.com Wed May 21 15:07:26 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 21 May 2008 15:07:26 -0700 Subject: [ofa-general] [PATCH 2/3][2.0] dapl: Fix long delays with the cma provider open call when DNS is not configure on server. Message-ID: Open call should default to netdev names when resolving local IP address for cma binding to match dat.conf settings. The open code attempts to resolve with IP or Hostname first and if there is no DNS services setup the failover to netdev name resolution is delayed for as much as 20 seconds. Signed-off by: Arlin Davis ardavis at ichips.intel.com --- dapl/openib_cma/dapl_ib_util.c | 68 +++++++++++++++++++--------------------- 1 files changed, 32 insertions(+), 36 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 41986a3..e3a3b29 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -105,41 +105,37 @@ bail: /* Get IP address using network name, address, or device name */ static int getipaddr(char *name, char *addr, int len) { - struct addrinfo *res; - int ret; - - /* Assume network name and address type for first attempt */ - if (getaddrinfo(name, NULL, NULL, &res)) { - /* retry using network device name */ - ret = getipaddr_netdev(name,addr,len); - if (ret) { - dapl_log(DAPL_DBG_TYPE_ERR, - " open_hca: getaddr_netdev ERROR:" - " %s. Is %s configured?\n", - strerror(errno), name); - return ret; - } - } else { - if (len >= res->ai_addrlen) - memcpy(addr, res->ai_addr, res->ai_addrlen); - else { - freeaddrinfo(res); - return EINVAL; - } - - freeaddrinfo(res); - } + struct addrinfo *res; + + /* assume netdev for first attempt, then network and address type */ + if (getipaddr_netdev(name,addr,len)) { + if (getaddrinfo(name, NULL, NULL, &res)) { + dapl_log(DAPL_DBG_TYPE_ERR, + " open_hca: getaddr_netdev ERROR:" + " %s. Is %s configured?\n", + strerror(errno), name); + return 1; + } else { + if (len >= res->ai_addrlen) + memcpy(addr, res->ai_addr, res->ai_addrlen); + else { + freeaddrinfo(res); + return 1; + } + freeaddrinfo(res); + } + } - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, - " getipaddr: family %d port %d addr %d.%d.%d.%d\n", - ((struct sockaddr_in *)addr)->sin_family, - ((struct sockaddr_in *)addr)->sin_port, - ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 0 & 0xff, - ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 8 & 0xff, - ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 16 & 0xff, - ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 24 & 0xff); - - return 0; + dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + " getipaddr: family %d port %d addr %d.%d.%d.%d\n", + ((struct sockaddr_in *)addr)->sin_family, + ((struct sockaddr_in *)addr)->sin_port, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 0 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 8 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 16 & 0xff, + ((struct sockaddr_in *)addr)->sin_addr.s_addr >> 24 & 0xff); + + return 0; } /* @@ -640,7 +636,7 @@ DAT_RETURN dapli_ib_thread_init(void) while (g_ib_thread_state != IB_THREAD_RUN) { struct timespec sleep, remain; sleep.tv_sec = 0; - sleep.tv_nsec = 20000000; /* 20 ms */ + sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_init: waiting for ib_thread\n"); dapl_os_unlock(&g_hca_lock); @@ -677,7 +673,7 @@ void dapli_ib_thread_destroy(void) while ((g_ib_thread_state != IB_THREAD_EXIT) && (retries--)) { struct timespec sleep, remain; sleep.tv_sec = 0; - sleep.tv_nsec = 20000000; /* 20 ms */ + sleep.tv_nsec = 2000000; /* 2 ms */ dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " ib_thread_destroy: waiting for ib_thread\n"); write(g_ib_pipe[1], "w", sizeof "w"); -- 1.5.2.5 From arlin.r.davis at intel.com Wed May 21 15:48:14 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 21 May 2008 15:48:14 -0700 Subject: [ofa-general] [ANNOUNCE] dapl-1.2.7 and dapl-2.0.9 release Message-ID: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com> New release for dapl 1.2 and 2.0 available on the OFA download page and in my git tree. md5sum: 21e2ff9e64ba5ef697e413bb2373db18 dapl-1.2.7.tar.gz md5sum: 89e07acd0ff4dc73078ee46ee663c786 dapl-2.0.9.tar.gz Vlad, please pull both packages into OFED 1.3.1 and install the following: dapl-1.2.7-1 dapl-devel-1.2.7-1 dapl-2.0.9-1 dapl-utils-2.0.9-1 dapl-devel-2.0.9-1 dapl-debuginfo-2.0.9-1 Summary of fixes since last package: v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 v1,v2 - long delay during dat_ia_open when DNS not configured v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max See http://www.openfabrics.org/downloads/dapl/ more details. -arlin From matthias at sgi.com Wed May 21 16:48:28 2008 From: matthias at sgi.com (Matthias Blankenhaus) Date: Wed, 21 May 2008 16:48:28 -0700 (PDT) Subject: [ofa-general] saquery port problems In-Reply-To: References: Message-ID: I have a patch that fixes the problem: diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c --- infiniband-diags-1.3.6.vanilla/src/saquery.c 2008-02-28 00:58:36.000000000 -0800 +++ my/src/saquery.c 2008-05-21 16:08:19.583221794 -0700 @@ -1304,13 +1304,13 @@ get_bind_handle(void) ca_name_index++; if (sa_port_num && sa_port_num != attr_array[i].port_num) continue; - if (sa_hca_name && i == 0) - continue; if (sa_hca_name && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0) continue; - if (attr_array[i].link_state == IB_LINK_ACTIVE) + if (attr_array[i].link_state == IB_LINK_ACTIVE) { port_guid = attr_array[i].port_guid; + break; + } } I have tested it and it solves the problem. Does this look ok ? Matthias On Tue, 20 May 2008, Matthias Blankenhaus wrote: > Forgot some important info: > > saquery BUILD VERSION: 1.3.6 > OFED-1.3 > > > On Tue, 20 May 2008, Matthias Blankenhaus wrote: > > > Howdy ! > > > > While using this tool to run some queries on a two port HCA, I noticed > > some odd behavior. Here are my observations running on a SLES10SP2 > > (x86_64) Intel Xeon with a Mellanox Technologies MT25208 InfiniHost III > > Ex HCA: > > > > (01) saquery -C mthca0 -m > > This yields the output for port number two. This is not conform with the > > usual ib tools behavior to report on port one per default. > > > > (02) saquery -C mthca0 -m -P 1 > > Fails with "Failed to find active port, check port status with "ibstat". > > This is incorrect, since > > > > # ibstat mthca0 1 > > CA: 'mthca0' > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 20 > > Base lid: 5 > > LMC: 0 > > SM lid: 1 > > Capability mask: 0x02510a68 > > Port GUID: 0x0008f10403987dc5 > > > > This might be the reason why (01) report on port two. > > > > (03) saquery -C mthca0 -m -P 2 > > Works and is identical with the out out from (01). > > > > However, the following command options work: > > > > (04) saquery -P 1 -m > > Correctly yields the output for port one. In other words > > port one seems to be fine unlike reported in (02). > > > > (05) saquery -P 2 -m > > Correctly yields the output for port two. > > > > > > Is it incorrect to use -C and -P in combination ? Why does does > > saquery think that port one is not active ? > > > > > > Thanx, > > Matthias > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sfr at canb.auug.org.au Wed May 21 17:17:14 2008 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 22 May 2008 10:17:14 +1000 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <48342C6C.2010502@googlemail.com> References: <48342C6C.2010502@googlemail.com> Message-ID: <20080522101714.3f2c5a82.sfr@canb.auug.org.au> Hi Gabriel, We appreciate the reports, thanks. On Wed, 21 May 2008 16:06:36 +0200 Gabriel C wrote: > > On linux-next from today, allmodconfig, I see the following warnings on 64bit: What architecture are you building on? -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From tony at bakeyournoodle.com Wed May 21 17:23:35 2008 From: tony at bakeyournoodle.com (Tony Breeds) Date: Thu, 22 May 2008 10:23:35 +1000 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <48342C6C.2010502@googlemail.com> References: <48342C6C.2010502@googlemail.com> Message-ID: <20080522002335.GG20457@bakeyournoodle.com> On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote: > On linux-next from today , allmodconfig, I see the following warnings on 64bit: x86_64 right? > diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c > index 3697449..5f80151 100644 > --- a/drivers/infiniband/hw/ipath/ipath_sdma.c > +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c > @@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque) > /* everything is stopped, time to clean up and restart */ > if (status == IPATH_SDMA_ABORT_ABORTED) { > struct ipath_sdma_txreq *txp, *txpnext; > - u64 hwstatus; > + unsigned long hwstatus; > int notify = 0; > > hwstatus = ipath_read_kreg64(dd, This can't be right. hwstatus needs to be u64, as that's what ipath_read_kreg64() retuns. and a little bit further down we see: --- if (/* ScoreBoardDrainInProg */ test_bit(63, &hwstatus) || /* AbortInProg */ test_bit(62, &hwstatus) || /* InternalSDmaEnable */ test_bit(61, &hwstatus) || --- so hwstatus, clearly needs to be 64-bits. This brings up an interesting point. test_bit() and co are essntally expecting to be passed the address of an unsigned long[], so is it correct to pass &u64? Yours Tony linux.conf.au http://www.marchsouth.org/ Jan 19 - 24 2009 The Australian Linux Technical Conference! From nix.or.die at googlemail.com Wed May 21 18:45:32 2008 From: nix.or.die at googlemail.com (Gabriel C) Date: Thu, 22 May 2008 03:45:32 +0200 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <20080522101714.3f2c5a82.sfr@canb.auug.org.au> References: <48342C6C.2010502@googlemail.com> <20080522101714.3f2c5a82.sfr@canb.auug.org.au> Message-ID: <1820d69d0805211845i54908fc5s486d8b9c1d818481@mail.gmail.com> 2008/5/22 Stephen Rothwell : > Hi Gabriel, > > We appreciate the reports, thanks. > > On Wed, 21 May 2008 16:06:36 +0200 Gabriel C > wrote: > > > > On linux-next from today, allmodconfig, I see the following warnings on > 64bit: > > What architecture are you building on? It is x86_64 Gabriel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nix.or.die at googlemail.com Wed May 21 19:39:16 2008 From: nix.or.die at googlemail.com (Gabriel C) Date: Thu, 22 May 2008 04:39:16 +0200 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <20080522002335.GG20457@bakeyournoodle.com> References: <48342C6C.2010502@googlemail.com> <20080522002335.GG20457@bakeyournoodle.com> Message-ID: <1820d69d0805211939u7476ead9pe17946f5d4ee7248@mail.gmail.com> 2008/5/22 Tony Breeds : > > On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote: > > On linux-next from today , allmodconfig, I see the following warnings on 64bit: > > x86_64 right? > > > > > diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c > > index 3697449..5f80151 100644 > > --- a/drivers/infiniband/hw/ipath/ipath_sdma.c > > +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c > > @@ -257,7 +257,7 @@ static void sdma_abort_task(unsigned long opaque) > > /* everything is stopped, time to clean up and restart */ > > if (status == IPATH_SDMA_ABORT_ABORTED) { > > struct ipath_sdma_txreq *txp, *txpnext; > > - u64 hwstatus; > > + unsigned long hwstatus; > > int notify = 0; > > > > hwstatus = ipath_read_kreg64(dd, > > This can't be right. hwstatus needs to be u64, as that's what ipath_read_kreg64() retuns. > and a little bit further down we see: > > --- > if (/* ScoreBoardDrainInProg */ > test_bit(63, &hwstatus) || > /* AbortInProg */ > test_bit(62, &hwstatus) || > /* InternalSDmaEnable */ > test_bit(61, &hwstatus) || > --- > > so hwstatus, clearly needs to be 64-bits. Hmm , right it need be 64-bits. I should drink my coffee first and read code more carefully before sending out wrong patches , sorry. > This brings up an interesting point. test_bit() and co are > essntally expecting to be passed the address > of an unsigned long[], so is it correct to pass &u64? I'm not sure about this one. > > Yours Tony > Gabriel From nix.or.die at googlemail.com Wed May 21 19:42:39 2008 From: nix.or.die at googlemail.com (Gabriel C) Date: Thu, 22 May 2008 04:42:39 +0200 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <20080522002335.GG20457@bakeyournoodle.com> References: <48342C6C.2010502@googlemail.com> <20080522002335.GG20457@bakeyournoodle.com> Message-ID: <1820d69d0805211942v36fa54edh216a14efae9ebdde@mail.gmail.com> 2008/5/22 Tony Breeds : > On Wed, May 21, 2008 at 04:06:36PM +0200, Gabriel C wrote: >> On linux-next from today , allmodconfig, I see the following warnings on 64bit: > > x86_64 right? > Yes it is x86_64 From vlad at dev.mellanox.co.il Wed May 21 22:58:44 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 22 May 2008 08:58:44 +0300 Subject: [ofa-general] Re: PATCH 1/1 - fix kernel crash in mad.c whenIB_MAD_RESULT_(SUCCESS|CONSUMED) returned In-Reply-To: <1211391260.3949.286.camel@brick.pathscale.com> References: <6C2C79E72C305246B504CBA17B5500C90411217E@mtlexch01.mtl.com> <1211391260.3949.286.camel@brick.pathscale.com> Message-ID: <48350B94.7030906@dev.mellanox.co.il> Ralph Campbell wrote: > On Wed, 2008-05-21 at 14:30 +0300, Tziporet Koren wrote: >> Ralph, >> >> Can you provide the patch to OFED 1.3.1 today so we will be able to >> include it in RC1? >> >> Thanks >> Tziporet > > The patch is available for pulling from: > git://git.openfabrics.org/~ralphc/linux-2.6/.git ofed_kernel > > commit 2c62d7930703acea41568a98cc74e712475ebe38 > Author: Ralph Campbell (QLogic) > Date: Tue May 20 16:58:41 2008 -0700 > > IB/MAD - fix crash when HCA returns IB_MAD_RESULT_SUCCESS|IB_MAD_RESULT_CONS > > This was observed with the hw/ipath driver, but could happen with any > driver. It's OFED bug 1027. The fix is to kfree the local data and > break, rather than falling through. > > Signed-off-by: Dave Olson > > > Done, Regards, Vladimir From vlad at dev.mellanox.co.il Wed May 21 23:17:39 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 22 May 2008 09:17:39 +0300 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] dapl-1.2.7 and dapl-2.0.9 release In-Reply-To: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com> References: <000e01c8bb94$c12f5e00$8bc3020a@amr.corp.intel.com> Message-ID: <48351003.1040906@dev.mellanox.co.il> Arlin Davis wrote: > New release for dapl 1.2 and 2.0 available on the OFA download page and in my git tree. > > md5sum: 21e2ff9e64ba5ef697e413bb2373db18 dapl-1.2.7.tar.gz > md5sum: 89e07acd0ff4dc73078ee46ee663c786 dapl-2.0.9.tar.gz > > Vlad, please pull both packages into OFED 1.3.1 and install the following: > > dapl-1.2.7-1 > dapl-devel-1.2.7-1 > dapl-2.0.9-1 > dapl-utils-2.0.9-1 > dapl-devel-2.0.9-1 > dapl-debuginfo-2.0.9-1 > > Summary of fixes since last package: > v1,v2 - dtest,dtestx,dapltest build issues on RHEL5.1 > v1,v2 - long delay during dat_ia_open when DNS not configured > v1,v2 - use rdma_read_in/out from ep_attr per consumer instead of HCA max > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin > Done, Regards, Vladimir From ogerlitz at voltaire.com Wed May 21 23:30:40 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 22 May 2008 09:30:40 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> Message-ID: <48351310.5090108@voltaire.com> Dror Goldenberg wrote: > When you post a fast register WQE, you specify the new 8 LSBits to > be assigned to the MR. The rest 24 MSBits are the ones that you obtained > while allocating the MR, and they persist throughout the lifetime of > this MR. OK, thanks Dror. Steve, do we agree on this point? if yes, the next version of the patches should include the new rkey value (or just the new 8 LSbits) in the work request. Or. From ogerlitz at voltaire.com Wed May 21 23:49:07 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 22 May 2008 09:49:07 +0300 Subject: [ofa-general] Re: the so many IPoIB-UD failures introduced by OFED 1.3 In-Reply-To: References: <20080508171936.GU24293@sgi.com> <20080508174358.GW24293@sgi.com> <20080510190721.GI5298@sgi.com> <482A8574.8070201@voltaire.com> Message-ID: <48351763.4070705@voltaire.com> Roland Dreier wrote: > > Maybe its about time for the Linux IB maintainers to get a little angry?! > > I'm not angry about it, although I have pretty much given up on trying > to debug IPoIB issues seen running anything other than an upstream > kernel. It seems like the OFED maintainers, the enterprise distros and > their customers should be more concerned about the failure of the OFED > process -- clearly producing something much buggier and less reliable > than the stock kernel is not what anyone wants. Still, it wastes your time... for example this thread ended up be the ninth! bug in a patch which was not reviewed on the mailing list (see https://bugs.openfabrics.org/show_bug.cgi?id=1004#c12) and is now on hold to be sent for review since it does not provide any benefit (see http://lists.openfabrics.org/pipermail/general/2008-March/048322.html), can we get more crazy than that? Or. From sashak at voltaire.com Thu May 22 00:37:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 May 2008 10:37:03 +0300 Subject: [ofa-general] saquery port problems In-Reply-To: References: Message-ID: <20080522073703.GA31474@sashak.voltaire.com> Hi Matthias, On 16:48 Wed 21 May , Matthias Blankenhaus wrote: > I have a patch that fixes the problem: > > diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c > --- infiniband-diags-1.3.6.vanilla/src/saquery.c 2008-02-28 > 00:58:36.000000000 -0800 > +++ my/src/saquery.c 2008-05-21 16:08:19.583221794 -0700 > @@ -1304,13 +1304,13 @@ get_bind_handle(void) > ca_name_index++; > if (sa_port_num && sa_port_num != attr_array[i].port_num) > continue; > - if (sa_hca_name && i == 0) > - continue; > if (sa_hca_name > && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0) > continue; > - if (attr_array[i].link_state == IB_LINK_ACTIVE) > + if (attr_array[i].link_state == IB_LINK_ACTIVE) { > port_guid = attr_array[i].port_guid; > + break; > + } > } > > > I have tested it and it solves the problem. > > Does this look ok ? Yes, this looks correct. Thanks for fixing this. I just will need your 'Signed-off-by:' line in order to apply the patch. Sasha From eli at mellanox.co.il Thu May 22 01:51:04 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 22 May 2008 11:51:04 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: set max CM MTU when moving to CM mode Message-ID: <1211446264.7310.49.camel@eli-laptop> >From c878b9d3a4cfd031e8baaba46a224b46b1ced441 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Thu, 22 May 2008 11:45:04 +0300 Subject: [PATCH] IB/ipoib: set max CM MTU when moving to CM mode This will relieve the user from the need to restore CM mode MTU every time we switch from UD to CM mode. Signed-off-by: Eli Cohen --- I would like to push this patch to ofed 1.3.1 too. drivers/infiniband/ulp/ipoib/ipoib_cm.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 97e67d3..e6f57dd 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -1387,6 +1387,11 @@ static ssize_t set_mode(struct device *d, struct device_attribute *attr, dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO); priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM; + if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu) + ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n", + priv->mcast_mtu); + dev->mtu = ipoib_cm_max_mtu(dev); + ipoib_flush_paths(dev); return count; } -- 1.5.5.1 From vlad at lists.openfabrics.org Thu May 22 03:10:44 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 22 May 2008 03:10:44 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080522-0200 daily build status Message-ID: <20080522101044.DDC07E60E8B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.18-53.el5 Log: from include/linux/mutex.h:13, from /home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core/addr.c:31: /home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/kernel_addons/backport/2.6.18-EL5.2/include/linux/log2.h:53: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'is_power_of_2' make[4]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_3_kernel-20080522-0200_linux-2.6.18-53.el5_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.18-53.el5' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From cousin_vinnie at hotmail.fr Thu May 22 04:29:21 2008 From: cousin_vinnie at hotmail.fr (Renaud Durand) Date: Thu, 22 May 2008 13:29:21 +0200 Subject: [ofa-general] iSER problem Message-ID: hello, I tried to run iscsi on my SLES10 sp1 computers, for this, i followed the tutorials of OFED's wiki (https://wiki.openfabrics.org/tiki-index.php?page=iSER and https://wiki.openfabrics.org/tiki-index.php?page=ISER-target) my problem is that : the target is working well (I can "discover" myself either with 127.0.0.1, my ethernet address and my ib address) linux-target:~ # iscsiadm -m discovery -t sendtargets -p 127.0.0.1 161.74.X.X:3260,1 iqn.2001-04.com.example:storage.disk2.amiens.sys1.xyz 127.0.0.1:3260,1 iqn.2001-04.com.example:storage.disk2.amiens.sys1.xyz linux-target:~ # but the remote computer can't discover the target linux-cx5e:~ # iscsiadm -m discovery -t sendtargets -p 161.74.X.X iscsiadm: connection to discovery address 161.74.X.X failed iscsiadm: connection to discovery address 161.74.X.X failed iscsiadm: connection to discovery address 161.74.X.X failed iscsiadm: connection to discovery address 161.74.X.X failed iscsiadm: connection to discovery address 161.74.X.X failed iscsiadm: connection login retries (reopen_max) 5 exceeded but the ping works linux-cx5e:~ # ping 161.74.X.X PING 161.74.83.128 (161.74.X.X) 56(84) bytes of data. 64 bytes from 161.74.X.X: icmp_seq=1 ttl=64 time=2.27 ms 64 bytes from 161.74.X.X: icmp_seq=2 ttl=64 time=0.068 ms here is lsmod on my computer linux-cx5e:~ # lsmod | grep iscsi iscsi_tcp 27520 0 libiscsi 30208 1 iscsi_tcp scsi_transport_iscsi 34320 3 iscsi_tcp,libiscsi scsi_mod 156600 6 iscsi_tcp,libiscsi,scsi_transport_iscsi,sg,libata,sd_mod I really don't understand what the problem is, if you have a suggestion/solution please tell me because I am desperate _________________________________________________________________ Retouchez, classez et partagez vos photos gratuitement avec le logiciel Galerie de Photos ! http://www.windowslive.fr/galerie/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Thu May 22 04:52:30 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 22 May 2008 07:52:30 -0400 Subject: [ofa-general] iSER problem In-Reply-To: References: Message-ID: At 07:29 AM 5/22/2008, Renaud Durand wrote: >but the remote computer can't discover the target > >linux-cx5e:~ # iscsiadm -m discovery -t sendtargets -p 161.74.X.X >iscsiadm: connection to discovery address 161.74.X.X failed Do you have a firewall protecting the iSCSI ports perhaps? Tom. From kliteyn at dev.mellanox.co.il Thu May 22 05:17:18 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 22 May 2008 15:17:18 +0300 Subject: [ofa-general] [PATCH] ibutils: fix some log messages Message-ID: <4835644E.2050508@dev.mellanox.co.il> Hi Oren, Fixing some log messages from 'info' to 'error' Signed-off-by: Yevgeny Kliteynik --- ibmgtsim/src/dispatcher.cpp | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/ibmgtsim/src/dispatcher.cpp b/ibmgtsim/src/dispatcher.cpp index 02b63fe..1fd9512 100644 --- a/ibmgtsim/src/dispatcher.cpp +++ b/ibmgtsim/src/dispatcher.cpp @@ -309,12 +309,12 @@ IBMSDispatcher::routeMadToDestByLid( int hops = 0; MSGREG(inf0, 'I', "Routing MAD mgmt_class:$ method:$ tid:$ to lid:$ from:$ port:$", "dispatcher"); - MSGREG(inf1, 'I', "Got to dead-end routing to lid:$ at node:$", + MSGREG(inf1, 'E', "Got to dead-end routing to lid:$ at node:$ (fdb)", "dispatcher"); MSGREG(inf2, 'I', "Arrived at lid $ = node $ after $ hops", "dispatcher"); - MSGREG(inf3, 'I', "Got to dead-end routing to lid:$ at node:$ port:$", + MSGREG(inf3, 'E', "Got to dead-end routing to lid:$ at node:$ port:$", "dispatcher"); - MSGREG(inf4, 'I', "Got to dead-end routing to lid:$ at HCA node:$ port:$ lid:$", + MSGREG(inf4, 'E', "Got to dead-end routing to lid:$ at HCA node:$ port:$ lid:$", "dispatcher"); MSGREG(inf5, 'V', "Got node:$ through port:$", "dispatcher"); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu May 22 05:20:53 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 22 May 2008 15:20:53 +0300 Subject: [ofa-general] [PATCH] ibutils: fixing seg. fault in ibis_gsi_mad_ctrl.c Message-ID: <48356525.9010700@dev.mellanox.co.il> Hi Oren, Fixing seg fault in allocation of gsi management class vector. Signed-off-by: Yevgeny Kliteynik --- ibis/src/ibis_gsi_mad_ctrl.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ibis/src/ibis_gsi_mad_ctrl.c b/ibis/src/ibis_gsi_mad_ctrl.c index 356d33d..bfb5fe6 100644 --- a/ibis/src/ibis_gsi_mad_ctrl.c +++ b/ibis/src/ibis_gsi_mad_ctrl.c @@ -731,7 +731,7 @@ ibis_gsi_mad_ctrl_set_class_attr_cb( size = cl_vector_get_size(&p_ctrl->class_vector); if (size <= mad_class) { - cl_status = cl_vector_set_size(&p_ctrl->class_vector,mad_class); + cl_status = cl_vector_set_size(&p_ctrl->class_vector,mad_class+1); if( cl_status != CL_SUCCESS) { -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu May 22 05:33:13 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 22 May 2008 15:33:13 +0300 Subject: [ofa-general] [PATCH] ibutils: remove trailing blanks Message-ID: <48356809.3070405@dev.mellanox.co.il> Hi Oren, Removing trailing blanks in ibis.i Signed-off-by: Yevgeny Kliteynik --- ibis/src/ibis.i | 216 +++++++++++++++++++++++++++--------------------------- 1 files changed, 108 insertions(+), 108 deletions(-) diff --git a/ibis/src/ibis.i b/ibis/src/ibis.i index 21766f2..a3c9b45 100644 --- a/ibis/src/ibis.i +++ b/ibis/src/ibis.i @@ -96,11 +96,11 @@ ibisp_is_debug(void) // // TYPE MAPS: -// +// %include ibis_typemaps.i // -// exception handling wrapper based on the MsgMgr interfaces +// exception handling wrapper based on the MsgMgr interfaces // %{ @@ -110,7 +110,7 @@ ibisp_is_debug(void) void ibis_set_tcl_error(char *err) { if (strlen(err) < 1024) strcpy(ibis_tcl_error_msg, err); - else + else strncpy(ibis_tcl_error_msg, err, 1024); ibis_tcl_error = 1; } @@ -122,24 +122,24 @@ ibisp_is_debug(void) if (!IbisObj.initialized) { Tcl_SetStringObj( - Tcl_GetObjResult(interp), + Tcl_GetObjResult(interp), "ibis was not yet initialized. please use ibis_init and then ibis_set_port before.", -1); return TCL_ERROR; } - + if (! IbisObj.port_guid) { Tcl_SetStringObj( - Tcl_GetObjResult(interp), + Tcl_GetObjResult(interp), " ibis was not yet initialized. please use ibis_set_port before.", -1); return TCL_ERROR; } ibis_tcl_error = 0; - $function; - if (ibis_tcl_error) { + $function; + if (ibis_tcl_error) { Tcl_SetStringObj(Tcl_GetObjResult(interp), ibis_tcl_error_msg, -1); - return TCL_ERROR; + return TCL_ERROR; } } @@ -178,10 +178,10 @@ ibisp_is_debug(void) ibis_t IbisObj; static ibis_opt_t *ibis_opt_p; ibis_opt_t IbisOpts; - - /* initialize the ibis object - is not done during init so we + + /* initialize the ibis object - is not done during init so we can play with the options ... */ - int ibis_ui_init(void) + int ibis_ui_init(void) { ib_api_status_t status; #ifdef OSM_BUILD_OPENIB @@ -202,8 +202,8 @@ ibisp_is_debug(void) printf("-E- fail to init ibcr_init.\n"); ibcr_destroy( p_ibcr_global ); exit(1); - } - + } + status = ibpm_init(p_ibpm_global); if( status != IB_SUCCESS ) { @@ -218,7 +218,7 @@ ibisp_is_debug(void) printf("-E- Fail to init ibvs_init.\n"); ibvs_destroy( p_ibvs_global ); exit(1); - } + } status = ibbbm_init(p_ibbbm_global); if( status != IB_SUCCESS ) @@ -226,7 +226,7 @@ ibisp_is_debug(void) printf("-E- Fail to init ibbbm_init.\n"); ibbbm_destroy( p_ibbbm_global ); exit(1); - } + } status = ibsm_init(gp_ibsm); if( status != IB_SUCCESS ) @@ -234,7 +234,7 @@ ibisp_is_debug(void) printf("-E- Fail to init ibbbm_init.\n"); ibsm_destroy( gp_ibsm ); exit(1); - } + } return 0; } @@ -265,12 +265,12 @@ ibisp_is_debug(void) /* simply return the active port guid ibis is binded to */ uint64_t ibis_get_port(void) { - return (IbisObj.port_guid); + return (IbisObj.port_guid); } /* set the port we bind to and initialize sub packages */ int ibis_set_port(uint64_t port_guid) - { + { ib_api_status_t status; if (! IbisObj.initialized) { @@ -280,7 +280,7 @@ ibisp_is_debug(void) } IbisObj.port_guid = port_guid; - + status = ibcr_bind(p_ibcr_global); if( status != IB_SUCCESS ) { @@ -303,15 +303,15 @@ ibisp_is_debug(void) printf("-E- Fail to ibvs_bind.\n"); ibvs_destroy( p_ibvs_global ); exit(1); - } - + } + status = ibbbm_bind(p_ibbbm_global); if( status != IB_SUCCESS ) { printf("-E- Fail to ibbbm_bind.\n"); ibbbm_destroy( p_ibbbm_global ); exit(1); - } + } status = ibsm_bind(gp_ibsm); if( status != IB_SUCCESS ) @@ -319,9 +319,9 @@ ibisp_is_debug(void) printf("-E- Fail to ibsm_bind.\n"); ibsm_destroy( gp_ibsm ); exit(1); - } + } - if (ibsac_bind(&IbisObj)) + if (ibsac_bind(&IbisObj)) { printf("-E- Fail to ibsac_bind.\n"); exit(1); @@ -345,7 +345,7 @@ ibisp_is_debug(void) } int ibis_set_transaction_timeout( uint32_t timeout_ms ) { - osm_log(&(IbisObj.log), + osm_log(&(IbisObj.log), OSM_LOG_VERBOSE, " Setting timeout to:%u[msec]\n", timeout_ms); IbisOpts.transaction_timeout = timeout_ms; @@ -364,16 +364,16 @@ ibisp_is_debug(void) Tcl_Obj *p_obj; if (!IbisObj.initialized) - { + { Tcl_SetStringObj( - Tcl_GetObjResult(interp), + Tcl_GetObjResult(interp), "ibis was not yet initialized. please use ibis_init before.", -1); return TCL_ERROR; } - + /* command options */ tcl_result = Tcl_GetObjResult(interp); - + if ((objc < 1) || (objc > 1)) { Tcl_SetStringObj(tcl_result,"Wrong # args. ibis_get_local_ports_info ",-1); return TCL_ERROR; @@ -394,12 +394,12 @@ ibisp_is_debug(void) return( TCL_ERROR ); } - /* - Go over all ports and build the return value + /* + Go over all ports and build the return value */ for( i = 0; i < num_ports; i++ ) { - + // start with 1 on host channel adapters. sprintf(res, "0x%016" PRIx64 " 0x%04X %s %u", cl_ntoh64( attr_array[i].port_guid ), @@ -407,11 +407,11 @@ ibisp_is_debug(void) ib_get_port_state_str( attr_array[i].link_state ), attr_array[i].port_num ); - + p_obj = Tcl_NewStringObj(res, strlen(res)); Tcl_ListObjAppendElement(interp, tcl_result, p_obj); } - + return TCL_OK; } @@ -421,7 +421,7 @@ ibisp_is_debug(void) // // INTERFACE DEFINITION (~copy of h file) -// +// %section "IBIS Constants" /* These constants are provided by IBIS: */ @@ -471,7 +471,7 @@ typedef struct _ibis_opt { /* If TRUE - forces flash after each log message (TRUE). */ uint8_t log_flags; /* The log levels to be used */ - char log_file[1024]; + char log_file[1024]; /* The name of the log file used (read only) */ uint64_t sm_key; /* The SM_Key to be used when sending SubnetMgt and SubnetAdmin MADs */ @@ -497,7 +497,7 @@ int ibis_set_transaction_timeout(uint32_t timeout_ms); %text %{ ibis_get_local_ports_info [return list] - Return the list of available IB ports with GUID, LID and State. + Return the list of available IB ports with GUID, LID and State. %} extern char * ibisSourceVersion; @@ -509,12 +509,12 @@ extern char * ibisSourceVersion; /* Make sure that the osmv, complib and ibisp use same modes (debug/free) */ - if ( osm_is_debug() != cl_is_debug() || - osm_is_debug() != ibisp_is_debug() || + if ( osm_is_debug() != cl_is_debug() || + osm_is_debug() != ibisp_is_debug() || ibisp_is_debug() != cl_is_debug() ) { fprintf(stderr, "-E- OSMV, Complib and Ibis were compiled using different modes\n"); - fprintf(stderr, "-E- OSMV debug:%d Complib debug:%d IBIS debug:%d \n", + fprintf(stderr, "-E- OSMV debug:%d Complib debug:%d IBIS debug:%d \n", osm_is_debug(), cl_is_debug(), ibisp_is_debug() ); exit(1); } @@ -526,7 +526,7 @@ extern char * ibisSourceVersion; /* we initialize the structs etc only once. */ if (0 == notFirstTime++) { Tcl_StaticPackage(interp, "ibis", Ibis_Init, NULL); - Tcl_PkgProvide(interp, "ibis", IBIS_VERSION); + Tcl_PkgProvide(interp, "ibis", IBIS_VERSION); /* Default Options */ memset(&IbisOpts, 0,sizeof(ibis_opt_t)); IbisOpts.transaction_timeout = 4*OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC; @@ -538,45 +538,45 @@ extern char * ibisSourceVersion; IbisOpts.log_flags = OSM_LOG_ERROR; strcpy(IbisOpts.log_file,"/tmp/ibis.log"); - + /* we want all exists to cleanup */ Tcl_CreateExitHandler(ibis_exit, NULL); - + /* ------------------ IBCR ---------------------- */ p_ibcr_global = ibcr_construct(); - + if (p_ibcr_global == NULL) { printf("-E- Error from ibcr_construct.\n"); exit(1); - } - + } + /* ------------------ IBPM ---------------------- */ p_ibpm_global = ibpm_construct(); - + if (p_ibpm_global == NULL) { printf("-E- Error from ibpm_construct.\n"); exit(1); - } - + } + /* ------------------ IBVS ---------------------- */ p_ibvs_global = ibvs_construct(); - + if (p_ibvs_global == NULL) { printf("-E- Error from ibvs_construct.\n"); exit(1); - } + } /* ------------------ IBBBM ---------------------- */ p_ibbbm_global = ibbbm_construct(); - + if (p_ibbbm_global == NULL) { printf("-E- Error from ibbbm_construct.\n"); exit(1); - } + } /* ------------------ IBSM ---------------------- */ gp_ibsm = ibsm_construct(); - + if (gp_ibsm == NULL) { printf("-E- Error from ibsm_construct.\n"); exit(1); @@ -593,7 +593,7 @@ extern char * ibisSourceVersion; memset(&ibsm_sm_info_obj, 0, sizeof(ib_sm_info_t)); /* ------------------ IBSAC ---------------------- */ - + /* Initialize global records */ memset(&ibsac_node_rec, 0,sizeof(ibsac_node_rec)); memset(&ibsac_portinfo_rec, 0,sizeof(ibsac_portinfo_rec)); @@ -603,10 +603,10 @@ extern char * ibisSourceVersion; memset(&ibsac_path_rec, 0, sizeof(ib_path_rec_t)); memset(&ibsac_lft_rec, 0, sizeof(ib_lft_record_t)); memset(&ibsac_mcm_rec, 0, sizeof(ib_member_rec_t)); - - /* + + /* * A1 Supported features: - * + * * Query: Rec/Info Types Done * * NodeRecord (nr, ni) Y @@ -627,7 +627,7 @@ extern char * ibisSourceVersion; * VLArbTableRecord (vlarb) Y * PKeyTableRecord (pkr, pkt) Y */ - + /* We use alternate SWIG Objects mangling */ SWIG_AltMnglInit(); SWIG_AltMnglRegTypeToPrefix("_sacNodeInfo_p", "ni"); @@ -650,24 +650,24 @@ extern char * ibisSourceVersion; SWIG_AltMnglRegTypeToPrefix("_sacVlArbRec_p", "vlarb"); SWIG_AltMnglRegTypeToPrefix("_sacPKeyTbl_p", "pkt"); SWIG_AltMnglRegTypeToPrefix("_sacPKeyRec_p", "pkr"); - + // register the pre-allocated objects SWIG_AltMnglRegObj("ni",&(ibsac_node_rec.node_info)); SWIG_AltMnglRegObj("nr",&(ibsac_node_rec)); - + SWIG_AltMnglRegObj("pi", &(ibsac_portinfo_rec.port_info)); SWIG_AltMnglRegObj("pir",&(ibsac_portinfo_rec)); - + SWIG_AltMnglRegObj("smi", &(ibsac_sminfo_rec.sm_info)); SWIG_AltMnglRegObj("smir",&(ibsac_sminfo_rec)); - + SWIG_AltMnglRegObj("swi", &(ibsac_swinfo_rec.switch_info)); SWIG_AltMnglRegObj("swir",&(ibsac_swinfo_rec)); SWIG_AltMnglRegObj("path",&(ibsac_path_rec)); - + SWIG_AltMnglRegObj("link",&(ibsac_link_rec)); - + SWIG_AltMnglRegObj("lft",&(ibsac_lft_rec)); SWIG_AltMnglRegObj("mcm",&(ibsac_mcm_rec)); @@ -675,18 +675,18 @@ extern char * ibisSourceVersion; SWIG_AltMnglRegObj("cpi",&(ibsac_class_port_info)); SWIG_AltMnglRegObj("info",&(ibsac_inform_info)); SWIG_AltMnglRegObj("svc",&(ibsac_svc_rec)); - + SWIG_AltMnglRegObj("slvt", &(ibsac_slvl_rec.slvl_tbl)); SWIG_AltMnglRegObj("slvr", &(ibsac_slvl_rec)); - + SWIG_AltMnglRegObj("vlarb", &(ibsac_vlarb_rec)); - + SWIG_AltMnglRegObj("pkt", &(ibsac_pkey_rec.pkey_tbl)); SWIG_AltMnglRegObj("pkr", &(ibsac_pkey_rec)); - + usleep(1000); } - + /* we defined this as a native command so declare it in here */ Tcl_CreateObjCommand(interp, "ibis_get_local_ports_info", ibis_get_local_ports_info, NULL, NULL); @@ -697,113 +697,113 @@ extern char * ibisSourceVersion; (ClientData)ibis_opt_p, 0); /* add commands for accessing the global query records */ - Tcl_CreateObjCommand(interp,"smNodeInfoMad", + Tcl_CreateObjCommand(interp,"smNodeInfoMad", TclsmNodeInfoMethodCmd, (ClientData)&ibsm_node_info_obj, 0); - - Tcl_CreateObjCommand(interp,"smPortInfoMad", + + Tcl_CreateObjCommand(interp,"smPortInfoMad", TclsmPortInfoMethodCmd, (ClientData)&ibsm_port_info_obj, 0); - - Tcl_CreateObjCommand(interp,"smSwitchInfoMad", + + Tcl_CreateObjCommand(interp,"smSwitchInfoMad", TclsmSwInfoMethodCmd, (ClientData)&ibsm_switch_info_obj, 0); - - Tcl_CreateObjCommand(interp,"smLftBlockMad", + + Tcl_CreateObjCommand(interp,"smLftBlockMad", TclsmLftBlockMethodCmd, (ClientData)&ibsm_lft_block_obj, 0); - - Tcl_CreateObjCommand(interp,"smMftBlockMad", + + Tcl_CreateObjCommand(interp,"smMftBlockMad", TclsmMftBlockMethodCmd, (ClientData)&ibsm_mft_block_obj, 0); - Tcl_CreateObjCommand(interp,"smGuidInfoMad", + Tcl_CreateObjCommand(interp,"smGuidInfoMad", TclsmGuidInfoMethodCmd, (ClientData)&ibsm_guid_info_obj, 0); - Tcl_CreateObjCommand(interp,"smPkeyTableMad", + Tcl_CreateObjCommand(interp,"smPkeyTableMad", TclsmPkeyTableMethodCmd, (ClientData)&ibsm_pkey_table_obj, 0); - Tcl_CreateObjCommand(interp,"smSlVlTableMad", + Tcl_CreateObjCommand(interp,"smSlVlTableMad", TclsmSlVlTableMethodCmd, (ClientData)&ibsm_slvl_table_obj, 0); - Tcl_CreateObjCommand(interp,"smVlArbTableMad", + Tcl_CreateObjCommand(interp,"smVlArbTableMad", TclsmVlArbTableMethodCmd, (ClientData)&ibsm_vl_arb_table_obj, 0); - Tcl_CreateObjCommand(interp,"smSMInfoMad", + Tcl_CreateObjCommand(interp,"smSMInfoMad", TclsmSMInfoMethodCmd, (ClientData)&ibsm_sm_info_obj, 0); - Tcl_CreateObjCommand(interp,"smNodeDescMad", + Tcl_CreateObjCommand(interp,"smNodeDescMad", TclsmNodeDescMethodCmd, (ClientData)&ibsm_node_desc_obj, 0); - Tcl_CreateObjCommand(interp,"smNoticeMad", + Tcl_CreateObjCommand(interp,"smNoticeMad", TclsmNoticeMethodCmd, (ClientData)&ibsm_notice_obj, 0); - Tcl_CreateObjCommand(interp,"sacNodeQuery", + Tcl_CreateObjCommand(interp,"sacNodeQuery", TclsacNodeRecMethodCmd, (ClientData)&ibsac_node_rec, 0); - - Tcl_CreateObjCommand(interp,"sacPortQuery", + + Tcl_CreateObjCommand(interp,"sacPortQuery", TclsacPortRecMethodCmd, (ClientData)&ibsac_portinfo_rec, 0); - - Tcl_CreateObjCommand(interp,"sacSmQuery", + + Tcl_CreateObjCommand(interp,"sacSmQuery", TclsacSmRecMethodCmd, (ClientData)&ibsac_sminfo_rec, 0); - - Tcl_CreateObjCommand(interp,"sacSwQuery", + + Tcl_CreateObjCommand(interp,"sacSwQuery", TclsacSwRecMethodCmd, (ClientData)&ibsac_swinfo_rec, 0); - - Tcl_CreateObjCommand(interp,"sacLinkQuery", + + Tcl_CreateObjCommand(interp,"sacLinkQuery", TclsacLinkRecMethodCmd, (ClientData)&ibsac_link_rec, 0); - Tcl_CreateObjCommand(interp,"sacPathQuery", + Tcl_CreateObjCommand(interp,"sacPathQuery", TclsacPathRecMethodCmd, (ClientData)&ibsac_path_rec, 0); - Tcl_CreateObjCommand(interp,"sacLFTQuery", + Tcl_CreateObjCommand(interp,"sacLFTQuery", TclsacLFTRecMethodCmd, (ClientData)&ibsac_lft_rec, 0); - Tcl_CreateObjCommand(interp,"sacMCMQuery", + Tcl_CreateObjCommand(interp,"sacMCMQuery", TclsacMCMRecMethodCmd, (ClientData)&ibsac_mcm_rec, 0); - Tcl_CreateObjCommand(interp,"sacClassPortInfoQuery", + Tcl_CreateObjCommand(interp,"sacClassPortInfoQuery", TclsacClassPortInfoMethodCmd, (ClientData)&ibsac_class_port_info, 0); - Tcl_CreateObjCommand(interp,"sacInformInfoQuery", + Tcl_CreateObjCommand(interp,"sacInformInfoQuery", TclsacInformInfoMethodCmd, (ClientData)&ibsac_inform_info, 0); - Tcl_CreateObjCommand(interp,"sacServiceQuery", + Tcl_CreateObjCommand(interp,"sacServiceQuery", TclsacServiceRecMethodCmd, (ClientData)&ibsac_svc_rec, 0); - Tcl_CreateObjCommand(interp,"sacSLVlQuery", + Tcl_CreateObjCommand(interp,"sacSLVlQuery", TclsacSlVlRecMethodCmd, (ClientData)&ibsac_slvl_rec, 0); - Tcl_CreateObjCommand(interp,"sacVlArbQuery", + Tcl_CreateObjCommand(interp,"sacVlArbQuery", TclsacVlArbRecMethodCmd, (ClientData)&ibsac_vlarb_rec, 0); - Tcl_CreateObjCommand(interp,"sacPKeyQuery", + Tcl_CreateObjCommand(interp,"sacPKeyQuery", TclsacPKeyRecMethodCmd, (ClientData)&ibsac_pkey_rec, 0); /* - use an embedded Tcl code for doing init if given command line - parameters: -port_num + use an embedded Tcl code for doing init if given command line + parameters: -port_num */ Tcl_GlobalEval( interp, @@ -835,4 +835,4 @@ extern char * ibisSourceVersion; "}\n"); } %} - + -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu May 22 05:56:32 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 22 May 2008 15:56:32 +0300 Subject: [ofa-general] [PATCH] ibutils: remove trailing blanks in Makefile.am Message-ID: <48356D80.60805@dev.mellanox.co.il> Hi Oren, Removing trailing blanks in ibis/src/Makefile.am Signed-off-by: Yevgeny Kliteynik --- ibis/src/Makefile.am | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index 6b1fe09..7166906 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -48,13 +48,13 @@ AM_CXXFLAGS = $(TCL_CPPFLAGS) $(OSM_CFLAGS) $(DBG) -fno-strict-aliasing -fPIC - # ibis shared library version triplet is: # API_ID:API_VER:NUM_PREV_API_SUP = x:y:z # * change of API_ID means new API -# * change of AGE means how many API backward compt +# * change of AGE means how many API backward compt # * change of API_VER is required every version # Results with SO version: x-z:z:y LIB_VER_TRIPLET="1:0:0" LIB_FILE_TRIPLET=1.0.0 -lib_LTLIBRARIES = libibiscom.la libibis.la +lib_LTLIBRARIES = libibiscom.la libibis.la libibiscom_la_SOURCES = ibbbm.c ibcr.c ibis.c ibis_gsi_mad_ctrl.c \ ibpm.c ibsac.c ibsm.c ibvs.c @@ -74,9 +74,9 @@ LDADD = $(OSM_LDFLAGS) ibis_SOURCES = ibissh_wrap.cpp -ibis_LDFLAGS = -static +ibis_LDFLAGS = -static # note the order of the libraries does matter as we static link -ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) +ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) # SWIG FILES: @@ -122,7 +122,7 @@ ibis_wrap.c: @MAINTAINER_MODE_TRUE@ $(SWIG_IFC_FILES) swig -I$(srcdir) -dhtml -tcl8 -o swig_wrap.c $(srcdir)/ibis.i $(srcdir)/fixSwigWrapper -g -s -p -o ibis_wrap.c if test ibis_wrap.c -ef $(srcdir)/ibis_wrap.c; then cp ibis_wrap.c $(srcdir)/ibis_wrap.c; fi - rm -f swig_wrap.c + rm -f swig_wrap.c ibissh_wrap.cpp: @MAINTAINER_MODE_TRUE@ $(SWIG_IFC_FILES) swig -I$(srcdir) -dhtml -tcl8 -ltclsh.i -o swig_wrap.c $(srcdir)/ibis.i @@ -156,7 +156,7 @@ EXTRA_DIST = swig_extended_obj.c fixSwigWrapper pkgIndex.tcl \ install-libLTLIBRARIES: # this actually over write the lib install -install-exec-am: install-binPROGRAMS +install-exec-am: install-binPROGRAMS mkdir -p $(DESTDIR)/$(libdir)/ibis$(VERSION) cp .libs/libibis.so.$(LIB_FILE_TRIPLET) $(DESTDIR)/$(libdir)/ibis$(VERSION)/libibis.so.$(VERSION) sed 's/%VERSION%/'$(VERSION)'/g' $(srcdir)/pkgIndex.tcl > $(DESTDIR)/$(libdir)/ibis$(VERSION)/pkgIndex.tcl -- 1.5.1.4 From swise at opengridcomputing.com Thu May 22 06:45:14 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 May 2008 08:45:14 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48351310.5090108@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> Message-ID: <483578EA.4070503@opengridcomputing.com> Or Gerlitz wrote: > Dror Goldenberg wrote: >> When you post a fast register WQE, you specify the new 8 LSBits to >> be assigned to the MR. The rest 24 MSBits are the ones that you obtained >> while allocating the MR, and they persist throughout the lifetime of >> this MR. > OK, thanks Dror. > > Steve, do we agree on this point? if yes, the next version of the > patches should include the new rkey value (or just the new 8 LSbits) in > the work request. > Are we sure we need to expose this to the user? > Or. > From sashak at voltaire.com Thu May 22 06:53:29 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 May 2008 16:53:29 +0300 Subject: [ofa-general] [PATCH] saquery: --smkey command line option Message-ID: <20080522135329.GB32128@sashak.voltaire.com> This adds possibility to specify SM_Key value with saquery. It should work with queries where OSM_DEFAULT_SM_KEY was used. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index ed61721..8edac5d 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -69,6 +69,7 @@ char *argv0 = "saquery"; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; +static ib_net64_t smkey = OSM_DEFAULT_SM_KEY; /** * Declare some globals because I don't want this to be too complex. @@ -730,7 +731,7 @@ get_all_records(osm_bind_handle_t bind_handle, int trusted) { return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset, - trusted ? OSM_DEFAULT_SM_KEY : 0); + trusted ? smkey : 0); } /** @@ -1254,8 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, &pktr, - ib_get_attr_offset(sizeof(pktr)), - OSM_DEFAULT_SM_KEY); + ib_get_attr_offset(sizeof(pktr)), smkey); if (status != IB_SUCCESS) return status; @@ -1411,6 +1411,7 @@ usage(void) "IPv6 format\n"); fprintf(stderr, " -C specify the SA query HCA\n"); fprintf(stderr, " -P specify the SA query port\n"); + fprintf(stderr, " --smkey specify SM_Key value for the query\n"); fprintf(stderr, " -t | --timeout specify the SA query " "response timeout (default %u msec)\n", DEFAULT_SA_TIMEOUT_MS); @@ -1466,6 +1467,7 @@ main(int argc, char **argv) {"sgid-to-dgid", 1, 0, 2}, {"timeout", 1, 0, 't'}, {"node-name-map", 1, 0, 3}, + {"smkey", 1, 0, 4}, { } }; @@ -1512,6 +1514,9 @@ main(int argc, char **argv) case 3: node_name_map_file = strdup(optarg); break; + case 4: + smkey = cl_hton64(strtoull(optarg, NULL, 0)); + break; case 'p': query_type = IB_MAD_ATTR_PATH_RECORD; break; -- 1.5.5.1.178.g1f811 From monis at Voltaire.COM Thu May 22 06:58:33 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 22 May 2008 16:58:33 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> Message-ID: <48357C09.1040302@Voltaire.COM> Hal, Roland Thanks for the comments. The patch below tries to address the issues that were raised in its previous form. Please note that I'm only asking for opinion for now. If the idea is acceptable then I will recreate more elegant patch with the required fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with textual names). The idea in few words is to flush only paths but keeping address handles in ipoib_neigh. This will trigger a new path lookup when an ARP probe arrives and eventually an addess handle renewal. In the meantime, the old address handle is kept and can be used. In most cases this address handle is a valid address handle and when it is not than the situatio is not worse than before. My tests show that this patch completes the improvement that was archived with patch #1 to zero packet loss (tested with ping flood) when SM change event occurs. thanks MoniS diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index ca126fc..8ef6573 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -276,10 +276,11 @@ struct ipoib_dev_priv { struct delayed_work pkey_poll_task; struct delayed_work mcast_task; - struct work_struct flush_task; + struct work_struct flush_task0; + struct work_struct flush_task1; + struct work_struct flush_task2; struct work_struct restart_task; struct delayed_work ah_reap_task; - struct work_struct pkey_event_task; struct ib_device *ca; u8 port; @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(struct work_struct *work); +void ipoib_flush_paths_only(struct net_device *dev); void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); -void ipoib_ib_dev_flush(struct work_struct *work); +void ipoib_ib_dev_flush0(struct work_struct *work); +void ipoib_ib_dev_flush1(struct work_struct *work); +void ipoib_ib_dev_flush2(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f429bce..5a6bbe8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return 0; } -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) { struct ipoib_dev_priv *cpriv; struct net_device *dev = priv->dev; @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) * the parent is down. */ list_for_each_entry(cpriv, &priv->child_intfs, list) - __ipoib_ib_dev_flush(cpriv, pkey_event); + __ipoib_ib_dev_flush(cpriv, level); mutex_unlock(&priv->vlan_mutex); @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) return; } - if (pkey_event) { + if (level == 2) { if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ipoib_ib_dev_down(dev, 0); @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) priv->pkey_index = new_index; } - ipoib_dbg(priv, "flushing\n"); - - ipoib_ib_dev_down(dev, 0); + ipoib_flush_paths_only(dev); + ipoib_mcast_dev_flush(dev); + + if (level >= 1) + ipoib_ib_dev_down(dev, 0); - if (pkey_event) { + if (level >= 2) { ipoib_ib_dev_stop(dev, 0); ipoib_ib_dev_open(dev); } @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) * we get here, don't bring it back up if it's not configured up */ if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { - ipoib_ib_dev_up(dev); + if (level >= 1) + ipoib_ib_dev_up(dev); ipoib_mcast_restart_task(&priv->restart_task); } } -void ipoib_ib_dev_flush(struct work_struct *work) +void ipoib_ib_dev_flush0(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, flush_task); + container_of(work, struct ipoib_dev_priv, flush_task0); - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 0); } -void ipoib_pkey_event(struct work_struct *work) +void ipoib_ib_dev_flush1(struct work_struct *work) { struct ipoib_dev_priv *priv = - container_of(work, struct ipoib_dev_priv, pkey_event_task); + container_of(work, struct ipoib_dev_priv, flush_task1); - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); __ipoib_ib_dev_flush(priv, 1); } +void ipoib_ib_dev_flush2(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, flush_task2); + + __ipoib_ib_dev_flush(priv, 2); +} + void ipoib_ib_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2442090..c41798d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path) return 0; } +static void path_free_only(struct net_device *dev, struct ipoib_path *path) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh, *tn; + struct sk_buff *skb; + unsigned long flags; + + while ((skb = __skb_dequeue(&path->queue))) + dev_kfree_skb_irq(skb); + + if (path->ah) + ipoib_put_ah(path->ah); + + kfree(path); +} static void path_free(struct net_device *dev, struct ipoib_path *path) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ +void ipoib_flush_paths_only(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_path *path, *tp; + LIST_HEAD(remove_list); + + spin_lock_irq(&priv->tx_lock); + spin_lock(&priv->lock); + + list_splice_init(&priv->path_list, &remove_list); + + list_for_each_entry(path, &remove_list, list) + rb_erase(&path->rb_node, &priv->path_tree); + + list_for_each_entry_safe(path, tp, &remove_list, list) { + if (path->query) + ib_sa_cancel_query(path->query_id, path->query); + spin_unlock(&priv->lock); + spin_unlock_irq(&priv->tx_lock); + wait_for_completion(&path->done); + path_free_only(dev, path); + spin_lock_irq(&priv->tx_lock); + spin_lock(&priv->lock); + } + spin_unlock(&priv->lock); + spin_unlock_irq(&priv->tx_lock); +} + void ipoib_flush_paths(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -421,6 +464,8 @@ static void path_rec_completion(int status, __skb_queue_tail(&skqueue, skb); list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { + if (neigh->ah) + ipoib_put_ah(neigh->ah); kref_get(&path->ah->ref); neigh->ah = path->ah; memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev) INIT_LIST_HEAD(&priv->multicast_list); INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 8766d29..80c0409 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, if (record->element.port_num != priv->port) return; - if (record->event == IB_EVENT_PORT_ERR || - record->event == IB_EVENT_PORT_ACTIVE || - record->event == IB_EVENT_LID_CHANGE || - record->event == IB_EVENT_SM_CHANGE || - record->event == IB_EVENT_CLIENT_REREGISTER) { - ipoib_dbg(priv, "Port state change event\n"); - queue_work(ipoib_workqueue, &priv->flush_task); + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, + record->device->name, record->element.port_num); + if ( record->event == IB_EVENT_SM_CHANGE || + record->event == IB_EVENT_CLIENT_REREGISTER) { + queue_work(ipoib_workqueue, &priv->flush_task0); + } else if (record->event == IB_EVENT_PORT_ERR || + record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_LID_CHANGE) { + queue_work(ipoib_workqueue, &priv->flush_task1); } else if (record->event == IB_EVENT_PKEY_CHANGE) { - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); - queue_work(ipoib_workqueue, &priv->pkey_event_task); + queue_work(ipoib_workqueue, &priv->flush_task2); } } From sashak at voltaire.com Thu May 22 07:09:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 May 2008 17:09:16 +0300 Subject: [ofa-general] OSM_DEFAULT_SM_KEY byte order Message-ID: <20080522140916.GC32128@sashak.voltaire.com> Hi, I noticed that OSM_DEFAULT_SM_KEY macro is defined and used in host byte order, this means it has different values on LE and BE machines (as result we could see some osmtest failures between x86 and G5). The fix could be trivial: diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 62d472e..7cc2757 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -117,7 +117,7 @@ BEGIN_C_DECLS * * SYNOPSIS */ -#define OSM_DEFAULT_SM_KEY 1 +#define OSM_DEFAULT_SM_KEY CL_HTON64(1) /********/ /****s* OpenSM: Base/OSM_DEFAULT_LMC * NAME , but sort of backward compatibility (currently I know that OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost. Is this so important? Ideas? Sasha From swise at opengridcomputing.com Thu May 22 07:22:42 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 May 2008 09:22:42 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4833EC5B.8070504@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EC5B.8070504@voltaire.com> Message-ID: <483581B2.7000109@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> So you allocate the rkey/stag up front, allocate page_lists up front, >> then as needed you populate your page list and bind it to the >> rkey/stag via IB_WR_FAST_REG_MR, and invalidate that mapping via >> IB_WR_INVALIDATE_MR. You can do this any number of times, and with >> proper fencing, you can pipeline these mappings. Eventually when >> you're done doing IO (like for NFSRDMA when the mount is unmounted) >> you free up the page list(s) and mr/rkey/stag. > Yes, that was my thought as well. > > Just to make sure, by "proper fencing" your understanding is that for > both IB and iWARP the ULP should not wait for the fmr work request to > complete and post the send work-request carrying the rkey/stag with the > IB_SEND_FENCE flag? > > Looking in the IB spec, its seems that the fence indicator only applies > to previous rdma-read / atomic operations, eg in section 11.4.1.1 POST > SEND REQUEST it says: >> Fence indicator. If the fence indicator is set, then all prior RDMA >> Read and Atomic Work Requests on the queue must be completed before >> starting to process this Work Request. > The fast register and invalidate work requests require that they be completed by the device _before_ processing any subsequent work requests. So you can post subsequent SEND WRs that utilize the rkey without problems. In addition, invalidate allows a local fence which means the device will no begin processing the invalitdae until all _prior_ work requests complete (similar to a read fence but for all prior WRS). Steve. From ogerlitz at voltaire.com Thu May 22 07:33:12 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 22 May 2008 17:33:12 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483578EA.4070503@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> Message-ID: <48358428.2000902@voltaire.com> Steve Wise wrote: > Are we sure we need to expose this to the user? I believe this is the way to go if we want to let smart ULPs generate new rkey/stag per mapping. Simpler ULPs could then just put the same value for each map associated with the same mr. Or. From hrosenstock at xsigo.com Thu May 22 07:46:49 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 07:46:49 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080522135329.GB32128@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> Message-ID: <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> Sasha, On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > This adds possibility to specify SM_Key value with saquery. It should > work with queries where OSM_DEFAULT_SM_KEY was used. I think this starts down a slippery slope and perhaps bad precedent for MKey as well. I know this is useful as a debug tool but compromises what purports as "security" IMO as this means the keys need to be too widely known. -- Hal > Signed-off-by: Sasha Khapyorsky > --- > infiniband-diags/src/saquery.c | 11 ++++++++--- > 1 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c > index ed61721..8edac5d 100644 > --- a/infiniband-diags/src/saquery.c > +++ b/infiniband-diags/src/saquery.c > @@ -69,6 +69,7 @@ char *argv0 = "saquery"; > > static char *node_name_map_file = NULL; > static nn_map_t *node_name_map = NULL; > +static ib_net64_t smkey = OSM_DEFAULT_SM_KEY; > > /** > * Declare some globals because I don't want this to be too complex. > @@ -730,7 +731,7 @@ get_all_records(osm_bind_handle_t bind_handle, > int trusted) > { > return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset, > - trusted ? OSM_DEFAULT_SM_KEY : 0); > + trusted ? smkey : 0); > } > > /** > @@ -1254,8 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, > > status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, > comp_mask, &pktr, > - ib_get_attr_offset(sizeof(pktr)), > - OSM_DEFAULT_SM_KEY); > + ib_get_attr_offset(sizeof(pktr)), smkey); > if (status != IB_SUCCESS) > return status; > > @@ -1411,6 +1411,7 @@ usage(void) > "IPv6 format\n"); > fprintf(stderr, " -C specify the SA query HCA\n"); > fprintf(stderr, " -P specify the SA query port\n"); > + fprintf(stderr, " --smkey specify SM_Key value for the query\n"); > fprintf(stderr, " -t | --timeout specify the SA query " > "response timeout (default %u msec)\n", > DEFAULT_SA_TIMEOUT_MS); > @@ -1466,6 +1467,7 @@ main(int argc, char **argv) > {"sgid-to-dgid", 1, 0, 2}, > {"timeout", 1, 0, 't'}, > {"node-name-map", 1, 0, 3}, > + {"smkey", 1, 0, 4}, > { } > }; > > @@ -1512,6 +1514,9 @@ main(int argc, char **argv) > case 3: > node_name_map_file = strdup(optarg); > break; > + case 4: > + smkey = cl_hton64(strtoull(optarg, NULL, 0)); > + break; > case 'p': > query_type = IB_MAD_ATTR_PATH_RECORD; > break; From sashak at voltaire.com Thu May 22 07:48:10 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 May 2008 17:48:10 +0300 Subject: [ofa-general] [PATCH] opensm/scripts/opensm.init.in: fix status command Message-ID: <20080522144810.GD32128@sashak.voltaire.com> This script is installed in SuSE systems where 'status' command/shell function doesn't exist (bug#982 https://bugs.openfabrics.org/show_bug.cgi?id=982). Signed-off-by: Sasha Khapyorsky --- opensm/scripts/opensm.init.in | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in index da23b36..a573662 100644 --- a/opensm/scripts/opensm.init.in +++ b/opensm/scripts/opensm.init.in @@ -81,7 +81,13 @@ stop () { } Xstatus () { - status opensm + pid="`pidof opensm`" + ret=$? + if [ $ret -eq 0 ] ; then + echo "OpenSM is running... pid=$pid" + else + echo "OpenSM is not running." + fi } restart() { -- 1.5.4.rc2.60.gb2e62 From hrosenstock at xsigo.com Thu May 22 07:52:41 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 07:52:41 -0700 Subject: [ofa-general] Re: OSM_DEFAULT_SM_KEY byte order In-Reply-To: <20080522140916.GC32128@sashak.voltaire.com> References: <20080522140916.GC32128@sashak.voltaire.com> Message-ID: <1211467961.18236.178.camel@hrosenstock-ws.xsigo.com> Sasha, On Thu, 2008-05-22 at 17:09 +0300, Sasha Khapyorsky wrote: > Hi, > > I noticed that OSM_DEFAULT_SM_KEY macro is defined and used in host byte > order, this means it has different values on LE and BE machines (as > result we could see some osmtest failures between x86 and G5). The fix > could be trivial: > diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h > index 62d472e..7cc2757 100644 > --- a/opensm/include/opensm/osm_base.h > +++ b/opensm/include/opensm/osm_base.h > @@ -117,7 +117,7 @@ BEGIN_C_DECLS > * > * SYNOPSIS > */ > -#define OSM_DEFAULT_SM_KEY 1 > +#define OSM_DEFAULT_SM_KEY CL_HTON64(1) > /********/ > /****s* OpenSM: Base/OSM_DEFAULT_LMC > * NAME > > > , but sort of backward compatibility (currently I know that > OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost. > Is this so important? Ideas? IMO yes, I think this breaks both backward compatibility and what was actually observed from some other SMs during interop testing. I agree it needs fixing but I think the proper thing is probably more like: #define OSM_DEFAULT_SM_KEY CL_HTON64(0x0100000000000000); -- Hal > Sasha From kliteyn at mellanox.co.il Thu May 22 07:56:28 2008 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 22 May 2008 17:56:28 +0300 Subject: [ofa-general] [PATCH] opensm/scripts/opensm.init.in: fix status command In-Reply-To: <20080522144810.GD32128@sashak.voltaire.com> References: <20080522144810.GD32128@sashak.voltaire.com> Message-ID: <4835899C.80400@mellanox.co.il> Great, thanks. -- Yevgeny Sasha Khapyorsky wrote: > This script is installed in SuSE systems where 'status' command/shell > function doesn't exist (bug#982 > https://bugs.openfabrics.org/show_bug.cgi?id=982). > > Signed-off-by: Sasha Khapyorsky > --- > opensm/scripts/opensm.init.in | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in > index da23b36..a573662 100644 > --- a/opensm/scripts/opensm.init.in > +++ b/opensm/scripts/opensm.init.in > @@ -81,7 +81,13 @@ stop () { > } > > Xstatus () { > - status opensm > + pid="`pidof opensm`" > + ret=$? > + if [ $ret -eq 0 ] ; then > + echo "OpenSM is running... pid=$pid" > + else > + echo "OpenSM is not running." > + fi > } > > restart() { > From sashak at voltaire.com Thu May 22 07:56:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 May 2008 17:56:07 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080522145607.GE32128@sashak.voltaire.com> On 07:46 Thu 22 May , Hal Rosenstock wrote: > Sasha, > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > > This adds possibility to specify SM_Key value with saquery. It should > > work with queries where OSM_DEFAULT_SM_KEY was used. > > I think this starts down a slippery slope and perhaps bad precedent for > MKey as well. I know this is useful as a debug tool but compromises what > purports as "security" IMO as this means the keys need to be too widely > known. When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM side an user may know this or not, in later case saquery will not work (just like now). I don't see a hole. Sasha From hrosenstock at xsigo.com Thu May 22 08:07:06 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 08:07:06 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <48357C09.1040302@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> Message-ID: <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> Moni, On Thu, 2008-05-22 at 16:58 +0300, Moni Shoua wrote: > Hal, Roland > Thanks for the comments. The patch below tries to address the issues that were > raised in its previous form. Please note that I'm only asking for opinion for now. > If the idea is acceptable then I will recreate more elegant patch with the required > fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with > textual names). > > The idea in few words is to flush only paths but keeping address handles in ipoib_neigh. > This will trigger a new path lookup when an ARP probe arrives and eventually an addess > handle renewal. In the meantime, the old address handle is kept and can be used. In most > cases this address handle is a valid address handle and when it is not than the situatio > is not worse than before. This part seems OK to me. > My tests show that this patch completes the improvement that was archived with patch #1 > to zero packet loss (tested with ping flood) when SM change event occurs. Looks to me like SM change is still "level 0". I may have missed it but I don't see how this addresses the general architectural concerns previously raised. This patch may work in your test environment but I don't think that covers all the cases. -- Hal > thanks > > MoniS > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > index ca126fc..8ef6573 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -276,10 +276,11 @@ struct ipoib_dev_priv { > > struct delayed_work pkey_poll_task; > struct delayed_work mcast_task; > - struct work_struct flush_task; > + struct work_struct flush_task0; > + struct work_struct flush_task1; > + struct work_struct flush_task2; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > - struct work_struct pkey_event_task; > > struct ib_device *ca; > u8 port; > @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, > struct ipoib_ah *address, u32 qpn); > void ipoib_reap_ah(struct work_struct *work); > > +void ipoib_flush_paths_only(struct net_device *dev); > void ipoib_flush_paths(struct net_device *dev); > struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > -void ipoib_ib_dev_flush(struct work_struct *work); > +void ipoib_ib_dev_flush0(struct work_struct *work); > +void ipoib_ib_dev_flush1(struct work_struct *work); > +void ipoib_ib_dev_flush2(struct work_struct *work); > void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index f429bce..5a6bbe8 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > return 0; > } > > -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) > { > struct ipoib_dev_priv *cpriv; > struct net_device *dev = priv->dev; > @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * the parent is down. > */ > list_for_each_entry(cpriv, &priv->child_intfs, list) > - __ipoib_ib_dev_flush(cpriv, pkey_event); > + __ipoib_ib_dev_flush(cpriv, level); > > mutex_unlock(&priv->vlan_mutex); > > @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > return; > } > > - if (pkey_event) { > + if (level == 2) { > if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { > clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > ipoib_ib_dev_down(dev, 0); > @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > priv->pkey_index = new_index; > } > > - ipoib_dbg(priv, "flushing\n"); > - > - ipoib_ib_dev_down(dev, 0); > + ipoib_flush_paths_only(dev); > + ipoib_mcast_dev_flush(dev); > + > + if (level >= 1) > + ipoib_ib_dev_down(dev, 0); > > - if (pkey_event) { > + if (level >= 2) { > ipoib_ib_dev_stop(dev, 0); > ipoib_ib_dev_open(dev); > } > @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * we get here, don't bring it back up if it's not configured up > */ > if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { > - ipoib_ib_dev_up(dev); > + if (level >= 1) > + ipoib_ib_dev_up(dev); > ipoib_mcast_restart_task(&priv->restart_task); > } > } > > -void ipoib_ib_dev_flush(struct work_struct *work) > +void ipoib_ib_dev_flush0(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, flush_task); > + container_of(work, struct ipoib_dev_priv, flush_task0); > > - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 0); > } > > -void ipoib_pkey_event(struct work_struct *work) > +void ipoib_ib_dev_flush1(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, pkey_event_task); > + container_of(work, struct ipoib_dev_priv, flush_task1); > > - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 1); > } > > +void ipoib_ib_dev_flush2(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, flush_task2); > + > + __ipoib_ib_dev_flush(priv, 2); > +} > + > void ipoib_ib_dev_cleanup(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 2442090..c41798d 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path) > return 0; > } > > +static void path_free_only(struct net_device *dev, struct ipoib_path *path) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ipoib_neigh *neigh, *tn; > + struct sk_buff *skb; > + unsigned long flags; > + > + while ((skb = __skb_dequeue(&path->queue))) > + dev_kfree_skb_irq(skb); > + > + if (path->ah) > + ipoib_put_ah(path->ah); > + > + kfree(path); > +} > static void path_free(struct net_device *dev, struct ipoib_path *path) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, > > #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ > > +void ipoib_flush_paths_only(struct net_device *dev) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ipoib_path *path, *tp; > + LIST_HEAD(remove_list); > + > + spin_lock_irq(&priv->tx_lock); > + spin_lock(&priv->lock); > + > + list_splice_init(&priv->path_list, &remove_list); > + > + list_for_each_entry(path, &remove_list, list) > + rb_erase(&path->rb_node, &priv->path_tree); > + > + list_for_each_entry_safe(path, tp, &remove_list, list) { > + if (path->query) > + ib_sa_cancel_query(path->query_id, path->query); > + spin_unlock(&priv->lock); > + spin_unlock_irq(&priv->tx_lock); > + wait_for_completion(&path->done); > + path_free_only(dev, path); > + spin_lock_irq(&priv->tx_lock); > + spin_lock(&priv->lock); > + } > + spin_unlock(&priv->lock); > + spin_unlock_irq(&priv->tx_lock); > +} > + > void ipoib_flush_paths(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -421,6 +464,8 @@ static void path_rec_completion(int status, > __skb_queue_tail(&skqueue, skb); > > list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { > + if (neigh->ah) > + ipoib_put_ah(neigh->ah); > kref_get(&path->ah->ref); > neigh->ah = path->ah; > memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, > @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev) > INIT_LIST_HEAD(&priv->multicast_list); > > INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); > - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); > INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); > - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); > + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); > + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); > + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); > } > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > index 8766d29..80c0409 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, > if (record->element.port_num != priv->port) > return; > > - if (record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PORT_ACTIVE || > - record->event == IB_EVENT_LID_CHANGE || > - record->event == IB_EVENT_SM_CHANGE || > - record->event == IB_EVENT_CLIENT_REREGISTER) { > - ipoib_dbg(priv, "Port state change event\n"); > - queue_work(ipoib_workqueue, &priv->flush_task); > + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, > + record->device->name, record->element.port_num); > + if ( record->event == IB_EVENT_SM_CHANGE || > + record->event == IB_EVENT_CLIENT_REREGISTER) { > + queue_work(ipoib_workqueue, &priv->flush_task0); > + } else if (record->event == IB_EVENT_PORT_ERR || > + record->event == IB_EVENT_PORT_ACTIVE || > + record->event == IB_EVENT_LID_CHANGE) { > + queue_work(ipoib_workqueue, &priv->flush_task1); > } else if (record->event == IB_EVENT_PKEY_CHANGE) { > - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); > - queue_work(ipoib_workqueue, &priv->pkey_event_task); > + queue_work(ipoib_workqueue, &priv->flush_task2); > } > } From hrosenstock at xsigo.com Thu May 22 08:10:29 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 08:10:29 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080522145607.GE32128@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> Message-ID: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote: > On 07:46 Thu 22 May , Hal Rosenstock wrote: > > Sasha, > > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > > > This adds possibility to specify SM_Key value with saquery. It should > > > work with queries where OSM_DEFAULT_SM_KEY was used. > > > > I think this starts down a slippery slope and perhaps bad precedent for > > MKey as well. I know this is useful as a debug tool but compromises what > > purports as "security" IMO as this means the keys need to be too widely > > known. > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM > side an user may know this or not, in later case saquery will not work > (just like now). I don't see a hole. I think it will tend towards proliferation of keys which will defeat any security/trust. The idea of SMKey was to keep it private between SMs. This is now spreading it wider IMO. I'm sure other patches will follow in the same vein once an MKey manager exists. -- Hal > Sasha From meier3 at llnl.gov Thu May 22 08:15:04 2008 From: meier3 at llnl.gov (Timothy A. Meier) Date: Thu, 22 May 2008 08:15:04 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root Message-ID: <48358DF8.2060603@llnl.gov> Sasha, Trivial patch to enforce root for these perl scripts. More importantly, doesn't silently fail if not root, and returns an error code. -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-infiniband-diags-terminate-perl-scripts-with-error.patch URL: From hrosenstock at xsigo.com Thu May 22 08:17:47 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 08:17:47 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <48358DF8.2060603@llnl.gov> References: <48358DF8.2060603@llnl.gov> Message-ID: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> Tim, On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote: > Sasha, > > Trivial patch to enforce root for these perl scripts. More importantly, > doesn't silently fail if not root, and returns an error code. Should these enforce root or be based on udev permissions for umad which default to root ? -- Hal > plain text document attachment (0001-infiniband-diags-terminate-perl- > scripts-with-error.patch) > >From f4058a22d31dc31f0e8ecdffcc42bff065eefcce Mon Sep 17 00:00:00 2001 > From: Tim Meier > Date: Wed, 21 May 2008 16:40:18 -0700 > Subject: [PATCH] infiniband-diags: terminate perl scripts with error if not root > > Adds the "auth_check" routine at the beginning of each main, which > terminates with an error if not invoked as root. > > Signed-off-by: Tim Meier > --- > infiniband-diags/scripts/IBswcountlimits.pm | 10 ++++++++++ > infiniband-diags/scripts/ibfindnodesusing.pl | 1 + > infiniband-diags/scripts/ibidsverify.pl | 1 + > infiniband-diags/scripts/iblinkinfo.pl | 1 + > infiniband-diags/scripts/ibprintca.pl | 1 + > infiniband-diags/scripts/ibprintrt.pl | 1 + > infiniband-diags/scripts/ibprintswitch.pl | 1 + > infiniband-diags/scripts/ibqueryerrors.pl | 1 + > infiniband-diags/scripts/ibswportwatch.pl | 1 + > 9 files changed, 18 insertions(+), 0 deletions(-) > > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm > index 9bc356f..0b7563e 100755 > --- a/infiniband-diags/scripts/IBswcountlimits.pm > +++ b/infiniband-diags/scripts/IBswcountlimits.pm > @@ -123,6 +123,16 @@ sub check_counters > "Total number of packets, excluding link packets, received on all VLs to the port" > ); > > +# ========================================================================= > +# only root is authorized, terminate with msg and err code > +# > +sub auth_check > +{ > + if ( $> != 0 ) { > + die "Permission denied, must be root\n"; > + } > +} > + > sub check_data_counters > { > my $print_action = $_[0]; > diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl > index 1bf0987..49003af 100755 > --- a/infiniband-diags/scripts/ibfindnodesusing.pl > +++ b/infiniband-diags/scripts/ibfindnodesusing.pl > @@ -168,6 +168,7 @@ sub compress_hostlist > # > sub main > { > + auth_check; > my $found_switch = undef; > my $cache_file = get_cache_file($ca_name, $ca_port); > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl > index de78e6b..b857166 100755 > --- a/infiniband-diags/scripts/ibidsverify.pl > +++ b/infiniband-diags/scripts/ibidsverify.pl > @@ -163,6 +163,7 @@ sub insert_portguid > > sub main > { > + auth_check; > if ($regenerate_map > || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) > { > diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl > index a195474..4bb9598 100755 > --- a/infiniband-diags/scripts/iblinkinfo.pl > +++ b/infiniband-diags/scripts/iblinkinfo.pl > @@ -98,6 +98,7 @@ my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port); > > sub main > { > + auth_check; > get_link_ends($regenerate_map, $ca_name, $ca_port); > if (defined($direct_route)) { > # convert DR to guid, then use original single_switch option > diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl > index 38b4330..d5c5fba 100755 > --- a/infiniband-diags/scripts/ibprintca.pl > +++ b/infiniband-diags/scripts/ibprintca.pl > @@ -88,6 +88,7 @@ if ($target_hca eq "") { > # > sub main > { > + auth_check; > my $found_hca = undef; > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > my $in_hca = "no"; > diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl > index 86dcb64..c6070ff 100755 > --- a/infiniband-diags/scripts/ibprintrt.pl > +++ b/infiniband-diags/scripts/ibprintrt.pl > @@ -88,6 +88,7 @@ if ($target_rt eq "") { > # > sub main > { > + auth_check; > my $found_rt = undef; > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > my $in_rt = "no"; > diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl > index 6712201..41a5131 100755 > --- a/infiniband-diags/scripts/ibprintswitch.pl > +++ b/infiniband-diags/scripts/ibprintswitch.pl > @@ -87,6 +87,7 @@ if ($target_switch eq "") { > # > sub main > { > + auth_check; > my $found_switch = undef; > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > my $in_switch = "no"; > diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl > index c807c02..3330687 100755 > --- a/infiniband-diags/scripts/ibqueryerrors.pl > +++ b/infiniband-diags/scripts/ibqueryerrors.pl > @@ -185,6 +185,7 @@ $cache_file = get_cache_file($ca_name, $ca_port); > > sub main > { > + auth_check; > if (@IBswcountlimits::suppress_errors) { > my $msg = join(",", @IBswcountlimits::suppress_errors); > print "Suppressing: $msg\n"; > diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl > index 6d6ba1c..76398fa 100755 > --- a/infiniband-diags/scripts/ibswportwatch.pl > +++ b/infiniband-diags/scripts/ibswportwatch.pl > @@ -157,6 +157,7 @@ my $sw_port = $ARGV[1]; > > sub main > { > + auth_check; > clear_counters; > get_new_counts($sw_addr, $sw_port); > while ($cycle != 0) { > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Thu May 22 08:17:57 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 08:17:57 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> Message-ID: <1211469478.18236.198.camel@hrosenstock-ws.xsigo.com> Roland, On Mon, 2008-05-19 at 15:23 -0700, Roland Dreier wrote: > I see two issues with this patch: > > - Is is architecturally guaranteed by the IB spec that flushing unicast > info is not required on an SM change or client reregister event? FWIW, here's my take on what the IBA spec says relative to this: I don't think there's an issue with client reregister AFAIK. Client registrations refer to subscriptions only. SM change is another matter IMO and is not guaranteed as has been pointed out in several earlier posts in this thread. -- Hal From olga.shern at gmail.com Thu May 22 08:28:19 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 22 May 2008 18:28:19 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> Message-ID: On 5/22/08, Hal Rosenstock wrote: > > Moni, > > On Thu, 2008-05-22 at 16:58 +0300, Moni Shoua wrote: > > Hal, Roland > > Thanks for the comments. The patch below tries to address the issues that > were > > raised in its previous form. Please note that I'm only asking for opinion > for now. > > If the idea is acceptable then I will recreate more elegant patch with > the required > > fixes if any and with respect to previous comments (such as replacing 0,1 > and 2 with > > textual names). > > > > The idea in few words is to flush only paths but keeping address handles > in ipoib_neigh. > > This will trigger a new path lookup when an ARP probe arrives and > eventually an addess > > handle renewal. In the meantime, the old address handle is kept and can > be used. In most > > cases this address handle is a valid address handle and when it is not > than the situatio > > is not worse than before. > > This part seems OK to me. > > > My tests show that this patch completes the improvement that was archived > with patch #1 > > to zero packet loss (tested with ping flood) when SM change event occurs. > > Looks to me like SM change is still "level 0". I may have missed it but > I don't see how this addresses the general architectural concerns > previously raised. This patch may work in your test environment but I > don't think that covers all the cases. Hal, You pointed out that we cannot rely on the assumption that on SM failover there is not path change. In the previous patch we only flush multicast. What Moni changed in this patch is that on SM failover (SM change event), we will flush not only multicast but also all paths but without destroying ah. Olga -- Hal > > > thanks > > > > MoniS > > > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h > b/drivers/infiniband/ulp/ipoib/ipoib.h > > index ca126fc..8ef6573 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > > @@ -276,10 +276,11 @@ struct ipoib_dev_priv { > > > > struct delayed_work pkey_poll_task; > > struct delayed_work mcast_task; > > - struct work_struct flush_task; > > + struct work_struct flush_task0; > > + struct work_struct flush_task1; > > + struct work_struct flush_task2; > > struct work_struct restart_task; > > struct delayed_work ah_reap_task; > > - struct work_struct pkey_event_task; > > > > struct ib_device *ca; > > u8 port; > > @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct > sk_buff *skb, > > struct ipoib_ah *address, u32 qpn); > > void ipoib_reap_ah(struct work_struct *work); > > > > +void ipoib_flush_paths_only(struct net_device *dev); > > void ipoib_flush_paths(struct net_device *dev); > > struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); > > > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int > port); > > -void ipoib_ib_dev_flush(struct work_struct *work); > > +void ipoib_ib_dev_flush0(struct work_struct *work); > > +void ipoib_ib_dev_flush1(struct work_struct *work); > > +void ipoib_ib_dev_flush2(struct work_struct *work); > > void ipoib_pkey_event(struct work_struct *work); > > void ipoib_ib_dev_cleanup(struct net_device *dev); > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > index f429bce..5a6bbe8 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > > @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct > ib_device *ca, int port) > > return 0; > > } > > > > -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int > pkey_event) > > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) > > { > > struct ipoib_dev_priv *cpriv; > > struct net_device *dev = priv->dev; > > @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct > ipoib_dev_priv *priv, int pkey_event) > > * the parent is down. > > */ > > list_for_each_entry(cpriv, &priv->child_intfs, list) > > - __ipoib_ib_dev_flush(cpriv, pkey_event); > > + __ipoib_ib_dev_flush(cpriv, level); > > > > mutex_unlock(&priv->vlan_mutex); > > > > @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct > ipoib_dev_priv *priv, int pkey_event) > > return; > > } > > > > - if (pkey_event) { > > + if (level == 2) { > > if (ib_find_pkey(priv->ca, priv->port, priv->pkey, > &new_index)) { > > clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > > ipoib_ib_dev_down(dev, 0); > > @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct > ipoib_dev_priv *priv, int pkey_event) > > priv->pkey_index = new_index; > > } > > > > - ipoib_dbg(priv, "flushing\n"); > > - > > - ipoib_ib_dev_down(dev, 0); > > + ipoib_flush_paths_only(dev); > > + ipoib_mcast_dev_flush(dev); > > + > > + if (level >= 1) > > + ipoib_ib_dev_down(dev, 0); > > > > - if (pkey_event) { > > + if (level >= 2) { > > ipoib_ib_dev_stop(dev, 0); > > ipoib_ib_dev_open(dev); > > } > > @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct > ipoib_dev_priv *priv, int pkey_event) > > * we get here, don't bring it back up if it's not configured up > > */ > > if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { > > - ipoib_ib_dev_up(dev); > > + if (level >= 1) > > + ipoib_ib_dev_up(dev); > > ipoib_mcast_restart_task(&priv->restart_task); > > } > > } > > > > -void ipoib_ib_dev_flush(struct work_struct *work) > > +void ipoib_ib_dev_flush0(struct work_struct *work) > > { > > struct ipoib_dev_priv *priv = > > - container_of(work, struct ipoib_dev_priv, flush_task); > > + container_of(work, struct ipoib_dev_priv, flush_task0); > > > > - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); > > __ipoib_ib_dev_flush(priv, 0); > > } > > > > -void ipoib_pkey_event(struct work_struct *work) > > +void ipoib_ib_dev_flush1(struct work_struct *work) > > { > > struct ipoib_dev_priv *priv = > > - container_of(work, struct ipoib_dev_priv, pkey_event_task); > > + container_of(work, struct ipoib_dev_priv, flush_task1); > > > > - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", > priv->dev->name); > > __ipoib_ib_dev_flush(priv, 1); > > } > > > > +void ipoib_ib_dev_flush2(struct work_struct *work) > > +{ > > + struct ipoib_dev_priv *priv = > > + container_of(work, struct ipoib_dev_priv, flush_task2); > > + > > + __ipoib_ib_dev_flush(priv, 2); > > +} > > + > > void ipoib_ib_dev_cleanup(struct net_device *dev) > > { > > struct ipoib_dev_priv *priv = netdev_priv(dev); > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > > index 2442090..c41798d 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > > @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct > ipoib_path *path) > > return 0; > > } > > > > +static void path_free_only(struct net_device *dev, struct ipoib_path > *path) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + struct ipoib_neigh *neigh, *tn; > > + struct sk_buff *skb; > > + unsigned long flags; > > + > > + while ((skb = __skb_dequeue(&path->queue))) > > + dev_kfree_skb_irq(skb); > > + > > + if (path->ah) > > + ipoib_put_ah(path->ah); > > + > > + kfree(path); > > +} > > static void path_free(struct net_device *dev, struct ipoib_path *path) > > { > > struct ipoib_dev_priv *priv = netdev_priv(dev); > > @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter > *iter, > > > > #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ > > > > +void ipoib_flush_paths_only(struct net_device *dev) > > +{ > > + struct ipoib_dev_priv *priv = netdev_priv(dev); > > + struct ipoib_path *path, *tp; > > + LIST_HEAD(remove_list); > > + > > + spin_lock_irq(&priv->tx_lock); > > + spin_lock(&priv->lock); > > + > > + list_splice_init(&priv->path_list, &remove_list); > > + > > + list_for_each_entry(path, &remove_list, list) > > + rb_erase(&path->rb_node, &priv->path_tree); > > + > > + list_for_each_entry_safe(path, tp, &remove_list, list) { > > + if (path->query) > > + ib_sa_cancel_query(path->query_id, path->query); > > + spin_unlock(&priv->lock); > > + spin_unlock_irq(&priv->tx_lock); > > + wait_for_completion(&path->done); > > + path_free_only(dev, path); > > + spin_lock_irq(&priv->tx_lock); > > + spin_lock(&priv->lock); > > + } > > + spin_unlock(&priv->lock); > > + spin_unlock_irq(&priv->tx_lock); > > +} > > + > > void ipoib_flush_paths(struct net_device *dev) > > { > > struct ipoib_dev_priv *priv = netdev_priv(dev); > > @@ -421,6 +464,8 @@ static void path_rec_completion(int status, > > __skb_queue_tail(&skqueue, skb); > > > > list_for_each_entry_safe(neigh, tn, &path->neigh_list, > list) { > > + if (neigh->ah) > > + ipoib_put_ah(neigh->ah); > > kref_get(&path->ah->ref); > > neigh->ah = path->ah; > > memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, > > @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev) > > INIT_LIST_HEAD(&priv->multicast_list); > > > > INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); > > - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); > > INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); > > - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); > > + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); > > + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); > > + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); > > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > > INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); > > } > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > > index 8766d29..80c0409 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > > @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, > > if (record->element.port_num != priv->port) > > return; > > > > - if (record->event == IB_EVENT_PORT_ERR || > > - record->event == IB_EVENT_PORT_ACTIVE || > > - record->event == IB_EVENT_LID_CHANGE || > > - record->event == IB_EVENT_SM_CHANGE || > > - record->event == IB_EVENT_CLIENT_REREGISTER) { > > - ipoib_dbg(priv, "Port state change event\n"); > > - queue_work(ipoib_workqueue, &priv->flush_task); > > + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, > > + record->device->name, record->element.port_num); > > + if ( record->event == IB_EVENT_SM_CHANGE || > > + record->event == IB_EVENT_CLIENT_REREGISTER) { > > + queue_work(ipoib_workqueue, &priv->flush_task0); > > + } else if (record->event == IB_EVENT_PORT_ERR || > > + record->event == IB_EVENT_PORT_ACTIVE || > > + record->event == IB_EVENT_LID_CHANGE) { > > + queue_work(ipoib_workqueue, &priv->flush_task1); > > } else if (record->event == IB_EVENT_PKEY_CHANGE) { > > - ipoib_dbg(priv, "P_Key change event on port:%d\n", > priv->port); > > - queue_work(ipoib_workqueue, &priv->pkey_event_task); > > + queue_work(ipoib_workqueue, &priv->flush_task2); > > } > > } > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at mellanox.co.il Thu May 22 08:40:15 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 22 May 2008 18:40:15 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: copy small SKBs in CM mode Message-ID: <1211470815.7310.61.camel@eli-laptop> >From a8ea680caf189ad984aedaa81463ed66e45c4e65 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Thu, 22 May 2008 16:28:59 +0300 Subject: [PATCH] IB/ipoib: copy small SKBs in CM mode CM mode of ipoib has a large overhead in the receive flow for managing SKBs. It usually allocates an SKB with data as much as was used in the currently received SKB and moves unused fragments from the old SKB to the new one. This involves a loop on all the remaining fragments and incurs overhead on the CPU. This patch, for small SKBs, allocates an SKB just large enough to contain the received data and copies to it the data from the received SKB. The newly allocated SKB is passed to the stack and the old SKB is reposted. Signed-off-by: Eli Cohen --- When running netperf I see significant improvement when using this patch (BW Mbps): with patch: sender receiver 313 313 without the patch: 509 134 drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index ca126fc..e39bf36 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -97,6 +97,7 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, MAX_SEND_CQE = 16, + SKB_TSHOLD = 256, }; #define IPOIB_OP_RECV (1ul << 31) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index e6f57dd..791bef7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -525,6 +525,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) u64 mapping[IPOIB_CM_RX_SG]; int frags; int has_srq; + struct sk_buff *small_skb; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -579,6 +580,19 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) } } + if (wc->byte_len < SKB_TSHOLD) { + int dlen = wc->byte_len; + + small_skb = dev_alloc_skb(dlen + 12); + if (small_skb) { + skb_reserve(small_skb, 12); + skb_copy_from_linear_data(skb, small_skb->data, dlen); + skb_put(small_skb, dlen); + skb = small_skb; + goto copied; + } + } + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; @@ -601,6 +615,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); +copied: skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); -- 1.5.5.1 From hrosenstock at xsigo.com Thu May 22 08:43:28 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 08:43:28 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com> Olga, On Thu, 2008-05-22 at 18:28 +0300, Olga Shern (Voltaire) wrote: > Hal, > You pointed out that we cannot rely on the assumption that on SM > failover there is not path change. > In the previous patch we only flush multicast. > What Moni changed in this patch is that on SM failover (SM change > event), we will flush not only multicast but also all paths but > without destroying ah. I missed that in the patch :-( It addresses the first level of concern in terms of the unicast paths but leaves open the path parameter changes (rate, etc.) as the address handles are preserved as Moni stated in other words. I agree it's in the right direction. I would like to see the whole problem solved. Is the cost of recreating the AHs too much or is something else leading towards preserving the AHs ? That's what's needed to be resolved for a complete solution. -- Hal > Olga From swise at opengridcomputing.com Thu May 22 09:00:34 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 May 2008 11:00:34 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48358428.2000902@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> Message-ID: <483598A2.1020503@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> Are we sure we need to expose this to the user? > I believe this is the way to go if we want to let smart ULPs generate > new rkey/stag per mapping. Simpler ULPs could then just put the same > value for each map associated with the same mr. > > Or. > Roland, what do you think? I'm ok with adding this. From olga.shern at gmail.com Thu May 22 09:06:29 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 22 May 2008 19:06:29 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com> Message-ID: On 5/22/08, Hal Rosenstock wrote: > > Olga, > > On Thu, 2008-05-22 at 18:28 +0300, Olga Shern (Voltaire) wrote: > > Hal, > > You pointed out that we cannot rely on the assumption that on SM > > failover there is not path change. > > In the previous patch we only flush multicast. > > What Moni changed in this patch is that on SM failover (SM change > > event), we will flush not only multicast but also all paths but > > without destroying ah. > > I missed that in the patch :-( It addresses the first level of concern > in terms of the unicast paths but leaves open the path parameter changes > (rate, etc.) as the address handles are preserved as Moni stated in > other words. I agree it's in the right direction. I would like to see > the whole problem solved. Is the cost of recreating the AHs too much or > is something else leading towards preserving the AHs ? That's what's > needed to be resolved for a complete solution. I didn't explain it well, I will try again :) On SM change event we will not destroy ah but will flush paths, therefore unicast traffic will continue without packets lost. When there will be arp probe (issued by the kernel) it will look for a path and because we have deleted it will issue path query to SM and after reply from sm it will create a new ah that will replace the old ah. Before this patch all packets were dropped till there is a new ah, this patch creating new ah at the background. I hope it is clear now. Olga -- Hal > > > Olga > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu May 22 09:21:20 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 22 May 2008 09:21:20 -0700 Subject: [ofa-general] [PATCH] [for-2.6.27] rdma: fix license text Message-ID: <000101c8bc27$df180960$ec248686@amr.corp.intel.com> The license text for several files references a third software license that was inadvertently copied in. Update the license to match that used by openfabrics. This update was based on a request from HP. Signed-off-by: Sean Hefty --- drivers/infiniband/core/addr.c | 41 ++++++++++++++++++--------------- drivers/infiniband/core/cma.c | 42 ++++++++++++++++++---------------- include/rdma/ib_addr.h | 42 ++++++++++++++++++---------------- include/rdma/rdma_cm.h | 42 ++++++++++++++++++---------------- include/rdma/rdma_cm_ib.h | 50 ++++++++++++++++++++++------------------ 5 files changed, 119 insertions(+), 98 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 781ea59..e4eb8be 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -4,28 +4,33 @@ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * - * This Software is licensed under one of the following licenses: + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #include diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 671f137..e5bd617 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -4,29 +4,33 @@ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. * - * This Software is licensed under one of the following licenses: + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #include diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index c36750f..b42bdd0 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -2,29 +2,33 @@ * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * - * This Software is licensed under one of the following licenses: + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #if !defined(IB_ADDR_H) diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 010f876..d8f9a95 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -2,29 +2,33 @@ * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * - * This Software is licensed under one of the following licenses: + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #if !defined(RDMA_CM_H) diff --git a/include/rdma/rdma_cm_ib.h b/include/rdma/rdma_cm_ib.h index 950424b..2389c3b 100644 --- a/include/rdma/rdma_cm_ib.h +++ b/include/rdma/rdma_cm_ib.h @@ -1,29 +1,33 @@ /* * Copyright (c) 2006 Intel Corporation. All rights reserved. * - * This Software is licensed under one of the following licenses: - * - * 1) under the terms of the "Common Public License 1.0" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/cpl.php. - * - * 2) under the terms of the "The BSD License" a copy of which is - * available from the Open Source Initiative, see - * http://www.opensource.org/licenses/bsd-license.php. - * - * 3) under the terms of the "GNU General Public License (GPL) Version 2" a - * copy of which is available from the Open Source Initiative, see - * http://www.opensource.org/licenses/gpl-license.php. - * - * Licensee has the right to choose one of the above licenses. - * - * Redistributions of source code must retain the above copyright - * notice and one of the license notices. - * - * Redistributions in binary form must reproduce both the above copyright - * notice, one of the license notices in the documentation - * and/or other materials provided with the distribution. - * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. */ #if !defined(RDMA_CM_IB_H) From hrosenstock at xsigo.com Thu May 22 09:39:36 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 09:39:36 -0700 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> <1211468826.18236.184.camel@hrosenstock-ws.xsigo.com> <1211471008.18236.209.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211474376.18236.220.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-22 at 19:06 +0300, Olga Shern (Voltaire) wrote: > I didn't explain it well, I will try again :) > > On SM change event we will not destroy ah but will flush paths, > therefore unicast traffic will continue without packets lost. > When there will be arp probe (issued by the kernel) it will look for a > path and because we have deleted it will issue path query to SM and > after reply from sm it will create a new ah that will replace the old > ah. So the new path parameters do become available after a while (once the SA PR query is responded to). In the interim, IPoIB continues to send using the potentially old AH in the hope that nothing changes. If so, this sounds OK to me. > Before this patch all packets were dropped till there is a new ah, > this patch creating new ah at the background > I hope it is clear now. If what I wrote above is a restatement, then it's clearer to me now. Thanks. -- Hal > Olga From ralph.campbell at qlogic.com Thu May 22 10:29:35 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 22 May 2008 10:29:35 -0700 Subject: [ofa-general] ib_device_cap_flags Message-ID: <1211477375.3949.313.camel@brick.pathscale.com> Or Gerlitz pointed out that the ipath driver wasn't setting IB_DEVICE_RC_RNR_NAK_GEN. When I looked at the other flags to see if any others were missing, I wasn't sure what one of the flags means. Can someone enlighten me? IB_DEVICE_CURR_QP_STATE_MOD ib_modify_qp(IB_QP_CUR_STATE) supported? From sean.hefty at intel.com Thu May 22 10:34:58 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 22 May 2008 10:34:58 -0700 Subject: [ofa-general] ib_device_cap_flags In-Reply-To: <1211477375.3949.313.camel@brick.pathscale.com> References: <1211477375.3949.313.camel@brick.pathscale.com> Message-ID: <000301c8bc32$2850a6a0$ec248686@amr.corp.intel.com> >IB_DEVICE_CURR_QP_STATE_MOD > ib_modify_qp(IB_QP_CUR_STATE) supported? "Ability of this HCA to support the Current QP State modifier for Modify Queue Pair." It allows the user to specify the current state of the QP when transitioning to RTS (from RTR or SQD). - Sean From hrosenstock at xsigo.com Thu May 22 11:08:00 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 22 May 2008 11:08:00 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211479680.13185.37.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-22 at 08:17 -0700, Hal Rosenstock wrote: > Tim, > > On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote: > > Sasha, > > > > Trivial patch to enforce root for these perl scripts. More importantly, > > doesn't silently fail if not root, and returns an error code. > > Should these enforce root or be based on udev permissions for umad which > default to root ? > > -- Hal > > > plain text document attachment (0001-infiniband-diags-terminate-perl- > > scripts-with-error.patch) > > >From f4058a22d31dc31f0e8ecdffcc42bff065eefcce Mon Sep 17 00:00:00 2001 > > From: Tim Meier > > Date: Wed, 21 May 2008 16:40:18 -0700 > > Subject: [PATCH] infiniband-diags: terminate perl scripts with error if not root > > > > Adds the "auth_check" routine at the beginning of each main, which > > terminates with an error if not invoked as root. > > > > Signed-off-by: Tim Meier > > --- > > infiniband-diags/scripts/IBswcountlimits.pm | 10 ++++++++++ > > infiniband-diags/scripts/ibfindnodesusing.pl | 1 + > > infiniband-diags/scripts/ibidsverify.pl | 1 + > > infiniband-diags/scripts/iblinkinfo.pl | 1 + > > infiniband-diags/scripts/ibprintca.pl | 1 + > > infiniband-diags/scripts/ibprintrt.pl | 1 + > > infiniband-diags/scripts/ibprintswitch.pl | 1 + > > infiniband-diags/scripts/ibqueryerrors.pl | 1 + > > infiniband-diags/scripts/ibswportwatch.pl | 1 + > > 9 files changed, 18 insertions(+), 0 deletions(-) > > > > diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm > > index 9bc356f..0b7563e 100755 > > --- a/infiniband-diags/scripts/IBswcountlimits.pm > > +++ b/infiniband-diags/scripts/IBswcountlimits.pm > > @@ -123,6 +123,16 @@ sub check_counters > > "Total number of packets, excluding link packets, received on all VLs to the port" > > ); > > > > +# ========================================================================= > > +# only root is authorized, terminate with msg and err code > > +# > > +sub auth_check > > +{ > > + if ( $> != 0 ) { > > + die "Permission denied, must be root\n"; > > + } I think all that's needed is a slightly more sophisticated auth_check than this :-) It could easily be a follow on patch to this. -- Hal > > +} > > + > > sub check_data_counters > > { > > my $print_action = $_[0]; > > diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl > > index 1bf0987..49003af 100755 > > --- a/infiniband-diags/scripts/ibfindnodesusing.pl > > +++ b/infiniband-diags/scripts/ibfindnodesusing.pl > > @@ -168,6 +168,7 @@ sub compress_hostlist > > # > > sub main > > { > > + auth_check; > > my $found_switch = undef; > > my $cache_file = get_cache_file($ca_name, $ca_port); > > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > > diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl > > index de78e6b..b857166 100755 > > --- a/infiniband-diags/scripts/ibidsverify.pl > > +++ b/infiniband-diags/scripts/ibidsverify.pl > > @@ -163,6 +163,7 @@ sub insert_portguid > > > > sub main > > { > > + auth_check; > > if ($regenerate_map > > || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) > > { > > diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl > > index a195474..4bb9598 100755 > > --- a/infiniband-diags/scripts/iblinkinfo.pl > > +++ b/infiniband-diags/scripts/iblinkinfo.pl > > @@ -98,6 +98,7 @@ my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port); > > > > sub main > > { > > + auth_check; > > get_link_ends($regenerate_map, $ca_name, $ca_port); > > if (defined($direct_route)) { > > # convert DR to guid, then use original single_switch option > > diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl > > index 38b4330..d5c5fba 100755 > > --- a/infiniband-diags/scripts/ibprintca.pl > > +++ b/infiniband-diags/scripts/ibprintca.pl > > @@ -88,6 +88,7 @@ if ($target_hca eq "") { > > # > > sub main > > { > > + auth_check; > > my $found_hca = undef; > > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > > my $in_hca = "no"; > > diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl > > index 86dcb64..c6070ff 100755 > > --- a/infiniband-diags/scripts/ibprintrt.pl > > +++ b/infiniband-diags/scripts/ibprintrt.pl > > @@ -88,6 +88,7 @@ if ($target_rt eq "") { > > # > > sub main > > { > > + auth_check; > > my $found_rt = undef; > > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > > my $in_rt = "no"; > > diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl > > index 6712201..41a5131 100755 > > --- a/infiniband-diags/scripts/ibprintswitch.pl > > +++ b/infiniband-diags/scripts/ibprintswitch.pl > > @@ -87,6 +87,7 @@ if ($target_switch eq "") { > > # > > sub main > > { > > + auth_check; > > my $found_switch = undef; > > open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; > > my $in_switch = "no"; > > diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl > > index c807c02..3330687 100755 > > --- a/infiniband-diags/scripts/ibqueryerrors.pl > > +++ b/infiniband-diags/scripts/ibqueryerrors.pl > > @@ -185,6 +185,7 @@ $cache_file = get_cache_file($ca_name, $ca_port); > > > > sub main > > { > > + auth_check; > > if (@IBswcountlimits::suppress_errors) { > > my $msg = join(",", @IBswcountlimits::suppress_errors); > > print "Suppressing: $msg\n"; > > diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl > > index 6d6ba1c..76398fa 100755 > > --- a/infiniband-diags/scripts/ibswportwatch.pl > > +++ b/infiniband-diags/scripts/ibswportwatch.pl > > @@ -157,6 +157,7 @@ my $sw_port = $ARGV[1]; > > > > sub main > > { > > + auth_check; > > clear_counters; > > get_new_counts($sw_addr, $sw_port); > > while ($cycle != 0) { > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tziporet at mellanox.co.il Thu May 22 12:46:58 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 22 May 2008 22:46:58 +0300 Subject: [ofa-general] OFED 1.3.1 RC2 release is available Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E6C3@mtlexch01.mtl.com> Hi, OFED 1.3.1 RC2 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc2.tgz To get BUILD_ID run ofed_info Please report any issues in Bugzilla https://bugs.openfabrics.org/ The GA version is expected on May 29 Release information: -------------------- Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2 beta: 2.6.18-84.el5 * - Fedora C6: 2.6.18-8.fc6 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp * - OpenSuSE 10.3: 2.6.22-*-* * - kernel.org: 2.6.23 and 2.6.24 * OSes that are partially tested Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED 1.3.1-rc1 ================================ * Added backports for the OSes (with very limited testing): * SLES10 SP2 with kernel 2.6.16.60-0.21-smp * RedHat EL5 up2 beta with kernel 2.6.18-84.el5 * MPI packages update: * mvapich-1.0.1-2481 * Updated libraries: * dapl-v1 1.2.7-1 * dapl-v2 2.0.9-1 * libcxgb3 1.2.1 * ULPs changes: * OpenSM: Fix segmentation fault * iSER: Bug fixes since 2.6.24 * RDS: fixes for RDMA API * IPoIB: Fix several kernel crashes (see attached list) * Updated low level drivers: * nes * mlx4 * cxgb3 * ehca * ipath Main Changes from OFED-1.3: =========================== * MPI packages update: * mvapich-1.0.1-2434 * mvapich2-1.0.3-1 * openmpi-1.2.6-1 * Updated libraries: * dapl-v1 1.2.6 * dapl-v2 2.0.8 * libcxgb3 1.2.0 * librdmacm 1.0.7 * ULPs changes: * IB Bonding: ib-bonding-0.9.0-24 * IPoIB bug fixes * RDS fixes for RDMA API * SRP failover * Updated low level drivers: * nes * mlx4 * cxgb3 * ehca Vlad & Tziporet Note: In the attached tgz file you can find git-log of all changes. In the CVS file there is a list of fixed bugs that were reported in bugzilla <> <> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.3.1_rc2-rc1.diff.tgz Type: application/x-compressed Size: 7419 bytes Desc: ofed-1.3.1_rc2-rc1.diff.tgz URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: rc2-fixed-bugs.csv Type: application/octet-stream Size: 464 bytes Desc: rc2-fixed-bugs.csv URL: From YJia at tmriusa.com Thu May 22 13:13:05 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 22 May 2008 15:13:05 -0500 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? Message-ID: Hi Folks, I'm trying to use CQ Event notification for multiple completions (ARM_N) according to Mellanox Lx III user manual for scatter/gathering RDMA. However I couldn't find it in current MLX driver. It seems to me that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are multiple work requests, I have to use "poll_cq" to synchronously wait until all the requests are done, is it correct? Is there a way to do asynchronous multiple send by subscribing for a ARM_N event? Thanks! Yicheng _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg at kroah.com Thu May 22 13:34:03 2008 From: greg at kroah.com (Greg KH) Date: Thu, 22 May 2008 13:34:03 -0700 Subject: [ofa-general] question about drivers/infiniband/core/cm.c's kobject usage Message-ID: <20080522203403.GA27263@kroah.com> Hi, I was working on some changes to the driver core that is cleaning up the struct class fields, when I ran accross the usage of cm.c and the infiniband_cm class. It looks like you are registering "raw" kobjects in this class, chaining things off of it, as if they were devices. If so, why not just use struct device in the first place? You are creating a tree, which on modern distros, userspace will never see as they are expecting everything to be showing up in /sys/devices/ Entries in /sys/class/*/* now are symlinks into the /sys/devices tree, showing the representation of everything in one tree, not lots of little trees all over the place. So I was curious, was this done on purpose? If so, why? If not, any objection to me switching it over to be using struct device properly? thanks, greg k-h From sean.hefty at intel.com Thu May 22 14:47:51 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 22 May 2008 14:47:51 -0700 Subject: [ofa-general] RE: question about drivers/infiniband/core/cm.c's kobject usage In-Reply-To: <20080522203403.GA27263@kroah.com> References: <20080522203403.GA27263@kroah.com> Message-ID: <000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com> >So I was curious, was this done on purpose? If so, why? If not, any >objection to me switching it over to be using struct device properly? It's entirely possible I have this wrong, but the intent is to export some infiniband communication management message counters and relate them to the corresponding ib_device/port. For example: /sys/class/infiniband_cm//// (E.g. /sys/class/infiniband_cm/mthca0/1/cm_tx_msgs/req) If there's a better way to handle this, I have no objection to changing it. - Sean From arlin.r.davis at intel.com Thu May 22 14:55:17 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 22 May 2008 14:55:17 -0700 Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows strange results Message-ID: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com> I have 2 servers (cst-53, cst-54) connected via one switch using a mlx4 dual port adapter. I did a quick test, using ibv_rdma_bw, across port 1 and then port 2. When running across port 1 the data seems to be split across port 1 and port 2, across port 2 the traffic is all on port 2 as expected. Any ideas? Can I trust perfquery results? Thanks, -arlin my configuration (OFED 1.3, RHEL5.1): cst-53: [root at cst-54 sbin]# ibstat CA 'mlx4_0' CA type: MT26418 Number of ports: 2 Firmware version: 2.4.938 Hardware version: a0 Node GUID: 0x0002c9030000a5b4 System image GUID: 0x0002c9030000a5b7 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 24 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a5b5 Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 25 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a5b6 cst-54: [root at cst-53 fw-25408-rel-2_4_938]# ibstat CA 'mlx4_0' CA type: MT26418 Number of ports: 2 Firmware version: 2.4.938 Hardware version: a0 Node GUID: 0x0002c9030000a620 System image GUID: 0x0002c9030000a623 Port 1: State: Active Physical state: LinkUp Rate: 2 Base lid: 22 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a621 Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 23 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a622 TEST results: ibv_rdma_bw on port 2 is working fine, all traffic on port 2:: server: perfquery -R;/usr/bin/ib_rdma_bw -i 2;perfquery 24 1;perfquery 25 2 11621: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 11621: Local address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6 RKey 0xc8002401 VAddr 0x002aaaaaf03000 11621: Remote address: LID 0x17, QPN 0x16004a, PSN 0x65136b, RKey 0x82002401 VAddr 0x002aaaab313000 # Port counters: Lid 24 port 1 PortSelect:......................1 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 # Port counters: Lid 25 port 2 PortSelect:......................2 XmtData:.........................62946 RcvData:.........................116072668 XmtPkts:.........................7279 RcvPkts:.........................224274 client: perfquery -R;/usr/bin/ib_rdma_bw -i 2 cst-54;perfquery 22 1;perfquery 23 2 9190: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 9190: Local address: LID 0x17, QPN 0x16004a, PSN 0x65136b RKey 0x82002401 VAddr 0x002aaaab313000 9190: Remote address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6, RKey 0xc8002401 VAddr 0x002aaaaaf03000 9190: Bandwidth peak (#0 to #983): 937.621 MB/sec 9190: Bandwidth average: 937.614 MB/sec 9190: Service Demand peak (#0 to #983): 2077 cycles/KB 9190: Service Demand Avg : 2077 cycles/KB # Port counters: Lid 22 port 1 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 # Port counters: Lid 23 port 2 XmtData:.........................116075478 RcvData:.........................66442 XmtPkts:.........................224298 RcvPkts:.........................7318 port 1 with strange results - traffic split between port 1 and 2: client: [root at cst-53 fw-25408-rel-2_4_938]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 cst-54;perfquery 22 1;perfquery 23 2 9144: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 9144: Local address: LID 0x16, QPN 0x10004a, PSN 0xd125be RKey 0x34002401 VAddr 0x002aaaab313000 9144: Remote address: LID 0x18, QPN 0x10004a, PSN 0x5cb52c, RKey 0x7a002401 VAddr 0x002aaaaaf03000 9144: Bandwidth peak (#0 to #978): 234.634 MB/sec 9144: Bandwidth average: 234.634 MB/sec 9144: Service Demand peak (#0 to #978): 8303 cycles/KB 9144: Service Demand Avg : 8303 cycles/KB # Port counters: Lid 22 port 1 XmtData:.........................16580072 RcvData:.........................7072 XmtPkts:.........................32001 RcvPkts:.........................1001 # Port counters: Lid 23 port 2 XmtData:.........................82915046 RcvData:.........................51692 XmtPkts:.........................160292 RcvPkts:.........................5308 server: [root at cst-54 sbin]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 ;perfquery 24 1;perfquery 25 2 11586: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 11586: Local address: LID 0x18, QPN 0x14004a, PSN 0x5e8160 RKey 0xae002401 VAddr 0x002aaaaaf03000 11586: Remote address: LID 0x16, QPN 0x14004a, PSN 0xf0dd70, RKey 0x68002401 VAddr 0x002aaaab313000 # Port counters: Lid 24 port 1 XmtData:.........................7000 RcvData:.........................16580000 XmtPkts:.........................1000 RcvPkts:.........................32000 # Port counters: Lid 25 port 2 XmtData:.........................55802 RcvData:.........................99492206 XmtPkts:.........................6277 RcvPkts:.........................192268 ibtracert: cst-53 port 1 to cst-54 port 1 [root at cst-54 sbin]# ibtracert 22 24 >From ca {0x0002c9030000a620} portnum 1 lid 22-22 "cst-53 HCA-1" [1] -> switch port {0x000b8cffff004046}[2] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies" [4] -> ca port {0x0002c9030000a5b5}[1] lid 24-24 "cst-54 HCA-1" To ca {0x0002c9030000a5b4} portnum 1 lid 24-24 "cst-54 HCA-1" cst-53 port 2 to cst-54 port 2 [root at cst-54 sbin]# ibtracert 23 25 >From ca {0x0002c9030000a620} portnum 2 lid 23-23 "cst-53 HCA-1" [2] -> switch port {0x000b8cffff004046}[3] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies" [1] -> ca port {0x0002c9030000a5b6}[2] lid 25-25 "cst-54 HCA-1" To ca {0x0002c9030000a5b4} portnum 2 lid 25-25 "cst-54 HCA-1" -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg at kroah.com Thu May 22 14:58:12 2008 From: greg at kroah.com (Greg KH) Date: Thu, 22 May 2008 14:58:12 -0700 Subject: [ofa-general] Re: question about drivers/infiniband/core/cm.c's kobject usage In-Reply-To: <000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com> References: <20080522203403.GA27263@kroah.com> <000501c8bc55$7c2dd680$ec248686@amr.corp.intel.com> Message-ID: <20080522215812.GA3366@kroah.com> On Thu, May 22, 2008 at 02:47:51PM -0700, Sean Hefty wrote: > >So I was curious, was this done on purpose? If so, why? If not, any > >objection to me switching it over to be using struct device properly? > > It's entirely possible I have this wrong, but the intent is to export some > infiniband communication management message counters and relate them to the > corresponding ib_device/port. For example: > > /sys/class/infiniband_cm//// > > (E.g. /sys/class/infiniband_cm/mthca0/1/cm_tx_msgs/req) > > If there's a better way to handle this, I have no objection to changing it. Yes, just hang all of the stuff off of the original struct device. That seems like it would be much simpler. thanks, greg k-h From arlin.r.davis at intel.com Thu May 22 15:42:36 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 22 May 2008 15:42:36 -0700 Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows strangeresults In-Reply-To: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com> References: <000f01c8bc56$85d1aee0$8bc3020a@amr.corp.intel.com> Message-ID: never mind. my use of perfquery to reset counters was not correct. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Davis, Arlin R Sent: Thursday, May 22, 2008 2:55 PM To: [ofa_general] Subject: [ofa-general] Multi-port testing with ibv_rdma_bw shows strangeresults I have 2 servers (cst-53, cst-54) connected via one switch using a mlx4 dual port adapter. I did a quick test, using ibv_rdma_bw, across port 1 and then port 2. When running across port 1 the data seems to be split across port 1 and port 2, across port 2 the traffic is all on port 2 as expected. Any ideas? Can I trust perfquery results? Thanks, -arlin my configuration (OFED 1.3, RHEL5.1): cst-53: [root at cst-54 sbin]# ibstat CA 'mlx4_0' CA type: MT26418 Number of ports: 2 Firmware version: 2.4.938 Hardware version: a0 Node GUID: 0x0002c9030000a5b4 System image GUID: 0x0002c9030000a5b7 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 24 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a5b5 Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 25 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a5b6 cst-54: [root at cst-53 fw-25408-rel-2_4_938]# ibstat CA 'mlx4_0' CA type: MT26418 Number of ports: 2 Firmware version: 2.4.938 Hardware version: a0 Node GUID: 0x0002c9030000a620 System image GUID: 0x0002c9030000a623 Port 1: State: Active Physical state: LinkUp Rate: 2 Base lid: 22 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a621 Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 23 LMC: 0 SM lid: 3 Capability mask: 0x02510868 Port GUID: 0x0002c9030000a622 TEST results: ibv_rdma_bw on port 2 is working fine, all traffic on port 2:: server: perfquery -R;/usr/bin/ib_rdma_bw -i 2;perfquery 24 1;perfquery 25 2 11621: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 11621: Local address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6 RKey 0xc8002401 VAddr 0x002aaaaaf03000 11621: Remote address: LID 0x17, QPN 0x16004a, PSN 0x65136b, RKey 0x82002401 VAddr 0x002aaaab313000 # Port counters: Lid 24 port 1 PortSelect:......................1 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 # Port counters: Lid 25 port 2 PortSelect:......................2 XmtData:.........................62946 RcvData:.........................116072668 XmtPkts:.........................7279 RcvPkts:.........................224274 client: perfquery -R;/usr/bin/ib_rdma_bw -i 2 cst-54;perfquery 22 1;perfquery 23 2 9190: | port=18515 | ib_port=2 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 9190: Local address: LID 0x17, QPN 0x16004a, PSN 0x65136b RKey 0x82002401 VAddr 0x002aaaab313000 9190: Remote address: LID 0x19, QPN 0x16004a, PSN 0x42d3c6, RKey 0xc8002401 VAddr 0x002aaaaaf03000 9190: Bandwidth peak (#0 to #983): 937.621 MB/sec 9190: Bandwidth average: 937.614 MB/sec 9190: Service Demand peak (#0 to #983): 2077 cycles/KB 9190: Service Demand Avg : 2077 cycles/KB # Port counters: Lid 22 port 1 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 # Port counters: Lid 23 port 2 XmtData:.........................116075478 RcvData:.........................66442 XmtPkts:.........................224298 RcvPkts:.........................7318 port 1 with strange results - traffic split between port 1 and 2: client: [root at cst-53 fw-25408-rel-2_4_938]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 cst-54;perfquery 22 1;perfquery 23 2 9144: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 9144: Local address: LID 0x16, QPN 0x10004a, PSN 0xd125be RKey 0x34002401 VAddr 0x002aaaab313000 9144: Remote address: LID 0x18, QPN 0x10004a, PSN 0x5cb52c, RKey 0x7a002401 VAddr 0x002aaaaaf03000 9144: Bandwidth peak (#0 to #978): 234.634 MB/sec 9144: Bandwidth average: 234.634 MB/sec 9144: Service Demand peak (#0 to #978): 8303 cycles/KB 9144: Service Demand Avg : 8303 cycles/KB # Port counters: Lid 22 port 1 XmtData:.........................16580072 RcvData:.........................7072 XmtPkts:.........................32001 RcvPkts:.........................1001 # Port counters: Lid 23 port 2 XmtData:.........................82915046 RcvData:.........................51692 XmtPkts:.........................160292 RcvPkts:.........................5308 server: [root at cst-54 sbin]# perfquery -R;/usr/bin/ib_rdma_bw -i 1 ;perfquery 24 1;perfquery 25 2 11586: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 11586: Local address: LID 0x18, QPN 0x14004a, PSN 0x5e8160 RKey 0xae002401 VAddr 0x002aaaaaf03000 11586: Remote address: LID 0x16, QPN 0x14004a, PSN 0xf0dd70, RKey 0x68002401 VAddr 0x002aaaab313000 # Port counters: Lid 24 port 1 XmtData:.........................7000 RcvData:.........................16580000 XmtPkts:.........................1000 RcvPkts:.........................32000 # Port counters: Lid 25 port 2 XmtData:.........................55802 RcvData:.........................99492206 XmtPkts:.........................6277 RcvPkts:.........................192268 ibtracert: cst-53 port 1 to cst-54 port 1 [root at cst-54 sbin]# ibtracert 22 24 >From ca {0x0002c9030000a620} portnum 1 lid 22-22 "cst-53 HCA-1" [1] -> switch port {0x000b8cffff004046}[2] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies" [4] -> ca port {0x0002c9030000a5b5}[1] lid 24-24 "cst-54 HCA-1" To ca {0x0002c9030000a5b4} portnum 1 lid 24-24 "cst-54 HCA-1" cst-53 port 2 to cst-54 port 2 [root at cst-54 sbin]# ibtracert 23 25 >From ca {0x0002c9030000a620} portnum 2 lid 23-23 "cst-53 HCA-1" [2] -> switch port {0x000b8cffff004046}[3] lid 2-2 "MT47396 Infiniscale-III Mellanox Technologies" [1] -> ca port {0x0002c9030000a5b6}[2] lid 25-25 "cst-54 HCA-1" To ca {0x0002c9030000a5b4} portnum 2 lid 25-25 "cst-54 HCA-1" -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Thu May 22 15:47:02 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 22 May 2008 15:47:02 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080522154702.430cdef7.weiny2@llnl.gov> I guess my question is "does saquery need this to talk to the SA?" I am assuming the answer is "yes". I noticed this in the spec section 14.4.7 page 890: "The SM Key used for SM authentication is independent of the SM Key in the SA header used for SA authentication." Does this mean there could be 2 SM_Key values in use? Ira On Thu, 22 May 2008 08:10:29 -0700 Hal Rosenstock wrote: > On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote: > > On 07:46 Thu 22 May , Hal Rosenstock wrote: > > > Sasha, > > > > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > > > > This adds possibility to specify SM_Key value with saquery. It should > > > > work with queries where OSM_DEFAULT_SM_KEY was used. > > > > > > I think this starts down a slippery slope and perhaps bad precedent for > > > MKey as well. I know this is useful as a debug tool but compromises what > > > purports as "security" IMO as this means the keys need to be too widely > > > known. > > > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM > > side an user may know this or not, in later case saquery will not work > > (just like now). I don't see a hole. > > I think it will tend towards proliferation of keys which will defeat any > security/trust. The idea of SMKey was to keep it private between SMs. > This is now spreading it wider IMO. I'm sure other patches will follow > in the same vein once an MKey manager exists. > > -- Hal > > > Sasha > From hypodermically at cl-invest.com Thu May 22 20:52:46 2008 From: hypodermically at cl-invest.com (Jack Cooper) Date: Thu, 22 May 2008 21:52:46 -0600 Subject: [ofa-general] 'EncoreDVD CS 3' Message-ID: <000901c8bc77$6c0d4c00$0100007f@fpjrpq> ' Adobe CS3 Master Collection for PC or MAC includes: ' InDesign CS3 ' Photoshop CS3 ' Illustrator CS3 ' Acrobat 8 Professional ' Flash CS3 Professional ' Dreamweaver CS3 ' Fireworks CS3 ' Contribute CS3 ' After Effects CS3 Professional ' Premiere Pro CS3 ' Encore DVD CS3 ' Soundbooth CS3 ' oemnewdeal . com in Web Browser ' System Requirements ' For PC: ' Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core ' Duo (or compatible) processor; SSE2-enabled processor required for AMD systems ' Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions) ' 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ' 38GB of available hard-disk space (additional free space required during installation) ' Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ' Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended) ' 1,280x1,024 monitor resolution with 32-bit color adapter ' DVD-ROM drive ' For MAC: ' PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp) ' Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server ' 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ' 36GB of available hard-disk space (additional free space required during installation) ' Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ' Core Audio compatible sound card ' 1,280x1,024 monitor resolution with 32-bit color adapter ' DVD-ROM drive' DVD+-R burner required for DVD creation Doctors Provide Care Despite Obstacles in China Last week's earthquake left more than 80,000 dead. The Chinese government is now scaling back rescue operations and turning its attention to the millions of people in need of shelter and medical care. From coarsens at gamewrench.com Thu May 22 21:25:19 2008 From: coarsens at gamewrench.com (Tharen Avila) Date: Fri, 23 May 2008 12:25:19 +0800 Subject: [ofa-general] `Adobe CS 3 Suite Ready for Download` Message-ID: <000801c8bc8c$bfd66000$0100007f@afhehbr> ` Adobe CS3 Master Collection for PC or MAC includes: ` InDesign CS3 ` Photoshop CS3 ` Illustrator CS3 ` Acrobat 8 Professional ` Flash CS3 Professional ` Dreamweaver CS3 ` Fireworks CS3 ` Contribute CS3 ` After Effects CS3 Professional ` Premiere Pro CS3 ` Encore DVD CS3 ` Soundbooth CS3 ` newxpnow . com in your Internet ExpIorer ` System Requirements ` For PC: ` Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core ` Duo (or compatible) processor; SSE2-enabled processor required for AMD systems ` Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions) ` 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ` 38GB of available hard-disk space (additional free space required during installation) ` Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ` Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended) ` 1,280x1,024 monitor resolution with 32-bit color adapter ` DVD-ROM drive ` For MAC: ` PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp) ` Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server ` 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ` 36GB of available hard-disk space (additional free space required during installation) ` Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ` Core Audio compatible sound card ` 1,280x1,024 monitor resolution with 32-bit color adapter ` DVD-ROM drive` DVD+-R burner required for DVD creation China has announced it will investigate charges that shoddy construction led to the collapse of schools last week. Meanwhile, at refugee centers, makeshift tent classrooms give kids a small bit of comfort. Italy Tackles Garbage Crisis, Illegal Immigration From freyp at student.ethz.ch Fri May 23 00:43:16 2008 From: freyp at student.ethz.ch (Philip Frey) Date: Fri, 23 May 2008 09:43:16 +0200 Subject: [ofa-general] Multithreaded iWARP application Message-ID: <48367594.9010401@student.ethz.ch> Hello, I have a peer-to-peer like application where on each peer there is thread listening for connection requests. The peers can at the same time also actively connect to other peers. How can the concept or the "rdma_event_channel" now applied to this scenario? Until now I only had one "rdma_event_channel" but with the thread, this leads to a race condition where the thread waits for a "RDMA_CM_EVENT_CONNECT_REQUST" while the peer tries to actively open a connection and is awaiting "RDMA_CM_EVENT_ADDR_RESOLVED" etc. One solution would be to create a new "rdma_event_channel" for each active connection. But what happens at the accepting side? On the accepting "rdma_event_channel" (which is now exclusively used for that purpose), I get a new "rdma_cm_id" for the connection request from the respective event. Is it now possible to create again a new "rdma_event_channel" for that new "rdma_cm_id"? If not, where does the "RDMA_CM_EVENT_ESTABLISHED" event go to? (The questionable line is marked with "<--HERE???" in the pseudo code below.) In pseudo code: /** connecting part **/ struct rdma_cm_id *id; struct rdma_event_channel *channel; struct rdma_cm_event *event; channel = rdma_create_event_channel(); rdma_create_id(channel, &id, context, RDMA_PS_TCP); rdma_resolve_addr(id, src_addr, dst_addr, timeout); rdma_get_cm_event(channel, &event); //expecting ADDR_RESOLVED rdma_ack_cm_event(event); //same for rdma_resolve_route() //expecting ROUTE_RESOLVED rdma_connect(id, conn_param); rdma_get_event(channel, &event); //expecting ESTABLISHED rdma_ack_event(event); ... do RDMA here ... //disconnect /** accepting thread **/ struct rdma_cm_id *listen_id, *id; struct rdma_event_channel *listen_channel; struct rdma_cm_event *listen_event, *event; channel = rdma_create_event_channel(); rdma_create_id(channel, &id, context, RDMA_PS_TCP); rdma_bind_addr(id, addr); rdma_listen(id, backlog); while(1) { rdma_get_cm_event(listen_channel, listen_event); rdma_ack_cm_event(listen_event); //expecting CONNECT_REQUEST id = listen_event->id; id->channel = rdma_create_event_channel(); <-- HERE ??? rdma_accept(id, conn_param); rdma_get_cm_event(id->channel, event); //expecting ESTABLISHED rdma_ack_event(event); ... do RDMA here ... //await disconnect } Many thanks for your advice and kind regards! Philip From sashak at voltaire.com Fri May 23 01:49:41 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 11:49:41 +0300 Subject: [ofa-general] [PATCH] opensm/scripts: remove not used opensmd template In-Reply-To: <20080519170624.GJ4616@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080512144541.3879de40.weiny2@llnl.gov> <20080519170624.GJ4616@sashak.voltaire.com> Message-ID: <20080523084941.GA4164@sashak.voltaire.com> Remove not used opensmd startup script template. Signed-off-by: Sasha Khapyorsky --- opensm/configure.in | 2 +- opensm/scripts/opensmd.in | 469 --------------------------------------------- 2 files changed, 1 insertions(+), 470 deletions(-) delete mode 100755 opensm/scripts/opensmd.in diff --git a/opensm/configure.in b/opensm/configure.in index 2ae8bd0..e079065 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -207,7 +207,7 @@ OPENIB_APP_OSMV_CHECK_LIB # overrides. CFLAGS=$ac_env_CFLAGS_value -AC_CONFIG_FILES([man/opensm.8 scripts/opensm.init scripts/redhat-opensm.init scripts/opensmd scripts/sldd.sh]) +AC_CONFIG_FILES([man/opensm.8 scripts/opensm.init scripts/redhat-opensm.init scripts/sldd.sh]) dnl Create the following Makefiles AC_OUTPUT([include/opensm/osm_version.h Makefile include/Makefile complib/Makefile libvendor/Makefile opensm/Makefile osmeventplugin/Makefile osmtest/Makefile opensm.spec]) diff --git a/opensm/scripts/opensmd.in b/opensm/scripts/opensmd.in deleted file mode 100755 index 7e5d868..0000000 --- a/opensm/scripts/opensmd.in +++ /dev/null @@ -1,469 +0,0 @@ -#!/bin/bash - -# -# Copyright (c) 2006 Mellanox Technologies. All rights reserved. -# -# This Software is licensed under one of the following licenses: -# -# 1) under the terms of the "Common Public License 1.0" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/cpl.php. -# -# 2) under the terms of the "The BSD License" a copy of which is -# available from the Open Source Initiative, see -# http://www.opensource.org/licenses/bsd-license.php. -# -# 3) under the terms of the "GNU General Public License (GPL) Version 2" a -# copy of which is available from the Open Source Initiative, see -# http://www.opensource.org/licenses/gpl-license.php. -# -# Licensee has the right to choose one of the above licenses. -# -# Redistributions of source code must retain the above copyright -# notice and one of the license notices. -# -# Redistributions in binary form must reproduce both the above copyright -# notice, one of the license notices in the documentation -# and/or other materials provided with the distribution. -# -# -# processname: @sbindir@/opensm -# config: @OPENSM_CONFIG_DIR@/opensm.conf -# pidfile: /var/run/opensm.pid - -prefix=@prefix@ -exec_prefix=@exec_prefix@ - -CONFIG=@OPENSM_CONFIG_DIR@/opensm.conf - -if [ ! -f $CONFIG ]; then - exit 0 -fi - -. $COFNIG - -prog=@sbindir@/opensm -bin=${prog##*/} - -# Handover daemon for updating guid2lid cache file -sldd_prog=@sbindir@/sldd.sh -sldd_bin=${sldd_prog##*/} -sldd_pid_file=/var/run/sldd.pid - -# Only use ONBOOT option if called by a runlevel directory. -# Therefore determine the base, follow a runlevel link name ... -base=${0##*/} -link=${base#*[SK][0-9][0-9]} -# ... and compare them -if [ $link == $base ] ; then - ONBOOT=yes -fi - -ACTION=$1 -shift - -if [ ! -x $prog ]; then - echo "OpenSM not installed" - exit 1 -fi - -# Check if OpenSM configured to start automatically -if [[ -z $ONBOOT || "$ONBOOT" != "yes" ]]; then - exit 0 -fi - -if ( grep -i 'SuSE Linux' /etc/issue >/dev/null 2>&1 ); then - if [ -n "$INIT_VERSION" ] ; then - # MODE=onboot - if LANG=C egrep -L "^ONBOOT=['\"]?[Nn][Oo]['\"]?" ${CONFIG} > /dev/null ; then - exit 0 - fi - fi -fi - -if [ -f /etc/init.d/functions ]; then - . /etc/init.d/functions -fi - -# Setting OpenSM start parameters -PID_FILE=/var/run/${bin}.pid -touch $PID_FILE - -if [[ -z $DEBUG || "$DEBUG" == "none" ]]; then - DEBUG_FLAG="" -else - DEBUG_FLAG="-d ${DEBUG}" -fi - -if [[ -z $LMC || "$LMC" == "0" ]]; then - LMC_FLAG="" -else - LMC_FLAG="-l ${LMC}" -fi - -if [[ -z $MAXSMPS || "$MAXSMPS" == "4" ]]; then - MAXSMPS_FLAG="" -else - MAXSMPS_FLAG="-maxsmps ${MAXSMPS}" -fi - -if [[ -z $REASSIGN_LIDS || "$REASSIGN_LIDS" == "no" ]]; then - REASSIGN_LIDS_FLAG="" -else - REASSIGN_LIDS_FLAG="-r" -fi - -if [[ -z $SWEEP || "$SWEEP" == "10" ]]; then - SWEEP_FLAG="" -else - SWEEP_FLAG="-s ${SWEEP}" -fi - -if [[ -z $TIMEOUT || "$TIMEOUT" == "100" ]]; then - TIMEOUT_FLAG="" -else - TIMEOUT_FLAG="-t ${TIMEOUT}" -fi - -if [[ -z $OSM_LOG || "$OSM_LOG" == "/var/log/opensm.log" ]]; then - OSM_LOG_FLAG="" -else - OSM_LOG_FLAG="-f ${OSM_LOG}" -fi - -if [[ -z $VERBOSE || "$VERBOSE" == "none" ]]; then - VERBOSE_FLAG="" -else - VERBOSE_FLAG="${VERBOSE}" -fi - -if [[ -z $UPDN || "$UPDN" == "off" ]]; then - UPDN_FLAG="" -else - UPDN_FLAG="-u" -fi - -if [[ -z $GUID_FILE || "$GUID_FILE" == "none" ]]; then - GUID_FILE_FLAG="" -else - GUID_FILE_FLAG="-a ${GUID_FILE}" -fi - -if [[ -z $GUID || "$GUID" == "none" ]]; then - GUID_FLAG="" -else - GUID_FLAG="-g ${GUID}" -fi - -if [[ -z $HONORE_GUID2LID || "$HONORE_GUID2LID" == "none" ]]; then - HONORE_GUID2LID_FLAG="" -else - HONORE_GUID2LID_FLAG="--honor_guid2lid" -fi - -if [[ -n "${OSM_HOSTS}" && $(echo -n ${OSM_HOSTS} | wc -w | tr -d '[:space:]') -gt 1 ]]; then - HONORE_GUID2LID_FLAG="--honor_guid2lid" -fi - - -if [[ -z $CACHE_OPTIONS || "$CACHE_OPTIONS" == "none" ]]; then - CACHE_OPTIONS_FLAG="" -else - CACHE_OPTIONS_FLAG="--cache-options" -fi - - -if [ -z $PORT_NUM ]; then - PORT_FLAG=1 -else - PORT_FLAG="${PORT_NUM}" -fi - - -######################################################################### -# Get a sane screen width -[ -z "${COLUMNS:-}" ] && COLUMNS=80 - -[ -z "${CONSOLETYPE:-}" ] && [ -x /sbin/consoletype ] && CONSOLETYPE="`/sbin/consoletype`" - -if [ -f /etc/sysconfig/i18n -a -z "${NOLOCALE:-}" ] ; then - . /etc/sysconfig/i18n - if [ "$CONSOLETYPE" != "pty" ]; then - case "${LANG:-}" in - ja_JP*|ko_KR*|zh_CN*|zh_TW*) - export LC_MESSAGES=en_US - ;; - *) - export LANG - ;; - esac - else - export LANG - fi -fi - -# Read in our configuration -if [ -z "${BOOTUP:-}" ]; then - if [ -f /etc/sysconfig/init ]; then - . /etc/sysconfig/init - else - # This all seem confusing? Look in /etc/sysconfig/init, - # or in /usr/doc/initscripts-*/sysconfig.txt - BOOTUP=color - RES_COL=60 - MOVE_TO_COL="echo -en \\033[${RES_COL}G" - SETCOLOR_SUCCESS="echo -en \\033[1;32m" - SETCOLOR_FAILURE="echo -en \\033[1;31m" - SETCOLOR_WARNING="echo -en \\033[1;33m" - SETCOLOR_NORMAL="echo -en \\033[0;39m" - LOGLEVEL=1 - fi - if [ "$CONSOLETYPE" = "serial" ]; then - BOOTUP=serial - MOVE_TO_COL= - SETCOLOR_SUCCESS= - SETCOLOR_FAILURE= - SETCOLOR_WARNING= - SETCOLOR_NORMAL= - fi -fi - -if [ "${BOOTUP:-}" != "verbose" ]; then - INITLOG_ARGS="-q" -else - INITLOG_ARGS= -fi - -echo_success() { - echo -n $@ - [ "$BOOTUP" = "color" ] && $MOVE_TO_COL - echo -n "[ " - [ "$BOOTUP" = "color" ] && $SETCOLOR_SUCCESS - echo -n $"OK" - [ "$BOOTUP" = "color" ] && $SETCOLOR_NORMAL - echo -n " ]" - echo -e "\r" - return 0 -} - -echo_failure() { - echo -n $@ - [ "$BOOTUP" = "color" ] && $MOVE_TO_COL - echo -n "[" - [ "$BOOTUP" = "color" ] && $SETCOLOR_FAILURE - echo -n $"FAILED" - [ "$BOOTUP" = "color" ] && $SETCOLOR_NORMAL - echo -n "]" - echo -e "\r" - return 1 -} - - -######################################################################### - -# Check if $pid (could be plural) are running -checkpid() { - local i - - for i in $* ; do - [ -d "/proc/$i" ] || return 1 - done - return 0 -} - -start_sldd() -{ - if [ -f $sldd_pid_file ]; then - local line p - read line < $sldd_pid_file - for p in $line ; do - [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && sldd_pid="$sldd_pid $p" - done - fi - - if [ -z "$sldd_pid" ]; then - sldd_pid=`pidof -x $sldd_bin` - fi - - if [ -n "${sldd_pid:-}" ] ; then - kill -9 ${sldd_pid} > /dev/null 2>&1 - fi - - $sldd_prog > /dev/null 2>&1 & - sldd_pid=$! - - echo ${sldd_pid} > $sldd_pid_file - # Sleep is needed in order to update local gid2lid cache file before running opensm - sleep 3 -} - -stop_sldd() -{ - if [ -f $sldd_pid_file ]; then - local line p - read line < $sldd_pid_file - for p in $line ; do - [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && sldd_pid="$sldd_pid $p" - done - fi - - if [ -z "$sldd_pid" ]; then - sldd_pid=`pidof -x $sldd_bin` - fi - - if [ -n "${sldd_pid:-}" ] ; then - kill -15 ${sldd_pid} > /dev/null 2>&1 - fi - -} - -start() -{ - if [ ! -d /sys/class/infiniband ]; then - echo - echo "Please load Infiniband driver first" - echo - return 2 - fi - - local OSM_PID= - - if [ -f $PID_FILE ]; then - local line p - read line < $PID_FILE - for p in $line ; do - [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && pid="$pid $p" - done - fi - - if [ -z "$pid" ]; then - pid=`pidof -o $$ -o $PPID -o %PPID -x $bin` - fi - - if [ -n "${pid:-}" ] ; then - echo $"${bin} (pid $pid) is already running..." - else - - if [ -n "${HONORE_GUID2LID_FLAG}" ]; then - # Run sldd daemod - start_sldd - fi - - # Start opensm - local START_FLAGS="" - for flag in "$DEBUG_FLAG" "$LMC_FLAG" "$MAXSMPS_FLAG" "$REASSIGN_LIDS_FLAG" "$SWEEP_FLAG" "$TIMEOUT_FLAG" "$OSM_LOG_FLAG" "$VERBOSE_FLAG" "$UPDN_FLAG" "$GUID_FILE_FLAG" "$GUID_FLAG" "$HONORE_GUID2LID_FLAG" "$CACHE_OPTIONS_FLAG" - do - [ ! -z "$flag" ] && START_FLAGS="$START_FLAGS $flag" - done - - echo $PORT_FLAG | $prog $START_FLAGS > /dev/null 2>&1 & - OSM_PID=$! - echo $OSM_PID > $PID_FILE - sleep 1 - checkpid $OSM_PID - RC=$? - [ $RC -eq 0 ] && echo_success "$bin start" || echo_failure "$bin start" - - fi -return $RC -} - -stop() -{ - local pid= - local pid1= - local pid2= - - # Stop sldd daemon - stop_sldd - - if [ -f $PID_FILE ]; then - local line p - read line < $PID_FILE - for p in $line ; do - [ -z "${p//[0-9]/}" -a -d "/proc/$p" ] && pid1="$pid1 $p" - done - fi - - pid2=`pidof -o $$ -o $PPID -o %PPID -x $bin` - - pid=`echo "$pid1 $pid2" | sed -e 's/\ /\n/g' | sort -n | uniq | sed -e 's/\n/\ /g'` - - if [ -n "${pid:-}" ] ; then - # Kill opensm - kill -15 $pid > /dev/null 2>&1 - cnt=0 - while [ $cnt -lt 6 ]; do echo -n "."; sleep 1; let cnt++;done - - for p in $pid - do - while checkpid $p ; do - kill -KILL $p > /dev/null 2>&1 - echo -n "." - sleep 1 - done - done - echo - checkpid $pid - RC=$? - [ $RC -eq 0 ] && echo_failure "$bin shutdown" || echo_success "$bin shutdown" - RC=$((! $RC)) - else - echo_failure "$bin shutdown" - RC=1 - fi - - # Remove pid file if any. - rm -f $PID_FILE -return $RC -} - -status() -{ - local pid - - # First try "pidof" - pid=`pidof -o $$ -o $PPID -o %PPID -x ${bin}` - if [ -n "$pid" ]; then - echo $"${bin} (pid $pid) is running..." - return 0 - fi - - # Next try "/var/run/opensm.pid" files - if [ -f $PID_FILE ] ; then - read pid < $PID_FILE - if [ -n "$pid" ]; then - echo $"${bin} dead but pid file $PID_FILE exists" - return 1 - fi - fi - echo $"${bin} is stopped" - return 3 -} - - - -case $ACTION in - start) - start - ;; - stop) - stop - ;; - restart) - stop - start - ;; - status) - status - ;; - *) - echo - echo "Usage: `basename $0` {start|stop|restart|status}" - echo - exit 1 - ;; -esac - -RC=$? -exit $RC -- 1.5.5.1.178.g1f811 From sashak at voltaire.com Fri May 23 01:50:10 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 11:50:10 +0300 Subject: [ofa-general] [PATCH] opensm/scripts: remove opensm.conf usage In-Reply-To: <20080523084941.GA4164@sashak.voltaire.com> References: <1207703425-19039-1-git-send-email-sashak@voltaire.com> <1210617225.11133.461.camel@cardanus.llnl.gov> <20080512144541.3879de40.weiny2@llnl.gov> <20080519170624.GJ4616@sashak.voltaire.com> <20080523084941.GA4164@sashak.voltaire.com> Message-ID: <20080523085010.GB4164@sashak.voltaire.com> Remove opensm.conf usage - startup script configuration will be replaced soon by OpenSM's opensm.conf. Signed-off-by: Sasha Khapyorsky --- opensm/scripts/opensm.sysconfig | 4 +- opensm/scripts/redhat-opensm.init.in | 103 ++-------------------------------- opensm/scripts/sldd.sh.in | 3 +- 3 files changed, 8 insertions(+), 102 deletions(-) diff --git a/opensm/scripts/opensm.sysconfig b/opensm/scripts/opensm.sysconfig index d3fba93..2cc02e6 100644 --- a/opensm/scripts/opensm.sysconfig +++ b/opensm/scripts/opensm.sysconfig @@ -1,2 +1,2 @@ -# If you want to pass any options to OpenSM, set them here. -OPTIONS= +# It will be used for sldd.sh +OSM_HOSTS="" diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in index 5cc9079..5526e44 100755 --- a/opensm/scripts/redhat-opensm.init.in +++ b/opensm/scripts/redhat-opensm.init.in @@ -38,7 +38,7 @@ # $Id: openib-1.0-opensm.init,v 1.5 2006/08/02 18:18:23 dledford Exp $ # # processname: @sbindir@/opensm -# config: @OPENSM_CONFIG_DIR@/opensm.conf +# config: @sysconfdir@/sysconfig/opensm.conf # pidfile: /var/run/opensm.pid prefix=@prefix@ @@ -46,7 +46,7 @@ exec_prefix=@exec_prefix@ . /etc/rc.d/init.d/functions -CONFIG=@OPENSM_CONFIG_DIR@/opensm.conf +CONFIG=@sysconfdir@/sysconfig/opensm.conf if [ ! -f $CONFIG ]; then exit 0 fi @@ -67,97 +67,10 @@ ACTION=$1 PID_FILE=/var/run/${bin}.pid touch $PID_FILE -if [[ -z $DEBUG || "$DEBUG" == "none" ]]; then - DEBUG_FLAG="" -else - DEBUG_FLAG="-d ${DEBUG}" -fi - -if [[ -z $LMC || "$LMC" == "0" ]]; then - LMC_FLAG="" -else - LMC_FLAG="-l ${LMC}" -fi - -if [[ -z $MAXSMPS || "$MAXSMPS" == "4" ]]; then - MAXSMPS_FLAG="" -else - MAXSMPS_FLAG="-maxsmps ${MAXSMPS}" -fi - -if [[ -z $REASSIGN_LIDS || "$REASSIGN_LIDS" == "no" ]]; then - REASSIGN_LIDS_FLAG="" -else - REASSIGN_LIDS_FLAG="-r" -fi - -if [[ -z $SWEEP || "$SWEEP" == "10" ]]; then - SWEEP_FLAG="" -else - SWEEP_FLAG="-s ${SWEEP}" -fi - -if [[ -z $TIMEOUT || "$TIMEOUT" == "100" ]]; then - TIMEOUT_FLAG="" -else - TIMEOUT_FLAG="-t ${TIMEOUT}" -fi - -if [[ -z $OSM_LOG || "$OSM_LOG" == "/tmp/osm.log" ]]; then - OSM_LOG_FLAG="" -else - OSM_LOG_FLAG="-f ${OSM_LOG}" -fi - -if [[ -z $VERBOSE || "$VERBOSE" == "none" ]]; then - VERBOSE_FLAG="" -else - VERBOSE_FLAG="${VERBOSE}" -fi - -if [[ -z $UPDN || "$UPDN" == "off" ]]; then - UPDN_FLAG="" -else - UPDN_FLAG="-u" -fi - -if [[ -z $GUID_FILE || "$GUID_FILE" == "none" ]]; then - GUID_FILE_FLAG="" -else - GUID_FILE_FLAG="-a ${GUID_FILE}" -fi - -if [[ -z $GUID || "$GUID" == "none" ]]; then - GUID_FLAG="" -else - GUID_FLAG="-g ${GUID}" -fi - -if [[ -z $HONORE_GUID2LID || "$HONORE_GUID2LID" == "none" ]]; then - HONORE_GUID2LID_FLAG="" -else - HONORE_GUID2LID_FLAG="--honor_guid2lid" -fi - if [[ -n "${OSM_HOSTS}" && $(echo -n ${OSM_HOSTS} | wc -w | tr -d '[:space:]') -gt 1 ]]; then - HONORE_GUID2LID_FLAG="--honor_guid2lid" + HONORE_GUID2LID="--honor_guid2lid" fi - -if [[ -z $CACHE_OPTIONS || "$CACHE_OPTIONS" == "none" ]]; then - CACHE_OPTIONS_FLAG="" -else - CACHE_OPTIONS_FLAG="--cache-options" -fi - - -if [ -z $PORT_NUM ]; then - PORT_FLAG=1 -else - PORT_FLAG="${PORT_NUM}" -fi - - ######################################################################### start_sldd() @@ -228,20 +141,14 @@ start() echo $"${bin} (pid $pid) is already running..." else - if [ -n "${HONORE_GUID2LID_FLAG}" ]; then + if [ -n "${HONORE_GUID2LID}" ]; then # Run sldd daemod start_sldd fi # Start opensm - local START_FLAGS="" - for flag in "$DEBUG_FLAG" "$LMC_FLAG" "$MAXSMPS_FLAG" "$REASSIGN_LIDS_FLAG" "$SWEEP_FLAG" "$TIMEOUT_FLAG" "$OSM_LOG_FLAG" "$VERBOSE_FLAG" "$UPDN_FLAG" "$GUID_FILE_FLAG" "$GUID_FLAG" "$HONORE_GUID2LID_FLAG" "$CACHE_OPTIONS_FLAG" - do - [ ! -z "$flag" ] && START_FLAGS="$START_FLAGS $flag" - done - echo -n "Starting IB Subnet Manager" - echo $PORT_FLAG | $prog $START_FLAGS > /dev/null 2>&1 & + $prog --daemon ${HONORE_GUID2LID} > /dev/null cnt=0; alive=0 while [ $cnt -lt 6 -a $alive -ne 1 ]; do echo -n "."; diff --git a/opensm/scripts/sldd.sh.in b/opensm/scripts/sldd.sh.in index a6f660f..8162c5c 100755 --- a/opensm/scripts/sldd.sh.in +++ b/opensm/scripts/sldd.sh.in @@ -41,10 +41,9 @@ prefix=@prefix@ exec_prefix=@exec_prefix@ -# config: @sysconfdir@/ofa/opensm.conf +# config: @sysconfdir@/sysconfig/opensm.conf [ -f @sysconfdir@/sysconfig/opensm.conf ] && CONFIG=@sysconfdir@/sysconfig/opensm.conf -[ -f @sysconfdir@/ofa/opensm.conf ] && CONFIG=@sysconfdir@/ofa/opensm.conf SLDD_DEBUG=${SLDD_DEBUG:-0} -- 1.5.5.1.178.g1f811 From ds at apcoworldwide.com Fri May 23 01:15:10 2008 From: ds at apcoworldwide.com (butler wang) Date: Fri, 23 May 2008 08:15:10 +0000 Subject: [ofa-general] High Quality Watches Available Now Message-ID: <000401c8bcbc$01f31c9c$3c3bdfa2@ybnfqr> Quality watches at affordable price, All top brands Today!! http://horsyjeearm.com/ From vlad at lists.openfabrics.org Fri May 23 03:08:56 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 23 May 2008 03:08:56 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080523-0200 daily build status Message-ID: <20080523100856.DD4F7E60D03@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Fri May 23 03:06:34 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 13:06:34 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080523100634.GD4164@sashak.voltaire.com> On 08:10 Thu 22 May , Hal Rosenstock wrote: > > I think it will tend towards proliferation of keys which will defeat any > security/trust. The idea of SMKey was to keep it private between SMs. > This is now spreading it wider IMO. Probably original idea was different, but now in IBA spec knowing a valid SM_Key is mandatory for privileged SA clients (which need to get whole list of MCMemberRecord, ServiceInfo, etc.). Sasha From sashak at voltaire.com Fri May 23 03:25:57 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 13:25:57 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080522154702.430cdef7.weiny2@llnl.gov> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080522154702.430cdef7.weiny2@llnl.gov> Message-ID: <20080523102557.GE4164@sashak.voltaire.com> On 15:47 Thu 22 May , Ira Weiny wrote: > I guess my question is "does saquery need this to talk to the SA?" > > I am assuming the answer is "yes". > > I noticed this in the spec section 14.4.7 page 890: > > "The SM Key used for SM authentication is independent of the SM Key in the > SA header used for SA authentication." > > Does this mean there could be 2 SM_Key values in use? At least I see nothing in the spec against this. Also there is stated explicitly that validity for non-zero values is vendor-defined. Sasha From sashak at voltaire.com Fri May 23 03:35:32 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 13:35:32 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080523103532.GA4640@sashak.voltaire.com> On 08:17 Thu 22 May , Hal Rosenstock wrote: > On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote: > > Sasha, > > > > Trivial patch to enforce root for these perl scripts. More importantly, > > doesn't silently fail if not root, and returns an error code. > > Should these enforce root or be based on udev permissions for umad which > default to root ? I would ask the same question as Hal did. What is wrong with how it works now? On some system access to files could be arranged for group members, or ibnetdiscover used as engine for many scripts could be su/gid-ed. This will break there. Sasha From hrosenstock at xsigo.com Fri May 23 04:07:35 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 04:07:35 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080522154702.430cdef7.weiny2@llnl.gov> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080522154702.430cdef7.weiny2@llnl.gov> Message-ID: <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-22 at 15:47 -0700, Ira Weiny wrote: > I guess my question is "does saquery need this to talk to the SA?" > > I am assuming the answer is "yes". It depends on whether trusted operations are needed to be supported or not. A normal node has no need for trusted operations. There was a reason why the additional information was hidden with a key. It allows a malicious user to effect not just his node but the subnet. As I mentioned, this starts to be a slippery slope with the management keys. I think a better approach when non default key is in place is to support this via the OpenSM console as OpenSM knows all the keys it's supposed to. > I noticed this in the spec section 14.4.7 page 890: > > "The SM Key used for SM authentication is independent of the SM Key in the > SA header used for SA authentication." > > Does this mean there could be 2 SM_Key values in use? This was a clarification added at IBA 1.2.1. The SA SMKey is really an SA Key. This lack of separation is a limitation in the current OpenSM implementation. -- Hal > Ira > > > On Thu, 22 May 2008 08:10:29 -0700 > Hal Rosenstock wrote: > > > On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote: > > > On 07:46 Thu 22 May , Hal Rosenstock wrote: > > > > Sasha, > > > > > > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > > > > > This adds possibility to specify SM_Key value with saquery. It should > > > > > work with queries where OSM_DEFAULT_SM_KEY was used. > > > > > > > > I think this starts down a slippery slope and perhaps bad precedent for > > > > MKey as well. I know this is useful as a debug tool but compromises what > > > > purports as "security" IMO as this means the keys need to be too widely > > > > known. > > > > > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM > > > side an user may know this or not, in later case saquery will not work > > > (just like now). I don't see a hole. > > > > I think it will tend towards proliferation of keys which will defeat any > > security/trust. The idea of SMKey was to keep it private between SMs. > > This is now spreading it wider IMO. I'm sure other patches will follow > > in the same vein once an MKey manager exists. > > > > -- Hal > > > > > Sasha > > From hrosenstock at xsigo.com Fri May 23 04:15:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 04:15:13 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080523100634.GD4164@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> Message-ID: <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-23 at 13:06 +0300, Sasha Khapyorsky wrote: > On 08:10 Thu 22 May , Hal Rosenstock wrote: > > > > I think it will tend towards proliferation of keys which will defeat any > > security/trust. The idea of SMKey was to keep it private between SMs. > > This is now spreading it wider IMO. > > Probably original idea was different, No; the spec clarification was just that; a clarification of what the original intent was rather than a change in the original idea. > but now in IBA spec knowing a valid > SM_Key is mandatory for privileged SA clients (which need to get whole > list of MCMemberRecord, ServiceInfo, etc.). It's a grey area. The issue is what the privileged SA clients should be used for. I think this use case allows much more common knowledge of the management keys (in this case the SA key) as it will not just be the network administrator using it and even if it were, the user would be looking over his shoulder. That more common knowledge allows for a malicious user to more easily compromise the subnet. A better approach to all these trust issues IMO is to use the OpenSM console to support these types of operations. -- Hal > Sasha From hrosenstock at xsigo.com Fri May 23 04:17:09 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 04:17:09 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080523102557.GE4164@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080522154702.430cdef7.weiny2@llnl.gov> <20080523102557.GE4164@sashak.voltaire.com> Message-ID: <1211541429.13185.84.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-23 at 13:25 +0300, Sasha Khapyorsky wrote: > On 15:47 Thu 22 May , Ira Weiny wrote: > > I guess my question is "does saquery need this to talk to the SA?" > > > > I am assuming the answer is "yes". > > > > I noticed this in the spec section 14.4.7 page 890: > > > > "The SM Key used for SM authentication is independent of the SM Key in the > > SA header used for SA authentication." > > > > Does this mean there could be 2 SM_Key values in use? > > At least I see nothing in the spec against this. Right; it is more a use case/compromise of trust issue and the implications of that. -- Hal > Also there is stated > explicitly that validity for non-zero values is vendor-defined. > > Sasha From Instant at lists.openfabrics.org Fri May 23 05:21:20 2008 From: Instant at lists.openfabrics.org (Instant at lists.openfabrics.org) Date: 23 May 2008 05:21:20 -0700 Subject: [ofa-general] Can you afford to lose 300, 000 potential customers per year ? Message-ID: <20080523052119.30E63A31B58893E0@from.header.has.no.domain> Can you afford to lose 300,000 potential customers per year ? How would You like to divert 1000s of fresh, new visitors daily to Your web site or affiliate web site from Google, Yahoo, MSN and others At $0 cost to you...? ...iNSTANT BOOSTER diverts 1000s of fresh, new visitors daily to Your web site or affiliate web site from Google, Yahoo, MSN and others at $0 cost to you! ...No matter what you are selling or offering - INTSANT BOOSTER will pull in hordes of potential customers to your website - instantly! For Full Details Please read the attached .html file Unsubscribe: Please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Fri May 23 05:34:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 15:34:14 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080523123414.GB4640@sashak.voltaire.com> On 04:15 Fri 23 May , Hal Rosenstock wrote: > > > but now in IBA spec knowing a valid > > SM_Key is mandatory for privileged SA clients (which need to get whole > > list of MCMemberRecord, ServiceInfo, etc.). > > It's a grey area. I don't see this as "grey" - spec is very clear about this sort of SA restrictions. > The issue is what the privileged SA clients should be > used for. It can be used for monitoring, SA DB sync/dump, debugging, etc.. > I think this use case allows much more common knowledge of the > management keys (in this case the SA key) as it will not just be the > network administrator using it and even if it were, the user would be > looking over his shoulder. A network administrator is not a little kid :) and this option is optional. Following your logic we will need to disable root passwords typing too. > That more common knowledge allows for a > malicious user to more easily compromise the subnet. There is nothing which could prevent from a malicious user to put things in the code. > A better approach to all these trust issues IMO is to use the OpenSM > console to support these types of operations. OpenSM console is not protected even by SM_Key. And what about diagnostics when other SMs are used? Sasha From hrosenstock at xsigo.com Fri May 23 05:52:41 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 05:52:41 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080523123414.GB4640@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> Message-ID: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-23 at 15:34 +0300, Sasha Khapyorsky wrote: > On 04:15 Fri 23 May , Hal Rosenstock wrote: > > > > > but now in IBA spec knowing a valid > > > SM_Key is mandatory for privileged SA clients (which need to get whole > > > list of MCMemberRecord, ServiceInfo, etc.). > > > > It's a grey area. > > I don't see this as "grey" - spec is very clear about this sort of SA > restrictions. It is not clear about the issues around the key proliferation. That is what is grey at least to me but maybe I'm the only one (at least speaking on this topic on this list). > > The issue is what the privileged SA clients should be > > used for. > > It can be used for monitoring, SA DB sync/dump, debugging, etc.. All those uses are easily imagined but that's not what I meant by that statement which was related to the key issue. > > I think this use case allows much more common knowledge of the > > management keys (in this case the SA key) as it will not just be the > > network administrator using it and even if it were, the user would be > > looking over his shoulder. > > A network administrator is not a little kid :) and this option is > optional. Following your logic we will need to disable root passwords > typing too. That's taking it too far. Root passwords are at least hidden when typing. > > That more common knowledge allows for a > > malicious user to more easily compromise the subnet. > > There is nothing which could prevent from a malicious user to put things > in the code. Of course not but it's one less hurdle to knock down. > > A better approach to all these trust issues IMO is to use the OpenSM > > console to support these types of operations. > > OpenSM console is not protected even by SM_Key. But can be protected by other weak access control currently and perhaps more in the future. New commands which require trust can utilize SMKey without it being specified (at least for OpenSM), no ? > And what about diagnostics when other SMs are used? I think there's a problem here in a trusted environments given the approach taken as I've stated in the past but seems to have been forgotten. The more trust the less the current diag strategy fits. Are you also going to be proposing exposing MKeys too once MKey management is supported by OpenSM/other SMs ? -- Hal > Sasha From gstreiff at NetEffect.com Fri May 23 06:17:58 2008 From: gstreiff at NetEffect.com (Glenn Streiff) Date: Fri, 23 May 2008 08:17:58 -0500 Subject: [ofa-general] Re: Current list of Linux maintainers and their emailinfo In-Reply-To: <20080520015652.GE1183@sashak.voltaire.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC079501ED@venom2> > On 15:28 Mon 19 May , Woodruff, Robert J wrote: > > > > Here is what I have so far as the list of kernel and userspace > > components. > > > Hi, everyone. For NetEffect, iw_nes (iwarp rnic driver) libnes will be maintained by Chien Tung I'm in the process of passing the torch to Chien. He is a very capable developer and I know he will do a good job. I've enjoyed working with everyone. I may still post from time to time as necessary. :-) Glenn From hrosenstock at xsigo.com Fri May 23 06:19:36 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 06:19:36 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211548776.13185.116.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote: > It is not clear about the issues around the key proliferation. In fact, if you notice, the IBA spec (at least the management chapters) was very careful to ignore all the key management issues to avoid discussions like we've been having ;-) -- Hal From hrosenstock at xsigo.com Fri May 23 06:47:12 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 23 May 2008 06:47:12 -0700 Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys Message-ID: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com> management: Support separate SA and SM keys as clarified in IBA 1.2.1 Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index ed61721..ccf7bdd 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -730,7 +730,7 @@ get_all_records(osm_bind_handle_t bind_handle, int trusted) { return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset, - trusted ? OSM_DEFAULT_SM_KEY : 0); + trusted ? OSM_DEFAULT_SA_KEY : 0); } /** @@ -1255,7 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, &pktr, ib_get_attr_offset(sizeof(pktr)), - OSM_DEFAULT_SM_KEY); + OSM_DEFAULT_SA_KEY); if (status != IB_SUCCESS) return status; diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 62d472e..39f9057 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -119,6 +119,17 @@ BEGIN_C_DECLS */ #define OSM_DEFAULT_SM_KEY 1 /********/ +/****s* OpenSM: Base/OSM_DEFAULT_SA_KEY +* NAME +* OSM_DEFAULT_SA_KEY +* +* DESCRIPTION +* Subnet Adminstration key value. +* +* SYNOPSIS +*/ +#define OSM_DEFAULT_SA_KEY 1 +/********/ /****s* OpenSM: Base/OSM_DEFAULT_LMC * NAME * OSM_DEFAULT_LMC diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 349ba79..171b5db 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -208,6 +208,7 @@ typedef struct _osm_subn_opt { ib_net64_t guid; ib_net64_t m_key; ib_net64_t sm_key; + ib_net64_t sa_key; ib_net64_t subnet_prefix; ib_net16_t m_key_lease_period; uint32_t sweep_interval; @@ -291,7 +292,10 @@ typedef struct _osm_subn_opt { * M_Key value sent to all ports qualifing all Set(PortInfo). * * sm_key -* SM_Key value of the SM to qualify rcv SA queries as "trusted". +* SM_Key value of the SM used for SM authentication. +* +* sa_key +* SM_Key value to qualify rcv SA queries as "trusted". * * subnet_prefix * Subnet prefix used on this subnet. diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c index 78fdec7..abd8d02 100644 --- a/opensm/opensm/osm_sa_mad_ctrl.c +++ b/opensm/opensm/osm_sa_mad_ctrl.c @@ -340,11 +340,11 @@ __osm_sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw, * otherwise discard the MAD. */ if ((p_sa_mad->sm_key != 0) && - (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sm_key)) { + (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sa_key)) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A04: " "Non-Zero SA MAD SM_Key: 0x%" PRIx64 " != SM_Key: 0x%" PRIx64 "; MAD ignored\n", cl_ntoh64(p_sa_mad->sm_key), - cl_ntoh64(p_ctrl->p_subn->opt.sm_key) + cl_ntoh64(p_ctrl->p_subn->opt.sa_key) ); osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw); goto Exit; diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c index 5cea525..4d19ed4 100644 --- a/opensm/opensm/osm_sa_pkey_record.c +++ b/opensm/opensm/osm_sa_pkey_record.c @@ -269,7 +269,7 @@ void osm_pkey_rec_rcv_process(IN void *ctx, IN void *data) to trusted requests. Check that the requester is a trusted one. */ - if (p_rcvd_mad->sm_key != sa->p_subn->opt.sm_key) { + if (p_rcvd_mad->sm_key != sa->p_subn->opt.sa_key) { /* This is not a trusted requester! */ OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 4608: " "Request from non-trusted requester: " diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 2dc0ca8..a5c9b02 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -395,6 +395,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->guid = 0; p_opt->m_key = OSM_DEFAULT_M_KEY; p_opt->sm_key = OSM_DEFAULT_SM_KEY; + p_opt->sa_key = OSM_DEFAULT_SA_KEY; p_opt->subnet_prefix = IB_DEFAULT_SUBNET_PREFIX; p_opt->m_key_lease_period = 0; p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; @@ -1183,6 +1184,8 @@ ib_api_status_t osm_subn_parse_conf_file(IN osm_subn_opt_t * const p_opts) opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key); + opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key); + opts_unpack_net64("subnet_prefix", p_key, p_val, &p_opts->subnet_prefix); @@ -1432,8 +1435,10 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) "m_key 0x%016" PRIx64 "\n\n" "# The lease period used for the M_Key on this subnet in [sec]\n" "m_key_lease_period %u\n\n" - "# SM_Key value of the SM to qualify rcv SA queries as 'trusted'\n" + "# SM_Key value of the SM used for SM authentication\n" "sm_key 0x%016" PRIx64 "\n\n" + "# SM_Key value to qualify rcv SA queries as 'trusted'\n" + "sa_key 0x%016" PRIx64 "\n\n" "# Subnet prefix used on this subnet\n" "subnet_prefix 0x%016" PRIx64 "\n\n" "# The LMC value used on this subnet\n" @@ -1487,6 +1492,7 @@ ib_api_status_t osm_subn_write_conf_file(IN osm_subn_opt_t * const p_opts) cl_ntoh64(p_opts->m_key), cl_ntoh16(p_opts->m_key_lease_period), cl_ntoh64(p_opts->sm_key), + cl_ntoh64(p_opts->sa_key), cl_ntoh64(p_opts->subnet_prefix), p_opts->lmc, p_opts->lmc_esp0 ? "TRUE" : "FALSE", From sashak at voltaire.com Fri May 23 06:46:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 16:46:08 +0300 Subject: [ofa-general] [PATCH] liberals4: fix linker dependencies Message-ID: <20080523134608.GD4640@sashak.voltaire.com> As stated in bug 1002 (https://bugs.openfabrics.org/show_bug.cgi?id=1002) when LDFLAGS like "-Wl,-z,defs" (disallows undefined symbols) is used it fails to resolve libpthread symbols. This simple patch fixes it. Signed-off-by: Sasha Khapyorsky --- configure.in | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/configure.in b/configure.in index 9304539..4f9ba8f 100644 --- a/configure.in +++ b/configure.in @@ -27,6 +27,8 @@ AC_PROG_CC dnl Checks for libraries AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], AC_MSG_ERROR([ibv_get_device_list() not found. libmlx4 requires libibverbs.])) +AC_CHECK_LIB(pthread, pthread_mutex_init, [], + AC_MSG_ERROR([pthread_mutex_init() not found. libmlx4 requires libpthread.])) dnl Checks for header files. AC_CHECK_HEADER(infiniband/driver.h, [], -- 1.5.5.1.178.g1f811 From sashak at voltaire.com Fri May 23 06:52:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 16:52:19 +0300 Subject: [ofa-general] [PATCH] libipathverbs: fix linker dependencies Message-ID: <20080523135219.GE4640@sashak.voltaire.com> As stated in bug 1002 (https://bugs.openfabrics.org/show_bug.cgi?id=1002) when LDFLAGS like "-Wl,-z,defs" (disallows undefined symbols) is used it fails to resolve libpthread symbols. This simple patch fixes it. Signed-off-by: Sasha Khapyorsky --- configure.in | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/configure.in b/configure.in index dcc207d..faaa4c3 100644 --- a/configure.in +++ b/configure.in @@ -62,6 +62,9 @@ AC_PROG_CC dnl Checks for libraries AC_CHECK_LIB(ibverbs, ibv_get_device_list, [], AC_MSG_ERROR([ibv_get_device_list() not found. libipathverbs requires libibverbs.])) +AC_CHECK_LIB(pthread, pthread_mutex_init, [], + AC_MSG_ERROR([pthread_mutex_init() not found. libipathverbs requires libpthread.])) + dnl Checks for header files. AC_CHECK_HEADER(infiniband/driver.h, [], -- 1.5.5.1.178.g1f811 From sashak at voltaire.com Fri May 23 06:53:44 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 23 May 2008 16:53:44 +0300 Subject: [ofa-general] [PATCH] liberals4: fix linker dependencies In-Reply-To: <20080523134608.GD4640@sashak.voltaire.com> References: <20080523134608.GD4640@sashak.voltaire.com> Message-ID: <20080523135344.GF4640@sashak.voltaire.com> Oops sorry, subject should be: libmlx4: fix linker dependencies Sasha From marcel.heinz at informatik.tu-chemnitz.de Fri May 23 08:26:41 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Fri, 23 May 2008 17:26:41 +0200 Subject: [ofa-general] Multicast Performance Message-ID: <4836E231.4000601@informatik.tu-chemnitz.de> Hi, I have ported an application to use InfiniBand multicast directly via libibverbs. I have discovered very low multicast throughput, only ~250MByte/s although we are using 4x DDR components. To count out any effects of the application, I've created a small benchmark (well, it's only a hack). It just tries to keep the send/recv queue filled with work request and polls the CQ in an endless loop. In server mode, it joins to/creates the multicast group as FullMember, attaches the QP to the group and receives any packets. The client joins as SendOnlyNonMember and sends Datagrams of full MTU size to the group. The test setup is as follows: Host A <---> Switch <---> Host B We use Mellanox InfiniHost III Lx HCAs (MT25204) and a Flextronics F-X430046 24-Port Switch, OFED 1.3 and a "vanilla" 2.6.23.9 Linux kernel. The results are: Host A Host B Throughput (MByte/sec) client server 262 client 2xserver 146 client+server server 944 client+server --- 946 as reference: unicast ib_send_bw (in UD mode): 1146 I don't see any reason why it should become _faster_ when I additionally start a server on the same host as the client. OTOH, the 944MByte/s sound relatively sane when compared to the unicast performance with the additional overhead of having to copy the data locally. These 260MB/s seem releatively near to the 2GBit/s effective throughput of a 1x SDR connection. However, the created group is rate 6 (20GBit/s) and /sys/class/infiniband/mthca0/ports/1/rate file showed 20 Gb/sec during the whole test. The error counters of all ports are showing nothing abnormal. Only the RcvSwRelayErrors counter of the switch's port (to the host running the client) is increasing very fast, but this seems to be normal for multicast packets, as the switch is not relaying these packets back to the source. We could test on another cluster with 6 nodes (also with MT25204 HCAs, I don't know the OFED version and switch type) and got the following results: Host1 Host2 Host3 Host4 Host5 Host6 Throughput (MByte/s) 1s 1s 1c 255,15 1s 1s 1s 1c 255,22 1s 1s 1s 1s 1c 255,22 1s 1s 1s 1s 1s 1c 255,22 1s1c 1s 1s 738,64 1s1c 1s 1s 1s 695,08 1s1c 1s 1s 1s 1s 565,14 1s1c 1s 1s 1s 1s 1s 451,90 As long as there is no server and client on the same host, it at least behaves like multicast. When having both client and server on the same host, performance decreases as the number of servers increases, which is totally surprising to me. Another test I did was doing a ib_send_bw (UD) benchmark while the multicast benchmark was running between A and B. I got ~260MByte/s for the multicast and also 260MB/s for ib_send_bw. Has anyone an idea of what is going on there or a hint what I should check? Regards, Marcel From meier3 at llnl.gov Fri May 23 08:58:48 2008 From: meier3 at llnl.gov (Timothy A. Meier) Date: Fri, 23 May 2008 08:58:48 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <20080523103532.GA4640@sashak.voltaire.com> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> <20080523103532.GA4640@sashak.voltaire.com> Message-ID: <4836E9B8.2080406@llnl.gov> Sasha Khapyorsky wrote: > On 08:17 Thu 22 May , Hal Rosenstock wrote: >> On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote: >>> Sasha, >>> >>> Trivial patch to enforce root for these perl scripts. More importantly, >>> doesn't silently fail if not root, and returns an error code. >> Should these enforce root or be based on udev permissions for umad which >> default to root ? > > I would ask the same question as Hal did. > Ok, I understand. I have created another patch with just the auth_check routine in it. Following Hals advice, authorization is based on the umad permissions. > What is wrong with how it works now? On some system access to files could > be arranged for group members, or ibnetdiscover used as engine for many > scripts could be su/gid-ed. This will break there. > > Sasha > The new patch shouldn't break code. I didn't realize/think about non-root with the original patch. The intent is simply to provide a consistent and non-silent fail mechanism. Currently, you can get partial functionality from these scripts (-? for example). So in that sense, this can change the behavior if the check is used early in the script (as I did in the original patch). I view most of these scripts as "all or nothing". -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov From meier3 at llnl.gov Fri May 23 09:04:55 2008 From: meier3 at llnl.gov (Timothy A. Meier) Date: Fri, 23 May 2008 09:04:55 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not authorized Message-ID: <4836EB27.7060707@llnl.gov> Sasha, Hal, Here is a revised version of the patch - just the auth_check() routine. Basically, it passes the test if root, or same ownership as umad0. The motivation for this patch is to provide a quick (and early) check for the perl scripts that were only intended for privilaged users. Stop partial functionality, and provide a non-zero exit code. I will patch the relevant perl scripts to use this check, if accepted. -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0001-infiniband-diags-terminate-perl-scripts-with-error.patch URL: From sean.hefty at intel.com Fri May 23 09:35:35 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 23 May 2008 09:35:35 -0700 Subject: [ofa-general] Multithreaded iWARP application In-Reply-To: <48367594.9010401@student.ethz.ch> References: <48367594.9010401@student.ethz.ch> Message-ID: <000001c8bcf3$070e4670$6b58180a@amr.corp.intel.com> >In pseudo code: > >/** connecting part **/ >struct rdma_cm_id *id; >struct rdma_event_channel *channel; >struct rdma_cm_event *event; > >channel = rdma_create_event_channel(); >rdma_create_id(channel, &id, context, RDMA_PS_TCP); > >rdma_resolve_addr(id, src_addr, dst_addr, timeout); >rdma_get_cm_event(channel, &event); //expecting ADDR_RESOLVED >rdma_ack_cm_event(event); >//same for rdma_resolve_route() //expecting ROUTE_RESOLVED > >rdma_connect(id, conn_param); >rdma_get_event(channel, &event); //expecting ESTABLISHED >rdma_ack_event(event); > >... do RDMA here ... >//disconnect > > >/** accepting thread **/ >struct rdma_cm_id *listen_id, *id; >struct rdma_event_channel *listen_channel; >struct rdma_cm_event *listen_event, *event; > >channel = rdma_create_event_channel(); >rdma_create_id(channel, &id, context, RDMA_PS_TCP); > >rdma_bind_addr(id, addr); >rdma_listen(id, backlog); > >while(1) { > rdma_get_cm_event(listen_channel, listen_event); > rdma_ack_cm_event(listen_event); > //expecting CONNECT_REQUEST > id = listen_event->id; > id->channel = rdma_create_event_channel(); <-- HERE ??? This is disallowed. The kernel is still maintaining the association between the new id and its current channel, so will still deliver events to the old event channel for that id. I think the solution that you're looking for is the call rdma_migrate_id(). The listen request will have its own event channel, and you can migrate new connection(s) to separate channel(s). Depending on your app, you may be able to get away with a total of 2 channels - one for the listen, and another one for the connected id's. As a note, rdma_migrate_id() is a relatively new call. So I don't know if your installation has it. Without it, you're stuck using a single event channel on the listening side. - Sean From rdreier at cisco.com Fri May 23 10:42:24 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 10:42:24 -0700 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <48342C6C.2010502@googlemail.com> (Gabriel C.'s message of "Wed, 21 May 2008 16:06:36 +0200") References: <48342C6C.2010502@googlemail.com> Message-ID: > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type Perhaps the best way to fix these is to change code like if (/* ScoreBoardDrainInProg */ test_bit(63, &hwstatus) || /* AbortInProg */ test_bit(62, &hwstatus) || /* InternalSDmaEnable */ test_bit(61, &hwstatus) || /* ScbEmpty */ !test_bit(30, &hwstatus)) { to something like if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG | IPATH_SDMA_STATUS_ABORT_IN_PROG | IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) || !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) { with appropriate defines for the constants 1ull << 63 etc. (I think I got the logic correct but someone should check) > drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': > drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' I have a fix for this pending; will ask Linus to pull today. From rdreier at cisco.com Fri May 23 10:45:11 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 10:45:11 -0700 Subject: [ofa-general] Re: [ PATCH ] RDMA/nes Update MAINTAINERS list In-Reply-To: <200805211649.m4LGnwPP026935@velma.neteffect.com> (Chien Tung's message of "Wed, 21 May 2008 11:49:58 -0500") References: <200805211649.m4LGnwPP026935@velma.neteffect.com> Message-ID: > Adding Chien to maintainers list for NetEffect. No problem with this, but is it intentional to remove Nishi Gupta in the same patch? > -P: Nishi Gupta > -M: ngupta at neteffect.com From rdreier at cisco.com Fri May 23 10:44:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 10:44:30 -0700 Subject: [ofa-general][PATCH v1 2/2] mlx4: Default value for automatic completion vector selection In-Reply-To: <4834335D.8030903@mellanox.co.il> (Yevgeny Petrilin's message of "Wed, 21 May 2008 17:36:13 +0300") References: <4834335D.8030903@mellanox.co.il> Message-ID: > When the vector number passed to mlx4_cq_alloc is MLX4_ANY_VECTOR (0xff), > the driver selects the completion vector that has the least CQs attached > to it and attaches the CQ to the chosen vector. Ummm... how could an app/ULP use this sanely? Have a huge switch statement to choose MLX4_ANY_VECTOR / EHCA_ANY_VECTOR / FOOHCA_ANY_VECTOR? We need something generic like IB_CQ_VECTOR_LEAST_ATTACHED that specifies the policy in a driver-independent way. - R. From rdreier at cisco.com Fri May 23 10:57:41 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 10:57:41 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a fixes for various issues: - Various trivial fixes that get rid of warnings - A couple of oopsable bugs fixed - Fixes for mthca/mlx4 driver bugs that stop NFS/RDMA from working - MAINTAINERS entry for Chelsio drivers Andrew Morton (1): IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send() Dave Olson (1): IB/mad: Fix kernel crash when .process_mad() returns SUCCESS|CONSUMED Jack Morgenstein (1): IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish() Ralph Campbell (1): IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate Roland Dreier (4): IB/ipath: Fix printk format for ipath_sdma_status RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send() IB/mthca: Fix max_sge value returned by query_device IB/mlx4: Fix creation of kernel QP with max number of send s/g entries Steve Wise (1): MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries MAINTAINERS | 14 ++++++++++++++ drivers/infiniband/core/mad.c | 4 +++- drivers/infiniband/hw/cxgb3/iwch_qp.c | 2 +- drivers/infiniband/hw/ipath/ipath_sdma.c | 4 ++-- drivers/infiniband/hw/ipath/ipath_uc.c | 4 ++-- drivers/infiniband/hw/mlx4/qp.c | 15 +++++++++------ drivers/infiniband/hw/mthca/mthca_main.c | 14 +++++++++++++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 6 ++++++ 8 files changed, 50 insertions(+), 13 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index bc1c008..907d8c4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1239,6 +1239,20 @@ L: video4linux-list at redhat.com W: http://linuxtv.org S: Maintained +CXGB3 ETHERNET DRIVER (CXGB3) +P: Divy Le Ray +M: divy at chelsio.com +L: netdev at vger.kernel.org +W: http://www.chelsio.com +S: Supported + +CXGB3 IWARP RNIC DRIVER (IW_CXGB3) +P: Steve Wise +M: swise at chelsio.com +L: general at lists.openfabrics.org +W: http://www.openfabrics.org +S: Supported + CYBERPRO FB DRIVER P: Russell King M: rmk at arm.linux.org.uk diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index fbe16d5..1adf2ef 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -747,7 +747,9 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, break; case IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED: kmem_cache_free(ib_mad_cache, mad_priv); - break; + kfree(local); + ret = 1; + goto out; case IB_MAD_RESULT_SUCCESS: /* Treat like an incoming receive MAD */ port_priv = ib_get_mad_port(mad_agent_priv->agent.device, diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 79dbe5b..9926137 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -229,7 +229,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, struct ib_send_wr **bad_wr) { int err = 0; - u8 t3_wr_flit_cnt; + u8 uninitialized_var(t3_wr_flit_cnt); enum t3_wr_opcode t3_wr_opcode = 0; enum t3_wr_flags t3_wr_flags; struct iwch_qp *qhp; diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 3697449..0a8c1b8 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -345,7 +345,7 @@ resched: * state change */ if (jiffies > dd->ipath_sdma_abort_jiffies) { - ipath_dbg("looping with status 0x%016llx\n", + ipath_dbg("looping with status 0x%08lx\n", dd->ipath_sdma_status); dd->ipath_sdma_abort_jiffies = jiffies + 5 * HZ; } @@ -615,7 +615,7 @@ void ipath_restart_sdma(struct ipath_devdata *dd) } spin_unlock_irqrestore(&dd->ipath_sdma_lock, flags); if (!needed) { - ipath_dbg("invalid attempt to restart SDMA, status 0x%016llx\n", + ipath_dbg("invalid attempt to restart SDMA, status 0x%08lx\n", dd->ipath_sdma_status); goto bail; } diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index 7fd18e8..0596ec1 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -407,12 +407,11 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, dev->n_pkt_drops++; goto done; } - /* XXX Need to free SGEs */ + wc.opcode = IB_WC_RECV; last_imm: ipath_copy_sge(&qp->r_sge, data, tlen); wc.wr_id = qp->r_wr_id; wc.status = IB_WC_SUCCESS; - wc.opcode = IB_WC_RECV; wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.slid = qp->remote_ah_attr.dlid; @@ -514,6 +513,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, goto done; } wc.byte_len = qp->r_len; + wc.opcode = IB_WC_RECV_RDMA_WITH_IMM; goto last_imm; case OP(RDMA_WRITE_LAST): diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 8e02ecf..a80df22 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -333,6 +333,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + send_wqe_overhead(type, qp->flags); + if (s > dev->dev->caps.max_sq_desc_sz) + return -EINVAL; + /* * Hermon supports shrinking WQEs, such that a single work * request can include multiple units of 1 << wqe_shift. This @@ -372,9 +375,6 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp->sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 << qp->sq.wqe_shift > dev->dev->caps.max_sq_desc_sz) - return -EINVAL; - qp->sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1U << qp->sq.wqe_shift); /* @@ -395,7 +395,8 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, ++qp->sq.wqe_shift; } - qp->sq.max_gs = ((qp->sq_max_wqes_per_wr << qp->sq.wqe_shift) - + qp->sq.max_gs = (min(dev->dev->caps.max_sq_desc_sz, + (qp->sq_max_wqes_per_wr << qp->sq.wqe_shift)) - send_wqe_overhead(type, qp->flags)) / sizeof (struct mlx4_wqe_data_seg); @@ -411,7 +412,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap->max_send_wr = qp->sq.max_post = (qp->sq.wqe_cnt - qp->sq_spare_wqes) / qp->sq_max_wqes_per_wr; - cap->max_send_sge = qp->sq.max_gs; + cap->max_send_sge = min(qp->sq.max_gs, + min(dev->dev->caps.max_sq_sg, + dev->dev->caps.max_rq_sg)); /* We don't support inline sends for kernel QPs (yet) */ cap->max_inline_data = 0; @@ -1457,7 +1460,7 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, unsigned ind; int uninitialized_var(stamp); int uninitialized_var(size); - unsigned seglen; + unsigned uninitialized_var(seglen); int i; spin_lock_irqsave(&qp->sq.lock, flags); diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 9ebadd6..200cf13 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -45,6 +45,7 @@ #include "mthca_cmd.h" #include "mthca_profile.h" #include "mthca_memfree.h" +#include "mthca_wqe.h" MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("Mellanox InfiniBand HCA low-level driver"); @@ -200,7 +201,18 @@ static int mthca_dev_lim(struct mthca_dev *mdev, struct mthca_dev_lim *dev_lim) mdev->limits.gid_table_len = dev_lim->max_gids; mdev->limits.pkey_table_len = dev_lim->max_pkeys; mdev->limits.local_ca_ack_delay = dev_lim->local_ca_ack_delay; - mdev->limits.max_sg = dev_lim->max_sg; + /* + * Need to allow for worst case send WQE overhead and check + * whether max_desc_sz imposes a lower limit than max_sg; UD + * send has the biggest overhead. + */ + mdev->limits.max_sg = min_t(int, dev_lim->max_sg, + (dev_lim->max_desc_sz - + sizeof (struct mthca_next_seg) - + (mthca_is_memfree(mdev) ? + sizeof (struct mthca_arbel_ud_seg) : + sizeof (struct mthca_tavor_ud_seg))) / + sizeof (struct mthca_data_seg)); mdev->limits.max_wqes = dev_lim->max_qp_sz; mdev->limits.max_qp_init_rdma = dev_lim->max_requester_per_qp; mdev->limits.reserved_qps = dev_lim->reserved_qps; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index d00a2c1..3f663fb 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -194,7 +194,13 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, /* Set the cached Q_Key before we attach if it's the broadcast group */ if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid))) { + spin_lock_irq(&priv->lock); + if (!priv->broadcast) { + spin_unlock_irq(&priv->lock); + return -EAGAIN; + } priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey); + spin_unlock_irq(&priv->lock); priv->tx_wr.wr.ud.remote_qkey = priv->qkey; } From rdreier at cisco.com Fri May 23 11:04:23 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 11:04:23 -0700 Subject: [ofa-general] [PATCH] [for-2.6.27] rdma: fix license text In-Reply-To: <000101c8bc27$df180960$ec248686@amr.corp.intel.com> (Sean Hefty's message of "Thu, 22 May 2008 09:21:20 -0700") References: <000101c8bc27$df180960$ec248686@amr.corp.intel.com> Message-ID: thanks, applied From rdreier at cisco.com Fri May 23 11:05:13 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 11:05:13 -0700 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM mode In-Reply-To: <1211470815.7310.61.camel@eli-laptop> (Eli Cohen's message of "Thu, 22 May 2008 18:40:15 +0300") References: <1211470815.7310.61.camel@eli-laptop> Message-ID: > --- > > When running netperf I see significant improvement when using this patch > (BW Mbps): > > with patch: > sender receiver > 313 313 > > without the patch: > 509 134 Any reason why we wouldn't want this info in the patch changelog? Can you explain why the sender gets dramatically slower with the patch? - R. From ctung at NetEffect.com Fri May 23 11:17:15 2008 From: ctung at NetEffect.com (Chien Tung) Date: Fri, 23 May 2008 13:17:15 -0500 Subject: [ofa-general] RE: [ PATCH ] RDMA/nes Update MAINTAINERS list In-Reply-To: References: <200805211649.m4LGnwPP026935@velma.neteffect.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0811C8FB@venom2> > > Adding Chien to maintainers list for NetEffect. > > No problem with this, but is it intentional to remove Nishi > Gupta in the same patch? Yes. I should of mentioned it in the abstract. Chien From swise at opengridcomputing.com Fri May 23 11:20:30 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 May 2008 13:20:30 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48358428.2000902@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> Message-ID: <48370AEE.7080507@opengridcomputing.com> Or Gerlitz wrote: > Steve Wise wrote: >> Are we sure we need to expose this to the user? > I believe this is the way to go if we want to let smart ULPs generate > new rkey/stag per mapping. Simpler ULPs could then just put the same > value for each map associated with the same mr. > > Or. > How should I add this to the API? Perhaps we just document the format of an rkey in the struct ib_mr. Thus the app would do this to change the key before posting the fast_reg_mr wr (coded to be explicit, not efficient): u8 newkey; u32 newrkey; newkey = 0xaa; newrkey = (mr->rkey & 0xffffff00) | newkey; mr->rkey = newrkey wr.wr.fast_reg.mr = mr; ... Note, this assumes mr->rkey is in host byte order (I think the linux rdma code assumes this in other places too). Steve. From dotanba at gmail.com Fri May 23 12:28:14 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 23 May 2008 21:28:14 +0200 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? In-Reply-To: References: Message-ID: <48371ACE.908@gmail.com> Hi. Yicheng Jia wrote: > > Hi Folks, > > I'm trying to use CQ Event notification for multiple completions > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering > RDMA. However I couldn't find it in current MLX driver. It seems to me > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are > multiple work requests, I have to use "poll_cq" to synchronously wait > until all the requests are done, is it correct? Is there a way to do > asynchronous multiple send by subscribing for a ARM_N event? You are right: the low level drivers of Mellanox devices doesn't support ARM-N (This feature is supported by the devices, but it wasn't implemented in the low level drivers). You are right, in order to read all of the completions you need to use poll_cq. By the way: Do you have you have to create a completion for any WR? (if you are using one QP, this will maybe solve your problem). Dotan From dotanba at gmail.com Fri May 23 12:29:49 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 23 May 2008 21:29:49 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <4836E231.4000601@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> Message-ID: <48371B2D.3040908@gmail.com> Hi. Do you use the latest released FW for this device? thanks Dotan Marcel Heinz wrote: > Hi, > > I have ported an application to use InfiniBand multicast directly via > libibverbs. I have discovered very low multicast throughput, only > ~250MByte/s although we are using 4x DDR components. To count out any > effects of the application, I've created a small benchmark (well, it's > only a hack). It just tries to keep the send/recv queue filled with work > request and polls the CQ in an endless loop. In server mode, it joins > to/creates the multicast group as FullMember, attaches the QP to the > group and receives any packets. The client joins as SendOnlyNonMember > and sends Datagrams of full MTU size to the group. > > The test setup is as follows: > > Host A <---> Switch <---> Host B > > We use Mellanox InfiniHost III Lx HCAs (MT25204) and a Flextronics > F-X430046 24-Port Switch, OFED 1.3 and a "vanilla" 2.6.23.9 Linux kernel. > > The results are: > > Host A Host B Throughput (MByte/sec) > client server 262 > client 2xserver 146 > client+server server 944 > client+server --- 946 > > as reference: unicast ib_send_bw (in UD mode): 1146 > > I don't see any reason why it should become _faster_ when I additionally > start a server on the same host as the client. OTOH, the 944MByte/s > sound relatively sane when compared to the unicast performance with the > additional overhead of having to copy the data locally. > > These 260MB/s seem releatively near to the 2GBit/s effective throughput > of a 1x SDR connection. However, the created group is rate 6 (20GBit/s) > and /sys/class/infiniband/mthca0/ports/1/rate file showed 20 Gb/sec > during the whole test. > > The error counters of all ports are showing nothing abnormal. Only the > RcvSwRelayErrors counter of the switch's port (to the host running the > client) is increasing very fast, but this seems to be normal for > multicast packets, as the switch is not relaying these packets back to > the source. > > We could test on another cluster with 6 nodes (also with MT25204 HCAs, I > don't know the OFED version and switch type) and got the following results: > > Host1 Host2 Host3 Host4 Host5 Host6 Throughput (MByte/s) > 1s 1s 1c 255,15 > 1s 1s 1s 1c 255,22 > 1s 1s 1s 1s 1c 255,22 > 1s 1s 1s 1s 1s 1c 255,22 > > 1s1c 1s 1s 738,64 > 1s1c 1s 1s 1s 695,08 > 1s1c 1s 1s 1s 1s 565,14 > 1s1c 1s 1s 1s 1s 1s 451,90 > > As long as there is no server and client on the same host, it at least > behaves like multicast. When having both client and server on the same > host, performance decreases as the number of servers increases, which is > totally surprising to me. > > Another test I did was doing a ib_send_bw (UD) benchmark while the > multicast benchmark was running between A and B. I got ~260MByte/s for > the multicast and also 260MB/s for ib_send_bw. > > Has anyone an idea of what is going on there or a hint what I should check? > > Regards, > Marcel > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From weiny2 at llnl.gov Fri May 23 11:54:38 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 23 May 2008 11:54:38 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <20080523103532.GA4640@sashak.voltaire.com> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> <20080523103532.GA4640@sashak.voltaire.com> Message-ID: <20080523115438.72900365.weiny2@llnl.gov> On Fri, 23 May 2008 13:35:32 +0300 Sasha Khapyorsky wrote: > On 08:17 Thu 22 May , Hal Rosenstock wrote: > > On Thu, 2008-05-22 at 08:15 -0700, Timothy A. Meier wrote: > > > Sasha, > > > > > > Trivial patch to enforce root for these perl scripts. More importantly, > > > doesn't silently fail if not root, and returns an error code. > > > > Should these enforce root or be based on udev permissions for umad which > > default to root ? > > I would ask the same question as Hal did. > > What is wrong with how it works now? On some system access to files could > be arranged for group members, or ibnetdiscover used as engine for many > scripts could be su/gid-ed. This will break there. The problem is, if you don't know what a particular script or option does and it simply returns a prompt with a "0" return code the user will THINK it did what whatever it was supposed to do, when in fact it did nothing!!! This is especially bad with these scripts as most of them simply query the fabric. This could lead one to believe that it did not find an information to return when in fact it did not query the fabric at all. I realize that running things which you don't know what they do is bad but for sure it should not return "0" when it clearly did not perform the requested operation because of an error in permissions. Ira From rdreier at cisco.com Fri May 23 11:56:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 11:56:15 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48370AEE.7080507@opengridcomputing.com> (Steve Wise's message of "Fri, 23 May 2008 13:20:30 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> Message-ID: > How should I add this to the API? > > Perhaps we just document the format of an rkey in the struct ib_mr. > Thus the app would do this to change the key before posting the > fast_reg_mr wr (coded to be explicit, not efficient): > > u8 newkey; > u32 newrkey; > > newkey = 0xaa; > newrkey = (mr->rkey & 0xffffff00) | newkey; > mr->rkey = newrkey > wr.wr.fast_reg.mr = mr; Don't like it -- too easy for the consumer to screw up the data structures. Seems simpler to just add a u8 "key" field (or maybe there's a better name) to the work request. - R. From swise at opengridcomputing.com Fri May 23 11:58:19 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 May 2008 13:58:19 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> Message-ID: <483713CB.3010408@opengridcomputing.com> Roland Dreier wrote: > > How should I add this to the API? > > > > Perhaps we just document the format of an rkey in the struct ib_mr. > > Thus the app would do this to change the key before posting the > > fast_reg_mr wr (coded to be explicit, not efficient): > > > > u8 newkey; > > u32 newrkey; > > > > newkey = 0xaa; > > newrkey = (mr->rkey & 0xffffff00) | newkey; > > mr->rkey = newrkey > > wr.wr.fast_reg.mr = mr; > > Don't like it -- too easy for the consumer to screw up the data > structures. > > Seems simpler to just add a u8 "key" field (or maybe there's a better > name) to the work request. > > And then the provider updates the mr->rkey field as part of WR processing? Steve. From dotanba at gmail.com Fri May 23 13:14:36 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 23 May 2008 23:14:36 +0300 Subject: [ofa-general] [PATCH] core/include: fix coding style typos according to checkpatch.pl Message-ID: <200805232314.37003.dotanba@gmail.com> Fixed header files coding style typos according to checkpatch.pl (without harming code readability). Signed-off-by: Dotan Barak --- diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h index f179d23..a5501e3 100644 --- a/include/rdma/ib_cache.h +++ b/include/rdma/ib_cache.h @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef _IB_CACHE_H diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index a627c86..48a30bd 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -32,7 +32,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $ */ #if !defined(IB_CM_H) #define IB_CM_H diff --git a/include/rdma/ib_fmr_pool.h b/include/rdma/ib_fmr_pool.h index 00dadbf..15195e6 100644 --- a/include/rdma/ib_fmr_pool.h +++ b/include/rdma/ib_fmr_pool.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_fmr_pool.h 2730 2005-06-28 16:43:03Z sean.hefty $ */ #if !defined(IB_FMR_POOL_H) @@ -61,7 +60,7 @@ struct ib_fmr_pool_param { int pool_size; int dirty_watermark; void (*flush_function)(struct ib_fmr_pool *pool, - void * arg); + void *arg); void *flush_arg; unsigned cache:1; }; diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 7228c05..edfc0a9 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -33,10 +33,9 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_mad.h 5596 2006-03-03 01:00:07Z sean.hefty $ */ -#if !defined( IB_MAD_H ) +#if !defined(IB_MAD_H) #define IB_MAD_H #include @@ -194,8 +193,7 @@ struct ib_vendor_mad { u8 data[IB_MGMT_VENDOR_DATA]; }; -struct ib_class_port_info -{ +struct ib_class_port_info { u8 base_version; u8 class_version; __be16 capability_mask; @@ -614,11 +612,11 @@ int ib_process_mad_wc(struct ib_mad_agent *mad_agent, * any class specific header, and MAD data area. * If @rmpp_active is set, the RMPP header will be initialized for sending. */ -struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, - u32 remote_qpn, u16 pkey_index, - int rmpp_active, - int hdr_len, int data_len, - gfp_t gfp_mask); +struct ib_mad_send_buf *ib_create_send_mad(struct ib_mad_agent *mad_agent, + u32 remote_qpn, u16 pkey_index, + int rmpp_active, + int hdr_len, int data_len, + gfp_t gfp_mask); /** * ib_is_mad_class_rmpp - returns whether given management class diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h index f926020..f1703d5 100644 --- a/include/rdma/ib_pack.h +++ b/include/rdma/ib_pack.h @@ -29,7 +29,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef IB_PACK_H diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 942692b..39c9780 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_sa.h 2811 2005-07-06 18:11:43Z halr $ */ #ifndef IB_SA_H diff --git a/include/rdma/ib_smi.h b/include/rdma/ib_smi.h index f29af13..da9428c 100644 --- a/include/rdma/ib_smi.h +++ b/include/rdma/ib_smi.h @@ -33,10 +33,9 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_smi.h 1389 2004-12-27 22:56:47Z roland $ */ -#if !defined( IB_SMI_H ) +#if !defined(IB_SMI_H) #define IB_SMI_H #include diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h index 37650af..7e7571c 100644 --- a/include/rdma/ib_user_cm.h +++ b/include/rdma/ib_user_cm.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_CM_H diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h index 29d2c72..11a8dde 100644 --- a/include/rdma/ib_user_mad.h +++ b/include/rdma/ib_user_mad.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_mad.h 2814 2005-07-06 19:14:09Z halr $ */ #ifndef IB_USER_MAD_H diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index 8d65bf0..e226c45 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -32,7 +32,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_VERBS_H diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..2a3bf8f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ */ #if !defined(IB_VERBS_H) @@ -778,7 +777,7 @@ struct ib_cq { struct ib_uobject *uobject; ib_comp_handler comp_handler; void (*event_handler)(struct ib_event *, void *); - void * cq_context; + void *cq_context; int cqe; atomic_t usecnt; /* count number of work queues */ }; @@ -884,7 +883,7 @@ struct ib_dma_mapping_ops { void (*sync_single_for_cpu)(struct ib_device *dev, u64 dma_handle, size_t size, - enum dma_data_direction dir); + enum dma_data_direction dir); void (*sync_single_for_device)(struct ib_device *dev, u64 dma_handle, size_t size, diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h index aeefa9b..cbb822e 100644 --- a/include/rdma/iw_cm.h +++ b/include/rdma/iw_cm.h @@ -62,7 +62,7 @@ struct iw_cm_event { struct sockaddr_in remote_addr; void *private_data; u8 private_data_len; - void* provider_data; + void *provider_data; }; /** diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 010f876..37eebb3 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -57,11 +57,11 @@ enum rdma_cm_event_type { }; enum rdma_port_space { - RDMA_PS_SDP = 0x0001, - RDMA_PS_IPOIB= 0x0002, - RDMA_PS_TCP = 0x0106, - RDMA_PS_UDP = 0x0111, - RDMA_PS_SCTP = 0x0183 + RDMA_PS_SDP = 0x0001, + RDMA_PS_IPOIB = 0x0002, + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, + RDMA_PS_SCTP = 0x0183 }; struct rdma_addr { From dotanba at gmail.com Fri May 23 12:52:03 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 23 May 2008 22:52:03 +0300 Subject: [ofa-general] [PATCH] core/include: fix coding style typos according to checkpatch.pl Message-ID: <200805232252.04047.dotanba@gmail.com> Fixed header files coding style typos according to checkpatch.pl (without harming code readability). Signed-off-by: Dotan Barak --- diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h index f179d23..a5501e3 100644 --- a/include/rdma/ib_cache.h +++ b/include/rdma/ib_cache.h @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cache.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef _IB_CACHE_H diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index a627c86..48a30bd 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -32,7 +32,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $ */ #if !defined(IB_CM_H) #define IB_CM_H diff --git a/include/rdma/ib_fmr_pool.h b/include/rdma/ib_fmr_pool.h index 00dadbf..15195e6 100644 --- a/include/rdma/ib_fmr_pool.h +++ b/include/rdma/ib_fmr_pool.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_fmr_pool.h 2730 2005-06-28 16:43:03Z sean.hefty $ */ #if !defined(IB_FMR_POOL_H) @@ -61,7 +60,7 @@ struct ib_fmr_pool_param { int pool_size; int dirty_watermark; void (*flush_function)(struct ib_fmr_pool *pool, - void * arg); + void *arg); void *flush_arg; unsigned cache:1; }; diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 7228c05..edfc0a9 100644 --- a/include/rdma/ib_mad.h +++ b/include/rdma/ib_mad.h @@ -33,10 +33,9 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_mad.h 5596 2006-03-03 01:00:07Z sean.hefty $ */ -#if !defined( IB_MAD_H ) +#if !defined(IB_MAD_H) #define IB_MAD_H #include @@ -194,8 +193,7 @@ struct ib_vendor_mad { u8 data[IB_MGMT_VENDOR_DATA]; }; -struct ib_class_port_info -{ +struct ib_class_port_info { u8 base_version; u8 class_version; __be16 capability_mask; @@ -614,11 +612,11 @@ int ib_process_mad_wc(struct ib_mad_agent *mad_agent, * any class specific header, and MAD data area. * If @rmpp_active is set, the RMPP header will be initialized for sending. */ -struct ib_mad_send_buf * ib_create_send_mad(struct ib_mad_agent *mad_agent, - u32 remote_qpn, u16 pkey_index, - int rmpp_active, - int hdr_len, int data_len, - gfp_t gfp_mask); +struct ib_mad_send_buf *ib_create_send_mad(struct ib_mad_agent *mad_agent, + u32 remote_qpn, u16 pkey_index, + int rmpp_active, + int hdr_len, int data_len, + gfp_t gfp_mask); /** * ib_is_mad_class_rmpp - returns whether given management class diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h index f926020..f1703d5 100644 --- a/include/rdma/ib_pack.h +++ b/include/rdma/ib_pack.h @@ -29,7 +29,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_pack.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef IB_PACK_H diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 942692b..39c9780 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -31,7 +31,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_sa.h 2811 2005-07-06 18:11:43Z halr $ */ #ifndef IB_SA_H diff --git a/include/rdma/ib_smi.h b/include/rdma/ib_smi.h index f29af13..da9428c 100644 --- a/include/rdma/ib_smi.h +++ b/include/rdma/ib_smi.h @@ -33,10 +33,9 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_smi.h 1389 2004-12-27 22:56:47Z roland $ */ -#if !defined( IB_SMI_H ) +#if !defined(IB_SMI_H) #define IB_SMI_H #include diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h index 37650af..7e7571c 100644 --- a/include/rdma/ib_user_cm.h +++ b/include/rdma/ib_user_cm.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_cm.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_CM_H diff --git a/include/rdma/ib_user_mad.h b/include/rdma/ib_user_mad.h index 29d2c72..11a8dde 100644 --- a/include/rdma/ib_user_mad.h +++ b/include/rdma/ib_user_mad.h @@ -30,7 +30,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_mad.h 2814 2005-07-06 19:14:09Z halr $ */ #ifndef IB_USER_MAD_H diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index 8d65bf0..e226c45 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -32,7 +32,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_user_verbs.h 4019 2005-11-11 00:33:09Z sean.hefty $ */ #ifndef IB_USER_VERBS_H diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..2a3bf8f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -35,7 +35,6 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ib_verbs.h 1349 2004-12-16 21:09:43Z roland $ */ #if !defined(IB_VERBS_H) @@ -778,7 +777,7 @@ struct ib_cq { struct ib_uobject *uobject; ib_comp_handler comp_handler; void (*event_handler)(struct ib_event *, void *); - void * cq_context; + void *cq_context; int cqe; atomic_t usecnt; /* count number of work queues */ }; @@ -884,7 +883,7 @@ struct ib_dma_mapping_ops { void (*sync_single_for_cpu)(struct ib_device *dev, u64 dma_handle, size_t size, - enum dma_data_direction dir); + enum dma_data_direction dir); void (*sync_single_for_device)(struct ib_device *dev, u64 dma_handle, size_t size, diff --git a/include/rdma/iw_cm.h b/include/rdma/iw_cm.h index aeefa9b..cbb822e 100644 --- a/include/rdma/iw_cm.h +++ b/include/rdma/iw_cm.h @@ -62,7 +62,7 @@ struct iw_cm_event { struct sockaddr_in remote_addr; void *private_data; u8 private_data_len; - void* provider_data; + void *provider_data; }; /** diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h index 010f876..37eebb3 100644 --- a/include/rdma/rdma_cm.h +++ b/include/rdma/rdma_cm.h @@ -57,11 +57,11 @@ enum rdma_cm_event_type { }; enum rdma_port_space { - RDMA_PS_SDP = 0x0001, - RDMA_PS_IPOIB= 0x0002, - RDMA_PS_TCP = 0x0106, - RDMA_PS_UDP = 0x0111, - RDMA_PS_SCTP = 0x0183 + RDMA_PS_SDP = 0x0001, + RDMA_PS_IPOIB = 0x0002, + RDMA_PS_TCP = 0x0106, + RDMA_PS_UDP = 0x0111, + RDMA_PS_SCTP = 0x0183 }; struct rdma_addr { From weiny2 at llnl.gov Fri May 23 12:18:15 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 23 May 2008 12:18:15 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080522154702.430cdef7.weiny2@llnl.gov> <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080523121815.2c39e65a.weiny2@llnl.gov> On Fri, 23 May 2008 04:07:35 -0700 Hal Rosenstock wrote: > On Thu, 2008-05-22 at 15:47 -0700, Ira Weiny wrote: > > I guess my question is "does saquery need this to talk to the SA?" > > > > I am assuming the answer is "yes". > > It depends on whether trusted operations are needed to be supported or > not. A normal node has no need for trusted operations. There was a > reason why the additional information was hidden with a key. It allows a > malicious user to effect not just his node but the subnet. Ok... I guess from your other emails the point is that ULP's must get these keys by some "out of spec" method? saquery only queries information, much of which I think ULP's require to establish connections etc. How are others solving this problem? > > As I mentioned, this starts to be a slippery slope with the management > keys. I think a better approach when non default key is in place is to > support this via the OpenSM console as OpenSM knows all the keys it's > supposed to. When you mention this I start to think about the secure API which Tim submitted a few months ago and was not accepted. I know we are still discussing how to do "secure console" but perhaps this is a very valid use case for the SM to answer SSL socket connections to get keys? (yea, no longer thinking... ;-) Ira > > > I noticed this in the spec section 14.4.7 page 890: > > > > "The SM Key used for SM authentication is independent of the SM Key in the > > SA header used for SA authentication." > > > > Does this mean there could be 2 SM_Key values in use? > > This was a clarification added at IBA 1.2.1. The SA SMKey is really an > SA Key. This lack of separation is a limitation in the current OpenSM > implementation. > > -- Hal > > > Ira > > > > > > On Thu, 22 May 2008 08:10:29 -0700 > > Hal Rosenstock wrote: > > > > > On Thu, 2008-05-22 at 17:56 +0300, Sasha Khapyorsky wrote: > > > > On 07:46 Thu 22 May , Hal Rosenstock wrote: > > > > > Sasha, > > > > > > > > > > On Thu, 2008-05-22 at 16:53 +0300, Sasha Khapyorsky wrote: > > > > > > This adds possibility to specify SM_Key value with saquery. It should > > > > > > work with queries where OSM_DEFAULT_SM_KEY was used. > > > > > > > > > > I think this starts down a slippery slope and perhaps bad precedent for > > > > > MKey as well. I know this is useful as a debug tool but compromises what > > > > > purports as "security" IMO as this means the keys need to be too widely > > > > > known. > > > > > > > > When different than OSM_DEFAULT_SM_KEY value is configured on OpenSM > > > > side an user may know this or not, in later case saquery will not work > > > > (just like now). I don't see a hole. > > > > > > I think it will tend towards proliferation of keys which will defeat any > > > security/trust. The idea of SMKey was to keep it private between SMs. > > > This is now spreading it wider IMO. I'm sure other patches will follow > > > in the same vein once an MKey manager exists. > > > > > > -- Hal > > > > > > > Sasha > > > > From Instant at lists.openfabrics.org Fri May 23 12:27:23 2008 From: Instant at lists.openfabrics.org (Instant at lists.openfabrics.org) Date: 23 May 2008 12:27:23 -0700 Subject: [ofa-general] Can you afford to lose 300, 000 potential customers per year ? Message-ID: <20080523122722.E73007C9F8082CD9@from.header.has.no.domain> Can you afford to lose 300,000 potential customers per year ? How would You like to divert 1000s of fresh, new visitors daily to Your web site or affiliate web site from Google, Yahoo, MSN and others At $0 cost to you...? ...iNSTANT BOOSTER diverts 1000s of fresh, new visitors daily to Your web site or affiliate web site from Google, Yahoo, MSN and others at $0 cost to you! ...No matter what you are selling or offering - INTSANT BOOSTER will pull in hordes of potential customers to your website - instantly! For Full Details Please read the attached .html file Unsubscribe: Please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Fri May 23 13:32:07 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 23 May 2008 23:32:07 +0300 Subject: [ofa-general] [PATCH] libibverbs: fix coding style typos according to checkpatch.pl Message-ID: <200805232332.07576.dotanba@gmail.com> Fixed coding style typos according to checkpatch.pl (without harming code readability). Signed-off-by: Dotan Barak --- diff --git a/examples/devinfo.c b/examples/devinfo.c index 1fadc80..86ad7da 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -48,7 +48,7 @@ #include #include -static int verbose = 0; +static int verbose; static int null_gid(union ibv_gid *gid) { @@ -231,9 +231,8 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) device_attr.max_total_mcast_qp_attach); printf("\tmax_ah:\t\t\t\t%d\n", device_attr.max_ah); printf("\tmax_fmr:\t\t\t%d\n", device_attr.max_fmr); - if (device_attr.max_fmr) { + if (device_attr.max_fmr) printf("\tmax_map_per_fmr:\t\t%d\n", device_attr.max_map_per_fmr); - } printf("\tmax_srq:\t\t\t%d\n", device_attr.max_srq); if (device_attr.max_srq) { printf("\tmax_srq_wr:\t\t\t%d\n", device_attr.max_srq_wr); diff --git a/examples/srq_pingpong.c b/examples/srq_pingpong.c index 95bebf4..e47bae6 100644 --- a/examples/srq_pingpong.c +++ b/examples/srq_pingpong.c @@ -143,7 +143,7 @@ static struct pingpong_dest *pp_client_exch_dest(const char *servername, int por .ai_socktype = SOCK_STREAM }; char *service; - char msg[ sizeof "0000:000000:000000"]; + char msg[sizeof "0000:000000:000000"]; int n; int r; int i; @@ -227,7 +227,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, .ai_socktype = SOCK_STREAM }; char *service; - char msg[ sizeof "0000:000000:000000"]; + char msg[sizeof "0000:000000:000000"]; int n; int r; int i; @@ -275,7 +275,7 @@ static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx, return NULL; } - rem_dest = malloc(MAX_QP *sizeof *rem_dest); + rem_dest = malloc(MAX_QP * sizeof *rem_dest); if (!rem_dest) goto out; diff --git a/src/cmd.c b/src/cmd.c index 9db8aa6..66d7134 100644 --- a/src/cmd.c +++ b/src/cmd.c @@ -851,7 +851,7 @@ int ibv_cmd_post_send(struct ibv_qp *ibqp, struct ibv_send_wr *wr, tmp->wr.ud.remote_qpn = i->wr.ud.remote_qpn; tmp->wr.ud.remote_qkey = i->wr.ud.remote_qkey; } else { - switch(i->opcode) { + switch (i->opcode) { case IBV_WR_RDMA_WRITE: case IBV_WR_RDMA_WRITE_WITH_IMM: case IBV_WR_RDMA_READ: diff --git a/src/ibverbs.h b/src/ibverbs.h index b1d2c2b..6a6e3c8 100644 --- a/src/ibverbs.h +++ b/src/ibverbs.h @@ -49,7 +49,7 @@ #endif /* HAVE_VALGRIND_MEMCHECK_H */ #ifndef VALGRIND_MAKE_MEM_DEFINED -# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +# define VALGRIND_MAKE_MEM_DEFINED(addr, len) #endif #define HIDDEN __attribute__((visibility ("hidden"))) diff --git a/src/init.c b/src/init.c index 07ab855..82dfae4 100644 --- a/src/init.c +++ b/src/init.c @@ -321,7 +321,7 @@ static void read_config(void) goto next; read_config_file(path); - next: +next: free(path); } From rdreier at cisco.com Fri May 23 13:05:35 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 13:05:35 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483713CB.3010408@opengridcomputing.com> (Steve Wise's message of "Fri, 23 May 2008 13:58:19 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> Message-ID: > And then the provider updates the mr->rkey field as part of WR processing? Yeah, I guess so. Actually thinking about it, another possibility would be to wrap up the > newrkey = (mr->rkey & 0xffffff00) | newkey; operation in a little inline helper function so people don't screw it up. Maybe that's the cleanest way to do it. (We would probably want the helper for low-level driver use anyway) - R. From rdreier at cisco.com Fri May 23 13:12:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 23 May 2008 13:12:38 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: (Roland Dreier's message of "Fri, 23 May 2008 13:05:35 -0700") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> Message-ID: > Actually thinking about it, another possibility would be to wrap up the > newrkey = (mr->rkey & 0xffffff00) | newkey; > operation in a little inline helper function so people don't screw it > up. Maybe that's the cleanest way to do it. If we add a "key" field to the work request, then it seems too easy for a consumer to forget to set it and end up passing uninitialized garbage. If the consumer has to explicitly update the key when posting the work request then that failure is avoided. HOWEVER -- if we have the consumer update the key when posting the operation, then there is the problem of what happens when the consumer posts multiple fastreg work requests at once (ie fastreg, local inval, new fastreg, etc. in a pipelined way). Does the low-level driver just take the the key value given when the WR is posted, even if there's a new value there by the time the WR is executed? - R. From ralph.campbell at qlogic.com Fri May 23 14:43:29 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 23 May 2008 14:43:29 -0700 Subject: [ofa-general] [PATCH 0/1] IB/ipath -- fix for 2.6.26 Message-ID: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com> The following patch fixes a minor bug that Or Gerlitz found. IB/ipath - IB/ipath - fix device capability flags This can be pulled into Roland's infiniband.git for-2.6.26 repo using: git pull git://git.qlogic.com/ipath-linux-2.6 for-roland From ralph.campbell at qlogic.com Fri May 23 14:43:34 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 23 May 2008 14:43:34 -0700 Subject: [ofa-general] [PATCH] IB/ipath - fix device capability flags In-Reply-To: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com> References: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com> Message-ID: <20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com> The driver supports a few features that were not reported in the device capability flags. This patch fixes that. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_verbs.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c index e0ec540..7779165 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.c +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c @@ -1494,7 +1494,8 @@ static int ipath_query_device(struct ib_device *ibdev, props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR | IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | - IB_DEVICE_SYS_IMAGE_GUID; + IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_RC_RNR_NAK_GEN | + IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SRQ_RESIZE; props->page_size_cap = PAGE_SIZE; props->vendor_id = dev->dd->ipath_vendorid; props->vendor_part_id = dev->dd->ipath_deviceid; From ralph.campbell at qlogic.com Fri May 23 14:45:01 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 23 May 2008 14:45:01 -0700 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: References: <48342C6C.2010502@googlemail.com> Message-ID: <1211579101.3949.326.camel@brick.pathscale.com> This looks good to me. On Fri, 2008-05-23 at 10:42 -0700, Roland Dreier wrote: > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': > > drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type > > Perhaps the best way to fix these is to change code like > > if (/* ScoreBoardDrainInProg */ > test_bit(63, &hwstatus) || > /* AbortInProg */ > test_bit(62, &hwstatus) || > /* InternalSDmaEnable */ > test_bit(61, &hwstatus) || > /* ScbEmpty */ > !test_bit(30, &hwstatus)) { > > to something like > > if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG | > IPATH_SDMA_STATUS_ABORT_IN_PROG | > IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) || > !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) { > > with appropriate defines for the constants 1ull << 63 etc. > > (I think I got the logic correct but someone should check) > > > drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' > > drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'ipath_restart_sdma': > > drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' > > I have a fix for this pending; will ask Linus to pull today. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From toneyn at excn.com Fri May 23 14:54:30 2008 From: toneyn at excn.com (Replica Watches) Date: Fri, 23 May 2008 21:54:30 +0000 Subject: [ofa-general] Exquisite Replica Message-ID: <000601c8bd2e$03755a93$91d1dd8d@gybcsmc> MY JEWELER COULD NOT TELL IT WAS NOT A REAL ROLEX! More information how to buy an AAA+ quality replica! -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Fri May 23 20:13:24 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 May 2008 22:13:24 -0500 Subject: [ofa-general] [ANNOUNCE] chelsio rnic 6.0 firmware Message-ID: <483787D4.3030805@opengridcomputing.com> Chelsio iWARP fans, The new ofed-1.3.1 cxgb3 drivers require a firmware upgrade for the chelsio rnic. You can pull the firmware from: http://service.chelsio.com/drivers/firmware/t3/t3fw-6.0.0.bin.gz Unzip it and place it in /lib/firmware on your systems. Then the next time you reload and configure cxgb3 it will install the new firmware. Thanks, Steve. From swise at opengridcomputing.com Fri May 23 20:32:24 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 May 2008 22:32:24 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> Message-ID: <48378C48.5060904@opengridcomputing.com> Roland Dreier wrote: > > Actually thinking about it, another possibility would be to wrap up the > > > newrkey = (mr->rkey & 0xffffff00) | newkey; > > > operation in a little inline helper function so people don't screw it > > up. Maybe that's the cleanest way to do it. > > If we add a "key" field to the work request, then it seems too easy for > a consumer to forget to set it and end up passing uninitialized garbage. > If the consumer has to explicitly update the key when posting the work > request then that failure is avoided. > > HOWEVER -- if we have the consumer update the key when posting the > operation, then there is the problem of what happens when the consumer > posts multiple fastreg work requests at once (ie fastreg, local inval, > new fastreg, etc. in a pipelined way). Does the low-level driver just > take the the key value given when the WR is posted, even if there's a > new value there by the time the WR is executed? > I would have to say yes. And it makes sense i think. say rkey is 0x010203XX. The a pipeline could look like: fastreg (mr->rkey is 0x01020301) rdma read (mr->rkey is 0x01020301) invalidate local with fence (mr->rkey is 0x01020301) fastreg (mr->rkey is 0x01020302) rdma read (sink mr->rkey is 0x01020302) invalidate local with fence (mr->rkey is 0x01020302) So the consumer is using the correct mr->rkey at all times even though the rnic is possibly processing the previous generation (that was copied into a fastreg WR at an earlier point in time) at the same time as the app is registering the next generation of the rkey. Steve. From swise at opengridcomputing.com Fri May 23 20:42:42 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 23 May 2008 22:42:42 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48378C48.5060904@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> <48378C48.5060904@opengridcomputing.com> Message-ID: <48378EB2.2060005@opengridcomputing.com> Steve Wise wrote: > > > Roland Dreier wrote: >> > Actually thinking about it, another possibility would be to wrap up >> the >> >> > newrkey = (mr->rkey & 0xffffff00) | newkey; >> >> > operation in a little inline helper function so people don't screw it >> > up. Maybe that's the cleanest way to do it. >> >> If we add a "key" field to the work request, then it seems too easy for >> a consumer to forget to set it and end up passing uninitialized garbage. >> If the consumer has to explicitly update the key when posting the work >> request then that failure is avoided. >> >> HOWEVER -- if we have the consumer update the key when posting the >> operation, then there is the problem of what happens when the consumer >> posts multiple fastreg work requests at once (ie fastreg, local inval, >> new fastreg, etc. in a pipelined way). Does the low-level driver just >> take the the key value given when the WR is posted, even if there's a >> new value there by the time the WR is executed? >> > > I would have to say yes. And it makes sense i think. > > say rkey is 0x010203XX. The a pipeline could look like: > > fastreg (mr->rkey is 0x01020301) > rdma read (mr->rkey is 0x01020301) > invalidate local with fence (mr->rkey is 0x01020301) > fastreg (mr->rkey is 0x01020302) > rdma read (sink mr->rkey is 0x01020302) > invalidate local with fence (mr->rkey is 0x01020302) > > So the consumer is using the correct mr->rkey at all times even though > the rnic is possibly processing the previous generation (that was copied > into a fastreg WR at an earlier point in time) at the same time as the > app is registering the next generation of the rkey. > So something like this? static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) { /* iWARP: rkey == lkey */ if (mr->rkey == mr->lkey) mr->lkey = mr->lkey & 0xffffff00 | newkey; mr->rkey = mr->rkey & 0xffffff00 | newkey; } From qvcbfmubvcis at chevettes.com Sat May 24 00:05:04 2008 From: qvcbfmubvcis at chevettes.com (ashleigh) Date: Fri, 23 May 2008 23:05:04 -0800 (PDT) Subject: [ofa-general] is it you? ashleigh here Message-ID: <152O481C0558MKBHKMXH@v.evacaville.com> hello, I am pretty russian girl, bored tonight. would you like to chat with me and see my pics? if so then email me at eashleigh3 at famplayfit.cn From vlad at lists.openfabrics.org Sat May 24 03:09:34 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 24 May 2008 03:09:34 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080524-0200 daily build status Message-ID: <20080524100934.E47A1E60B71@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From Targeted at lists.openfabrics.org Sat May 24 05:55:34 2008 From: Targeted at lists.openfabrics.org (Targeted at lists.openfabrics.org) Date: 24 May 2008 05:55:34 -0700 Subject: [ofa-general] How to get free quality visitors to your website? Message-ID: <20080524055534.559667DC67741ED0@from.header.has.no.domain> No Matter what you are selling - Hit-Booster will send targeted visitors to your website! Within 15 minutes you will have your own website traffic generator that will bring in an ever increasing amount of hits to your websites! Automatically This software is perfect for bringing real traffic to your site... even if... it's an affiliate link where you have no control over the website content! For Full Details please read the attached .html file Unsubscribe: please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From Targeted at lists.openfabrics.org Sat May 24 11:09:27 2008 From: Targeted at lists.openfabrics.org (Targeted at lists.openfabrics.org) Date: 24 May 2008 11:09:27 -0700 Subject: [ofa-general] How to get free quality visitors to your website? Message-ID: <20080524110926.8BBD9B7D5B303687@from.header.has.no.domain> No Matter what you are selling - Hit-Booster will send targeted visitors to your website! Within 15 minutes you will have your own website traffic generator that will bring in an ever increasing amount of hits to your websites! Automatically This software is perfect for bringing real traffic to your site... even if... it's an affiliate link where you have no control over the website content! For Full Details please read the attached .html file Unsubscribe: please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Sat May 24 22:19:07 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 25 May 2008 08:19:07 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <4838F6CB.2040203@voltaire.com> Steve Wise wrote: > Usage Model: > - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) > - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) Hi Steve, Roland, After discussing the rkey renew and fencing with send/rdma ops, I am quite clear with how all this plugs well into ULPs such as SCSI or FS low-level (interconnect) initiator/target drivers, specifically those who use a transactional protocol. Few more points to clarify are (sorry if it became somehow long): * Do we want it to be a must for a consumer to invalidate an fast-reg mr before reusing it? if yes, how? * If remote invalidation is supported, when the peer is done with the mr, it sends the "response" in send-with-invalidate fashion and saves the mapper side from doing a local invalidate. For the case of the mapping produced by SCSI initiator or FS client, when remote invalidation is not supported, I don't see how a local invalidate design can be made in a pipe-lined manner - Since from the network perspective the I/O is done, the target response at your hands, but until doing mr invalidation the pages are typically not returned to the upper layer and the ULP has to "stall" till the invalidation WR is completed? I don't say its a bug or a big issue, just wonder what are your thoughts regarding this point. * talking about remote invalidation, I understand that it requires support of both sides (and hence has to be negotiated), so the IB_DEVICE_SEND_W_INV device capability says that a device can send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? * what about ZBVA, is it orthogonal to these calls, no enhancement of the suggested API is needed even if zbva is used, or the other way, it would work also when zbva is not used? Or From ogerlitz at voltaire.com Sat May 24 22:32:29 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 25 May 2008 08:32:29 +0300 Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys In-Reply-To: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com> References: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com> Message-ID: <4838F9ED.9090304@voltaire.com> Hal Rosenstock wrote: > management: Support separate SA and SM keys as clarified in IBA 1.2.1 Does some host side patch is needed to inter-operate with this change? Or. From vlad at lists.openfabrics.org Sun May 25 03:09:17 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 25 May 2008 03:09:17 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080525-0200 daily build status Message-ID: <20080525100917.9E205E60C2A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at voltaire.com Sun May 25 05:27:34 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 25 May 2008 15:27:34 +0300 (IDT) Subject: [ofa-general] Re: got scheduling while atomic in ipoib (was net/bonding: announce fail-over for the active-backup mode) In-Reply-To: References: Message-ID: > Enhance bonding to announce fail-over for the active-backup mode through > the netdev events notifier chain mechanism. Such an event can be of use > for the RDMA CM (communication manager) to let native RDMA ULPs (eg > NFS-RDMA, iSER) always use the same links as the IP stack does. > --- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c 2008-05-13 10:02:22.000000000 +0300 > +++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c 2008-05-15 12:29:44.000000000 +0300 > @@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon > bond->send_grat_arp = 1; > } else > bond_send_gratuitous_arp(bond); > + netdev_bonding_change(bond->dev); > } > } > --- linux-2.6.26-rc2.orig/net/core/dev.c 2008-05-13 10:02:31.000000000 +0300 > +++ linux-2.6.26-rc2/net/core/dev.c 2008-05-13 11:50:49.000000000 +0300 > @@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi > } > } > > +void netdev_bonding_change(struct net_device *dev) > +{ > + call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev); > +} > +EXPORT_SYMBOL(netdev_bonding_change); Hi Roland, I have enhanced the bonding driver to deliver event through the netdev notifier chain and getting this "scheduling while atomic" warning. The function __bond_mii_monitor does spin_lock_bh before calling bond_select_active_slave() which calls bond_change_active_slave() so maybe its not a good idea to deliver event under these atomic conditions, but I still want to make sure I didn't stepped on some problem in ipoib (as of the :ib_ipoib:ipoib_start_xmit+0x445/0x459 line in the trace), any idea? bonding: bond0: link status definitely down for interface ib0, disabling it bonding: bond0: making interface ib1 the new active one. BUG: scheduling while atomic: bond0/14237/0x10000100 Pid: 14237, comm: bond0 Not tainted 2.6.26-rc3 #4 Call Trace: [] schedule+0x98/0x57b [] dbg_redzone1+0x16/0x1f [] :ib_ipoib:ipoib_start_xmit+0x445/0x459 [] kmem_cache_alloc_node+0x147/0x177 [] __alloc_skb+0x35/0x12b [] __cond_resched+0x1c/0x43 [] _cond_resched+0x2d/0x38 [] kmem_cache_alloc_node+0x25/0x177 [] __alloc_skb+0x35/0x12b [] rtmsg_ifinfo+0x3a/0xd4 [] rtnetlink_event+0x3d/0x41 [] notifier_call_chain+0x30/0x54 [] :bonding:bond_select_active_slave+0xb9/0xe8 [] :bonding:__bond_mii_monitor+0x43a/0x464 [] :bonding:bond_mii_monitor+0x5e/0xaa [] :bonding:bond_mii_monitor+0x0/0xaa [] run_workqueue+0x7f/0x107 [] worker_thread+0x0/0xef [] worker_thread+0xe5/0xef [] autoremove_wake_function+0x0/0x2e [] autoremove_wake_function+0x0/0x2e [] kthread+0x3d/0x63 [] child_rip+0xa/0x12 [] kthread+0x0/0x63 [] child_rip+0x0/0x12 eth2: no IPv6 routers present bond0: no IPv6 routers present end_request: I/O error, dev fd0, sector 0 Or. From eli at mellanox.co.il Sun May 25 05:38:29 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 25 May 2008 15:38:29 +0300 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM mode In-Reply-To: References: <1211470815.7310.61.camel@eli-laptop> Message-ID: <1211719109.13769.32.camel@mtls03> On Fri, 2008-05-23 at 11:05 -0700, Roland Dreier wrote: > > --- > > > > When running netperf I see significant improvement when using this patch > > (BW Mbps): > > > > with patch: > > sender receiver > > 313 313 > > > > without the patch: > > 509 134 > > Any reason why we wouldn't want this info in the patch changelog? Not really. If you think it should be there, I'll add it to the changelog along with an explanation to the question bellow. > > Can you explain why the sender gets dramatically slower with the patch? > When using this patch, the overhead of the CPU for handling RX packets is dramatically reduced. As a result, we do not experience RNR NACK messages from the receiver which cause the connection to be closed and reopened again; when the patch is not used, the receiver cannot handle the packets fast enough so there is less time to post new buffers and hence the mentioned RNR NACKs. So what happens is that the application, e.g. netperf, *thinks* it posted a certain number of packets for transmission but these packets are flushed and do not really get transmitted. Since the connection gets opened and closed many times, each time netperf gets the CPU time that otherwise would have been given to the CPU to actually transmit the packtes. This can be verified when looking at the port counters, the output of ifconfig and the oputput of netperf (this is for the case without the patch): tx packets ========== port counter: 1,543,996 ifconfig: 1,581,426 netperf: 5,142,034 rx packtes ========== netperf 1,1304,089 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 14.4.3.178 (14.4.3.178) port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 114688 128 10.00 5142034 0 526.31 114688 10.00 1130489 115.71 From bradley.kite at gmail.com Sun May 25 05:48:57 2008 From: bradley.kite at gmail.com (Bradley Kite) Date: Sun, 25 May 2008 13:48:57 +0100 Subject: [ofa-general] SDP and epoll vs select() Message-ID: Hi all, Currently my application uses the Linux kernel's epoll interface for socket event notifications. From what I've read it looks like the SDP library only works with select()/poll() - is this actually the case or will epoll work too? Many thanks -- Brad. From eli at dev.mellanox.co.il Sun May 25 08:59:41 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 25 May 2008 18:59:41 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: reduce CM tx object size Message-ID: <1211731181.13769.46.camel@mtls03> >From 74218f2b8fff790a0fa35c2bf3aa6ab48c08ba81 Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Sun, 25 May 2008 18:58:13 +0300 Subject: [PATCH] IB/ipoib: reduce CM tx object size Since IPOIB CM does not publish NETIF_F_SG, we don't need a mapping array so define a new struct with one u64 field and use it. Signed-off-by: Eli Cohen --- drivers/infiniband/ulp/ipoib/ipoib.h | 7 ++++++- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 12 ++++++------ 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index e39bf36..2b6f60b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -109,6 +109,11 @@ enum { /* structs */ +struct ipoib_cm_tx_buf { + struct sk_buff *skb; + u64 mapping; +}; + struct ipoib_header { __be16 proto; u16 reserved; @@ -208,7 +213,7 @@ struct ipoib_cm_tx { struct net_device *dev; struct ipoib_neigh *neigh; struct ipoib_path *path; - struct ipoib_tx_buf *tx_ring; + struct ipoib_cm_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; unsigned long flags; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 9e0facc..064971d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -662,7 +662,7 @@ static inline int post_send(struct ipoib_dev_priv *priv, void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) { struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ipoib_tx_buf *tx_req; + struct ipoib_cm_tx_buf *tx_req; u64 addr; if (unlikely(skb->len > tx->mtu)) { @@ -693,7 +693,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ return; } - tx_req->mapping[0] = addr; + tx_req->mapping = addr; if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), addr, skb->len))) { @@ -718,7 +718,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_cm_tx *tx = wc->qp->qp_context; unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM; - struct ipoib_tx_buf *tx_req; + struct ipoib_cm_tx_buf *tx_req; unsigned long flags; ipoib_dbg_data(priv, "cm send completion: id %d, status: %d\n", @@ -732,7 +732,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) tx_req = &tx->tx_ring[wr_id]; - ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len, DMA_TO_DEVICE); + ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); /* FIXME: is this right? Shouldn't we only increment on success? */ ++dev->stats.tx_packets; @@ -1102,7 +1102,7 @@ err_tx: static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) { struct ipoib_dev_priv *priv = netdev_priv(p->dev); - struct ipoib_tx_buf *tx_req; + struct ipoib_cm_tx_buf *tx_req; unsigned long flags; unsigned long begin; @@ -1130,7 +1130,7 @@ timeout: while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; - ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len, + ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; -- 1.5.5.1 From swise at opengridcomputing.com Sun May 25 09:56:36 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 25 May 2008 11:56:36 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4838F6CB.2040203@voltaire.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> Message-ID: <48399A44.7060009@opengridcomputing.com> Or Gerlitz wrote: > After discussing the rkey renew and fencing with send/rdma ops, I am > quite clear with how all this plugs well into ULPs such as SCSI or FS > low-level (interconnect) initiator/target drivers, specifically those > who use a transactional protocol. Few more points to clarify are (sorry > if it became somehow long): > > * Do we want it to be a must for a consumer to invalidate an fast-reg mr > before reusing it? if yes, how? The verbs specs mandate that the mr be in the invalid state when the fast-reg work request is processed. So I think that means yes. And the consumer invalidates it via the INVALIDATE_MR work request. > > * If remote invalidation is supported, when the peer is done with the > mr, it sends the "response" in > send-with-invalidate fashion and saves the mapper side from doing a > local invalidate. For the case of the mapping produced by SCSI initiator > or FS client, when remote invalidation is not supported, I don't see how > a local invalidate design can be made in a pipe-lined manner - Since > from the network perspective the I/O is done, the target response at > your hands, but until doing mr invalidation the pages are typically not > returned to the upper layer and the ULP has to "stall" till the > invalidation WR is completed? I don't say its a bug or a big issue, just > wonder what are your thoughts regarding this point. > I guess that's why they invented send-with-inv, and read-with-inv-local. > * talking about remote invalidation, I understand that it requires > support of both sides (and hence has to be negotiated), so the > IB_DEVICE_SEND_W_INV device capability says that a device can > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? > > * what about ZBVA, is it orthogonal to these calls, no enhancement of > the suggested API is needed even if zbva is used, or the other way, it > would work also when zbva is not used? > > Or From sashak at voltaire.com Sun May 25 12:10:47 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 May 2008 22:10:47 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <4836E9B8.2080406@llnl.gov> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> <20080523103532.GA4640@sashak.voltaire.com> <4836E9B8.2080406@llnl.gov> Message-ID: <20080525191047.GS4616@sashak.voltaire.com> Hi Tim, Ira, On 08:58 Fri 23 May , Timothy A. Meier wrote: > > Following Hals advice, authorization is based on the umad permissions. I will send some more comments about this method later today. But basically still think that some things could be broken and that it is not really trivial to separate in this way wrong usage from desired behavior reliably (with some approximation it is possible of course). > The intent is simply to provide a consistent and > non-silent fail mechanism. OTOH I fully agree with yours and Ira's arguments about this - 'Silent' fails are bad. I thought about how to solve this and and started to run diag perl scripts from unprivileged account in various conditions (cache file exists or not, cache dir is readable or not, etc.). First thing I saw was that even on bad usage most scripts return 0. Then I found that on many failures return status is not checked or ignored and program return 0. I did those two patches (below) and up to now it works fine for me (but likely I didn't cover everything). What do you say? Sasha >From cbbc155996c9f6efe91b78f055a643809b997468 Mon Sep 17 00:00:00 2001 From: root Date: Sat, 24 May 2008 11:04:08 +0300 Subject: [PATCH] infiniband-diags/scripts/*.pl: exit 2 on usage errors Add non-zero exit status (2) on usage errors for perl scripts. Signed-off-by: root --- infiniband-diags/scripts/check_lft_balance.pl | 2 +- infiniband-diags/scripts/ibfindnodesusing.pl | 2 +- infiniband-diags/scripts/ibidsverify.pl | 2 +- infiniband-diags/scripts/iblinkinfo.pl | 2 +- infiniband-diags/scripts/ibprintca.pl | 2 +- infiniband-diags/scripts/ibprintrt.pl | 2 +- infiniband-diags/scripts/ibprintswitch.pl | 2 +- infiniband-diags/scripts/ibqueryerrors.pl | 2 +- infiniband-diags/scripts/ibswportwatch.pl | 2 +- 9 files changed, 9 insertions(+), 9 deletions(-) diff --git a/infiniband-diags/scripts/check_lft_balance.pl b/infiniband-diags/scripts/check_lft_balance.pl index 66f5f0f..b0f0fef 100755 --- a/infiniband-diags/scripts/check_lft_balance.pl +++ b/infiniband-diags/scripts/check_lft_balance.pl @@ -70,7 +70,7 @@ sub usage print "Usage: $prog [-R -v]\n"; print " -R recalculate all cached information\n"; print " -v verbose output\n"; - exit 0; + exit 2; } sub is_port_up diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl index 1bf0987..71656b3 100755 --- a/infiniband-diags/scripts/ibfindnodesusing.pl +++ b/infiniband-diags/scripts/ibfindnodesusing.pl @@ -80,7 +80,7 @@ sub usage_and_exit print " -R Recalculate ibnetdiscover information\n"; print " -C use selected Channel Adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl index de78e6b..1a236c8 100755 --- a/infiniband-diags/scripts/ibidsverify.pl +++ b/infiniband-diags/scripts/ibidsverify.pl @@ -46,7 +46,7 @@ sub usage_and_exit print " -h This help message\n"; print " -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl index a195474..a7a3df5 100755 --- a/infiniband-diags/scripts/iblinkinfo.pl +++ b/infiniband-diags/scripts/iblinkinfo.pl @@ -62,7 +62,7 @@ sub usage_and_exit print " -C use selected Channel Adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; print " -g print port guids instead of node guids\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl index 38b4330..0baea0b 100755 --- a/infiniband-diags/scripts/ibprintca.pl +++ b/infiniband-diags/scripts/ibprintca.pl @@ -51,7 +51,7 @@ sub usage_and_exit print " -l list cas\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl index 86dcb64..0b3db19 100755 --- a/infiniband-diags/scripts/ibprintrt.pl +++ b/infiniband-diags/scripts/ibprintrt.pl @@ -51,7 +51,7 @@ sub usage_and_exit print " -l list rts\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl index 6712201..c7377a9 100755 --- a/infiniband-diags/scripts/ibprintswitch.pl +++ b/infiniband-diags/scripts/ibprintswitch.pl @@ -50,7 +50,7 @@ sub usage_and_exit print " -l list switches\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl index c807c02..5f2e167 100755 --- a/infiniband-diags/scripts/ibqueryerrors.pl +++ b/infiniband-diags/scripts/ibqueryerrors.pl @@ -149,7 +149,7 @@ sub usage_and_exit print " -d include the data counters in the output\n"; print " -C use selected Channel Adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; - exit 0; + exit 2; } my $argv0 = `basename $0`; diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl index 6d6ba1c..d888f51 100755 --- a/infiniband-diags/scripts/ibswportwatch.pl +++ b/infiniband-diags/scripts/ibswportwatch.pl @@ -81,7 +81,7 @@ sub usage_and_exit print " -n run n cycles then exit (default -1 == forever)\n"; print " -G Address provided is a GUID\n"; print " -b report bytes/second packets/second\n"; - exit 0; + exit 2; } # ========================================================================= -- 1.5.4.rc2.60.gb2e62 >From a3d4a44d668912526466f591931a099a0978f943 Mon Sep 17 00:00:00 2001 From: root Date: Sun, 25 May 2008 15:54:00 +0300 Subject: [PATCH] infiniband-diags/scripts/*.pl: prevent some zero exists on errors Upon failures break execution and drop error status. Signed-off-by: root --- infiniband-diags/scripts/IBswcountlimits.pm | 19 +++++++++++-------- infiniband-diags/scripts/ibprintca.pl | 6 +++--- infiniband-diags/scripts/ibprintrt.pl | 6 +++--- infiniband-diags/scripts/ibprintswitch.pl | 4 ++-- infiniband-diags/scripts/ibqueryerrors.pl | 6 ++++-- infiniband-diags/scripts/ibswportwatch.pl | 6 ++---- 6 files changed, 25 insertions(+), 22 deletions(-) diff --git a/infiniband-diags/scripts/IBswcountlimits.pm b/infiniband-diags/scripts/IBswcountlimits.pm index 9bc356f..9794ff1 100755 --- a/infiniband-diags/scripts/IBswcountlimits.pm +++ b/infiniband-diags/scripts/IBswcountlimits.pm @@ -219,8 +219,9 @@ sub any_counts # sub ensure_cache_dir { - if (!(-d "$IBswcountlimits::cache_dir")) { - mkdir $IBswcountlimits::cache_dir, 0700; + if (!(-d "$IBswcountlimits::cache_dir") && + !mkdir($IBswcountlimits::cache_dir, 0700)) { + die "cannot create $IBswcountlimits::cache_dir: $!\n"; } } @@ -260,9 +261,8 @@ sub generate_ibnetdiscover_topology my $cache_file = get_cache_file($ca_name, $ca_port); my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - `ibnetdiscover -g $extra_params > $cache_file`; - if ($? != 0) { - die "Execution of ibnetdiscover failed with errors\n"; + if (`ibnetdiscover -g $extra_params > $cache_file`) { + die "Execution of ibnetdiscover failed: $!\n"; } } @@ -421,7 +421,8 @@ sub get_num_ports my $num_ports = 0; my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - my $data = `smpquery $extra_params -G nodeinfo $guid`; + my $data = `smpquery $extra_params -G nodeinfo $guid` || + die "'smpquery $extra_params -G nodeinfo $guid' faild\n"; my @lines = split("\n", $data); my $pkt_lifetime = ""; foreach my $line (@lines) { @@ -457,7 +458,8 @@ sub convert_dr_to_guid { my $guid = undef; - my $data = `smpquery nodeinfo -D $_[0]`; + my $data = `smpquery nodeinfo -D $_[0]` || + die "'mpquery nodeinfo -D $_[0]' failed\n"; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^PortGuid:\.+(.*)/) { $guid = $1; } @@ -480,7 +482,8 @@ sub get_node_type $query_arg .= "-D " . $_[0]; } - my $data = `$query_arg`; + my $data = `$query_arg` || + die "'$query_arg' failed\n"; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^NodeType:\.+(.*)/) { $type = $1; } diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl index 0baea0b..7de0801 100755 --- a/infiniband-diags/scripts/ibprintca.pl +++ b/infiniband-diags/scripts/ibprintca.pl @@ -118,9 +118,9 @@ sub main print $ports{$port}; } if (!$found_hca) { - print "\"$target_hca\" not found\n"; - print " Try running with the \"-R\" option.\n"; - print " If still not found the node is probably down.\n"; + die "\"$target_hca\" not found\n" . + " Try running with the \"-R\" option.\n" . + " If still not found the node is probably down.\n"; } close IBNET_TOPO; } diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl index 0b3db19..43323ca 100755 --- a/infiniband-diags/scripts/ibprintrt.pl +++ b/infiniband-diags/scripts/ibprintrt.pl @@ -118,9 +118,9 @@ sub main print $ports{$port}; } if (!$found_rt) { - print "\"$target_rt\" not found\n"; - print " Try running with the \"-R\" option.\n"; - print " If still not found the node is probably down.\n"; + die "\"$target_rt\" not found\n" . + " Try running with the \"-R\" option.\n" . + " If still not found the node is probably down.\n"; } close IBNET_TOPO; } diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl index c7377a9..8af3f48 100755 --- a/infiniband-diags/scripts/ibprintswitch.pl +++ b/infiniband-diags/scripts/ibprintswitch.pl @@ -117,8 +117,8 @@ sub main print $ports{$port}; } if (!$found_switch) { - print "Switch \"$target_switch\" not found\n"; - print " Try running with the \"-R\" option.\n"; + die "Switch \"$target_switch\" not found\n" . + " Try running with the \"-R\" option.\n"; } close IBNET_TOPO; } diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl index 5f2e167..a6128b5 100755 --- a/infiniband-diags/scripts/ibqueryerrors.pl +++ b/infiniband-diags/scripts/ibqueryerrors.pl @@ -104,7 +104,8 @@ sub get_counts my $ca_port = $_[3]; my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - my $data = `perfquery $extra_params -G $addr $port`; + my $data = `perfquery $extra_params -G $addr $port` || + die "'perfquery $extra_params -G $addr $port' FAILED.\n"; my @lines = split("\n", $data); foreach my $line (@lines) { foreach my $count (@IBswcountlimits::counters) { @@ -121,7 +122,8 @@ my %switches = (); sub get_switches { - my $data = `ibswitches $cache_file`; + my $data = `ibswitches $cache_file` || + die "'ibswitches $cache_file' failed.\n"; my @lines = split("\n", $data); foreach my $line (@lines) { if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) { diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl index d888f51..92066d1 100755 --- a/infiniband-diags/scripts/ibswportwatch.pl +++ b/infiniband-diags/scripts/ibswportwatch.pl @@ -121,12 +121,10 @@ sub get_new_counts ) ) { - print "perfquery failed : \"perfquery $GUID $addr $port\"\n"; - system("cat $IBswcountlimits::cache_dir/perfquery.out"); - exit 1; + die "perfquery failed : \"perfquery $GUID $addr $port\"\n"; } open PERF_QUERY, "<$IBswcountlimits::cache_dir/perfquery.out" - or die "perfquery failed"; + or die "cannot read '$IBswcountlimits::cache_dir/perfquery.out': $!\n"; while (my $line = ) { foreach my $count (@IBswcountlimits::counters) { if ($line =~ /^$count:\.+(\d+)/) { -- 1.5.4.rc2.60.gb2e62 From sashak at voltaire.com Sun May 25 12:14:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 May 2008 22:14:30 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags: terminate perl scripts with error if not authorized In-Reply-To: <4836EB27.7060707@llnl.gov> References: <4836EB27.7060707@llnl.gov> Message-ID: <20080525191430.GT4616@sashak.voltaire.com> Hi Tim, On 09:04 Fri 23 May , Timothy A. Meier wrote: > > +# ========================================================================= > +# only authorized if uid is root, or matches umad ownership > +# > +sub auth_check > +{ > + my $file = "/dev/infiniband/umad0"; How would we know that it is "/dev/infiniband/umad0" and not another device (when first port in not connected, or if -C and/or -P options are used, or if udev is configured to put the entries in another place)? Really I don't see an easy (without reimplementing most of libibumad device resolution functionality via sysfs in perl scripts) way to detect device reliably. > + my $uid = (stat $file)[4]; > + my $gid = (stat $file)[5]; > + if (($> != $uid) && ($> != $gid) && ($> != 0)){ The requirement here is not really ownership, but rather that the file is readable and writable by user which runs script. Right? Sasha From rdreier at cisco.com Sun May 25 13:43:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 May 2008 13:43:08 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <4838F6CB.2040203@voltaire.com> (Or Gerlitz's message of "Sun, 25 May 2008 08:19:07 +0300") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> Message-ID: > * Do we want it to be a must for a consumer to invalidate an fast-reg > mr before reusing it? if yes, how? The verbs specs go into exhaustive detail about the state diagram for validity of MRs. > * talking about remote invalidation, I understand that it requires > support of both sides (and hence has to be negotiated), so the > IB_DEVICE_SEND_W_INV device capability says that a device can > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? I think we decided that all of these related features will be indicated by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits. > * what about ZBVA, is it orthogonal to these calls, no enhancement of > the suggested API is needed even if zbva is used, or the other way, it > would work also when zbva is not used? ZBVA would require adding some flag to request ZBVA when registering. - R. From rdreier at cisco.com Sun May 25 15:14:28 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 May 2008 15:14:28 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <48378EB2.2060005@opengridcomputing.com> (Steve Wise's message of "Fri, 23 May 2008 22:42:42 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> <48378C48.5060904@opengridcomputing.com> <48378EB2.2060005@opengridcomputing.com> Message-ID: > So something like this? yeah, looks reasonable... > static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) > { > /* iWARP: rkey == lkey */ actually I need to reread the IB spec and understand how the consumer key part of L_Key and R_Key is supposed to work... for Mellanox adapters at least the L_Key and R_Key are the same too. > if (mr->rkey == mr->lkey) > mr->lkey = mr->lkey & 0xffffff00 | newkey; > mr->rkey = mr->rkey & 0xffffff00 | newkey; > } From rdreier at cisco.com Sun May 25 15:21:03 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 May 2008 15:21:03 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: (Roland Dreier's message of "Sun, 25 May 2008 15:14:28 -0700") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <483713CB.3010408@opengridcomputing.com> <48378C48.5060904@opengridcomputing.com> <48378EB2.2060005@opengridcomputing.com> Message-ID: > > static void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) > > { > > /* iWARP: rkey == lkey */ > > actually I need to reread the IB spec and understand how the consumer > key part of L_Key and R_Key is supposed to work... for Mellanox adapters > at least the L_Key and R_Key are the same too. > > > if (mr->rkey == mr->lkey) > > mr->lkey = mr->lkey & 0xffffff00 | newkey; > > mr->rkey = mr->rkey & 0xffffff00 | newkey; > > } I just looked in the IB spec (1.2.1) and it talks about passing the "Key to use on the new L_Key and R_Key" into a fastreg work request. So I think we can just can the test for rkey==lkey and just do mr->lkey = mr->lkey & 0xffffff00 | newkey; mr->rkey = mr->rkey & 0xffffff00 | newkey; - R. From rdreier at cisco.com Sun May 25 15:43:56 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 May 2008 15:43:56 -0700 Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <20080519103730.12355.14730.stgit@localhost.localdomain> (Ramachandra K.'s message of "Mon, 19 May 2008 16:07:30 +0530") References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103730.12355.14730.stgit@localhost.localdomain> Message-ID: > +config INFINIBAND_QLGC_VNIC_DEBUG > + bool "QLogic VNIC Verbose debugging" > + depends on INFINIBAND_QLGC_VNIC > + default n > + ---help--- > + This option causes verbose debugging code to be compiled > + into the QLogic VNIC driver. The output can be turned on via the > + vnic_debug module parameter. I think I mentioned this before, but... if you default this option to 'n', then all distributions will build your module with the option off. And if someone is having problems, they will be forced to rebuild their kernel to get debug output, which is a heavy burden for most users. Much better to do something like what I ended up doing for mthca, which is to have the option on unless someone specifically enables CONFIG_EMBEDDED and goes out of their way to disable it: config INFINIBAND_MTHCA_DEBUG bool "Verbose debugging output" if EMBEDDED depends on INFINIBAND_MTHCA default y ---help--- This option causes debugging code to be compiled into the From rdreier at cisco.com Sun May 25 15:47:29 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 25 May 2008 15:47:29 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080516223419.27221.49014.stgit@dell3.ogc.int> (Steve Wise's message of "Fri, 16 May 2008 17:34:20 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: > - device-specific alloc/free of physical buffer lists for use in fast > register work requests. This allows devices to allocate this memory as > needed (like via dma_alloc_coherent). I'm looking at how one would implement the MM extensions for mlx4, and it turns out that in addition to needing to allocate these fastreg page lists in coherent memory, mlx4 is even going to need to write to the memory (basically set the lsb of each address for internal device reasons). So I think we just need to update the documentation of the interface so that not only does the page list belong to the device driver between posting the fastreg work request and completing the request, but also the device driver is allowed to change the page list as part of the work request processing. I don't see any real reason why this would cause problems for consumers; does this seem OK to other people? From BlogBlaster at lists.openfabrics.org Sun May 25 16:32:51 2008 From: BlogBlaster at lists.openfabrics.org (BlogBlaster at lists.openfabrics.org) Date: 25 May 2008 16:32:51 -0700 Subject: [ofa-general] "How would you like to have your ad on 2 Million Websites ?" Message-ID: <20080525163250.D8F6585009E86E70@from.header.has.no.domain> How would you like 2 Million Sites linking to your ad ? Weblog or blog population is exploding around the world, resembling the growth of e-mail users in the 1990s. Post your ads where people read them! - What if you could place your ad on all these sites ? Right, that would mean you would have millions of sites linking to your ad. For Full details please read the attached .html file Unsubscribe please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From BlogBlaster at lists.openfabrics.org Sun May 25 17:29:10 2008 From: BlogBlaster at lists.openfabrics.org (BlogBlaster at lists.openfabrics.org) Date: 25 May 2008 17:29:10 -0700 Subject: [ofa-general] "How would you like to have your ad on 2 Million Websites ?" Message-ID: <20080525172909.62F3BB4F2EE3A4BA@from.header.has.no.domain> How would you like 2 Million Sites linking to your ad ? Weblog or blog population is exploding around the world, resembling the growth of e-mail users in the 1990s. Post your ads where people read them! - What if you could place your ad on all these sites ? Right, that would mean you would have millions of sites linking to your ad. For Full details please read the attached .html file Unsubscribe please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From ihara at sun.com Sun May 25 19:25:21 2008 From: ihara at sun.com (Shuichi Ihara) Date: Mon, 26 May 2008 11:25:21 +0900 Subject: [ofa-general] question of mvapich version Message-ID: <483A1F91.6080205@sun.com> Hi, I have a question about version of mvapich and mvapich2 which are included in ofed-1.3.1-rc2. It looks the mvapich version is 1.0.1-2481 in ofed-1.3.1, but I can't see same version on mvapich's download site and SVN repositories. Is this version same as the Revision 2481:/mvapich/branches/1.0? Also, mvapich2's src.rpm filename is mvapich2-1.0.3-1.src.rpm. Is it also from the svn branches? I would like to know where is source tree of both packages. Thanks, -Ihara From panda at cse.ohio-state.edu Sun May 25 20:03:26 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun, 25 May 2008 23:03:26 -0400 (EDT) Subject: [ofa-general] question of mvapich version In-Reply-To: <483A1F91.6080205@sun.com> Message-ID: > Hi, > > I have a question about version of mvapich and mvapich2 which are included > in ofed-1.3.1-rc2. > > It looks the mvapich version is 1.0.1-2481 in ofed-1.3.1, but I can't see same > version on mvapich's download site and SVN repositories. Is this version same as > the Revision 2481:/mvapich/branches/1.0? Yes. This version will be soon (mostly during the coming week) available as MVAPICH 1.0.1. > Also, mvapich2's src.rpm filename is mvapich2-1.0.3-1.src.rpm. Is it > also from the svn branches? I would like to know where is source tree > of both packages. Yes. This version will also be soon available as MVAPICH2 1.0.3. Both MVAPICH and MVAPICH2 source packages in OFED are from the original MVAPICH and MVAPICH2 SVN. Since OFED has its own release schedule, the versions in OFED are being identified with version number and `-x' suffix to keep track of the exact versions which are going into OFED. Hope this helps. Thanks, DK > Thanks, > > -Ihara > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eli at dev.mellanox.co.il Mon May 26 00:20:40 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 26 May 2008 10:20:40 +0300 Subject: [ofa-general] [PATCH] IB/ipoib: increase ring sizes Message-ID: <1211786440.13769.54.camel@mtls03> >From b1ec82e65173556919f0e0f728af520e41bdbd5b Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Mon, 26 May 2008 10:19:16 +0300 Subject: [PATCH] IB/ipoib: increase ring sizes Increase IPoIB ring sizes to twice the original size to act as a shock observer for high traffic picks. Signed-off-by: Eli Cohen --- drivers/infiniband/ulp/ipoib/ipoib.h | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 2b6f60b..c49fc09 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -65,8 +65,8 @@ enum { IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, - IPOIB_RX_RING_SIZE = 128, - IPOIB_TX_RING_SIZE = 64, + IPOIB_RX_RING_SIZE = 256, + IPOIB_TX_RING_SIZE = 128, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, IPOIB_CM_MAX_CONN_QP = 4096, -- 1.5.5.1 From ogerlitz at voltaire.com Mon May 26 00:29:53 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 26 May 2008 10:29:53 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <483A66F1.9040305@voltaire.com> Roland Dreier wrote: > I don't see any real reason why this would cause problems for consumers; > does this seem OK to other people? this seems fine to me. Or. From ramachandra.kuchimanchi at qlogic.com Mon May 26 00:37:28 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Mon, 26 May 2008 13:07:28 +0530 Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103730.12355.14730.stgit@localhost.localdomain> Message-ID: <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com> Roland, On Mon, May 26, 2008 at 4:13 AM, Roland Dreier wrote: > > +config INFINIBAND_QLGC_VNIC_DEBUG > > + bool "QLogic VNIC Verbose debugging" > > + depends on INFINIBAND_QLGC_VNIC > > + default n > > + ---help--- > > + This option causes verbose debugging code to be compiled > > + into the QLogic VNIC driver. The output can be turned on via the > > + vnic_debug module parameter. > > I think I mentioned this before, but... if you default this option to > 'n', then all distributions will build your module with the option off. > And if someone is having problems, they will be forced to rebuild their > kernel to get debug output, which is a heavy burden for most users. The debugging code is always compiled in and is controlled at run time through vnic_debug module parameter. INFINIBAND_QLGC_VNIC_DEBUG config option only controls verbose debugging which adds some extra information in the debug statements (file name, line number) which we typically use for debug builds of the driver. Even if this option is set to 'n', users can still get all debug messages from the driver by using the vnic_debug module parameter. Regards, Ram From marcel.heinz at informatik.tu-chemnitz.de Mon May 26 02:09:20 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Mon, 26 May 2008 11:09:20 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <48371B2D.3040908@gmail.com> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> Message-ID: <483A7E40.5040407@informatik.tu-chemnitz.de> Hello, Dotan Barak wrote: > Hi. > > Do you use the latest released FW for this device? The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a look at the switch later. Regards, Marcel From vlad at lists.openfabrics.org Mon May 26 03:10:10 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 26 May 2008 03:10:10 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080526-0200 daily build status Message-ID: <20080526101010.8FA37E60CF8@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at voltaire.com Mon May 26 04:10:49 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 26 May 2008 14:10:49 +0300 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> Message-ID: <483A9AB9.10809@voltaire.com> Roland Dreier wrote: > > * talking about remote invalidation, I understand that it requires > > support of both sides (and hence has to be negotiated), so the > > IB_DEVICE_SEND_W_INV device capability says that a device can > > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? > > I think we decided that all of these related features will be indicated > by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits. send-with-invalidate is a little different in the sense that we would probably want to expose remote invalidation through libibverbs such that user space block/file targets (eg the iSER layer of STGT) would be able to use it, but (at least in this point of time) not expose the other memory management extensions to user space. BTW - what's the status of the send-with-invalidate patches to the core and mlx4? > ZBVA would require adding some flag to request ZBVA when registering. So this flag would be added as a field in the WR? for the current proposal, can the ULP dictate the VA as done with the current FMR API exposed by the core? Or. From swise at opengridcomputing.com Mon May 26 06:05:48 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 May 2008 08:05:48 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> Message-ID: <483AB5AC.3030406@opengridcomputing.com> Roland Dreier wrote: > > * Do we want it to be a must for a consumer to invalidate an fast-reg > > mr before reusing it? if yes, how? > > The verbs specs go into exhaustive detail about the state diagram for > validity of MRs. > > > * talking about remote invalidation, I understand that it requires > > support of both sides (and hence has to be negotiated), so the > > IB_DEVICE_SEND_W_INV device capability says that a device can > > send-with-invalidate, do we need a IB_DEVICE_RECV_W_INV cap as well? > > I think we decided that all of these related features will be indicated > by IB_DEVICE_MEM_MGT_EXTENSIONS to avoid an explosion of capability bits. > BTW: a single capability bit doesn't allow apps to decide at run time whether to use read-with-inv, which is iwarp-only. Perhaps we need that as its own capbility bit? Or perhaps we can load detailed support/no support into the query device logic? What it some devices can only support part of the suite of MEM_MGT_EXTENSIONS? > > * what about ZBVA, is it orthogonal to these calls, no enhancement of > > the suggested API is needed even if zbva is used, or the other way, it > > would work also when zbva is not used? > > ZBVA would require adding some flag to request ZBVA when registering. > > - R. From swise at opengridcomputing.com Mon May 26 06:07:50 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 May 2008 08:07:50 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> Message-ID: <483AB626.8060404@opengridcomputing.com> Roland Dreier wrote: > > - device-specific alloc/free of physical buffer lists for use in fast > > register work requests. This allows devices to allocate this memory as > > needed (like via dma_alloc_coherent). > > I'm looking at how one would implement the MM extensions for mlx4, and > it turns out that in addition to needing to allocate these fastreg page > lists in coherent memory, mlx4 is even going to need to write to the > memory (basically set the lsb of each address for internal device > reasons). So I think we just need to update the documentation of the > interface so that not only does the page list belong to the device > driver between posting the fastreg work request and completing the > request, but also the device driver is allowed to change the page list > as part of the work request processing. > > I don't see any real reason why this would cause problems for consumers; > does this seem OK to other people? Tom, Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS support? From rdreier at cisco.com Mon May 26 14:47:29 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 14:47:29 -0700 Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com> (Ramachandra K.'s message of "Mon, 26 May 2008 13:07:28 +0530") References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103730.12355.14730.stgit@localhost.localdomain> <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com> Message-ID: > The debugging code is always compiled in and is controlled > at run time through vnic_debug module parameter. > INFINIBAND_QLGC_VNIC_DEBUG config option only controls verbose debugging > which adds some extra information in the debug statements (file name, > line number) > which we typically use for debug builds of the driver. Even if this option is > set to 'n', users can still get all debug messages from the driver by using the > vnic_debug module parameter. OK, I looked at the code. Is there any point to having CONFIG_INFINIBAND_QLGC_VNIC_DEBUG at all?? Is anyone going to care about having __FILE__ and __LINE__ included in the output and want to set this option to 'n'? - R. From rdreier at cisco.com Mon May 26 14:53:04 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 14:53:04 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483AB5AC.3030406@opengridcomputing.com> (Steve Wise's message of "Mon, 26 May 2008 08:05:48 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> Message-ID: > BTW: a single capability bit doesn't allow apps to decide at run time > whether to use read-with-inv, which is iwarp-only. Perhaps we need > that as its own capbility bit? Or perhaps we can load detailed > support/no support into the query device logic? What it some devices > can only support part of the suite of MEM_MGT_EXTENSIONS? I think RDMA read with invalidate can be tested for as iWARP vs. IB. The reason IB doesn't have it is kind of inherent in the IB protocol, since remote access is not required for the RDMA target. I think making the capability flags really fine-grained isn't worth it -- we went too far in that direction historically, and no one checks any capability flags at all. It's just complexity. So any device that supports only part of the IB base memory mgt extensions (or doesn't support the full IWARP spec) just shouldn't advertise MEM_MGT_EXTENSIONS I think. Implementing such a device would be kind of dumb anyway at this point. - R. From rdreier at cisco.com Mon May 26 14:57:41 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 14:57:41 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483A9AB9.10809@voltaire.com> (Or Gerlitz's message of "Mon, 26 May 2008 14:10:49 +0300") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483A9AB9.10809@voltaire.com> Message-ID: > send-with-invalidate is a little different in the sense that we would > probably want to expose remote invalidation through libibverbs such > that user space block/file targets (eg the iSER layer of STGT) would > be able to use it, but (at least in this point of time) not expose the > other memory management extensions to user space. Why? Local invalidate and RDMA read with invalidate make perfect sense for userspace too. Of course fast register through a send queue can't be used in userspace because it operations on physical memory, but I think MEM_MGT_EXTENSIONS makes sense as something userspace can test for and use. > BTW - what's the status of the send-with-invalidate patches to the > core and mlx4? I'll add the completion struct changes for 2.6.27, and roll the mlx4 patches into the full MEM_MGT_EXTENSIONS patch. > > ZBVA would require adding some flag to request ZBVA when registering. > So this flag would be added as a field in the WR? for the current > proposal, can the ULP dictate the VA as done with the current FMR API > exposed by the core? The IB spec has a table that shows exactly which operations would need flags to handle the ZBVA extension. As for the VA, I think the latest patch is pretty clear: > @@ -676,6 +683,20 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u64 iova_start; - R. From rdreier at cisco.com Mon May 26 15:21:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 15:21:15 -0700 Subject: [ofa-general] Re: linux-next: [PATCH] infiniband/hw/ipath/ipath_sdma.c , fix compiler warnings In-Reply-To: <1211579101.3949.326.camel@brick.pathscale.com> (Ralph Campbell's message of "Fri, 23 May 2008 14:45:01 -0700") References: <48342C6C.2010502@googlemail.com> <1211579101.3949.326.camel@brick.pathscale.com> Message-ID: OK, I added the following to my tree: commit e8ffef73c8dd2c2d00287829db87cdaf229d3859 Author: Roland Dreier Date: Mon May 26 15:20:34 2008 -0700 IB/ipath: Avoid test_bit() on u64 SDMA status value Gabriel C pointed out that when the x86 bitops are updated to operate on unsigned long, the code in sdma_abort_task() will produce warnings: drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task': drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type and so on, because it uses test_bit() to operation on a u64 value (returned by ipath_read_kref64() for a hardware register). Fix up these warnings by converting the test_bit() operations to &ing with appropriate symbolic defines of the bits within the hardware register. This has the benign side-effect of making the code more self-documenting as well. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/ipath/ipath_kernel.h b/drivers/infiniband/hw/ipath/ipath_kernel.h index 59a8b25..0bd8bcb 100644 --- a/drivers/infiniband/hw/ipath/ipath_kernel.h +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h @@ -232,6 +232,11 @@ struct ipath_sdma_desc { #define IPATH_SDMA_TXREQ_S_ABORTED 2 #define IPATH_SDMA_TXREQ_S_SHUTDOWN 3 +#define IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG (1ull << 63) +#define IPATH_SDMA_STATUS_ABORT_IN_PROG (1ull << 62) +#define IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE (1ull << 61) +#define IPATH_SDMA_STATUS_SCB_EMPTY (1ull << 30) + /* max dwords in small buffer packet */ #define IPATH_SMALLBUF_DWORDS (dd->ipath_piosize2k >> 2) diff --git a/drivers/infiniband/hw/ipath/ipath_sdma.c b/drivers/infiniband/hw/ipath/ipath_sdma.c index 0a8c1b8..eaba032 100644 --- a/drivers/infiniband/hw/ipath/ipath_sdma.c +++ b/drivers/infiniband/hw/ipath/ipath_sdma.c @@ -263,14 +263,10 @@ static void sdma_abort_task(unsigned long opaque) hwstatus = ipath_read_kreg64(dd, dd->ipath_kregs->kr_senddmastatus); - if (/* ScoreBoardDrainInProg */ - test_bit(63, &hwstatus) || - /* AbortInProg */ - test_bit(62, &hwstatus) || - /* InternalSDmaEnable */ - test_bit(61, &hwstatus) || - /* ScbEmpty */ - !test_bit(30, &hwstatus)) { + if ((hwstatus & (IPATH_SDMA_STATUS_SCORE_BOARD_DRAIN_IN_PROG | + IPATH_SDMA_STATUS_ABORT_IN_PROG | + IPATH_SDMA_STATUS_INTERNAL_SDMA_ENABLE)) || + !(hwstatus & IPATH_SDMA_STATUS_SCB_EMPTY)) { if (dd->ipath_sdma_reset_wait > 0) { /* not done shutting down sdma */ --dd->ipath_sdma_reset_wait; From rdreier at cisco.com Mon May 26 15:22:27 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 15:22:27 -0700 Subject: [ofa-general] [PATCH] IB/ipath - fix device capability flags In-Reply-To: <20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Fri, 23 May 2008 14:43:34 -0700") References: <20080523214329.20736.85555.stgit@eng-46.mv.qlogic.com> <20080523214334.20736.86499.stgit@eng-46.mv.qlogic.com> Message-ID: thanks, applied From rdreier at cisco.com Mon May 26 15:23:48 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 15:23:48 -0700 Subject: [ofa-general] Re: [ PATCH ] RDMA/nes Update MAINTAINERS list In-Reply-To: <200805211649.m4LGnwPP026935@velma.neteffect.com> (Chien Tung's message of "Wed, 21 May 2008 11:49:58 -0500") References: <200805211649.m4LGnwPP026935@velma.neteffect.com> Message-ID: thanks, applied. From swise at opengridcomputing.com Mon May 26 15:33:59 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 May 2008 17:33:59 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> Message-ID: <483B3AD7.4050208@opengridcomputing.com> Roland Dreier wrote: > > BTW: a single capability bit doesn't allow apps to decide at run time > > whether to use read-with-inv, which is iwarp-only. Perhaps we need > > that as its own capbility bit? Or perhaps we can load detailed > > support/no support into the query device logic? What it some devices > > can only support part of the suite of MEM_MGT_EXTENSIONS? > > I think RDMA read with invalidate can be tested for as iWARP vs. IB. > The reason IB doesn't have it is kind of inherent in the IB protocol, > since remote access is not required for the RDMA target. > The "invalidate local stag" part of a read is just a local sink side operation (ie no wire protocol change from a read). It's not like processing an ingress send-with-inv. It is really functionally like a read followed immediately by a fenced invalidate-local, but it doesn't stall the pipe. So the device has to remember the read is a "with inv local stag" and invalidate the stag after the read response is placed and before the WCE is reaped by the application. > I think making the capability flags really fine-grained isn't worth > it -- we went too far in that direction historically, and no one checks > any capability flags at all. It's just complexity. > Ok. Steve. From rdreier at cisco.com Mon May 26 16:02:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 16:02:30 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483B3AD7.4050208@opengridcomputing.com> (Steve Wise's message of "Mon, 26 May 2008 17:33:59 -0500") References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> Message-ID: > The "invalidate local stag" part of a read is just a local sink side > operation (ie no wire protocol change from a read). It's not like > processing an ingress send-with-inv. It is really functionally like a > read followed immediately by a fenced invalidate-local, but it doesn't > stall the pipe. So the device has to remember the read is a "with inv > local stag" and invalidate the stag after the read response is placed > and before the WCE is reaped by the application. Yes, understood. My point was just that in IB, at least in theory, one could just use an L_Key that doesn't have any remote permissions in the scatter list of an RDMA read, while in iWARP, the STag used to place an RDMA read response has to have remote write permission. So RDMA read with invalidate makes sense for iWARP, because it gives a race-free way to allow an STag to be invalidated immediately after an RDMA read response is placed, while in IB it's simpler just to never give remote access at all. - R. From uspropertyfax at gmail.com Mon May 26 19:06:14 2008 From: uspropertyfax at gmail.com (US Property Report) Date: Mon, 26 May 2008 19:06:14 -0700 Subject: [ofa-general] Property Fax Report Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: property fax e.jpg Type: image/jpeg Size: 162943 bytes Desc: not available URL: From rdreier at cisco.com Mon May 26 20:44:14 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 20:44:14 -0700 Subject: [ofa-general] [PATCH V2 1/2] IB/core: handle race between elements in work queues after event In-Reply-To: <4832DC80.2000408@Voltaire.COM> (Moni Shoua's message of "Tue, 20 May 2008 17:13:20 +0300") References: <48302034.8040709@Voltaire.COM> <4832D99E.3010205@Voltaire.COM> <4832DC80.2000408@Voltaire.COM> Message-ID: thanks, applied for 2.6.27 From rdreier at cisco.com Mon May 26 20:48:17 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 20:48:17 -0700 Subject: [ofa-general] [PATCH] IB/mlx4: Optimize stamping for selective signalling QPs In-Reply-To: <1211374769.6577.21.camel@eli-laptop> (Eli Cohen's message of "Wed, 21 May 2008 15:59:29 +0300") References: <1211374769.6577.21.camel@eli-laptop> Message-ID: thanks, applied for 2.6.27 From rdreier at cisco.com Mon May 26 20:58:29 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 20:58:29 -0700 Subject: [ofa-general] Re: [PATCH] core/include: fix coding style typos according to checkpatch.pl In-Reply-To: <200805232252.04047.dotanba@gmail.com> (Dotan Barak's message of "Fri, 23 May 2008 22:52:03 +0300") References: <200805232252.04047.dotanba@gmail.com> Message-ID: Thanks, all changes do look like improvements. applied for 2.6.27. I'll get rid of the rest of the $Id lines in drivers/infiniband. From rdreier at cisco.com Mon May 26 21:10:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 21:10:08 -0700 Subject: [ofa-general] Re: [PATCH] core/include: fix coding style typos according to checkpatch.pl In-Reply-To: (Roland Dreier's message of "Mon, 26 May 2008 20:58:29 -0700") References: <200805232252.04047.dotanba@gmail.com> Message-ID: > I'll get rid of the rest of the $Id lines in drivers/infiniband. like this... commit 2be5019394ab8a6fa924bc955682db62950ddcc6 Author: Roland Dreier Date: Mon May 26 21:09:23 2008 -0700 RDMA: Remove subversion $Id tags They don't get updated by git and so they're worse than useless. Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/agent.h b/drivers/infiniband/core/agent.h index fb9ed14..6669287 100644 --- a/drivers/infiniband/core/agent.h +++ b/drivers/infiniband/core/agent.h @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: agent.h 1389 2004-12-27 22:56:47Z roland $ */ #ifndef __AGENT_H_ diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e85f701..6888356 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: cache.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index a47fe64..55738ee 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index 7ad47a4..05ac36e 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: core_priv.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef _CORE_PRIV_H diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 5ac5ffe..7913b80 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: device.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 1286dc1..4507043 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: fmr_pool.c 2730 2005-06-28 16:43:03Z sean.hefty $ */ #include diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h index 8b75010..05ce331 100644 --- a/drivers/infiniband/core/mad_priv.h +++ b/drivers/infiniband/core/mad_priv.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mad_priv.h 5596 2006-03-03 01:00:07Z sean.hefty $ */ #ifndef __IB_MAD_PRIV_H__ diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index a5e2a31..d0ef7d6 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mad_rmpp.c 1921 2005-03-02 22:58:44Z sean.hefty $ */ #include "mad_priv.h" diff --git a/drivers/infiniband/core/mad_rmpp.h b/drivers/infiniband/core/mad_rmpp.h index f0616fd..3d336bf 100644 --- a/drivers/infiniband/core/mad_rmpp.h +++ b/drivers/infiniband/core/mad_rmpp.h @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mad_rmpp.h 1921 2005-02-25 22:58:44Z sean.hefty $ */ #ifndef __MAD_RMPP_H__ diff --git a/drivers/infiniband/core/packer.c b/drivers/infiniband/core/packer.c index c972d72..019bd4b 100644 --- a/drivers/infiniband/core/packer.c +++ b/drivers/infiniband/core/packer.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: packer.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 78ea815..1341de7 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: sa_query.c 2811 2005-07-06 18:11:43Z halr $ */ #include diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 9575655..36a0ef9 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: sysfs.c 1349 2004-12-16 21:09:43Z roland $ */ #include "core_priv.h" diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index d7a6881..54fc1de 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 997c07d..8ec7876 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ud_header.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index fe78f7d..5c145b2 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: uverbs_mem.c 2743 2005-06-28 22:27:59Z roland $ */ #include diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 840ede9..eb58fcf 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: user_mad.c 5596 2006-03-03 01:00:07Z sean.hefty $ */ #include diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 376a57c..b3ea958 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: uverbs.h 2559 2005-06-06 19:43:16Z roland $ */ #ifndef UVERBS_H diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 2c3bff5..112b37c 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: uverbs_cmd.c 2708 2005-06-24 17:27:21Z roland $ */ #include diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f806da1..9bc07f4 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: uverbs_main.c 2733 2005-06-28 19:14:34Z roland $ */ #include diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..9f399d3 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -34,8 +34,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: verbs.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_allocator.c b/drivers/infiniband/hw/mthca/mthca_allocator.c index a763067..c5ccc2d 100644 --- a/drivers/infiniband/hw/mthca/mthca_allocator.c +++ b/drivers/infiniband/hw/mthca/mthca_allocator.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_allocator.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_av.c b/drivers/infiniband/hw/mthca/mthca_av.c index 4b111a8..32f6c63 100644 --- a/drivers/infiniband/hw/mthca/mthca_av.c +++ b/drivers/infiniband/hw/mthca/mthca_av.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_av.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_catas.c b/drivers/infiniband/hw/mthca/mthca_catas.c index e948158..40573e4 100644 --- a/drivers/infiniband/hw/mthca/mthca_catas.c +++ b/drivers/infiniband/hw/mthca/mthca_catas.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id$ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 54d230e..c33e1c5 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_cmd.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.h b/drivers/infiniband/hw/mthca/mthca_cmd.h index 8928ca4..6efd326 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.h +++ b/drivers/infiniband/hw/mthca/mthca_cmd.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_cmd.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef MTHCA_CMD_H diff --git a/drivers/infiniband/hw/mthca/mthca_config_reg.h b/drivers/infiniband/hw/mthca/mthca_config_reg.h index afa56bf..75671f7 100644 --- a/drivers/infiniband/hw/mthca/mthca_config_reg.h +++ b/drivers/infiniband/hw/mthca/mthca_config_reg.h @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_config_reg.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef MTHCA_CONFIG_REG_H diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 20401d2..f788fce 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_cq.c 1369 2004-12-20 16:17:07Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 7bc32f8..2997d8d 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_dev.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef MTHCA_DEV_H diff --git a/drivers/infiniband/hw/mthca/mthca_doorbell.h b/drivers/infiniband/hw/mthca/mthca_doorbell.h index b374dc3..14f51ef 100644 --- a/drivers/infiniband/hw/mthca/mthca_doorbell.h +++ b/drivers/infiniband/hw/mthca/mthca_doorbell.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_doorbell.h 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_eq.c b/drivers/infiniband/hw/mthca/mthca_eq.c index 8bde7f9..4e36aa7 100644 --- a/drivers/infiniband/hw/mthca/mthca_eq.c +++ b/drivers/infiniband/hw/mthca/mthca_eq.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_eq.c 1382 2004-12-24 02:21:02Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index 8b7e83e..6404495 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_mad.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 200cf13..fb9f91b 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_main.c 1396 2004-12-28 04:10:27Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_mcg.c b/drivers/infiniband/hw/mthca/mthca_mcg.c index a8ad072..3f5f948 100644 --- a/drivers/infiniband/hw/mthca/mthca_mcg.c +++ b/drivers/infiniband/hw/mthca/mthca_mcg.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_mcg.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c index b224079..9e77ba9 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.c +++ b/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id$ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.h b/drivers/infiniband/hw/mthca/mthca_memfree.h index a1ab068..da9b8f9 100644 --- a/drivers/infiniband/hw/mthca/mthca_memfree.h +++ b/drivers/infiniband/hw/mthca/mthca_memfree.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id$ */ #ifndef MTHCA_MEMFREE_H diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index 820205d..8489b1e 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_mr.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_pd.c b/drivers/infiniband/hw/mthca/mthca_pd.c index c1e9507..266f14e 100644 --- a/drivers/infiniband/hw/mthca/mthca_pd.c +++ b/drivers/infiniband/hw/mthca/mthca_pd.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_pd.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c index 605a8d5..d168c25 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.c +++ b/drivers/infiniband/hw/mthca/mthca_profile.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_profile.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_profile.h b/drivers/infiniband/hw/mthca/mthca_profile.h index e76cb62..62b009c 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.h +++ b/drivers/infiniband/hw/mthca/mthca_profile.h @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_profile.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef MTHCA_PROFILE_H diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index be34f99..87ad889 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -32,8 +32,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_provider.c 4859 2006-01-09 21:55:10Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_provider.h b/drivers/infiniband/hw/mthca/mthca_provider.h index 934bf95..c621f87 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.h +++ b/drivers/infiniband/hw/mthca/mthca_provider.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_provider.h 1349 2004-12-16 21:09:43Z roland $ */ #ifndef MTHCA_PROVIDER_H diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 09dc361..3b1c5ba 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_qp.c 1355 2004-12-17 15:23:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_reset.c b/drivers/infiniband/hw/mthca/mthca_reset.c index 91934f2..acb6817 100644 --- a/drivers/infiniband/hw/mthca/mthca_reset.c +++ b/drivers/infiniband/hw/mthca/mthca_reset.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_reset.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_srq.c b/drivers/infiniband/hw/mthca/mthca_srq.c index a5ffff6..4fabe62 100644 --- a/drivers/infiniband/hw/mthca/mthca_srq.c +++ b/drivers/infiniband/hw/mthca/mthca_srq.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_srq.c 3047 2005-08-10 03:59:35Z roland $ */ #include diff --git a/drivers/infiniband/hw/mthca/mthca_uar.c b/drivers/infiniband/hw/mthca/mthca_uar.c index 8b72848..ca5900c 100644 --- a/drivers/infiniband/hw/mthca/mthca_uar.c +++ b/drivers/infiniband/hw/mthca/mthca_uar.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id$ */ #include /* PAGE_SHIFT */ diff --git a/drivers/infiniband/hw/mthca/mthca_user.h b/drivers/infiniband/hw/mthca/mthca_user.h index e1262c9..5fe56e8 100644 --- a/drivers/infiniband/hw/mthca/mthca_user.h +++ b/drivers/infiniband/hw/mthca/mthca_user.h @@ -29,7 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * */ #ifndef MTHCA_USER_H diff --git a/drivers/infiniband/hw/mthca/mthca_wqe.h b/drivers/infiniband/hw/mthca/mthca_wqe.h index b3551a8..341a5ae 100644 --- a/drivers/infiniband/hw/mthca/mthca_wqe.h +++ b/drivers/infiniband/hw/mthca/mthca_wqe.h @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: mthca_wqe.h 3047 2005-08-10 03:59:35Z roland $ */ #ifndef MTHCA_WQE_H diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index ca126fc..0dcbab3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib.h 1358 2004-12-17 22:00:11Z roland $ */ #ifndef _IPOIB_H diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 97e67d3..91c9592 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id$ */ #include diff --git a/drivers/infiniband/ulp/ipoib/ipoib_fs.c b/drivers/infiniband/ulp/ipoib/ipoib_fs.c index 8b882bb..961c585 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_fs.c 1389 2004-12-27 22:56:47Z roland $ */ #include diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index f429bce..eca8518 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -31,8 +31,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_ib.c 1386 2004-12-27 16:23:17Z roland $ */ #include diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2442090..f217b1e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_main.c 1377 2004-12-23 19:57:12Z roland $ */ #include "ipoib.h" diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3f663fb..4a6538b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -30,8 +30,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_multicast.c 1362 2004-12-18 15:56:29Z roland $ */ #include diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 8766d29..810790a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_verbs.c 1349 2004-12-16 21:09:43Z roland $ */ #include "ipoib.h" diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 1cdb5cf..b08eb56 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ipoib_vlan.c 1349 2004-12-16 21:09:43Z roland $ */ #include diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index aeb58ca..356fac6 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -42,9 +42,6 @@ * Zhenyu Wang * Modified by: * Erez Zilber - * - * - * $Id: iscsi_iser.c 6965 2006-05-07 11:36:20Z ogerlitz $ */ #include diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index a8c1b30..0e10703 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -36,8 +36,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: iscsi_iser.h 7051 2006-05-10 12:29:11Z ogerlitz $ */ #ifndef __ISCSI_ISER_H__ #define __ISCSI_ISER_H__ diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 08dc81c..31ad498 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: iser_initiator.c 6964 2006-05-07 11:11:43Z ogerlitz $ */ #include #include diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index cac50c4..81e49cb 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: iser_memory.c 6964 2006-05-07 11:11:43Z ogerlitz $ */ #include #include diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index d19cfe6..77cabee 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -29,8 +29,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: iser_verbs.c 7051 2006-05-10 12:29:11Z ogerlitz $ */ #include #include diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 81cc59c..ed7c5f7 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ib_srp.c 3932 2005-11-01 17:19:29Z roland $ */ #include diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 63d2ae7..e185b90 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -28,8 +28,6 @@ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. - * - * $Id: ib_srp.h 3932 2005-11-01 17:19:29Z roland $ */ #ifndef IB_SRP_H From rdreier at cisco.com Mon May 26 21:18:24 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 May 2008 21:18:24 -0700 Subject: [ofa-general] Re: [ewg] [PATCH] IB/ipoib: copy small SKBs in CM mode In-Reply-To: <1211719109.13769.32.camel@mtls03> (Eli Cohen's message of "Sun, 25 May 2008 15:38:29 +0300") References: <1211470815.7310.61.camel@eli-laptop> <1211719109.13769.32.camel@mtls03> Message-ID: > + if (wc->byte_len < SKB_TSHOLD) { > + int dlen = wc->byte_len; > + > + small_skb = dev_alloc_skb(dlen + 12); > + if (small_skb) { > + skb_reserve(small_skb, 12); > + skb_copy_from_linear_data(skb, small_skb->data, dlen); > + skb_put(small_skb, dlen); Just noticed in the original patch: you need calls to ib_dma_sync_single_for_cpu and ib_dma_sync_single_for_device around this skb_copy_from_linear_data. > > Any reason why we wouldn't want this info in the patch changelog? > Not really. If you think it should be there, I'll add it to the > changelog along with an explanation to the question bellow. Yes, definitely we want the performance info. Imagine if you were reading the patch in git history -- clearly this justification and measurement would be very helpful in understanding why the patch was added, and there's no reason to leave out the useful information you've already written. - R. From ramachandra.kuchimanchi at qlogic.com Mon May 26 23:23:42 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 27 May 2008 11:53:42 +0530 Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103730.12355.14730.stgit@localhost.localdomain> <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com> Message-ID: <71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com> Roland, On Tue, May 27, 2008 at 3:17 AM, Roland Dreier wrote: > OK, I looked at the code. Is there any point to having > CONFIG_INFINIBAND_QLGC_VNIC_DEBUG at all?? Is anyone going to care > about having __FILE__ and __LINE__ included in the output and want to > set this option to 'n'? Makes sense. We will get rid of this CONFIG option. Apart from this are there any other changes you would like to see in the patch series ? Regards, Ram From tziporet at mellanox.co.il Tue May 27 00:00:11 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 27 May 2008 10:00:11 +0300 Subject: [ofa-general] Update on features that are delayed from ofed 1.4 Message-ID: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com> Hi, The following features were planned for OFED 1.4 but will not be implemented eventually: * RDMA CM to support IPv6 * Xsigo's host drivers: Virtual NIC and HBA * New IB verb for reliable multicast * SDP: RDMA zero copy Please review and comment Tziporet From marcel.heinz at informatik.tu-chemnitz.de Tue May 27 00:44:27 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Tue, 27 May 2008 09:44:27 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <483A7E40.5040407@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> Message-ID: <483BBBDB.6000605@informatik.tu-chemnitz.de> Marcel Heinz wrote: > Dotan Barak wrote: >>Do you use the latest released FW for this device? > > The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a > look at the switch later. The Switch is Mellanox MT47396 based and uses FW version 1.0.0. This isn't the latest one, but I don't see anything in the release notes of the 1.0.5 firmware which is related to our problem. Regards, Marcel From Sumit.Gaur at Sun.COM Tue May 27 00:50:27 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Tue, 27 May 2008 13:20:27 +0530 Subject: [Fwd: Re: [ofa-general] question regarding umad_recv] Message-ID: <483BBD43.5080807@Sun.COM> An embedded message was scrubbed... From: Sumit Gaur - Sun Microsystem Subject: Re: [ofa-general] question regarding umad_recv Date: Tue, 27 May 2008 13:14:28 +0530 Size: 2626 URL: From okir at lst.de Tue May 27 01:14:29 2008 From: okir at lst.de (Olaf Kirch) Date: Tue, 27 May 2008 10:14:29 +0200 Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling Message-ID: <200805271014.30175.okir@lst.de> Oops, this should have gone to ofa-general as well. Olaf ---------- Forwarded Message ---------- Subject: [rds-devel] RDS: Fix a bug in RDMA signalling Date: Tuesday 27 May 2008 From: Olaf Kirch To: rds-devel at oss.oracle.com I found an issue with RDMA completions vs. IB connection teardown yesterday - the patch seems correct to me (and passed my testing), but I would appreciate a second set of eyeballs, and some additional testing (Rick, can you spin the crload wheel with this one, please?) Vlad, is there any chance to get this into 1.3.1? I think I remember that rc2 (which is out already) was supposed to be the final release candidate for 1.3.1 - correct? The patch is also available from my tree at git://git.openfabrics.org/~okir/ofed_1_3/linux-2.6.git code-drop-20080527 Olaf ------------------- From: Olaf Kirch Subject: RDS: Fix a bug in RDMA signalling Code inspection revealed a problem in the way we signal RDMA completions to user space when a connection goes down. The send CQ handler calls rds_ib_send_unmap_rm, indicating success no matter what the WC status code says. This means we may signal success for RDMAs that get flushed out with WR_FLUSH_ERR. This patch fixes the problem by passing the wc.status value to rds_ib_send_unmap_rm for inspection. While I was at it, I moved the code that translates WC status codes to RDMA notifications from the send CQ handler to rds_ib_send_unmap_rm where it belongs. Signed-off-by: Olaf Kirch --- net/rds/ib_send.c | 39 ++++++++++++++++++++------------------- net/rds/rdma.h | 2 +- net/rds/send.c | 3 ++- 3 files changed, 23 insertions(+), 21 deletions(-) Index: ofa_kernel-1.3.1/net/rds/ib_send.c =================================================================== --- ofa_kernel-1.3.1.orig/net/rds/ib_send.c +++ ofa_kernel-1.3.1/net/rds/ib_send.c @@ -41,7 +41,7 @@ void rds_ib_send_unmap_rm(struct rds_ib_connection *ic, struct rds_ib_send_work *send, - int success) + int wc_status) { struct rds_message *rm = send->s_rm; @@ -52,7 +52,9 @@ void rds_ib_send_unmap_rm(struct rds_ib_ DMA_TO_DEVICE); /* raise rdma completion hwm */ - if (rm->m_rdma_op && success) { + if (rm->m_rdma_op && wc_status != IB_WC_WR_FLUSH_ERR) { + int notify_status; + /* If the user asked for a completion notification on this * message, we can implement three different semantics: * 1. Notify when we received the ACK on the RDS message @@ -68,7 +70,20 @@ void rds_ib_send_unmap_rm(struct rds_ib_ * don't call rds_rdma_send_complete at all, and fall back to the notify * handling in the ACK processing code. */ - rds_rdma_send_complete(rm); + switch (wc_status) { + case IB_WC_SUCCESS: + notify_status = RDS_RDMA_SUCCESS; + break; + + case IB_WC_REM_ACCESS_ERR: + notify_status = RDS_RDMA_REMOTE_ERROR; + break; + + default: + notify_status = RDS_RDMA_OTHER_ERROR; + break; + } + rds_rdma_send_complete(rm, notify_status); if (rm->m_rdma_op->r_write) rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); @@ -118,7 +133,7 @@ void rds_ib_send_clear_ring(struct rds_i if (send->s_wr.opcode == 0xdead) continue; if (send->s_rm) - rds_ib_send_unmap_rm(ic, send, 0); + rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); if (send->s_op) ib_dma_unmap_sg(ic->i_cm_id->device, send->s_op->r_sg, send->s_op->r_nents, @@ -174,7 +189,7 @@ void rds_ib_send_cq_comp_handler(struct switch (send->s_wr.opcode) { case IB_WR_SEND: if (send->s_rm) - rds_ib_send_unmap_rm(ic, send, 1); + rds_ib_send_unmap_rm(ic, send, wc.status); break; case IB_WR_RDMA_WRITE: if (send->s_op) @@ -204,20 +219,6 @@ void rds_ib_send_cq_comp_handler(struct oldest = (oldest + 1) % ic->i_send_ring.w_nr; } - if (unlikely(wc.status != IB_WC_SUCCESS && send->s_op && send->s_op->r_notifier)) { - switch (wc.status) { - default: - send->s_op->r_notifier->n_status = RDS_RDMA_OTHER_ERROR; - break; - case IB_WC_REM_ACCESS_ERR: - send->s_op->r_notifier->n_status = RDS_RDMA_REMOTE_ERROR; - break; - case IB_WC_WR_FLUSH_ERR: - /* flushed out; not an error */ - break; - } - } - rds_ib_ring_free(&ic->i_send_ring, completed); if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) Index: ofa_kernel-1.3.1/net/rds/rdma.h =================================================================== --- ofa_kernel-1.3.1.orig/net/rds/rdma.h +++ ofa_kernel-1.3.1/net/rds/rdma.h @@ -71,6 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock * int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, struct cmsghdr *cmsg); void rds_rdma_free_op(struct rds_rdma_op *ro); -void rds_rdma_send_complete(struct rds_message *rm); +void rds_rdma_send_complete(struct rds_message *rm, int); #endif Index: ofa_kernel-1.3.1/net/rds/send.c =================================================================== --- ofa_kernel-1.3.1.orig/net/rds/send.c +++ ofa_kernel-1.3.1/net/rds/send.c @@ -361,7 +361,7 @@ int rds_send_acked_before(struct rds_con * the IB send completion on the RDMA op and the accompanying * message. */ -void rds_rdma_send_complete(struct rds_message *rm) +void rds_rdma_send_complete(struct rds_message *rm, int status) { struct rds_sock *rs = NULL; struct rds_rdma_op *ro; @@ -376,6 +376,7 @@ void rds_rdma_send_complete(struct rds_m rs = rm->m_rs; sock_hold(rds_rs_to_sk(rs)); + notifier->n_status = status; spin_lock(&rs->rs_lock); list_add_tail(¬ifier->n_list, &rs->rs_notify_queue); spin_unlock(&rs->rs_lock); -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax _______________________________________________ rds-devel mailing list rds-devel at oss.oracle.com http://oss.oracle.com/mailman/listinfo/rds-devel ------------------------------------------------------- -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From ogerlitz at voltaire.com Tue May 27 02:01:45 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 27 May 2008 12:01:45 +0300 Subject: [ofa-general] Update on features that are delayed In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com> Message-ID: <483BCDF9.2080207@voltaire.com> Tziporet Koren wrote: > The following features were planned for OFED 1.4 but will not be implemented eventually: > * RDMA CM to support IPv6 Hi Sean, Can you please elaborate what is missing in the design/implementation of the rdma-cm regarding IPv6? Or. From Feed at lists.openfabrics.org Tue May 27 02:04:18 2008 From: Feed at lists.openfabrics.org (Feed at lists.openfabrics.org) Date: 27 May 2008 02:04:18 -0700 Subject: [ofa-general] Feed Blaster puts your ad right to the screens of millions in 15 Minutes ! Message-ID: <20080527020417.9D5C205E710DF692@from.header.has.no.domain> More and more people are subscribing to feeds every day and there are millions who are already subscribed. Thus, your ad will reach a very broad range of potential customers with each use of Feed Blaster! Feed Blaster is the first & only submitter that can submit your ads to thousands of feeds within a few minutes! Post your ads where people read them! - What if you could place your ad into all these feeds ? Right, that would mean you would have millions of sites linking to your ad - and millions of users reading your message within minutes - and my idea actually works For Full details please read the attached .html file Unsubscribe: please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Tue May 27 02:05:48 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 27 May 2008 12:05:48 +0300 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode Message-ID: <1211879148.13769.94.camel@mtls03> >From db74e3fc04ef41da02d65c056b78275365891b3d Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Thu, 22 May 2008 16:28:59 +0300 Subject: [PATCH] IB/ipoib: copy small SKBs in CM mode CM mode of ipoib has a large overhead in the receive flow for managing SKBs. It usually allocates an SKB with data as much as was used in the currently received SKB and moves unused fragments from the old SKB to the new one. This involves a loop on all the remaining fragments and incurs overhead on the CPU. This patch, for small SKBs, allocates an SKB just large enough to contain the received data and copies to it the data from the received SKB. The newly allocated SKB is passed to the stack and the old SKB is reposted. When running netperf, UDP small messages, without this pach I get: UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 14.4.3.178 (14.4.3.178) port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 114688 128 10.00 5142034 0 526.31 114688 10.00 1130489 115.71 With this patch I get both send and receive at ~315 mbps. The reason for that is as follows: When using this patch, the overhead of the CPU for handling RX packets is dramatically reduced. As a result, we do not experience RNR NACK messages from the receiver which cause the connection to be closed and reopened again; when the patch is not used, the receiver cannot handle the packets fast enough so there is less time to post new buffers and hence the mentioned RNR NACKs. So what happens is that the application *thinks* it posted a certain number of packets for transmission but these packets are flushed and do not really get transmitted. Since the connection gets opened and closed many times, each time netperf gets the CPU time that otherwise would have been given to IPoIB to actually transmit the packtes. This can be verified when looking at the port counters, the output of ifconfig and the oputput of netperf (this is for the case without the patch): tx packets ========== port counter: 1,543,996 ifconfig: 1,581,426 netperf: 5,142,034 rx packtes ========== netperf 1,1304,089 Signed-off-by: Eli Cohen --- Changes since V1: 1. wrapped call to skb_copy_from_linear_data() with calls to ib_dma_sync_single_for_cpu() and ib_dma_sync_single_for_device() 2. Ensure SKB_TSHOLD is not defined to large. drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 19 +++++++++++++++++++ drivers/infiniband/ulp/ipoib/ipoib_main.c | 10 ++++++++++ 3 files changed, 30 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index ca126fc..e39bf36 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -97,6 +97,7 @@ enum { IPOIB_MCAST_FLAG_ATTACHED = 3, MAX_SEND_CQE = 16, + SKB_TSHOLD = 256, }; #define IPOIB_OP_RECV (1ul << 31) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 97e67d3..7be0a43 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -525,6 +525,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) u64 mapping[IPOIB_CM_RX_SG]; int frags; int has_srq; + struct sk_buff *small_skb; ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n", wr_id, wc->status); @@ -579,6 +580,23 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) } } + if (wc->byte_len < SKB_TSHOLD) { + int dlen = wc->byte_len; + + small_skb = dev_alloc_skb(dlen + 12); + if (small_skb) { + skb_reserve(small_skb, 12); + ib_dma_sync_single_for_cpu(priv->ca, rx_ring[wr_id].mapping[0], + dlen, DMA_FROM_DEVICE); + skb_copy_from_linear_data(skb, small_skb->data, dlen); + ib_dma_sync_single_for_device(priv->ca, rx_ring[wr_id].mapping[0], + dlen, DMA_FROM_DEVICE); + skb_put(small_skb, dlen); + skb = small_skb; + goto copied; + } + } + frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len, (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE; @@ -601,6 +619,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb); +copied: skb->protocol = ((struct ipoib_header *) skb->data)->proto; skb_reset_mac_header(skb); skb_pull(skb, IPOIB_ENCAP_LEN); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 2442090..ec6e7c5 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1304,6 +1304,16 @@ static int __init ipoib_init_module(void) ipoib_max_conn_qp = min(ipoib_max_conn_qp, IPOIB_CM_MAX_CONN_QP); #endif + /* + * we rely on this condition when copying small skbs and we + * pass ownership of the first fragment only. + */ + if (SKB_TSHOLD > IPOIB_CM_HEAD_SIZE) { + printk("%s: SKB_TSHOLD(%d) must not be larger then %d\n", + THIS_MODULE->name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE); + return -EINVAL; + } + ret = ipoib_register_debugfs(); if (ret) return ret; -- 1.5.5.1 From Sumit.Gaur at Sun.COM Tue May 27 02:04:43 2008 From: Sumit.Gaur at Sun.COM (Sumit Gaur - Sun Microsystem) Date: Tue, 27 May 2008 14:34:43 +0530 Subject: [ofa-general] question regarding umad_recv In-Reply-To: <20071016204021.GC12364@sashak.voltaire.com> References: <20071015195140.GA12364@sashak.voltaire.com> <000001c80f65$87bea170$3c98070a@amr.corp.intel.com> <20071016204021.GC12364@sashak.voltaire.com> Message-ID: <483BCEAB.3020902@Sun.COM> Hi, Just want to confirm which OFED build you have integrated this bug (*support multiple umad_open_port*)fix. I am working on OFED-1.2.5.5. Also want to know if it is not available in my current Build, Could I apply any available patch to get fix. Thanks and Regards sumit Sasha Khapyorsky wrote: > On 12:56 Mon 15 Oct , Sean Hefty wrote: > >>>Seems you don't think it is very critical, cannot say I disagree so much. >>>Hmm, let's change portid -> fd and depreciate umad_get_fd() after OFED? >> >>My vote is to retain some sort of abstraction. Once we get rid of it, it will >>be very hard to add it back in. > > > That is true, but I cannot find scenario when using fd as umad device > handle could be insufficient. Even if we will need to create some > internally tracked per device data again (unlikely) fd can serve as us > index just well. The whole issue is all about naming and seems minor for > me - without actual API change we can rename it once and rename again > later if it will be needed or keep things as it is - both options are > fine. Anc since there is concern let's do nothing and stay with "as is". > > >>My concern is that multi-thread receive handling isn't easily supported when >>RMPP is involved, and having umad_recv take an abstract 'id' gives us some >>flexibility that could come in useful someday. >> >>E.g. something like: >>umad_recv() -> returns too small, gives necessary size + id specific to a mad >>uamd_recv(mad id, new size ...) -> returns reassembled rmpp mad > > > With this second umad_recv() we also will need to specify which umad > device to use, I think API change will be required, right? > (the option to encode both fd and mad id as first umad_recv() parameter > looks messy for me.) > > Sasha From sashak at voltaire.com Tue May 27 03:03:17 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 May 2008 13:03:17 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080523121815.2c39e65a.weiny2@llnl.gov> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080522154702.430cdef7.weiny2@llnl.gov> <1211540855.13185.71.camel@hrosenstock-ws.xsigo.com> <20080523121815.2c39e65a.weiny2@llnl.gov> Message-ID: <20080527100317.GE12014@sashak.voltaire.com> Hi Ira, On 12:18 Fri 23 May , Ira Weiny wrote: > > > When you mention this I start to think about the secure API which Tim > submitted a few months ago and was not accepted. Don't think I missed that. I remember that there were some another changes, not secure API yet. Sasha From vlad at lists.openfabrics.org Tue May 27 03:11:55 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 27 May 2008 03:11:55 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080527-0200 daily build status Message-ID: <20080527101155.D4044E60D57@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Tue May 27 03:33:41 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 May 2008 13:33:41 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080527103341.GF12014@sashak.voltaire.com> On 05:52 Fri 23 May , Hal Rosenstock wrote: > > But can be protected by other weak access control currently and perhaps > more in the future. OpenSM console is not a great example IMO - OpenSM doesn't need to issue SA queries against itself. > New commands which require trust can utilize SMKey > without it being specified (at least for OpenSM), no ? Maybe yes, but could you be more specific? Store SMKey in read-only file on a client side? > > And what about diagnostics when other SMs are used? > > I think there's a problem here in a trusted environments given the > approach taken as I've stated in the past but seems to have been > forgotten. The more trust the less the current diag strategy fits. > > Are you also going to be proposing exposing MKeys too once MKey > management is supported by OpenSM/other SMs ? I don't have any M_Key manager implementation details, but hope it will not needed. I'm not proposing to expose SM_Key, just added such option where this key could be specified. So: 1) this is *optional*, 2) there is no suggestions about how the right value should be determined. Sasha From Feed at lists.openfabrics.org Tue May 27 04:04:01 2008 From: Feed at lists.openfabrics.org (Feed at lists.openfabrics.org) Date: 27 May 2008 04:04:01 -0700 Subject: [ofa-general] Feed Blaster puts your ad right to the screens of millions in 15 Minutes ! Message-ID: <20080527040401.EBDB1E1C68ED4647@from.header.has.no.domain> More and more people are subscribing to feeds every day and there are millions who are already subscribed. Thus, your ad will reach a very broad range of potential customers with each use of Feed Blaster! Feed Blaster is the first & only submitter that can submit your ads to thousands of feeds within a few minutes! Post your ads where people read them! - What if you could place your ad into all these feeds ? Right, that would mean you would have millions of sites linking to your ad - and millions of users reading your message within minutes - and my idea actually works For Full details please read the attached .html file Unsubscribe: please read the attached .html file -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Tue May 27 04:14:55 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 27 May 2008 04:14:55 -0700 Subject: [ofa-general] [PATCH] management: Support separate SA and SM keys In-Reply-To: <4838F9ED.9090304@voltaire.com> References: <1211550432.13185.121.camel@hrosenstock-ws.xsigo.com> <4838F9ED.9090304@voltaire.com> Message-ID: <1211886895.13185.207.camel@hrosenstock-ws.xsigo.com> On Sun, 2008-05-25 at 08:32 +0300, Or Gerlitz wrote: > Hal Rosenstock wrote: > > management: Support separate SA and SM keys as clarified in IBA 1.2.1 > Does some host side patch is needed to inter-operate with this change? Nope. -- Hal > > Or. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Tue May 27 04:29:12 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 27 May 2008 04:29:12 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com> On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote: > > Following your logic we will need to disable root passwords > > typing too. > > That's taking it too far. Root passwords are at least hidden when > typing. At least hide the key typing from plain sight when typing like su does. -- Hal From hrosenstock at xsigo.com Tue May 27 04:33:56 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 27 May 2008 04:33:56 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080527103341.GF12014@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <20080527103341.GF12014@sashak.voltaire.com> Message-ID: <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-27 at 13:33 +0300, Sasha Khapyorsky wrote: > On 05:52 Fri 23 May , Hal Rosenstock wrote: > > > > But can be protected by other weak access control currently and perhaps > > more in the future. > > OpenSM console is not a great example IMO - OpenSM doesn't need to issue > SA queries against itself. There's no reason it couldn't though rather than go after internal data structures. > > New commands which require trust can utilize SMKey > > without it being specified (at least for OpenSM), no ? > > Maybe yes, but could you be more specific? Store SMKey in read-only > file on a client side? Treat smkey as su treats password rather than a command line parameter is another alternative. > > > And what about diagnostics when other SMs are used? > > > > I think there's a problem here in a trusted environments given the > > approach taken as I've stated in the past but seems to have been > > forgotten. The more trust the less the current diag strategy fits. > > > > Are you also going to be proposing exposing MKeys too once MKey > > management is supported by OpenSM/other SMs ? > > I don't have any M_Key manager implementation details, There have been no details as yet but one can readily extrapolate the same issue from the spec. (The issue actually goes further for MKey IMO). > but hope it will not needed. I believe it is on the OFED 1.4 list. > I'm not proposing to expose SM_Key, just added such option where this > key could be specified. How is that not exposing it ? -- Hal > So: 1) this is *optional*, 2) there is no > suggestions about how the right value should be determined. > > Sasha From nico.mittenzwey at s2001.tu-chemnitz.de Tue May 27 05:00:34 2008 From: nico.mittenzwey at s2001.tu-chemnitz.de (Nico Mittenzwey) Date: Tue, 27 May 2008 14:00:34 +0200 Subject: [ofa-general] Retry count error with ipath on OFED-1.3 In-Reply-To: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com> References: <829ded920805160242i57481603t3c65c44ceafd640@mail.gmail.com> Message-ID: <483BF7E2.2000402@s2001.tu-chemnitz.de> Hello Keshetti, thanks for your response. After more tests it turned out to be a hardware error of the Infinipath HCA. regards Nico From hrosenstock at xsigo.com Tue May 27 06:02:34 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Tue, 27 May 2008 06:02:34 -0700 Subject: [ofa-general] [PATCH] mlx4_core: enable changing default max HCA resource limits at run time -- reposting In-Reply-To: <200804281438.28417.jackm@dev.mellanox.co.il> References: <200804281438.28417.jackm@dev.mellanox.co.il> Message-ID: <1211893354.13185.229.camel@hrosenstock-ws.xsigo.com> On Mon, 2008-04-28 at 14:38 +0300, Jack Morgenstein wrote: > mlx4-core: enable changing default max HCA resource limits. > > Enable module-initialization time modification of default HCA > maximum resource limits via module parameters, as is done in mthca. > > Specify the log of the parameter value, rather than the value itself > to avoid the hidden side-effect of rounding up values to next power-of-2. > > Signed-off-by: Jack Morgenstein Sorry if I'm rehashing this but this thread appears to have died out and I'm not sure about it's status: Where do we stand in terms of getting the additional mlx4 module parameters incorporated ? Thanks. -- Hal From tziporet at dev.mellanox.co.il Tue May 27 06:17:57 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 27 May 2008 16:17:57 +0300 Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling In-Reply-To: <200805271014.30175.okir@lst.de> References: <200805271014.30175.okir@lst.de> Message-ID: <483C0A05.5030602@mellanox.co.il> Olaf Kirch wrote: > Oops, this should have gone to ofa-general as well. > > Olaf > ---------- Forwarded Message ---------- > > Subject: [rds-devel] RDS: Fix a bug in RDMA signalling > Date: Tuesday 27 May 2008 > From: Olaf Kirch > To: rds-devel at oss.oracle.com > > > > Vlad, is there any chance to get this into 1.3.1? I think I remember > that rc2 (which is out already) was supposed to be the final release > candidate for 1.3.1 - correct? > > > I just asked to delay 1.3.1 to Monday so we have time to take this patch too Tziporet From vlad at dev.mellanox.co.il Tue May 27 06:31:44 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 27 May 2008 16:31:44 +0300 Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling In-Reply-To: <200805271014.30175.okir@lst.de> References: <200805271014.30175.okir@lst.de> Message-ID: <483C0D40.5060902@dev.mellanox.co.il> Olaf Kirch wrote: > Oops, this should have gone to ofa-general as well. > > Olaf > ---------- Forwarded Message ---------- > > Subject: [rds-devel] RDS: Fix a bug in RDMA signalling > Date: Tuesday 27 May 2008 > From: Olaf Kirch > To: rds-devel at oss.oracle.com > > I found an issue with RDMA completions vs. IB connection teardown > yesterday - the patch seems correct to me (and passed my testing), > but I would appreciate a second set of eyeballs, and some additional > testing (Rick, can you spin the crload wheel with this one, please?) > > Vlad, is there any chance to get this into 1.3.1? I think I remember > that rc2 (which is out already) was supposed to be the final release > candidate for 1.3.1 - correct? > > The patch is also available from my tree at > git://git.openfabrics.org/~okir/ofed_1_3/linux-2.6.git code-drop-20080527 > > Olaf > ------------------- > From: Olaf Kirch > Subject: RDS: Fix a bug in RDMA signalling > > Code inspection revealed a problem in the way we signal RDMA > completions to user space when a connection goes down. > > The send CQ handler calls rds_ib_send_unmap_rm, indicating success > no matter what the WC status code says. This means we may > signal success for RDMAs that get flushed out with WR_FLUSH_ERR. > > This patch fixes the problem by passing the wc.status value to > rds_ib_send_unmap_rm for inspection. > > While I was at it, I moved the code that translates WC status codes > to RDMA notifications from the send CQ handler to rds_ib_send_unmap_rm > where it belongs. > > Signed-off-by: Olaf Kirch > --- > net/rds/ib_send.c | 39 ++++++++++++++++++++------------------- > net/rds/rdma.h | 2 +- > net/rds/send.c | 3 ++- > 3 files changed, 23 insertions(+), 21 deletions(-) > > Index: ofa_kernel-1.3.1/net/rds/ib_send.c > =================================================================== > --- ofa_kernel-1.3.1.orig/net/rds/ib_send.c > +++ ofa_kernel-1.3.1/net/rds/ib_send.c > @@ -41,7 +41,7 @@ > > void rds_ib_send_unmap_rm(struct rds_ib_connection *ic, > struct rds_ib_send_work *send, > - int success) > + int wc_status) > { > struct rds_message *rm = send->s_rm; > > @@ -52,7 +52,9 @@ void rds_ib_send_unmap_rm(struct rds_ib_ > DMA_TO_DEVICE); > > /* raise rdma completion hwm */ > - if (rm->m_rdma_op && success) { > + if (rm->m_rdma_op && wc_status != IB_WC_WR_FLUSH_ERR) { > + int notify_status; > + > /* If the user asked for a completion notification on this > * message, we can implement three different semantics: > * 1. Notify when we received the ACK on the RDS message > @@ -68,7 +70,20 @@ void rds_ib_send_unmap_rm(struct rds_ib_ > * don't call rds_rdma_send_complete at all, and fall back to the notify > * handling in the ACK processing code. > */ > - rds_rdma_send_complete(rm); > + switch (wc_status) { > + case IB_WC_SUCCESS: > + notify_status = RDS_RDMA_SUCCESS; > + break; > + > + case IB_WC_REM_ACCESS_ERR: > + notify_status = RDS_RDMA_REMOTE_ERROR; > + break; > + > + default: > + notify_status = RDS_RDMA_OTHER_ERROR; > + break; > + } > + rds_rdma_send_complete(rm, notify_status); > > if (rm->m_rdma_op->r_write) > rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); > @@ -118,7 +133,7 @@ void rds_ib_send_clear_ring(struct rds_i > if (send->s_wr.opcode == 0xdead) > continue; > if (send->s_rm) > - rds_ib_send_unmap_rm(ic, send, 0); > + rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); > if (send->s_op) > ib_dma_unmap_sg(ic->i_cm_id->device, > send->s_op->r_sg, send->s_op->r_nents, > @@ -174,7 +189,7 @@ void rds_ib_send_cq_comp_handler(struct > switch (send->s_wr.opcode) { > case IB_WR_SEND: > if (send->s_rm) > - rds_ib_send_unmap_rm(ic, send, 1); > + rds_ib_send_unmap_rm(ic, send, wc.status); > break; > case IB_WR_RDMA_WRITE: > if (send->s_op) > @@ -204,20 +219,6 @@ void rds_ib_send_cq_comp_handler(struct > oldest = (oldest + 1) % ic->i_send_ring.w_nr; > } > > - if (unlikely(wc.status != IB_WC_SUCCESS && send->s_op && send->s_op->r_notifier)) { > - switch (wc.status) { > - default: > - send->s_op->r_notifier->n_status = RDS_RDMA_OTHER_ERROR; > - break; > - case IB_WC_REM_ACCESS_ERR: > - send->s_op->r_notifier->n_status = RDS_RDMA_REMOTE_ERROR; > - break; > - case IB_WC_WR_FLUSH_ERR: > - /* flushed out; not an error */ > - break; > - } > - } > - > rds_ib_ring_free(&ic->i_send_ring, completed); > > if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) > Index: ofa_kernel-1.3.1/net/rds/rdma.h > =================================================================== > --- ofa_kernel-1.3.1.orig/net/rds/rdma.h > +++ ofa_kernel-1.3.1/net/rds/rdma.h > @@ -71,6 +71,6 @@ int rds_cmsg_rdma_args(struct rds_sock * > int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, > struct cmsghdr *cmsg); > void rds_rdma_free_op(struct rds_rdma_op *ro); > -void rds_rdma_send_complete(struct rds_message *rm); > +void rds_rdma_send_complete(struct rds_message *rm, int); > > #endif > Index: ofa_kernel-1.3.1/net/rds/send.c > =================================================================== > --- ofa_kernel-1.3.1.orig/net/rds/send.c > +++ ofa_kernel-1.3.1/net/rds/send.c > @@ -361,7 +361,7 @@ int rds_send_acked_before(struct rds_con > * the IB send completion on the RDMA op and the accompanying > * message. > */ > -void rds_rdma_send_complete(struct rds_message *rm) > +void rds_rdma_send_complete(struct rds_message *rm, int status) > { > struct rds_sock *rs = NULL; > struct rds_rdma_op *ro; > @@ -376,6 +376,7 @@ void rds_rdma_send_complete(struct rds_m > rs = rm->m_rs; > sock_hold(rds_rs_to_sk(rs)); > > + notifier->n_status = status; > spin_lock(&rs->rs_lock); > list_add_tail(¬ifier->n_list, &rs->rs_notify_queue); > spin_unlock(&rs->rs_lock); > Applied to OFED-1.3.1 kernel git tree. Regards, Vladimir From monis at Voltaire.COM Tue May 27 07:38:13 2008 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 27 May 2008 17:38:13 +0300 Subject: [ofa-general] Re: [PATCH 2/2] IB/IPoIB: Separate IB events to groups and handle each according to level of severity In-Reply-To: <48357C09.1040302@Voltaire.COM> References: <48302034.8040709@Voltaire.COM> <483022BB.9060004@Voltaire.COM> <48357C09.1040302@Voltaire.COM> Message-ID: <483C1CD5.9040904@Voltaire.COM> Roland, I sent a proposal that tries to answer yours and Hal's comments regarding the second patch. I would appreciate if you take a look at it and let me know if you think I can go on to produce a "decent" patch. I just want to get more reviews before I make another step. thanks MoniS Moni Shoua wrote: > Hal, Roland > Thanks for the comments. The patch below tries to address the issues that were > raised in its previous form. Please note that I'm only asking for opinion for now. > If the idea is acceptable then I will recreate more elegant patch with the required > fixes if any and with respect to previous comments (such as replacing 0,1 and 2 with > textual names). > > The idea in few words is to flush only paths but keeping address handles in ipoib_neigh. > This will trigger a new path lookup when an ARP probe arrives and eventually an addess > handle renewal. In the meantime, the old address handle is kept and can be used. In most > cases this address handle is a valid address handle and when it is not than the situatio > is not worse than before. > My tests show that this patch completes the improvement that was archived with patch #1 > to zero packet loss (tested with ping flood) when SM change event occurs. > > > thanks > > MoniS > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > index ca126fc..8ef6573 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -276,10 +276,11 @@ struct ipoib_dev_priv { > > struct delayed_work pkey_poll_task; > struct delayed_work mcast_task; > - struct work_struct flush_task; > + struct work_struct flush_task0; > + struct work_struct flush_task1; > + struct work_struct flush_task2; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > - struct work_struct pkey_event_task; > > struct ib_device *ca; > u8 port; > @@ -423,11 +424,14 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, > struct ipoib_ah *address, u32 qpn); > void ipoib_reap_ah(struct work_struct *work); > > +void ipoib_flush_paths_only(struct net_device *dev); > void ipoib_flush_paths(struct net_device *dev); > struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); > > int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port); > -void ipoib_ib_dev_flush(struct work_struct *work); > +void ipoib_ib_dev_flush0(struct work_struct *work); > +void ipoib_ib_dev_flush1(struct work_struct *work); > +void ipoib_ib_dev_flush2(struct work_struct *work); > void ipoib_pkey_event(struct work_struct *work); > void ipoib_ib_dev_cleanup(struct net_device *dev); > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index f429bce..5a6bbe8 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -898,7 +898,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > return 0; > } > > -static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > +static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int level) > { > struct ipoib_dev_priv *cpriv; > struct net_device *dev = priv->dev; > @@ -911,7 +911,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * the parent is down. > */ > list_for_each_entry(cpriv, &priv->child_intfs, list) > - __ipoib_ib_dev_flush(cpriv, pkey_event); > + __ipoib_ib_dev_flush(cpriv, level); > > mutex_unlock(&priv->vlan_mutex); > > @@ -925,7 +925,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > return; > } > > - if (pkey_event) { > + if (level == 2) { > if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &new_index)) { > clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); > ipoib_ib_dev_down(dev, 0); > @@ -943,11 +943,13 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > priv->pkey_index = new_index; > } > > - ipoib_dbg(priv, "flushing\n"); > - > - ipoib_ib_dev_down(dev, 0); > + ipoib_flush_paths_only(dev); > + ipoib_mcast_dev_flush(dev); > + > + if (level >= 1) > + ipoib_ib_dev_down(dev, 0); > > - if (pkey_event) { > + if (level >= 2) { > ipoib_ib_dev_stop(dev, 0); > ipoib_ib_dev_open(dev); > } > @@ -957,29 +959,36 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, int pkey_event) > * we get here, don't bring it back up if it's not configured up > */ > if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) { > - ipoib_ib_dev_up(dev); > + if (level >= 1) > + ipoib_ib_dev_up(dev); > ipoib_mcast_restart_task(&priv->restart_task); > } > } > > -void ipoib_ib_dev_flush(struct work_struct *work) > +void ipoib_ib_dev_flush0(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, flush_task); > + container_of(work, struct ipoib_dev_priv, flush_task0); > > - ipoib_dbg(priv, "Flushing %s\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 0); > } > > -void ipoib_pkey_event(struct work_struct *work) > +void ipoib_ib_dev_flush1(struct work_struct *work) > { > struct ipoib_dev_priv *priv = > - container_of(work, struct ipoib_dev_priv, pkey_event_task); > + container_of(work, struct ipoib_dev_priv, flush_task1); > > - ipoib_dbg(priv, "Flushing %s and restarting its QP\n", priv->dev->name); > __ipoib_ib_dev_flush(priv, 1); > } > > +void ipoib_ib_dev_flush2(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, flush_task2); > + > + __ipoib_ib_dev_flush(priv, 2); > +} > + > void ipoib_ib_dev_cleanup(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index 2442090..c41798d 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -259,6 +259,21 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path) > return 0; > } > > +static void path_free_only(struct net_device *dev, struct ipoib_path *path) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ipoib_neigh *neigh, *tn; > + struct sk_buff *skb; > + unsigned long flags; > + > + while ((skb = __skb_dequeue(&path->queue))) > + dev_kfree_skb_irq(skb); > + > + if (path->ah) > + ipoib_put_ah(path->ah); > + > + kfree(path); > +} > static void path_free(struct net_device *dev, struct ipoib_path *path) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -350,6 +365,34 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, > > #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ > > +void ipoib_flush_paths_only(struct net_device *dev) > +{ > + struct ipoib_dev_priv *priv = netdev_priv(dev); > + struct ipoib_path *path, *tp; > + LIST_HEAD(remove_list); > + > + spin_lock_irq(&priv->tx_lock); > + spin_lock(&priv->lock); > + > + list_splice_init(&priv->path_list, &remove_list); > + > + list_for_each_entry(path, &remove_list, list) > + rb_erase(&path->rb_node, &priv->path_tree); > + > + list_for_each_entry_safe(path, tp, &remove_list, list) { > + if (path->query) > + ib_sa_cancel_query(path->query_id, path->query); > + spin_unlock(&priv->lock); > + spin_unlock_irq(&priv->tx_lock); > + wait_for_completion(&path->done); > + path_free_only(dev, path); > + spin_lock_irq(&priv->tx_lock); > + spin_lock(&priv->lock); > + } > + spin_unlock(&priv->lock); > + spin_unlock_irq(&priv->tx_lock); > +} > + > void ipoib_flush_paths(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -421,6 +464,8 @@ static void path_rec_completion(int status, > __skb_queue_tail(&skqueue, skb); > > list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { > + if (neigh->ah) > + ipoib_put_ah(neigh->ah); > kref_get(&path->ah->ref); > neigh->ah = path->ah; > memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, > @@ -989,9 +1034,10 @@ static void ipoib_setup(struct net_device *dev) > INIT_LIST_HEAD(&priv->multicast_list); > > INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll); > - INIT_WORK(&priv->pkey_event_task, ipoib_pkey_event); > INIT_DELAYED_WORK(&priv->mcast_task, ipoib_mcast_join_task); > - INIT_WORK(&priv->flush_task, ipoib_ib_dev_flush); > + INIT_WORK(&priv->flush_task0, ipoib_ib_dev_flush0); > + INIT_WORK(&priv->flush_task1, ipoib_ib_dev_flush1); > + INIT_WORK(&priv->flush_task2, ipoib_ib_dev_flush2); > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); > } > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > index 8766d29..80c0409 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c > @@ -289,15 +289,16 @@ void ipoib_event(struct ib_event_handler *handler, > if (record->element.port_num != priv->port) > return; > > - if (record->event == IB_EVENT_PORT_ERR || > - record->event == IB_EVENT_PORT_ACTIVE || > - record->event == IB_EVENT_LID_CHANGE || > - record->event == IB_EVENT_SM_CHANGE || > - record->event == IB_EVENT_CLIENT_REREGISTER) { > - ipoib_dbg(priv, "Port state change event\n"); > - queue_work(ipoib_workqueue, &priv->flush_task); > + ipoib_dbg(priv, "Event %d on device %s port %d\n",record->event, > + record->device->name, record->element.port_num); > + if ( record->event == IB_EVENT_SM_CHANGE || > + record->event == IB_EVENT_CLIENT_REREGISTER) { > + queue_work(ipoib_workqueue, &priv->flush_task0); > + } else if (record->event == IB_EVENT_PORT_ERR || > + record->event == IB_EVENT_PORT_ACTIVE || > + record->event == IB_EVENT_LID_CHANGE) { > + queue_work(ipoib_workqueue, &priv->flush_task1); > } else if (record->event == IB_EVENT_PKEY_CHANGE) { > - ipoib_dbg(priv, "P_Key change event on port:%d\n", priv->port); > - queue_work(ipoib_workqueue, &priv->pkey_event_task); > + queue_work(ipoib_workqueue, &priv->flush_task2); > } > } > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Tue May 27 08:07:00 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 May 2008 08:07:00 -0700 Subject: [ofa-general] Update on features that are delayed In-Reply-To: <483BCDF9.2080207@voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C9041CE703@mtlexch01.mtl.com> <483BCDF9.2080207@voltaire.com> Message-ID: <000001c8c00b$50df8be0$ebc8180a@amr.corp.intel.com> >Can you please elaborate what is missing in the design/implementation of >the rdma-cm regarding IPv6? The main item that's missing is code in ib_addr to convert the IPv6 address to an IB GID. Once that's available, there may be a couple code paths where IPv6 checks are needed, but I can't think of anything that would be that hard. - Sean From taylor at hpc.ufl.edu Tue May 27 08:15:14 2008 From: taylor at hpc.ufl.edu (Charles Taylor) Date: Tue, 27 May 2008 11:15:14 -0400 Subject: [ofa-general] OpenSM? In-Reply-To: <48370AEE.7080507@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> Message-ID: <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> We have a 400 node IB cluster. We are running an embedded SM in failover mode on our TS270/Cisco7008 core switches. Lately we have been seeing problems with LID assignment when rebooting nodes (see log messages below). It is also taking far too long for LIDS to be assigned as it takes on the order of minutes for the ports to transition to "ACTIVE". This seems like a bug to us and we are considering switching to OpenSM on a host. I'm wondering about experience with running OpenSM for medium to large clusters (Fat Tree) and what resources (memory/cpu) we should plan on for the host node. Thanks, Charlie Taylor UF HPC Center May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: Rediscover the subnet May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM OUT_OF_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An existing IB node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM DELETE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: Topology changed May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by discovering removed ports May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: async events require sweep May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: Rediscover the subnet May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194 May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: Topology changed May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by discovering new ports May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by multicast membership change May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]: Force port to go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]: Program port state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]: Failed to negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad status 0x1c May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM IN_SERVICE trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new IB node 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0 May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: async events require sweep May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: Rediscover the subnet May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No topology change May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by previous GET/SET operation failures May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]: Reassigning LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr LID=0 May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]: Force port to go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]: Clean up SA resources for port forced down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]: cleaning DB for guid 00:02:c9:02:00:21:4b:59, lid 194 May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: _ib_smAllocSubnet: initRate= 4 May 27 14:18:47 topspin-270sc last message repeated 23 times May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity links detected in the network May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]: Active port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16, state=2, neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2 May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: async events require sweep May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: Rediscover the subnet May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No topology change May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node 00:06:6a:00:d9:00:04:5d port 16 is INIT state May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by some ports in INIT state May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by previous GET/SET operation failures May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: _ib_smAllocSubnet: initRate= 4 May 27 14:21:05 topspin-270sc last message repeated 23 times May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity links detected in the network May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]: Program port state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM CREATE_MC_GROUP trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: async events require sweep May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No topology change May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by multicast membership change May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a backup session with Standby SM guid 00:05:ad:00:00:02:3c:60 May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: async events require sweep May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No topology change May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration caused by multicast membership change May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB synchronized with Standby SM guid 00:05:ad:00:00:02:3c:60 May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB synchronized with all designated backup SMs May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: ********************** NEW SWEEP ******************** May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No topology change On May 23, 2008, at 2:20 PM, Steve Wise wrote: > Or Gerlitz wrote: >> Steve Wise wrote: >>> Are we sure we need to expose this to the user? >> I believe this is the way to go if we want to let smart ULPs >> generate new rkey/stag per mapping. Simpler ULPs could then just >> put the same value for each map associated with the same mr. >> >> Or. >> > > How should I add this to the API? > > Perhaps we just document the format of an rkey in the struct ib_mr. > Thus the app would do this to change the key before posting the > fast_reg_mr wr (coded to be explicit, not efficient): > > u8 newkey; > u32 newrkey; > > newkey = 0xaa; > newrkey = (mr->rkey & 0xffffff00) | newkey; > mr->rkey = newrkey > wr.wr.fast_reg.mr = mr; > ... > > > Note, this assumes mr->rkey is in host byte order (I think the linux > rdma code assumes this in other places too). > > > Steve. > > > > > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue May 27 08:16:09 2008 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 27 May 2008 08:16:09 -0700 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode In-Reply-To: <1211879148.13769.94.camel@mtls03> Message-ID: Hello Eli, >So what happens is that the application *thinks* it posted a certain number of > packets for transmission but these packets are flushed and do not really get > transmitted. In this case, how many tx drop packets from ifconfig output? Should we see ifconfig tx drop packets + tx successfully transmit packets close to netperf packets? Any TCP STREAM test results to share here? thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Tue May 27 08:33:37 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 27 May 2008 10:33:37 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> Message-ID: <1211902417.4114.73.camel@trinity.ogc.int> On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote: > > The "invalidate local stag" part of a read is just a local sink side > > operation (ie no wire protocol change from a read). It's not like > > processing an ingress send-with-inv. It is really functionally like a > > read followed immediately by a fenced invalidate-local, but it doesn't > > stall the pipe. So the device has to remember the read is a "with inv > > local stag" and invalidate the stag after the read response is placed > > and before the WCE is reaped by the application. > > Yes, understood. My point was just that in IB, at least in theory, one > could just use an L_Key that doesn't have any remote permissions in the > scatter list of an RDMA read, while in iWARP, the STag used to place an > RDMA read response has to have remote write permission. So RDMA read > with invalidate makes sense for iWARP, because it gives a race-free way > to allow an STag to be invalidated immediately after an RDMA read > response is placed, while in IB it's simpler just to never give remote > access at all. > So I think from an NFSRDMA coding perspective it's a wash... When creating the local data sink, We need to check the transport type. If it's IB --> only local access, if it's iWARP --> local + remote access. When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true --> Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate else /* IB */ post rdma read post invalidate fi else ... today's logic fi I make the observation, however, that the transport type is now overloaded with a set of required verbs. For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... This also means that new transport types will inherit one or the other set of verbs (IB or iWARP). Tom > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From yevgenyp at mellanox.co.il Tue May 27 08:31:50 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Tue, 27 May 2008 18:31:50 +0300 Subject: [ofa-general][PATCH v2 1/2]mlx4: Multiple completion vectors support Message-ID: <483C2966.1080606@mellanox.co.il> >From 90b97243dc9429eec9d0ed2188c3e16ab04e5432 Mon Sep 17 00:00:00 2001 From: Yevgeny Petrilin Date: Tue, 27 May 2008 17:37:01 +0300 Subject: [PATCH] mlx4: Multiple completion vectors support The driver now creates a completion EQ for every cpu. While allocating CQ a ULP asks a completion vector number it wants the CQ to be attached to. The number of completion vectors is populated via ib_device.num_comp_vectors Signed-off-by: Yevgeny Petrilin --- Changes since V1: Created the patch against the latest tree drivers/infiniband/hw/mlx4/cq.c | 2 +- drivers/infiniband/hw/mlx4/main.c | 2 +- drivers/net/mlx4/cq.c | 14 ++++++++-- drivers/net/mlx4/eq.c | 47 ++++++++++++++++++++++++------------ drivers/net/mlx4/main.c | 14 ++++++---- drivers/net/mlx4/mlx4.h | 4 +- include/linux/mlx4/device.h | 4 ++- 7 files changed, 57 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 4521319..3519f92 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -221,7 +221,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector } err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar, - cq->db.dma, &cq->mcq, 0); + cq->db.dma, &cq->mcq, vector, 0); if (err) goto err_dbmap; diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 4d61e32..60dc700 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -563,7 +563,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.owner = THIS_MODULE; ibdev->ib_dev.node_type = RDMA_NODE_IB_CA; ibdev->ib_dev.phys_port_cnt = dev->caps.num_ports; - ibdev->ib_dev.num_comp_vectors = 1; + ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; ibdev->ib_dev.dma_device = &dev->pdev->dev; ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION; diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index 95e87a2..9be895f 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -189,7 +189,7 @@ EXPORT_SYMBOL_GPL(mlx4_cq_resize); int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed) + unsigned vector, int collapsed) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = &priv->cq_table; @@ -227,7 +227,15 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP].eqn; + + if (vector >= dev->caps.num_comp_vectors) { + err = -EINVAL; + goto err_radix; + } + + cq->comp_eq_idx = MLX4_EQ_COMP_CPU0 + vector; + cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + vector].eqn; cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; mtt_addr = mlx4_mtt_addr(dev, mtt); @@ -276,7 +284,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) if (err) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); - synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq); + synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index e141a15..825e90c 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -265,7 +265,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr) writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); return IRQ_RETVAL(work); @@ -482,7 +482,7 @@ static void mlx4_free_irqs(struct mlx4_dev *dev) if (eq_table->have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) if (eq_table->eq[i].have_irq) free_irq(eq_table->eq[i].irq, eq_table->eq + i); } @@ -553,6 +553,7 @@ void mlx4_unmap_eq_icm(struct mlx4_dev *dev) int mlx4_init_eq_table(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + int req_eqs; int err; int i; @@ -573,11 +574,21 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) priv->eq_table.clr_int = priv->clr_base + (priv->eq_table.inta_pin < 32 ? 4 : 0); - err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, - (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0, - &priv->eq_table.eq[MLX4_EQ_COMP]); - if (err) - goto err_out_unmap; + dev->caps.num_comp_vectors = 0; + req_eqs = (dev->flags & MLX4_FLAG_MSI_X) ? num_online_cpus() : 1; + while (req_eqs) { + err = mlx4_create_eq( + dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, + (dev->flags & MLX4_FLAG_MSI_X) ? + (MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors) : 0, + &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors]); + if (err) + goto err_out_comp; + + dev->caps.num_comp_vectors++; + req_eqs--; + } err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE, (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0, @@ -586,12 +597,16 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) goto err_out_comp; if (dev->flags & MLX4_FLAG_MSI_X) { - static const char *eq_name[] = { - [MLX4_EQ_COMP] = DRV_NAME " (comp)", - [MLX4_EQ_ASYNC] = DRV_NAME " (async)" - }; + static char eq_name[MLX4_NUM_EQ][20]; + + for (i = 0; i < MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors; ++i) { + if (i == 0) + snprintf(eq_name[0], 20, DRV_NAME "(async)"); + else + snprintf(eq_name[i], 20, "comp_" DRV_NAME "%d", + i - 1); - for (i = 0; i < MLX4_NUM_EQ; ++i) { err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name[i], priv->eq_table.eq + i); @@ -616,7 +631,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) eq_set_ci(&priv->eq_table.eq[i], 1); return 0; @@ -625,9 +640,9 @@ err_out_async: mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); err_out_comp: - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]); + for (i = 0; i < dev->caps.num_comp_vectors; ++i) + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i]); -err_out_unmap: mlx4_unmap_clr_int(dev); mlx4_free_irqs(dev); @@ -646,7 +661,7 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev) mlx4_free_irqs(dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) mlx4_free_eq(dev, &priv->eq_table.eq[i]); mlx4_unmap_clr_int(dev); diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index a6aa49f..8634b52 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -692,22 +692,24 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); struct msix_entry entries[MLX4_NUM_EQ]; + int needed_vectors = MLX4_EQ_COMP_CPU0 + num_online_cpus(); int err; int i; if (msi_x) { - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) entries[i].entry = i; - err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries)); + err = pci_enable_msix(dev->pdev, entries, needed_vectors); if (err) { if (err > 0) - mlx4_info(dev, "Only %d MSI-X vectors available, " - "not using MSI-X\n", err); + mlx4_info(dev, "Only %d MSI-X vectors " + "available, need %d. Not using MSI-X\n", + err, needed_vectors); goto no_msi; } - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = entries[i].vector; dev->flags |= MLX4_FLAG_MSI_X; @@ -715,7 +717,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) } no_msi: - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = dev->pdev->irq; } diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index a4023c2..8e5fbe0 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -64,8 +64,8 @@ enum { enum { MLX4_EQ_ASYNC, - MLX4_EQ_COMP, - MLX4_NUM_EQ + MLX4_EQ_COMP_CPU0, + MLX4_NUM_EQ = MLX4_EQ_COMP_CPU0 + NR_CPUS }; enum { diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index a744383..accc1ee 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -168,6 +168,7 @@ struct mlx4_caps { int reserved_cqs; int num_eqs; int reserved_eqs; + int num_comp_vectors; int num_mpts; int num_mtt_segs; int fmr_reserved_mtts; @@ -279,6 +280,7 @@ struct mlx4_cq { int arm_sn; int cqn; + int comp_eq_idx; atomic_t refcount; struct completion free; @@ -383,7 +385,7 @@ void mlx4_free_hwq_res(struct mlx4_dev *mdev, struct mlx4_hwq_resources *wqres, int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed); + unsigned vector, int collapsed); void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq); int mlx4_qp_alloc(struct mlx4_dev *dev, int sqpn, struct mlx4_qp *qp); -- 1.5.3.7 From yevgenyp at mellanox.co.il Tue May 27 08:35:00 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Tue, 27 May 2008 18:35:00 +0300 Subject: [ofa-general][PATCH v2 2/2] mlx4: Default value for automatic completion vector selection Message-ID: <483C2A24.90903@mellanox.co.il> >From 71b7fbb81b46f992986b2b278eea7c61d7e0372a Mon Sep 17 00:00:00 2001 From: Yevgeny Petrilin Date: Tue, 27 May 2008 18:15:36 +0300 Subject: [PATCH] mlx4: Default value for automatic completion vector selection When the vector number passed to mlx4_cq_alloc is IB_CQ_VECTOR_LEAST_ATTACHED (0xff), the driver selects the completion vector that has the least CQ's attached to it and attaches the CQ to the chosen vector. IB_CQ_VECTOR_LEAST_ATTACHED is redefined in device.h as MLX4_ANY_VECTOR because we don't want all mlx4_core clients (Ethernet and FCoE) to include Signed-off-by: Yevgeny Petrilin --- Changes since V1: 1. Added IB_CQ_VECTOR_LEAST_ATTACHED to rdma/ib_verbs.h 2. Set MLX4_ANY_VECTOR to IB_CQ_VECTOR_LEAST_ATTACHED drivers/net/mlx4/cq.c | 22 +++++++++++++++++++++- drivers/net/mlx4/mlx4.h | 1 + include/linux/mlx4/device.h | 4 ++++ include/rdma/ib_verbs.h | 10 +++++++++- 4 files changed, 35 insertions(+), 2 deletions(-) diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index 9be895f..7f0bdf6 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq, } EXPORT_SYMBOL_GPL(mlx4_cq_resize); +static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv) +{ + int i; + int index = 0; + int min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0].load; + + for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) { + if (priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load < min) { + index = i; + min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load; + } + } + + return index; +} + int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, unsigned vector, int collapsed) @@ -228,7 +244,9 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - if (vector >= dev->caps.num_comp_vectors) { + if (vector == MLX4_ANY_VECTOR) + vector = mlx4_find_least_loaded_vector(priv); + else if (vector >= dev->caps.num_comp_vectors) { err = -EINVAL; goto err_radix; } @@ -248,6 +266,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, if (err) goto err_radix; + priv->eq_table.eq[cq->comp_eq_idx].load++; cq->cons_index = 0; cq->arm_sn = 1; cq->uar = uar; @@ -285,6 +304,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); + priv->eq_table.eq[cq->comp_eq_idx].load--; spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 8e5fbe0..df16f05 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -143,6 +143,7 @@ struct mlx4_eq { u16 irq; u16 have_irq; int nent; + int load; struct mlx4_buf_list *page_list; struct mlx4_mtt mtt; }; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index accc1ee..fd93546 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -37,6 +37,8 @@ #include #include +#include + #include enum { @@ -133,6 +135,8 @@ enum { MLX4_STAT_RATE_OFFSET = 5 }; +#define MLX4_ANY_VECTOR IB_CQ_VECTOR_LEAST_ATTACHED + static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) { return (major << 32) | (minor << 16) | subminor; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..2462d83 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1364,6 +1364,13 @@ static inline int ib_post_recv(struct ib_qp *qp, return qp->device->post_recv(qp, recv_wr, bad_recv_wr); } +/* + * IB_CQ_VECTOR_LEAST_ATTACHED: The constatnt specifies that + * teh cq will be attached to the least attached + * completion vector + */ +#define IB_CQ_VECTOR_LEAST_ATTACHED 0xff + /** * ib_create_cq - Creates a CQ on the specified device. * @device: The device on which to create the CQ. @@ -1375,7 +1382,8 @@ static inline int ib_post_recv(struct ib_qp *qp, * the associated completion and event handlers. * @cqe: The minimum size of the CQ. * @comp_vector - Completion vector used to signal completion events. - * Must be >= 0 and < context->num_comp_vectors. + * Must be >= 0 and < context->num_comp_vectors + * or IB_CQ_VECTOR_LEAST_ATTACHED. * * Users can examine the cq structure to determine the actual CQ size. */ -- 1.5.3.7 From ogerlitz at voltaire.com Tue May 27 08:50:21 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 27 May 2008 18:50:21 +0300 (IDT) Subject: [ofa-general] [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks Message-ID: The rdma-cm has some logic in place to make sure that callbacks on an ID are delivered to the consumer in serialized manner, specifically it has code to protect against the device removal racing with a callback now being delivered to the user. This patch simplifies this logic by using a mutex per ID instead of the wait queue and atomic variable. I have left the disable/enable_remove notation such that the patch would be easier to read, but if this approach is accepted, I think we want to change it to disable/enable_callback Signed-off-by: Or Gerlitz Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c 2008-05-26 15:11:17.000000000 +0300 +++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c 2008-05-26 15:22:11.000000000 +0300 @@ -126,8 +126,7 @@ struct rdma_id_private { struct completion comp; atomic_t refcount; - wait_queue_head_t wait_remove; - atomic_t dev_remove; + struct mutex handler_mutex; int backlog; int timeout_ms; @@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm spin_lock_irqsave(&id_priv->lock, flags); if (id_priv->state == state) { - atomic_inc(&id_priv->dev_remove); + mutex_lock(&id_priv->handler_mutex); ret = 0; } else ret = -EINVAL; @@ -369,8 +368,7 @@ static int cma_disable_remove(struct rdm static void cma_enable_remove(struct rdma_id_private *id_priv) { - if (atomic_dec_and_test(&id_priv->dev_remove)) - wake_up(&id_priv->wait_remove); + mutex_unlock(&id_priv->handler_mutex); } static int cma_has_cm_dev(struct rdma_id_private *id_priv) @@ -395,8 +393,7 @@ struct rdma_cm_id *rdma_create_id(rdma_c mutex_init(&id_priv->qp_mutex); init_completion(&id_priv->comp); atomic_set(&id_priv->refcount, 1); - init_waitqueue_head(&id_priv->wait_remove); - atomic_set(&id_priv->dev_remove, 0); + mutex_init(&id_priv->handler_mutex); INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); @@ -1118,7 +1115,7 @@ static int cma_req_handler(struct ib_cm_ goto out; } - atomic_inc(&conn_id->dev_remove); + mutex_lock(&conn_id->handler_mutex); mutex_lock(&lock); ret = cma_acquire_dev(conn_id); mutex_unlock(&lock); @@ -1296,7 +1293,7 @@ static int iw_conn_req_handler(struct iw goto out; } conn_id = container_of(new_cm_id, struct rdma_id_private, id); - atomic_inc(&conn_id->dev_remove); + mutex_lock(&conn_id->handler_mutex); conn_id->state = CMA_CONNECT; dev = ip_dev_find(&init_net, iw_event->local_addr.sin_addr.s_addr); @@ -1588,7 +1585,7 @@ static void cma_work_handler(struct work struct rdma_id_private *id_priv = work->id; int destroy = 0; - atomic_inc(&id_priv->dev_remove); + mutex_lock(&id_priv->handler_mutex); if (!cma_comp_exch(id_priv, work->old_state, work->new_state)) goto out; @@ -1760,7 +1757,7 @@ static void addr_handler(int status, str struct rdma_cm_event event; memset(&event, 0, sizeof event); - atomic_inc(&id_priv->dev_remove); + mutex_lock(&id_priv->handler_mutex); /* * Grab mutex to block rdma_destroy_id() from removing the device while @@ -2756,22 +2753,26 @@ static int cma_remove_id_dev(struct rdma { struct rdma_cm_event event; enum cma_state state; - + int ret = 0; + /* Record that we want to remove the device */ state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); if (state == CMA_DESTROYING) return 0; cma_cancel_operation(id_priv, state); - wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); + mutex_lock(&id_priv->handler_mutex); /* Check for destruction from another callback. */ if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) - return 0; + goto out; memset(&event, 0, sizeof event); event.event = RDMA_CM_EVENT_DEVICE_REMOVAL; - return id_priv->id.event_handler(&id_priv->id, &event); + ret = id_priv->id.event_handler(&id_priv->id, &event); +out: + mutex_unlock(&id_priv->handler_mutex); + return ret; } static void cma_process_remove(struct cma_device *cma_dev) From ogerlitz at voltaire.com Tue May 27 08:51:34 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 27 May 2008 18:51:34 +0300 (IDT) Subject: [ofa-general] [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: References: Message-ID: RDMA_CM_EVENT_NETDEV_CHANGE event can be used by rdma-cm consumers that wish to have their RDMA sessions always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding is used and fail-over happened, but the IB link used by an already existing session is operating fine. Use netevent notification for sensing that a change has happened in the IP stack, then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" in that respect with the IP stack, and deliver RDMA_CM_EVENT_NETDEV_CHANGE for this ID. The user can act on the event or just ignore it Signed-off-by: Or Gerlitz This patch should be applied on top of the previous patch ("simplify locking needed for serialization of callbacks) and the first two patches of the series I have posted which remained unchanged at this point: [RFC v2 PATCH 1/5] net/bonding: announce fail-over for the active-backup mode http://lists.openfabrics.org/pipermail/general/2008-May/050285.html [RFC v2 PATCH 2/5] rdma/addr: keep the name of the netdevice in struct rdma_dev_addr http://lists.openfabrics.org/pipermail/general/2008-May/050286.html main changes from v2 - - took the approach of unconditionally notifying the user - use the handler_mutex of the ID to serialize with other callbacks As for the locking issues, I still have the double loop in cma_netdev_callback() being wrapped with the rdma-cm global mutex taken. The loop on devices has to be under this lock because the device removal code in cma_remove_one() removes the device from the global linked list of devices this code loops on. The loop on IDs has to be under this lock because the device removal code in cma_process_remove() removes IDs from the device ID list this code loops on. Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c 2008-05-27 13:46:48.000000000 +0300 +++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c 2008-05-27 13:46:58.000000000 +0300 @@ -164,6 +164,12 @@ struct cma_work { struct rdma_cm_event event; }; +struct cma_ndev_work { + struct work_struct work; + struct rdma_id_private *id; + struct rdma_cm_event event; +}; + union cma_ip_addr { struct in6_addr ip6; struct { @@ -1601,6 +1607,26 @@ out: kfree(work); } +static void cma_ndev_work_handler(struct work_struct *_work) +{ + struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work); + struct rdma_id_private *id_priv = work->id; + int destroy = 0; + + mutex_lock(&id_priv->handler_mutex); + + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { + cma_exch(id_priv, CMA_DESTROYING); + destroy = 1; + } + + cma_enable_remove(id_priv); + cma_deref_id(id_priv); + if (destroy) + rdma_destroy_id(&id_priv->id); + kfree(work); +} + static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) { struct rdma_route *route = &id_priv->id.route; @@ -2726,6 +2752,61 @@ void rdma_leave_multicast(struct rdma_cm } EXPORT_SYMBOL(rdma_leave_multicast); +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv) +{ + struct rdma_dev_addr *dev_addr; + struct cma_ndev_work *work; + + dev_addr = &id_priv->id.route.addr.dev_addr; + + if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { + printk(KERN_ERR "addr change for device %s used by id %p, notifying\n", + ndev->name, &id_priv->id); + work = kzalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return -ENOMEM; + INIT_WORK(&work->work, cma_ndev_work_handler); + work->id = id_priv; + work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE; + atomic_inc(&id_priv->refcount); + queue_work(cma_wq, &work->work); + } +} + +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct net_device *ndev = (struct net_device *)ctx; + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + int ret = NOTIFY_DONE; + + if (dev_net(ndev) != &init_net) + return NOTIFY_DONE; + + if (event != NETDEV_BONDING_FAILOVER) + return NOTIFY_DONE; + + if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) + return NOTIFY_DONE; + + mutex_lock(&lock); + list_for_each_entry(cma_dev, &dev_list, list) + list_for_each_entry(id_priv, &cma_dev->id_list, list) { + ret = cma_netdev_align_id(ndev, id_priv); + if (ret) + break; + } + mutex_unlock(&lock); + + return ret; +} + +static struct notifier_block cma_nb = { + .notifier_call = cma_netdev_callback +}; + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2834,6 +2915,7 @@ static int cma_init(void) ib_sa_register_client(&sa_client); rdma_addr_register_client(&addr_client); + register_netdevice_notifier(&cma_nb); ret = ib_register_client(&cma_client); if (ret) @@ -2841,6 +2923,7 @@ static int cma_init(void) return 0; err: + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); @@ -2850,6 +2933,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); Index: linux-2.6.26-rc3/include/rdma/rdma_cm.h =================================================================== --- linux-2.6.26-rc3.orig/include/rdma/rdma_cm.h 2008-05-27 13:44:53.000000000 +0300 +++ linux-2.6.26-rc3/include/rdma/rdma_cm.h 2008-05-27 13:46:58.000000000 +0300 @@ -53,7 +53,8 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL, RDMA_CM_EVENT_MULTICAST_JOIN, - RDMA_CM_EVENT_MULTICAST_ERROR + RDMA_CM_EVENT_MULTICAST_ERROR, + RDMA_CM_EVENT_NETDEV_CHANGE }; enum rdma_port_space { From sean.hefty at intel.com Tue May 27 08:54:23 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 May 2008 08:54:23 -0700 Subject: [ofa-general] RE: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks In-Reply-To: References: Message-ID: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com> >@@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm > > spin_lock_irqsave(&id_priv->lock, flags); > if (id_priv->state == state) { >- atomic_inc(&id_priv->dev_remove); >+ mutex_lock(&id_priv->handler_mutex); This just tried to acquire a mutex while holding a spinlock. - Sean From sean.hefty at intel.com Tue May 27 09:19:37 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 May 2008 09:19:37 -0700 Subject: [ofa-general] RE: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: References: Message-ID: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> >+static void cma_ndev_work_handler(struct work_struct *_work) >+{ >+ struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, >work); >+ struct rdma_id_private *id_priv = work->id; >+ int destroy = 0; >+ >+ mutex_lock(&id_priv->handler_mutex); >+ >+ if (id_priv->id.event_handler(&id_priv->id, &work->event)) { How do we know that the user hasn't tried to destroy the id from another callback? We need some sort of state check here. >+ cma_exch(id_priv, CMA_DESTROYING); >+ destroy = 1; >+ } >+ >+ cma_enable_remove(id_priv); I didn't see the matching cma_disable_remove() call. >+ cma_deref_id(id_priv); >+ if (destroy) >+ rdma_destroy_id(&id_priv->id); >+ kfree(work); >+} >+ > static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int >timeout_ms) > { > struct rdma_route *route = &id_priv->id.route; >@@ -2726,6 +2752,61 @@ void rdma_leave_multicast(struct rdma_cm > } > EXPORT_SYMBOL(rdma_leave_multicast); > >+static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private >*id_priv) >+{ nit - function name isn't clear to me. Maybe something like cma_netdev_change_handler()? Although I'm not sure that netdev change is what the user is really interested in. What they really want to know is if IP address mapping/resolution changed. netdev is hidden from the user. >+ struct rdma_dev_addr *dev_addr; >+ struct cma_ndev_work *work; >+ >+ dev_addr = &id_priv->id.route.addr.dev_addr; >+ >+ if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && >+ memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { >+ printk(KERN_ERR "addr change for device %s used by id %p, >notifying\n", >+ ndev->name, &id_priv->id); >+ work = kzalloc(sizeof *work, GFP_ATOMIC); >+ if (!work) >+ return -ENOMEM; >+ INIT_WORK(&work->work, cma_ndev_work_handler); >+ work->id = id_priv; >+ work->event.event = RDMA_CM_EVENT_NETDEV_CHANGE; Maybe call this RDMA_CM_EVENT_ADDR_CHANGE? >+ atomic_inc(&id_priv->refcount); >+ queue_work(cma_wq, &work->work); >+ } >+} >+ >+static int cma_netdev_callback(struct notifier_block *self, unsigned long >event, >+ void *ctx) >+{ >+ struct net_device *ndev = (struct net_device *)ctx; >+ struct cma_device *cma_dev; >+ struct rdma_id_private *id_priv; >+ int ret = NOTIFY_DONE; >+ >+ if (dev_net(ndev) != &init_net) >+ return NOTIFY_DONE; >+ >+ if (event != NETDEV_BONDING_FAILOVER) >+ return NOTIFY_DONE; >+ >+ if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) >+ return NOTIFY_DONE; >+ >+ mutex_lock(&lock); >+ list_for_each_entry(cma_dev, &dev_list, list) It seems like we just need to find the cma_dev that has the current mapping - Sean From ogerlitz at voltaire.com Tue May 27 09:37:09 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 27 May 2008 19:37:09 +0300 Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> References: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> Message-ID: <483C38B5.6080706@voltaire.com> Sean Hefty wrote: >> +static void cma_ndev_work_handler(struct work_struct *_work) >> +{ >> + struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, >> work); >> + struct rdma_id_private *id_priv = work->id; >> + int destroy = 0; >> + >> + mutex_lock(&id_priv->handler_mutex); >> + >> + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { > How do we know that the user hasn't tried to destroy the id from another > callback? We need some sort of state check here. correct, will be fixed. >> + cma_exch(id_priv, CMA_DESTROYING); >> + destroy = 1; >> + } >> + >> + cma_enable_remove(id_priv); > > I didn't see the matching cma_disable_remove() call. As you can see also in the patch 3/5, places in the code which originally did --not-- call cma_enable_remove() but rather just did atomic_inc(&conn_id->dev_remove) were just replaced with mutex_lock(&id_priv->handler_mutex). This is b/c cma_enable_remove does two things: 1) it does the state validation 2) it locks the handler_mutex, so places in the code which don't need the state validation don't call it... a bit dirty. >> >> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private >> *id_priv) >> +{ > > nit - function name isn't clear to me. Maybe something like > cma_netdev_change_handler()? Although I'm not sure that netdev change is what > the user is really interested in. What they really want to know is if IP > address mapping/resolution changed. netdev is hidden from the user. OK, I will see how to improve the name > Maybe call this RDMA_CM_EVENT_ADDR_CHANGE? let me think about it >> +static int cma_netdev_callback(struct notifier_block *self, unsigned long >> event, >> + void *ctx) >> +{ >> + struct net_device *ndev = (struct net_device *)ctx; >> + struct cma_device *cma_dev; >> + struct rdma_id_private *id_priv; >> + int ret = NOTIFY_DONE; >> + >> + if (dev_net(ndev) != &init_net) >> + return NOTIFY_DONE; >> + >> + if (event != NETDEV_BONDING_FAILOVER) >> + return NOTIFY_DONE; >> + >> + if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) >> + return NOTIFY_DONE; >> + >> + mutex_lock(&lock); >> + list_for_each_entry(cma_dev, &dev_list, list) > > It seems like we just need to find the cma_dev that has the current mapping correct. So I can take the lock, find the device in the list, increment its refcount and release the lock. For that end I would have to save in the cma device structure the names of the network devices which are associated with it.... i can't use a comarison of the pdev etc pointers to the dma device since some network devices (eg bonding / vlan interfaces in ethernet) have NULL pdev (they are virtual devices) Later I can scan this device ID list, but I must do it under the lock inorder not to race with the device removal code which removed IDs from this list in cma_process_remove(), correct? Or. From ogerlitz at voltaire.com Tue May 27 09:37:44 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 27 May 2008 19:37:44 +0300 Subject: [ofa-general] Re: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks In-Reply-To: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com> References: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com> Message-ID: <483C38D8.5020600@voltaire.com> Sean Hefty wrote: >> @@ -359,7 +358,7 @@ static int cma_disable_remove(struct rdm >> >> spin_lock_irqsave(&id_priv->lock, flags); >> if (id_priv->state == state) { >> - atomic_inc(&id_priv->dev_remove); >> + mutex_lock(&id_priv->handler_mutex); > This just tried to acquire a mutex while holding a spinlock. I see. So can taking this spin lock be avoided here? I understand that spin lock came to protect the state check, correct? Or. From Thomas.Talpey at netapp.com Tue May 27 09:39:29 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 May 2008 12:39:29 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <1211902417.4114.73.camel@trinity.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> Message-ID: At 11:33 AM 5/27/2008, Tom Tucker wrote: >So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. >When posting the WR, We check the fastreg capabilities bit + transport >type bit: >If fastreg is true --> > Post FastReg > If iWARP (or with a cap bit read-with-inv-flag) > post rdma read w/ invalidate >... For iWARP's case, this means rdma-read-w-inv, >plus rdma-send-w-inv, etc... Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: -------+-----------+-------+------+-------+-----------+-------------- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Message| Type | Flag | and | Number| STag | Length OpCode | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP -------+-----------+-------+------+-------+-----------+-------------- 0000b | RDMA Write| 1 | Valid| N/A | N/A | Yes | | | | | | -------+-----------+-------+------+-------+-----------+-------------- 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes | Request | | | | | -------+-----------+-------+------+-------+-----------+-------------- 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes | Response | | | | | -------+-----------+-------+------+-------+-----------+-------------- 0011b | Send | 0 | N/A | 0 | N/A | Yes | | | | | | -------+-----------+-------+------+-------+-----------+-------------- 0100b | Send with | 0 | N/A | 0 | Valid | Yes | Invalidate| | | | | -------+-----------+-------+------+-------+-----------+-------------- 0101b | Send with | 0 | N/A | 0 | N/A | Yes | SE | | | | | -------+-----------+-------+------+-------+-----------+-------------- 0110b | Send with | 0 | N/A | 0 | Valid | Yes | SE and | | | | | | Invalidate| | | | | -------+-----------+-------+------+-------+-----------+-------------- 0111b | Terminate | 0 | N/A | 2 | N/A | Yes | | | | | | -------+-----------+-------+------+-------+-----------+-------------- 1000b | | to | Reserved | Not Specified 1111b | | -------+-----------+------------------------------------------------- I want to take this opportunity to also mention that the RPC/RDMA client-server exchange does not support remote-invalidate currently. Because of the multiple stags supported by the rpcrdma chunking header, and because the client needs to verify that the stags were in fact invalidated, there is significant overhead, and the jury is out on that benefit. In fact, I suspect it's a lose at the client. Tom (Talpey). From felix at chelsio.com Tue May 27 09:58:32 2008 From: felix at chelsio.com (Felix Marti) Date: Tue, 27 May 2008 09:58:32 -0700 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support References: <20080516223256.27221.34568.stgit@dell3.ogc.int><20080516223419.27221.49014.stgit@dell3.ogc.int><4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com><483B3AD7.4050208@opengridcomputing.com><1211902417.4114.73.camel@trinity.ogc.int> Message-ID: <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> > -----Original Message----- > From: general-bounces at lists.openfabrics.org [mailto:general- > bounces at lists.openfabrics.org] On Behalf Of Talpey, Thomas > Sent: Tuesday, May 27, 2008 9:39 AM > To: Tom Tucker > Cc: Roland Dreier; general at lists.openfabrics.org > Subject: Re: [ofa-general] [PATCH RFC v3 1/2] > RDMA/Core:MEM_MGT_EXTENSIONS support > > At 11:33 AM 5/27/2008, Tom Tucker wrote: > >So I think from an NFSRDMA coding perspective it's a wash... > > Just to be clear, you're talking about the NFS/RDMA server. However, > it's > pretty much a wash on the client, for different reasons. > > >When posting the WR, We check the fastreg capabilities bit + transport > >type bit: > >If fastreg is true --> > > Post FastReg > > If iWARP (or with a cap bit read-with-inv-flag) > > post rdma read w/ invalidate > > >... For iWARP's case, this means rdma-read-w-inv, > >plus rdma-send-w-inv, etc... > > > Maybe I'm confused, but I don't understand this. iWARP RDMA Read > requests > don't support remote invalidate. At least, the table in RFC5040 (p.22) > doesn't: > > > > -------+-----------+-------+------+-------+-----------+------------- > - > RDMA | Message | Tagged| STag | Queue | Invalidate| Message > Message| Type | Flag | and | Number| STag | Length > OpCode | | | TO | | | Communicated > | | | | | | between DDP > | | | | | | and RDMAP > -------+-----------+-------+------+-------+-----------+------------- > - > 0000b | RDMA Write| 1 | Valid| N/A | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes > | Request | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes > | Response | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0011b | Send | 0 | N/A | 0 | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0100b | Send with | 0 | N/A | 0 | Valid | Yes > | Invalidate| | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0101b | Send with | 0 | N/A | 0 | N/A | Yes > | SE | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0110b | Send with | 0 | N/A | 0 | Valid | Yes > | SE and | | | | | > | Invalidate| | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 0111b | Terminate | 0 | N/A | 2 | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+------------- > - > 1000b | | > to | Reserved | Not Specified > 1111b | | > -------+-----------+------------------------------------------------ > - RDMA Read with Local Invalidate does not affect the wire. The 'must invalidate' state is kept in the RNIC that issues the RDMA Read Request... > > > > I want to take this opportunity to also mention that the RPC/RDMA > client-server > exchange does not support remote-invalidate currently. Because of the > multiple > stags supported by the rpcrdma chunking header, and because the client > needs > to verify that the stags were in fact invalidated, there is significant > overhead, > and the jury is out on that benefit. In fact, I suspect it's a lose at > the client. > > Tom (Talpey). > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general From weiny2 at llnl.gov Tue May 27 10:08:59 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 27 May 2008 10:08:59 -0700 Subject: [ofa-general] OpenSM? In-Reply-To: <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> Message-ID: <20080527100859.6d48cd45.weiny2@llnl.gov> Charles, Here at LLNL we have been running OpenSM for some time. Thus far we are very happy with it's performance. Our largest cluster is 1152 nodes and OpenSM can bring it up (not counting boot time) in less than a minute. Here are some details. We are running v3.1.10 of OpenSM with some minor modifications (mostly patches which have been submitted upstream and been accepted by Sasha but are not yet in a release.) Our clusters are all Fat-tree topologies. We have a node which is more or less dedicated to running OpenSM. We have some other monitoring software running on it, but OpenSM can utilize the CPU/Memory if it needs to. A) On our large clusters this node is a 4 socket, dual core (8 cores total) Opteron running at 2.4Gig with 16Gig of memory. I don't believe OpenSM needs this much but the nodes were built all the same so this is what it got. B) On one of our smaller clusters (128 nodes) OpenSM is running on a dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of memory. We have not seen any issues with this cluster and OpenSM. We run with the up/down algorithm, ftree has not panned out for us yet. I can't say how that would compare to the Cisco algorithms. In short OpenSM should work just fine on your cluster. Hope this helps, Ira On Tue, 27 May 2008 11:15:14 -0400 Charles Taylor wrote: > > We have a 400 node IB cluster. We are running an embedded SM in > failover mode on our TS270/Cisco7008 core switches. Lately we have > been seeing problems with LID assignment when rebooting nodes (see log > messages below). It is also taking far too long for LIDS to be > assigned as it takes on the order of minutes for the ports to > transition to "ACTIVE". > > This seems like a bug to us and we are considering switching to > OpenSM on a host. I'm wondering about experience with running > OpenSM for medium to large clusters (Fat Tree) and what resources > (memory/cpu) we should plan on for the host node. > > Thanks, > > Charlie Taylor > UF HPC Center > > May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: > Rediscover > the subnet > May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM > OUT_OF_SERVICE > trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 > May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An > existing IB > node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed > May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM > DELETE_MC_GROUP > trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 > May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: > Topology > changed > May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > discovering removed ports > May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: > async events > require sweep > May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: > Rediscover > the subnet > May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no > routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194 > May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: > Topology > changed > May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > discovering new ports > May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > multicast membership change > May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]: > Force port to > go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 > May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]: > Program port > state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor > node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 > May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]: > Failed to > negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad > status 0x1c > May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM > IN_SERVICE trap > for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 > May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new > IB node > 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0 > May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: > async events > require sweep > May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: > Rediscover > the subnet > May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No > topology > change > May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > previous GET/SET operation failures > May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]: > Reassigning > LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr > LID=0 > May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]: > Force port to > go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 > May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]: > Clean up SA > resources for port forced down due to LID conflict, node - > GUID=00:02:c9:02:00:21:4b:58, port=1 > May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]: > cleaning DB > for guid 00:02:c9:02:00:21:4b:59, lid 194 > May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: > _ib_smAllocSubnet: initRate= 4 > May 27 14:18:47 topspin-270sc last message repeated 23 times > May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity > links > detected in the network > May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]: > Active > port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16, > state=2, > neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2 > May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: > async events > require sweep > May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: > Rediscover > the subnet > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No > topology > change > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node > 00:06:6a:00:d9:00:04:5d port 16 is INIT state > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > some ports in INIT state > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > previous GET/SET operation failures > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: > _ib_smAllocSubnet: initRate= 4 > May 27 14:21:05 topspin-270sc last message repeated 23 times > May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity > links > detected in the network > May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]: > Program port > state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor > node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 > May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM > CREATE_MC_GROUP > trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 > May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: > async events > require sweep > May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No > topology > change > May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > multicast membership change > May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid > 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM > May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a > backup session > with Standby SM guid 00:05:ad:00:00:02:3c:60 > May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: > async events > require sweep > May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid > 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM > May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No > topology > change > May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration > caused by > multicast membership change > May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB > synchronized > with Standby SM guid 00:05:ad:00:00:02:3c:60 > May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB > synchronized > with all designated backup SMs > May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: > ********************** NEW SWEEP ******************** > May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No > topology > change > > On May 23, 2008, at 2:20 PM, Steve Wise wrote: > > > Or Gerlitz wrote: > >> Steve Wise wrote: > >>> Are we sure we need to expose this to the user? > >> I believe this is the way to go if we want to let smart ULPs > >> generate new rkey/stag per mapping. Simpler ULPs could then just > >> put the same value for each map associated with the same mr. > >> > >> Or. > >> > > > > How should I add this to the API? > > > > Perhaps we just document the format of an rkey in the struct ib_mr. > > Thus the app would do this to change the key before posting the > > fast_reg_mr wr (coded to be explicit, not efficient): > > > > u8 newkey; > > u32 newrkey; > > > > newkey = 0xaa; > > newrkey = (mr->rkey & 0xffffff00) | newkey; > > mr->rkey = newrkey > > wr.wr.fast_reg.mr = mr; > > ... > > > > > > Note, this assumes mr->rkey is in host byte order (I think the linux > > rdma code assumes this in other places too). > > > > > > Steve. > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From sashak at voltaire.com Tue May 27 10:53:43 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 May 2008 20:53:43 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <20080527103341.GF12014@sashak.voltaire.com> <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080527175343.GA14205@sashak.voltaire.com> On 04:33 Tue 27 May , Hal Rosenstock wrote: > > > > Maybe yes, but could you be more specific? Store SMKey in read-only > > file on a client side? > > Treat smkey as su treats password rather than a command line parameter > is another alternative. Ok, let's do it as '--smkey X' and then saquery will ask for a value, just like su does. Good? > > I'm not proposing to expose SM_Key, just added such option where this > > key could be specified. > > How is that not exposing it ? Because (1) and (2) below. Sasha > > -- Hal > > > So: 1) this is *optional*, 2) there is no > > suggestions about how the right value should be determined. > > > > Sasha > From sashak at voltaire.com Tue May 27 10:56:37 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 May 2008 20:56:37 +0300 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080527175637.GB14205@sashak.voltaire.com> On 04:29 Tue 27 May , Hal Rosenstock wrote: > On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote: > > > Following your logic we will need to disable root passwords > > > typing too. > > > > That's taking it too far. Root passwords are at least hidden when > > typing. > > At least hide the key typing from plain sight when typing like su does. There are lot of tools where password can be specified as clear text in command line (wget, smbclient, etc..) - it is an user responsibility to keep his sensitive data safe. Sasha From pw at osc.edu Tue May 27 11:00:04 2008 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 27 May 2008 14:00:04 -0400 Subject: [ofa-general] mthca MR attrs userspace change Message-ID: <20080527180004.GA15444@osc.edu> My kernel started complaining to me recently: ib_mthca 0000:02:00.0: Process 'tgtd' did not pass in MR attrs. ib_mthca 0000:02:00.0: Update libmthca to fix this. It comes from commit cb9fbc5c37b69ac584e61d449cfd590f5ae1f90d ("IB: expand ib_umem_get() prototype") and the fix in baaad380c0aa955f7d62e846467316c94067f1a5 ("IB/mthca: Avoid changing userspace ABI to handle DMA write barrier attribute"). Nice that everything still works with old userspace, but where is the latest libmthca these days? The one at kernel.org still has ABI_VERSION 1: http://git.kernel.org/?p=libs/infiniband/libmthca.git;a=blob;f=src/mthca-abi.h;h=2557274e4cbd9f36df2be42379644d31b4ff5da3;hb=HEAD By the way, Roland, your efforts at Fedora packaging are certainly appreciated here. If the new libmthca just showed up in updates to F-9, that would be most convenient. -- Pete From rdreier at cisco.com Tue May 27 11:03:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 11:03:30 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 27 May 2008 14:00:04 -0400") References: <20080527180004.GA15444@osc.edu> Message-ID: > ib_mthca 0000:02:00.0: Process 'tgtd' did not pass in MR attrs. > ib_mthca 0000:02:00.0: Update libmthca to fix this. Heh, thanks for the reminder. Will fix libmthca to handle this properly today. > By the way, Roland, your efforts at Fedora packaging are certainly > appreciated here. If the new libmthca just showed up in updates to > F-9, that would be most convenient. It will, although the Fedora process takes a while to grind through ;) - R. From sean.hefty at intel.com Tue May 27 11:09:24 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 27 May 2008 11:09:24 -0700 Subject: [ofa-general] RE: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks In-Reply-To: <483C38D8.5020600@voltaire.com> References: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com> <483C38D8.5020600@voltaire.com> Message-ID: <000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com> >>> spin_lock_irqsave(&id_priv->lock, flags); >>> if (id_priv->state == state) { >>> - atomic_inc(&id_priv->dev_remove); >>> + mutex_lock(&id_priv->handler_mutex); >> This just tried to acquire a mutex while holding a spinlock. > >I see. So can taking this spin lock be avoided here? I understand that >spin lock came to protect the state check, correct? I think we should just remove cma_disable_remove() and cma_enable_remove(), and instead call mutex_lock/unlock directly in their places. Where cma_disable_remove() is called, add in appropriate state checks after acquiring the mutex. - Sean From tom at opengridcomputing.com Tue May 27 11:28:14 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 27 May 2008 13:28:14 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <1211902417.4114.73.camel@trinity.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> Message-ID: <1211912894.4114.78.camel@trinity.ogc.int> On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote: > On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote: > > > The "invalidate local stag" part of a read is just a local sink side > > > operation (ie no wire protocol change from a read). It's not like > > > processing an ingress send-with-inv. It is really functionally like a > > > read followed immediately by a fenced invalidate-local, but it doesn't > > > stall the pipe. So the device has to remember the read is a "with inv > > > local stag" and invalidate the stag after the read response is placed > > > and before the WCE is reaped by the application. > > > > Yes, understood. My point was just that in IB, at least in theory, one > > could just use an L_Key that doesn't have any remote permissions in the > > scatter list of an RDMA read, while in iWARP, the STag used to place an > > RDMA read response has to have remote write permission. So RDMA read > > with invalidate makes sense for iWARP, because it gives a race-free way > > to allow an STag to be invalidated immediately after an RDMA read > > response is placed, while in IB it's simpler just to never give remote > > access at all. > > > > So I think from an NFSRDMA coding perspective it's a wash... > > When creating the local data sink, We need to check the transport type. > > If it's IB --> only local access, > if it's iWARP --> local + remote access. > > When posting the WR, We check the fastreg capabilities bit + transport type bit: > If fastreg is true --> > Post FastReg > If iWARP (or with a cap bit read-with-inv-flag) > post rdma read w/ invalidate > else /* IB */ > post rdma read Steve pointed out a good optimization here. Instead of fencing the RDMA READ here in advance of the INVALIDATE, we should post the INVALIDATE when the READ WR completes. This will avoid stalling the SQ. Since IB doesn't put the LKEY on the wire, there's no security issue to close. We need to keep a bunch of fastreg MR around anyway for concurrent RPC. Thoughts? Tom > post invalidate > fi > else > ... today's logic > fi > > I make the observation, however, that the transport type is now overloaded > with a set of required verbs. For iWARP's case, this means rdma-read-w-inv, > plus rdma-send-w-inv, etc... This also means that new transport types will > inherit one or the other set of verbs (IB or iWARP). > > Tom > > > > - R. > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Tue May 27 11:24:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 11:24:55 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: (Roland Dreier's message of "Tue, 27 May 2008 11:03:30 -0700") References: <20080527180004.GA15444@osc.edu> Message-ID: Here's the patch I plan to add to match the kernel, in case someone wants to check it over: diff --git a/src/mthca-abi.h b/src/mthca-abi.h index 2557274..7e47d70 100644 --- a/src/mthca-abi.h +++ b/src/mthca-abi.h @@ -36,7 +36,8 @@ #include -#define MTHCA_UVERBS_ABI_VERSION 1 +#define MTHCA_UVERBS_MIN_ABI_VERSION 1 +#define MTHCA_UVERBS_MAX_ABI_VERSION 2 struct mthca_alloc_ucontext_resp { struct ibv_get_context_resp ibv_resp; @@ -50,6 +51,17 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + struct ibv_reg_mr ibv_cmd; +/* + * Mark the memory region with a DMA attribute that causes + * in-flight DMA to be flushed when the region is written to: + */ +#define MTHCA_MR_DMASYNC 0x1 + __u32 mr_attrs; + __u32 reserved; +}; + struct mthca_create_cq { struct ibv_create_cq ibv_cmd; __u32 lkey; diff --git a/src/mthca.c b/src/mthca.c index e00c4ee..dd95636 100644 --- a/src/mthca.c +++ b/src/mthca.c @@ -282,9 +282,11 @@ static struct ibv_device *mthca_driver_init(const char *uverbs_sys_path, return NULL; found: - if (abi_version > MTHCA_UVERBS_ABI_VERSION) { - fprintf(stderr, PFX "Fatal: ABI version %d of %s is too new (expected %d)\n", - abi_version, uverbs_sys_path, MTHCA_UVERBS_ABI_VERSION); + if (abi_version > MTHCA_UVERBS_MAX_ABI_VERSION || + abi_version < MTHCA_UVERBS_MIN_ABI_VERSION) { + fprintf(stderr, PFX "Fatal: ABI version %d of %s is not in supported range %d-%d\n", + abi_version, uverbs_sys_path, MTHCA_UVERBS_MIN_ABI_VERSION, + MTHCA_UVERBS_MAX_ABI_VERSION); return NULL; } diff --git a/src/verbs.c b/src/verbs.c index 6c9b53a..3d273d4 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -117,12 +117,21 @@ int mthca_free_pd(struct ibv_pd *pd) static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, uint64_t hca_va, - enum ibv_access_flags access) + enum ibv_access_flags access, + int dma_sync) { struct ibv_mr *mr; - struct ibv_reg_mr cmd; + struct mthca_reg_mr cmd; int ret; + /* + * Old kernels just ignore the extra data we pass in with the + * reg_mr command structure, so there's no need to add an ABI + * version check here. + */ + cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0; + cmd.reserved = 0; + mr = malloc(sizeof *mr); if (!mr) return NULL; @@ -132,11 +141,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, struct ibv_reg_mr_resp resp; ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd, &resp, sizeof resp); + &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp); } #else ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd); + &cmd.ibv_cmd, sizeof cmd); #endif if (ret) { free(mr); @@ -149,7 +158,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access) { - return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access); + return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0); } int mthca_dereg_mr(struct ibv_mr *mr) @@ -202,7 +211,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!cq->mr) goto err_buf; @@ -297,7 +306,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe) mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!mr) { mthca_free_buf(&buf); ret = ENOMEM; @@ -405,7 +414,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) goto err; - srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0); if (!srq->mr) goto err_free; @@ -525,7 +534,7 @@ struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; - qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0); if (!qp->mr) goto err_free; From rdreier at cisco.com Tue May 27 11:26:32 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 11:26:32 -0700 Subject: [ofa-general] Re: [PATCH v2 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com> (Ramachandra K.'s message of "Tue, 27 May 2008 11:53:42 +0530") References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103730.12355.14730.stgit@localhost.localdomain> <71d336490805260037y4c45abd5lccd5dbadebd84eb8@mail.gmail.com> <71d336490805262323qd876161ue86f98dbc8499ad6@mail.gmail.com> Message-ID: > Makes sense. We will get rid of this CONFIG option. Apart from this > are there any other changes you > would like to see in the patch series ? Have not reviewed the latest in detail but I think we are at least pretty close to something ready to merge. - R. From rdreier at cisco.com Tue May 27 11:33:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 11:33:55 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 27 May 2008 14:00:04 -0400") References: <20080527180004.GA15444@osc.edu> Message-ID: > Nice that everything still works with old userspace, but where is > the latest libmthca these days? The one at kernel.org still has > ABI_VERSION 1: Actually this tricked me... the kernel ABI didn't get bumped so there was no reason to bump the libmthca ABI. I'll actually use this slightly simpler patch: diff --git a/src/mthca-abi.h b/src/mthca-abi.h index 2557274..4fbd98b 100644 --- a/src/mthca-abi.h +++ b/src/mthca-abi.h @@ -50,6 +50,17 @@ struct mthca_alloc_pd_resp { __u32 reserved; }; +struct mthca_reg_mr { + struct ibv_reg_mr ibv_cmd; +/* + * Mark the memory region with a DMA attribute that causes + * in-flight DMA to be flushed when the region is written to: + */ +#define MTHCA_MR_DMASYNC 0x1 + __u32 mr_attrs; + __u32 reserved; +}; + struct mthca_create_cq { struct ibv_create_cq ibv_cmd; __u32 lkey; diff --git a/src/verbs.c b/src/verbs.c index 6c9b53a..def0f30 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -117,12 +117,22 @@ int mthca_free_pd(struct ibv_pd *pd) static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, uint64_t hca_va, - enum ibv_access_flags access) + enum ibv_access_flags access, + int dma_sync) { struct ibv_mr *mr; - struct ibv_reg_mr cmd; + struct mthca_reg_mr cmd; int ret; + /* + * Old kernels just ignore the extra data we pass in with the + * reg_mr command structure, so there's no need to add an ABI + * version check here (and indeed the kernel ABI was not + * incremented due to this change). + */ + cmd.mr_attrs = dma_sync ? MTHCA_MR_DMASYNC : 0; + cmd.reserved = 0; + mr = malloc(sizeof *mr); if (!mr) return NULL; @@ -132,11 +142,11 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, struct ibv_reg_mr_resp resp; ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd, &resp, sizeof resp); + &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp); } #else ret = ibv_cmd_reg_mr(pd, addr, length, hca_va, access, mr, - &cmd, sizeof cmd); + &cmd.ibv_cmd, sizeof cmd); #endif if (ret) { free(mr); @@ -149,7 +159,7 @@ static struct ibv_mr *__mthca_reg_mr(struct ibv_pd *pd, void *addr, struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access) { - return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access); + return __mthca_reg_mr(pd, addr, length, (uintptr_t) addr, access, 0); } int mthca_dereg_mr(struct ibv_mr *mr) @@ -202,7 +212,7 @@ struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, cq->mr = __mthca_reg_mr(to_mctx(context)->pd, cq->buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!cq->mr) goto err_buf; @@ -297,7 +307,7 @@ int mthca_resize_cq(struct ibv_cq *ibcq, int cqe) mr = __mthca_reg_mr(to_mctx(ibcq->context)->pd, buf.buf, cqe * MTHCA_CQ_ENTRY_SIZE, - 0, IBV_ACCESS_LOCAL_WRITE); + 0, IBV_ACCESS_LOCAL_WRITE, 1); if (!mr) { mthca_free_buf(&buf); ret = ENOMEM; @@ -405,7 +415,7 @@ struct ibv_srq *mthca_create_srq(struct ibv_pd *pd, if (mthca_alloc_srq_buf(pd, &attr->attr, srq)) goto err; - srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0); + srq->mr = __mthca_reg_mr(pd, srq->buf.buf, srq->buf_size, 0, 0, 0); if (!srq->mr) goto err_free; @@ -525,7 +535,7 @@ struct ibv_qp *mthca_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) pthread_spin_init(&qp->rq.lock, PTHREAD_PROCESS_PRIVATE)) goto err_free; - qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0); + qp->mr = __mthca_reg_mr(pd, qp->buf.buf, qp->buf_size, 0, 0, 0); if (!qp->mr) goto err_free; From swise at opengridcomputing.com Tue May 27 11:34:29 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 13:34:29 -0500 Subject: [ofa-general] [PATCH RFC v4 0/2] RDMA/Core: MEM_MGT_EXTENSIONS support Message-ID: <20080527183429.32168.14351.stgit@dell3.ogc.int> The following patch series proposes: - The API and core changes needed to implement the IB BMMR and iWARP equivalient memory extensions. - cxgb3 support. Changes since version 3: - better comments to ib_alloc_fast_reg_page_list() function to explicitly state the page list is owned by the device until the fast_reg WR completes _and_ that the page_list can be modified by the device. - cxgb3 - when allocating a page list, set max_page_list_len. - page_size -> page_shift in fast_reg union of ib_send_wr struct. - key support via ib_update_fast_reg_key() Changes since version 2: - added device attribute max_fast_reg_page_list_len - added cxgb3 patch Changes since version 1: - ib_alloc_mr() -> ib_alloc_fast_reg_mr() - pbl_depth -> max_page_list_len - page_list_len -> max_page_list_len where it makes sense - int -> unsigned int where needed - fbo -> first_byte_offset - added page size and page_list_len to fast_reg union in ib_send_wr - rearranged work request fast_reg union of ib_send_wr to pack it - dropped remove_access parameter from ib_alloc_fast_reg_mr() - IB_DEVICE_MM_EXTENSIONS -> IB_DEVICE_MEM_MGT_EXTENSIONS - compiled Steve. From swise at opengridcomputing.com Tue May 27 11:35:49 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 13:35:49 -0500 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080527183429.32168.14351.stgit@dell3.ogc.int> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> Message-ID: <20080527183549.32168.22959.stgit@dell3.ogc.int> Support for the IB BMME and iWARP equivalent memory extensions to non shared memory regions. This includes: - allocation of an ib_mr for use in fast register work requests - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). - fast register memory region work request - invalidate local memory region work request - read with invalidate local memory region work request (iWARP only) Design details: - New device capability flag added: IB_DEVICE_MEM_MGT_EXTENSIONS indicates device support for this feature. - New send WR opcode IB_WR_FAST_REG_MR used to issue a fast_reg request. - New send WR opcode IB_WR_INVALIDATE_MR used to invalidate a fast_reg mr. - New API function, ib_alloc_mr() used to allocate fast_reg memory regions. - New API function, ib_alloc_fast_reg_page_list to allocate device-specific page lists. - New API function, ib_free_fast_reg_page_list to free said page lists. - New API function, ib_update_fast_reg_key to allow the key portion of the R_Key and L_Key of a fast_reg MR to be updated. Applications call this if desired before posting the IB_WR_FAST_REG_MR. Usage Model: - MR allocated with ib_alloc_mr() - Page lists allocated via ib_alloc_fast_reg_page_list(). - MR R_Key/L_Key "key" field updated with ib_update_fast_reg_key(). - MR made VALID and bound to a specific page list via ib_post_send(IB_WR_FAST_REG_MR) - MR made INVALID via ib_post_send(IB_WR_INVALIDATE_MR) - MR deallocated with ib_dereg_mr() - page lists dealloced via ib_free_fast_reg_page_list(). Applications can allocate a fast_reg mr once, and then can repeatedly bind the mr to different physical memory SGLs via posting work requests to the For each outstanding mr-to-pbl binding in the SQ pipe, a fast_reg_page_list needs to be allocated. Thus pipelining can be achieved while still allowing device-specific page_list processing. The 4B fast_reg rkey or stag is composed of a 3B index, and a 1B key. The application can change the key each time it fast-registers thus allowing more control over the peer's use of the rkey (ie it can effectively be changed each time the rkey is rebound to a page list). Signed-off-by: Steve Wise --- drivers/infiniband/core/verbs.c | 46 ++++++++++++++++++++++++ include/rdma/ib_verbs.h | 76 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 122 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 0504208..0a334b4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -755,6 +755,52 @@ int ib_dereg_mr(struct ib_mr *mr) } EXPORT_SYMBOL(ib_dereg_mr); +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) +{ + struct ib_mr *mr; + + if (!pd->device->alloc_fast_reg_mr) + return ERR_PTR(-ENOSYS); + + mr = pd->device->alloc_fast_reg_mr(pd, max_page_list_len); + + if (!IS_ERR(mr)) { + mr->device = pd->device; + mr->pd = pd; + mr->uobject = NULL; + atomic_inc(&pd->usecnt); + atomic_set(&mr->usecnt, 0); + } + + return mr; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_mr); + +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int max_page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + if (!device->alloc_fast_reg_page_list) + return ERR_PTR(-ENOSYS); + + page_list = device->alloc_fast_reg_page_list(device, max_page_list_len); + + if (!IS_ERR(page_list)) { + page_list->device = device; + page_list->max_page_list_len = max_page_list_len; + } + + return page_list; +} +EXPORT_SYMBOL(ib_alloc_fast_reg_page_list); + +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list) +{ + page_list->device->free_fast_reg_page_list(page_list); +} +EXPORT_SYMBOL(ib_free_fast_reg_page_list); + /* Memory windows */ struct ib_mw *ib_alloc_mw(struct ib_pd *pd) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 911a661..ede0c80 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -106,6 +106,7 @@ enum ib_device_cap_flags { IB_DEVICE_UD_IP_CSUM = (1<<18), IB_DEVICE_UD_TSO = (1<<19), IB_DEVICE_SEND_W_INV = (1<<21), + IB_DEVICE_MEM_MGT_EXTENSIONS = (1<<22), }; enum ib_atomic_cap { @@ -151,6 +152,7 @@ struct ib_device_attr { int max_srq; int max_srq_wr; int max_srq_sge; + unsigned int max_fast_reg_page_list_len; u16 max_pkeys; u8 local_ca_ack_delay; }; @@ -414,6 +416,8 @@ enum ib_wc_opcode { IB_WC_FETCH_ADD, IB_WC_BIND_MW, IB_WC_LSO, + IB_WC_FAST_REG_MR, + IB_WC_INVALIDATE_MR, /* * Set value of IB_WC_RECV so consumers can test if a completion is a * receive by testing (opcode & IB_WC_RECV). @@ -628,6 +632,9 @@ enum ib_wr_opcode { IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, + IB_WR_FAST_REG_MR, + IB_WR_INVALIDATE_MR, + IB_WR_READ_WITH_INV, }; enum ib_send_flags { @@ -676,6 +683,19 @@ struct ib_send_wr { u16 pkey_index; /* valid for GSI only */ u8 port_num; /* valid for DR SMPs on switch only */ } ud; + struct { + u64 iova_start; + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + unsigned int page_shift; + unsigned int page_list_len; + unsigned int first_byte_offset; + u32 length; + int access_flags; + } fast_reg; + struct { + struct ib_mr *mr; + } local_inv; } wr; }; @@ -1014,6 +1034,10 @@ struct ib_device { int (*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int (*dereg_mr)(struct ib_mr *mr); + struct ib_mr * (*alloc_fast_reg_mr)(struct ib_pd *pd, + int max_page_list_len); + struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct ib_device *device, int page_list_len); + void (*free_fast_reg_page_list)(struct ib_fast_reg_page_list *page_list); int (*rereg_phys_mr)(struct ib_mr *mr, int mr_rereg_mask, struct ib_pd *pd, @@ -1808,6 +1832,58 @@ int ib_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int ib_dereg_mr(struct ib_mr *mr); /** + * ib_alloc_fast_reg_mr - Allocates memory region usable with the + * IB_WR_FAST_REG_MR send work request. + * @pd: The protection domain associated with the region. + * @max_page_list_len: requested max physical buffer list size to be allocated. + */ +struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len); + +struct ib_fast_reg_page_list { + struct ib_device *device; + u64 *page_list; + unsigned int max_page_list_len; +}; + +/** + * ib_alloc_fast_reg_page_list - Allocates a page list array + * @device - ib device pointer. + * @page_list_len - size of the page list array to be allocated. + * + * This allocates and returns a struct ib_fast_reg_page_list * + * and a page_list array that is at least page_list_len in size. + * The actual size is returned in max_page_list_len. + * The caller is responsible for initializing the contents of the + * page_list array before posting a send work request with the + * IB_WC_FAST_REG_MR opcode. The page_list array entries must be + * translated using one of the ib_dma_*() functions similar to the + * addresses passed to ib_map_phys_fmr(). Once the ib_post_send() + * is issued, the struct ib_fast_reg_page_list must not be modified + * by the caller until a completion notice is returned by the device. + */ +struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list( + struct ib_device *device, int page_list_len); + +/** + * ib_free_fast_reg_page_list - Deallocates a previously allocated + * page list array. + * @page_list - struct ib_fast_reg_page_list pointer to be deallocated. + */ +void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list); + +/** + * ib_update_fast_reg_key - updates the key portion of the fast_reg + * R_Key and L_Key. + * @mr - struct ib_mr pointer to be updated. + * @newkey - new key to be used. + */ +static inline void ib_update_fast_reg_key(struct ib_mr *mr, u8 newkey) +{ + mr->lkey = (mr->lkey & 0xffffff00) | newkey; + mr->rkey = (mr->rkey & 0xffffff00) | newkey; +} + +/** * ib_alloc_mw - Allocates a memory window. * @pd: The protection domain associated with the memory window. */ From swise at opengridcomputing.com Tue May 27 11:35:51 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 13:35:51 -0500 Subject: [ofa-general] [PATCH RFC v4 2/2] RDMA/cxgb3: MEM_MGT_EXTENSIONS support In-Reply-To: <20080527183429.32168.14351.stgit@dell3.ogc.int> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> Message-ID: <20080527183551.32168.19227.stgit@dell3.ogc.int> - set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit. - set max_fast_reg_page_list_len device attribute. - add iwch_alloc_fast_reg_mr function. - add iwch_alloc_fastreg_pbl - add iwch_free_fastreg_pbl - adjust the WQ depth for kernel mode work queues to account for fastreg possibly taking 2 WR slots. - add fastreg_mr work request support. - add invalidate_mr work request support. - add send_with_inv and send_with_se_inv work request support. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 13 ++- drivers/infiniband/hw/cxgb3/cxio_hal.h | 1 drivers/infiniband/hw/cxgb3/cxio_wr.h | 51 ++++++++++- drivers/infiniband/hw/cxgb3/iwch_provider.c | 78 ++++++++++++++++- drivers/infiniband/hw/cxgb3/iwch_qp.c | 123 +++++++++++++++++++-------- 5 files changed, 216 insertions(+), 50 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 3f441fc..6315c77 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -145,7 +145,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) } wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); memset(wqe, 0, sizeof(*wqe)); - build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 0, qpid, 7, 3); wqe->flags = cpu_to_be32(MODQP_WRITE_EC); sge_cmd = qpid << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); @@ -558,7 +558,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); memset(wqe, 0, sizeof(*wqe)); build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 0, - T3_CTL_QP_TID, 7); + T3_CTL_QP_TID, 7, 3); wqe->flags = cpu_to_be32(MODQP_WRITE_EC); sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; wqe->sge_cmd = cpu_to_be64(sge_cmd); @@ -674,7 +674,7 @@ static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, Q_GENBIT(rdev_p->ctrl_qp.wptr, T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, - wr_len); + wr_len, 3); if (flag == T3_COMPLETION_FLAG) ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); len -= 96; @@ -816,6 +816,13 @@ int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) 0, 0); } +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + 0, 0, 0ULL, 0, 0, 0, 0); +} + int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) { struct t3_rdma_init_wr *wqe; diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h b/drivers/infiniband/hw/cxgb3/cxio_hal.h index 6e128f6..e7659f6 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.h +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h @@ -165,6 +165,7 @@ int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, u32 pbl_addr); int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid); int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); diff --git a/drivers/infiniband/hw/cxgb3/cxio_wr.h b/drivers/infiniband/hw/cxgb3/cxio_wr.h index f1a25a8..2a24962 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_wr.h +++ b/drivers/infiniband/hw/cxgb3/cxio_wr.h @@ -72,7 +72,8 @@ enum t3_wr_opcode { T3_WR_BIND = FW_WROPCODE_RI_BIND_MW, T3_WR_RCV = FW_WROPCODE_RI_RECEIVE, T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT, - T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP + T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP, + T3_WR_FASTREG = FW_WROPCODE_RI_FASTREGISTER_MR } __attribute__ ((packed)); enum t3_rdma_opcode { @@ -89,7 +90,8 @@ enum t3_rdma_opcode { T3_FAST_REGISTER, T3_LOCAL_INV, T3_QP_MOD, - T3_BYPASS + T3_BYPASS, + T3_RDMA_READ_REQ_WITH_INV, } __attribute__ ((packed)); static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop) @@ -170,11 +172,46 @@ struct t3_send_wr { struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ }; +#define T3_MAX_FASTREG_DEPTH 18 + +struct t3_fastreg_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 page_type_perms; /* 2 */ + __be32 reserved1; + __be32 stag; /* 3 */ + __be32 len; + __be32 va_base_hi; /* 4 */ + __be32 va_base_lo_fbo; + __be64 reserved2[6]; /* 5-10 */ + __be64 pbl_addrs[0]; /* 11+ */ +}; + +#define S_FR_PAGE_COUNT 24 +#define M_FR_PAGE_COUNT 0xff +#define V_FR_PAGE_COUNT(x) ((x) << S_FR_PAGE_COUNT) +#define G_FR_PAGE_COUNT(x) ((((x) >> S_FR_PAGE_COUNT)) & M_FR_PAGE_COUNT) + +#define S_FR_PAGE_SIZE 16 +#define M_FR_PAGE_SIZE 0x1f +#define V_FR_PAGE_SIZE(x) ((x) << S_FR_PAGE_SIZE) +#define G_FR_PAGE_SIZE(x) ((((x) >> S_FR_PAGE_SIZE)) & M_FR_PAGE_SIZE) + +#define S_FR_TYPE 8 +#define M_FR_TYPE 0x1 +#define V_FR_TYPE(x) ((x) << S_FR_TYPE) +#define G_FR_TYPE(x) ((((x) >> S_FR_TYPE)) & M_FR_TYPE) + +#define S_FR_PERMS 0 +#define M_FR_PERMS 0xff +#define V_FR_PERMS(x) ((x) << S_FR_PERMS) +#define G_FR_PERMS(x) ((((x) >> S_FR_PERMS)) & M_FR_PERMS) + struct t3_local_inv_wr { struct fw_riwrh wrh; /* 0 */ union t3_wrid wrid; /* 1 */ __be32 stag; /* 2 */ - __be32 reserved3; + __be32 reserved; }; struct t3_rdma_write_wr { @@ -210,7 +247,8 @@ enum t3_mem_perms { T3_MEM_ACCESS_LOCAL_READ = 0x1, T3_MEM_ACCESS_LOCAL_WRITE = 0x2, T3_MEM_ACCESS_REM_READ = 0x4, - T3_MEM_ACCESS_REM_WRITE = 0x8 + T3_MEM_ACCESS_REM_WRITE = 0x8, + T3_MEM_ACCESS_MW_BIND = 0x10 } __attribute__ ((packed)); struct t3_bind_mw_wr { @@ -346,6 +384,7 @@ union t3_wr { struct t3_rdma_write_wr write; struct t3_rdma_read_wr read; struct t3_receive_wr recv; + struct t3_fastreg_wr fastreg; struct t3_local_inv_wr local_inv; struct t3_bind_mw_wr bind; struct t3_bypass_wr bypass; @@ -368,10 +407,10 @@ static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, enum t3_wr_flags flags, u8 genbit, u32 tid, - u8 len) + u8 len, u8 sopeop) { wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | - V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_SOPEOP(sopeop) | V_FW_RIWR_FLAGS(flags)); wmb(); wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 8934178..e53d25b 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -768,6 +768,65 @@ static int iwch_dealloc_mw(struct ib_mw *mw) return 0; } +static struct ib_mr *iwch_alloc_fast_reg_mr(struct ib_pd *pd, int pbl_depth) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + ret = iwch_alloc_pbl(mhp, pbl_depth); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->attr.pbl_size = pbl_depth; + ret = cxio_allocate_stag(&rhp->rdev, &stag, php->pdid); + if (ret) { + iwch_free_pbl(mhp); + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_NON_SHARED_MR; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __func__, mmid, mhp, stag); + return &(mhp->ibmr); +} + +static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl( + struct ib_device *device, + int page_list_len) +{ + struct ib_fast_reg_page_list *page_list; + + page_list = kmalloc(sizeof *page_list + page_list_len * sizeof(u64), + GFP_KERNEL); + if (!page_list) + return ERR_PTR(-ENOMEM); + + page_list->page_list = (u64 *)(page_list + 1); + page_list->max_page_list_len = page_list_len; + + return page_list; +} + +static void iwch_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list) +{ + kfree(page_list); +} + static int iwch_destroy_qp(struct ib_qp *ib_qp) { struct iwch_dev *rhp; @@ -843,6 +902,15 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd, */ sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); wqsize = roundup_pow_of_two(rqsize + sqsize); + + /* + * Kernel users need more wq space for fastreg WRs which can take + * 2 WR fragments. + */ + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (!ucontext && wqsize < (rqsize + (2 * sqsize))) + wqsize = roundup_pow_of_two(rqsize + + roundup_pow_of_two(attrs->cap.max_send_wr * 2)); PDBG("%s wqsize %d sqsize %d rqsize %d\n", __func__, wqsize, sqsize, rqsize); qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); @@ -851,7 +919,6 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd, qhp->wq.size_log2 = ilog2(wqsize); qhp->wq.rq_size_log2 = ilog2(rqsize); qhp->wq.sq_size_log2 = ilog2(sqsize); - ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { kfree(qhp); @@ -1048,6 +1115,7 @@ static int iwch_query_device(struct ib_device *ibdev, props->max_mr = dev->attr.max_mem_regs; props->max_pd = dev->attr.max_pds; props->local_ca_ack_delay = 0; + props->max_fast_reg_page_list_len = T3_MAX_FASTREG_DEPTH; return 0; } @@ -1145,8 +1213,9 @@ int iwch_register_device(struct iwch_dev *dev) memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); dev->ibdev.owner = THIS_MODULE; - dev->device_cap_flags = - (IB_DEVICE_ZERO_STAG | IB_DEVICE_MEM_WINDOW); + dev->device_cap_flags = IB_DEVICE_ZERO_STAG | + IB_DEVICE_MEM_WINDOW | + IB_DEVICE_MEM_MGT_EXTENSIONS; dev->ibdev.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | @@ -1198,6 +1267,9 @@ int iwch_register_device(struct iwch_dev *dev) dev->ibdev.alloc_mw = iwch_alloc_mw; dev->ibdev.bind_mw = iwch_bind_mw; dev->ibdev.dealloc_mw = iwch_dealloc_mw; + dev->ibdev.alloc_fast_reg_mr = iwch_alloc_fast_reg_mr; + dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl; + dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl; dev->ibdev.attach_mcast = iwch_multicast_attach; dev->ibdev.detach_mcast = iwch_multicast_detach; diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 79dbe5b..c702c71 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -44,54 +44,39 @@ static int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, switch (wr->opcode) { case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: if (wr->send_flags & IB_SEND_SOLICITED) wqe->send.rdmaop = T3_SEND_WITH_SE; else wqe->send.rdmaop = T3_SEND; wqe->send.rem_stag = 0; break; -#if 0 /* Not currently supported */ - case TYPE_SEND_INVALIDATE: - case TYPE_SEND_INVALIDATE_IMMEDIATE: - wqe->send.rdmaop = T3_SEND_WITH_INV; - wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); - break; - case TYPE_SEND_SE_INVALIDATE: - wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + case IB_WR_SEND_WITH_INV: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + else + wqe->send.rdmaop = T3_SEND_WITH_INV; wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); break; -#endif default: - break; + return -EINVAL; } if (wr->num_sge > T3_MAX_SGE) return -EINVAL; wqe->send.reserved[0] = 0; wqe->send.reserved[1] = 0; wqe->send.reserved[2] = 0; - if (wr->opcode == IB_WR_SEND_WITH_IMM) { - plen = 4; - wqe->send.sgl[0].stag = wr->ex.imm_data; - wqe->send.sgl[0].len = __constant_cpu_to_be32(0); - wqe->send.num_sgle = __constant_cpu_to_be32(0); - *flit_cnt = 5; - } else { - plen = 0; - for (i = 0; i < wr->num_sge; i++) { - if ((plen + wr->sg_list[i].length) < plen) { - return -EMSGSIZE; - } - plen += wr->sg_list[i].length; - wqe->send.sgl[i].stag = - cpu_to_be32(wr->sg_list[i].lkey); - wqe->send.sgl[i].len = - cpu_to_be32(wr->sg_list[i].length); - wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; } - wqe->send.num_sgle = cpu_to_be32(wr->num_sge); - *flit_cnt = 4 + ((wr->num_sge) << 1); + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); wqe->send.plen = cpu_to_be32(plen); return 0; } @@ -155,6 +140,56 @@ static int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, return 0; } +static int iwch_build_fastreg(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq) +{ + int i; + u64 *p; + + if (wr->wr.fast_reg.page_list_len > T3_MAX_FASTREG_DEPTH) + return -EINVAL; + *wr_cnt = 1; + wqe->fastreg.stag = cpu_to_be32(wr->wr.fast_reg.mr->rkey); + wqe->fastreg.len = cpu_to_be32(wr->wr.fast_reg.length); + wqe->fastreg.va_base_hi = cpu_to_be32(wr->wr.fast_reg.iova_start>>32); + wqe->fastreg.va_base_lo_fbo = + cpu_to_be32(wr->wr.fast_reg.iova_start&0xffffffff); + wqe->fastreg.page_type_perms = cpu_to_be32( + V_FR_PAGE_COUNT(wr->wr.fast_reg.page_list_len) | + V_FR_PAGE_SIZE(wr->wr.fast_reg.page_shift-12) | + V_FR_TYPE(T3_VA_BASED_TO) | + V_FR_PERMS(iwch_ib_to_mwbind_access(wr->wr.fast_reg.access_flags))); + p = &wqe->fastreg.pbl_addrs[0]; + for (i=0; iwr.fast_reg.page_list_len; i++, p++) { + + /* If we need a 2nd WR, then set it up */ + if (i == 10) { + *wr_cnt = 2; + wqe = (union t3_wr *)(wq->queue + + Q_PTR2IDX((wq->wptr+1), wq->size_log2)); + build_fw_riwrh((void *)wqe, T3_WR_FASTREG, 0, + Q_GENBIT(wq->wptr, wq->size_log2), + 0, 1 + wr->wr.fast_reg.page_list_len - 10, 1); + + p = &wqe->flit[1]; + } + *p = cpu_to_be64((u64)wr->wr.fast_reg.page_list->page_list[i]); + } + *flit_cnt = 5 + wr->wr.fast_reg.page_list_len; + if (*flit_cnt > 15) + *flit_cnt = 15; + return 0; +} + +static int iwch_build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + wqe->local_inv.stag = cpu_to_be32(wr->wr.local_inv.mr->rkey); + wqe->local_inv.reserved = 0; + *flit_cnt = sizeof(struct t3_local_inv_wr) >> 3; + return 0; +} + /* * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. */ @@ -238,6 +273,7 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, u32 num_wrs; unsigned long flag; struct t3_swsq *sqp; + int wr_cnt = 1; qhp = to_iwch_qp(ibqp); spin_lock_irqsave(&qhp->lock, flag); @@ -262,15 +298,15 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, t3_wr_flags = 0; if (wr->send_flags & IB_SEND_SOLICITED) t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; - if (wr->send_flags & IB_SEND_FENCE) - t3_wr_flags |= T3_READ_FENCE_FLAG; if (wr->send_flags & IB_SEND_SIGNALED) t3_wr_flags |= T3_COMPLETION_FLAG; sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); switch (wr->opcode) { case IB_WR_SEND: - case IB_WR_SEND_WITH_IMM: + case IB_WR_SEND_WITH_INV: + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; t3_wr_opcode = T3_WR_SEND; err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); break; @@ -289,6 +325,17 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, if (!qhp->wq.oldest_read) qhp->wq.oldest_read = sqp; break; + case IB_WR_FAST_REG_MR: + t3_wr_opcode = T3_WR_FASTREG; + err = iwch_build_fastreg(wqe, wr, &t3_wr_flit_cnt, + &wr_cnt, &qhp->wq); + break; + case IB_WR_INVALIDATE_MR: + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_LOCAL_FENCE_FLAG; + t3_wr_opcode = T3_WR_INV_STAG; + err = iwch_build_inv_stag(wqe, wr, &t3_wr_flit_cnt); + break; default: PDBG("%s post of type=%d TBD!\n", __func__, wr->opcode); @@ -307,14 +354,14 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), - 0, t3_wr_flit_cnt); + 0, t3_wr_flit_cnt, (wr_cnt == 1) ? 3 : 2); PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", __func__, (unsigned long long) wr->wr_id, idx, Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), sqp->opcode); wr = wr->next; num_wrs--; - ++(qhp->wq.wptr); + qhp->wq.wptr += wr_cnt; ++(qhp->wq.sq_wptr); } spin_unlock_irqrestore(&qhp->lock, flag); @@ -359,7 +406,7 @@ int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, wr->wr_id; build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), - 0, sizeof(struct t3_receive_wr) >> 3); + 0, sizeof(struct t3_receive_wr) >> 3, 3); PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " "wqe %p \n", __func__, (unsigned long long) wr->wr_id, idx, qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); @@ -444,7 +491,7 @@ int iwch_bind_mw(struct ib_qp *qp, wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, - sizeof(struct t3_bind_mw_wr) >> 3); + sizeof(struct t3_bind_mw_wr) >> 3, 3); ++(qhp->wq.wptr); ++(qhp->wq.sq_wptr); spin_unlock_irqrestore(&qhp->lock, flag); From Thomas.Talpey at netapp.com Tue May 27 11:35:19 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 May 2008 14:35:19 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigner s.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> Message-ID: At 12:58 PM 5/27/2008, Felix Marti wrote: >RDMA Read with Local Invalidate does not affect the wire. The 'must >invalidate' state is kept in the RNIC that issues the RDMA Read >Request... Aha, okay that was not clear to me. What information does the RNIC use to line up the arrival of the RDMA Read response with the "must invalidate" state? Also, how does the RNIC signal whether the invalidation actually occurred, so the upper layer can defend itself from attack? Tom. From swise at opengridcomputing.com Tue May 27 11:40:28 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 13:40:28 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> Message-ID: <483C559C.90203@opengridcomputing.com> Talpey, Thomas wrote: > At 12:58 PM 5/27/2008, Felix Marti wrote: > >> RDMA Read with Local Invalidate does not affect the wire. The 'must >> invalidate' state is kept in the RNIC that issues the RDMA Read >> Request... >> > > Aha, okay that was not clear to me. What information does the RNIC use > to line up the arrival of the RDMA Read response with the "must invalidate" > state? The rnic already tracks outstanding read requests. It now also will track the local stag to invalidate when the read completes. > Also, how does the RNIC signal whether the invalidation actually > occurred, so the upper layer can defend itself from attack? > > The stag is guaranteed to be in the invalid state by the time the app reaps the read-inv-local work completion... Steve. From tom at opengridcomputing.com Tue May 27 11:58:43 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 27 May 2008 13:58:43 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> Message-ID: <1211914723.4114.86.camel@trinity.ogc.int> On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: > At 11:33 AM 5/27/2008, Tom Tucker wrote: > >So I think from an NFSRDMA coding perspective it's a wash... > > Just to be clear, you're talking about the NFS/RDMA server. However, it's > pretty much a wash on the client, for different reasons. > Tom: What client side memory registration strategy do you recommend if the default on the server side is fastreg? On the performance side we are limited by the min size of the read/write-chunk element. If the client still gives the server a 4k chunk, the performance benefit (fewer PDU on the wire) goes away. Tom > >When posting the WR, We check the fastreg capabilities bit + transport > >type bit: > >If fastreg is true --> > > Post FastReg > > If iWARP (or with a cap bit read-with-inv-flag) > > post rdma read w/ invalidate > > >... For iWARP's case, this means rdma-read-w-inv, > >plus rdma-send-w-inv, etc... > > > Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests > don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: > > > > -------+-----------+-------+------+-------+-----------+-------------- > RDMA | Message | Tagged| STag | Queue | Invalidate| Message > Message| Type | Flag | and | Number| STag | Length > OpCode | | | TO | | | Communicated > | | | | | | between DDP > | | | | | | and RDMAP > -------+-----------+-------+------+-------+-----------+-------------- > 0000b | RDMA Write| 1 | Valid| N/A | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes > | Request | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes > | Response | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0011b | Send | 0 | N/A | 0 | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0100b | Send with | 0 | N/A | 0 | Valid | Yes > | Invalidate| | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0101b | Send with | 0 | N/A | 0 | N/A | Yes > | SE | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0110b | Send with | 0 | N/A | 0 | Valid | Yes > | SE and | | | | | > | Invalidate| | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 0111b | Terminate | 0 | N/A | 2 | N/A | Yes > | | | | | | > -------+-----------+-------+------+-------+-----------+-------------- > 1000b | | > to | Reserved | Not Specified > 1111b | | > -------+-----------+------------------------------------------------- > > > > I want to take this opportunity to also mention that the RPC/RDMA client-server > exchange does not support remote-invalidate currently. Because of the multiple > stags supported by the rpcrdma chunking header, and because the client needs > to verify that the stags were in fact invalidated, there is significant overhead, > and the jury is out on that benefit. In fact, I suspect it's a lose at the client. > > Tom (Talpey). > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Tue May 27 11:58:19 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 13:58:19 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <1211914723.4114.86.camel@trinity.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <1211914723.4114.86.camel@trinity.ogc.int> Message-ID: <483C59CB.2060308@opengridcomputing.com> Tom Tucker wrote: > On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: > >> At 11:33 AM 5/27/2008, Tom Tucker wrote: >> >>> So I think from an NFSRDMA coding perspective it's a wash... >>> >> Just to be clear, you're talking about the NFS/RDMA server. However, it's >> pretty much a wash on the client, for different reasons. >> >> > Tom: > > What client side memory registration strategy do you recommend if the > default on the server side is fastreg? > > On the performance side we are limited by the min size of the > read/write-chunk element. If the client still gives the server a 4k > chunk, the performance benefit (fewer PDU on the wire) goes away. > > Tom > > I would hope that dma_mr usage will be replaced with fast_reg on both the client and the server. >>> When posting the WR, We check the fastreg capabilities bit + transport >>> type bit: >>> If fastreg is true --> >>> Post FastReg >>> If iWARP (or with a cap bit read-with-inv-flag) >>> post rdma read w/ invalidate >>> >>> ... For iWARP's case, this means rdma-read-w-inv, >>> plus rdma-send-w-inv, etc... >>> >> Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests >> don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: >> >> >> >> -------+-----------+-------+------+-------+-----------+-------------- >> RDMA | Message | Tagged| STag | Queue | Invalidate| Message >> Message| Type | Flag | and | Number| STag | Length >> OpCode | | | TO | | | Communicated >> | | | | | | between DDP >> | | | | | | and RDMAP >> -------+-----------+-------+------+-------+-----------+-------------- >> 0000b | RDMA Write| 1 | Valid| N/A | N/A | Yes >> | | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes >> | Request | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes >> | Response | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0011b | Send | 0 | N/A | 0 | N/A | Yes >> | | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0100b | Send with | 0 | N/A | 0 | Valid | Yes >> | Invalidate| | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0101b | Send with | 0 | N/A | 0 | N/A | Yes >> | SE | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0110b | Send with | 0 | N/A | 0 | Valid | Yes >> | SE and | | | | | >> | Invalidate| | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 0111b | Terminate | 0 | N/A | 2 | N/A | Yes >> | | | | | | >> -------+-----------+-------+------+-------+-----------+-------------- >> 1000b | | >> to | Reserved | Not Specified >> 1111b | | >> -------+-----------+------------------------------------------------- >> >> >> >> I want to take this opportunity to also mention that the RPC/RDMA client-server >> exchange does not support remote-invalidate currently. Because of the multiple >> stags supported by the rpcrdma chunking header, and because the client needs >> to verify that the stags were in fact invalidated, there is significant overhead, >> and the jury is out on that benefit. In fact, I suspect it's a lose at the client. >> >> Tom (Talpey). >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From pw at osc.edu Tue May 27 12:18:25 2008 From: pw at osc.edu (Pete Wyckoff) Date: Tue, 27 May 2008 15:18:25 -0400 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: References: <20080527180004.GA15444@osc.edu> Message-ID: <20080527191825.GA15530@osc.edu> rdreier at cisco.com wrote on Tue, 27 May 2008 11:33 -0700: > > Nice that everything still works with old userspace, but where is > > the latest libmthca these days? The one at kernel.org still has > > ABI_VERSION 1: > > Actually this tricked me... the kernel ABI didn't get bumped so there > was no reason to bump the libmthca ABI. I'll actually use this slightly > simpler patch: Oh, yeah. Your patch put it back to 1. I missed that. This patch looks good, as far as I can tell without testing. Only CQ changes need dmasync, apparently. -- Pete From Thomas.Talpey at netapp.com Tue May 27 12:38:30 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 May 2008 15:38:30 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <1211914723.4114.86.camel@trinity.ogc.int> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <1211914723.4114.86.camel@trinity.ogc.int> Message-ID: At 02:58 PM 5/27/2008, Tom Tucker wrote: > >On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: >> At 11:33 AM 5/27/2008, Tom Tucker wrote: >> >So I think from an NFSRDMA coding perspective it's a wash... >> >> Just to be clear, you're talking about the NFS/RDMA server. However, it's >> pretty much a wash on the client, for different reasons. >> >Tom: > >What client side memory registration strategy do you recommend if the >default on the server side is fastreg? "Whatever is fastest and safest". Given that the client and server won't necessarily be using the same hardware, nor the same kernel for that matter, I don't think we can or should legislate it. That said, I am hopeful that "fastreg" does turn out to be "fast" and therefore will become the only logical choice for the NFS/RDMA Linux client. But the future Linux client is only one such system. I cannot speak for others. Tom. > >On the performance side we are limited by the min size of the >read/write-chunk element. If the client still gives the server a 4k >chunk, the performance benefit (fewer PDU on the wire) goes away. From akepner at sgi.com Tue May 27 12:39:11 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 27 May 2008 12:39:11 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: References: <20080527180004.GA15444@osc.edu> Message-ID: <20080527193911.GF23650@sgi.com> On Tue, May 27, 2008 at 11:33:55AM -0700, Roland Dreier wrote: > ... > Actually this tricked me... the kernel ABI didn't get bumped so there > was no reason to bump the libmthca ABI. I'll actually use this slightly > simpler patch: I was wondering about that... FWIW, the patch (which I deleted below) looked good to me. -- Arthur From Thomas.Talpey at netapp.com Tue May 27 12:42:01 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 May 2008 15:42:01 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: <483C559C.90203@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> <483C559C.90203@opengridcomputing.com> Message-ID: At 02:40 PM 5/27/2008, Steve Wise wrote: >Talpey, Thomas wrote: >> At 12:58 PM 5/27/2008, Felix Marti wrote: >> >>> RDMA Read with Local Invalidate does not affect the wire. The 'must >>> invalidate' state is kept in the RNIC that issues the RDMA Read >>> Request... >>> >> >> Aha, okay that was not clear to me. What information does the RNIC use >> to line up the arrival of the RDMA Read response with the "must invalidate" >> state? > >The rnic already tracks outstanding read requests. It now also will >track the local stag to invalidate when the read completes. Ah - okay, so the stag that actually gets invalidated was provided with the RDMA Read request posting, and is not necessarily the stag that arrived in the peer's RDMA Read response. That helps. What happens if the upper layer gives up and invalidates the stag itself, and the peer's RDMA Read response arrives later? Nothing bad, I assume, and the peer's response is denied? > >> Also, how does the RNIC signal whether the invalidation actually >> occurred, so the upper layer can defend itself from attack? >> >> > >The stag is guaranteed to be in the invalid state by the time the app >reaps the read-inv-local work completion... Ok, given my correct understanding of the source of the stag above. Tom. From akepner at sgi.com Tue May 27 12:40:17 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Tue, 27 May 2008 12:40:17 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: <20080527191825.GA15530@osc.edu> References: <20080527180004.GA15444@osc.edu> <20080527191825.GA15530@osc.edu> Message-ID: <20080527194017.GG23650@sgi.com> On Tue, May 27, 2008 at 03:18:25PM -0400, Pete Wyckoff wrote: > > ... Only CQ changes need dmasync, apparently. Right. -- Arthur From swise at opengridcomputing.com Tue May 27 12:59:16 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 14:59:16 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> <483C559C.90203@opengridcomputing.com> Message-ID: <483C6814.4060706@opengridcomputing.com> Talpey, Thomas wrote: > At 02:40 PM 5/27/2008, Steve Wise wrote: > >> Talpey, Thomas wrote: >> >>> At 12:58 PM 5/27/2008, Felix Marti wrote: >>> >>> >>>> RDMA Read with Local Invalidate does not affect the wire. The 'must >>>> invalidate' state is kept in the RNIC that issues the RDMA Read >>>> Request... >>>> >>>> >>> Aha, okay that was not clear to me. What information does the RNIC use >>> to line up the arrival of the RDMA Read response with the "must invalidate" >>> state? >>> >> The rnic already tracks outstanding read requests. It now also will >> track the local stag to invalidate when the read completes. >> > > Ah - okay, so the stag that actually gets invalidated was provided with > the RDMA Read request posting, and is not necessarily the stag that > arrived in the peer's RDMA Read response. That helps. > > What happens if the upper layer gives up and invalidates the stag itself, > and the peer's RDMA Read response arrives later? Nothing bad, I assume, > and the peer's response is denied? > > It behaves just like any other tagged message arriving and the target stag is invalid. The connection is torn down via an RDMAP TERMINATE... >>> Also, how does the RNIC signal whether the invalidation actually >>> occurred, so the upper layer can defend itself from attack? >>> >>> >>> >> The stag is guaranteed to be in the invalid state by the time the app >> reaps the read-inv-local work completion... >> > > Ok, given my correct understanding of the source of the stag above. > > Tom. > From rdreier at cisco.com Tue May 27 13:00:17 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 13:00:17 -0700 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode In-Reply-To: <1211879148.13769.94.camel@mtls03> (Eli Cohen's message of "Tue, 27 May 2008 12:05:48 +0300") References: <1211879148.13769.94.camel@mtls03> Message-ID: thanks, applied for 2.6.27... > + /* > + * we rely on this condition when copying small skbs and we > + * pass ownership of the first fragment only. > + */ > + if (SKB_TSHOLD > IPOIB_CM_HEAD_SIZE) { > + printk("%s: SKB_TSHOLD(%d) must not be larger then %d\n", > + THIS_MODULE->name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE); > + return -EINVAL; > + } I changed this to a build bug, to avoid waiting until runtime to notice this problem: + /* + * When copying small received packets, we only copy from the + * linear data part of the SKB, so we rely on this condition. + */ + BUILD_BUG_ON(IPOIB_CM_COPYBREAK > IPOIB_CM_HEAD_SIZE); From Thomas.Talpey at netapp.com Tue May 27 13:24:32 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 27 May 2008 16:24:32 -0400 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: <483C6814.4060706@opengridcomputing.com> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> <483C559C.90203@opengridcomputing.com> <483C6814.4060706@opengridcomputing.com> Message-ID: At 03:59 PM 5/27/2008, Steve Wise wrote: >Talpey, Thomas wrote: >> What happens if the upper layer gives up and invalidates the stag itself, >> and the peer's RDMA Read response arrives later? Nothing bad, I assume, >> and the peer's response is denied? >> >> > >It behaves just like any other tagged message arriving and the target >stag is invalid. The connection is torn down via an RDMAP TERMINATE... I was wondering more about the dangling stag reference that the original work request carried. Normally, it would reference the still-valid stag, but if that stag was torn down (causing the invalidation to point to nothing), or worse, re-bound (causing it to point at something else!), then it's a possible issue? Sorry to seem paranoid here. Storage is pretty sensitive to silent data corruption avenues. Because they always find a way to happen. Tom. From swise at opengridcomputing.com Tue May 27 13:33:52 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 15:33:52 -0500 Subject: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core:MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <4838F6CB.2040203@voltaire.com> <483AB5AC.3030406@opengridcomputing.com> <483B3AD7.4050208@opengridcomputing.com> <1211902417.4114.73.camel@trinity.ogc.int> <8A71B368A89016469F72CD08050AD33402DE30DE@maui.asicdesigners.com> <483C559C.90203@opengridcomputing.com> <483C6814.4060706@opengridcomputing.com> Message-ID: <483C7030.7040208@opengridcomputing.com> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue May 27 13:50:54 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 13:50:54 -0700 Subject: [ofa-general] [ANNOUNCE] libmthca 1.0.5 released Message-ID: libmthca is a userspace driver for Mellanox InfiniBand HCAs. It is a plug-in module for libibverbs that allows programs to use Mellanox hardware directly from userspace. A new stable release, libmthca 1.0.5, is available from http://www.openfabrics.org/downloads/mlx4/libmthca-1.0.5.tar.gz with sha1sum a68b1de47d320546c7bcc92bfa9c482f7d74fac1 /data/home/roland/libmthca-1.0.5.tar.gz I also tagged the 1.0.5 release of libmthca and pushed it out to my git tree on kernel.org: git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git (the name of the tag is libmthca-1.0.5). Builds for the Ubuntu 7.10 and 8.04 releases will be available by adding the lines (replacing hardy by gutsy if needed) deb http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main deb-src http://ppa.launchpad.net/roland-digitalvampire/ubuntu hardy main to your /etc/sources.list file, and updated Debian and Fedora packages will work their way into the main archives. This release fixes several bugs and adds support for the new kernel interface for mr attrs (which will be shipped in 2.6.26). The complete list of changes since 1.0.4 is: Eli Cohen (3): Ensure an receive WQEs are in memory before linking to chain Remove checks for srq->first_free < 0 IB/ib_mthca: Pre-link receive WQEs in Tavor mode Jack Morgenstein (1): Clear context struct at allocation time Michael S. Tsirkin (2): Fix posting >255 recv WRs for Tavor Set cleaned CQEs back to HW ownership when cleaning CQ Roland Dreier (21): Fix paths in Debian install files for libibverbs 1.1 Update Debian changelog debian/rules: Remove DEB_DH_STRIP_ARGS Fix handling of send CQE with error for QPs connected to SRQ Add missing wmb() in mthca_tavor_post_send() Remove deprecated ${Source-Version} from debian/control Clean up NVALGRIND comment in config.h.in Fix Valgrind annotations so they can actually be built Remove ibv_driver_init from linker version script Fix spec file License: tag Mark "driver" file in sysconfdir with %config Update Debian policy version to 3.7.3 Fix Valgrind false positives in mthca_create_cq() and mthca_create_srq() Add debian/watch file Update Debian build to avoid setting RPATH Change openib.org URLs to openfabrics.org URLs Fix CQ cleanup when QP is destroyed Update libmthca to handle new kernel ABI Include spec file changes from Fedora CVS Remove %config tag from mthca.driver file Roll libmthca-1.0.5 release From kliteyn at dev.mellanox.co.il Tue May 27 14:31:58 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 28 May 2008 00:31:58 +0300 Subject: [ofa-general] OpenSM? In-Reply-To: <20080527100859.6d48cd45.weiny2@llnl.gov> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> <20080527100859.6d48cd45.weiny2@llnl.gov> Message-ID: <483C7DCE.7080507@dev.mellanox.co.il> Charles, Ira Weiny wrote: > Charles, > > Here at LLNL we have been running OpenSM for some time. Thus far we are very > happy with it's performance. Our largest cluster is 1152 nodes and OpenSM can > bring it up (not counting boot time) in less than a minute. OpenSM is successfully running on some large clusters with 4-5K nodes. It takes about 2-3 minutes to bring up such clusters. > Here are some details. > > We are running v3.1.10 of OpenSM with some minor modifications (mostly patches > which have been submitted upstream and been accepted by Sasha but are not yet > in a release.) > > Our clusters are all Fat-tree topologies. > > We have a node which is more or less dedicated to running OpenSM. We have some > other monitoring software running on it, but OpenSM can utilize the CPU/Memory > if it needs to. > > A) On our large clusters this node is a 4 socket, dual core (8 cores > total) Opteron running at 2.4Gig with 16Gig of memory. I don't believe > OpenSM needs this much but the nodes were built all the same so this is > what it got. > > B) On one of our smaller clusters (128 nodes) OpenSM is running on a > dual socket, single core (2 core) 2.4Gig Opteron nodes with 2Gig of > memory. We have not seen any issues with this cluster and OpenSM. > > We run with the up/down algorithm, ftree has not panned out for us yet. I > can't say how that would compare to the Cisco algorithms. If the cluster topology is fat-tree, then there is a ftree and up/down routing. Ftree would be a good choice if you need LMC=0 (plus if the topology complies with certain fat-tree rules). For any other tree, or for LMC>0, up/down should work. -- Yevgeny > In short OpenSM should work just fine on your cluster. > > Hope this helps, > Ira > > > On Tue, 27 May 2008 11:15:14 -0400 > Charles Taylor wrote: > >> We have a 400 node IB cluster. We are running an embedded SM in >> failover mode on our TS270/Cisco7008 core switches. Lately we have >> been seeing problems with LID assignment when rebooting nodes (see log >> messages below). It is also taking far too long for LIDS to be >> assigned as it takes on the order of minutes for the ports to >> transition to "ACTIVE". >> >> This seems like a bug to us and we are considering switching to >> OpenSM on a host. I'm wondering about experience with running >> OpenSM for medium to large clusters (Fat Tree) and what resources >> (memory/cpu) we should plan on for the host node. >> >> Thanks, >> >> Charlie Taylor >> UF HPC Center >> >> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:14:10 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: >> Rediscover >> the subnet >> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM >> OUT_OF_SERVICE >> trap for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 >> May 27 14:14:13 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:256]: An >> existing IB >> node GUID 00:02:c9:02:00:21:4b:59 LID 194 was removed >> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM >> DELETE_MC_GROUP >> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 >> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: >> Topology >> changed >> May 27 14:14:14 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> discovering removed ports >> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: >> async events >> require sweep >> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:16:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: >> Rediscover >> the subnet >> May 27 14:16:28 topspin-270sc ib_sm.x[812]: [ib_sm_discovery.c:1009]: no >> routing required for port guid 00:02:c9:02:00:21:4b:59, lid 194 >> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1503]: >> Topology >> changed >> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> discovering new ports >> May 27 14:16:30 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> multicast membership change >> May 27 14:16:30 topspin-270sc ib_sm.x[812]: [ib_sm_assign.c:588]: >> Force port to >> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 >> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:562]: >> Program port >> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor >> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 >> May 27 14:18:42 topspin-270sc ib_sm.x[819]: [ib_sm_bringup.c:733]: >> Failed to >> negotiate MTU, op_vl for node=00:02:c9:02:00:21:4b:58, port= 1, mad >> status 0x1c >> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [INFO]: Generate SM >> IN_SERVICE trap >> for GID=fe:80:00:00:00:00:00:00:00:02:c9:02:00:21:4b:59 >> May 27 14:18:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:144]: A new >> IB node >> 00:02:c9:02:00:21:4b:59 was discovered and assigned LID 0 >> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: >> async events >> require sweep >> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:18:43 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: >> Rediscover >> the subnet >> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No >> topology >> change >> May 27 14:18:46 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> previous GET/SET operation failures >> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:545]: >> Reassigning >> LID, node - GUID=00:02:c9:02:00:21:4b:58, port=1, new LID=411, curr >> LID=0 >> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:588]: >> Force port to >> go down due to LID conflict, node - GUID=00:02:c9:02:00:21:4b:58, port=1 >> May 27 14:18:46 topspin-270sc ib_sm.x[816]: [ib_sm_assign.c:635]: >> Clean up SA >> resources for port forced down due to LID conflict, node - >> GUID=00:02:c9:02:00:21:4b:58, port=1 >> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_assign.c:667]: >> cleaning DB >> for guid 00:02:c9:02:00:21:4b:59, lid 194 >> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: >> _ib_smAllocSubnet: initRate= 4 >> May 27 14:18:47 topspin-270sc last message repeated 23 times >> May 27 14:18:47 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity >> links >> detected in the network >> May 27 14:21:01 topspin-270sc ib_sm.x[820]: [ib_sm_bringup.c:516]: >> Active >> port(s) now in INIT state node=00:02:c9:02:00:21:4b:58, port=16, >> state=2, >> neighbor node=00:02:c9:02:00:21:4b:58, port=1, state=2 >> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: >> async events >> require sweep >> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:21:01 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1320]: >> Rediscover >> the subnet >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No >> topology >> change >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:525]: IB node >> 00:06:6a:00:d9:00:04:5d port 16 is INIT state >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> some ports in INIT state >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> previous GET/SET operation failures >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [ib_sm_routing.c:2936]: >> _ib_smAllocSubnet: initRate= 4 >> May 27 14:21:05 topspin-270sc last message repeated 23 times >> May 27 14:21:05 topspin-270sc ib_sm.x[803]: [INFO]: Different capacity >> links >> detected in the network >> May 27 14:23:19 topspin-270sc ib_sm.x[817]: [ib_sm_bringup.c:562]: >> Program port >> state, node=00:02:c9:02:00:21:4b:58, port= 16, current state 2, neighbor >> node=00:02:c9:02:00:21:4b:58, port= 1, current state 2 >> May 27 14:23:24 topspin-270sc ib_sm.x[823]: [INFO]: Generate SM >> CREATE_MC_GROUP >> trap for GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:21:4b:59 >> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: >> async events >> require sweep >> May 27 14:23:24 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No >> topology >> change >> May 27 14:23:26 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> multicast membership change >> May 27 14:23:33 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid >> 00:05:ad:00:00:02:3c:60, is no longer synchronized with Master SM >> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Initialize a >> backup session >> with Standby SM guid 00:05:ad:00:00:02:3c:60 >> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1875]: >> async events >> require sweep >> May 27 14:25:39 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:25:39 topspin-270sc ib_sm.x[826]: [INFO]: Standby SM guid >> 00:05:ad:00:00:02:3c:60, started synchronizing with Master SM >> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No >> topology >> change >> May 27 14:25:42 topspin-270sc ib_sm.x[803]: [INFO]: Configuration >> caused by >> multicast membership change >> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB >> synchronized >> with Standby SM guid 00:05:ad:00:00:02:3c:60 >> May 27 14:25:43 topspin-270sc ib_sm.x[826]: [INFO]: Master SM DB >> synchronized >> with all designated backup SMs >> May 27 14:28:04 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1914]: >> ********************** NEW SWEEP ******************** >> May 27 14:28:06 topspin-270sc ib_sm.x[803]: [ib_sm_sweep.c:1516]: No >> topology >> change >> >> On May 23, 2008, at 2:20 PM, Steve Wise wrote: >> >>> Or Gerlitz wrote: >>>> Steve Wise wrote: >>>>> Are we sure we need to expose this to the user? >>>> I believe this is the way to go if we want to let smart ULPs >>>> generate new rkey/stag per mapping. Simpler ULPs could then just >>>> put the same value for each map associated with the same mr. >>>> >>>> Or. >>>> >>> How should I add this to the API? >>> >>> Perhaps we just document the format of an rkey in the struct ib_mr. >>> Thus the app would do this to change the key before posting the >>> fast_reg_mr wr (coded to be explicit, not efficient): >>> >>> u8 newkey; >>> u32 newrkey; >>> >>> newkey = 0xaa; >>> newrkey = (mr->rkey & 0xffffff00) | newkey; >>> mr->rkey = newrkey >>> wr.wr.fast_reg.mr = mr; >>> ... >>> >>> >>> Note, this assumes mr->rkey is in host byte order (I think the linux >>> rdma code assumes this in other places too). >>> >>> >>> Steve. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue May 27 15:53:02 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 15:53:02 -0700 Subject: [ofa-general] Re: mthca MR attrs userspace change In-Reply-To: <20080527180004.GA15444@osc.edu> (Pete Wyckoff's message of "Tue, 27 May 2008 14:00:04 -0400") References: <20080527180004.GA15444@osc.edu> Message-ID: libmthca-1.0.5 is now in Fedora 9 proposed updates -- http://koji.fedoraproject.org/koji/buildinfo?buildID=50682 I'm not sure exactly how it makes it into real F-9. - R. From YJia at tmriusa.com Tue May 27 16:28:27 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Tue, 27 May 2008 18:28:27 -0500 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? In-Reply-To: <48371ACE.908@gmail.com> Message-ID: Thanks for your reply. I'm using one CQ for all the WRs. Do you know why there's no ARM-N support in MLX drivers? My concern is the performance. The overhead of software poll_cq loop is quite significant if there are multiple pieces of small amount of data to be transferred on both sender/receiver sides. For instance, on the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you have a good solution for such kind of problem? Best, Yicheng Dotan Barak 05/23/2008 01:27 PM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? Hi. Yicheng Jia wrote: > > Hi Folks, > > I'm trying to use CQ Event notification for multiple completions > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering > RDMA. However I couldn't find it in current MLX driver. It seems to me > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are > multiple work requests, I have to use "poll_cq" to synchronously wait > until all the requests are done, is it correct? Is there a way to do > asynchronous multiple send by subscribing for a ARM_N event? You are right: the low level drivers of Mellanox devices doesn't support ARM-N (This feature is supported by the devices, but it wasn't implemented in the low level drivers). You are right, in order to read all of the completions you need to use poll_cq. By the way: Do you have you have to create a completion for any WR? (if you are using one QP, this will maybe solve your problem). Dotan _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue May 27 19:05:36 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 May 2008 21:05:36 -0500 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <20080527183549.32168.22959.stgit@dell3.ogc.int> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> Message-ID: <483CBDF0.7030209@opengridcomputing.com> > enum ib_send_flags { > @@ -676,6 +683,19 @@ struct ib_send_wr { > u16 pkey_index; /* valid for GSI only */ > u8 port_num; /* valid for DR SMPs on switch only */ > } ud; > + struct { > + u64 iova_start; > + struct ib_mr *mr; > + struct ib_fast_reg_page_list *page_list; > + unsigned int page_shift; > + unsigned int page_list_len; > + unsigned int first_byte_offset; > + u32 length; > + int access_flags; > + } fast_reg; > + struct { > + struct ib_mr *mr; > + } local_inv; > } wr; > }; Ok, while writing a test case for all this jazz, I see that passing the struct ib_mr pointer to both IB_WR_FAST_REGISTER_MR and IB_WR_INVALIDATE_MR is perhaps bad. Consider a chain of WRs: INVALIDATE_MR linked to a FAST_REGISTER_MR and passed to the provider via a single ib_post_send() call. You can't do that if you want to bump the key value between the invalidate and the fast_reg with the new key, which is probably what apps want to do. You are forced, under this proposed API, to post the two WRs separately and call ib_update_fast_reg_key() in between the ib_post_send() calls. Perhaps we should just pass in a u32 rkey for both WRs instead of the mr pointer? Then the code could put the old rkey in the invalidate WR, and the newly updated rkey in the fast_reg WR and chain the two together and do a single post. I think this is the way to go: change the fast_reg and local_inv unions to take a u32 rkey instead of a struct ib_mr *mr. Thoughts? Steve. From rdreier at cisco.com Tue May 27 20:59:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 20:59:55 -0700 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483CBDF0.7030209@opengridcomputing.com> (Steve Wise's message of "Tue, 27 May 2008 21:05:36 -0500") References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> <483CBDF0.7030209@opengridcomputing.com> Message-ID: > Perhaps we should just pass in a u32 rkey for both WRs instead of the > mr pointer? Then the code could put the old rkey in the invalidate > WR, and the newly updated rkey in the fast_reg WR and chain the two > together and do a single post. Makes sense to me. The only thing I would worry about would be if some device needs the actual mr struct pointer to post the work request, but mlx4 and I guess cxgb3 don't at least and I don't see a good reason why another device would. Let's go for it. - R. From rdreier at cisco.com Tue May 27 21:23:06 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 21:23:06 -0700 Subject: [ofa-general] Re: [PATCH] libibverbs: fix coding style typos according to checkpatch.pl In-Reply-To: <200805232332.07576.dotanba@gmail.com> (Dotan Barak's message of "Fri, 23 May 2008 23:32:07 +0300") References: <200805232332.07576.dotanba@gmail.com> Message-ID: thanks, applied From earthpea at wagingwork.com Tue May 27 22:18:56 2008 From: earthpea at wagingwork.com (Wendi Bishop) Date: Wed, 28 May 2008 05:18:56 -0000 Subject: [ofa-general] ~After Effects CS 3 Pro~ Message-ID: <000301c8c081$841f0a00$0100007f@bnxym> ~ Adobe CS3 Master Collection for PC or MAC includes: ~ InDesign CS3 ~ Photoshop CS3 ~ Illustrator CS3 ~ Acrobat 8 Professional ~ Flash CS3 Professional ~ Dreamweaver CS3 ~ Fireworks CS3 ~ Contribute CS3 ~ After Effects CS3 Professional ~ Premiere Pro CS3 ~ Encore DVD CS3 ~ Soundbooth CS3 ~ oemnewdeal . com in your Internet Exp1orer ~ System Requirements ~ For PC: ~ Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core ~ Duo (or compatible) processor; SSE2-enabled processor required for AMD systems ~ Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions) ~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ~ 38GB of available hard-disk space (additional free space required during installation) ~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ~ Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended) ~ 1,280x1,024 monitor resolution with 32-bit color adapter ~ DVD-ROM drive ~ For MAC: ~ PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp) ~ Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server ~ 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components ~ 36GB of available hard-disk space (additional free space required during installation) ~ Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred ~ Core Audio compatible sound card ~ 1,280x1,024 monitor resolution with 32-bit color adapter ~ DVD-ROM drive~ DVD+-R burner required for DVD creation U.N. agencies in Myanmar say international aid workers are finally moving into the Irrawaddy Delta. More visas for aid workers are being processed, and half a million people have now received food rations from the World Food Program. Dresden, Germany, has regained some of its prewar splendor, thanks to the restoration of its Baroque architectural masterpieces. The Elbe River Valley that runs through the city was declared a U.N. World Heritage site. But now, a new steel bridge being built across the river is threatening that status. From rdreier at cisco.com Tue May 27 22:26:04 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 22:26:04 -0700 Subject: [ofa-general] Re: [PATCH v2 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080519103529.12355.82570.stgit@localhost.localdomain> (Ramachandra K.'s message of "Mon, 19 May 2008 16:05:29 +0530") References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103529.12355.82570.stgit@localhost.localdomain> Message-ID: > +ssize_t vnic_create_primary(struct device *dev, > + struct device_attribute *dev_attr, const char *buf, > + size_t count) > +ssize_t vnic_create_secondary(struct device *dev, > + struct device_attribute *dev_attr, > + const char *buf, size_t count) > +ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr, > + const char *buf, size_t count) These are all only referenced from a sysfs attribute defined in the same file, so they can be made static (and don't need extern declarations in a header file). From rdreier at cisco.com Tue May 27 22:28:46 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 May 2008 22:28:46 -0700 Subject: [ofa-general] Re: [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx In-Reply-To: <20080519103258.12355.6146.stgit@localhost.localdomain> (Ramachandra K.'s message of "Mon, 19 May 2008 16:02:58 +0530") References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103258.12355.6146.stgit@localhost.localdomain> Message-ID: > +void viport_disconnect(struct viport *viport) > +{ > + VIPORT_FUNCTION("viport_disconnect()\n"); > + viport->disconnect = 1; > + viport_failure(viport); > + wait_event(viport->disconnect_queue, viport->disconnect == 0); > +} > + > +void viport_free(struct viport *viport) > +{ > + VIPORT_FUNCTION("viport_free()\n"); > + viport_disconnect(viport); /* NOTE: this can sleep */ There are no other calls to viport_disconnect() that I can see, so it can be made static (and the declaration in vnic_viport.h can be dropped). in fact given how small the function is and the fact that it has only a single call site, it might be easier just to merge it into viport_free(). But that's a matter of taste. - R. From kliteyn at dev.mellanox.co.il Tue May 27 23:24:49 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 28 May 2008 09:24:49 +0300 Subject: [ofa-general] OpenSM? In-Reply-To: <20080527100859.6d48cd45.weiny2@llnl.gov> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> <20080527100859.6d48cd45.weiny2@llnl.gov> Message-ID: <483CFAB1.5020409@dev.mellanox.co.il> Ira, Ira Weiny wrote: > > We run with the up/down algorithm, ftree has not panned out for us yet. Can you elaborate on that? Did ftree fail to digest the topology? Or did it do a lousy job configuring the subnet? Or perhaps the you need LMC>0, which ftree doesn't support? -- Yevgeny From ogerlitz at voltaire.com Tue May 27 23:45:33 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 May 2008 09:45:33 +0300 Subject: [ofa-general] Re: [RFC PATCH 3/5] rdma/cma: simplify locking needed for serialization of callbacks In-Reply-To: <000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com> References: <000301c8c011$ef202f20$ebc8180a@amr.corp.intel.com> <483C38D8.5020600@voltaire.com> <000e01c8c024$cb797730$ebc8180a@amr.corp.intel.com> Message-ID: <483CFF8D.2000001@voltaire.com> Sean Hefty wrote: > I think we should just remove cma_disable_remove() and cma_enable_remove(), and instead call mutex_lock/unlock directly in their places. Where cma_disable_remove() is called, add in appropriate state checks after acquiring > the mutex. OK, will do that. Or. From nonstylized at adriencarpentier.com Wed May 28 03:25:08 2008 From: nonstylized at adriencarpentier.com (Bonnie Weber) Date: Wed, 28 May 2008 12:25:08 +0200 Subject: [ofa-general] #All Adobe products in One Download# Message-ID: <000401c8c0a9$0a9b1200$0100007f@jscyeu> # Adobe CS3 Master Collection for PC or MAC includes: # InDesign CS3 # Photoshop CS3 # Illustrator CS3 # Acrobat 8 Professional # Flash CS3 Professional # Dreamweaver CS3 # Fireworks CS3 # Contribute CS3 # After Effects CS3 Professional # Premiere Pro CS3 # Encore DVD CS3 # Soundbooth CS3 # xpnewdeal. com in your Internet Exp1orer # System Requirements # For PC: # Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core # Duo (or compatible) processor; SSE2-enabled processor required for AMD systems # Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions) # 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components # 38GB of available hard-disk space (additional free space required during installation) # Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred # Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended) # 1,280x1,024 monitor resolution with 32-bit color adapter # DVD-ROM drive # For MAC: # PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp) # Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server # 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components # 36GB of available hard-disk space (additional free space required during installation) # Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred # Core Audio compatible sound card # 1,280x1,024 monitor resolution with 32-bit color adapter # DVD-ROM drive# DVD+-R burner required for DVD creation The global jump in the price of food has also hit Afghanistan, one of the world's poorest countries. Many Afghans are now spending half their earnings on bread alone. International aid is keeping the country from food riots and starvation. But the crisis may encourage some farmers to move out of the drug trade and into wheat. Ohio Couple Tell Sichuan Quake Tales From vlad at lists.openfabrics.org Wed May 28 03:09:57 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 28 May 2008 03:09:57 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080528-0200 daily build status Message-ID: <20080528100957.CC3BBE60CAB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on ppc64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From hrosenstock at xsigo.com Wed May 28 04:06:30 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 04:06:30 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080527175637.GB14205@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com> <20080527175637.GB14205@sashak.voltaire.com> Message-ID: <1211972790.13185.332.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-27 at 20:56 +0300, Sasha Khapyorsky wrote: > On 04:29 Tue 27 May , Hal Rosenstock wrote: > > On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote: > > > > Following your logic we will need to disable root passwords > > > > typing too. > > > > > > That's taking it too far. Root passwords are at least hidden when > > > typing. > > > > At least hide the key typing from plain sight when typing like su does. > > There are lot of tools where password can be specified as clear text in > command line (wget, smbclient, etc..) - it is an user responsibility to > keep his sensitive data safe. Do those tools provide a way to obscure passwords or force the user to do this in plain sight ? Seems like a user can't do this without support from the tool. smbclient seems to provide this; I didn't look at wget. smbclient supports an authorization file which supports this and says: Make certain that the permissions on the file restrict access from unwanted users. As you mentioned before, this is another acceptable approach (and this also lends itself better to scripting). -- Hal > Sasha From hrosenstock at xsigo.com Wed May 28 04:06:31 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 04:06:31 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <20080527175343.GA14205@sashak.voltaire.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <20080527103341.GF12014@sashak.voltaire.com> <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com> <20080527175343.GA14205@sashak.voltaire.com> Message-ID: <1211972791.13185.334.camel@hrosenstock-ws.xsigo.com> On Tue, 2008-05-27 at 20:53 +0300, Sasha Khapyorsky wrote: > On 04:33 Tue 27 May , Hal Rosenstock wrote: > > > > > > Maybe yes, but could you be more specific? Store SMKey in read-only > > > file on a client side? > > > > Treat smkey as su treats password rather than a command line parameter > > is another alternative. > > Ok, let's do it as '--smkey X' and then saquery will ask for a value, > just like su does. Good? Works for me. > > > I'm not proposing to expose SM_Key, just added such option where this > > > key could be specified. > > > > How is that not exposing it ? > > Because (1) and (2) below. The original patch exposes the key when the option is invoked and that's just the time to hide it. -- Hal > Sasha > > > > > -- Hal > > > > > So: 1) this is *optional*, 2) there is no > > > suggestions about how the right value should be determined. > > > > > > Sasha > > From eli at dev.mellanox.co.il Wed May 28 04:13:45 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 28 May 2008 14:13:45 +0300 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode In-Reply-To: References: <1211879148.13769.94.camel@mtls03> Message-ID: <1211973225.13769.147.camel@mtls03> On Tue, 2008-05-27 at 13:00 -0700, Roland Dreier wrote: > I changed this to a build bug, to avoid waiting until runtime to notice > this problem: > > + /* > + * When copying small received packets, we only copy from the > + * linear data part of the SKB, so we rely on this condition. > + */ > + BUILD_BUG_ON(IPOIB_CM_COPYBREAK > IPOIB_CM_HEAD_SIZE); I was looking for this one thing to make this check at compile time... thanks for letting us know. From ogerlitz at voltaire.com Wed May 28 04:14:40 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 May 2008 14:14:40 +0300 Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> References: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> Message-ID: <483D3EA0.40506@voltaire.com> Sean Hefty wrote >> How do we know that the user hasn't tried to destroy the id from another >> callback? We need some sort of state check here. fixed >> >> +static int cma_netdev_align_id(struct net_device *ndev, struct rdma_id_private *id_priv) > nit - function name isn't clear to me. Maybe something like > cma_netdev_change_handler()? Although I'm not sure that netdev change is what > the user is really interested in. What they really want to know is if IP > address mapping/resolution changed. netdev is hidden from the user. I changed the function name to cma_netdev_change as it checks if there was some netdev change between the time of this ID address resolution to when the netdev event was delivered. The user doesn't get explicit notification from the rdma-cm on netdev change but rather on address change as you suggested next. > Maybe call this RDMA_CM_EVENT_ADDR_CHANGE? done Or. From ogerlitz at voltaire.com Wed May 28 04:31:15 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 May 2008 14:31:15 +0300 Subject: [ofa-general] Re: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> References: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> Message-ID: <483D4283.60307@voltaire.com> Sean Hefty wrote: >> +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, void *ctx) >> + >> + mutex_lock(&lock); >> + list_for_each_entry(cma_dev, &dev_list, list) > > It seems like we just need to find the cma_dev that has the current mapping If your comment comes to say that maybe first find the cma_dev to which this event applies, I don't think its possible, see below. If I didn't get you right, can you please explain it a little more. The rdma-cm maintains a mapping between IDs to the physical devices. The mapping is established during address resolution using the HW address of the --network-- device that was resolved (eg through ARP and then looking on neigh->dev or route lookup) for this ID. In the bonding case, the network device --borrows-- the HW address from the active slave device. During fail-over, the bonding net device changes its HW address and then the netdev event is delivered on which this code acts. So the same cma_dev can have IDs with different netdev HW address in their dev_addr struct, say bond0 = and pdevA list = { , } depending on the time address resolution was done to ID1,ID2 and the ULP behavior on the ADDR_CHANGE event. I don't see how to get along with a simple check that tell on what cma_dev to look for matches. If we really want to avoid scanning all the cma_dev list, we can add a mapping between --net devices-- to IDs and then scan only the list of the affiliated netdevice. So I am still left with the general rdma-cm mutex being taken for the duration of the double-loop... Or. From ogerlitz at voltaire.com Wed May 28 04:34:31 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 May 2008 14:34:31 +0300 (IDT) Subject: [ofa-general] [RFC V4 PATCH 3/5] rdma/cma: simply locking needed for serialization of callbacks In-Reply-To: References: Message-ID: The rdma-cm has some logic in place to make sure that callbacks on an ID are delivered to the consumer in serialized manner, specifically it has code to protect against the device removal racing with a callback now being delivered to the user. This patch simplifies this logic by using a mutex per ID instead of the wait queue and atomic variable. I have left the disable/enable_remove notation such that the patch would be easier to read, but if this approach is accepted, I think we want to change it to disable/enable_callback Signed-off-by: Or Gerlitz cma.c | 96 ++++++++++++++++++++++++++++++++---------------------------------- 1 files changed, 47 insertions(+), 49 deletions(-) changes from v2 (I named this v4 to comply with the next patch) - cma_disable_remove --> cma_disable_callback, acquire the mutex before the spinlock - removed cma_enable_remove and just call mutex_unlock(id->handler_mutex) instead Sean, basically you asked that cma_disable_remove be removed from the code, but this would spread taking the spin lock and doing state checks on all the places which call it, so I think it can be nice to still have it. As for the spin lock usage, I preferred not to touch it, since the code of cma_comp, cma_exch, cma_comp_exch etc use it. Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c 2008-05-26 15:11:17.000000000 +0300 +++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c 2008-05-28 11:08:24.000000000 +0300 @@ -126,8 +126,7 @@ struct rdma_id_private { struct completion comp; atomic_t refcount; - wait_queue_head_t wait_remove; - atomic_t dev_remove; + struct mutex handler_mutex; int backlog; int timeout_ms; @@ -351,28 +350,24 @@ static void cma_deref_id(struct rdma_id_ complete(&id_priv->comp); } -static int cma_disable_remove(struct rdma_id_private *id_priv, +static int cma_disable_callback(struct rdma_id_private *id_priv, enum cma_state state) { unsigned long flags; int ret; + mutex_lock(&id_priv->handler_mutex); spin_lock_irqsave(&id_priv->lock, flags); - if (id_priv->state == state) { - atomic_inc(&id_priv->dev_remove); + if (id_priv->state == state) ret = 0; - } else + else { + mutex_unlock(&id_priv->handler_mutex); ret = -EINVAL; + } spin_unlock_irqrestore(&id_priv->lock, flags); return ret; } -static void cma_enable_remove(struct rdma_id_private *id_priv) -{ - if (atomic_dec_and_test(&id_priv->dev_remove)) - wake_up(&id_priv->wait_remove); -} - static int cma_has_cm_dev(struct rdma_id_private *id_priv) { return (id_priv->id.device && id_priv->cm_id.ib); @@ -395,8 +390,7 @@ struct rdma_cm_id *rdma_create_id(rdma_c mutex_init(&id_priv->qp_mutex); init_completion(&id_priv->comp); atomic_set(&id_priv->refcount, 1); - init_waitqueue_head(&id_priv->wait_remove); - atomic_set(&id_priv->dev_remove, 0); + mutex_init(&id_priv->handler_mutex); INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); @@ -923,7 +917,7 @@ static int cma_ib_handler(struct ib_cm_i struct rdma_cm_event event; int ret = 0; - if (cma_disable_remove(id_priv, CMA_CONNECT)) + if (cma_disable_callback(id_priv, CMA_CONNECT)) return 0; memset(&event, 0, sizeof event); @@ -980,12 +974,12 @@ static int cma_ib_handler(struct ib_cm_i /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); return ret; } out: - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); return ret; } @@ -1097,7 +1091,7 @@ static int cma_req_handler(struct ib_cm_ int offset, ret; listen_id = cm_id->context; - if (cma_disable_remove(listen_id, CMA_LISTEN)) + if (cma_disable_callback(listen_id, CMA_LISTEN)) return -ECONNABORTED; memset(&event, 0, sizeof event); @@ -1118,7 +1112,7 @@ static int cma_req_handler(struct ib_cm_ goto out; } - atomic_inc(&conn_id->dev_remove); + mutex_lock(&conn_id->handler_mutex); mutex_lock(&lock); ret = cma_acquire_dev(conn_id); mutex_unlock(&lock); @@ -1140,7 +1134,7 @@ static int cma_req_handler(struct ib_cm_ !cma_is_ud_ps(conn_id->id.ps)) ib_send_cm_mra(cm_id, CMA_CM_MRA_SETTING, NULL, 0); mutex_unlock(&lock); - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); goto out; } @@ -1149,11 +1143,11 @@ static int cma_req_handler(struct ib_cm_ release_conn_id: cma_exch(conn_id, CMA_DESTROYING); - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(&conn_id->id); out: - cma_enable_remove(listen_id); + mutex_unlock(&listen_id->handler_mutex); return ret; } @@ -1219,7 +1213,7 @@ static int cma_iw_handler(struct iw_cm_i struct sockaddr_in *sin; int ret = 0; - if (cma_disable_remove(id_priv, CMA_CONNECT)) + if (cma_disable_callback(id_priv, CMA_CONNECT)) return 0; memset(&event, 0, sizeof event); @@ -1263,12 +1257,12 @@ static int cma_iw_handler(struct iw_cm_i /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.iw = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); return ret; } - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); return ret; } @@ -1284,7 +1278,7 @@ static int iw_conn_req_handler(struct iw struct ib_device_attr attr; listen_id = cm_id->context; - if (cma_disable_remove(listen_id, CMA_LISTEN)) + if (cma_disable_callback(listen_id, CMA_LISTEN)) return -ECONNABORTED; /* Create a new RDMA id for the new IW CM ID */ @@ -1296,19 +1290,19 @@ static int iw_conn_req_handler(struct iw goto out; } conn_id = container_of(new_cm_id, struct rdma_id_private, id); - atomic_inc(&conn_id->dev_remove); + mutex_lock(&conn_id->handler_mutex); conn_id->state = CMA_CONNECT; dev = ip_dev_find(&init_net, iw_event->local_addr.sin_addr.s_addr); if (!dev) { ret = -EADDRNOTAVAIL; - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(new_cm_id); goto out; } ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); if (ret) { - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(new_cm_id); goto out; } @@ -1317,7 +1311,7 @@ static int iw_conn_req_handler(struct iw ret = cma_acquire_dev(conn_id); mutex_unlock(&lock); if (ret) { - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(new_cm_id); goto out; } @@ -1333,7 +1327,7 @@ static int iw_conn_req_handler(struct iw ret = ib_query_device(conn_id->id.device, &attr); if (ret) { - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(new_cm_id); goto out; } @@ -1349,14 +1343,14 @@ static int iw_conn_req_handler(struct iw /* User wants to destroy the CM ID */ conn_id->cm_id.iw = NULL; cma_exch(conn_id, CMA_DESTROYING); - cma_enable_remove(conn_id); + mutex_unlock(&conn_id->handler_mutex); rdma_destroy_id(&conn_id->id); } out: if (dev) dev_put(dev); - cma_enable_remove(listen_id); + mutex_unlock(&listen_id->handler_mutex); return ret; } @@ -1588,7 +1582,7 @@ static void cma_work_handler(struct work struct rdma_id_private *id_priv = work->id; int destroy = 0; - atomic_inc(&id_priv->dev_remove); + mutex_lock(&id_priv->handler_mutex); if (!cma_comp_exch(id_priv, work->old_state, work->new_state)) goto out; @@ -1597,7 +1591,7 @@ static void cma_work_handler(struct work destroy = 1; } out: - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); cma_deref_id(id_priv); if (destroy) rdma_destroy_id(&id_priv->id); @@ -1760,7 +1754,7 @@ static void addr_handler(int status, str struct rdma_cm_event event; memset(&event, 0, sizeof event); - atomic_inc(&id_priv->dev_remove); + mutex_lock(&id_priv->handler_mutex); /* * Grab mutex to block rdma_destroy_id() from removing the device while @@ -1789,13 +1783,13 @@ static void addr_handler(int status, str if (id_priv->id.event_handler(&id_priv->id, &event)) { cma_exch(id_priv, CMA_DESTROYING); - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); cma_deref_id(id_priv); rdma_destroy_id(&id_priv->id); return; } out: - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); cma_deref_id(id_priv); } @@ -2122,7 +2116,7 @@ static int cma_sidr_rep_handler(struct i struct ib_cm_sidr_rep_event_param *rep = &ib_event->param.sidr_rep_rcvd; int ret = 0; - if (cma_disable_remove(id_priv, CMA_CONNECT)) + if (cma_disable_callback(id_priv, CMA_CONNECT)) return 0; memset(&event, 0, sizeof event); @@ -2163,12 +2157,12 @@ static int cma_sidr_rep_handler(struct i /* Destroy the CM ID by returning a non-zero value. */ id_priv->cm_id.ib = NULL; cma_exch(id_priv, CMA_DESTROYING); - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); return ret; } out: - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); return ret; } @@ -2566,8 +2560,8 @@ static int cma_ib_mc_handler(int status, int ret; id_priv = mc->id_priv; - if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) && - cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) + if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) && + cma_disable_callback(id_priv, CMA_ADDR_RESOLVED)) return 0; mutex_lock(&id_priv->qp_mutex); @@ -2592,12 +2586,12 @@ static int cma_ib_mc_handler(int status, ret = id_priv->id.event_handler(&id_priv->id, &event); if (ret) { cma_exch(id_priv, CMA_DESTROYING); - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); rdma_destroy_id(&id_priv->id); return 0; } - cma_enable_remove(id_priv); + mutex_unlock(&id_priv->handler_mutex); return 0; } @@ -2756,22 +2750,26 @@ static int cma_remove_id_dev(struct rdma { struct rdma_cm_event event; enum cma_state state; - + int ret = 0; + /* Record that we want to remove the device */ state = cma_exch(id_priv, CMA_DEVICE_REMOVAL); if (state == CMA_DESTROYING) return 0; cma_cancel_operation(id_priv, state); - wait_event(id_priv->wait_remove, !atomic_read(&id_priv->dev_remove)); + mutex_lock(&id_priv->handler_mutex); /* Check for destruction from another callback. */ if (!cma_comp(id_priv, CMA_DEVICE_REMOVAL)) - return 0; + goto out; memset(&event, 0, sizeof event); event.event = RDMA_CM_EVENT_DEVICE_REMOVAL; - return id_priv->id.event_handler(&id_priv->id, &event); + ret = id_priv->id.event_handler(&id_priv->id, &event); +out: + mutex_unlock(&id_priv->handler_mutex); + return ret; } static void cma_process_remove(struct cma_device *cma_dev) From ogerlitz at voltaire.com Wed May 28 04:36:30 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 May 2008 14:36:30 +0300 (IDT) Subject: [ofa-general] [RFC V4 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_ADDR_CHANGE notification In-Reply-To: References: Message-ID: RDMA_CM_EVENT_ADDR_CHANGE event can be used by rdma-cm consuamers that wish to have their RDMA sessions always use the same links (eg ) as the IP stack does. In the current code, this does not happen when bonding is used and fail-over happened, but the IB link used by an already existing session is operating fine. Use netevent notification for sensing that a change has happened in the IP stack, then scan the rdma-cm IDs list to see if there is an ID that is "misaligned" in that respect with the IP stack, and deliver RDMA_CM_EVENT_ADDR_CHANGE for this ID. The user can act on the event or just ignore it Signed-off-by: Or Gerlitz changes from v2 - - took the approach of uncoditionally notifying the user - use the handler_mutex of the ID to serialize with other callbacks changes from v3 - - check in cma_ndev_work_handler to make sure the ID is not getting destroyed - change the event name to be RDMA_CM_EVENT_ADDR_CHANGE - cma_netdev_align_id --> cma_netdev_change As for the locking issues, I still have the double loop in cma_netdev_callback() being wrapped with the rdma-cm global mutex taken, as I explained over the thread. drivers/infiniband/core/cma.c | 88 ++++++++++++++++++++++++++++++++++++++++++ include/rdma/rdma_cm.h | 3 - 2 files changed, 90 insertions(+), 1 deletion(-) Index: linux-2.6.26-rc3/drivers/infiniband/core/cma.c =================================================================== --- linux-2.6.26-rc3.orig/drivers/infiniband/core/cma.c 2008-05-28 11:08:24.000000000 +0300 +++ linux-2.6.26-rc3/drivers/infiniband/core/cma.c 2008-05-28 13:03:43.000000000 +0300 @@ -164,6 +164,12 @@ struct cma_work { struct rdma_cm_event event; }; +struct cma_ndev_work { + struct work_struct work; + struct rdma_id_private *id; + struct rdma_cm_event event; +}; + union cma_ip_addr { struct in6_addr ip6; struct { @@ -1598,6 +1604,28 @@ out: kfree(work); } +static void cma_ndev_work_handler(struct work_struct *_work) +{ + struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, work); + struct rdma_id_private *id_priv = work->id; + int destroy = 0; + + mutex_lock(&id_priv->handler_mutex); + if (id_priv->state == CMA_DESTROYING) + goto out; + + if (id_priv->id.event_handler(&id_priv->id, &work->event)) { + cma_exch(id_priv, CMA_DESTROYING); + destroy = 1; + } +out: + mutex_unlock(&id_priv->handler_mutex); + cma_deref_id(id_priv); + if (destroy) + rdma_destroy_id(&id_priv->id); + kfree(work); +} + static int cma_resolve_ib_route(struct rdma_id_private *id_priv, int timeout_ms) { struct rdma_route *route = &id_priv->id.route; @@ -2723,6 +2751,63 @@ void rdma_leave_multicast(struct rdma_cm } EXPORT_SYMBOL(rdma_leave_multicast); +static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private *id_priv) +{ + struct rdma_dev_addr *dev_addr; + struct cma_ndev_work *work; + + dev_addr = &id_priv->id.route.addr.dev_addr; + + if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && + memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { + printk(KERN_ERR "addr change for device %s used by id %p, notifying\n", + ndev->name, &id_priv->id); + work = kzalloc(sizeof *work, GFP_ATOMIC); + if (!work) + return -ENOMEM; + INIT_WORK(&work->work, cma_ndev_work_handler); + work->id = id_priv; + work->event.event = RDMA_CM_EVENT_ADDR_CHANGE; + atomic_inc(&id_priv->refcount); + queue_work(cma_wq, &work->work); + } + + return 0; +} + +static int cma_netdev_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct net_device *ndev = (struct net_device *)ctx; + struct cma_device *cma_dev; + struct rdma_id_private *id_priv; + int ret = NOTIFY_DONE; + + if (dev_net(ndev) != &init_net) + return NOTIFY_DONE; + + if (event != NETDEV_BONDING_FAILOVER) + return NOTIFY_DONE; + + if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) + return NOTIFY_DONE; + + mutex_lock(&lock); + list_for_each_entry(cma_dev, &dev_list, list) + list_for_each_entry(id_priv, &cma_dev->id_list, list) { + ret = cma_netdev_change(ndev, id_priv); + if (ret) + break; + } + mutex_unlock(&lock); + + return ret; +} + +static struct notifier_block cma_nb = { + .notifier_call = cma_netdev_callback +}; + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2831,6 +2916,7 @@ static int cma_init(void) ib_sa_register_client(&sa_client); rdma_addr_register_client(&addr_client); + register_netdevice_notifier(&cma_nb); ret = ib_register_client(&cma_client); if (ret) @@ -2838,6 +2924,7 @@ static int cma_init(void) return 0; err: + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); @@ -2847,6 +2934,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + unregister_netdevice_notifier(&cma_nb); rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); Index: linux-2.6.26-rc3/include/rdma/rdma_cm.h =================================================================== --- linux-2.6.26-rc3.orig/include/rdma/rdma_cm.h 2008-05-28 10:34:27.000000000 +0300 +++ linux-2.6.26-rc3/include/rdma/rdma_cm.h 2008-05-28 12:55:31.000000000 +0300 @@ -53,7 +53,8 @@ enum rdma_cm_event_type { RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL, RDMA_CM_EVENT_MULTICAST_JOIN, - RDMA_CM_EVENT_MULTICAST_ERROR + RDMA_CM_EVENT_MULTICAST_ERROR, + RDMA_CM_EVENT_ADDR_CHANGE }; enum rdma_port_space { From eli at dev.mellanox.co.il Wed May 28 05:05:03 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 28 May 2008 15:05:03 +0300 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode In-Reply-To: References: Message-ID: <1211976303.13769.155.camel@mtls03> > In this case, how many tx drop packets from ifconfig output? Should we > see ifconfig tx drop packets + tx successfully transmit packets close > to netperf packets? That's right. > > Any TCP STREAM test results to share here? TCP won't demonstrate the problem since it uses Nagle's algorithm to aggregate data into full sized packets. > > thanks > Shirley > From Thomas.Talpey at netapp.com Wed May 28 05:06:45 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 May 2008 08:06:45 -0400 Subject: [ofa-general] Infiniband back-to-back without OpenSM? Message-ID: Is it possible to manually configure two Infiniband ports to operate with one another in back-to-back mode, without running OpenSM on one of them? We have done this on other IB implementations by manually assigning LIDs, but I discover that the "lid" entry below /sys/class/infiniband/ is not writable, at least for mthca. Also, I expect that the ipoib driver will be unable to join the broadcast group, so will be unwilling to come up fully. With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why not IB? If you're wondering, my goal is give NFS/RDMA users a way to avoid having to install the many userspace modules needed to do this, including libibverbs, opensm, etc. There's a lot to get wrong, and things go missing. Seeking an "easy" way to get started with just the kernel and some shell commands. Tom. From hrosenstock at xsigo.com Wed May 28 05:39:29 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 05:39:29 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: Message-ID: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> Tom, On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: > Is it possible to manually configure two Infiniband ports to operate > with one another in back-to-back mode, without running OpenSM > on one of them? This is possible but something would need to do at least some subset of what the SM does depending on the precise requirements and the limits placed on the environment supported without a "full blown" SM. > We have done this on other IB implementations by manually assigning > LIDs, but I discover that the "lid" entry below /sys/class/infiniband/ > is not writable, at least for mthca. This can be done via MADs so user_mad kernel module would be needed to do this. > Also, I expect that the ipoib driver will > be unable to join the broadcast group, so will be unwilling to come up fully. Is IPoIB a requirement ? > With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why > not IB? The simple answer is that it is the nature of IB management (being different than ethernet). -- Hal > If you're wondering, my goal is give NFS/RDMA users a way to avoid having > to install the many userspace modules needed to do this, including libibverbs, > opensm, etc. There's a lot to get wrong, and things go missing. Seeking an > "easy" way to get started with just the kernel and some shell commands. > > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Thomas.Talpey at netapp.com Wed May 28 05:56:53 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 May 2008 08:56:53 -0400 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> Message-ID: At 08:39 AM 5/28/2008, Hal Rosenstock wrote: >Tom, > >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: >> Is it possible to manually configure two Infiniband ports to operate >> with one another in back-to-back mode, without running OpenSM >> on one of them? > >This is possible but something would need to do at least some subset of >what the SM does depending on the precise requirements and the limits >placed on the environment supported without a "full blown" SM. Okay ... but IMO the only thing we need is a LID. Or at least, in my experience all I've needed is a LID. In a previous effort, we simply stole the low octet of an IP address, so we'd "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. Worked great. If necessary, we would set a manual arp entry (using iproute) to avoid having to broadcast. > >> We have done this on other IB implementations by manually assigning >> LIDs, but I discover that the "lid" entry below >/sys/class/infiniband/ >> is not writable, at least for mthca. > >This can be done via MADs so user_mad kernel module would be needed to >do this. Okay, all kernel modules can be assumed to be in place. How do we tell it to manage the LID, with a shell command? > >> Also, I expect that the ipoib driver will >> be unable to join the broadcast group, so will be unwilling to come up fully. > >Is IPoIB a requirement ? I think so, for two reasons. One, principle of least surprise - the user will expect to be able to ping, telnet etc if it has connectivity. Two, for NFS/RDMA we require TCP and UDP connections in order to perform the mount and do locking and recovery. We could do those over a parallel ethernet connection, but that's kind of not the point. > >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why >> not IB? > >The simple answer is that it is the nature of IB management (being >different than ethernet). Which, IMO, we need to boil down to simplest-possible, for at least some workable configuration. Thanks for the ideas! Tom. > >-- Hal > >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having >> to install the many userspace modules needed to do this, including >libibverbs, >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an >> "easy" way to get started with just the kernel and some shell commands. >> >> Tom. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Wed May 28 06:03:37 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 06:03:37 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote: > At 08:39 AM 5/28/2008, Hal Rosenstock wrote: > >Tom, > > > >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: > >> Is it possible to manually configure two Infiniband ports to operate > >> with one another in back-to-back mode, without running OpenSM > >> on one of them? > > > >This is possible but something would need to do at least some subset of > >what the SM does depending on the precise requirements and the limits > >placed on the environment supported without a "full blown" SM. > > Okay ... but IMO the only thing we need is a LID. Or at least, in my experience > all I've needed is a LID. The port also needs to be walked from init to active which takes coordination at both ends of the b2b link. > In a previous effort, we simply stole the low octet of an IP address, so we'd > "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. Worked great. > If necessary, we would set a manual arp entry (using iproute) to avoid having > to broadcast. That could be done if that is what is desired and can be relied upon (that ib0 is configured and we only care about the first port). Is it just ARP support that is needed ? > >> We have done this on other IB implementations by manually assigning > >> LIDs, but I discover that the "lid" entry below > >/sys/class/infiniband/ > >> is not writable, at least for mthca. > > > >This can be done via MADs so user_mad kernel module would be needed to > >do this. > > Okay, all kernel modules can be assumed to be in place. How do we tell it > to manage the LID, with a shell command? A new "command" would be needed. -- Hal > >> Also, I expect that the ipoib driver will > >> be unable to join the broadcast group, so will be unwilling to come up fully. > > > >Is IPoIB a requirement ? > > I think so, for two reasons. One, principle of least surprise - the user will > expect to be able to ping, telnet etc if it has connectivity. Two, for NFS/RDMA > we require TCP and UDP connections in order to perform the mount and do > locking and recovery. We could do those over a parallel ethernet connection, > but that's kind of not the point. > > > > >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why > >> not IB? > > > >The simple answer is that it is the nature of IB management (being > >different than ethernet). > > Which, IMO, we need to boil down to simplest-possible, for at least some > workable configuration. > > Thanks for the ideas! > > Tom. > > > > >-- Hal > > > >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having > >> to install the many userspace modules needed to do this, including > >libibverbs, > >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an > >> "easy" way to get started with just the kernel and some shell commands. > >> > >> Tom. > >> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >http://openib.org/mailman/listinfo/openib-general > From Thomas.Talpey at netapp.com Wed May 28 06:24:21 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 May 2008 09:24:21 -0400 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> Message-ID: At 09:03 AM 5/28/2008, Hal Rosenstock wrote: >On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote: >> At 08:39 AM 5/28/2008, Hal Rosenstock wrote: >> >Tom, >> > >> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: >> >> Is it possible to manually configure two Infiniband ports to operate >> >> with one another in back-to-back mode, without running OpenSM >> >> on one of them? >> > >> >This is possible but something would need to do at least some subset of >> >what the SM does depending on the precise requirements and the limits >> >placed on the environment supported without a "full blown" SM. >> >> Okay ... but IMO the only thing we need is a LID. Or at least, in my >experience >> all I've needed is a LID. > >The port also needs to be walked from init to active which takes >coordination at both ends of the b2b link. Yep. But, it has all it needs with a LID, right? No messages need to be exchanged, for instance. > >> In a previous effort, we simply stole the low octet of an IP address, so we'd >> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. >Worked great. >> If necessary, we would set a manual arp entry (using iproute) to avoid having >> to broadcast. > >That could be done if that is what is desired and can be relied upon >(that ib0 is configured and we only care about the first port). > >Is it just ARP support that is needed ? Well, ARP is the precursor to establishing an IP send and a TCP connection, which we need to do also. But, if the resulting ipaddr-hwaddr mapping is installed, then ARP is unnecessary and the IP layer can send without using it. When we did this before, we'd install a "permanent" ARP entry, in a two-line shell script. Roughly, for peers configuring lids X and Y, it would do peer X: ifconfig ib0 1.2.3.X ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid) peer Y: ifconfig ib0 1.2.3.Y ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X And we'd be up and running for both IP and RDMA connections. We fixed a bug in the old iproute2 command to allow the long IB link addresses. I'm thinking that using IPOIB to drive this kind of manual setup is one way to approach it. It certainly would be simple, and worked for us before there was an OFA stack. Maybe I'm getting ahead of myself though, still wondering if there's a way to do it with what we have. Tom. > >> >> We have done this on other IB implementations by manually assigning >> >> LIDs, but I discover that the "lid" entry below >> >/sys/class/infiniband/ >> >> is not writable, at least for mthca. >> > >> >This can be done via MADs so user_mad kernel module would be needed to >> >do this. >> >> Okay, all kernel modules can be assumed to be in place. How do we tell it >> to manage the LID, with a shell command? > >A new "command" would be needed. > >-- Hal > >> >> Also, I expect that the ipoib driver will >> >> be unable to join the broadcast group, so will be unwilling to >come up fully. >> > >> >Is IPoIB a requirement ? >> >> I think so, for two reasons. One, principle of least surprise - the user will >> expect to be able to ping, telnet etc if it has connectivity. Two, >for NFS/RDMA >> we require TCP and UDP connections in order to perform the mount and do >> locking and recovery. We could do those over a parallel ethernet connection, >> but that's kind of not the point. >> >> > >> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why >> >> not IB? >> > >> >The simple answer is that it is the nature of IB management (being >> >different than ethernet). >> >> Which, IMO, we need to boil down to simplest-possible, for at least some >> workable configuration. >> >> Thanks for the ideas! >> >> Tom. >> >> > >> >-- Hal >> > >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having >> >> to install the many userspace modules needed to do this, including >> >libibverbs, >> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an >> >> "easy" way to get started with just the kernel and some shell commands. >> >> >> >> Tom. >> >> >> >> _______________________________________________ >> >> general mailing list >> >> general at lists.openfabrics.org >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> To unsubscribe, please visit >> >http://openib.org/mailman/listinfo/openib-general >> > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Wed May 28 06:34:10 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 06:34:10 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> Message-ID: <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com> On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote: > At 09:03 AM 5/28/2008, Hal Rosenstock wrote: > >On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote: > >> At 08:39 AM 5/28/2008, Hal Rosenstock wrote: > >> >Tom, > >> > > >> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: > >> >> Is it possible to manually configure two Infiniband ports to operate > >> >> with one another in back-to-back mode, without running OpenSM > >> >> on one of them? > >> > > >> >This is possible but something would need to do at least some subset of > >> >what the SM does depending on the precise requirements and the limits > >> >placed on the environment supported without a "full blown" SM. > >> > >> Okay ... but IMO the only thing we need is a LID. Or at least, in my > >experience > >> all I've needed is a LID. > > > >The port also needs to be walked from init to active which takes > >coordination at both ends of the b2b link. > > Yep. But, it has all it needs with a LID, right? No messages need to be > exchanged, for instance. It's more than a LID and messages do need to be exchanged (mini SM -> SMA) to walk the port from INIT to ACTIVE. This needs to be coordinated on both sides of the link so they move in rough concert. > >> In a previous effort, we simply stole the low octet of an IP address, so we'd > >> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. > >Worked great. > >> If necessary, we would set a manual arp entry (using iproute) to avoid having > >> to broadcast. > > > >That could be done if that is what is desired and can be relied upon > >(that ib0 is configured and we only care about the first port). > > > >Is it just ARP support that is needed ? > > Well, ARP is the precursor to establishing an IP send and a TCP connection, > which we need to do also. I was just asking about other broadcast/multicast needs. Sounds like this is not the case. > But, if the resulting ipaddr-hwaddr mapping is > installed, then ARP is unnecessary and the IP layer can send without using it. > > When we did this before, we'd install a "permanent" ARP entry, in a two-line > shell script. Roughly, for peers configuring lids X and Y, it would do > > peer X: > ifconfig ib0 1.2.3.X > ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid) > > peer Y: > ifconfig ib0 1.2.3.Y > ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X > > And we'd be up and running for both IP and RDMA connections. We fixed a > bug in the old iproute2 command to allow the long IB link addresses. > > I'm thinking that using IPOIB to drive this kind of manual setup is one way > to approach it. It certainly would be simple, and worked for us before there > was an OFA stack. This would still work. > Maybe I'm getting ahead of myself though, still wondering if there's a way > to do it with what we have. The closest thing is OpenSM run once mode but I think you've been describing a b2b mini SM command which wouldn't be hard to implement. -- Hal > Tom. > > > > >> >> We have done this on other IB implementations by manually assigning > >> >> LIDs, but I discover that the "lid" entry below > >> >/sys/class/infiniband/ > >> >> is not writable, at least for mthca. > >> > > >> >This can be done via MADs so user_mad kernel module would be needed to > >> >do this. > >> > >> Okay, all kernel modules can be assumed to be in place. How do we tell it > >> to manage the LID, with a shell command? > > > >A new "command" would be needed. > > > >-- Hal > > > >> >> Also, I expect that the ipoib driver will > >> >> be unable to join the broadcast group, so will be unwilling to > >come up fully. > >> > > >> >Is IPoIB a requirement ? > >> > >> I think so, for two reasons. One, principle of least surprise - the user will > >> expect to be able to ping, telnet etc if it has connectivity. Two, > >for NFS/RDMA > >> we require TCP and UDP connections in order to perform the mount and do > >> locking and recovery. We could do those over a parallel ethernet connection, > >> but that's kind of not the point. > >> > >> > > >> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why > >> >> not IB? > >> > > >> >The simple answer is that it is the nature of IB management (being > >> >different than ethernet). > >> > >> Which, IMO, we need to boil down to simplest-possible, for at least some > >> workable configuration. > >> > >> Thanks for the ideas! > >> > >> Tom. > >> > >> > > >> >-- Hal > >> > > >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having > >> >> to install the many userspace modules needed to do this, including > >> >libibverbs, > >> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an > >> >> "easy" way to get started with just the kernel and some shell commands. > >> >> > >> >> Tom. > >> >> > >> >> _______________________________________________ > >> >> general mailing list > >> >> general at lists.openfabrics.org > >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > >> >> To unsubscribe, please visit > >> >http://openib.org/mailman/listinfo/openib-general > >> > > > >_______________________________________________ > >general mailing list > >general at lists.openfabrics.org > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tziporet at dev.mellanox.co.il Wed May 28 06:49:26 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 28 May 2008 16:49:26 +0300 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: Message-ID: <483D62E6.2050107@mellanox.co.il> Talpey, Thomas wrote: > > If you're wondering, my goal is give NFS/RDMA users a way to avoid having > to install the many userspace modules needed to do this, including libibverbs, > opensm, etc. There's a lot to get wrong, and things go missing. Seeking an > "easy" way to get started with just the kernel and some shell commands. > > No need for libibverbs for opensm, just the management libraries. Tziporet From chu11 at llnl.gov Wed May 28 03:55:25 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 28 May 2008 06:55:25 -0400 Subject: [ofa-general] OpenSM? In-Reply-To: <483CFAB1.5020409@dev.mellanox.co.il> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> <20080527100859.6d48cd45.weiny2@llnl.gov> <483CFAB1.5020409@dev.mellanox.co.il> Message-ID: <1211972125.5192.5.camel@whatsup> Hey Yevgeny, > Did ftree fail to digest the topology? For the current OpenSM, on atleast one cluster, yes. I've been meaning to look into it. Just haven't gotten to it yet. There is some legacy too. Earlier ftree's weren't able to handle a number of corner cases, so updn was used instead (and now is still used). > Or perhaps that you need LMC>0, which ftree doesn't support? This too. Al On Wed, 2008-05-28 at 09:24 +0300, Yevgeny Kliteynik wrote: > Ira, > > Ira Weiny wrote: > > > > We run with the up/down algorithm, ftree has not panned out for us yet. > > Can you elaborate on that? > Did ftree fail to digest the topology? Or did it do a lousy job configuring > the subnet? Or perhaps the you need LMC>0, which ftree doesn't support? > > -- Yevgeny > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From downligging at toddshop.com Wed May 28 06:02:28 2008 From: downligging at toddshop.com (Bruno Dickinson) Date: Wed, 28 May 2008 16:02:28 +0300 Subject: [ofa-general] "All Adobe products in One Download" Message-ID: <000301c8c0cb$097adc80$0100007f@pltuxsm> " Adobe CS3 Master Collection for PC or MAC includes: " InDesign CS3 " Photoshop CS3 " Illustrator CS3 " Acrobat 8 Professional " Flash CS3 Professional " Dreamweaver CS3 " Fireworks CS3 " Contribute CS3 " After Effects CS3 Professional " Premiere Pro CS3 " Encore DVD CS3 " Soundbooth CS3 " xpwebspot . com in Web Browser " System Requirements " For PC: " Intel Pentium 4 (1.4GHz processor for DV; 3.4GHz processor for HDV), Intel Centrino, Intel Xeon, (dual 2.8GHz processors for HD), or Intel Core " Duo (or compatible) processor; SSE2-enabled processor required for AMD systems " Microsoft Windows XP with Service Pack 2 or Microsoft Windows Vista Home Premium, Business, Ultimate, or Enterprise (certified for 32-bit editions) " 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components " 38GB of available hard-disk space (additional free space required during installation) " Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred " Microsoft DirectX compatible sound card (multichannel ASIO-compatible sound card recommended) " 1,280x1,024 monitor resolution with 32-bit color adapter " DVD-ROM drive " For MAC: " PowerPC G4 or G5 or multicore Intel processor (Adobe Premiere Pro, Encore, and Soundbooth require a multicore Intel processor; Adobe OnLocation CS3 is a Windows application and may be used with Boot Camp) " Mac OS X v.10.4.9; Java Runtime Environment 1.5 required for Adobe Version Cue CS3 Server " 1GB of RAM for DV; 2GB of RAM for HDV and HD; more RAM recommended when running multiple components " 36GB of available hard-disk space (additional free space required during installation) " Dedicated 7,200 RPM hard drive for DV and HDV editing; striped disk array storage (RAID 0) for HD; SCSI disk subsystem preferred " Core Audio compatible sound card " 1,280x1,024 monitor resolution with 32-bit color adapter " DVD-ROM drive" DVD+-R burner required for DVD creation Fifty years ago, Beneteau was a small, family-owned company that made fishing boats in a French village. Now it's the world's top sailboat maker, with dealers in 50 countries. Reporter Eleanor Beardsley has more on the woman who transformed the company. The Copenhagen Consensus Center's annual conference brings together some of the world's top economists and thinkers. Offering highlights is Bjorn Lomborg, CCC founder and author of The Skeptical Environmentalist: Measuring the Real State of the World. From ramachandra.kuchimanchi at qlogic.com Wed May 28 07:18:28 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Wed, 28 May 2008 19:48:28 +0530 Subject: [ofa-general] Re: [PATCH v2 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx In-Reply-To: References: <20080519102843.12355.832.stgit@localhost.localdomain> <20080519103258.12355.6146.stgit@localhost.localdomain> Message-ID: <71d336490805280718j79c3ea24j4408b851eb0a23ab@mail.gmail.com> Roland, Thanks. Will fix both the items you pointed out. Regards, Ram On Wed, May 28, 2008 at 10:58 AM, Roland Dreier wrote: > > +void viport_disconnect(struct viport *viport) > > +{ > > + VIPORT_FUNCTION("viport_disconnect()\n"); > > + viport->disconnect = 1; > > + viport_failure(viport); > > + wait_event(viport->disconnect_queue, viport->disconnect == 0); > > +} > > + > > +void viport_free(struct viport *viport) > > +{ > > + VIPORT_FUNCTION("viport_free()\n"); > > + viport_disconnect(viport); /* NOTE: this can sleep */ > > There are no other calls to viport_disconnect() that I can see, so it > can be made static (and the declaration in vnic_viport.h can be dropped). > in fact given how small the function is and the fact that it has only a > single call site, it might be easier just to merge it into > viport_free(). But that's a matter of taste. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Thomas.Talpey at netapp.com Wed May 28 07:31:38 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 28 May 2008 10:31:38 -0400 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <483D62E6.2050107@mellanox.co.il> References: <483D62E6.2050107@mellanox.co.il> Message-ID: At 09:49 AM 5/28/2008, Tziporet Koren wrote: >Talpey, Thomas wrote: >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having >> to install the many userspace modules needed to do this, including >libibverbs, >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an >> "easy" way to get started with just the kernel and some shell commands. >> >> >No need for libibverbs for opensm, just the management libraries. Hmm, well expanding on my "etc.", at the NFS Connectathon we were trying to install it on a green Fedora 9 system, and the RedHat opensm package (OFED 1.3) wanted the following prerequisites: libibcommon libibumad opensm_lib opensm In addition we needed to install libmthca which in turn wanted libibverbs And, after we successfully loaded it all and started opensm, etc, then ipoib wouldn't come up to RUNNING because it couldn't join the broadcast group. At which point, we threw up our hands because all we wanted to do was make a b2b connection, so we admitted defeat and connected to a managed switch. :-/ I'm just looking to make it easier for the next guy/gal. ;-) Tom. From kliteyn at dev.mellanox.co.il Wed May 28 07:44:27 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 28 May 2008 17:44:27 +0300 Subject: [ofa-general] OpenSM? In-Reply-To: <1211972125.5192.5.camel@whatsup> References: <20080516223256.27221.34568.stgit@dell3.ogc.int> <20080516223419.27221.49014.stgit@dell3.ogc.int> <483285DC.20003@voltaire.com> <4832D850.2010102@opengridcomputing.com> <4833EA6B.9000705@voltaire.com> <26E4A768C32B0749989A2C07675EBABD12EA29@mtlexch01.mtl.com> <48351310.5090108@voltaire.com> <483578EA.4070503@opengridcomputing.com> <48358428.2000902@voltaire.com> <48370AEE.7080507@opengridcomputing.com> <5E22F547-0F44-4E56-A72C-DE9FC83EE697@hpc.ufl.edu> <20080527100859.6d48cd45.weiny2@llnl.gov> <483CFAB1.5020409@dev.mellanox.co.il> <1211972125.5192.5.camel@whatsup> Message-ID: <483D6FCB.8020308@dev.mellanox.co.il> Hi Al, Al Chu wrote: > Hey Yevgeny, > >> Did ftree fail to digest the topology? > > For the current OpenSM, on atleast one cluster, yes. I've been meaning > to look into it. Just haven't gotten to it yet. OK, let me know if ftree is still unable to work on your topology when you do get back to this - perhaps I'll need to tune it up a bit > There is some legacy too. Earlier ftree's weren't able to handle a > number of corner cases, so updn was used instead (and now is still > used). > >> Or perhaps that you need LMC>0, which ftree doesn't support? > > This too. Unless, of course, you need LMC>0, in which case ftree is useless. -- Yevgeny > Al > > On Wed, 2008-05-28 at 09:24 +0300, Yevgeny Kliteynik wrote: >> Ira, >> >> Ira Weiny wrote: >>> We run with the up/down algorithm, ftree has not panned out for us yet. >> Can you elaborate on that? >> Did ftree fail to digest the topology? Or did it do a lousy job configuring >> the subnet? Or perhaps the you need LMC>0, which ftree doesn't support? >> >> -- Yevgeny >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Wed May 28 08:24:29 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Wed, 28 May 2008 08:24:29 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: <483D62E6.2050107@mellanox.co.il> Message-ID: <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote: > In addition we needed to install > libmthca > which in turn wanted > libibverbs AFAIK, there is no OpenSM requirement for these libraries. > And, after we successfully loaded it all and started opensm, etc, > then ipoib wouldn't come up to RUNNING because it couldn't join > the broadcast group. This was likely due to IPoIB needing to be enabled on the default partition. From the 1.3 opensm man page, PARTITION CONFIGURATION The default name of OpenSM partitions configuration file is /etc/opensm/partitions.conf. The default may be changed by using --Pconfig (-P) option with OpenSM. ... The following rule is equivalent to how OpenSM used to run prior to the partition manager: Default=0x7fff,ipoib:ALL=full; -- Hal From weiny2 at llnl.gov Wed May 28 09:06:29 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 28 May 2008 09:06:29 -0700 Subject: [ofa-general] Re: [PATCH] saquery: --smkey command line option In-Reply-To: <1211972790.13185.332.camel@hrosenstock-ws.xsigo.com> References: <20080522135329.GB32128@sashak.voltaire.com> <1211467609.18236.171.camel@hrosenstock-ws.xsigo.com> <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <1211887752.13185.212.camel@hrosenstock-ws.xsigo.com> <20080527175637.GB14205@sashak.voltaire.com> <1211972790.13185.332.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080528090629.3ca96d30.weiny2@llnl.gov> On Wed, 28 May 2008 04:06:30 -0700 Hal Rosenstock wrote: > On Tue, 2008-05-27 at 20:56 +0300, Sasha Khapyorsky wrote: > > On 04:29 Tue 27 May , Hal Rosenstock wrote: > > > On Fri, 2008-05-23 at 05:52 -0700, Hal Rosenstock wrote: > > > > > Following your logic we will need to disable root passwords > > > > > typing too. > > > > > > > > That's taking it too far. Root passwords are at least hidden when > > > > typing. > > > > > > At least hide the key typing from plain sight when typing like su does. > > > > There are lot of tools where password can be specified as clear text in > > command line (wget, smbclient, etc..) - it is an user responsibility to > > keep his sensitive data safe. > > Do those tools provide a way to obscure passwords or force the user to > do this in plain sight ? Seems like a user can't do this without support > from the tool. smbclient seems to provide this; I didn't look at wget. > > smbclient supports an authorization file which supports this and says: > Make certain that the permissions on the file restrict access > from unwanted users. > > As you mentioned before, this is another acceptable approach (and this > also lends itself better to scripting). Another example of this is MySQL. From the man page: shell> mysql --user=user_name --password=your_password db_name With the plugin I just released I install a config file with this password accessible only to root. If someone runs OpenSM as another user or has other programs trying to access the DB (like SKUMMEE) then you will have to set the permissions on this file appropriately. I think I like the addition of a conf file for the scripts... Ira > > -- Hal > > > Sasha > From rdreier at cisco.com Wed May 28 09:11:37 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 May 2008 09:11:37 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: (Thomas Talpey's message of "Wed, 28 May 2008 10:31:38 -0400") References: <483D62E6.2050107@mellanox.co.il> Message-ID: > Hmm, well expanding on my "etc.", at the NFS Connectathon we were > trying to install it on a green Fedora 9 system, and the RedHat opensm > package (OFED 1.3) wanted the following prerequisites: > libibcommon > libibumad > opensm_lib > opensm I think the solution to this is not to do a bunch more work to create an opensm-free solution, but just to package opensm properly so it can get into Fedora. Then installation is just "yum install opensm". is a very detailed step-by-step guide to what is necessary. > In addition we needed to install > libmthca > which in turn wanted > libibverbs This isn't needed for opensm or NFS/RDMA, but it is very easy on Fedora 9: just "yum install libmthca". - R. From dotanba at gmail.com Wed May 28 11:24:28 2008 From: dotanba at gmail.com (Dotan Barak) Date: Wed, 28 May 2008 20:24:28 +0200 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? In-Reply-To: References: Message-ID: <483DA35C.3080502@gmail.com> Yicheng Jia wrote: > > Thanks for your reply. I'm using one CQ for all the WRs. Do you know > why there's no ARM-N support in MLX drivers? I don't know if i can speak in the name of Mellanox/MLX driver maintainers, but i think that the reason is lack of demand for this feature (but i can't be sure). > My concern is the performance. The overhead of software poll_cq loop > is quite significant if there are multiple pieces of small amount of > data to be transferred on both sender/receiver sides. For instance, on > the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver > side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you > have a good solution for such kind of problem? How many QPs do you use? (and how outstanding WR from every QP?) Dotan > Best, > Yicheng > > > > *Dotan Barak * > > 05/23/2008 01:27 PM > > > To > Yicheng Jia > cc > general at lists.openfabrics.org > Subject > Re: [ofa-general] MLX HCA: CQ request notification for multiple > completions not implemented? > > > > > > > > > > Hi. > > Yicheng Jia wrote: > > > > Hi Folks, > > > > I'm trying to use CQ Event notification for multiple completions > > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering > > RDMA. However I couldn't find it in current MLX driver. It seems to me > > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are > > multiple work requests, I have to use "poll_cq" to synchronously wait > > until all the requests are done, is it correct? Is there a way to do > > asynchronous multiple send by subscribing for a ARM_N event? > You are right: the low level drivers of Mellanox devices doesn't support > ARM-N > (This feature is supported by the devices, but it wasn't implemented in > the low level drivers). > > You are right, in order to read all of the completions you need to use > poll_cq. > > By the way: Do you have you have to create a completion for any WR? > (if you are using one QP, this will maybe solve your problem). > > Dotan > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ From dotanba at gmail.com Wed May 28 11:31:46 2008 From: dotanba at gmail.com (Dotan Barak) Date: Wed, 28 May 2008 20:31:46 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <483BBBDB.6000605@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> Message-ID: <483DA512.2070403@gmail.com> Marcel Heinz wrote: > Marcel Heinz wrote: > >> Dotan Barak wrote: >> >>> Do you use the latest released FW for this device? >>> >> The HCAs all use Mallanox' latest released FW version 1.2.0. I'll have a >> look at the switch later. >> > > The Switch is Mellanox MT47396 based and uses FW version 1.0.0. This > isn't the latest one, but I don't see anything in the release notes of > the 1.0.5 firmware which is related to our problem. > 1) I know that ib_send_bw supports multicast as well, can you please check that you can reproduce your problem on this benchmark too? 2) You should expect that multicast messages will be slower than unicast because the HCA/switch treat them in different way (message duplication need to be done if needed). Dotan From sean.hefty at intel.com Wed May 28 10:59:34 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 May 2008 10:59:34 -0700 Subject: [ofa-general] RE: [RFC V3 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_NETDEV_CHANGE notification In-Reply-To: <483D4283.60307@voltaire.com> References: <000401c8c015$75c37890$ebc8180a@amr.corp.intel.com> <483D4283.60307@voltaire.com> Message-ID: <000601c8c0ec$96492510$6758180a@amr.corp.intel.com> >The rdma-cm maintains a mapping between IDs to the physical devices. The >mapping is established during address resolution using the HW address of >the --network-- device that was resolved (eg through ARP and then >looking on neigh->dev or route lookup) for this ID. This is what I was thinking. >In the bonding case, the network device --borrows-- the HW address from >the active slave device. During fail-over, the bonding net device >changes its HW address and then the netdev event is delivered on which >this code acts. So the same cma_dev can have IDs with different netdev >HW address in their dev_addr struct, say bond0 = and pdevA >list = { , } depending on the time address >resolution was done to ID1,ID2 and the ULP behavior on the ADDR_CHANGE >event. I don't see how to get along with a simple check that tell on >what cma_dev to look for matches. If we really want to avoid scanning >all the cma_dev list, we can add a mapping between --net devices-- to >IDs and then scan only the list of the affiliated netdevice. Ok - looping through everything isn't that bad, since it's not expected to happen often. If there's a way to improve this, I'm fine waiting to see if there's a real problem before complicating things. - Sean From YJia at tmriusa.com Wed May 28 11:10:42 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Wed, 28 May 2008 13:10:42 -0500 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? In-Reply-To: <483DA35C.3080502@gmail.com> Message-ID: >> My concern is the performance. The overhead of software poll_cq loop >> is quite significant if there are multiple pieces of small amount of >> data to be transferred on both sender/receiver sides. For instance, on >> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver >> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you >> have a good solution for such kind of problem? >How many QPs do you use? >(and how outstanding WR from every QP?) Only one QP. Is it better to alloc multiple QPs and evenly distribute WRs among those QPs? Best, Yicheng Dotan Barak 05/28/2008 12:24 PM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? Yicheng Jia wrote: > > Thanks for your reply. I'm using one CQ for all the WRs. Do you know > why there's no ARM-N support in MLX drivers? I don't know if i can speak in the name of Mellanox/MLX driver maintainers, but i think that the reason is lack of demand for this feature (but i can't be sure). > My concern is the performance. The overhead of software poll_cq loop > is quite significant if there are multiple pieces of small amount of > data to be transferred on both sender/receiver sides. For instance, on > the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver > side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you > have a good solution for such kind of problem? How many QPs do you use? (and how outstanding WR from every QP?) Dotan > Best, > Yicheng > > > > *Dotan Barak * > > 05/23/2008 01:27 PM > > > To > Yicheng Jia > cc > general at lists.openfabrics.org > Subject > Re: [ofa-general] MLX HCA: CQ request notification for multiple > completions not implemented? > > > > > > > > > > Hi. > > Yicheng Jia wrote: > > > > Hi Folks, > > > > I'm trying to use CQ Event notification for multiple completions > > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering > > RDMA. However I couldn't find it in current MLX driver. It seems to me > > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are > > multiple work requests, I have to use "poll_cq" to synchronously wait > > until all the requests are done, is it correct? Is there a way to do > > asynchronous multiple send by subscribing for a ARM_N event? > You are right: the low level drivers of Mellanox devices doesn't support > ARM-N > (This feature is supported by the devices, but it wasn't implemented in > the low level drivers). > > You are right, in order to read all of the completions you need to use > poll_cq. > > By the way: Do you have you have to create a completion for any WR? > (if you are using one QP, this will maybe solve your problem). > > Dotan > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Wed May 28 12:20:52 2008 From: dotanba at gmail.com (Dotan Barak) Date: Wed, 28 May 2008 21:20:52 +0200 Subject: [ofa-general] MLX HCA: CQ request notification for multiple completions not implemented? In-Reply-To: References: Message-ID: <483DB094.4000706@gmail.com> Yicheng Jia wrote: > > >> My concern is the performance. The overhead of software poll_cq loop > >> is quite significant if there are multiple pieces of small amount of > >> data to be transferred on both sender/receiver sides. For instance, on > >> the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver > >> side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you > >> have a good solution for such kind of problem? > > >How many QPs do you use? > >(and how outstanding WR from every QP?) > > Only one QP. Is it better to alloc multiple QPs and evenly distribute > WRs among those QPs? It depends on what you are trying to do ... You don't have to ask for completion for any SR that you post, this way you can do some optimization.. Dotan > > Best, > Yicheng > > > > *Dotan Barak * > > 05/28/2008 12:24 PM > > > To > Yicheng Jia > cc > general at lists.openfabrics.org > Subject > Re: [ofa-general] MLX HCA: CQ request notification for multiple > completions not implemented? > > > > > > > > > > Yicheng Jia wrote: > > > > Thanks for your reply. I'm using one CQ for all the WRs. Do you know > > why there's no ARM-N support in MLX drivers? > I don't know if i can speak in the name of Mellanox/MLX driver > maintainers, but i think that the > reason is lack of demand for this feature (but i can't be sure). > > > My concern is the performance. The overhead of software poll_cq loop > > is quite significant if there are multiple pieces of small amount of > > data to be transferred on both sender/receiver sides. For instance, on > > the sender, the data I have are 1k, 1k, 2k, 1k..., on the receiver > > side, the data size and blocks are the same, 1k, 1k, 2k, 1k.... Do you > > have a good solution for such kind of problem? > How many QPs do you use? > (and how outstanding WR from every QP?) > > Dotan > > Best, > > Yicheng > > > > > > > > *Dotan Barak * > > > > 05/23/2008 01:27 PM > > > > > > To > > Yicheng Jia > > cc > > general at lists.openfabrics.org > > Subject > > Re: [ofa-general] MLX HCA: CQ request notification > for multiple > > completions not implemented? > > > > > > > > > > > > > > > > > > > > Hi. > > > > Yicheng Jia wrote: > > > > > > Hi Folks, > > > > > > I'm trying to use CQ Event notification for multiple completions > > > (ARM_N) according to Mellanox Lx III user manual for scatter/gathering > > > RDMA. However I couldn't find it in current MLX driver. It seems to me > > > that only ARM_NEXT and ARM_SOLICIT are implemented. So if there are > > > multiple work requests, I have to use "poll_cq" to synchronously wait > > > until all the requests are done, is it correct? Is there a way to do > > > asynchronous multiple send by subscribing for a ARM_N event? > > You are right: the low level drivers of Mellanox devices doesn't support > > ARM-N > > (This feature is supported by the devices, but it wasn't implemented in > > the low level drivers). > > > > You are right, in order to read all of the completions you need to use > > poll_cq. > > > > By the way: Do you have you have to create a completion for any WR? > > (if you are using one QP, this will maybe solve your problem). > > > > Dotan > > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by > > MessageLabs. For more information please visit http://www.ers.ibm.com > > > _____________________________________________________________________________ > > < http://www.ers.ibm.com/> > > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by > > MessageLabs. For more information please visit http://www.ers.ibm.com > > > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit > http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ From sean.hefty at intel.com Wed May 28 11:53:57 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 May 2008 11:53:57 -0700 Subject: [ofa-general] RE: [RFC V4 PATCH 3/5] rdma/cma: simply locking needed for serialization of callbacks In-Reply-To: References: Message-ID: <000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com> >-static int cma_disable_remove(struct rdma_id_private *id_priv, >+static int cma_disable_callback(struct rdma_id_private *id_priv, > enum cma_state state) > { > unsigned long flags; > int ret; > >+ mutex_lock(&id_priv->handler_mutex); > spin_lock_irqsave(&id_priv->lock, flags); >- if (id_priv->state == state) { >- atomic_inc(&id_priv->dev_remove); >+ if (id_priv->state == state) > ret = 0; >- } else >+ else { >+ mutex_unlock(&id_priv->handler_mutex); > ret = -EINVAL; >+ } > spin_unlock_irqrestore(&id_priv->lock, flags); > return ret; > } I wasn't clear on this before, but we shouldn't need to take the spinlock here at all now. We needed it before in order to check the state and increment dev_remove in one operation. Once the spinlock was released the state could have changed, but dev_remove would have halted the device removal thread. Under the new method, device removal is halted while we hold the handler_mutex. >@@ -2566,8 +2560,8 @@ static int cma_ib_mc_handler(int status, > int ret; > > id_priv = mc->id_priv; >- if (cma_disable_remove(id_priv, CMA_ADDR_BOUND) && >- cma_disable_remove(id_priv, CMA_ADDR_RESOLVED)) >+ if (cma_disable_callback(id_priv, CMA_ADDR_BOUND) && >+ cma_disable_callback(id_priv, CMA_ADDR_RESOLVED)) This can end up trying to acquire the mutex twice. We could change this to mutex_lock(); if (id_priv->state == CMA_ADDR_BOUND || id_priv->state == CMA_ADDR_RESOLVED) - Sean From sean.hefty at intel.com Wed May 28 12:06:00 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 May 2008 12:06:00 -0700 Subject: [ofa-general] RE: [RFC V4 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_ADDR_CHANGE notification In-Reply-To: References: Message-ID: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com> >+static void cma_ndev_work_handler(struct work_struct *_work) >+{ >+ struct cma_ndev_work *work = container_of(_work, struct cma_ndev_work, >work); >+ struct rdma_id_private *id_priv = work->id; >+ int destroy = 0; >+ >+ mutex_lock(&id_priv->handler_mutex); >+ if (id_priv->state == CMA_DESTROYING) We should probably skip id_priv->state == CMA_DEVICE_REMOVAL as well. >@@ -2723,6 +2751,63 @@ void rdma_leave_multicast(struct rdma_cm > } > EXPORT_SYMBOL(rdma_leave_multicast); > >+static int cma_netdev_change(struct net_device *ndev, struct rdma_id_private >*id_priv) >+{ >+ struct rdma_dev_addr *dev_addr; >+ struct cma_ndev_work *work; >+ >+ dev_addr = &id_priv->id.route.addr.dev_addr; >+ >+ if (!memcmp(dev_addr->src_dev_name, ndev->name, IFNAMSIZ) && >+ memcmp(dev_addr->src_dev_addr, ndev->dev_addr, ndev->addr_len)) { >+ printk(KERN_ERR "addr change for device %s used by id %p, >notifying\n", Is KERN_ERR what we want here? >+static int cma_netdev_callback(struct notifier_block *self, unsigned long >event, >+ void *ctx) >+{ >+ struct net_device *ndev = (struct net_device *)ctx; >+ struct cma_device *cma_dev; >+ struct rdma_id_private *id_priv; >+ int ret = NOTIFY_DONE; >+ >+ if (dev_net(ndev) != &init_net) >+ return NOTIFY_DONE; >+ >+ if (event != NETDEV_BONDING_FAILOVER) >+ return NOTIFY_DONE; >+ >+ if (!(ndev->flags & IFF_MASTER) || !(ndev->priv_flags & IFF_BONDING)) >+ return NOTIFY_DONE; >+ >+ mutex_lock(&lock); >+ list_for_each_entry(cma_dev, &dev_list, list) >+ list_for_each_entry(id_priv, &cma_dev->id_list, list) { >+ ret = cma_netdev_change(ndev, id_priv); >+ if (ret) >+ break; Should this be goto (mutex_unlock) instead? Okay - I think we're pretty close on the rdma_cm side of things. Thanks. - Sean From jlentini at netapp.com Wed May 28 14:24:20 2008 From: jlentini at netapp.com (James Lentini) Date: Wed, 28 May 2008 17:24:20 -0400 (EDT) Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483CBDF0.7030209@opengridcomputing.com> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> <483CBDF0.7030209@opengridcomputing.com> Message-ID: On Tue, 27 May 2008, Steve Wise wrote: > > enum ib_send_flags { > > @@ -676,6 +683,19 @@ struct ib_send_wr { > > u16 pkey_index; /* valid for GSI only */ > > u8 port_num; /* valid for DR SMPs on switch > > only */ > > } ud; > > + struct { > > + u64 iova_start; > > + struct ib_mr *mr; > > + struct ib_fast_reg_page_list *page_list; > > + unsigned int page_shift; > > + unsigned int page_list_len; > > + unsigned int first_byte_offset; > > + u32 length; > > + int access_flags; > > + } fast_reg; > > + struct { > > + struct ib_mr *mr; > > + } local_inv; > > } wr; > > }; > > Ok, while writing a test case for all this jazz, Could you post the test case when it is ready? An example of how to use this API would be useful. Of course, I realize you are revising the API at the moment... From swise at opengridcomputing.com Wed May 28 14:29:12 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 28 May 2008 16:29:12 -0500 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> <483CBDF0.7030209@opengridcomputing.com> Message-ID: <483DCEA8.20505@opengridcomputing.com> James Lentini wrote: > On Tue, 27 May 2008, Steve Wise wrote: > > >>> enum ib_send_flags { >>> @@ -676,6 +683,19 @@ struct ib_send_wr { >>> u16 pkey_index; /* valid for GSI only */ >>> u8 port_num; /* valid for DR SMPs on switch >>> only */ >>> } ud; >>> + struct { >>> + u64 iova_start; >>> + struct ib_mr *mr; >>> + struct ib_fast_reg_page_list *page_list; >>> + unsigned int page_shift; >>> + unsigned int page_list_len; >>> + unsigned int first_byte_offset; >>> + u32 length; >>> + int access_flags; >>> + } fast_reg; >>> + struct { >>> + struct ib_mr *mr; >>> + } local_inv; >>> } wr; >>> }; >>> >> Ok, while writing a test case for all this jazz, >> > > Could you post the test case when it is ready? An example of how to > use this API would be useful. Of course, I realize you are revising > the API at the moment... > Yes, I have already said I'll post a test case. :) The krping tool will be the culprit. Its the kernel equivalent of rping and has been around for a long time in one form or another. It is available at git://git.openfabrics.org/~swise/krping It currently supports dma mrs and regular mrs only. I'm adding fastreg support now. And I want to add mw too. Steve. From jon at opengridcomputing.com Wed May 28 15:55:49 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Wed, 28 May 2008 17:55:49 -0500 Subject: [ofa-general] Port space sharing in RDS Message-ID: <20080528225549.GC6288@opengridcomputing.com> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the RDS port for all IP addresses. Unfortunately, that will not work for iWARP for 2 major reasons. Firstly, the binding of a CM ID to IP address via rdma_bind_addr on INADDR_ANY will cause the first caller to bind to all IP addresses/Devices (and the subsequent calls will fail). So whichever module (IB or iWARP) that is called first will break the second (and cause the module loading to abort). Secondly, iWARP and the Linux stack must share the same port space. If bound to the same port, the RNIC will not be able to tell if the incoming RDS packet is for TCP or IB/iWARP RDS module. It appears to be preferring the IB/iWARP RDS module and not passing the packet to the TCP RDS module. Thus currently, iWARP adapters will not work if both TCP and IB are enabled in RDS. Regardless of whether iWARP support is separate or rolled into IB, these issues need to be resolved. I am open to suggestions about how to go about correcting this. One idea to correct the first issue, we can have the bind and all other device specific setup of both IB and iWARP handled by a single function which will then, based on node_type, handle the IB or iWARP case. The second issue is more complicated, as there is currently no way for the rdma_bind_addr to know if the port is already in use and vice versa. Obviously, we can make TCP/IWARP inversely dependent on each other during compile time, but I'm not sure that is a good long term strategy. Thoughts? Thanks, Jon From sean.hefty at intel.com Wed May 28 16:33:06 2008 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 28 May 2008 16:33:06 -0700 Subject: [ofa-general] Port space sharing in RDS In-Reply-To: <20080528225549.GC6288@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> Message-ID: <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to >the >RDS port for all IP addresses. Unfortunately, that will not work for iWARP for >2 major reasons. Can RDS use different port numbers for its RDMA and TCP protocols? The wire protocols end up being different when running over TCP versus iWarp. - Sean From jon at opengridcomputing.com Wed May 28 17:03:54 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Wed, 28 May 2008 19:03:54 -0500 Subject: [ofa-general] Port space sharing in RDS In-Reply-To: <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> Message-ID: <20080529000354.GD6288@opengridcomputing.com> On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote: > >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to > >the > >RDS port for all IP addresses. Unfortunately, that will not work for iWARP for > >2 major reasons. > > Can RDS use different port numbers for its RDMA and TCP protocols? The wire I do not know if this is desirable, but a quick test shows that having TCP and IB on different ports works around the problem. > protocols end up being different when running over TCP versus iWarp. > > - Sean > From chu11 at llnl.gov Wed May 28 17:14:59 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 28 May 2008 17:14:59 -0700 Subject: [ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting" option to updn/minhop routing In-Reply-To: <1207861815.7695.160.camel@cardanus.llnl.gov> References: <1207861815.7695.160.camel@cardanus.llnl.gov> Message-ID: <1212020100.31760.154.camel@cardanus.llnl.gov> Hey Sasha, Attached are some numbers from a recent run I did with my port offsetting patches. I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120 nodes. I ran w/ 1 task per node or 8 tasks per node (nodes have 8 processors each), trying LMC=0, LMC=1, and LMC=2 with the original 'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled "PO"). Next to these columns are the percentage worse the numbers are in comparison to LMC=0. My understanding is that mvapich 0.9.9 does not know how to take advantage of multiple lids while openMPI 1.2.6 does know how to take advantage of it. I think the key numbers to notice are that without port-offsetting, performance relative to LMC=0 is pretty bad when the MPI implementation does not know how to take advantage of multiple lids (mvapich 0.9.9). LMC=1 shows ~30% performance degradation and LMC=2 shows ~90% degradation on this cluster. With the port-offsetting turned on, the degradation falls to 0%-6%, a few times even being faster. We consider this within "noise" levels. For MPIs that do know how to take advantage of multiple lids it seems that the port-offsetting patch doesn't affect performance that much. (See OpenMPI 1.2.6 sections). PLMK what you think. Thanks. Al On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote: > Hey Sasha, > > I was going to submit this after I had a chance to test on one of our > big clusters to see if it worked 100% right. But my final testing has > been delayed (for a month now!). Ira said some folks from Sonoma were > interested in this, so I'll go ahead and post it. > > This is a patch for something I call "port_offsetting" (name/description > of the option is open to suggestion). Basically, we want to move to > using lmc > 0 on our clusters b/c some of the newer MPI implementations > take advantage of multiple lids and have shown faster performance when > lmc > 0. > > The problem is that those users that do not use the newer MPI > implementations, or do not run their code in a way that can take > advantage of multiple lids, suffer great performance degradation in > their code. We determined that the primary issue is what we started > calling "base lid alignment". Here's a simple example. > > Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D). > Those lids are: > > port A - 1,2,3,4 > port B - 5,6,7,8 > port C - 9,10,11,12 > port D - 13,14,15,16 > > Suppose forwarding of these lids goes through 4 switch ports. If we > cycle through the ports like updn/minhop currently do, we would see > something like this. > > switch port 1: 1, 5, 9, 13 > switch port 2: 2, 6, 10, 14 > switch port 3: 3, 7, 11, 15 > switch port 4: 4, 8, 12, 16 > > Note that the base lid of each port (lids 1, 5, 9, 13) goes through only > 1 port of the switch. Thus a user that uses only the base lid is using > only 1 port out of the 4 ports they could be using. Leading to terrible > performance. > > We want to get this instead. > > switch port 1: 1, 8, 11, 14 > switch port 2: 2, 5, 12, 15 > switch port 3: 3, 6, 9, 16 > switch port 4: 4, 7, 10, 13 > > where base lids are distributed in a more even manner. > > In order to do this, we (effectively) iterate through all ports like > before, but we iterate starting at a different index depending on the > number of paths we have routed thus far. > > On one of our clusters, some testing has shown when we run w/ LMC=1 and > 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than > when LMC=0 is used. With LMC=2, mpibench tends to be 50-70% worse in > performance than with LMC=0. > > With the port offsetting option, the performance degradation ranges 1-5% > worse than LMC=0. I am currently at a loss why I cannot get it to be > even to LMC=0, but 1-5% is small enough to not make users mad :-) > > The part I haven't been able to test yet is whether newer MPIs that do > take advantage of LMC > 0 run equally when my port_offsetting is turned > off and on. That's the part I'm still haven't been able to test. > > Thanks, look forward to your comments, > > Al > > -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi_port_offsetting.xls Type: application/vnd.ms-excel Size: 17408 bytes Desc: not available URL: From weiny2 at llnl.gov Wed May 28 17:57:19 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 28 May 2008 17:57:19 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: fix printing by name Message-ID: <20080528175719.06d6ae15.weiny2@llnl.gov> I guess when I added support to search by GUID I must have broken the printing by name. This changes these scripts to use the common convention of "-G" to specify that a GUID is to be searched for and fixes the printing when a name is specified. Ira >From e9b6766bd6b3661a5bc1e78c9e95784a99a631c0 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 23 May 2008 16:19:57 -0700 Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: fix printing by name Signed-off-by: Ira K. Weiny --- infiniband-diags/scripts/ibprintca.pl | 15 +++++++++++---- infiniband-diags/scripts/ibprintrt.pl | 15 +++++++++++---- infiniband-diags/scripts/ibprintswitch.pl | 15 +++++++++++---- 3 files changed, 33 insertions(+), 12 deletions(-) diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl index 38b4330..b13a83b 100755 --- a/infiniband-diags/scripts/ibprintca.pl +++ b/infiniband-diags/scripts/ibprintca.pl @@ -45,12 +45,13 @@ use IBswcountlimits; sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog [-R -l] []\n"; + print "Usage: $prog [-R -l] [-G | ]\n"; print " print only the ca specified from the ibnetdiscover output\n"; print " -R Recalculate ibnetdiscover information\n"; print " -l list cas\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; + print " -G node is specified with GUID\n"; exit 0; } @@ -59,15 +60,21 @@ my $regenerate_map = undef; my $list_hcas = undef; my $ca_name = ""; my $ca_port = ""; +my $name_is_guid = "no"; chomp $argv0; -if (!getopts("hRlC:P:")) { usage_and_exit $argv0; } +if (!getopts("hRlC:P:G")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } if (defined $Getopt::Std::opt_l) { $list_hcas = $Getopt::Std::opt_l; } if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } +if (defined $Getopt::Std::opt_G) { $name_is_guid = "yes"; } -my $target_hca = format_guid($ARGV[0]); +my $target_hca = $ARGV[0]; + +if ($name_is_guid eq "yes") { + $target_hca = format_guid($target_hca); +} my $cache_file = get_cache_file($ca_name, $ca_port); @@ -100,7 +107,7 @@ sub main $in_hca = "no"; goto DONE; } - if ("0x$guid" eq $target_hca || $desc =~ /.*$target_hca.*/) { + if ("0x$guid" eq $target_hca || $desc =~ /[\s\"]$target_hca[\s\"]/) { print $line; $in_hca = "yes"; $found_hca = "yes"; diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl index 86dcb64..e9e6cc4 100755 --- a/infiniband-diags/scripts/ibprintrt.pl +++ b/infiniband-diags/scripts/ibprintrt.pl @@ -45,12 +45,13 @@ use IBswcountlimits; sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog [-R -l] []\n"; + print "Usage: $prog [-R -l] [-G | ]\n"; print " print only the rt specified from the ibnetdiscover output\n"; print " -R Recalculate ibnetdiscover information\n"; print " -l list rts\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; + print " -G node is specified with GUID\n"; exit 0; } @@ -59,15 +60,21 @@ my $regenerate_map = undef; my $list_rts = undef; my $ca_name = ""; my $ca_port = ""; +my $name_is_guid = "no"; chomp $argv0; -if (!getopts("hRlC:P:")) { usage_and_exit $argv0; } +if (!getopts("hRlC:P:G")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } if (defined $Getopt::Std::opt_l) { $list_rts = $Getopt::Std::opt_l; } if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } +if (defined $Getopt::Std::opt_G) { $name_is_guid = "yes"; } -my $target_rt = format_guid($ARGV[0]); +my $target_rt = $ARGV[0]; + +if ($name_is_guid eq "yes") { + $target_rt = format_guid($target_rt); +} my $cache_file = get_cache_file($ca_name, $ca_port); @@ -100,7 +107,7 @@ sub main $in_rt = "no"; goto DONE; } - if ("0x$guid" eq $target_rt || $desc =~ /.*$target_rt.*/) { + if ("0x$guid" eq $target_rt || $desc =~ /[\s\"]$target_rt[\s\"]/) { print $line; $in_rt = "yes"; $found_rt = "yes"; diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl index 6712201..148d70e 100755 --- a/infiniband-diags/scripts/ibprintswitch.pl +++ b/infiniband-diags/scripts/ibprintswitch.pl @@ -44,12 +44,13 @@ use IBswcountlimits; sub usage_and_exit { my $prog = $_[0]; - print "Usage: $prog [-R -l] []\n"; + print "Usage: $prog [-R -l] [-G | ]\n"; print " print only the switch specified from the ibnetdiscover output\n"; print " -R Recalculate ibnetdiscover information\n"; print " -l list switches\n"; print " -C use selected channel adaptor name for queries\n"; print " -P use selected channel adaptor port for queries\n"; + print " -G node is specified with GUID\n"; exit 0; } @@ -58,15 +59,21 @@ my $regenerate_map = undef; my $list_switches = undef; my $ca_name = ""; my $ca_port = ""; +my $name_is_guid = "no"; chomp $argv0; -if (!getopts("hRlC:P:")) { usage_and_exit $argv0; } +if (!getopts("hRlC:P:G")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } if (defined $Getopt::Std::opt_l) { $list_switches = $Getopt::Std::opt_l; } if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } +if (defined $Getopt::Std::opt_G) { $name_is_guid = "yes"; } -my $target_switch = format_guid($ARGV[0]); +my $target_switch = $ARGV[0]; + +if ($name_is_guid eq "yes") { + $target_switch = format_guid($target_switch); +} my $cache_file = get_cache_file($ca_name, $ca_port); @@ -99,7 +106,7 @@ sub main $in_switch = "no"; goto DONE; } - if ("0x$guid" eq $target_switch || $desc =~ /.*$target_switch.*/) { + if ("0x$guid" eq $target_switch || $desc =~ /[\s\"]$target_switch[\s\"]/) { print $line; $in_switch = "yes"; $found_switch = "yes"; -- 1.5.4.5 From weiny2 at llnl.gov Wed May 28 17:59:41 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 28 May 2008 17:59:41 -0700 Subject: [ofa-general] [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple matches but print warning to user that multiple matches were found Message-ID: <20080528175941.41425dac.weiny2@llnl.gov> I think it is useful to print multiple matches found when searching for matches. Specifically when switches have not been named, ie they all have some "Mellanox..." or "Voltaire..." name. This prints all matches but also warns the user at the end that it found X matches. Ira >From 11b85c9b526b9067aa12eac5d445d8ee43a7d024 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 23 May 2008 16:25:19 -0700 Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple matches but print warning to user that multiple matches were found Signed-off-by: Ira K. Weiny --- infiniband-diags/scripts/ibprintca.pl | 17 +++++++++-------- infiniband-diags/scripts/ibprintrt.pl | 17 +++++++++-------- infiniband-diags/scripts/ibprintswitch.pl | 17 +++++++++-------- 3 files changed, 27 insertions(+), 24 deletions(-) diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl index b13a83b..ccd5473 100755 --- a/infiniband-diags/scripts/ibprintca.pl +++ b/infiniband-diags/scripts/ibprintca.pl @@ -95,7 +95,7 @@ if ($target_hca eq "") { # sub main { - my $found_hca = undef; + my $found_hca = 0; open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; my $in_hca = "no"; my %ports = undef; @@ -105,12 +105,14 @@ sub main my $desc = $2; if ($in_hca eq "yes") { $in_hca = "no"; - goto DONE; + foreach my $port (sort { $a <=> $b } (keys %ports)) { + print $ports{$port}; + } } if ("0x$guid" eq $target_hca || $desc =~ /[\s\"]$target_hca[\s\"]/) { print $line; $in_hca = "yes"; - $found_hca = "yes"; + $found_hca++; } } if ($line =~ /^Switch.*/ || $line =~ /^Rt.*/) { $in_hca = "no"; } @@ -120,15 +122,14 @@ sub main } } - DONE: - foreach my $port (sort { $a <=> $b } (keys %ports)) { - print $ports{$port}; - } - if (!$found_hca) { + if ($found_hca == 0) { print "\"$target_hca\" not found\n"; print " Try running with the \"-R\" option.\n"; print " If still not found the node is probably down.\n"; } + if ($found_hca > 1) { + print "\nWARNING: Found $found_hca CA's with the name \"$target_hca\"\n"; + } close IBNET_TOPO; } main diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl index e9e6cc4..4b83ff0 100755 --- a/infiniband-diags/scripts/ibprintrt.pl +++ b/infiniband-diags/scripts/ibprintrt.pl @@ -95,7 +95,7 @@ if ($target_rt eq "") { # sub main { - my $found_rt = undef; + my $found_rt = 0; open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; my $in_rt = "no"; my %ports = undef; @@ -105,12 +105,14 @@ sub main my $desc = $2; if ($in_rt eq "yes") { $in_rt = "no"; - goto DONE; + foreach my $port (sort { $a <=> $b } (keys %ports)) { + print $ports{$port}; + } } if ("0x$guid" eq $target_rt || $desc =~ /[\s\"]$target_rt[\s\"]/) { print $line; $in_rt = "yes"; - $found_rt = "yes"; + $found_rt++; } } if ($line =~ /^Switch.*/ || $line =~ /^Ca.*/) { $in_rt = "no"; } @@ -120,15 +122,14 @@ sub main } } - DONE: - foreach my $port (sort { $a <=> $b } (keys %ports)) { - print $ports{$port}; - } - if (!$found_rt) { + if ($found_rt == 0) { print "\"$target_rt\" not found\n"; print " Try running with the \"-R\" option.\n"; print " If still not found the node is probably down.\n"; } + if ($found_rt > 1) { + print "\nWARNING: Found $found_rt Router's with the name \"$target_rt\"\n"; + } close IBNET_TOPO; } main diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl index 148d70e..9426673 100755 --- a/infiniband-diags/scripts/ibprintswitch.pl +++ b/infiniband-diags/scripts/ibprintswitch.pl @@ -94,7 +94,7 @@ if ($target_switch eq "") { # sub main { - my $found_switch = undef; + my $found_switch = 0; open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology\n"; my $in_switch = "no"; my %ports = undef; @@ -104,12 +104,14 @@ sub main my $desc = $2; if ($in_switch eq "yes") { $in_switch = "no"; - goto DONE; + foreach my $port (sort { $a <=> $b } (keys %ports)) { + print $ports{$port}; + } } if ("0x$guid" eq $target_switch || $desc =~ /[\s\"]$target_switch[\s\"]/) { print $line; $in_switch = "yes"; - $found_switch = "yes"; + $found_switch++; } } if ($line =~ /^Ca.*/) { $in_switch = "no"; } @@ -119,14 +121,13 @@ sub main } } - DONE: - foreach my $port (sort { $a <=> $b } (keys %ports)) { - print $ports{$port}; - } - if (!$found_switch) { + if ($found_switch == 0) { print "Switch \"$target_switch\" not found\n"; print " Try running with the \"-R\" option.\n"; } + if ($found_switch > 1) { + print "\nWARNING: Found $found_switch switches with the name \"$target_switch\"\n"; + } close IBNET_TOPO; } main -- 1.5.4.5 From chu11 at llnl.gov Wed May 28 19:30:52 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 28 May 2008 22:30:52 -0400 Subject: [ofa-general] [OpenSM] [PATCH 0/3] New "port-offsetting" option to updn/minhop routing In-Reply-To: <1212020100.31760.154.camel@cardanus.llnl.gov> References: <1207861815.7695.160.camel@cardanus.llnl.gov> <1212020100.31760.154.camel@cardanus.llnl.gov> Message-ID: <1212028252.6913.3.camel@whatsup> Oops, I forgot about one other important measurement we did. The following are the Average Send/Receive MPI bandwidths as measured by mpigraph (http://sourceforge.net/projects/mpigraph). Again, using updn routing. LMC=0 Send 391 MB/s Recv 461 MB/s LMC=1 Send 292 MB/s Recv 358 MB/s LMC=2 Send 197 MB/s Recv 241 MB/s with my port offsetting turned on. I got LMC=1 Send 387 MB/s Recv 457 MB/s LMC=2 Send 383 MB/s Recv 455 MB/s So similar to the AlltoAll MPI tests, the port offsetting gets the numbers back to about what they were at LMC=0. Al On Wed, 2008-05-28 at 17:14 -0700, Al Chu wrote: > Hey Sasha, > > Attached are some numbers from a recent run I did with my port > offsetting patches. I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120 > nodes. I ran w/ 1 task per node or 8 tasks per node (nodes have 8 > processors each), trying LMC=0, LMC=1, and LMC=2 with the original > 'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled > "PO"). Next to these columns are the percentage worse the numbers are > in comparison to LMC=0. My understanding is that mvapich 0.9.9 does not > know how to take advantage of multiple lids while openMPI 1.2.6 does > know how to take advantage of it. > > I think the key numbers to notice are that without port-offsetting, > performance relative to LMC=0 is pretty bad when the MPI implementation > does not know how to take advantage of multiple lids (mvapich 0.9.9). > LMC=1 shows ~30% performance degradation and LMC=2 shows ~90% > degradation on this cluster. With the port-offsetting turned on, the > degradation falls to 0%-6%, a few times even being faster. We consider > this within "noise" levels. > > For MPIs that do know how to take advantage of multiple lids it seems > that the port-offsetting patch doesn't affect performance that much. > (See OpenMPI 1.2.6 sections). > > PLMK what you think. Thanks. > > Al > > On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote: > > Hey Sasha, > > > > I was going to submit this after I had a chance to test on one of our > > big clusters to see if it worked 100% right. But my final testing has > > been delayed (for a month now!). Ira said some folks from Sonoma were > > interested in this, so I'll go ahead and post it. > > > > This is a patch for something I call "port_offsetting" (name/description > > of the option is open to suggestion). Basically, we want to move to > > using lmc > 0 on our clusters b/c some of the newer MPI implementations > > take advantage of multiple lids and have shown faster performance when > > lmc > 0. > > > > The problem is that those users that do not use the newer MPI > > implementations, or do not run their code in a way that can take > > advantage of multiple lids, suffer great performance degradation in > > their code. We determined that the primary issue is what we started > > calling "base lid alignment". Here's a simple example. > > > > Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D). > > Those lids are: > > > > port A - 1,2,3,4 > > port B - 5,6,7,8 > > port C - 9,10,11,12 > > port D - 13,14,15,16 > > > > Suppose forwarding of these lids goes through 4 switch ports. If we > > cycle through the ports like updn/minhop currently do, we would see > > something like this. > > > > switch port 1: 1, 5, 9, 13 > > switch port 2: 2, 6, 10, 14 > > switch port 3: 3, 7, 11, 15 > > switch port 4: 4, 8, 12, 16 > > > > Note that the base lid of each port (lids 1, 5, 9, 13) goes through only > > 1 port of the switch. Thus a user that uses only the base lid is using > > only 1 port out of the 4 ports they could be using. Leading to terrible > > performance. > > > > We want to get this instead. > > > > switch port 1: 1, 8, 11, 14 > > switch port 2: 2, 5, 12, 15 > > switch port 3: 3, 6, 9, 16 > > switch port 4: 4, 7, 10, 13 > > > > where base lids are distributed in a more even manner. > > > > In order to do this, we (effectively) iterate through all ports like > > before, but we iterate starting at a different index depending on the > > number of paths we have routed thus far. > > > > On one of our clusters, some testing has shown when we run w/ LMC=1 and > > 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than > > when LMC=0 is used. With LMC=2, mpibench tends to be 50-70% worse in > > performance than with LMC=0. > > > > With the port offsetting option, the performance degradation ranges 1-5% > > worse than LMC=0. I am currently at a loss why I cannot get it to be > > even to LMC=0, but 1-5% is small enough to not make users mad :-) > > > > The part I haven't been able to test yet is whether newer MPIs that do > > take advantage of LMC > 0 run equally when my port_offsetting is turned > > off and on. That's the part I'm still haven't been able to test. > > > > Thanks, look forward to your comments, > > > > Al > > > > > -- > Albert Chu > chu11 at llnl.gov > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From sashak at voltaire.com Wed May 28 22:34:18 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 May 2008 08:34:18 +0300 Subject: [ofa-general] Re: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple matches but print warning to user that multiple matches were found In-Reply-To: <20080528175941.41425dac.weiny2@llnl.gov> References: <20080528175941.41425dac.weiny2@llnl.gov> Message-ID: <20080529053418.GA16570@sashak.voltaire.com> On 17:59 Wed 28 May , Ira Weiny wrote: > I think it is useful to print multiple matches found when searching for > matches. Specifically when switches have not been named, ie they all have some > "Mellanox..." or "Voltaire..." name. > > This prints all matches but also warns the user at the end that it found X > matches. > > Ira > > From 11b85c9b526b9067aa12eac5d445d8ee43a7d024 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Fri, 23 May 2008 16:25:19 -0700 > Subject: [PATCH] infiniband-diags/scripts/ibprint[ca|switch|rt].pl: allow printing of multiple > matches but print warning to user that multiple matches were found > > > Signed-off-by: Ira K. Weiny Both applied. Thanks. Sasha From jackm at dev.mellanox.co.il Wed May 28 22:42:40 2008 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 29 May 2008 08:42:40 +0300 Subject: [ofa-general] [PATCH] mlx4_core: enable changing default max HCA resource limits at run time -- reposting In-Reply-To: <1211893354.13185.229.camel@hrosenstock-ws.xsigo.com> References: <200804281438.28417.jackm@dev.mellanox.co.il> <1211893354.13185.229.camel@hrosenstock-ws.xsigo.com> Message-ID: <200805290842.40722.jackm@dev.mellanox.co.il> See http://lists.openfabrics.org/pipermail/general/2008-May/049781.html I'll submit a pair of patches incorporating my suggestions (at the end of the post). Roland? - Jack On Tuesday 27 May 2008 16:02, Hal Rosenstock wrote: > On Mon, 2008-04-28 at 14:38 +0300, Jack Morgenstein wrote: > > mlx4-core: enable changing default max HCA resource limits. > > > > Enable module-initialization time modification of default HCA > > maximum resource limits via module parameters, as is done in mthca. > > > > Specify the log of the parameter value, rather than the value itself > > to avoid the hidden side-effect of rounding up values to next power-of-2. > > > > Signed-off-by: Jack Morgenstein > > Sorry if I'm rehashing this but this thread appears to have died out and > I'm not sure about it's status: > > Where do we stand in terms of getting the additional mlx4 module > parameters incorporated ? > > Thanks. > > -- Hal > > From sashak at voltaire.com Wed May 28 22:56:41 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 May 2008 08:56:41 +0300 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> References: <483D62E6.2050107@mellanox.co.il> <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080529055641.GB16570@sashak.voltaire.com> On 08:24 Wed 28 May , Hal Rosenstock wrote: > On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote: > > In addition we needed to install > > libmthca > > which in turn wanted > > libibverbs > > AFAIK, there is no OpenSM requirement for these libraries. Right, it is not needed. Actually it looks like a bug in OFED's install.pl script. > > And, after we successfully loaded it all and started opensm, etc, > > then ipoib wouldn't come up to RUNNING because it couldn't join > > the broadcast group. > > This was likely due to IPoIB needing to be enabled on the default > partition. The default partition is enabled by default. Sasha From sashak at voltaire.com Wed May 28 23:02:56 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 May 2008 09:02:56 +0300 Subject: [ofa-general] [PATCH] ofed_1_3_scripts: fix management libib* dependencies In-Reply-To: <483D62E6.2050107@mellanox.co.il> References: <483D62E6.2050107@mellanox.co.il> Message-ID: <20080529060256.GD16570@sashak.voltaire.com> libibcommon, libibumad and libibmad don't require libibverbs. Signed-off-by: Sasha Khapyorsky --- install.pl | 30 +++++++++++++++--------------- 1 files changed, 15 insertions(+), 15 deletions(-) diff --git a/install.pl b/install.pl index a795a3e..e533be7 100755 --- a/install.pl +++ b/install.pl @@ -584,28 +584,28 @@ my %packages_info = ( { name => "libibcommon", parent => "libibcommon", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => ["libtool"], - dist_req_inst => [], ofa_req_build => ["libibverbs"], - ofa_req_inst => ["libibverbs"], + dist_req_inst => [], ofa_req_build => [], + ofa_req_inst => [], install32 => 1, exception => 0, configure_options => '' }, 'libibcommon-devel' => { name => "libibcommon-devel", parent => "libibcommon", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"], - ofa_req_inst => ["libibverbs", "libibcommon"], + dist_req_inst => [], ofa_req_build => [], + ofa_req_inst => ["libibcommon"], install32 => 1, exception => 0 }, 'libibcommon-static' => { name => "libibcommon-static", parent => "libibcommon", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"], - ofa_req_inst => ["libibverbs", "libibcommon"], + dist_req_inst => [], ofa_req_build => [], + ofa_req_inst => ["libibcommon"], install32 => 1, exception => 0 }, 'libibcommon-debuginfo' => { name => "libibcommon-debuginfo", parent => "libibcommon", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibverbs-devel"], + dist_req_inst => [], ofa_req_build => [], ofa_req_inst => [], install32 => 0, exception => 0 }, @@ -613,28 +613,28 @@ my %packages_info = ( { name => "libibumad", parent => "libibumad", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs", "libibcommon-devel"], - ofa_req_inst => ["libibverbs", "libibcommon"], + dist_req_inst => [], ofa_req_build => ["libibcommon-devel"], + ofa_req_inst => ["libibcommon"], install32 => 1, exception => 0, configure_options => '' }, 'libibumad-devel' => { name => "libibumad-devel", parent => "libibumad", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"], - ofa_req_inst => ["libibverbs", "libibcommon-devel", "libibumad"], + dist_req_inst => [], ofa_req_build => ["libibcommon-devel"], + ofa_req_inst => ["libibcommon-devel", "libibumad"], install32 => 1, exception => 0 }, 'libibumad-static' => { name => "libibumad-static", parent => "libibumad", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"], - ofa_req_inst => ["libibverbs", "libibcommon-devel", "libibumad"], + dist_req_inst => [], ofa_req_build => ["libibcommon-devel"], + ofa_req_inst => ["libibcommon-devel", "libibumad"], install32 => 1, exception => 0 }, 'libibumad-debuginfo' => { name => "libibumad-debuginfo", parent => "libibumad", selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], - dist_req_inst => [], ofa_req_build => ["libibverbs","libibcommon-devel"], + dist_req_inst => [], ofa_req_build => ["libibcommon-devel"], ofa_req_inst => [], install32 => 0, exception => 0 }, @@ -643,7 +643,7 @@ my %packages_info = ( selected => 0, installed => 0, rpm_exist => 0, rpm_exist32 => 0, available => 1, mode => "user", dist_req_build => [], dist_req_inst => [], ofa_req_build => ["libibumad-devel"], - ofa_req_inst => ["libibverbs", "libibumad"], + ofa_req_inst => ["libibumad"], install32 => 1, exception => 0, configure_options => '' }, 'libibmad-devel' => { name => "libibmad-devel", parent => "libibmad", -- 1.5.5.1.178.g1f811 From ogerlitz at voltaire.com Wed May 28 23:23:59 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 29 May 2008 09:23:59 +0300 Subject: [ofa-general] Re: [RFC V4 PATCH 3/5] rdma/cma: simply locking needed for serialization of callbacks In-Reply-To: <000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com> References: <000801c8c0f4$2f2cfd90$6758180a@amr.corp.intel.com> Message-ID: <483E4BFF.1090207@voltaire.com> Sean Hefty wrote: > I wasn't clear on this before, but we shouldn't need to take the spinlock here > at all now. We needed it before in order to check the state and increment > dev_remove in one operation. Once the spinlock was released the state could > have changed, but dev_remove would have halted the device removal thread. Under > the new method, device removal is halted while we hold the handler_mutex. OK, got it. > This can end up trying to acquire the mutex twice. We could change this to > mutex_lock(); > if (id_priv->state == CMA_ADDR_BOUND || id_priv->state == CMA_ADDR_RESOLVED) > OK, will do so. From ogerlitz at voltaire.com Wed May 28 23:29:16 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 29 May 2008 09:29:16 +0300 Subject: [ofa-general] Re: [RFC V4 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_ADDR_CHANGE notification In-Reply-To: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com> References: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com> Message-ID: <483E4D3C.3020600@voltaire.com> Sean Hefty wrote: >> + mutex_lock(&id_priv->handler_mutex); >> + if (id_priv->state == CMA_DESTROYING) > > We should probably skip id_priv->state == CMA_DEVICE_REMOVAL as well. OK > >> + printk(KERN_ERR "addr change for device %s used by id %p, notifying\n", > > Is KERN_ERR what we want here? no, I think we can do well with warning or info level > >> +static int cma_netdev_callback(struct notifier_block *self, unsigned long >> event, void *ctx) >> + mutex_lock(&lock); >> + list_for_each_entry(cma_dev, &dev_list, list) >> + list_for_each_entry(id_priv, &cma_dev->id_list, list) { >> + ret = cma_netdev_change(ndev, id_priv); >> + if (ret) >> + break; > > Should this be goto (mutex_unlock) instead? yes it would be better to have it this way Or From vlad at dev.mellanox.co.il Wed May 28 23:58:18 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 29 May 2008 09:58:18 +0300 Subject: [ofa-general] [PATCH] ofed_1_3_scripts: fix management libib* dependencies In-Reply-To: <20080529060256.GD16570@sashak.voltaire.com> References: <483D62E6.2050107@mellanox.co.il> <20080529060256.GD16570@sashak.voltaire.com> Message-ID: <483E540A.4080701@dev.mellanox.co.il> Sasha Khapyorsky wrote: > libibcommon, libibumad and libibmad don't require libibverbs. > > Signed-off-by: Sasha Khapyorsky > --- > install.pl | 30 +++++++++++++++--------------- > 1 files changed, 15 insertions(+), 15 deletions(-) > Applied, Regards, Vladimir From marcel.heinz at informatik.tu-chemnitz.de Thu May 29 02:19:28 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Thu, 29 May 2008 11:19:28 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <483DA512.2070403@gmail.com> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> Message-ID: <483E7520.1000302@informatik.tu-chemnitz.de> Hi, Dotan Barak wrote: > Marcel Heinz wrote: >> [low multicast throughput of ~250MB/s with own benchmark tool] > > 1) I know that ib_send_bw supports multicast as well, can you please > check that you can reproduce your problem > on this benchmark too? Well, the last time I've checked this, ib_send_bw didn't support multicast, but this was some months ago. That multicast support seems a bit odd, since it doesn't create/join the multicast groups and there is still a 1:1 TCP connection used to establish the IB connection, so one cannot benchmark "real" multicast scenarios with more than one receiver. However, here are the results (I just used ipoib to let it create some multicast groups for me): | mh at mhtest0:~$ ib_send_bw -c UD -g mhtest1 | ------------------------------------------------------------------ | Send BW Multicast Test | Connection type : UD | Max msg size in UD is 2048 changing to 2048 | Inline data is used up to 400 bytes message | local address: LID 0x01, QPN 0x4a0405, PSN 0x8667a7 | remote address: LID 0x03, QPN 0x4a0405, PSN 0x5d41b6 | Mtu : 2048 | ------------------------------------------------------------------ | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] | 2048 1000 301.12 247.05 | ------------------------------------------------------------------ This is the same result as my own benchmark showed in that scenario. > 2) You should expect that multicast messages will be slower than > unicast because the HCA/switch treat them in different way > (message duplication need to be done if needed). Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit too much of overhead, don't you think? Especially if I take into account that with my own benchmark, I can get ~950MB/s when I start another receiver on the same host as the sender. Note that both of the receivers, the local and the remote one, are seeing all packets at that rate, so the HCAs and the switch must be able to handle multicast packets with this throughput. The other strange thing is that multicast traffics slows down other traffic way more than the bandwith it consumes. Moreover, it seems like it limits any other connections to the same throughput than that of the multicast traffic, which looks suspicious to me. The same behavior can be reproduced with ib_send_bw, by starting an unicast and multicast run in parallel: | mh at mhtest0:~$ ib_send_bw -c UD mhtest1 & ib_send_bw -c UD -g mhtest1\ | -p 18516 | ./ib_send_bw -c UD -g mhtest1 -p 18516 | [1] 4927 | ------------------------------------------------------------------ | Send BW Test | Connection type : UD | Max msg size in UD is 2048 changing to 2048 | ------------------------------------------------------------------ | Send BW Multicast Test | Connection type : UD | Max msg size in UD is 2048 changing to 2048 | Inline data is used up to 400 bytes message | Inline data is used up to 400 bytes message | local address: LID 0x01, QPN 0x530405, PSN 0xe98523 | local address: LID 0x01, QPN 0x530406, PSN 0x3b338e | remote address: LID 0x03, QPN 0x540405, PSN 0x5c53e2 | Mtu : 2048 | remote address: LID 0x03, QPN 0x540406, PSN 0xff883f | Mtu : 2048 | ------------------------------------------------------------------ | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] | ------------------------------------------------------------------ | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] | 2048 1000 692.41 270.26 | 2048 1000 246.00 244.68 | ------------------------------------------------------------------ | ------------------------------------------------------------------ Doing 2 unicast UD runs in parallel, I'm getting ~650MB/s average bandwith for each, which sounds reasonable. Also, when using bidircetional mode, I'm getting ~1900MB/s (amlost doubled) throughput for unicast, but still ~250MBs for multicast. Regards, Marcel From tziporet at dev.mellanox.co.il Thu May 29 02:35:05 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 29 May 2008 12:35:05 +0300 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483DCEA8.20505@opengridcomputing.com> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> <483CBDF0.7030209@opengridcomputing.com> <483DCEA8.20505@opengridcomputing.com> Message-ID: <483E78C9.7080209@mellanox.co.il> Steve Wise wrote: > Yes, I have already said I'll post a test case. :) > > The krping tool will be the culprit. Its the kernel equivalent of > rping and has been around for a long time in one form or another. > > It is available at git://git.openfabrics.org/~swise/krping > Do younthink we should include it in OFED as we include user space examples? Tziporet From ogerlitz at voltaire.com Thu May 29 02:39:57 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 29 May 2008 12:39:57 +0300 Subject: [ofa-general] Re: [RFC V4 PATCH 4/5] rdma/cma: implement RDMA_CM_EVENT_ADDR_CHANGE notification In-Reply-To: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com> References: <000901c8c0f5$ddee4680$6758180a@amr.corp.intel.com> Message-ID: <483E79ED.4050301@voltaire.com> Sean Hefty wrote: > Okay - I think we're pretty close on the rdma_cm side of things. Thanks. > Sean, I have implemented all your last comments, so I think that rdma_cm wise we are kind of ready. The review of the bonding patch in netdev has just started and I want to get some progress there and testing before sending you the final set of the patches. Or. From vlad at lists.openfabrics.org Thu May 29 03:09:11 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 29 May 2008 03:09:11 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080529-0200 daily build status Message-ID: <20080529100911.A200AE60E00@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From hrosenstock at xsigo.com Thu May 29 04:34:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 04:34:13 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <20080529055641.GB16570@sashak.voltaire.com> References: <483D62E6.2050107@mellanox.co.il> <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> <20080529055641.GB16570@sashak.voltaire.com> Message-ID: <1212060853.27600.83.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 08:56 +0300, Sasha Khapyorsky wrote: > On 08:24 Wed 28 May , Hal Rosenstock wrote: > > On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote: > > > In addition we needed to install > > > libmthca > > > which in turn wanted > > > libibverbs > > > > AFAIK, there is no OpenSM requirement for these libraries. > > Right, it is not needed. Actually it looks like a bug in OFED's > install.pl script. T > > > And, after we successfully loaded it all and started opensm, etc, > > > then ipoib wouldn't come up to RUNNING because it couldn't join > > > the broadcast group. > > > > This was likely due to IPoIB needing to be enabled on the default > > partition. > > The default partition is enabled by default. and so is ipoib on that partition (I forgot how this worked for the no config file case) so the problem was something else (perhaps rate or MTU mismatch if defaults were not adaquete for the b2b configuration ?). -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Thu May 29 04:36:04 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 04:36:04 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <20080529055641.GB16570@sashak.voltaire.com> References: <483D62E6.2050107@mellanox.co.il> <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> <20080529055641.GB16570@sashak.voltaire.com> Message-ID: <1212060964.27600.87.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 08:56 +0300, Sasha Khapyorsky wrote: > On 08:24 Wed 28 May , Hal Rosenstock wrote: > > On Wed, 2008-05-28 at 10:31 -0400, Talpey, Thomas wrote: > > > In addition we needed to install > > > libmthca > > > which in turn wanted > > > libibverbs > > > > AFAIK, there is no OpenSM requirement for these libraries. > > Right, it is not needed. Actually it looks like a bug in OFED's > install.pl script. What about getting the FC package updated too ? I thought that uses different packaging. -- Hal From richard.frank at oracle.com Thu May 29 04:52:11 2008 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 29 May 2008 07:52:11 -0400 Subject: [rds-devel] [ofa-general] Port space sharing in RDS In-Reply-To: <20080529000354.GD6288@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> Message-ID: <483E98EB.1020308@oracle.com> I see no problem with this - but defer to Olaf. Olaf is currently presenting at LinuxTag in Berlin - his responses may be delayed a day or two.. Jon Mason wrote: > On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote: > >>> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to >>> the >>> RDS port for all IP addresses. Unfortunately, that will not work for iWARP for >>> 2 major reasons. >>> >> Can RDS use different port numbers for its RDMA and TCP protocols? The wire >> > > I do not know if this is desirable, but a quick test shows that having TCP and IB > on different ports works around the problem. > > >> protocols end up being different when running over TCP versus iWarp. >> >> - Sean >> >> > > _______________________________________________ > rds-devel mailing list > rds-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/rds-devel > From eli at dev.mellanox.co.il Thu May 29 05:15:18 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 29 May 2008 15:15:18 +0300 Subject: [ofa-general] IB/ipoib: Fix CM connection premature destruction Message-ID: <1212063318.13769.186.camel@mtls03> >From 24e88d727dbbb7fd491edb57416f5cb0d4009f1d Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Thu, 29 May 2008 15:13:25 +0300 Subject: [PATCH] IB/ipoib: Fix CM connection premature destruction Destroy the CM connection at ipoib_cm_tx_destroy() after the TX queue is flushed. Failure to do so might cause the cm_id to be allocated again pending TX completions which have not been reported yet will move the connection to the reap list again causing it to be destroyed before it has been used. The overall effect would be to delay the creation of a new connection. Signed-off-by: Eli Cohen --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 819c027..a40e649 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -1113,9 +1113,6 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) ipoib_dbg(priv, "Destroy active connection 0x%x head 0x%x tail 0x%x\n", p->qp ? p->qp->qp_num : 0, p->tx_head, p->tx_tail); - if (p->id) - ib_destroy_cm_id(p->id); - if (p->tx_ring) { /* Wait for all sends to complete */ begin = jiffies; @@ -1131,6 +1128,8 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) } timeout: + if (p->id) + ib_destroy_cm_id(p->id); while ((int) p->tx_tail - (int) p->tx_head < 0) { tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; -- 1.5.5.1 From hrosenstock at xsigo.com Thu May 29 05:46:21 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 05:46:21 -0700 Subject: [ofa-general] Multicast Performance In-Reply-To: <483E7520.1000302@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> Message-ID: <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote: > Hi, > > Dotan Barak wrote: > > Marcel Heinz wrote: > >> [low multicast throughput of ~250MB/s with own benchmark tool] > > > > 1) I know that ib_send_bw supports multicast as well, can you please > > check that you can reproduce your problem > > on this benchmark too? > > Well, the last time I've checked this, ib_send_bw didn't support > multicast, but this was some months ago. That multicast support seems > a bit odd, since it doesn't create/join the multicast groups and there > is still a 1:1 TCP connection used to establish the IB connection, so > one cannot benchmark "real" multicast scenarios with more than one > receiver. > > However, here are the results (I just used ipoib to let it create some > multicast groups for me): > > | mh at mhtest0:~$ ib_send_bw -c UD -g mhtest1 > | ------------------------------------------------------------------ > | Send BW Multicast Test > | Connection type : UD > | Max msg size in UD is 2048 changing to 2048 > | Inline data is used up to 400 bytes message > | local address: LID 0x01, QPN 0x4a0405, PSN 0x8667a7 > | remote address: LID 0x03, QPN 0x4a0405, PSN 0x5d41b6 > | Mtu : 2048 > | ------------------------------------------------------------------ > | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > | 2048 1000 301.12 247.05 > | ------------------------------------------------------------------ > > This is the same result as my own benchmark showed in that scenario. > > > 2) You should expect that multicast messages will be slower than > > unicast because the HCA/switch treat them in different way > > (message duplication need to be done if needed). > > Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit > too much of overhead, don't you think? Agreed. > Especially if I take into account > that with my own benchmark, I can get ~950MB/s when I start another > receiver on the same host as the sender. Note that both of the > receivers, the local and the remote one, are seeing all packets at that > rate, so the HCAs and the switch must be able to handle multicast > packets with this throughput. Perhaps this is a static rate issue. What SM is being used ? -- Hal > The other strange thing is that multicast traffics slows down other > traffic way more than the bandwith it consumes. Moreover, it seems like > it limits any other connections to the same throughput than that of the > multicast traffic, which looks suspicious to me. > > The same behavior can be reproduced with ib_send_bw, by starting an > unicast and multicast run in parallel: > > | mh at mhtest0:~$ ib_send_bw -c UD mhtest1 & ib_send_bw -c UD -g mhtest1\ > | -p 18516 > | ./ib_send_bw -c UD -g mhtest1 -p 18516 > | [1] 4927 > | ------------------------------------------------------------------ > | Send BW Test > | Connection type : UD > | Max msg size in UD is 2048 changing to 2048 > | ------------------------------------------------------------------ > | Send BW Multicast Test > | Connection type : UD > | Max msg size in UD is 2048 changing to 2048 > | Inline data is used up to 400 bytes message > | Inline data is used up to 400 bytes message > | local address: LID 0x01, QPN 0x530405, PSN 0xe98523 > | local address: LID 0x01, QPN 0x530406, PSN 0x3b338e > | remote address: LID 0x03, QPN 0x540405, PSN 0x5c53e2 > | Mtu : 2048 > | remote address: LID 0x03, QPN 0x540406, PSN 0xff883f > | Mtu : 2048 > | ------------------------------------------------------------------ > | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > | ------------------------------------------------------------------ > | #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > | 2048 1000 692.41 270.26 > | 2048 1000 246.00 244.68 > | ------------------------------------------------------------------ > | ------------------------------------------------------------------ > > Doing 2 unicast UD runs in parallel, I'm getting ~650MB/s average > bandwith for each, which sounds reasonable. > > Also, when using bidircetional mode, I'm getting ~1900MB/s (amlost > doubled) throughput for unicast, but still ~250MBs for multicast. > > Regards, > Marcel > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Thu May 29 06:08:42 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 May 2008 16:08:42 +0300 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1212060964.27600.87.camel@hrosenstock-ws.xsigo.com> References: <483D62E6.2050107@mellanox.co.il> <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> <20080529055641.GB16570@sashak.voltaire.com> <1212060964.27600.87.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080529130842.GU4616@sashak.voltaire.com> On 04:36 Thu 29 May , Hal Rosenstock wrote: > > What about getting the FC package updated too ? I thought that uses > different packaging. I don't know FC story. A spec files which are in management tree don't have such dependencies. Sasha From sashak at voltaire.com Thu May 29 06:10:31 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 May 2008 16:10:31 +0300 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1212060853.27600.83.camel@hrosenstock-ws.xsigo.com> References: <483D62E6.2050107@mellanox.co.il> <1211988269.27600.32.camel@hrosenstock-ws.xsigo.com> <20080529055641.GB16570@sashak.voltaire.com> <1212060853.27600.83.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080529131031.GV4616@sashak.voltaire.com> On 04:34 Thu 29 May , Hal Rosenstock wrote: > > > > The default partition is enabled by default. > > and so is ipoib on that partition (I forgot how this worked for the no > config file case) so the problem was something else (perhaps rate or MTU > mismatch if defaults were not adaquete for the b2b configuration ?). Yes, this should be something else. Sasha From hrosenstock at xsigo.com Thu May 29 06:22:52 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 06:22:52 -0700 Subject: [ofa-general] [PATCHv2] management: Support separate SA and SM keys as clarified in IBA 1.2.1 Message-ID: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com> management: Support separate SA and SM keys as clarified in IBA 1.2.1 v2 is just a rebase to latest tree Signed-off-by: Hal Rosenstock diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index ed61721..ccf7bdd 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -730,7 +730,7 @@ get_all_records(osm_bind_handle_t bind_handle, int trusted) { return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset, - trusted ? OSM_DEFAULT_SM_KEY : 0); + trusted ? OSM_DEFAULT_SA_KEY : 0); } /** @@ -1255,7 +1255,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, &pktr, ib_get_attr_offset(sizeof(pktr)), - OSM_DEFAULT_SM_KEY); + OSM_DEFAULT_SA_KEY); if (status != IB_SUCCESS) return status; diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 289e49e..07cc407 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -119,6 +119,17 @@ BEGIN_C_DECLS */ #define OSM_DEFAULT_SM_KEY 1 /********/ +/****s* OpenSM: Base/OSM_DEFAULT_SA_KEY +* NAME +* OSM_DEFAULT_SA_KEY +* +* DESCRIPTION +* Subnet Adminstration key value. +* +* SYNOPSIS +*/ +#define OSM_DEFAULT_SA_KEY 1 +/********/ /****s* OpenSM: Base/OSM_DEFAULT_LMC * NAME * OSM_DEFAULT_LMC diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index d84c5a2..1b862c0 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -209,6 +209,7 @@ typedef struct _osm_subn_opt { ib_net64_t guid; ib_net64_t m_key; ib_net64_t sm_key; + ib_net64_t sa_key; ib_net64_t subnet_prefix; ib_net16_t m_key_lease_period; uint32_t sweep_interval; @@ -295,7 +296,10 @@ typedef struct _osm_subn_opt { * M_Key value sent to all ports qualifing all Set(PortInfo). * * sm_key -* SM_Key value of the SM to qualify rcv SA queries as "trusted". +* SM_Key value of the SM used for SM authentication. +* +* sa_key +* SM_Key value to qualify rcv SA queries as "trusted". * * subnet_prefix * Subnet prefix used on this subnet. diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c index 78fdec7..abd8d02 100644 --- a/opensm/opensm/osm_sa_mad_ctrl.c +++ b/opensm/opensm/osm_sa_mad_ctrl.c @@ -340,11 +340,11 @@ __osm_sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw, * otherwise discard the MAD. */ if ((p_sa_mad->sm_key != 0) && - (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sm_key)) { + (p_sa_mad->sm_key != p_ctrl->p_subn->opt.sa_key)) { OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "ERR 1A04: " "Non-Zero SA MAD SM_Key: 0x%" PRIx64 " != SM_Key: 0x%" PRIx64 "; MAD ignored\n", cl_ntoh64(p_sa_mad->sm_key), - cl_ntoh64(p_ctrl->p_subn->opt.sm_key) + cl_ntoh64(p_ctrl->p_subn->opt.sa_key) ); osm_mad_pool_put(p_ctrl->p_mad_pool, p_madw); goto Exit; diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c index 5cea525..4d19ed4 100644 --- a/opensm/opensm/osm_sa_pkey_record.c +++ b/opensm/opensm/osm_sa_pkey_record.c @@ -269,7 +269,7 @@ void osm_pkey_rec_rcv_process(IN void *ctx, IN void *data) to trusted requests. Check that the requester is a trusted one. */ - if (p_rcvd_mad->sm_key != sa->p_subn->opt.sm_key) { + if (p_rcvd_mad->sm_key != sa->p_subn->opt.sa_key) { /* This is not a trusted requester! */ OSM_LOG(sa->p_log, OSM_LOG_ERROR, "ERR 4608: " "Request from non-trusted requester: " diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 9d1fbeb..d1e25ef 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -387,6 +387,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->guid = 0; p_opt->m_key = OSM_DEFAULT_M_KEY; p_opt->sm_key = OSM_DEFAULT_SM_KEY; + p_opt->sa_key = OSM_DEFAULT_SA_KEY; p_opt->subnet_prefix = IB_DEFAULT_SUBNET_PREFIX; p_opt->m_key_lease_period = 0; p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS; @@ -1161,6 +1162,8 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key); + opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key); + opts_unpack_net64("subnet_prefix", p_key, p_val, &p_opts->subnet_prefix); @@ -1401,8 +1404,10 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) "m_key 0x%016" PRIx64 "\n\n" "# The lease period used for the M_Key on this subnet in [sec]\n" "m_key_lease_period %u\n\n" - "# SM_Key value of the SM to qualify rcv SA queries as 'trusted'\n" + "# SM_Key value of the SM used for SM authentication\n" "sm_key 0x%016" PRIx64 "\n\n" + "# SM_Key value to qualify rcv SA queries as 'trusted'\n" + "sa_key 0x%016" PRIx64 "\n\n" "# Subnet prefix used on this subnet\n" "subnet_prefix 0x%016" PRIx64 "\n\n" "# The LMC value used on this subnet\n" @@ -1456,6 +1461,7 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) cl_ntoh64(p_opts->m_key), cl_ntoh16(p_opts->m_key_lease_period), cl_ntoh64(p_opts->sm_key), + cl_ntoh64(p_opts->sa_key), cl_ntoh64(p_opts->subnet_prefix), p_opts->lmc, p_opts->lmc_esp0 ? "TRUE" : "FALSE", From marcel.heinz at informatik.tu-chemnitz.de Thu May 29 06:35:22 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Thu, 29 May 2008 15:35:22 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> Message-ID: <483EB11A.5000000@informatik.tu-chemnitz.de> Hal Rosenstock wrote: > On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote: >>Dotan Barak wrote: >>>Marcel Heinz wrote: >>> >>>>[low multicast throughput of ~250MB/s with own benchmark tool] >>> >>>1) I know that ib_send_bw supports multicast as well, can you please >>>check that you can reproduce your problem >>>on this benchmark too? >> >>| #bytes #iterations BW peak[MB/sec] BW average[MB/sec] >>| 2048 1000 301.12 247.05 >> >>This is the same result as my own benchmark showed in that scenario. >> >> >>>2) You should expect that multicast messages will be slower than >>>unicast because the HCA/switch treat them in different way >>>(message duplication need to be done if needed). >> >>Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit >>too much of overhead, don't you think? > > > Agreed. > > >>Especially if I take into account >>that with my own benchmark, I can get ~950MB/s when I start another >>receiver on the same host as the sender. Note that both of the >>receivers, the local and the remote one, are seeing all packets at that >>rate, so the HCAs and the switch must be able to handle multicast >>packets with this throughput. > > > Perhaps this is a static rate issue. > > What SM is being used ? It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but this didn't change anything. Regards, Marcel From hrosenstock at xsigo.com Thu May 29 06:37:23 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 06:37:23 -0700 Subject: [ofa-general] Multicast Performance In-Reply-To: <483EB11A.5000000@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> Message-ID: <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote: > Hal Rosenstock wrote: > > On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote: > >>Dotan Barak wrote: > >>>Marcel Heinz wrote: > >>> > >>>>[low multicast throughput of ~250MB/s with own benchmark tool] > >>> > >>>1) I know that ib_send_bw supports multicast as well, can you please > >>>check that you can reproduce your problem > >>>on this benchmark too? > >> > >>| #bytes #iterations BW peak[MB/sec] BW average[MB/sec] > >>| 2048 1000 301.12 247.05 > >> > >>This is the same result as my own benchmark showed in that scenario. > >> > >> > >>>2) You should expect that multicast messages will be slower than > >>>unicast because the HCA/switch treat them in different way > >>>(message duplication need to be done if needed). > >> > >>Yes, but 250MB/s vs. 1100MB/s (UD unicast throughput) seems to be a bit > >>too much of overhead, don't you think? > > > > > > Agreed. > > > > > >>Especially if I take into account > >>that with my own benchmark, I can get ~950MB/s when I start another > >>receiver on the same host as the sender. Note that both of the > >>receivers, the local and the remote one, are seeing all packets at that > >>rate, so the HCAs and the switch must be able to handle multicast > >>packets with this throughput. > > > > > > Perhaps this is a static rate issue. > > > > What SM is being used ? > > It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but > this didn't change anything. Can you validate either the PathRecord or MCMemberRecord returned or the static rate applied to the multicast QP in the various scenarios ? If it is the same, this is not the problem but if it's different then we're on to something here. -- Hal > Regards, > Marcel > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From marcel.heinz at informatik.tu-chemnitz.de Thu May 29 07:32:53 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Thu, 29 May 2008 16:32:53 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> Message-ID: <483EBE95.60901@informatik.tu-chemnitz.de> Hi, Hal Rosenstock wrote: > On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote: >>Hal Rosenstock wrote: >>>On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote: >>>>Especially if I take into account >>>>that with my own benchmark, I can get ~950MB/s when I start another >>>>receiver on the same host as the sender. Note that both of the >>>>receivers, the local and the remote one, are seeing all packets at that >>>>rate, so the HCAs and the switch must be able to handle multicast >>>>packets with this throughput. >>> >>> >>>Perhaps this is a static rate issue. >>> >>>What SM is being used ? >> >>It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but >>this didn't change anything. > > > Can you validate either the PathRecord or MCMemberRecord returned or the > static rate applied to the multicast QP in the various scenarios ? If it > is the same, this is not the problem but if it's different then we're on > to something here. > This is what happened: 1. The server on host B is started and creates the MC group, OpenSM returns: | May 29 15:54:34 699610 [B6D71B90] 0x08 -> MCMember Record dump: | MGID....................0xff12000000000000 : 0x00010002deadbeef | PortGid.................0xfe80000000000000 : 0x0002c9020025abdd | qkey....................0xABCD | mlid....................0xC000 | mtu.....................0x84 | TClass..................0x0 | pkey....................0x7FFF | rate....................0x86 | pkt_life................0x80 | SLFlowLabelHopLimit.....0x0 | ScopeState..............0x21 | ProxyJoin...............0x0 2. The client on host A is started and joins to the group as SendOnlyNonMember, OpenSM returns: | May 29 15:54:45 381972 [B5D6FB90] 0x08 -> MCMember Record dump: | MGID....................0xff12000000000000 : 0x00010002deadbeef | PortGid.................0xfe80000000000000 : 0x0002c9020025abed | qkey....................0xABCD | mlid....................0xC000 | mtu.....................0x84 | TClass..................0x0 | pkey....................0x7FFF | rate....................0x86 | pkt_life................0x80 | SLFlowLabelHopLimit.....0x0 | ScopeState..............0x4 | ProxyJoin...............0x0 Now I have 255MB/s between host A and B. 3. I start another server on host A, it joines to the group and OpenSM returns: | May 29 15:54:56 129971 [B6570B90] 0x08 -> MCMember Record dump: | MGID....................0xff12000000000000 : 0x00010002deadbeef | PortGid.................0xfe80000000000000 : 0x0002c9020025abed | qkey....................0xABCD | mlid....................0xC000 | mtu.....................0x84 | TClass..................0x0 | pkey....................0x7FFF | rate....................0x86 | pkt_life................0x80 | SLFlowLabelHopLimit.....0x0 | ScopeState..............0x25 | ProxyJoin...............0x0 Now, all 3 instances measure 950MB/s throughput. The returned MCMember Records are absolutely identical except for the PortGid and the membership state. How can I find out the static rate applied to the multicast QP? Regards, Marcel From xma at us.ibm.com Thu May 29 07:37:12 2008 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 29 May 2008 07:37:12 -0700 Subject: [ofa-general] [PATCH v2] IB/ipoib: copy small SKBs in CM mode In-Reply-To: <1211976303.13769.155.camel@mtls03> Message-ID: Hello Eli, > > In this case, how many tx drop packets from ifconfig output? Should we > > see ifconfig tx drop packets + tx successfully transmit packets close > > to netperf packets? > That's right. I am looking at ipoib_cm_handle_tx_wc(), there is no tx drop packets increased in this situation, so tx transmit packets should be around netperf send packets. void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { ... tx_req = &tx->tx_ring[wr_id]; ib_dma_unmap_single(priv->ca, tx_req->mapping[0], tx_req->skb->len, DMA_TO_DEVICE); /* FIXME: is this right? Shouldn't we only increment on success? */ ++dev->stats.tx_packets; dev->stats.tx_bytes += tx_req->skb->len; ... } > > Any TCP STREAM test results to share here? > TCP won't demonstrate the problem since it uses Nagle's algorithm to > aggregate data into full sized packets. So when hitting this RNR retry, the error status return was flush err, so the packets were silently dropped instead of "failed cm send event" and clear the interface up flag? Please correct me if wrong. thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Thu May 29 07:49:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 07:49:13 -0700 Subject: [ofa-general] Multicast Performance In-Reply-To: <483EBE95.60901@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> <483EBE95.60901@informatik.tu-chemnitz.de> Message-ID: <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com> Hi Marcel, On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote: > Hi, > > Hal Rosenstock wrote: > > On Thu, 2008-05-29 at 15:35 +0200, Marcel Heinz wrote: > >>Hal Rosenstock wrote: > >>>On Thu, 2008-05-29 at 11:19 +0200, Marcel Heinz wrote: > >>>>Especially if I take into account > >>>>that with my own benchmark, I can get ~950MB/s when I start another > >>>>receiver on the same host as the sender. Note that both of the > >>>>receivers, the local and the remote one, are seeing all packets at that > >>>>rate, so the HCAs and the switch must be able to handle multicast > >>>>packets with this throughput. > >>> > >>> > >>>Perhaps this is a static rate issue. > >>> > >>>What SM is being used ? > >> > >>It's OpenSM 3.1.7. I had also made some tests with OpensSM 3.2.1, but > >>this didn't change anything. > > > > > > Can you validate either the PathRecord or MCMemberRecord returned or the > > static rate applied to the multicast QP in the various scenarios ? If it > > is the same, this is not the problem but if it's different then we're on > > to something here. > > > > This is what happened: > > 1. The server on host B is started and creates the MC group, OpenSM > returns: > > | May 29 15:54:34 699610 [B6D71B90] 0x08 -> MCMember Record dump: > | MGID....................0xff12000000000000 : 0x00010002deadbeef > | PortGid.................0xfe80000000000000 : 0x0002c9020025abdd > | qkey....................0xABCD > | mlid....................0xC000 > | mtu.....................0x84 > | TClass..................0x0 > | pkey....................0x7FFF > | rate....................0x86 > | pkt_life................0x80 > | SLFlowLabelHopLimit.....0x0 > | ScopeState..............0x21 > | ProxyJoin...............0x0 > > 2. The client on host A is started and joins to the group as > SendOnlyNonMember, OpenSM returns: > > | May 29 15:54:45 381972 [B5D6FB90] 0x08 -> MCMember Record dump: > | MGID....................0xff12000000000000 : 0x00010002deadbeef > | PortGid.................0xfe80000000000000 : 0x0002c9020025abed > | qkey....................0xABCD > | mlid....................0xC000 > | mtu.....................0x84 > | TClass..................0x0 > | pkey....................0x7FFF > | rate....................0x86 > | pkt_life................0x80 > | SLFlowLabelHopLimit.....0x0 > | ScopeState..............0x4 > | ProxyJoin...............0x0 > > Now I have 255MB/s between host A and B. > > 3. I start another server on host A, it joines to the group and > OpenSM returns: > > | May 29 15:54:56 129971 [B6570B90] 0x08 -> MCMember Record dump: > | MGID....................0xff12000000000000 : 0x00010002deadbeef > | PortGid.................0xfe80000000000000 : 0x0002c9020025abed > | qkey....................0xABCD > | mlid....................0xC000 > | mtu.....................0x84 > | TClass..................0x0 > | pkey....................0x7FFF > | rate....................0x86 > | pkt_life................0x80 > | SLFlowLabelHopLimit.....0x0 > | ScopeState..............0x25 > | ProxyJoin...............0x0 > > Now, all 3 instances measure 950MB/s throughput. > > The returned MCMember Records are absolutely identical except > for the PortGid and the membership state. Rate 0x86 is exactly 20 Gbps. > How can I find out the static rate applied to the multicast QP? Given the above, I don't see this as a likely suspect but you should be able to query the ah used for sending and look in the ah_attr for static_rate. -- Hal > Regards, > Marcel From eli at dev.mellanox.co.il Thu May 29 08:14:51 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 29 May 2008 18:14:51 +0300 Subject: [ofa-general] Re: IB/ipoib: Fix CM connection premature destruction In-Reply-To: <1212063318.13769.186.camel@mtls03> References: <1212063318.13769.186.camel@mtls03> Message-ID: <4e6a6b3c0805290814m5399fc8frc7f0770ef003962f@mail.gmail.com> On Thu, May 29, 2008 at 3:15 PM, Eli Cohen wrote: > >From 24e88d727dbbb7fd491edb57416f5cb0d4009f1d Mon Sep 17 00:00:00 2001 > From: Eli Cohen > Date: Thu, 29 May 2008 15:13:25 +0300 > Subject: [PATCH] IB/ipoib: Fix CM connection premature destruction > > Destroy the CM connection at ipoib_cm_tx_destroy() after the TX > queue is flushed. Failure to do so might cause the cm_id to be > allocated again pending TX completions which have not been reported > yet will move the connection to the reap list again causing it to > be destroyed before it has been used. The overall effect would be > to delay the creation of a new connection. Thinking it over there's no bug there that I can identify. Please ignore this patch. From marcel.heinz at informatik.tu-chemnitz.de Thu May 29 08:30:11 2008 From: marcel.heinz at informatik.tu-chemnitz.de (Marcel Heinz) Date: Thu, 29 May 2008 17:30:11 +0200 Subject: [ofa-general] Multicast Performance In-Reply-To: <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> <483EBE95.60901@informatik.tu-chemnitz.de> <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com> Message-ID: <483ECC03.8000600@informatik.tu-chemnitz.de> Hi Hal, Hal Rosenstock wrote: > On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote: >>How can I find out the static rate applied to the multicast QP? > > Given the above, I don't see this as a likely suspect but you should be > able to query the ah used for sending and look in the ah_attr for > static_rate. Well, there is no query_ah function. The ah I use for sending was created with static_rate 0. The manpage doesn't give any explanation what static rate is meant to be, but after playing around with it I guess that it is what the spec calls "Inter Packet Delay", right? So 0 should be the correct choice. There is also the ah_attr field of the ib_qp_attr struct which I could query, but this field is not valid for datagram QPs. Regards, Marcel From hrosenstock at xsigo.com Thu May 29 08:34:41 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 08:34:41 -0700 Subject: [ofa-general] Multicast Performance In-Reply-To: <483ECC03.8000600@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> <483EBE95.60901@informatik.tu-chemnitz.de> <1212072553.17997.65.camel@hrosenstock-ws.xsigo.com> <483ECC03.8000600@informatik.tu-chemnitz.de> Message-ID: <1212075281.17997.89.camel@hrosenstock-ws.xsigo.com> Hi Marcel, On Thu, 2008-05-29 at 17:30 +0200, Marcel Heinz wrote: > Hi Hal, > > Hal Rosenstock wrote: > > On Thu, 2008-05-29 at 16:32 +0200, Marcel Heinz wrote: > >>How can I find out the static rate applied to the multicast QP? > > > > Given the above, I don't see this as a likely suspect but you should be > > able to query the ah used for sending and look in the ah_attr for > > static_rate. > > Well, there is no query_ah function. I was looking at kernel space not user space. Not sure about user space but I think it's moot. > The ah I use for sending was > created with static_rate 0. The manpage doesn't give any explanation > what static rate is meant to be, but after playing around with it I > guess that it is what the spec calls "Inter Packet Delay", right? Yes. > So 0 should be the correct choice. Yes. This isn't the issue and I'm not sure what is. Sorry. -- Hal > There is also the ah_attr field of the ib_qp_attr struct which I could > query, but this field is not valid for datagram QPs. > Regards, > Marcel From jlentini at netapp.com Thu May 29 08:37:15 2008 From: jlentini at netapp.com (James Lentini) Date: Thu, 29 May 2008 11:37:15 -0400 (EDT) Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com> References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com> Message-ID: On Wed, 28 May 2008, Hal Rosenstock wrote: > On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote: > > At 09:03 AM 5/28/2008, Hal Rosenstock wrote: > > >On Wed, 2008-05-28 at 08:56 -0400, Talpey, Thomas wrote: > > >> At 08:39 AM 5/28/2008, Hal Rosenstock wrote: > > >> >Tom, > > >> > > > >> >On Wed, 2008-05-28 at 08:06 -0400, Talpey, Thomas wrote: > > >> >> Is it possible to manually configure two Infiniband ports to operate > > >> >> with one another in back-to-back mode, without running OpenSM > > >> >> on one of them? > > >> > > > >> >This is possible but something would need to do at least some subset of > > >> >what the SM does depending on the precise requirements and the limits > > >> >placed on the environment supported without a "full blown" SM. > > >> > > >> Okay ... but IMO the only thing we need is a LID. Or at least, in my > > >experience > > >> all I've needed is a LID. > > > > > >The port also needs to be walked from init to active which takes > > >coordination at both ends of the b2b link. > > > > Yep. But, it has all it needs with a LID, right? No messages need to be > > exchanged, for instance. > > It's more than a LID and messages do need to be exchanged (mini SM -> > SMA) to walk the port from INIT to ACTIVE. This needs to be coordinated > on both sides of the link so they move in rough concert. > > > >> In a previous effort, we simply stole the low octet of an IP address, so we'd > > >> "ifconfig ib0 1.2.3.X" and it would jam lid=X into the interface. > > >Worked great. > > >> If necessary, we would set a manual arp entry (using iproute) to avoid having > > >> to broadcast. > > > > > >That could be done if that is what is desired and can be relied upon > > >(that ib0 is configured and we only care about the first port). > > > > > >Is it just ARP support that is needed ? > > > > Well, ARP is the precursor to establishing an IP send and a TCP connection, > > which we need to do also. > > I was just asking about other broadcast/multicast needs. Sounds like > this is not the case. > > > But, if the resulting ipaddr-hwaddr mapping is > > installed, then ARP is unnecessary and the IP layer can send without using it. > > > > When we did this before, we'd install a "permanent" ARP entry, in a two-line > > shell script. Roughly, for peers configuring lids X and Y, it would do > > > > peer X: > > ifconfig ib0 1.2.3.X > > ip neigh add 1.2.3.Y nud permanent lladdr a.b.c.d.e.f....Y (i.e. Y's guid) > > > > peer Y: > > ifconfig ib0 1.2.3.Y > > ip neigh add 1.2.3.X nud permanent lladdr a.b.c.d.e.f....X > > > > And we'd be up and running for both IP and RDMA connections. We fixed a > > bug in the old iproute2 command to allow the long IB link addresses. > > > > I'm thinking that using IPOIB to drive this kind of manual setup is one way > > to approach it. It certainly would be simple, and worked for us before there > > was an OFA stack. > > This would still work. > > > Maybe I'm getting ahead of myself though, still wondering if there's a way > > to do it with what we have. > > The closest thing is OpenSM run once mode but I think you've been > describing a b2b mini SM command which wouldn't be hard to implement. Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs to assign a lid, and then transitioned the port to ARMED and ACTIVE. This worked for enabling IB communication, but not IPoIB. In retrospect, I probably could have implemented the same functionality in userspace. > -- Hal > > > Tom. > > > > > > > >> >> We have done this on other IB implementations by manually assigning > > >> >> LIDs, but I discover that the "lid" entry below > > >> >/sys/class/infiniband/ > > >> >> is not writable, at least for mthca. > > >> > > > >> >This can be done via MADs so user_mad kernel module would be needed to > > >> >do this. > > >> > > >> Okay, all kernel modules can be assumed to be in place. How do we tell it > > >> to manage the LID, with a shell command? > > > > > >A new "command" would be needed. > > > > > >-- Hal > > > > > >> >> Also, I expect that the ipoib driver will > > >> >> be unable to join the broadcast group, so will be unwilling to > > >come up fully. > > >> > > > >> >Is IPoIB a requirement ? > > >> > > >> I think so, for two reasons. One, principle of least surprise - the user will > > >> expect to be able to ping, telnet etc if it has connectivity. Two, > > >for NFS/RDMA > > >> we require TCP and UDP connections in order to perform the mount and do > > >> locking and recovery. We could do those over a parallel ethernet connection, > > >> but that's kind of not the point. > > >> > > >> > > > >> >> With ethernet, and maybe iWARP, just a simple ifconfig can do this. So why > > >> >> not IB? > > >> > > > >> >The simple answer is that it is the nature of IB management (being > > >> >different than ethernet). > > >> > > >> Which, IMO, we need to boil down to simplest-possible, for at least some > > >> workable configuration. > > >> > > >> Thanks for the ideas! > > >> > > >> Tom. > > >> > > >> > > > >> >-- Hal > > >> > > > >> >> If you're wondering, my goal is give NFS/RDMA users a way to avoid having > > >> >> to install the many userspace modules needed to do this, including > > >> >libibverbs, > > >> >> opensm, etc. There's a lot to get wrong, and things go missing. Seeking an > > >> >> "easy" way to get started with just the kernel and some shell commands. > > >> >> > > >> >> Tom. > > >> >> > > >> >> _______________________________________________ > > >> >> general mailing list > > >> >> general at lists.openfabrics.org > > >> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > >> >> > > >> >> To unsubscribe, please visit > > >> >http://openib.org/mailman/listinfo/openib-general > > >> > > > > > >_______________________________________________ > > >general mailing list > > >general at lists.openfabrics.org > > >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hrosenstock at xsigo.com Thu May 29 08:45:28 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 08:45:28 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com> Message-ID: <1212075928.17997.93.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 11:37 -0400, James Lentini wrote: > >The closest thing is OpenSM run once mode but I think you've been > >describing a b2b mini SM command which wouldn't be hard to > >implement. > > Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs > to assign a lid, and then transitioned the port to ARMED and ACTIVE. Is this based on OpenIB/OpenFabrics kernel APIs ? > This worked for enabling IB communication, but not IPoIB. I think that IPoIB for b2b mode would be a relatively simple addition. -- Hal From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:53:46 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:23:46 +0530 Subject: [ofa-general] [PATCH v3 00/13] QLogic VNIC Driver Message-ID: <20080529095126.9943.84692.stgit@localhost.localdomain> Roland, This is the third round of QLogic Virtual NIC driver patch series for submission to 2.6.27 kernel. The series has been tested against your for-2.6.27 branch. Based on comments received on second series of patches, following fixes are introduced in this series: - Removal of CONFIG_INFINIBAND_QLGC_VNIC_DEBUG option. - Making few function definitions static. - Removed un-necessary extern declarations. The sparse endianness checking for the driver did not give any warnings and checkpatch.pl have few warnings indicating lines slightly longer than 80 columns. Background: As mentioned in the first version of patch series, this series adds QLogic Virtual NIC (VNIC) driver which works in conjunction with the the QLogic Ethernet Virtual I/O Controller (EVIC) hardware. The VNIC driver along with the QLogic EVIC's two 10 Gigabit ethernet ports, enables Infiniband clusters to connect to Ethernet networks. This driver also works with the earlier version of the I/O Controller, the VEx. The QLogic VNIC driver creates virtual ethernet interfaces and tunnels the Ethernet data to/from the EVIC over Infiniband using an Infiniband reliable connection. [PATCH v3 01/13] QLogic VNIC: Driver - netdev implementation [PATCH v3 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx [PATCH v3 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx [PATCH v3 04/13] QLogic VNIC: Implementation of Control path of communication protocol [PATCH v3 05/13] QLogic VNIC: Implementation of Data path of communication protocol [PATCH v3 06/13] QLogic VNIC: IB core stack interaction [PATCH v3 07/13] QLogic VNIC: Handling configurable parameters of the driver [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver [PATCH v3 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast [PATCH v3 10/13] QLogic VNIC: Driver Statistics collection [PATCH v3 11/13] QLogic VNIC: Driver utility file - implements various utility macros [PATCH v3 12/13] QLogic VNIC: Driver Kconfig and Makefile. [PATCH v3 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile drivers/infiniband/Kconfig | 2 drivers/infiniband/Makefile | 1 drivers/infiniband/ulp/qlgc_vnic/Kconfig | 19 drivers/infiniband/ulp/qlgc_vnic/Makefile | 13 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c | 379 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_config.h | 242 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.c | 2286 ++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.h | 179 ++ .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h | 368 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.c | 1492 +++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.h | 206 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 +++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h | 206 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.h | 154 + drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c | 319 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h | 77 + drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c | 112 + drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h | 79 + drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c | 234 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h | 497 ++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 ++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 51 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h | 103 + drivers/infiniband/ulp/qlgc_vnic/vnic_util.h | 236 ++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 +++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h | 175 ++ 27 files changed, 11918 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h -- Regards, Ram From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:54:23 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:24:23 +0530 Subject: [ofa-general] [PATCH v3 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095423.9943.77528.stgit@localhost.localdomain> From: Ramachandra K QLogic Virtual NIC Driver. This patch implements netdev registration, netdev functions and state maintenance of the QLogic Virtual NIC corresponding to the various events associated with the QLogic Ethernet Virtual I/O Controller (EVIC/VEx) connection. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_main.h | 154 ++++ 2 files changed, 1252 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c new file mode 100644 index 0000000..570c069 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c @@ -0,0 +1,1098 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_netpath.h" +#include "vnic_viport.h" +#include "vnic_ib.h" +#include "vnic_stats.h" + +#define MODULEVERSION "1.3.0.0.4" +#define MODULEDETAILS \ + "QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION + +MODULE_AUTHOR("QLogic Corp."); +MODULE_DESCRIPTION(MODULEDETAILS); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller"); + +u32 vnic_debug; + +module_param(vnic_debug, uint, 0444); +MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0"); + +LIST_HEAD(vnic_list); + +static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue); +static LIST_HEAD(vnic_npevent_list); +static DECLARE_COMPLETION(vnic_npevent_thread_exit); +static spinlock_t vnic_npevent_list_lock; +static struct task_struct *vnic_npevent_thread; +static int vnic_npevent_thread_end; + +static const char *const vnic_npevent_str[] = { + "PRIMARY CONNECTED", + "PRIMARY DISCONNECTED", + "PRIMARY CARRIER", + "PRIMARY NO CARRIER", + "PRIMARY TIMER EXPIRED", + "PRIMARY SETLINK", + "SECONDARY CONNECTED", + "SECONDARY DISCONNECTED", + "SECONDARY CARRIER", + "SECONDARY NO CARRIER", + "SECONDARY TIMER EXPIRED", + "SECONDARY SETLINK", + "FREE VNIC", +}; + +void vnic_connected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_connected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED); + + vnic_connected_stats(vnic); +} + +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_disconnected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED); +} + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_up()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP); +} + +void vnic_link_down(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_down()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN); +} + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) +{ + unsigned long flags; + + VNIC_FUNCTION("vnic_stop_xmit()\n"); + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (netpath == vnic->current_path) { + if (!netif_queue_stopped(vnic->netdevice)) { + netif_stop_queue(vnic->netdevice); + vnic->failed_over = 0; + } + + vnic_stop_xmit_stats(vnic); + } + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath) +{ + unsigned long flags; + + VNIC_FUNCTION("vnic_restart_xmit()\n"); + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (netpath == vnic->current_path) { + if (netif_queue_stopped(vnic->netdevice)) + netif_wake_queue(vnic->netdevice); + + vnic_restart_xmit_stats(vnic); + } + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb) +{ + VNIC_FUNCTION("vnic_recv_packet()\n"); + if ((netpath != vnic->current_path) || !vnic->open) { + VNIC_INFO("tossing packet\n"); + dev_kfree_skb(skb); + return; + } + + vnic->netdevice->last_rx = jiffies; + skb->dev = vnic->netdevice; + skb->protocol = eth_type_trans(skb, skb->dev); + if (!vnic->config->use_rx_csum) + skb->ip_summed = CHECKSUM_NONE; + netif_rx(skb); + vnic_recv_pkt_stats(vnic); +} + +static struct net_device_stats *vnic_get_stats(struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + unsigned long flags; + + VNIC_FUNCTION("vnic_get_stats()\n"); + vnic = netdev_priv(device); + + spin_lock_irqsave(&vnic->current_path_lock, flags); + np = vnic->current_path; + if (np && np->viport) { + atomic_inc(&np->viport->reference_count); + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + viport_get_stats(np->viport, &vnic->stats); + atomic_dec(&np->viport->reference_count); + wake_up(&np->viport->reference_queue); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + + return &vnic->stats; +} + +static int vnic_open(struct net_device *device) +{ + struct vnic *vnic; + + VNIC_FUNCTION("vnic_open()\n"); + vnic = netdev_priv(device); + + vnic->open++; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + netif_start_queue(vnic->netdevice); + + return 0; +} + +static int vnic_stop(struct net_device *device) +{ + struct vnic *vnic; + int ret = 0; + + VNIC_FUNCTION("vnic_stop()\n"); + vnic = netdev_priv(device); + netif_stop_queue(device); + vnic->open--; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + + return ret; +} + +static int vnic_hard_start_xmit(struct sk_buff *skb, + struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + cycles_t xmit_time; + int ret = -1; + + VNIC_FUNCTION("vnic_hard_start_xmit()\n"); + vnic = netdev_priv(device); + np = vnic->current_path; + + vnic_pre_pkt_xmit_stats(&xmit_time); + + if (np && np->viport) + ret = viport_xmit_packet(np->viport, skb); + + if (ret) { + vnic_xmit_fail_stats(vnic); + dev_kfree_skb_any(skb); + vnic->stats.tx_dropped++; + goto out; + } + + device->trans_start = jiffies; + vnic_post_pkt_xmit_stats(vnic, xmit_time); +out: + return 0; +} + +static void vnic_tx_timeout(struct net_device *device) +{ + struct vnic *vnic; + struct viport *viport = NULL; + unsigned long flags; + + VNIC_FUNCTION("vnic_tx_timeout()\n"); + vnic = netdev_priv(device); + device->trans_start = jiffies; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path && vnic->current_path->viport) { + if (vnic->failed_over) { + if (vnic->current_path == &vnic->primary_path) + viport = vnic->secondary_path.viport; + else if (vnic->current_path == &vnic->secondary_path) + viport = vnic->primary_path.viport; + } else + viport = vnic->current_path->viport; + + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + if (viport) + viport_failure(viport); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + + VNIC_ERROR("vnic_tx_timeout\n"); +} + +static void vnic_set_multicast_list(struct net_device *device) +{ + struct vnic *vnic; + unsigned long flags; + + VNIC_FUNCTION("vnic_set_multicast_list()\n"); + vnic = netdev_priv(device); + + spin_lock_irqsave(&vnic->lock, flags); + if (device->mc_count == 0) { + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + } else { + struct dev_mc_list *mc_list = device->mc_list; + int i; + + if (device->mc_count > vnic->mc_list_len) { + if (vnic->mc_list_len) + kfree(vnic->mc_list); + vnic->mc_list_len = device->mc_count + 10; + vnic->mc_list = kmalloc(vnic->mc_list_len * + sizeof *mc_list, GFP_ATOMIC); + if (!vnic->mc_list) { + vnic->mc_list_len = vnic->mc_count = 0; + VNIC_ERROR("failed allocating mc_list\n"); + goto failure; + } + } + vnic->mc_count = device->mc_count; + for (i = 0; i < device->mc_count; i++) { + vnic->mc_list[i] = *mc_list; + vnic->mc_list[i].next = &vnic->mc_list[i + 1]; + mc_list = mc_list->next; + } + } + spin_unlock_irqrestore(&vnic->lock, flags); + + if (vnic->primary_path.viport) + viport_set_multicast(vnic->primary_path.viport, + vnic->mc_list, vnic->mc_count); + + if (vnic->secondary_path.viport) + viport_set_multicast(vnic->secondary_path.viport, + vnic->mc_list, vnic->mc_count); + + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); + return; +failure: + spin_unlock_irqrestore(&vnic->lock, flags); +} + +/** + * Following set of functions queues up the events for EVIC and the + * kernel thread queuing up the event might return. + */ +static int vnic_set_mac_address(struct net_device *device, void *addr) +{ + struct vnic *vnic; + struct sockaddr *sockaddr = addr; + u8 *address; + int ret = -1; + + VNIC_FUNCTION("vnic_set_mac_address()\n"); + vnic = netdev_priv(device); + + if (!is_valid_ether_addr(sockaddr->sa_data)) + return -EADDRNOTAVAIL; + + if (netif_running(device)) + return -EBUSY; + + memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN); + address = sockaddr->sa_data; + + if (vnic->primary_path.viport) + ret = viport_set_unicast(vnic->primary_path.viport, + address); + + if (ret) + return ret; + + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, address); + + vnic->mac_set = 1; + return 0; +} + +static int vnic_change_mtu(struct net_device *device, int mtu) +{ + struct vnic *vnic; + int ret = 0; + int pri_max_mtu; + int sec_max_mtu; + + VNIC_FUNCTION("vnic_change_mtu()\n"); + vnic = netdev_priv(device); + + if (vnic->primary_path.viport) + pri_max_mtu = viport_max_mtu(vnic->primary_path.viport); + else + pri_max_mtu = MAX_PARAM_VALUE; + + if (vnic->secondary_path.viport) + sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport); + else + sec_max_mtu = MAX_PARAM_VALUE; + + if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) { + device->mtu = mtu; + vnic_npevent_queue_evt(&vnic->primary_path, + VNIC_PRINP_SETLINK); + vnic_npevent_queue_evt(&vnic->secondary_path, + VNIC_SECNP_SETLINK); + } else if (pri_max_mtu < sec_max_mtu) + printk(KERN_WARNING PFX "%s: Maximum " + "supported MTU size is %d. " + "Cannot set MTU to %d\n", + vnic->config->name, pri_max_mtu, mtu); + else + printk(KERN_WARNING PFX "%s: Maximum " + "supported MTU size is %d. " + "Cannot set MTU to %d\n", + vnic->config->name, sec_max_mtu, mtu); + + return ret; +} + +static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath) +{ + u8 *address; + int ret; + + if (!vnic->mac_set) { + /* if netpath == secondary_path, then the primary path isn't + * connected. MAC address will be set when the primary + * connects. + */ + netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr); + address = vnic->netdevice->dev_addr; + + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, + address); + + vnic->mac_set = 1; + } + ret = register_netdev(vnic->netdevice); + if (ret) { + printk(KERN_ERR PFX "%s failed registering netdev " + "error %d - calling viport_failure\n", + config_viport_name(vnic->primary_path.viport->config), + ret); + vnic_free(vnic); + printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n", + config_viport_name(vnic->primary_path.viport->config)); + return ret; + } + + vnic->state = VNIC_REGISTERED; + vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/ + return 0; +} + +static void vnic_npevent_dequeue_all(struct vnic *vnic) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + +static void update_path_and_reconnect(struct netpath *netpath, + struct vnic *vnic) +{ + struct viport_config *config = netpath->viport->config; + int delay = 1; + + if (vnic_ib_get_path(netpath, vnic)) + return; + /* + * tell viport_connect to wait for default_no_path_timeout + * before connecting if we are retrying the same path index + * within default_no_path_timeout. + * This prevents flooding connect requests to a path (or set + * of paths) that aren't successfully connecting for some reason. + */ + if (time_after(jiffies, + (netpath->connect_time + vnic->config->no_path_timeout))) { + netpath->path_idx = config->path_idx; + netpath->connect_time = jiffies; + netpath->delay_reconnect = 0; + delay = 0; + } else if (config->path_idx != netpath->path_idx) { + delay = netpath->delay_reconnect; + netpath->path_idx = config->path_idx; + netpath->delay_reconnect = 1; + } else + delay = 1; + viport_connect(netpath->viport, delay); +} + +static inline void vnic_set_checksum_flag(struct vnic *vnic, + struct netpath *target_path) +{ + unsigned long flags; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + vnic->current_path = target_path; + vnic->failed_over = 1; + if (vnic->config->use_tx_csum && + netpath_can_tx_csum(vnic->current_path)) + vnic->netdevice->features |= NETIF_F_IP_CSUM; + + spin_unlock_irqrestore(&vnic->current_path_lock, flags); +} + +static void vnic_set_uni_multicast(struct vnic *vnic, + struct netpath *netpath) +{ + unsigned long flags; + u8 *address; + + if (vnic->mac_set) { + address = vnic->netdevice->dev_addr; + + if (netpath->viport) + viport_set_unicast(netpath->viport, address); + } + spin_lock_irqsave(&vnic->lock, flags); + + if (vnic->mc_list && netpath->viport) + viport_set_multicast(netpath->viport, vnic->mc_list, + vnic->mc_count); + + spin_unlock_irqrestore(&vnic->lock, flags); + if (vnic->state == VNIC_REGISTERED) { + if (!netpath->viport) + return; + viport_set_link(netpath->viport, + vnic->netdevice->flags & ~IFF_UP, + vnic->netdevice->mtu); + } +} + +static void vnic_set_netpath_timers(struct vnic *vnic, + struct netpath *netpath) +{ + switch (netpath->timer_state) { + case NETPATH_TS_IDLE: + netpath->timer_state = NETPATH_TS_ACTIVE; + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer(netpath, + vnic->config-> + primary_connect_timeout); + else + netpath_timer(netpath, + vnic->config-> + primary_reconnect_timeout); + break; + case NETPATH_TS_ACTIVE: + /*nothing to do*/ + break; + case NETPATH_TS_EXPIRED: + if (vnic->state == VNIC_UNINITIALIZED) + vnic_npevent_register(vnic, netpath); + + break; + } +} + +static void vnic_check_primary_path_timer(struct vnic *vnic) +{ + switch (vnic->primary_path.timer_state) { + case NETPATH_TS_ACTIVE: + /* nothing to do. just wait */ + break; + case NETPATH_TS_IDLE: + netpath_timer(&vnic->primary_path, + vnic->config-> + primary_switch_timeout); + break; + case NETPATH_TS_EXPIRED: + printk(KERN_INFO PFX + "%s: switching to primary path\n", + vnic->config->name); + + vnic_set_checksum_flag(vnic, &vnic->primary_path); + break; + } +} + +static void vnic_carrier_loss(struct vnic *vnic, + struct netpath *last_path) +{ + if (vnic->primary_path.carrier) { + vnic->carrier = 1; + vnic_set_checksum_flag(vnic, &vnic->primary_path); + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to primary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using primary path\n", + vnic->config->name); + + } else if ((vnic->secondary_path.carrier) && + (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) { + vnic->carrier = 1; + vnic_set_checksum_flag(vnic, &vnic->secondary_path); + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to secondary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using secondary path\n", + vnic->config->name); + + } + +} + +static void vnic_handle_path_change(struct vnic *vnic, + struct netpath **path) +{ + struct netpath *last_path = *path; + + if (!last_path) { + if (vnic->current_path == &vnic->primary_path) + last_path = &vnic->secondary_path; + else + last_path = &vnic->primary_path; + + } + + if (vnic->current_path && vnic->current_path->viport) + viport_set_link(vnic->current_path->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + + if (last_path->viport) + viport_set_link(last_path->viport, + vnic->netdevice->flags & + ~IFF_UP, vnic->netdevice->mtu); + + vnic_restart_xmit(vnic, vnic->current_path); +} + +static void vnic_report_path_change(struct vnic *vnic, + struct netpath *last_path, + int other_path_ok) +{ + if (!vnic->current_path) { + if (last_path == &vnic->primary_path) + printk(KERN_INFO PFX "%s: primary path lost, " + "no failover path available\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path lost, " + "no failover path available\n", + vnic->config->name); + return; + } + + if (last_path != vnic->current_path) + return; + + if (vnic->current_path == &vnic->secondary_path) { + if (other_path_ok != vnic->primary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: primary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: primary path now" + " available for failover\n", + vnic->config->name); + } + } else { + if (other_path_ok != vnic->secondary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: secondary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path now" + " available for failover\n", + vnic->config->name); + } + } +} + +static void vnic_handle_free_vnic_evt(struct vnic *vnic) +{ + unsigned long flags; + + if (!netif_queue_stopped(vnic->netdevice)) + netif_stop_queue(vnic->netdevice); + + netpath_timer_stop(&vnic->primary_path); + netpath_timer_stop(&vnic->secondary_path); + spin_lock_irqsave(&vnic->current_path_lock, flags); + vnic->current_path = NULL; + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + netpath_free(&vnic->primary_path); + netpath_free(&vnic->secondary_path); + if (vnic->state == VNIC_REGISTERED) + unregister_netdev(vnic->netdevice); + + vnic_npevent_dequeue_all(vnic); + kfree(vnic->config); + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group); + vnic_cleanup_stats_files(vnic); + device_unregister(&vnic->dev_info.dev); + wait_for_completion(&vnic->dev_info.released); + free_netdev(vnic->netdevice); +} + +static struct vnic *vnic_handle_npevent(struct vnic *vnic, + enum vnic_npevent_type npevt_type) +{ + struct netpath *netpath; + const char *netpath_str; + + if (npevt_type <= VNIC_PRINP_LASTTYPE) + netpath_str = netpath_to_string(vnic, &vnic->primary_path); + else if (npevt_type <= VNIC_SECNP_LASTTYPE) + netpath_str = netpath_to_string(vnic, &vnic->secondary_path); + else + netpath_str = netpath_to_string(vnic, vnic->current_path); + + VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n", + vnic->config->name, vnic_npevent_str[npevt_type], + netpath_str, vnic->carrier); + + switch (npevt_type) { + case VNIC_PRINP_CONNECTED: + netpath = &vnic->primary_path; + if (vnic->state == VNIC_UNINITIALIZED) { + if (vnic_npevent_register(vnic, netpath)) + break; + } + vnic_set_uni_multicast(vnic, netpath); + break; + case VNIC_SECNP_CONNECTED: + vnic_set_uni_multicast(vnic, &vnic->secondary_path); + break; + case VNIC_PRINP_TIMEREXPIRED: + netpath = &vnic->primary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (!netpath->carrier) + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_TIMEREXPIRED: + netpath = &vnic->secondary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (!netpath->carrier) + update_path_and_reconnect(netpath, vnic); + else { + if (vnic->state == VNIC_UNINITIALIZED) + vnic_npevent_register(vnic, netpath); + } + break; + case VNIC_PRINP_LINKUP: + vnic->primary_path.carrier = 1; + break; + case VNIC_SECNP_LINKUP: + netpath = &vnic->secondary_path; + netpath->carrier = 1; + if (!vnic->carrier) + vnic_set_netpath_timers(vnic, netpath); + break; + case VNIC_PRINP_LINKDOWN: + vnic->primary_path.carrier = 0; + break; + case VNIC_SECNP_LINKDOWN: + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer_stop(&vnic->secondary_path); + vnic->secondary_path.carrier = 0; + break; + case VNIC_PRINP_DISCONNECTED: + netpath = &vnic->primary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_DISCONNECTED: + netpath = &vnic->secondary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_PRINP_SETLINK: + netpath = vnic->current_path; + if (!netpath || !netpath->viport) + break; + viport_set_link(netpath->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + break; + case VNIC_SECNP_SETLINK: + netpath = &vnic->secondary_path; + if (!netpath || !netpath->viport) + break; + viport_set_link(netpath->viport, + vnic->netdevice->flags, + vnic->netdevice->mtu); + break; + case VNIC_NP_FREEVNIC: + vnic_handle_free_vnic_evt(vnic); + vnic = NULL; + break; + } + return vnic; +} + +static int vnic_npevent_statemachine(void *context) +{ + struct vnic_npevent *vnic_link_evt; + enum vnic_npevent_type npevt_type; + struct vnic *vnic; + int last_carrier; + int other_path_ok = 0; + struct netpath *last_path; + + while (!vnic_npevent_thread_end || + !list_empty(&vnic_npevent_list)) { + unsigned long flags; + + wait_event_interruptible(vnic_npevent_queue, + !list_empty(&vnic_npevent_list) + || vnic_npevent_thread_end); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) { + spin_unlock_irqrestore(&vnic_npevent_list_lock, + flags); + VNIC_INFO("netpath statemachine wake" + " on empty list\n"); + continue; + } + + vnic_link_evt = list_entry(vnic_npevent_list.next, + struct vnic_npevent, + list_ptrs); + list_del(&vnic_link_evt->list_ptrs); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + vnic = vnic_link_evt->vnic; + npevt_type = vnic_link_evt->event_type; + kfree(vnic_link_evt); + + if (vnic->current_path == &vnic->secondary_path) + other_path_ok = vnic->primary_path.carrier; + else if (vnic->current_path == &vnic->primary_path) + other_path_ok = vnic->secondary_path.carrier; + + vnic = vnic_handle_npevent(vnic, npevt_type); + + if (!vnic) + continue; + + last_carrier = vnic->carrier; + last_path = vnic->current_path; + + if (!vnic->current_path || + !vnic->current_path->carrier) { + vnic->carrier = 0; + vnic->current_path = NULL; + vnic->netdevice->features &= ~NETIF_F_IP_CSUM; + } + + if (!vnic->carrier) + vnic_carrier_loss(vnic, last_path); + else if ((vnic->current_path != &vnic->primary_path) && + (vnic->config->prefer_primary) && + (vnic->primary_path.carrier)) + vnic_check_primary_path_timer(vnic); + + if (last_path) + vnic_report_path_change(vnic, last_path, + other_path_ok); + + VNIC_INFO("new netpath=%s, carrier=%d\n", + netpath_to_string(vnic, vnic->current_path), + vnic->carrier); + + if (vnic->current_path != last_path) + vnic_handle_path_change(vnic, &last_path); + + if (vnic->carrier != last_carrier) { + if (vnic->carrier) { + VNIC_INFO("netif_carrier_on\n"); + netif_carrier_on(vnic->netdevice); + vnic_carrier_loss_stats(vnic); + } else { + VNIC_INFO("netif_carrier_off\n"); + netif_carrier_off(vnic->netdevice); + vnic_disconn_stats(vnic); + } + + } + } + complete_and_exit(&vnic_npevent_thread_exit, 0); + return 0; +} + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + struct vnic_npevent *npevent; + unsigned long flags; + + npevent = kmalloc(sizeof *npevent, GFP_ATOMIC); + if (!npevent) { + VNIC_ERROR("Could not allocate memory for vnic event\n"); + return; + } + npevent->vnic = netpath->parent; + npevent->event_type = evt; + INIT_LIST_HEAD(&npevent->list_ptrs); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + list_add_tail(&npevent->list_ptrs, &vnic_npevent_list); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + wake_up(&vnic_npevent_queue); +} + +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + struct vnic *vnic = netpath->parent; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic) && + (npevt->event_type == evt)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + break; + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + +static int vnic_npevent_start(void) +{ + VNIC_FUNCTION("vnic_npevent_start()\n"); + + spin_lock_init(&vnic_npevent_list_lock); + vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL, + "qlgc_vnic_npevent_s_m"); + if (IS_ERR(vnic_npevent_thread)) { + printk(KERN_WARNING PFX "failed to create vnic npevent" + " thread; error %d\n", + (int) PTR_ERR(vnic_npevent_thread)); + vnic_npevent_thread = NULL; + return 1; + } + + return 0; +} + +void vnic_npevent_cleanup(void) +{ + if (vnic_npevent_thread) { + vnic_npevent_thread_end = 1; + wake_up(&vnic_npevent_queue); + wait_for_completion(&vnic_npevent_thread_exit); + vnic_npevent_thread = NULL; + } +} + +static void vnic_setup(struct net_device *device) +{ + ether_setup(device); + + /* ether_setup is used to fill + * device parameters for ethernet devices. + * We override some of the parameters + * which are specific to VNIC. + */ + device->get_stats = vnic_get_stats; + device->open = vnic_open; + device->stop = vnic_stop; + device->hard_start_xmit = vnic_hard_start_xmit; + device->tx_timeout = vnic_tx_timeout; + device->set_multicast_list = vnic_set_multicast_list; + device->set_mac_address = vnic_set_mac_address; + device->change_mtu = vnic_change_mtu; + device->watchdog_timeo = 10 * HZ; + device->features = 0; +} + +struct vnic *vnic_allocate(struct vnic_config *config) +{ + struct vnic *vnic = NULL; + struct net_device *netdev; + + VNIC_FUNCTION("vnic_allocate()\n"); + netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup); + if (!netdev) { + VNIC_ERROR("failed allocating vnic structure\n"); + return NULL; + } + + vnic = netdev_priv(netdev); + vnic->netdevice = netdev; + spin_lock_init(&vnic->lock); + spin_lock_init(&vnic->current_path_lock); + vnic_alloc_stats(vnic); + vnic->state = VNIC_UNINITIALIZED; + vnic->config = config; + + netpath_init(&vnic->primary_path, vnic, 0); + netpath_init(&vnic->secondary_path, vnic, 1); + + vnic->current_path = NULL; + vnic->failed_over = 0; + + list_add_tail(&vnic->list_ptrs, &vnic_list); + + return vnic; +} + +void vnic_free(struct vnic *vnic) +{ + VNIC_FUNCTION("vnic_free()\n"); + list_del(&vnic->list_ptrs); + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC); +} + +static void __exit vnic_cleanup(void) +{ + VNIC_FUNCTION("vnic_cleanup()\n"); + + VNIC_INIT("unloading %s\n", MODULEDETAILS); + + while (!list_empty(&vnic_list)) { + struct vnic *vnic = + list_entry(vnic_list.next, struct vnic, list_ptrs); + vnic_free(vnic); + } + + vnic_npevent_cleanup(); + viport_cleanup(); + vnic_ib_cleanup(); +} + +static int __init vnic_init(void) +{ + int ret; + VNIC_FUNCTION("vnic_init()\n"); + VNIC_INIT("Initializing %s\n", MODULEDETAILS); + + ret = config_start(); + if (ret) { + VNIC_ERROR("config_start failed\n"); + goto failure; + } + + ret = vnic_ib_init(); + if (ret) { + VNIC_ERROR("ib_start failed\n"); + goto failure; + } + + ret = viport_start(); + if (ret) { + VNIC_ERROR("viport_start failed\n"); + goto failure; + } + + ret = vnic_npevent_start(); + if (ret) { + VNIC_ERROR("vnic_npevent_start failed\n"); + goto failure; + } + + return 0; +failure: + vnic_cleanup(); + return ret; +} + +module_init(vnic_init); +module_exit(vnic_cleanup); diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h new file mode 100644 index 0000000..7535124 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h @@ -0,0 +1,154 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_MAIN_H_INCLUDED +#define VNIC_MAIN_H_INCLUDED + +#include +#include +#include +#include + +#include "vnic_config.h" +#include "vnic_netpath.h" + +extern u16 vnic_max_mtu; +extern struct list_head vnic_list; +extern struct attribute_group vnic_stats_attr_group; +extern cycles_t vnic_recv_ref; + +enum vnic_npevent_type { + VNIC_PRINP_CONNECTED = 0, + VNIC_PRINP_DISCONNECTED = 1, + VNIC_PRINP_LINKUP = 2, + VNIC_PRINP_LINKDOWN = 3, + VNIC_PRINP_TIMEREXPIRED = 4, + VNIC_PRINP_SETLINK = 5, + + /* used to figure out PRI vs SEC types for dbg msg*/ + VNIC_PRINP_LASTTYPE = VNIC_PRINP_SETLINK, + + VNIC_SECNP_CONNECTED = 6, + VNIC_SECNP_DISCONNECTED = 7, + VNIC_SECNP_LINKUP = 8, + VNIC_SECNP_LINKDOWN = 9, + VNIC_SECNP_TIMEREXPIRED = 10, + VNIC_SECNP_SETLINK = 11, + + /* used to figure out PRI vs SEC types for dbg msg*/ + VNIC_SECNP_LASTTYPE = VNIC_SECNP_SETLINK, + + VNIC_NP_FREEVNIC = 12, + + /* + * NOTE : If any new netpath event is being added, don't forget to + * add corresponding netpath event string into vnic_main.c. + */ +}; + +struct vnic_npevent { + struct list_head list_ptrs; + struct vnic *vnic; + enum vnic_npevent_type event_type; +}; + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); + +enum vnic_state { + VNIC_UNINITIALIZED = 0, + VNIC_REGISTERED = 1 +}; + +struct vnic { + struct list_head list_ptrs; + enum vnic_state state; + struct vnic_config *config; + struct netpath *current_path; + struct netpath primary_path; + struct netpath secondary_path; + int open; + int carrier; + int failed_over; + int mac_set; + struct net_device_stats stats; + struct net_device *netdevice; + struct dev_info dev_info; + struct dev_mc_list *mc_list; + int mc_list_len; + int mc_count; + spinlock_t lock; + spinlock_t current_path_lock; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t start_time; + cycles_t conn_time; + cycles_t disconn_ref; /* intermediate time */ + cycles_t disconn_time; + u32 disconn_num; + cycles_t xmit_time; + u32 xmit_num; + u32 xmit_fail; + cycles_t recv_time; + u32 recv_num; + u32 multicast_recv_num; + cycles_t xmit_ref; /* intermediate time */ + cycles_t xmit_off_time; + u32 xmit_off_num; + cycles_t carrier_ref; /* intermediate time */ + cycles_t carrier_off_time; + u32 carrier_off_num; + } statistics; + struct dev_info stat_info; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct vnic *vnic_allocate(struct vnic_config *config); + +void vnic_free(struct vnic *vnic); + +void vnic_connected(struct vnic *vnic, struct netpath *netpath); +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath); + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath); +void vnic_link_down(struct vnic *vnic, struct netpath *netpath); + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath); +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath); + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb); +void vnic_npevent_cleanup(void); +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn); +#endif /* VNIC_MAIN_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:54:53 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:24:53 +0530 Subject: [ofa-general] [PATCH v3 02/13] QLogic VNIC: Netpath - abstraction of connection to EVIC/VEx In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095453.9943.66549.stgit@localhost.localdomain> From: Ramachandra K This patch implements the netpath layer of QLogic VNIC. Netpath is an abstraction of a connection to EVIC. It primarily includes the implementation which maintains the timers to monitor the status of the connection to EVIC/VEx. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c | 112 +++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h | 79 ++++++++++++++++ 2 files changed, 191 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c new file mode 100644 index 0000000..820b996 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.c @@ -0,0 +1,112 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" + +static void vnic_npevent_timeout(unsigned long data) +{ + struct netpath *netpath = (struct netpath *)data; + + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); +} + +void netpath_timer(struct netpath *netpath, int timeout) +{ + if (netpath->timer_state == NETPATH_TS_ACTIVE) + del_timer_sync(&netpath->timer); + if (timeout) { + init_timer(&netpath->timer); + netpath->timer_state = NETPATH_TS_ACTIVE; + netpath->timer.expires = jiffies + timeout; + netpath->timer.data = (unsigned long)netpath; + netpath->timer.function = vnic_npevent_timeout; + add_timer(&netpath->timer); + } else + vnic_npevent_timeout((unsigned long)netpath); +} + +void netpath_timer_stop(struct netpath *netpath) +{ + if (netpath->timer_state != NETPATH_TS_ACTIVE) + return; + del_timer_sync(&netpath->timer); + if (netpath->second_bias) + vnic_npevent_dequeue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_dequeue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); + + netpath->timer_state = NETPATH_TS_IDLE; +} + +void netpath_free(struct netpath *netpath) +{ + if (!netpath->viport) + return; + viport_free(netpath->viport); + netpath->viport = NULL; + sysfs_remove_group(&netpath->dev_info.dev.kobj, + &vnic_path_attr_group); + device_unregister(&netpath->dev_info.dev); + wait_for_completion(&netpath->dev_info.released); +} + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias) +{ + netpath->parent = vnic; + netpath->carrier = 0; + netpath->viport = NULL; + netpath->second_bias = second_bias; + netpath->timer_state = NETPATH_TS_IDLE; + init_timer(&netpath->timer); +} + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath) +{ + if (!netpath) + return "NULL"; + else if (netpath == &vnic->primary_path) + return "PRIMARY"; + else if (netpath == &vnic->secondary_path) + return "SECONDARY"; + else + return "UNKNOWN"; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h new file mode 100644 index 0000000..f4e142e --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_netpath.h @@ -0,0 +1,79 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_NETPATH_H_INCLUDED +#define VNIC_NETPATH_H_INCLUDED + +#include + +#include "vnic_sys.h" + +struct viport; +struct vnic; + +enum netpath_ts { + NETPATH_TS_IDLE = 0, + NETPATH_TS_ACTIVE = 1, + NETPATH_TS_EXPIRED = 2 +}; + +struct netpath { + int carrier; + struct vnic *parent; + struct viport *viport; + size_t path_idx; + unsigned long connect_time; + int second_bias; + u8 is_primary_path; + u8 delay_reconnect; + struct timer_list timer; + enum netpath_ts timer_state; + struct dev_info dev_info; +}; + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias); +void netpath_free(struct netpath *netpath); + +void netpath_timer(struct netpath *netpath, int timeout); +void netpath_timer_stop(struct netpath *netpath); + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath); + +#define netpath_get_hw_addr(netpath, address) \ + viport_get_hw_addr((netpath)->viport, address) +#define netpath_is_connected(netpath) \ + (netpath->state == NETPATH_CONNECTED) +#define netpath_can_tx_csum(netpath) \ + viport_can_tx_csum(netpath->viport) + +#endif /* VNIC_NETPATH_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:55:23 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:25:23 +0530 Subject: [ofa-general] [PATCH v3 03/13] QLogic VNIC: Implementation of communication protocol with EVIC/VEx In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095523.9943.28433.stgit@localhost.localdomain> From: Poornima Kamath Implementation of the statemachine for the protocol used while communicating with the EVIC. The patch also implements the viport abstraction which represents the virtual ethernet port on EVIC. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c | 1214 ++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h | 175 +++ 2 files changed, 1389 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c new file mode 100644 index 0000000..7462403 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.c @@ -0,0 +1,1214 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" +#include "vnic_control.h" +#include "vnic_data.h" +#include "vnic_config.h" +#include "vnic_control_pkt.h" + +#define VIPORT_DISCONN_TIMER 10000 /* 10 seconds */ + +#define MAX_RETRY_INTERVAL 20000 /* 20 seconds */ +#define RETRY_INCREMENT 5000 /* 5 seconds */ +#define MAX_CONNECT_RETRY_TIMEOUT 600000 /* 10 minutes */ + +static DECLARE_WAIT_QUEUE_HEAD(viport_queue); +static LIST_HEAD(viport_list); +static DECLARE_COMPLETION(viport_thread_exit); +static spinlock_t viport_list_lock; + +static struct task_struct *viport_thread; +static int viport_thread_end; + +static void viport_timer(struct viport *viport, int timeout); + +struct viport *viport_allocate(struct viport_config *config) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_allocate()\n"); + viport = kzalloc(sizeof *viport, GFP_KERNEL); + if (!viport) { + VIPORT_ERROR("failed allocating viport structure\n"); + return NULL; + } + + viport->state = VIPORT_DISCONNECTED; + viport->link_state = LINK_FIRSTCONNECT; + viport->connect = WAIT; + viport->new_mtu = 1500; + viport->new_flags = 0; + viport->config = config; + viport->connect = DELAY; + viport->data.max_mtu = vnic_max_mtu; + spin_lock_init(&viport->lock); + init_waitqueue_head(&viport->stats_queue); + init_waitqueue_head(&viport->disconnect_queue); + init_waitqueue_head(&viport->reference_queue); + INIT_LIST_HEAD(&viport->list_ptrs); + + vnic_mc_init(viport); + + return viport; +} + +void viport_connect(struct viport *viport, int delay) +{ + VIPORT_FUNCTION("viport_connect()\n"); + + if (viport->connect != DELAY) + viport->connect = (delay) ? DELAY : NOW; + if (viport->link_state == LINK_FIRSTCONNECT) { + u32 duration; + duration = (net_random() & 0x1ff); + if (!viport->parent->is_primary_path) + duration += 0x1ff; + viport->link_state = LINK_RETRYWAIT; + viport_timer(viport, duration); + } else + viport_kick(viport); +} + +static void viport_disconnect(struct viport *viport) +{ + VIPORT_FUNCTION("viport_disconnect()\n"); + viport->disconnect = 1; + viport_failure(viport); + wait_event(viport->disconnect_queue, viport->disconnect == 0); +} + +void viport_free(struct viport *viport) +{ + VIPORT_FUNCTION("viport_free()\n"); + viport_disconnect(viport); /* NOTE: this can sleep */ + vnic_mc_uninit(viport); + kfree(viport->config); + kfree(viport); +} + +void viport_set_link(struct viport *viport, u16 flags, u16 mtu) +{ + unsigned long localflags; + int i; + + VIPORT_FUNCTION("viport_set_link()\n"); + if (mtu > data_max_mtu(&viport->data)) { + VIPORT_ERROR("configuration error." + " mtu of %d unsupported by %s\n", mtu, + config_viport_name(viport->config)); + goto failure; + } + + spin_lock_irqsave(&viport->lock, localflags); + flags &= IFF_UP | IFF_ALLMULTI | IFF_PROMISC; + if ((viport->new_flags != flags) + || (viport->new_mtu != mtu)) { + viport->new_flags = flags; + viport->new_mtu = mtu; + viport->updates |= NEED_LINK_CONFIG; + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + if (((viport->mtu <= MCAST_MSG_SIZE) && (mtu > MCAST_MSG_SIZE)) || + ((viport->mtu > MCAST_MSG_SIZE) && (mtu <= MCAST_MSG_SIZE))) { + /* + * MTU value will enable/disable the multicast. In + * either case, need to send the CMD_CONFIG_ADDRESS2 to + * EVIC. Hence, setting the NEED_ADDRESS_CONFIG flag. + */ + viport->updates |= NEED_ADDRESS_CONFIG; + if (mtu <= MCAST_MSG_SIZE) { + VIPORT_PRINT("%s: MTU changed; " + "old:%d new:%d (threshold:%d);" + " MULTICAST will be enabled.\n", + config_viport_name(viport->config), + viport->mtu, mtu, + (int)MCAST_MSG_SIZE); + } else { + VIPORT_PRINT("%s: MTU changed; " + "old:%d new:%d (threshold:%d); " + "MULTICAST will be disabled.\n", + config_viport_name(viport->config), + viport->mtu, mtu, + (int)MCAST_MSG_SIZE); + } + /* When we resend these addresses, EVIC will + * send mgid=0 back in response. So no need to + * shutoff ib_multicast. + */ + for (i = MCAST_ADDR_START; i < viport->num_mac_addresses; i++) { + if (viport->mac_addresses[i].valid) + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + } + } + viport_kick(viport); + } + + spin_unlock_irqrestore(&viport->lock, localflags); + return; +failure: + viport_failure(viport); +} + +int viport_set_unicast(struct viport *viport, u8 *address) +{ + unsigned long flags; + int ret = -1; + VIPORT_FUNCTION("viport_set_unicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + if (memcmp(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN)) { + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].operation + = VNIC_OP_SET_ENTRY; + viport->updates |= NEED_ADDRESS_CONFIG; + viport_kick(viport); + } + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +int viport_set_multicast(struct viport *viport, + struct dev_mc_list *mc_list, int mc_count) +{ + u32 old_update_list; + int i; + int ret = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_set_multicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + old_update_list = viport->updates; + if (mc_count > viport->num_mac_addresses - MCAST_ADDR_START) + viport->updates |= NEED_LINK_CONFIG | MCAST_OVERFLOW; + else { + if (mc_count == 0) { + ret = 0; + goto out; + } + if (viport->updates & MCAST_OVERFLOW) { + viport->updates &= ~MCAST_OVERFLOW; + viport->updates |= NEED_LINK_CONFIG; + } + for (i = MCAST_ADDR_START; i < mc_count + MCAST_ADDR_START; + i++, mc_list = mc_list->next) { + if (viport->mac_addresses[i].valid && + !memcmp(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN)) + continue; + memcpy(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN); + viport->mac_addresses[i].valid = 1; + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + for (; i < viport->num_mac_addresses; i++) { + if (!viport->mac_addresses[i].valid) + continue; + viport->mac_addresses[i].valid = 0; + viport->mac_addresses[i].operation = VNIC_OP_SET_ENTRY; + } + if (mc_count) + viport->updates |= NEED_ADDRESS_CONFIG; + } + + if (viport->updates != old_update_list) + viport_kick(viport); + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +static inline void viport_disable_multicast(struct viport *viport) +{ + VIPORT_INFO("turned off IB_MULTICAST\n"); + viport->config->control_config.ib_multicast = 0; + viport->config->control_config.ib_config.conn_data.features_supported &= + __constant_cpu_to_be32((u32)~VNIC_FEAT_INBOUND_IB_MC); + viport->link_state = LINK_RESET; +} + +void viport_get_stats(struct viport *viport, + struct net_device_stats *stats) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_get_stats()\n"); + /* Reference count has been already incremented indicating + * that viport structure is being used, which prevents its + * freeing when this task sleeps + */ + if (time_after(jiffies, + (viport->last_stats_time + viport->config->stats_interval))) { + + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_STATS; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + wait_event(viport->stats_queue, + !(viport->updates & NEED_STATS) + || (viport->disconnect == 1)); + + if (viport->stats.ethernet_status) + vnic_link_up(viport->vnic, viport->parent); + else + vnic_link_down(viport->vnic, viport->parent); + } + + stats->rx_packets = be64_to_cpu(viport->stats.if_in_ok); + stats->tx_packets = be64_to_cpu(viport->stats.if_out_ok); + stats->rx_bytes = be64_to_cpu(viport->stats.if_in_octets); + stats->tx_bytes = be64_to_cpu(viport->stats.if_out_octets); + stats->rx_errors = be64_to_cpu(viport->stats.if_in_errors); + stats->tx_errors = be64_to_cpu(viport->stats.if_out_errors); + stats->rx_dropped = 0; /* EIOC doesn't track */ + stats->tx_dropped = 0; /* EIOC doesn't track */ + stats->multicast = be64_to_cpu(viport->stats.if_in_nucast_pkts); + stats->collisions = 0; /* EIOC doesn't track */ +} + +int viport_xmit_packet(struct viport *viport, struct sk_buff *skb) +{ + int status = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_xmit_packet()\n"); + spin_lock_irqsave(&viport->lock, flags); + if (viport->state == VIPORT_CONNECTED) + status = data_xmit_packet(&viport->data, skb); + spin_unlock_irqrestore(&viport->lock, flags); + + return status; +} + +void viport_kick(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_kick()\n"); + spin_lock_irqsave(&viport_list_lock, flags); + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +void viport_failure(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_failure()\n"); + vnic_stop_xmit(viport->vnic, viport->parent); + spin_lock_irqsave(&viport_list_lock, flags); + viport->errored = 1; + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +static void viport_timeout(unsigned long data) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_timeout()\n"); + viport = (struct viport *)data; + viport->timer_active = 0; + viport_kick(viport); +} + +static void viport_timer(struct viport *viport, int timeout) +{ + VIPORT_FUNCTION("viport_timer()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + init_timer(&viport->timer); + viport->timer.expires = jiffies + timeout; + viport->timer.data = (unsigned long)viport; + viport->timer.function = viport_timeout; + viport->timer_active = 1; + add_timer(&viport->timer); +} + +static void viport_timer_stop(struct viport *viport) +{ + VIPORT_FUNCTION("viport_timer_stop()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + viport->timer_active = 0; +} + +static int viport_init_mac_addresses(struct viport *viport) +{ + struct vnic_address_op2 *temp; + unsigned long flags; + int i; + + VIPORT_FUNCTION("viport_init_mac_addresses()\n"); + i = viport->num_mac_addresses * sizeof *temp; + temp = kzalloc(viport->num_mac_addresses * sizeof *temp, + GFP_KERNEL); + if (!temp) { + VIPORT_ERROR("failed allocating MAC address table\n"); + return -ENOMEM; + } + + spin_lock_irqsave(&viport->lock, flags); + viport->mac_addresses = temp; + for (i = 0; i < viport->num_mac_addresses; i++) { + viport->mac_addresses[i].index = cpu_to_be16(i); + viport->mac_addresses[i].vlan = + cpu_to_be16(viport->default_vlan); + } + memset(viport->mac_addresses[BROADCAST_ADDR].address, + 0xFF, ETH_ALEN); + viport->mac_addresses[BROADCAST_ADDR].valid = 1; + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + viport->hw_mac_address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].valid = 1; + + spin_unlock_irqrestore(&viport->lock, flags); + + return 0; +} + +static inline void viport_match_mac_address(struct vnic *vnic, + struct viport *viport) +{ + if (vnic && vnic->current_path && + viport == vnic->current_path->viport && + vnic->mac_set && + memcmp(vnic->netdevice->dev_addr, viport->hw_mac_address, ETH_ALEN)) { + VIPORT_ERROR("*** ERROR MAC address mismatch; " + "current = %02x:%02x:%02x:%02x:%02x:%02x " + "From EVIC = %02x:%02x:%02x:%02x:%02x:%02x\n", + vnic->netdevice->dev_addr[0], + vnic->netdevice->dev_addr[1], + vnic->netdevice->dev_addr[2], + vnic->netdevice->dev_addr[3], + vnic->netdevice->dev_addr[4], + vnic->netdevice->dev_addr[5], + viport->hw_mac_address[0], + viport->hw_mac_address[1], + viport->hw_mac_address[2], + viport->hw_mac_address[3], + viport->hw_mac_address[4], + viport->hw_mac_address[5]); + } +} + +static int viport_handle_init_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_UNINITIALIZED: + LINK_STATE("state LINK_UNINITIALIZED\n"); + viport->updates = 0; + spin_lock_irq(&viport_list_lock); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + if (atomic_read(&viport->reference_count)) { + wake_up(&viport->stats_queue); + wait_event(viport->reference_queue, + atomic_read(&viport->reference_count) == 0); + } + /* No more references to viport structure + * so it is safe to delete it by waking disconnect + * queue + */ + + viport->disconnect = 0; + wake_up(&viport->disconnect_queue); + break; + case LINK_INITIALIZE: + LINK_STATE("state LINK_INITIALIZE\n"); + viport->errored = 0; + viport->connect = WAIT; + viport->last_stats_time = 0; + if (viport->disconnect) + viport->link_state = LINK_UNINITIALIZED; + else + viport->link_state = LINK_INITIALIZECONTROL; + break; + case LINK_INITIALIZECONTROL: + LINK_STATE("state LINK_INITIALIZECONTROL\n"); + viport->pd = ib_alloc_pd(viport->config->ibdev); + if (IS_ERR(viport->pd)) + viport->link_state = LINK_DISCONNECTED; + else if (control_init(&viport->control, viport, + &viport->config->control_config, + viport->pd)) { + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + + } else + viport->link_state = LINK_INITIALIZEDATA; + break; + case LINK_INITIALIZEDATA: + LINK_STATE("state LINK_INITIALIZEDATA\n"); + if (data_init(&viport->data, viport, + &viport->config->data_config, + viport->pd)) + viport->link_state = LINK_CLEANUPCONTROL; + else + viport->link_state = LINK_CONTROLCONNECT; + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_control_states(struct viport *viport) +{ + enum link_state old_state; + struct vnic *vnic; + + do { + switch (old_state = viport->link_state) { + case LINK_CONTROLCONNECT: + if (vnic_ib_cm_connect(&viport->control.ib_conn)) + viport->link_state = LINK_CLEANUPDATA; + else + viport->link_state = LINK_CONTROLCONNECTWAIT; + break; + case LINK_CONTROLCONNECTWAIT: + LINK_STATE("state LINK_CONTROLCONNECTWAIT\n"); + if (control_is_connected(&viport->control)) + viport->link_state = LINK_INITVNICREQ; + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + case LINK_INITVNICREQ: + LINK_STATE("state LINK_INITVNICREQ\n"); + if (control_init_vnic_req(&viport->control)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_INITVNICRSP; + break; + case LINK_INITVNICRSP: + LINK_STATE("state LINK_INITVNICRSP\n"); + control_process_async(&viport->control); + + if (!control_init_vnic_rsp(&viport->control, + &viport->features_supported, + viport->hw_mac_address, + &viport->num_mac_addresses, + &viport->default_vlan)) { + if (viport_init_mac_addresses(viport)) + viport->link_state = + LINK_RESETCONTROL; + else { + viport->link_state = + LINK_BEGINDATAPATH; + /* + * Ensure that the current path's MAC + * address matches the one returned by + * EVIC - we've had cases of mismatch + * which then caused havoc. + */ + vnic = viport->parent->parent; + viport_match_mac_address(vnic, viport); + } + } + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_data_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_BEGINDATAPATH: + LINK_STATE("state LINK_BEGINDATAPATH\n"); + viport->link_state = LINK_CONFIGDATAPATHREQ; + break; + case LINK_CONFIGDATAPATHREQ: + LINK_STATE("state LINK_CONFIGDATAPATHREQ\n"); + if (control_config_data_path_req(&viport->control, + data_path_id(&viport-> + data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data))) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_CONFIGDATAPATHRSP; + break; + case LINK_CONFIGDATAPATHRSP: + LINK_STATE("state LINK_CONFIGDATAPATHRSP\n"); + control_process_async(&viport->control); + + if (!control_config_data_path_rsp(&viport->control, + data_host_pool + (&viport->data), + data_eioc_pool + (&viport->data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data), + data_host_pool_min + (&viport->data), + data_eioc_pool_min + (&viport->data))) + viport->link_state = LINK_DATACONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + case LINK_DATACONNECT: + LINK_STATE("state LINK_DATACONNECT\n"); + if (data_connect(&viport->data)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_DATACONNECTWAIT; + break; + case LINK_DATACONNECTWAIT: + LINK_STATE("state LINK_DATACONNECTWAIT\n"); + control_process_async(&viport->control); + if (data_is_connected(&viport->data)) + viport->link_state = LINK_XCHGPOOLREQ; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_xchgpool_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_XCHGPOOLREQ: + LINK_STATE("state LINK_XCHGPOOLREQ\n"); + if (control_exchange_pools_req(&viport->control, + data_local_pool_addr + (&viport->data), + data_local_pool_rkey + (&viport->data))) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_XCHGPOOLRSP; + break; + case LINK_XCHGPOOLRSP: + LINK_STATE("state LINK_XCHGPOOLRSP\n"); + control_process_async(&viport->control); + + if (!control_exchange_pools_rsp(&viport->control, + data_remote_pool_addr + (&viport->data), + data_remote_pool_rkey + (&viport->data))) + viport->link_state = LINK_INITIALIZED; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_INITIALIZED: + LINK_STATE("state LINK_INITIALIZED\n"); + viport->state = VIPORT_CONNECTED; + printk(KERN_INFO PFX + "%s: connection established\n", + config_viport_name(viport->config)); + data_connected(&viport->data); + vnic_connected(viport->parent->parent, + viport->parent); + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + printk(KERN_INFO PFX "%s: Supports Inbound IB " + "Multicast\n", + config_viport_name(viport->config)); + if (mc_data_init(&viport->mc_data, viport, + &viport->config->data_config, + viport->pd)) { + viport_disable_multicast(viport); + break; + } + } + spin_lock_irq(&viport->lock); + viport->mtu = 1500; + viport->flags = 0; + if ((viport->mtu != viport->new_mtu) || + (viport->flags != viport->new_flags)) + viport->updates |= NEED_LINK_CONFIG; + spin_unlock_irq(&viport->lock); + viport->link_state = LINK_IDLE; + viport->retry_duration = 0; + viport->total_retry_duration = 0; + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_idle_states(struct viport *viport) +{ + enum link_state old_state; + int handle_mc_join_compl, handle_mc_join; + + do { + switch (old_state = viport->link_state) { + case LINK_IDLE: + LINK_STATE("state LINK_IDLE\n"); + if (viport->config->hb_interval) + viport_timer(viport, + viport->config->hb_interval); + viport->link_state = LINK_IDLING; + break; + case LINK_IDLING: + LINK_STATE("state LINK_IDLING\n"); + control_process_async(&viport->control); + if (viport->errored) { + viport_timer_stop(viport); + viport->errored = 0; + viport->link_state = LINK_RESET; + break; + } + + spin_lock_irq(&viport->lock); + handle_mc_join = (viport->updates & NEED_MCAST_JOIN); + handle_mc_join_compl = + (viport->updates & NEED_MCAST_COMPLETION); + /* + * Turn off both flags, the handler functions will + * rearm them if necessary. + */ + viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION); + + if (viport->updates & NEED_LINK_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGLINKREQ; + } else if (viport->updates & NEED_ADDRESS_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGADDRSREQ; + } else if (viport->updates & NEED_STATS) { + viport_timer_stop(viport); + viport->link_state = LINK_REPORTSTATREQ; + } else if (viport->config->hb_interval) { + if (!viport->timer_active) + viport->link_state = + LINK_HEARTBEATREQ; + } + spin_unlock_irq(&viport->lock); + if (handle_mc_join) { + if (vnic_mc_join(viport)) + viport_disable_multicast(viport); + } + if (handle_mc_join_compl) + vnic_mc_join_handle_completion(viport); + + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_config_states(struct viport *viport) +{ + enum link_state old_state; + int res; + + do { + switch (old_state = viport->link_state) { + case LINK_CONFIGLINKREQ: + LINK_STATE("state LINK_CONFIGLINKREQ\n"); + spin_lock_irq(&viport->lock); + viport->updates &= ~NEED_LINK_CONFIG; + viport->flags = viport->new_flags; + if (viport->updates & MCAST_OVERFLOW) + viport->flags |= IFF_ALLMULTI; + viport->mtu = viport->new_mtu; + spin_unlock_irq(&viport->lock); + if (control_config_link_req(&viport->control, + viport->flags, + viport->mtu)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_CONFIGLINKRSP; + break; + case LINK_CONFIGLINKRSP: + LINK_STATE("state LINK_CONFIGLINKRSP\n"); + control_process_async(&viport->control); + + if (!control_config_link_rsp(&viport->control, + &viport->flags, + &viport->mtu)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_CONFIGADDRSREQ: + LINK_STATE("state LINK_CONFIGADDRSREQ\n"); + + spin_lock_irq(&viport->lock); + res = control_config_addrs_req(&viport->control, + viport->mac_addresses, + viport-> + num_mac_addresses); + + if (res > 0) { + viport->updates &= ~NEED_ADDRESS_CONFIG; + viport->link_state = LINK_CONFIGADDRSRSP; + } else if (res == 0) + viport->link_state = LINK_CONFIGADDRSRSP; + else + viport->link_state = LINK_RESET; + spin_unlock_irq(&viport->lock); + break; + case LINK_CONFIGADDRSRSP: + LINK_STATE("state LINK_CONFIGADDRSRSP\n"); + control_process_async(&viport->control); + + if (!control_config_addrs_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_stat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_REPORTSTATREQ: + LINK_STATE("state LINK_REPORTSTATREQ\n"); + if (control_report_statistics_req(&viport->control)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_REPORTSTATRSP; + break; + case LINK_REPORTSTATRSP: + LINK_STATE("state LINK_REPORTSTATRSP\n"); + control_process_async(&viport->control); + + spin_lock_irq(&viport->lock); + if (control_report_statistics_rsp(&viport->control, + &viport->stats) == 0) { + viport->updates &= ~NEED_STATS; + viport->last_stats_time = jiffies; + wake_up(&viport->stats_queue); + viport->link_state = LINK_IDLE; + } + + spin_unlock_irq(&viport->lock); + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_heartbeat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_HEARTBEATREQ: + LINK_STATE("state LINK_HEARTBEATREQ\n"); + if (control_heartbeat_req(&viport->control, + viport->config->hb_timeout)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_HEARTBEATRSP; + break; + case LINK_HEARTBEATRSP: + LINK_STATE("state LINK_HEARTBEATRSP\n"); + control_process_async(&viport->control); + + if (!control_heartbeat_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_reset_states(struct viport *viport) +{ + enum link_state old_state; + int handle_mc_join_compl = 0, handle_mc_join = 0; + + do { + switch (old_state = viport->link_state) { + case LINK_RESET: + LINK_STATE("state LINK_RESET\n"); + viport->errored = 0; + spin_lock_irq(&viport->lock); + viport->state = VIPORT_DISCONNECTED; + /* + * Turn off both flags, the handler functions will + * rearm them if necessary + */ + viport->updates &= ~(NEED_MCAST_JOIN | NEED_MCAST_COMPLETION); + + spin_unlock_irq(&viport->lock); + vnic_link_down(viport->vnic, viport->parent); + printk(KERN_INFO PFX + "%s: connection lost\n", + config_viport_name(viport->config)); + if (handle_mc_join) { + if (vnic_mc_join(viport)) + viport_disable_multicast(viport); + } + if (handle_mc_join_compl) + vnic_mc_join_handle_completion(viport); + if (viport->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + vnic_mc_leave(viport); + vnic_mc_data_cleanup(&viport->mc_data); + } + + if (control_reset_req(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + else + viport->link_state = LINK_RESETRSP; + break; + case LINK_RESETRSP: + LINK_STATE("state LINK_RESETRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_DATADISCONNECT; + } + break; + case LINK_RESETCONTROL: + LINK_STATE("state LINK_RESETCONTROL\n"); + if (control_reset_req(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + else + viport->link_state = LINK_RESETCONTROLRSP; + break; + case LINK_RESETCONTROLRSP: + LINK_STATE("state LINK_RESETCONTROLRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_disconn_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch (old_state = viport->link_state) { + case LINK_DATADISCONNECT: + LINK_STATE("state LINK_DATADISCONNECT\n"); + data_disconnect(&viport->data); + viport->link_state = LINK_CONTROLDISCONNECT; + break; + case LINK_CONTROLDISCONNECT: + LINK_STATE("state LINK_CONTROLDISCONNECT\n"); + viport->link_state = LINK_CLEANUPDATA; + break; + case LINK_CLEANUPDATA: + LINK_STATE("state LINK_CLEANUPDATA\n"); + data_cleanup(&viport->data); + viport->link_state = LINK_CLEANUPCONTROL; + break; + case LINK_CLEANUPCONTROL: + LINK_STATE("state LINK_CLEANUPCONTROL\n"); + spin_lock_irq(&viport->lock); + kfree(viport->mac_addresses); + viport->mac_addresses = NULL; + spin_unlock_irq(&viport->lock); + control_cleanup(&viport->control); + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + break; + case LINK_DISCONNECTED: + LINK_STATE("state LINK_DISCONNECTED\n"); + vnic_disconnected(viport->parent->parent, + viport->parent); + if (viport->disconnect != 0) + viport->link_state = LINK_UNINITIALIZED; + else if (viport->retry == 1) { + viport->retry = 0; + /* + * Check if the initial retry interval has crossed + * 20 seconds. + * The retry interval is initially 5 seconds which + * is incremented by 5. Once it is 20 the interval + * is fixed to 20 seconds till 10 minutes, + * after which retrying is stopped + */ + if (viport->retry_duration < MAX_RETRY_INTERVAL) + viport->retry_duration += + RETRY_INCREMENT; + + viport->total_retry_duration += + viport->retry_duration; + + if (viport->total_retry_duration >= + MAX_CONNECT_RETRY_TIMEOUT) { + viport->link_state = LINK_UNINITIALIZED; + printk("Timed out after retrying" + " for retry_duration %d msecs\n" + , viport->total_retry_duration); + } else { + viport->connect = DELAY; + viport->link_state = LINK_RETRYWAIT; + } + viport_timer(viport, + msecs_to_jiffies(viport->retry_duration)); + } else { + u32 duration = 5000 + ((net_random()) & 0x1FF); + if (!viport->parent->is_primary_path) + duration += 0x1ff; + viport_timer(viport, + msecs_to_jiffies(duration)); + viport->connect = DELAY; + viport->link_state = LINK_RETRYWAIT; + } + break; + case LINK_RETRYWAIT: + LINK_STATE("state LINK_RETRYWAIT\n"); + viport->stats.ethernet_status = 0; + viport->updates = 0; + wake_up(&viport->stats_queue); + if (viport->disconnect != 0) { + viport_timer_stop(viport); + viport->link_state = LINK_UNINITIALIZED; + } else if (viport->connect == DELAY) { + if (!viport->timer_active) + viport->link_state = LINK_INITIALIZE; + } else if (viport->connect == NOW) { + viport_timer_stop(viport); + viport->link_state = LINK_INITIALIZE; + } + break; + case LINK_FIRSTCONNECT: + viport->stats.ethernet_status = 0; + viport->updates = 0; + wake_up(&viport->stats_queue); + if (viport->disconnect != 0) { + viport_timer_stop(viport); + viport->link_state = LINK_UNINITIALIZED; + } + + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_statemachine(void *context) +{ + struct viport *viport; + enum link_state old_link_state; + + VIPORT_FUNCTION("viport_statemachine()\n"); + while (!viport_thread_end || !list_empty(&viport_list)) { + wait_event_interruptible(viport_queue, + !list_empty(&viport_list) + || viport_thread_end); + spin_lock_irq(&viport_list_lock); + if (list_empty(&viport_list)) { + spin_unlock_irq(&viport_list_lock); + continue; + } + viport = list_entry(viport_list.next, struct viport, + list_ptrs); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + + do { + old_link_state = viport->link_state; + + /* + * Optimize for the state machine steady state + * by checking for the most common states first. + * + */ + if (viport_handle_idle_states(viport) == 0) + break; + if (viport_handle_heartbeat_states(viport) == 0) + break; + if (viport_handle_stat_states(viport) == 0) + break; + if (viport_handle_config_states(viport) == 0) + break; + + if (viport_handle_init_states(viport) == 0) + break; + if (viport_handle_control_states(viport) == 0) + break; + if (viport_handle_data_states(viport) == 0) + break; + if (viport_handle_xchgpool_states(viport) == 0) + break; + if (viport_handle_reset_states(viport) == 0) + break; + if (viport_handle_disconn_states(viport) == 0) + break; + } while (viport->link_state != old_link_state); + } + + complete_and_exit(&viport_thread_exit, 0); +} + +int viport_start(void) +{ + VIPORT_FUNCTION("viport_start()\n"); + + spin_lock_init(&viport_list_lock); + viport_thread = kthread_run(viport_statemachine, NULL, + "qlgc_vnic_viport_s_m"); + if (IS_ERR(viport_thread)) { + printk(KERN_WARNING PFX "Could not create viport_thread;" + " error %d\n", (int) PTR_ERR(viport_thread)); + viport_thread = NULL; + return 1; + } + + return 0; +} + +void viport_cleanup(void) +{ + VIPORT_FUNCTION("viport_cleanup()\n"); + if (viport_thread) { + viport_thread_end = 1; + wake_up(&viport_queue); + wait_for_completion(&viport_thread_exit); + viport_thread = NULL; + } +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h new file mode 100644 index 0000000..70cdc9f --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_viport.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_VIPORT_H_INCLUDED +#define VNIC_VIPORT_H_INCLUDED + +#include "vnic_control.h" +#include "vnic_data.h" +#include "vnic_multicast.h" + +enum viport_state { + VIPORT_DISCONNECTED = 0, + VIPORT_CONNECTED = 1 +}; + +enum link_state { + LINK_UNINITIALIZED = 0, + LINK_INITIALIZE = 1, + LINK_INITIALIZECONTROL = 2, + LINK_INITIALIZEDATA = 3, + LINK_CONTROLCONNECT = 4, + LINK_CONTROLCONNECTWAIT = 5, + LINK_INITVNICREQ = 6, + LINK_INITVNICRSP = 7, + LINK_BEGINDATAPATH = 8, + LINK_CONFIGDATAPATHREQ = 9, + LINK_CONFIGDATAPATHRSP = 10, + LINK_DATACONNECT = 11, + LINK_DATACONNECTWAIT = 12, + LINK_XCHGPOOLREQ = 13, + LINK_XCHGPOOLRSP = 14, + LINK_INITIALIZED = 15, + LINK_IDLE = 16, + LINK_IDLING = 17, + LINK_CONFIGLINKREQ = 18, + LINK_CONFIGLINKRSP = 19, + LINK_CONFIGADDRSREQ = 20, + LINK_CONFIGADDRSRSP = 21, + LINK_REPORTSTATREQ = 22, + LINK_REPORTSTATRSP = 23, + LINK_HEARTBEATREQ = 24, + LINK_HEARTBEATRSP = 25, + LINK_RESET = 26, + LINK_RESETRSP = 27, + LINK_RESETCONTROL = 28, + LINK_RESETCONTROLRSP = 29, + LINK_DATADISCONNECT = 30, + LINK_CONTROLDISCONNECT = 31, + LINK_CLEANUPDATA = 32, + LINK_CLEANUPCONTROL = 33, + LINK_DISCONNECTED = 34, + LINK_RETRYWAIT = 35, + LINK_FIRSTCONNECT = 36 +}; + +enum { + BROADCAST_ADDR = 0, + UNICAST_ADDR = 1, + MCAST_ADDR_START = 2 +}; + +#define current_mac_address mac_addresses[UNICAST_ADDR].address + +enum { + NEED_STATS = 0x00000001, + NEED_ADDRESS_CONFIG = 0x00000002, + NEED_LINK_CONFIG = 0x00000004, + MCAST_OVERFLOW = 0x00000008, + NEED_MCAST_COMPLETION = 0x00000010, + NEED_MCAST_JOIN = 0x00000020 +}; + +struct viport { + struct list_head list_ptrs; + struct netpath *parent; + struct vnic *vnic; + struct viport_config *config; + struct control control; + struct data data; + spinlock_t lock; + struct ib_pd *pd; + enum viport_state state; + enum link_state link_state; + struct vnic_cmd_report_stats_rsp stats; + wait_queue_head_t stats_queue; + unsigned long last_stats_time; + u32 features_supported; + u8 hw_mac_address[ETH_ALEN]; + u16 default_vlan; + u16 num_mac_addresses; + struct vnic_address_op2 *mac_addresses; + u32 updates; + u16 flags; + u16 new_flags; + u16 mtu; + u16 new_mtu; + u32 errored; + enum { WAIT, DELAY, NOW } connect; + u32 disconnect; + u32 retry; + wait_queue_head_t disconnect_queue; + int timer_active; + struct timer_list timer; + u32 retry_duration; + u32 total_retry_duration; + atomic_t reference_count; + wait_queue_head_t reference_queue; + struct mc_info mc_info; + struct mc_data mc_data; +}; + +int viport_start(void); +void viport_cleanup(void); + +struct viport *viport_allocate(struct viport_config *config); +void viport_free(struct viport *viport); + +void viport_connect(struct viport *viport, int delay); + +void viport_set_link(struct viport *viport, u16 flags, u16 mtu); +void viport_get_stats(struct viport *viport, + struct net_device_stats *stats); +int viport_xmit_packet(struct viport *viport, struct sk_buff *skb); +void viport_kick(struct viport *viport); + +void viport_failure(struct viport *viport); + +int viport_set_unicast(struct viport *viport, u8 *address); +int viport_set_multicast(struct viport *viport, + struct dev_mc_list *mc_list, + int mc_count); + +#define viport_max_mtu(viport) data_max_mtu(&(viport)->data) + +#define viport_get_hw_addr(viport, address) \ + memcpy(address, (viport)->hw_mac_address, ETH_ALEN) + +#define viport_features(viport) ((viport)->features_supported) + +#define viport_can_tx_csum(viport) \ + (((viport)->features_supported & \ + (VNIC_FEAT_IPV4_CSUM_TX | VNIC_FEAT_TCP_CSUM_TX | \ + VNIC_FEAT_UDP_CSUM_TX)) == (VNIC_FEAT_IPV4_CSUM_TX | \ + VNIC_FEAT_TCP_CSUM_TX | VNIC_FEAT_UDP_CSUM_TX)) + +#endif /* VNIC_VIPORT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:55:54 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:25:54 +0530 Subject: [ofa-general] [PATCH v3 04/13] QLogic VNIC: Implementation of Control path of communication protocol In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095554.9943.43485.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the files that define the control packet formats and implements various control messages that are exchanged as part of the communication protocol with the EVIC/VEx. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_control.c | 2286 ++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_control.h | 179 ++ .../infiniband/ulp/qlgc_vnic/vnic_control_pkt.h | 368 +++ 3 files changed, 2833 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c new file mode 100644 index 0000000..774a071 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.c @@ -0,0 +1,2286 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_stats.h" + +#define vnic_multicast_address(rsp2_address, index) \ + ((rsp2_address)->list_address_ops[index].address[0] & 0x01) + +static void control_log_control_packet(struct vnic_control_packet *pkt); + +char *control_ifcfg_name(struct control *control) +{ + if (!control) + return "nctl"; + if (!control->parent) + return "np"; + if (!control->parent->parent) + return "npp"; + if (!control->parent->parent->parent) + return "nppp"; + if (!control->parent->parent->parent->config) + return "npppc"; + return (control->parent->parent->parent->config->name); +} + +static void control_recv(struct control *control, struct recv_io *recv_io) +{ + if (vnic_ib_post_recv(&control->ib_conn, &recv_io->io)) + viport_failure(control->parent); +} + +static void control_recv_complete(struct io *io) +{ + struct recv_io *recv_io = (struct recv_io *)io; + struct recv_io *last_recv_io; + struct control *control = &io->viport->control; + struct vnic_control_packet *pkt = control_packet(recv_io); + struct vnic_control_header *c_hdr = &pkt->hdr; + unsigned long flags; + cycles_t response_time; + + CONTROL_FUNCTION("%s: control_recv_complete() State=%d\n", + control_ifcfg_name(control), control->req_state); + + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + control_note_rsptime_stats(&response_time); + CONTROL_PACKET(pkt); + spin_lock_irqsave(&control->io_lock, flags); + if (c_hdr->pkt_type == TYPE_INFO) { + last_recv_io = control->info; + control->info = recv_io; + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + if (last_recv_io) + control_recv(control, last_recv_io); + } else if (c_hdr->pkt_type == TYPE_RSP) { + u8 repost = 0; + u8 fail = 0; + u8 kick = 0; + + switch (control->req_state) { + case REQ_INACTIVE: + case RSP_RECEIVED: + case REQ_COMPLETED: + CONTROL_ERROR("%s: Unexpected control" + "response received: CMD = %d\n", + control_ifcfg_name(control), + c_hdr->pkt_cmd); + control_log_control_packet(pkt); + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_POSTED: + case REQ_SENT: + if (c_hdr->pkt_cmd != control->last_cmd + || c_hdr->pkt_seq_num != control->seq_num) { + CONTROL_ERROR("%s: Incorrect Control Response " + "received\n", + control_ifcfg_name(control)); + CONTROL_ERROR("%s: Sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: Received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + control->req_state = REQ_FAILED; + fail = 1; + } else { + control->response = recv_io; + control_update_rsptime_stats(control, + response_time); + if (control->req_state == REQ_POSTED) { + CONTROL_INFO("%s: Recv CMD RSP %d" + "before Send Completion\n", + control_ifcfg_name(control), + c_hdr->pkt_cmd); + control->req_state = RSP_RECEIVED; + } else { + control->req_state = REQ_COMPLETED; + kick = 1; + } + } + break; + case REQ_FAILED: + /* stay in REQ_FAILED state */ + repost = 1; + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock*/ + if (kick) + viport_kick(control->parent); + if (repost || fail) { + control_recv(control, recv_io); + if (fail) + viport_failure(control->parent); + } + + } else { + list_add_tail(&recv_io->io.list_ptrs, + &control->failure_list); + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + } + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); +} + +static void control_timeout(unsigned long data) +{ + struct control *control; + unsigned long flags; + u8 fail = 0; + u8 kick = 0; + + control = (struct control *)data; + CONTROL_FUNCTION("%s: control_timeout(), State=%d\n", + control_ifcfg_name(control), control->req_state); + control->timer_state = TIMER_EXPIRED; + + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + kick = 1; + /* stay in REQ_INACTIVE state */ + break; + case REQ_POSTED: + case REQ_SENT: + control->req_state = REQ_FAILED; + CONTROL_ERROR("%s: No send Completion for Cmd=%d \n", + control_ifcfg_name(control), control->last_cmd); + control_timeout_stats(control); + fail = 1; + break; + case RSP_RECEIVED: + control->req_state = REQ_FAILED; + CONTROL_ERROR("%s: No response received from EIOC for Cmd=%d\n", + control_ifcfg_name(control), control->last_cmd); + control_timeout_stats(control); + fail = 1; + break; + case REQ_COMPLETED: + /* stay in REQ_COMPLETED state*/ + kick = 1; + break; + case REQ_FAILED: + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + if (kick) + viport_kick(control->parent); + + return; +} + +static void control_timer(struct control *control, int timeout) +{ + CONTROL_FUNCTION("%s: control_timer()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + mod_timer(&control->timer, jiffies + timeout); + else { + init_timer(&control->timer); + control->timer.expires = jiffies + timeout; + control->timer.data = (unsigned long)control; + control->timer.function = control_timeout; + control->timer_state = TIMER_ACTIVE; + add_timer(&control->timer); + } +} + +static void control_timer_stop(struct control *control) +{ + CONTROL_FUNCTION("%s: control_timer_stop()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + del_timer_sync(&control->timer); + + control->timer_state = TIMER_IDLE; +} + +static int control_send(struct control *control, struct send_io *send_io) +{ + unsigned long flags; + u8 ret = -1; + u8 fail = 0; + struct vnic_control_packet *pkt = control_packet(send_io); + + CONTROL_FUNCTION("%s: control_send(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + CONTROL_PACKET(pkt); + control_timer(control, control->config->rsp_timeout); + control_note_reqtime_stats(control); + if (vnic_ib_post_send(&control->ib_conn, &control->send_io.io)) { + CONTROL_ERROR("%s: Failed to post send\n", + control_ifcfg_name(control)); + /* stay in REQ_INACTIVE state*/ + fail = 1; + } else { + control->last_cmd = pkt->hdr.pkt_cmd; + control->req_state = REQ_POSTED; + ret = 0; + } + break; + case REQ_POSTED: + case REQ_SENT: + case RSP_RECEIVED: + case REQ_COMPLETED: + CONTROL_ERROR("%s:Previous Command is not completed." + "New CMD: %d Last CMD: %d Seq: %d\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->last_cmd, control->seq_num); + + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_FAILED: + /* this can occur after an error when ViPort state machine + * attempts to reset the link. + */ + CONTROL_INFO("%s:Attempt to send in failed state." + "New CMD: %d Last CMD: %d\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->last_cmd); + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + return ret; + +} + +static void control_send_complete(struct io *io) +{ + struct control *control = &io->viport->control; + unsigned long flags; + u8 fail = 0; + u8 kick = 0; + + CONTROL_FUNCTION("%s: control_sendComplete(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + case REQ_SENT: + case REQ_COMPLETED: + CONTROL_ERROR("%s: Unexpected control send completion\n", + control_ifcfg_name(control)); + fail = 1; + control->req_state = REQ_FAILED; + break; + case REQ_POSTED: + control->req_state = REQ_SENT; + break; + case RSP_RECEIVED: + control->req_state = REQ_COMPLETED; + kick = 1; + break; + case REQ_FAILED: + /* stay in REQ_FAILED state */ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + /* we must do this outside the lock */ + if (fail) + viport_failure(control->parent); + if (kick) + viport_kick(control->parent); + + return; +} + +void control_process_async(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + unsigned long flags; + + CONTROL_FUNCTION("%s: control_process_async()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + spin_lock_irqsave(&control->io_lock, flags); + recv_io = control->info; + if (recv_io) { + CONTROL_INFO("%s: processing info packet\n", + control_ifcfg_name(control)); + control->info = NULL; + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd == CMD_REPORT_STATUS) { + u32 status; + status = + be32_to_cpu(pkt->cmd.report_status.status_number); + switch (status) { + case VNIC_STATUS_LINK_UP: + CONTROL_INFO("%s: link up\n", + control_ifcfg_name(control)); + vnic_link_up(control->parent->vnic, + control->parent->parent); + break; + case VNIC_STATUS_LINK_DOWN: + CONTROL_INFO("%s: link down\n", + control_ifcfg_name(control)); + vnic_link_down(control->parent->vnic, + control->parent->parent); + break; + default: + CONTROL_ERROR("%s: asynchronous status" + " received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + break; + } + } + if ((pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) || + pkt->cmd.report_status.is_fatal) + viport_failure(control->parent); + + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + + while (!list_empty(&control->failure_list)) { + CONTROL_INFO("%s: processing error packet\n", + control_ifcfg_name(control)); + recv_io = (struct recv_io *) + list_entry(control->failure_list.next, struct io, + list_ptrs); + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + CONTROL_ERROR("%s: asynchronous error received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + if ((pkt->hdr.pkt_type != TYPE_ERR) + || (pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) + || pkt->cmd.report_status.is_fatal) + viport_failure(control->parent); + + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + spin_unlock_irqrestore(&control->io_lock, flags); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + CONTROL_FUNCTION("%s: done control_process_async\n", + control_ifcfg_name(control)); +} + +static struct send_io *control_init_hdr(struct control *control, u8 cmd) +{ + struct control_config *config; + struct vnic_control_packet *pkt; + struct vnic_control_header *hdr; + + CONTROL_FUNCTION("control_init_hdr()\n"); + config = control->config; + + pkt = control_packet(&control->send_io); + hdr = &pkt->hdr; + + hdr->pkt_type = TYPE_REQ; + hdr->pkt_cmd = cmd; + control->seq_num++; + hdr->pkt_seq_num = control->seq_num; + hdr->pkt_retry_count = 0; + + return &control->send_io; +} + +static struct recv_io *control_get_rsp(struct control *control) +{ + struct recv_io *recv_io = NULL; + unsigned long flags; + u8 fail = 0; + + CONTROL_FUNCTION("%s: control_getRsp(), State=%d\n", + control_ifcfg_name(control), control->req_state); + spin_lock_irqsave(&control->io_lock, flags); + switch (control->req_state) { + case REQ_INACTIVE: + CONTROL_ERROR("%s: Checked for Response with no" + "command pending\n", + control_ifcfg_name(control)); + control->req_state = REQ_FAILED; + fail = 1; + break; + case REQ_POSTED: + case REQ_SENT: + case RSP_RECEIVED: + /* no response available yet + stay in present state*/ + break; + case REQ_COMPLETED: + recv_io = control->response; + if (!recv_io) { + control->req_state = REQ_FAILED; + fail = 1; + break; + } + control->response = NULL; + control->last_cmd = CMD_INVALID; + control_timer_stop(control); + control->req_state = REQ_INACTIVE; + break; + case REQ_FAILED: + control_timer_stop(control); + /* stay in REQ_FAILED state*/ + break; + } + spin_unlock_irqrestore(&control->io_lock, flags); + if (fail) + viport_failure(control->parent); + return recv_io; +} + +int control_init_vnic_req(struct control *control) +{ + struct send_io *send_io; + struct control_config *config = control->config; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_req *init_vnic_req; + + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_INIT_VNIC); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + init_vnic_req = &pkt->cmd.init_vnic_req; + init_vnic_req->vnic_major_version = + __constant_cpu_to_be16(VNIC_MAJORVERSION); + init_vnic_req->vnic_minor_version = + __constant_cpu_to_be16(VNIC_MINORVERSION); + init_vnic_req->vnic_instance = config->vnic_instance; + init_vnic_req->num_data_paths = 1; + init_vnic_req->num_address_entries = + cpu_to_be16(config->max_address_entries); + + control->last_cmd = pkt->hdr.pkt_cmd; + CONTROL_PACKET(pkt); + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +static int control_chk_vnic_rsp_values(struct control *control, + u16 *num_addrs, + u8 num_data_paths, + u8 num_lan_switches, + u32 *features) +{ + + struct control_config *config = control->config; + + if ((control->maj_ver > VNIC_MAJORVERSION) + || ((control->maj_ver == VNIC_MAJORVERSION) + && (control->min_ver > VNIC_MINORVERSION))) { + CONTROL_ERROR("%s: unsupported version\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_data_paths != 1) { + CONTROL_ERROR("%s: EIOC returned too many datapaths\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs > config->max_address_entries) { + CONTROL_ERROR("%s: EIOC returned more address" + " entries than requested\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs < config->min_address_entries) { + CONTROL_ERROR("%s: not enough address entries\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches < 1) { + CONTROL_ERROR("%s: EIOC returned no lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches > 1) { + CONTROL_ERROR("%s: EIOC returned multiple lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + CONTROL_ERROR("%s checking features %x ib_multicast:%d\n", + control_ifcfg_name(control), + *features, config->ib_multicast); + if ((*features & VNIC_FEAT_INBOUND_IB_MC) && !config->ib_multicast) { + /* disable multicast if it is not on in the cfg file, or + if we turned it off because join failed */ + *features &= ~VNIC_FEAT_INBOUND_IB_MC; + } + + return 0; +failure: + return -1; +} + +int control_init_vnic_rsp(struct control *control, u32 *features, + u8 *mac_address, u16 *num_addrs, u16 *vlan) +{ + u8 num_data_paths; + u8 num_lan_switches; + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_rsp *init_vnic_rsp; + + + CONTROL_FUNCTION("%s: control_init_vnic_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_INIT_VNIC) + goto failure; + + init_vnic_rsp = &pkt->cmd.init_vnic_rsp; + control->maj_ver = be16_to_cpu(init_vnic_rsp->vnic_major_version); + control->min_ver = be16_to_cpu(init_vnic_rsp->vnic_minor_version); + num_data_paths = init_vnic_rsp->num_data_paths; + num_lan_switches = init_vnic_rsp->num_lan_switches; + *features = be32_to_cpu(init_vnic_rsp->features_supported); + *num_addrs = be16_to_cpu(init_vnic_rsp->num_address_entries); + + if (control_chk_vnic_rsp_values(control, num_addrs, + num_data_paths, + num_lan_switches, + features)) + goto failure; + + control->lan_switch.lan_switch_num = + init_vnic_rsp->lan_switch[0].lan_switch_num; + control->lan_switch.num_enet_ports = + init_vnic_rsp->lan_switch[0].num_enet_ports; + control->lan_switch.default_vlan = + init_vnic_rsp->lan_switch[0].default_vlan; + *vlan = be16_to_cpu(control->lan_switch.default_vlan); + memcpy(control->lan_switch.hw_mac_address, + init_vnic_rsp->lan_switch[0].hw_mac_address, ETH_ALEN); + memcpy(mac_address, init_vnic_rsp->lan_switch[0].hw_mac_address, + ETH_ALEN); + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static void copy_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst) +{ + dst->size_recv_pool_entry = src->size_recv_pool_entry; + dst->num_recv_pool_entries = src->num_recv_pool_entries; + dst->timeout_before_kick = src->timeout_before_kick; + dst->num_recv_pool_entries_before_kick = + src->num_recv_pool_entries_before_kick; + dst->num_recv_pool_bytes_before_kick = + src->num_recv_pool_bytes_before_kick; + dst->free_recv_pool_entries_per_update = + src->free_recv_pool_entries_per_update; +} + +static int check_recv_pool_config_value(__be32 *src, __be32 *dst, + __be32 *max, __be32 *min, + char *name) +{ + u32 value; + + value = be32_to_cpu(*src); + if (value > be32_to_cpu(*max)) { + CONTROL_ERROR("value %s too large\n", name); + return -1; + } else if (value < be32_to_cpu(*min)) { + CONTROL_ERROR("value %s too small\n", name); + return -1; + } + + *dst = cpu_to_be32(value); + return 0; +} + +static int check_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst, + struct vnic_recv_pool_config *max, + struct vnic_recv_pool_config *min) +{ + if (check_recv_pool_config_value(&src->size_recv_pool_entry, + &dst->size_recv_pool_entry, + &max->size_recv_pool_entry, + &min->size_recv_pool_entry, + "size_recv_pool_entry") + || check_recv_pool_config_value(&src->num_recv_pool_entries, + &dst->num_recv_pool_entries, + &max->num_recv_pool_entries, + &min->num_recv_pool_entries, + "num_recv_pool_entries") + || check_recv_pool_config_value(&src->timeout_before_kick, + &dst->timeout_before_kick, + &max->timeout_before_kick, + &min->timeout_before_kick, + "timeout_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_entries_before_kick, + &dst-> + num_recv_pool_entries_before_kick, + &max-> + num_recv_pool_entries_before_kick, + &min-> + num_recv_pool_entries_before_kick, + "num_recv_pool_entries_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_bytes_before_kick, + &dst-> + num_recv_pool_bytes_before_kick, + &max-> + num_recv_pool_bytes_before_kick, + &min-> + num_recv_pool_bytes_before_kick, + "num_recv_pool_bytes_before_kick") + || check_recv_pool_config_value(&src-> + free_recv_pool_entries_per_update, + &dst-> + free_recv_pool_entries_per_update, + &max-> + free_recv_pool_entries_per_update, + &min-> + free_recv_pool_entries_per_update, + "free_recv_pool_entries_per_update")) + goto failure; + + if (!is_power_of_2(be32_to_cpu(dst->num_recv_pool_entries))) { + CONTROL_ERROR("num_recv_pool_entries (%d)" + " must be power of 2\n", + dst->num_recv_pool_entries); + goto failure; + } + + if (!is_power_of_2(be32_to_cpu(dst-> + free_recv_pool_entries_per_update))) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d)" + " must be power of 2\n", + dst->free_recv_pool_entries_per_update); + goto failure; + } + + if (be32_to_cpu(dst->free_recv_pool_entries_per_update) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->free_recv_pool_entries_per_update, + dst->num_recv_pool_entries); + goto failure; + } + + if (be32_to_cpu(dst->num_recv_pool_entries_before_kick) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("num_recv_pool_entries_before_kick (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->num_recv_pool_entries_before_kick, + dst->num_recv_pool_entries); + goto failure; + } + + return 0; +failure: + return -1; +} + +int control_config_data_path_req(struct control *control, u64 path_id, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_DATA_PATH); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_data_path = &pkt->cmd.config_data_path_req; + config_data_path->data_path = 0; + config_data_path->path_identifier = path_id; + copy_recv_pool_config(host, + &config_data_path->host_recv_pool_config); + copy_recv_pool_config(eioc, + &config_data_path->eioc_recv_pool_config); + CONTROL_PACKET(pkt); + + control->last_cmd = pkt->hdr.pkt_cmd; + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_data_path_rsp(struct control *control, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc, + struct vnic_recv_pool_config *max_host, + struct vnic_recv_pool_config *max_eioc, + struct vnic_recv_pool_config *min_host, + struct vnic_recv_pool_config *min_eioc) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_DATA_PATH) + goto failure; + + config_data_path = &pkt->cmd.config_data_path_rsp; + if (config_data_path->data_path != 0) { + CONTROL_ERROR("%s: received CMD_CONFIG_DATA_PATH response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + config_data_path->data_path); + goto failure; + } + + if (check_recv_pool_config(&config_data_path-> + host_recv_pool_config, + host, max_host, min_host) + || check_recv_pool_config(&config_data_path-> + eioc_recv_pool_config, + eioc, max_eioc, min_eioc)) { + goto failure; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_exchange_pools_req(struct control *control, u64 addr, u32 rkey) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_EXCHANGE_POOLS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + exchange_pools = &pkt->cmd.exchange_pools_req; + exchange_pools->data_path = 0; + exchange_pools->pool_rkey = cpu_to_be32(rkey); + exchange_pools->pool_addr = cpu_to_be64(addr); + + control->last_cmd = pkt->hdr.pkt_cmd; + + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_exchange_pools_rsp(struct control *control, u64 *addr, + u32 *rkey) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_EXCHANGE_POOLS) + goto failure; + + exchange_pools = &pkt->cmd.exchange_pools_rsp; + *rkey = be32_to_cpu(exchange_pools->pool_rkey); + *addr = be64_to_cpu(exchange_pools->pool_addr); + + if (exchange_pools->data_path != 0) { + CONTROL_ERROR("%s: received CMD_EXCHANGE_POOLS response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + exchange_pools->data_path); + goto failure; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_config_link_req(struct control *control, u16 flags, u16 mtu) +{ + struct send_io *send_io; + struct vnic_cmd_config_link *config_link_req; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_config_link_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_LINK); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_link_req = &pkt->cmd.config_link_req; + config_link_req->lan_switch_num = + control->lan_switch.lan_switch_num; + config_link_req->cmd_flags = VNIC_FLAG_SET_MTU; + if (flags & IFF_UP) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_NIC; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_NIC; + if (flags & IFF_ALLMULTI) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_MCAST_ALL; + if (flags & IFF_PROMISC) { + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_PROMISC; + /* the EIOU doesn't really do PROMISC mode. + * if PROMISC is set, it only receives unicast packets + * I also have to set MCAST_ALL if I want real + * PROMISC mode. + */ + config_link_req->cmd_flags &= ~VNIC_FLAG_DISABLE_MCAST_ALL; + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + } else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_PROMISC; + + config_link_req->mtu_size = cpu_to_be16(mtu); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_link_rsp(struct control *control, u16 *flags, u16 *mtu) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_link *config_link_rsp; + + CONTROL_FUNCTION("%s: control_config_link_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_LINK) + goto failure; + config_link_rsp = &pkt->cmd.config_link_rsp; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_NIC) + *flags |= IFF_UP; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + *flags |= IFF_ALLMULTI; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + *flags |= IFF_PROMISC; + + *mtu = be16_to_cpu(config_link_rsp->mtu_size); + + if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + /* featuresSupported might include INBOUND_IB_MC but + MTU might cause it to be auto-disabled at embedded */ + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) { + union ib_gid mgid = config_link_rsp->allmulti_mgid; + if (mgid.raw[0] != 0xff) { + CONTROL_ERROR("%s: invalid formatprefix " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + } else { + /* rather than issuing join here, which might + * arrive at SM before EVIC creates the MC + * group, postpone it. + */ + vnic_mc_join_setup(control->parent, &mgid); + CONTROL_ERROR("join setup for ALL_MULTI\n"); + } + } + /* we don't want to leave mcast group if MCAST_ALL is disabled + * because there are no doubt multicast addresses set and we + * want to stay joined so we can get that traffic via the + * mcast group. + */ + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +/* control_config_addrs_req: + * return values: + * -1: failure + * 0: incomplete (successful operation, but more address + * table entries to be updated) + * 1: complete + */ +int control_config_addrs_req(struct control *control, + struct vnic_address_op2 *addrs, u16 num) +{ + u16 i; + u8 j; + int ret = 1; + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_addresses *config_addrs_req; + struct vnic_cmd_config_addresses2 *config_addrs_req2; + + CONTROL_FUNCTION("%s: control_config_addrs_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + if (control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) { + CONTROL_INFO("Sending CMD_CONFIG_ADDRESSES2 %lx MAX:%d " + "sizes:%d %d(off:%d) sizes2:%d %d %d" + "(off:%d - %d %d %d %d %d %d %d)\n", jiffies, + (int)MAX_CONFIG_ADDR_ENTRIES2, + (int)sizeof(struct vnic_cmd_config_addresses), + (int)sizeof(struct vnic_address_op), + (int)offsetof(struct vnic_cmd_config_addresses, + list_address_ops), + (int)sizeof(struct vnic_cmd_config_addresses2), + (int)sizeof(struct vnic_address_op2), + (int)sizeof(union ib_gid), + (int)offsetof(struct vnic_cmd_config_addresses2, + list_address_ops), + (int)offsetof(struct vnic_address_op2, index), + (int)offsetof(struct vnic_address_op2, operation), + (int)offsetof(struct vnic_address_op2, valid), + (int)offsetof(struct vnic_address_op2, address), + (int)offsetof(struct vnic_address_op2, vlan), + (int)offsetof(struct vnic_address_op2, reserved), + (int)offsetof(struct vnic_address_op2, mgid) + ); + send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES2); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_addrs_req2 = &pkt->cmd.config_addresses_req2; + memset(pkt->cmd.cmd_data, 0, VNIC_MAX_CONTROLDATASZ); + config_addrs_req2->lan_switch_num = + control->lan_switch.lan_switch_num; + for (i = 0, j = 0; (i < num) && (j < MAX_CONFIG_ADDR_ENTRIES2); i++) { + if (!addrs[i].operation) + continue; + config_addrs_req2->list_address_ops[j].index = + cpu_to_be16(i); + config_addrs_req2->list_address_ops[j].operation = + VNIC_OP_SET_ENTRY; + config_addrs_req2->list_address_ops[j].valid = + addrs[i].valid; + memcpy(config_addrs_req2->list_address_ops[j].address, + addrs[i].address, ETH_ALEN); + config_addrs_req2->list_address_ops[j].vlan = + addrs[i].vlan; + addrs[i].operation = 0; + CONTROL_INFO("%s i=%d " + "addr[%d]=%02x:%02x:%02x:%02x:%02x:%02x " + "valid:%d\n", control_ifcfg_name(control), i, j, + addrs[i].address[0], addrs[i].address[1], + addrs[i].address[2], addrs[i].address[3], + addrs[i].address[4], addrs[i].address[5], + addrs[i].valid); + j++; + } + config_addrs_req2->num_address_ops = j; + } else { + send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_addrs_req = &pkt->cmd.config_addresses_req; + config_addrs_req->lan_switch_num = + control->lan_switch.lan_switch_num; + for (i = 0, j = 0; (i < num) && (j < 16); i++) { + if (!addrs[i].operation) + continue; + config_addrs_req->list_address_ops[j].index = + cpu_to_be16(i); + config_addrs_req->list_address_ops[j].operation = + VNIC_OP_SET_ENTRY; + config_addrs_req->list_address_ops[j].valid = + addrs[i].valid; + memcpy(config_addrs_req->list_address_ops[j].address, + addrs[i].address, ETH_ALEN); + config_addrs_req->list_address_ops[j].vlan = + addrs[i].vlan; + addrs[i].operation = 0; + j++; + } + config_addrs_req->num_address_ops = j; + } + for (; i < num; i++) { + if (addrs[i].operation) { + ret = 0; + break; + } + } + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + if (control_send(control, send_io)) + return -1; + return ret; +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +static int process_cmd_config_address2_rsp(struct control *control, + struct vnic_control_packet *pkt, + struct recv_io *recv_io) +{ + struct vnic_cmd_config_addresses2 *config_addrs_rsp2; + int idx, mcaddrs, nomgid; + union ib_gid mgid, rsp_mgid; + + config_addrs_rsp2 = &pkt->cmd.config_addresses_rsp2; + CONTROL_INFO("%s rsp to CONFIG_ADDRESSES2\n", + control_ifcfg_name(control)); + + for (idx = 0, mcaddrs = 0, nomgid = 1; + idx < config_addrs_rsp2->num_address_ops; + idx++) { + if (!config_addrs_rsp2->list_address_ops[idx].valid) + continue; + + /* check if address is multicasts */ + if (!vnic_multicast_address(config_addrs_rsp2, idx)) + continue; + + mcaddrs++; + mgid = config_addrs_rsp2->list_address_ops[idx].mgid; + CONTROL_INFO("%s: got mgid " VNIC_GID_FMT + " MCAST_MSG_SIZE:%d mtu:%d\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw), + (int)MCAST_MSG_SIZE, + control->parent->mtu); + + /* Embedded should have turned off multicast + * due to large MTU size; mgid had better be 0. + */ + if (control->parent->mtu > MCAST_MSG_SIZE) { + if ((mgid.global.subnet_prefix != 0) || + (mgid.global.interface_id != 0)) { + CONTROL_ERROR("%s: invalid mgid; " + "expected 0 " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + } + continue; + } + if (mgid.raw[0] != 0xff) { + CONTROL_ERROR("%s: invalid formatprefix " + VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw)); + continue; + } + nomgid = 0; /* got a valid mgid */ + + /* let's verify that all the mgids match this one */ + for (; idx < config_addrs_rsp2->num_address_ops; idx++) { + if (!config_addrs_rsp2->list_address_ops[idx].valid) + continue; + + /* check if address is multicasts */ + if (!vnic_multicast_address(config_addrs_rsp2, idx)) + continue; + + rsp_mgid = config_addrs_rsp2->list_address_ops[idx].mgid; + if (memcmp(&mgid, &rsp_mgid, sizeof(union ib_gid)) == 0) + continue; + + CONTROL_ERROR("%s: Multicast Group MGIDs not " + "unique; mgids: " VNIC_GID_FMT + " " VNIC_GID_FMT "\n", + control_ifcfg_name(control), + VNIC_GID_RAW_ARG(mgid.raw), + VNIC_GID_RAW_ARG(rsp_mgid.raw)); + return 1; + } + + /* rather than issuing join here, which might arrive + * at SM before EVIC creates the MC group, postpone it. + */ + vnic_mc_join_setup(control->parent, &mgid); + + /* there is only one multicast group to join, so we're done. */ + break; + } + + /* we sent atleast one multicast address but got no MGID + * back so, if it is not allmulti case, leave the group + * we joined before. (for allmulti case we have to stay + * joined) + */ + if ((config_addrs_rsp2->num_address_ops > 0) && (mcaddrs > 0) && + nomgid && !(control->parent->flags & IFF_ALLMULTI)) { + CONTROL_INFO("numaddrops:%d mcadrs:%d nomgid:%d\n", + config_addrs_rsp2->num_address_ops, + mcaddrs > 0, nomgid); + + vnic_mc_leave(control->parent); + } + + return 0; +} + +int control_config_addrs_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_config_addrs_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if ((pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES) && + (pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES2)) + goto failure; + + if (((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) && + !control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC) || + ((pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES) && + control->parent->features_supported & VNIC_FEAT_INBOUND_IB_MC)) { + CONTROL_ERROR("%s unexpected response pktCmd:%d flag:%x\n", + control_ifcfg_name(control), pkt->hdr.pkt_cmd, + control->parent->features_supported & + VNIC_FEAT_INBOUND_IB_MC); + goto failure; + } + + if (pkt->hdr.pkt_cmd == CMD_CONFIG_ADDRESSES2) { + if (process_cmd_config_address2_rsp(control, pkt, recv_io)) + goto failure; + } else { + struct vnic_cmd_config_addresses *config_addrs_rsp; + config_addrs_rsp = &pkt->cmd.config_addresses_rsp; + } + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_report_statistics_req(struct control *control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_req *report_statistics_req; + + CONTROL_FUNCTION("%s: control_report_statistics_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_REPORT_STATISTICS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + report_statistics_req = &pkt->cmd.report_statistics_req; + report_statistics_req->lan_switch_num = + control->lan_switch.lan_switch_num; + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_report_statistics_rsp(struct control *control, + struct vnic_cmd_report_stats_rsp *stats) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_rsp *rep_stat_rsp; + + CONTROL_FUNCTION("%s: control_report_statistics_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_REPORT_STATISTICS) + goto failure; + + rep_stat_rsp = &pkt->cmd.report_statistics_rsp; + + stats->if_in_broadcast_pkts = rep_stat_rsp->if_in_broadcast_pkts; + stats->if_in_multicast_pkts = rep_stat_rsp->if_in_multicast_pkts; + stats->if_in_octets = rep_stat_rsp->if_in_octets; + stats->if_in_ucast_pkts = rep_stat_rsp->if_in_ucast_pkts; + stats->if_in_nucast_pkts = rep_stat_rsp->if_in_nucast_pkts; + stats->if_in_underrun = rep_stat_rsp->if_in_underrun; + stats->if_in_errors = rep_stat_rsp->if_in_errors; + stats->if_out_errors = rep_stat_rsp->if_out_errors; + stats->if_out_octets = rep_stat_rsp->if_out_octets; + stats->if_out_ucast_pkts = rep_stat_rsp->if_out_ucast_pkts; + stats->if_out_multicast_pkts = rep_stat_rsp->if_out_multicast_pkts; + stats->if_out_broadcast_pkts = rep_stat_rsp->if_out_broadcast_pkts; + stats->if_out_nucast_pkts = rep_stat_rsp->if_out_nucast_pkts; + stats->if_out_ok = rep_stat_rsp->if_out_ok; + stats->if_in_ok = rep_stat_rsp->if_in_ok; + stats->if_out_ucast_bytes = rep_stat_rsp->if_out_ucast_bytes; + stats->if_out_multicast_bytes = rep_stat_rsp->if_out_multicast_bytes; + stats->if_out_broadcast_bytes = rep_stat_rsp->if_out_broadcast_bytes; + stats->if_in_ucast_bytes = rep_stat_rsp->if_in_ucast_bytes; + stats->if_in_multicast_bytes = rep_stat_rsp->if_in_multicast_bytes; + stats->if_in_broadcast_bytes = rep_stat_rsp->if_in_broadcast_bytes; + stats->ethernet_status = rep_stat_rsp->ethernet_status; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_reset_req(struct control *control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_RESET); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_reset_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_RESET) + goto failure; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_heartbeat_req(struct control *control, u32 hb_interval) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_req; + + CONTROL_FUNCTION("%s: control_heartbeat_req()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_HEARTBEAT); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + heartbeat_req = &pkt->cmd.heartbeat_req; + heartbeat_req->hb_interval = cpu_to_be32(hb_interval); + + control->last_cmd = pkt->hdr.pkt_cmd; + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_heartbeat_rsp(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_rsp; + + CONTROL_FUNCTION("%s: control_heartbeat_rsp()\n", + control_ifcfg_name(control)); + ib_dma_sync_single_for_cpu(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_HEARTBEAT) + goto failure; + + heartbeat_rsp = &pkt->cmd.heartbeat_rsp; + + control_recv(control, recv_io); + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + ib_dma_sync_single_for_device(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static int control_init_recv_ios(struct control *control, + struct viport *viport, + struct vnic_control_packet *pkt) +{ + struct io *io; + struct ib_device *ibdev = viport->config->ibdev; + struct control_config *config = control->config; + dma_addr_t recv_dma; + unsigned int i; + + + control->recv_len = sizeof *pkt * config->num_recvs; + control->recv_dma = ib_dma_map_single(ibdev, + pkt, control->recv_len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(ibdev, control->recv_dma)) { + CONTROL_ERROR("control recv dma map error\n"); + goto failure; + } + + recv_dma = control->recv_dma; + for (i = 0; i < config->num_recvs; i++) { + io = &control->recv_ios[i].io; + io->viport = viport; + io->routine = control_recv_complete; + io->type = RECV; + + control->recv_ios[i].virtual_addr = (u8 *)pkt; + control->recv_ios[i].list.addr = recv_dma; + control->recv_ios[i].list.length = sizeof *pkt; + control->recv_ios[i].list.lkey = control->mr->lkey; + + recv_dma = recv_dma + sizeof *pkt; + pkt++; + + io->rwr.wr_id = (u64)io; + io->rwr.sg_list = &control->recv_ios[i].list; + io->rwr.num_sge = 1; + if (vnic_ib_post_recv(&control->ib_conn, io)) + goto unmap_recv; + } + + return 0; +unmap_recv: + ib_dma_unmap_single(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); +failure: + return -1; +} + +static int control_init_send_ios(struct control *control, + struct viport *viport, + struct vnic_control_packet *pkt) +{ + struct io *io; + struct ib_device *ibdev = viport->config->ibdev; + + control->send_io.virtual_addr = (u8 *)pkt; + control->send_len = sizeof *pkt; + control->send_dma = ib_dma_map_single(ibdev, pkt, + control->send_len, + DMA_TO_DEVICE); + if (ib_dma_mapping_error(ibdev, control->send_dma)) { + CONTROL_ERROR("control send dma map error\n"); + goto failure; + } + + io = &control->send_io.io; + io->viport = viport; + io->routine = control_send_complete; + + control->send_io.list.addr = control->send_dma; + control->send_io.list.length = sizeof *pkt; + control->send_io.list.lkey = control->mr->lkey; + + io->swr.wr_id = (u64)io; + io->swr.sg_list = &control->send_io.list; + io->swr.num_sge = 1; + io->swr.opcode = IB_WR_SEND; + io->swr.send_flags = IB_SEND_SIGNALED; + io->type = SEND; + + return 0; +failure: + return -1; +} + +int control_init(struct control *control, struct viport *viport, + struct control_config *config, struct ib_pd *pd) +{ + struct vnic_control_packet *pkt; + unsigned int sz; + + CONTROL_FUNCTION("%s: control_init()\n", + control_ifcfg_name(control)); + control->parent = viport; + control->config = config; + control->ib_conn.viport = viport; + control->ib_conn.ib_config = &config->ib_config; + control->ib_conn.state = IB_CONN_UNINITTED; + control->ib_conn.callback_thread = NULL; + control->ib_conn.callback_thread_end = 0; + control->req_state = REQ_INACTIVE; + control->last_cmd = CMD_INVALID; + control->seq_num = 0; + control->response = NULL; + control->info = NULL; + INIT_LIST_HEAD(&control->failure_list); + spin_lock_init(&control->io_lock); + + if (vnic_ib_conn_init(&control->ib_conn, viport, pd, + &config->ib_config)) { + CONTROL_ERROR("Control IB connection" + " initialization failed\n"); + goto failure; + } + + control->mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(control->mr)) { + CONTROL_ERROR("%s: failed to register memory" + " for control connection\n", + control_ifcfg_name(control)); + goto destroy_conn; + } + + control->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &control->ib_conn); + if (IS_ERR(control->ib_conn.cm_id)) { + CONTROL_ERROR("creating control CM ID failed\n"); + goto destroy_mr; + } + + sz = sizeof(struct recv_io) * config->num_recvs; + control->recv_ios = vmalloc(sz); + + if (!control->recv_ios) { + CONTROL_ERROR("%s: failed allocating space for recv ios\n", + control_ifcfg_name(control)); + goto destroy_cm_id; + } + + memset(control->recv_ios, 0, sz); + /*One send buffer and num_recvs recv buffers */ + control->local_storage = kzalloc(sizeof *pkt * + (config->num_recvs + 1), + GFP_KERNEL); + + if (!control->local_storage) { + CONTROL_ERROR("%s: failed allocating space" + " for local storage\n", + control_ifcfg_name(control)); + goto free_recv_ios; + } + + pkt = control->local_storage; + if (control_init_send_ios(control, viport, pkt)) + goto free_storage; + + pkt++; + if (control_init_recv_ios(control, viport, pkt)) + goto unmap_send; + + return 0; + +unmap_send: + ib_dma_unmap_single(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); +free_storage: + kfree(control->local_storage); +free_recv_ios: + vfree(control->recv_ios); +destroy_cm_id: + ib_destroy_cm_id(control->ib_conn.cm_id); +destroy_mr: + ib_dereg_mr(control->mr); +destroy_conn: + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); +failure: + return -1; +} + +void control_cleanup(struct control *control) +{ + CONTROL_FUNCTION("%s: control_disconnect()\n", + control_ifcfg_name(control)); + + if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0)) + CONTROL_ERROR("control CM DREQ sending failed\n"); + + control->ib_conn.state = IB_CONN_DISCONNECTED; + control_timer_stop(control); + control->req_state = REQ_INACTIVE; + control->response = NULL; + control->last_cmd = CMD_INVALID; + completion_callback_cleanup(&control->ib_conn); + ib_destroy_cm_id(control->ib_conn.cm_id); + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); + ib_dereg_mr(control->mr); + ib_dma_unmap_single(control->parent->config->ibdev, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + ib_dma_unmap_single(control->parent->config->ibdev, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + vfree(control->recv_ios); + kfree(control->local_storage); + +} + +static void control_log_report_status_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATUS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " lan_switch_num = %u, is_fatal = %u\n", + pkt->cmd.report_status.lan_switch_num, + pkt->cmd.report_status.is_fatal); + printk(KERN_INFO + " status_number = %u, status_info = %u\n", + be32_to_cpu(pkt->cmd.report_status.status_number), + be32_to_cpu(pkt->cmd.report_status.status_info)); + pkt->cmd.report_status.file_name[31] = '\0'; + pkt->cmd.report_status.routine[31] = '\0'; + printk(KERN_INFO " filename = %s, routine = %s\n", + pkt->cmd.report_status.file_name, + pkt->cmd.report_status.routine); + printk(KERN_INFO + " line_num = %u, error_parameter = %u\n", + be32_to_cpu(pkt->cmd.report_status.line_num), + be32_to_cpu(pkt->cmd.report_status.error_parameter)); + pkt->cmd.report_status.desc_text[127] = '\0'; + printk(KERN_INFO " desc_text = %s\n", + pkt->cmd.report_status.desc_text); +} + +static void control_log_report_stats_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " lan_switch_num = %u\n", + pkt->cmd.report_statistics_req.lan_switch_num); + if (pkt->hdr.pkt_type == TYPE_REQ) + return; + printk(KERN_INFO " if_in_broadcast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_pkts)); + printk(" if_in_multicast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_pkts)); + printk(KERN_INFO " if_in_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_octets)); + printk(" if_in_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_pkts)); + printk(KERN_INFO " if_in_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_nucast_pkts)); + printk(" if_in_underrun = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_underrun)); + printk(KERN_INFO " if_in_errors = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_errors)); + printk(" if_out_errors = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_errors)); + printk(KERN_INFO " if_out_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_octets)); + printk(" if_out_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_pkts)); + printk(KERN_INFO " if_out_multicast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_pkts)); + printk(" if_out_broadcast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_pkts)); + printk(KERN_INFO " if_out_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_nucast_pkts)); + printk(" if_out_ok = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_out_ok)); + printk(KERN_INFO " if_in_ok = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_in_ok)); + printk(" if_out_ucast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_bytes)); + printk(KERN_INFO " if_out_multicast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_bytes)); + printk(" if_out_broadcast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_bytes)); + printk(KERN_INFO " if_in_ucast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_bytes)); + printk(" if_in_multicast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_bytes)); + printk(KERN_INFO " if_in_broadcast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_bytes)); + printk(" ethernet_status = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + ethernet_status)); +} + +static void control_log_config_link_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_LINK\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " cmd_flags = %x\n", + pkt->cmd.config_link_req.cmd_flags); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_ENABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_NIC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_DISABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_NIC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_SET_MTU) + printk(KERN_INFO + " VNIC_FLAG_SET_MTU\n"); + printk(KERN_INFO + " lan_switch_num = %x, mtu_size = %d\n", + pkt->cmd.config_link_req.lan_switch_num, + be16_to_cpu(pkt->cmd.config_link_req.mtu_size)); + if (pkt->hdr.pkt_type == TYPE_RSP) { + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.config_link_req. + default_vlan), + pkt->cmd.config_link_req.hw_mac_address[0], + pkt->cmd.config_link_req.hw_mac_address[1], + pkt->cmd.config_link_req.hw_mac_address[2], + pkt->cmd.config_link_req.hw_mac_address[3], + pkt->cmd.config_link_req.hw_mac_address[4], + pkt->cmd.config_link_req.hw_mac_address[5]); + } +} + +static void print_config_addr(struct vnic_address_op *list, + int num_address_ops, size_t mgidoff) +{ + int i = 0; + + while (i < num_address_ops && i < 16) { + printk(KERN_INFO " list_address_ops[%u].index" + " = %u\n", i, be16_to_cpu(list->index)); + switch (list->operation) { + case VNIC_OP_GET_ENTRY: + printk(KERN_INFO " list_address_ops[%u]." + "operation = VNIC_OP_GET_ENTRY\n", i); + break; + case VNIC_OP_SET_ENTRY: + printk(KERN_INFO " list_address_ops[%u]." + "operation = VNIC_OP_SET_ENTRY\n", i); + break; + default: + printk(KERN_INFO " list_address_ops[%u]." + "operation = UNKNOWN(%d)\n", i, + list->operation); + break; + } + printk(KERN_INFO " list_address_ops[%u].valid" + " = %u\n", i, list->valid); + printk(KERN_INFO " list_address_ops[%u].address" + " = %02x:%02x:%02x:%02x:%02x:%02x\n", i, + list->address[0], list->address[1], + list->address[2], list->address[3], + list->address[4], list->address[5]); + printk(KERN_INFO " list_address_ops[%u].vlan" + " = %u\n", i, be16_to_cpu(list->vlan)); + if (mgidoff) { + printk(KERN_INFO + " list_address_ops[%u].mgid" + " = " VNIC_GID_FMT "\n", i, + VNIC_GID_RAW_ARG((char *)list + mgidoff)); + list = (struct vnic_address_op *) + ((char *)list + sizeof(struct vnic_address_op2)); + } else + list = (struct vnic_address_op *) + ((char *)list + sizeof(struct vnic_address_op)); + i++; + } +} + +static void control_log_config_addrs_pkt(struct vnic_control_packet *pkt, + u8 addresses2) +{ + struct vnic_address_op *list; + int no_address_ops; + + if (addresses2) + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_ADDRESSES2\n"); + else + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_ADDRESSES\n"); + printk(KERN_INFO " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, pkt->hdr.pkt_retry_count); + if (addresses2) { + printk(KERN_INFO " num_address_ops = %x," + " lan_switch_num = %d\n", + pkt->cmd.config_addresses_req2.num_address_ops, + pkt->cmd.config_addresses_req2.lan_switch_num); + list = (struct vnic_address_op *) + pkt->cmd.config_addresses_req2.list_address_ops; + no_address_ops = pkt->cmd.config_addresses_req2.num_address_ops; + print_config_addr(list, no_address_ops, + offsetof(struct vnic_address_op2, mgid)); + } else { + printk(KERN_INFO " num_address_ops = %x," + " lan_switch_num = %d\n", + pkt->cmd.config_addresses_req.num_address_ops, + pkt->cmd.config_addresses_req.lan_switch_num); + list = pkt->cmd.config_addresses_req.list_address_ops; + no_address_ops = pkt->cmd.config_addresses_req.num_address_ops; + print_config_addr(list, no_address_ops, 0); + } +} + +static void control_log_exch_pools_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_EXCHANGE_POOLS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " datapath = %u\n", + pkt->cmd.exchange_pools_req.data_path); + printk(KERN_INFO " pool_rkey = %08x" + " pool_addr = %llx\n", + be32_to_cpu(pkt->cmd.exchange_pools_req.pool_rkey), + be64_to_cpu(pkt->cmd.exchange_pools_req.pool_addr)); +} + +static void control_log_data_path_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_DATA_PATH\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " path_identifier = %llx," + " data_path = %u\n", + pkt->cmd.config_data_path_req.path_identifier, + pkt->cmd.config_data_path_req.data_path); + printk(KERN_INFO + "host config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + free_recv_pool_entries_per_update)); + printk(KERN_INFO + "eioc config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + free_recv_pool_entries_per_update)); +} + +static void control_log_init_vnic_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_INIT_VNIC\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " vnic_major_version = %u," + " vnic_minor_version = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_major_version), + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_minor_version)); + if (pkt->hdr.pkt_type == TYPE_REQ) { + printk(KERN_INFO + " vnic_instance = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_req.vnic_instance, + pkt->cmd.init_vnic_req.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req. + num_address_entries)); + } else { + printk(KERN_INFO + " num_lan_switches = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_rsp.num_lan_switches, + pkt->cmd.init_vnic_rsp.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u," + " features_supported = %08x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + num_address_entries), + be32_to_cpu(pkt->cmd.init_vnic_rsp. + features_supported)); + if (pkt->cmd.init_vnic_rsp.num_lan_switches != 0) { + printk(KERN_INFO + "lan_switch[0] lan_switch_num = %u," + " num_enet_ports = %08x\n", + pkt->cmd.init_vnic_rsp. + lan_switch[0].lan_switch_num, + pkt->cmd.init_vnic_rsp. + lan_switch[0].num_enet_ports); + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + lan_switch[0].default_vlan), + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[0], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[1], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[2], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[3], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[4], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[5]); + } + } +} + +static void control_log_control_packet(struct vnic_control_packet *pkt) +{ + switch (pkt->hdr.pkt_type) { + case TYPE_INFO: + printk(KERN_INFO "control_packet: pkt_type = TYPE_INFO\n"); + break; + case TYPE_REQ: + printk(KERN_INFO "control_packet: pkt_type = TYPE_REQ\n"); + break; + case TYPE_RSP: + printk(KERN_INFO "control_packet: pkt_type = TYPE_RSP\n"); + break; + case TYPE_ERR: + printk(KERN_INFO "control_packet: pkt_type = TYPE_ERR\n"); + break; + default: + printk(KERN_INFO "control_packet: pkt_type = UNKNOWN\n"); + } + + switch (pkt->hdr.pkt_cmd) { + case CMD_INIT_VNIC: + control_log_init_vnic_pkt(pkt); + break; + case CMD_CONFIG_DATA_PATH: + control_log_data_path_pkt(pkt); + break; + case CMD_EXCHANGE_POOLS: + control_log_exch_pools_pkt(pkt); + break; + case CMD_CONFIG_ADDRESSES: + control_log_config_addrs_pkt(pkt, 0); + break; + case CMD_CONFIG_ADDRESSES2: + control_log_config_addrs_pkt(pkt, 1); + break; + case CMD_CONFIG_LINK: + control_log_config_link_pkt(pkt); + break; + case CMD_REPORT_STATISTICS: + control_log_report_stats_pkt(pkt); + break; + case CMD_CLEAR_STATISTICS: + printk(KERN_INFO + " pkt_cmd = CMD_CLEAR_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_REPORT_STATUS: + control_log_report_status_pkt(pkt); + + break; + case CMD_RESET: + printk(KERN_INFO + " pkt_cmd = CMD_RESET\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_HEARTBEAT: + printk(KERN_INFO + " pkt_cmd = CMD_HEARTBEAT\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " hb_interval = %d\n", + be32_to_cpu(pkt->cmd.heartbeat_req.hb_interval)); + break; + default: + printk(KERN_INFO + " pkt_cmd = UNKNOWN (%u)\n", + pkt->hdr.pkt_cmd); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + } +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h new file mode 100644 index 0000000..57fab67 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control.h @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_H_INCLUDED +#define VNIC_CONTROL_H_INCLUDED + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS +#include +#include +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" + +enum control_timer_state { + TIMER_IDLE = 0, + TIMER_ACTIVE = 1, + TIMER_EXPIRED = 2 +}; + +enum control_request_state { + REQ_INACTIVE, /* quiet state, all previous operations done + * response is NULL + * last_cmd = CMD_INVALID + * timer_state = IDLE + */ + REQ_POSTED, /* REQ put on send Q + * response is NULL + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_SENT, /* Send completed for REQ + * response is NULL + * last_cmd = command issued + * timer_state = ACTIVE + */ + RSP_RECEIVED, /* Received Resp, but no Send completion yet + * response is response buffer received + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_COMPLETED, /* all processing for REQ completed, ready to be gotten + * response is response buffer received + * last_cmd = command issued + * timer_state = ACTIVE + */ + REQ_FAILED, /* processing of REQ/RSP failed. + * response is NULL + * last_cmd = CMD_INVALID + * timer_state = IDLE or EXPIRED + * viport has been moved to error state to force + * recovery + */ +}; + +struct control { + struct viport *parent; + struct control_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + struct vnic_control_packet *local_storage; + int send_len; + int recv_len; + u16 maj_ver; + u16 min_ver; + struct vnic_lan_switch_attribs lan_switch; + struct send_io send_io; + struct recv_io *recv_ios; + dma_addr_t send_dma; + dma_addr_t recv_dma; + enum control_timer_state timer_state; + enum control_request_state req_state; + struct timer_list timer; + u8 seq_num; + u8 last_cmd; + struct recv_io *response; + struct recv_io *info; + struct list_head failure_list; + spinlock_t io_lock; + struct completion done; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t request_time; /* intermediate value */ + cycles_t response_time; + u32 response_num; + cycles_t response_max; + cycles_t response_min; + u32 timeout_num; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +int control_init(struct control *control, struct viport *viport, + struct control_config *config, struct ib_pd *pd); + +void control_cleanup(struct control *control); + +void control_process_async(struct control *control); + +int control_init_vnic_req(struct control *control); +int control_init_vnic_rsp(struct control *control, u32 *features, + u8 *mac_address, u16 *num_addrs, u16 *vlan); + +int control_config_data_path_req(struct control *control, u64 path_id, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc); +int control_config_data_path_rsp(struct control *control, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc, + struct vnic_recv_pool_config *max_host, + struct vnic_recv_pool_config *max_eioc, + struct vnic_recv_pool_config *min_host, + struct vnic_recv_pool_config *min_eioc); + +int control_exchange_pools_req(struct control *control, + u64 addr, u32 rkey); +int control_exchange_pools_rsp(struct control *control, + u64 *addr, u32 *rkey); + +int control_config_link_req(struct control *control, + u16 flags, u16 mtu); +int control_config_link_rsp(struct control *control, + u16 *flags, u16 *mtu); + +int control_config_addrs_req(struct control *control, + struct vnic_address_op2 *addrs, u16 num); +int control_config_addrs_rsp(struct control *control); + +int control_report_statistics_req(struct control *control); +int control_report_statistics_rsp(struct control *control, + struct vnic_cmd_report_stats_rsp *stats); + +int control_heartbeat_req(struct control *control, u32 hb_interval); +int control_heartbeat_rsp(struct control *control); + +int control_reset_req(struct control *control); +int control_reset_rsp(struct control *control); + +#define control_packet(io) \ + (struct vnic_control_packet *)(io)->virtual_addr +#define control_is_connected(control) \ + (vnic_ib_conn_connected(&((control)->ib_conn))) + +#define control_last_req(control) control_packet(&(control)->send_io) +#define control_features(control) (control)->features_supported + +#define control_get_mac_address(control,addr) \ + memcpy(addr, (control)->lan_switch.hw_mac_address, ETH_ALEN) + +#endif /* VNIC_CONTROL_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h new file mode 100644 index 0000000..1fc62fb --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_control_pkt.h @@ -0,0 +1,368 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_PKT_H_INCLUDED +#define VNIC_CONTROL_PKT_H_INCLUDED + +#include +#include + +#define VNIC_MAX_NODENAME_LEN 64 + +struct vnic_connection_data { + u64 path_id; + u8 vnic_instance; + u8 path_num; + u8 nodename[VNIC_MAX_NODENAME_LEN + 1]; + u8 reserved; /* for alignment */ + __be32 features_supported; +}; + +struct vnic_control_header { + u8 pkt_type; + u8 pkt_cmd; + u8 pkt_seq_num; + u8 pkt_retry_count; + u32 reserved; /* for 64-bit alignmnet */ +}; + +/* ptk_type values */ +enum { + TYPE_INFO = 0, + TYPE_REQ = 1, + TYPE_RSP = 2, + TYPE_ERR = 3 +}; + +/* ptk_cmd values */ +enum { + CMD_INVALID = 0, + CMD_INIT_VNIC = 1, + CMD_CONFIG_DATA_PATH = 2, + CMD_EXCHANGE_POOLS = 3, + CMD_CONFIG_ADDRESSES = 4, + CMD_CONFIG_LINK = 5, + CMD_REPORT_STATISTICS = 6, + CMD_CLEAR_STATISTICS = 7, + CMD_REPORT_STATUS = 8, + CMD_RESET = 9, + CMD_HEARTBEAT = 10, + CMD_CONFIG_ADDRESSES2 = 11, +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_REQ data format */ +struct vnic_cmd_init_vnic_req { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 vnic_instance; + u8 num_data_paths; + __be16 num_address_entries; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP subdata format */ +struct vnic_lan_switch_attribs { + u8 lan_switch_num; + u8 num_enet_ports; + __be16 default_vlan; + u8 hw_mac_address[ETH_ALEN]; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP data format */ +struct vnic_cmd_init_vnic_rsp { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 num_lan_switches; + u8 num_data_paths; + __be16 num_address_entries; + __be32 features_supported; + struct vnic_lan_switch_attribs lan_switch[1]; +}; + +/* features_supported values */ +enum { + VNIC_FEAT_IPV4_HEADERS = 0x0001, + VNIC_FEAT_IPV6_HEADERS = 0x0002, + VNIC_FEAT_IPV4_CSUM_RX = 0x0004, + VNIC_FEAT_IPV4_CSUM_TX = 0x0008, + VNIC_FEAT_TCP_CSUM_RX = 0x0010, + VNIC_FEAT_TCP_CSUM_TX = 0x0020, + VNIC_FEAT_UDP_CSUM_RX = 0x0040, + VNIC_FEAT_UDP_CSUM_TX = 0x0080, + VNIC_FEAT_TCP_SEGMENT = 0x0100, + VNIC_FEAT_IPV4_IPSEC_OFFLOAD = 0x0200, + VNIC_FEAT_IPV6_IPSEC_OFFLOAD = 0x0400, + VNIC_FEAT_FCS_PROPAGATE = 0x0800, + VNIC_FEAT_PF_KICK = 0x1000, + VNIC_FEAT_PF_FORCE_ROUTE = 0x2000, + VNIC_FEAT_CHASH_OFFLOAD = 0x4000, + /* host send with immediate data */ + VNIC_FEAT_RDMA_IMMED = 0x8000, + /* host ignore inbound PF_VLAN_INSERT flag */ + VNIC_FEAT_IGNORE_VLAN = 0x10000, + /* host supports IB multicast for inbound Ethernet mcast traffic */ + VNIC_FEAT_INBOUND_IB_MC = 0x20000, +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH subdata format */ +struct vnic_recv_pool_config { + __be32 size_recv_pool_entry; + __be32 num_recv_pool_entries; + __be32 timeout_before_kick; + __be32 num_recv_pool_entries_before_kick; + __be32 num_recv_pool_bytes_before_kick; + __be32 free_recv_pool_entries_per_update; +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH data format */ +struct vnic_cmd_config_data_path { + u64 path_identifier; + u8 data_path; + u8 reserved[3]; + struct vnic_recv_pool_config host_recv_pool_config; + struct vnic_recv_pool_config eioc_recv_pool_config; +}; + +/* pkt_cmd CMD_EXCHANGE_POOLS data format */ +struct vnic_cmd_exchange_pools { + u8 data_path; + u8 reserved[3]; + __be32 pool_rkey; + __be64 pool_addr; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES subdata format */ +struct vnic_address_op { + __be16 index; + u8 operation; + u8 valid; + u8 address[6]; + __be16 vlan; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES2 subdata format */ +struct vnic_address_op2 { + __be16 index; + u8 operation; + u8 valid; + u8 address[6]; + __be16 vlan; + u32 reserved; /* for alignment */ + union ib_gid mgid; /* valid in rsp only if both ends support mcast */ +}; + +/* operation values */ +enum { + VNIC_OP_SET_ENTRY = 0x01, + VNIC_OP_GET_ENTRY = 0x02 +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES data format */ +struct vnic_cmd_config_addresses { + u8 num_address_ops; + u8 lan_switch_num; + struct vnic_address_op list_address_ops[1]; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES2 data format */ +struct vnic_cmd_config_addresses2 { + u8 num_address_ops; + u8 lan_switch_num; + u8 reserved1; + u8 reserved2; + u8 reserved3; + struct vnic_address_op2 list_address_ops[1]; +}; + +/* CMD_CONFIG_LINK data format */ +struct vnic_cmd_config_link { + u8 cmd_flags; + u8 lan_switch_num; + __be16 mtu_size; + __be16 default_vlan; + u8 hw_mac_address[6]; + u32 reserved; /* for alignment */ + /* valid in rsp only if both ends support mcast */ + union ib_gid allmulti_mgid; +}; + +/* cmd_flags values */ +enum { + VNIC_FLAG_ENABLE_NIC = 0x01, + VNIC_FLAG_DISABLE_NIC = 0x02, + VNIC_FLAG_ENABLE_MCAST_ALL = 0x04, + VNIC_FLAG_DISABLE_MCAST_ALL = 0x08, + VNIC_FLAG_ENABLE_PROMISC = 0x10, + VNIC_FLAG_DISABLE_PROMISC = 0x20, + VNIC_FLAG_SET_MTU = 0x40 +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_REQ data format */ +struct vnic_cmd_report_stats_req { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_RSP data format */ +struct vnic_cmd_report_stats_rsp { + u8 lan_switch_num; + u8 reserved[7]; /* for 64-bit alignment */ + __be64 if_in_broadcast_pkts; + __be64 if_in_multicast_pkts; + __be64 if_in_octets; + __be64 if_in_ucast_pkts; + __be64 if_in_nucast_pkts; /* if_in_broadcast_pkts + + if_in_multicast_pkts */ + __be64 if_in_underrun; /* (OID_GEN_RCV_NO_BUFFER) */ + __be64 if_in_errors; /* (OID_GEN_RCV_ERROR) */ + __be64 if_out_errors; /* (OID_GEN_XMIT_ERROR) */ + __be64 if_out_octets; + __be64 if_out_ucast_pkts; + __be64 if_out_multicast_pkts; + __be64 if_out_broadcast_pkts; + __be64 if_out_nucast_pkts; /* if_out_broadcast_pkts + + if_out_multicast_pkts */ + __be64 if_out_ok; /* if_out_nucast_pkts + + if_out_ucast_pkts(OID_GEN_XMIT_OK) */ + __be64 if_in_ok; /* if_in_nucast_pkts + + if_in_ucast_pkts(OID_GEN_RCV_OK) */ + __be64 if_out_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_XMT) */ + __be64 if_out_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_XMT) */ + __be64 if_out_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_XMT) */ + __be64 if_in_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_RCV) */ + __be64 if_in_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_RCV) */ + __be64 if_in_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_RCV) */ + __be64 ethernet_status; /* OID_GEN_MEDIA_CONNECT_STATUS) */ +}; + +/* pkt_cmd CMD_CLEAR_STATISTICS data format */ +struct vnic_cmd_clear_statistics { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATUS data format */ +struct vnic_cmd_report_status { + u8 lan_switch_num; + u8 is_fatal; + u8 reserved[2]; /* for 32-bit alignment */ + __be32 status_number; + __be32 status_info; + u8 file_name[32]; + u8 routine[32]; + __be32 line_num; + __be32 error_parameter; + u8 desc_text[128]; +}; + +/* pkt_cmd CMD_HEARTBEAT data format */ +struct vnic_cmd_heartbeat { + __be32 hb_interval; +}; + +enum { + VNIC_STATUS_LINK_UP = 1, + VNIC_STATUS_LINK_DOWN = 2, + VNIC_STATUS_ENET_AGGREGATION_CHANGE = 3, + VNIC_STATUS_EIOC_SHUTDOWN = 4, + VNIC_STATUS_CONTROL_ERROR = 5, + VNIC_STATUS_EIOC_ERROR = 6 +}; + +#define VNIC_MAX_CONTROLPKTSZ 256 +#define VNIC_MAX_CONTROLDATASZ \ + (VNIC_MAX_CONTROLPKTSZ - sizeof(struct vnic_control_header)) + +struct vnic_control_packet { + struct vnic_control_header hdr; + union { + struct vnic_cmd_init_vnic_req init_vnic_req; + struct vnic_cmd_init_vnic_rsp init_vnic_rsp; + struct vnic_cmd_config_data_path config_data_path_req; + struct vnic_cmd_config_data_path config_data_path_rsp; + struct vnic_cmd_exchange_pools exchange_pools_req; + struct vnic_cmd_exchange_pools exchange_pools_rsp; + struct vnic_cmd_config_addresses config_addresses_req; + struct vnic_cmd_config_addresses2 config_addresses_req2; + struct vnic_cmd_config_addresses config_addresses_rsp; + struct vnic_cmd_config_addresses2 config_addresses_rsp2; + struct vnic_cmd_config_link config_link_req; + struct vnic_cmd_config_link config_link_rsp; + struct vnic_cmd_report_stats_req report_statistics_req; + struct vnic_cmd_report_stats_rsp report_statistics_rsp; + struct vnic_cmd_clear_statistics clear_statistics_req; + struct vnic_cmd_clear_statistics clear_statistics_rsp; + struct vnic_cmd_report_status report_status; + struct vnic_cmd_heartbeat heartbeat_req; + struct vnic_cmd_heartbeat heartbeat_rsp; + + char cmd_data[VNIC_MAX_CONTROLDATASZ]; + } cmd; +}; + +union ib_gid_cpu { + u8 raw[16]; + struct { + u64 subnet_prefix; + u64 interface_id; + } global; +}; + +static inline void bswap_ib_gid(union ib_gid *mgid1, union ib_gid_cpu *mgid2) +{ + /* swap hi & low */ + __be64 low = mgid1->global.subnet_prefix; + mgid2->global.subnet_prefix = be64_to_cpu(mgid1->global.interface_id); + mgid2->global.interface_id = be64_to_cpu(low); +} + +#define VNIC_GID_FMT "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x" + +#define VNIC_GID_RAW_ARG(gid) be16_to_cpu(*(__be16 *)&(gid)[0]), \ + be16_to_cpu(*(__be16 *)&(gid)[2]), \ + be16_to_cpu(*(__be16 *)&(gid)[4]), \ + be16_to_cpu(*(__be16 *)&(gid)[6]), \ + be16_to_cpu(*(__be16 *)&(gid)[8]), \ + be16_to_cpu(*(__be16 *)&(gid)[10]), \ + be16_to_cpu(*(__be16 *)&(gid)[12]), \ + be16_to_cpu(*(__be16 *)&(gid)[14]) + + +/* These defines are used to figure out how many address entries can be passed + * in config_addresses request. + */ +#define MAX_CONFIG_ADDR_ENTRIES \ + ((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses) \ + - sizeof(struct vnic_address_op)))/sizeof(struct vnic_address_op)) +#define MAX_CONFIG_ADDR_ENTRIES2 \ + ((VNIC_MAX_CONTROLDATASZ - (sizeof(struct vnic_cmd_config_addresses2) \ + - sizeof(struct vnic_address_op2)))/sizeof(struct vnic_address_op2)) + + +#endif /* VNIC_CONTROL_PKT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:56:24 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:26:24 +0530 Subject: [ofa-general] [PATCH v3 05/13] QLogic VNIC: Implementation of Data path of communication protocol In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095624.9943.8469.stgit@localhost.localdomain> From: Ramachandra K This patch implements the actual data transfer part of the communication protocol with the EVIC/VEx. RDMA of ethernet packets is implemented in here. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_data.c | 1492 +++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_data.h | 206 +++ drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h | 103 ++ 3 files changed, 1801 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_data.h create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c new file mode 100644 index 0000000..b81fcde --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.c @@ -0,0 +1,1492 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_data.h" +#include "vnic_trailer.h" +#include "vnic_stats.h" + +static void data_received_kick(struct io *io); +static void data_xmit_complete(struct io *io); + +static void mc_data_recv_routine(struct io *io); +static void mc_data_post_recvs(struct mc_data *mc_data); +static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb, + struct viport_trailer *trailer); + +static u32 min_rcv_skb = 60; +module_param(min_rcv_skb, int, 0444); +MODULE_PARM_DESC(min_rcv_skb, "Packets of size (in bytes) less than" + " or equal this value will be copied during receive." + " Default 60"); + +static u32 min_xmt_skb = 60; +module_param(min_xmt_skb, int, 0444); +MODULE_PARM_DESC(min_xmit_skb, "Packets of size (in bytes) less than" + " or equal to this value will be copied during transmit." + "Default 60"); + +int data_init(struct data *data, struct viport *viport, + struct data_config *config, struct ib_pd *pd) +{ + DATA_FUNCTION("data_init()\n"); + + data->parent = viport; + data->config = config; + data->ib_conn.viport = viport; + data->ib_conn.ib_config = &config->ib_config; + data->ib_conn.state = IB_CONN_UNINITTED; + data->ib_conn.callback_thread = NULL; + data->ib_conn.callback_thread_end = 0; + + if ((min_xmt_skb < 60) || (min_xmt_skb > 9000)) { + DATA_ERROR("min_xmt_skb (%d) must be between 60 and 9000\n", + min_xmt_skb); + goto failure; + } + if (vnic_ib_conn_init(&data->ib_conn, viport, pd, + &config->ib_config)) { + DATA_ERROR("Data IB connection initialization failed\n"); + goto failure; + } + data->mr = ib_get_dma_mr(pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(data->mr)) { + DATA_ERROR("failed to register memory for" + " data connection\n"); + goto destroy_conn; + } + + data->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &data->ib_conn); + + if (IS_ERR(data->ib_conn.cm_id)) { + DATA_ERROR("creating data CM ID failed\n"); + goto dereg_mr; + } + + return 0; + +dereg_mr: + ib_dereg_mr(data->mr); +destroy_conn: + completion_callback_cleanup(&data->ib_conn); + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); +failure: + return -1; +} + +static void data_post_recvs(struct data *data) +{ + unsigned long flags; + int i = 0; + + DATA_FUNCTION("data_post_recvs()\n"); + spin_lock_irqsave(&data->recv_ios_lock, flags); + while (!list_empty(&data->recv_ios)) { + struct io *io = list_entry(data->recv_ios.next, + struct io, list_ptrs); + struct recv_io *recv_io = (struct recv_io *)io; + + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + if (vnic_ib_post_recv(&data->ib_conn, &recv_io->io)) { + viport_failure(data->parent); + return; + } + i++; + spin_lock_irqsave(&data->recv_ios_lock, flags); + } + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + DATA_INFO("data posted %d %p\n", i, &data->recv_ios); +} + +static void data_init_pool_work_reqs(struct data *data, + struct recv_io *recv_io) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct rdma_dest *rdma_dest; + dma_addr_t xmit_dma; + u8 *xmit_data; + unsigned int i; + + INIT_LIST_HEAD(&data->recv_ios); + spin_lock_init(&data->recv_ios_lock); + spin_lock_init(&data->xmit_buf_lock); + for (i = 0; i < data->config->num_recvs; i++) { + recv_io[i].io.viport = data->parent; + recv_io[i].io.routine = data_received_kick; + recv_io[i].list.addr = data->region_data_dma; + recv_io[i].list.length = 4; + recv_io[i].list.lkey = data->mr->lkey; + + recv_io[i].io.rwr.wr_id = (u64)&recv_io[i].io; + recv_io[i].io.rwr.sg_list = &recv_io[i].list; + recv_io[i].io.rwr.num_sge = 1; + + list_add(&recv_io[i].io.list_ptrs, &data->recv_ios); + } + + INIT_LIST_HEAD(&recv_pool->avail_recv_bufs); + for (i = 0; i < recv_pool->pool_sz; i++) { + rdma_dest = &recv_pool->recv_bufs[i]; + list_add(&rdma_dest->list_ptrs, + &recv_pool->avail_recv_bufs); + } + + xmit_dma = xmit_pool->xmitdata_dma; + xmit_data = xmit_pool->xmit_data; + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + rdma_io = &xmit_pool->xmit_bufs[i]; + rdma_io->index = i; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = data_xmit_complete; + + rdma_io->list[0].lkey = data->mr->lkey; + rdma_io->list[1].lkey = data->mr->lkey; + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 2; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + rdma_io->data = xmit_data; + rdma_io->data_dma = xmit_dma; + + xmit_data += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + xmit_dma += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + rdma_io->trailer = (struct viport_trailer *)xmit_data; + rdma_io->trailer_dma = xmit_dma; + xmit_data += sizeof(struct viport_trailer); + xmit_dma += sizeof(struct viport_trailer); + } + + xmit_pool->rdma_rkey = data->mr->rkey; + xmit_pool->rdma_addr = xmit_pool->buf_pool_dma; +} + +static void data_init_free_bufs_swrs(struct data *data) +{ + struct rdma_io *rdma_io; + struct send_io *send_io; + + rdma_io = &data->free_bufs_io; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = NULL; + + rdma_io->list[0].lkey = data->mr->lkey; + + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 1; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + send_io = &data->kick_io; + send_io->io.viport = data->parent; + send_io->io.routine = NULL; + + send_io->list.addr = data->region_data_dma; + send_io->list.length = 0; + send_io->list.lkey = data->mr->lkey; + + send_io->io.swr.wr_id = (u64)send_io; + send_io->io.swr.sg_list = &send_io->list; + send_io->io.swr.num_sge = 1; + send_io->io.swr.opcode = IB_WR_SEND; + send_io->io.swr.send_flags = IB_SEND_SIGNALED; + send_io->io.type = SEND; +} + +static int data_init_buf_pools(struct data *data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + + recv_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * recv_pool->eioc_pool_sz; + + recv_pool->buf_pool = kzalloc(recv_pool->buf_pool_len, GFP_KERNEL); + + if (!recv_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for recv pool bufpool\n", + recv_pool->buf_pool_len); + goto failure; + } + + recv_pool->buf_pool_dma = + ib_dma_map_single(viport->config->ibdev, + recv_pool->buf_pool, recv_pool->buf_pool_len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, recv_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_recv_pool; + } + + xmit_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * xmit_pool->pool_sz; + xmit_pool->buf_pool = kzalloc(xmit_pool->buf_pool_len, GFP_KERNEL); + + if (!xmit_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for xmit pool bufpool\n", + xmit_pool->buf_pool_len); + goto unmap_recv_pool; + } + + xmit_pool->buf_pool_dma = + ib_dma_map_single(viport->config->ibdev, + xmit_pool->buf_pool, xmit_pool->buf_pool_len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_xmit_pool; + } + + xmit_pool->xmit_data = kzalloc(xmit_pool->xmitdata_len, GFP_KERNEL); + + if (!xmit_pool->xmit_data) { + DATA_ERROR("failed allocating %d bytes for xmit data\n", + xmit_pool->xmitdata_len); + goto unmap_xmit_pool; + } + + xmit_pool->xmitdata_dma = + ib_dma_map_single(viport->config->ibdev, + xmit_pool->xmit_data, xmit_pool->xmitdata_len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, xmit_pool->xmitdata_dma)) { + DATA_ERROR("xmit data dma map error\n"); + goto free_xmit_data; + } + + return 0; + +free_xmit_data: + kfree(xmit_pool->xmit_data); +unmap_xmit_pool: + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); +free_xmit_pool: + kfree(xmit_pool->buf_pool); +unmap_recv_pool: + ib_dma_unmap_single(data->parent->config->ibdev, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); +free_recv_pool: + kfree(recv_pool->buf_pool); +failure: + return -1; +} + +static void data_init_xmit_pool(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + + xmit_pool->pool_sz = + be32_to_cpu(data->eioc_pool_parms.num_recv_pool_entries); + xmit_pool->buffer_sz = + be32_to_cpu(data->eioc_pool_parms.size_recv_pool_entry); + + xmit_pool->notify_count = 0; + xmit_pool->notify_bundle = data->config->notify_bundle; + xmit_pool->next_xmit_pool = 0; + xmit_pool->num_xmit_bufs = xmit_pool->notify_bundle * 2; + xmit_pool->next_xmit_buf = 0; + xmit_pool->last_comp_buf = xmit_pool->num_xmit_bufs - 1; + /* This assumes that data_init_recv_pool has been called + * before. + */ + data->max_mtu = MAX_PAYLOAD(min((data)->recv_pool.buffer_sz, + (data)->xmit_pool.buffer_sz)) - VLAN_ETH_HLEN; + + xmit_pool->kick_count = 0; + xmit_pool->kick_byte_count = 0; + + xmit_pool->send_kicks = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick) + || be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + xmit_pool->kick_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick); + xmit_pool->kick_byte_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + + xmit_pool->need_buffers = 1; + + xmit_pool->xmitdata_len = + BUFFER_SIZE(min_xmt_skb) * xmit_pool->num_xmit_bufs; +} + +static void data_init_recv_pool(struct data *data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + + recv_pool->pool_sz = data->config->host_recv_pool_entries; + recv_pool->eioc_pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + if (recv_pool->pool_sz > recv_pool->eioc_pool_sz) + recv_pool->pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + + recv_pool->buffer_sz = + be32_to_cpu(data->host_pool_parms.size_recv_pool_entry); + + recv_pool->sz_free_bundle = + be32_to_cpu(data-> + host_pool_parms.free_recv_pool_entries_per_update); + recv_pool->num_free_bufs = 0; + recv_pool->num_posted_bufs = 0; + + recv_pool->next_full_buf = 0; + recv_pool->next_free_buf = 0; + recv_pool->kick_on_free = 0; +} + +int data_connect(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + struct recv_io *recv_io; + unsigned int sz; + struct viport *viport = data->parent; + + DATA_FUNCTION("data_connect()\n"); + + /* Do not interchange the order of the functions + * called below as this will affect the MAX MTU + * calculation + */ + + data_init_recv_pool(data); + data_init_xmit_pool(data); + + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz + + sizeof(struct recv_io) * data->config->num_recvs + + sizeof(struct rdma_io) * xmit_pool->num_xmit_bufs; + + data->local_storage = vmalloc(sz); + + if (!data->local_storage) { + DATA_ERROR("failed allocating %d bytes" + " local storage\n", sz); + goto out; + } + + memset(data->local_storage, 0, sz); + + recv_pool->recv_bufs = (struct rdma_dest *)data->local_storage; + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz; + + recv_io = (struct recv_io *)(data->local_storage + sz); + sz += sizeof(struct recv_io) * data->config->num_recvs; + + xmit_pool->xmit_bufs = (struct rdma_io *)(data->local_storage + sz); + data->region_data = kzalloc(4, GFP_KERNEL); + + if (!data->region_data) { + DATA_ERROR("failed to alloc memory for region data\n"); + goto free_local_storage; + } + + data->region_data_dma = + ib_dma_map_single(viport->config->ibdev, + data->region_data, 4, DMA_BIDIRECTIONAL); + + if (ib_dma_mapping_error(viport->config->ibdev, data->region_data_dma)) { + DATA_ERROR("region data dma map error\n"); + goto free_region_data; + } + + if (data_init_buf_pools(data)) + goto unmap_region_data; + + data_init_free_bufs_swrs(data); + data_init_pool_work_reqs(data, recv_io); + + data_post_recvs(data); + + if (vnic_ib_cm_connect(&data->ib_conn)) + goto unmap_region_data; + + return 0; + +unmap_region_data: + ib_dma_unmap_single(data->parent->config->ibdev, + data->region_data_dma, 4, DMA_BIDIRECTIONAL); +free_region_data: + kfree(data->region_data); +free_local_storage: + vfree(data->local_storage); +out: + return -1; +} + +static void data_add_free_buffer(struct data *data, int index, + struct rdma_dest *rdma_dest) +{ + struct recv_pool *pool = &data->recv_pool; + struct buff_pool_entry *bpe; + dma_addr_t vaddr_dma; + + DATA_FUNCTION("data_add_free_buffer()\n"); + rdma_dest->trailer->connection_hash_and_valid = 0; + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe = &pool->buf_pool[index]; + bpe->rkey = cpu_to_be32(data->mr->rkey); + vaddr_dma = ib_dma_map_single(data->parent->config->ibdev, + rdma_dest->data, pool->buffer_sz, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(data->parent->config->ibdev, vaddr_dma)) { + DATA_ERROR("rdma_dest->data dma map error\n"); + goto failure; + } + bpe->remote_addr = cpu_to_be64(vaddr_dma); + bpe->valid = (u32) (rdma_dest - &pool->recv_bufs[0]) + 1; + ++pool->num_free_bufs; +failure: + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); +} + +/* NOTE: this routine is not reentrant */ +static void data_alloc_buffers(struct data *data, int initial_allocation) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct sk_buff *skb; + int index; + + DATA_FUNCTION("data_alloc_buffers()\n"); + index = ADD(pool->next_free_buf, pool->num_free_bufs, + pool->eioc_pool_sz); + + while (!list_empty(&pool->avail_recv_bufs)) { + rdma_dest = + list_entry(pool->avail_recv_bufs.next, + struct rdma_dest, list_ptrs); + if (!rdma_dest->skb) { + if (initial_allocation) + skb = alloc_skb(pool->buffer_sz + 2, + GFP_KERNEL); + else + skb = dev_alloc_skb(pool->buffer_sz + 2); + if (!skb) + break; + skb_reserve(skb, 2); + skb_put(skb, pool->buffer_sz); + rdma_dest->skb = skb; + rdma_dest->data = skb->data; + rdma_dest->trailer = + (struct viport_trailer *)(rdma_dest->data + + pool->buffer_sz - + sizeof(struct + viport_trailer)); + } + rdma_dest->trailer->connection_hash_and_valid = 0; + + list_del_init(&rdma_dest->list_ptrs); + + data_add_free_buffer(data, index, rdma_dest); + index = NEXT(index, pool->eioc_pool_sz); + } +} + +static void data_send_kick_message(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + DATA_FUNCTION("data_send_kick_message()\n"); + /* stop timer for bundle_timeout */ + if (data->kick_timer_on) { + del_timer(&data->kick_timer); + data->kick_timer_on = 0; + } + pool->kick_count = 0; + pool->kick_byte_count = 0; + + /* TODO: keep track of when kick is outstanding, and + * don't reuse until complete + */ + if (vnic_ib_post_send(&data->ib_conn, &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + } +} + +static void data_send_free_recv_buffers(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct ib_send_wr *swr = &data->free_bufs_io.io.swr; + + int bufs_sent = 0; + u64 rdma_addr; + u32 offset; + u32 sz; + unsigned int num_to_send, next_increment; + + DATA_FUNCTION("data_send_free_recv_buffers()\n"); + + for (num_to_send = pool->sz_free_bundle; + num_to_send <= pool->num_free_bufs; + num_to_send += pool->sz_free_bundle) { + /* handle multiple bundles as one when possible. */ + next_increment = num_to_send + pool->sz_free_bundle; + if ((next_increment <= pool->num_free_bufs) + && (pool->next_free_buf + next_increment <= + pool->eioc_pool_sz)) + continue; + + offset = pool->next_free_buf * + sizeof(struct buff_pool_entry); + sz = num_to_send * sizeof(struct buff_pool_entry); + rdma_addr = pool->eioc_rdma_addr + offset; + swr->sg_list->length = sz; + swr->sg_list->addr = pool->buf_pool_dma + offset; + swr->wr.rdma.remote_addr = rdma_addr; + + if (vnic_ib_post_send(&data->ib_conn, + &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + return; + } + INC(pool->next_free_buf, num_to_send, pool->eioc_pool_sz); + pool->num_free_bufs -= num_to_send; + pool->num_posted_bufs += num_to_send; + bufs_sent = 1; + } + + if (bufs_sent) { + if (pool->kick_on_free) + data_send_kick_message(data); + } + if (pool->num_posted_bufs == 0) { + struct vnic *vnic = data->parent->vnic; + unsigned long flags; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path == &vnic->primary_path) { + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + DATA_ERROR("%s: primary path: " + "unable to allocate receive buffers\n", + vnic->config->name); + } else { + if (vnic->current_path == &vnic->secondary_path) { + spin_unlock_irqrestore(&vnic->current_path_lock, + flags); + DATA_ERROR("%s: secondary path: " + "unable to allocate receive buffers\n", + vnic->config->name); + } else + spin_unlock_irqrestore(&vnic->current_path_lock, + flags); + } + data->ib_conn.state = IB_CONN_ERRORED; + viport_failure(data->parent); + } +} + +void data_connected(struct data *data) +{ + DATA_FUNCTION("data_connected()\n"); + data->free_bufs_io.io.swr.wr.rdma.rkey = + data->recv_pool.eioc_rdma_rkey; + data_alloc_buffers(data, 1); + data_send_free_recv_buffers(data); + data->connected = 1; +} + +void data_disconnect(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + unsigned int i; + + DATA_FUNCTION("data_disconnect()\n"); + + data->connected = 0; + if (data->kick_timer_on) { + del_timer_sync(&data->kick_timer); + data->kick_timer_on = 0; + } + + if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0)) + DATA_ERROR("data CM DREQ sending failed\n"); + data->ib_conn.state = IB_CONN_DISCONNECTED; + + completion_callback_cleanup(&data->ib_conn); + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + if (xmit_pool->xmit_bufs[i].skb) + dev_kfree_skb(xmit_pool->xmit_bufs[i].skb); + xmit_pool->xmit_bufs[i].skb = NULL; + + } + for (i = 0; i < recv_pool->pool_sz; i++) { + if (data->recv_pool.recv_bufs[i].skb) + dev_kfree_skb(recv_pool->recv_bufs[i].skb); + recv_pool->recv_bufs[i].skb = NULL; + } + vfree(data->local_storage); + if (data->region_data) { + ib_dma_unmap_single(data->parent->config->ibdev, + data->region_data_dma, 4, + DMA_BIDIRECTIONAL); + kfree(data->region_data); + } + + if (recv_pool->buf_pool) { + ib_dma_unmap_single(data->parent->config->ibdev, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); + kfree(recv_pool->buf_pool); + } + + if (xmit_pool->buf_pool) { + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); + kfree(xmit_pool->buf_pool); + } + + if (xmit_pool->xmit_data) { + ib_dma_unmap_single(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + kfree(xmit_pool->xmit_data); + } +} + +void data_cleanup(struct data *data) +{ + ib_destroy_cm_id(data->ib_conn.cm_id); + + /* Completion callback cleanup called again. + * This is to cleanup the threads in case there is an + * error before state LINK_DATACONNECT due to which + * data_disconnect is not called. + */ + completion_callback_cleanup(&data->ib_conn); + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); + ib_dereg_mr(data->mr); + +} + +static int data_alloc_xmit_buffer(struct data *data, struct sk_buff *skb, + struct buff_pool_entry **pp_bpe, + struct rdma_io **pp_rdma_io, + int *last) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + int ret; + + DATA_FUNCTION("data_alloc_xmit_buffer()\n"); + + spin_lock_irqsave(&data->xmit_buf_lock, flags); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + *last = 0; + *pp_rdma_io = &pool->xmit_bufs[pool->next_xmit_buf]; + *pp_bpe = &pool->buf_pool[pool->next_xmit_pool]; + + if ((*pp_bpe)->valid && pool->next_xmit_buf != + pool->last_comp_buf) { + INC(pool->next_xmit_buf, 1, pool->num_xmit_bufs); + INC(pool->next_xmit_pool, 1, pool->pool_sz); + if (!pool->buf_pool[pool->next_xmit_pool].valid) { + DATA_INFO("just used the last EIOU" + " receive buffer\n"); + *last = 1; + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + data_kickreq_stats(data); + } else if (pool->next_xmit_buf == pool->last_comp_buf) { + DATA_INFO("just used our last xmit buffer\n"); + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + } + (*pp_rdma_io)->skb = skb; + (*pp_bpe)->valid = 0; + ret = 0; + } else { + data_no_xmitbuf_stats(data); + DATA_ERROR("Out of xmit buffers\n"); + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + ret = -1; + } + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, + pool->buf_pool_len, DMA_TO_DEVICE); + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); + return ret; +} + +static void data_rdma_packet(struct data *data, struct buff_pool_entry *bpe, + struct rdma_io *rdma_io) +{ + struct ib_send_wr *swr; + struct sk_buff *skb; + dma_addr_t trailer_data_dma; + dma_addr_t skb_data_dma; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + u8 *d; + int len; + int fill_len; + + DATA_FUNCTION("data_rdma_packet()\n"); + swr = &rdma_io->io.swr; + skb = rdma_io->skb; + len = ALIGN(rdma_io->len, VIPORT_TRAILER_ALIGNMENT); + fill_len = len - skb->len; + + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + + d = (u8 *) rdma_io->trailer - fill_len; + trailer_data_dma = rdma_io->trailer_dma - fill_len; + memset(d, 0, fill_len); + + swr->sg_list[0].length = skb->len; + if (skb->len <= min_xmt_skb) { + memcpy(rdma_io->data, skb->data, skb->len); + swr->sg_list[0].lkey = data->mr->lkey; + swr->sg_list[0].addr = rdma_io->data_dma; + dev_kfree_skb_any(skb); + rdma_io->skb = NULL; + } else { + swr->sg_list[0].lkey = data->mr->lkey; + + skb_data_dma = ib_dma_map_single(viport->config->ibdev, + skb->data, skb->len, + DMA_TO_DEVICE); + + if (ib_dma_mapping_error(viport->config->ibdev, skb_data_dma)) { + DATA_ERROR("skb data dma map error\n"); + goto failure; + } + + rdma_io->skb_data_dma = skb_data_dma; + + swr->sg_list[0].addr = skb_data_dma; + skb_orphan(skb); + } + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + swr->sg_list[1].addr = trailer_data_dma; + swr->sg_list[1].length = fill_len + sizeof(struct viport_trailer); + swr->sg_list[0].lkey = data->mr->lkey; + swr->wr.rdma.remote_addr = be64_to_cpu(bpe->remote_addr); + swr->wr.rdma.remote_addr += data->xmit_pool.buffer_sz; + swr->wr.rdma.remote_addr -= (sizeof(struct viport_trailer) + len); + swr->wr.rdma.rkey = be32_to_cpu(bpe->rkey); + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + /* If VNIC_FEAT_RDMA_IMMED is supported then change the work request + * opcode to IB_WR_RDMA_WRITE_WITH_IMM + */ + + if (data->parent->features_supported & VNIC_FEAT_RDMA_IMMED) { + swr->ex.imm_data = 0; + swr->opcode = IB_WR_RDMA_WRITE_WITH_IMM; + } + + data->xmit_pool.notify_count++; + if (data->xmit_pool.notify_count >= data->xmit_pool.notify_bundle) { + data->xmit_pool.notify_count = 0; + swr->send_flags = IB_SEND_SIGNALED; + } else { + swr->send_flags = 0; + } + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + if (vnic_ib_post_send(&data->ib_conn, &rdma_io->io)) { + DATA_ERROR("failed to post send for data RDMA write\n"); + viport_failure(data->parent); + goto failure; + } + + data_xmits_stats(data); +failure: + ib_dma_sync_single_for_device(data->parent->config->ibdev, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); +} + +static void data_kick_timeout_handler(unsigned long arg) +{ + struct data *data = (struct data *)arg; + + DATA_FUNCTION("data_kick_timeout_handler()\n"); + data->kick_timer_on = 0; + data_send_kick_message(data); +} + +int data_xmit_packet(struct data *data, struct sk_buff *skb) +{ + struct xmit_pool *pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct buff_pool_entry *bpe; + struct viport_trailer *trailer; + unsigned int sz = skb->len; + int last; + + DATA_FUNCTION("data_xmit_packet()\n"); + if (sz > pool->buffer_sz) { + DATA_ERROR("outbound packet too large, size = %d\n", sz); + return -1; + } + + if (data_alloc_xmit_buffer(data, skb, &bpe, &rdma_io, &last)) { + DATA_ERROR("error in allocating data xmit buffer\n"); + return -1; + } + + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + trailer = rdma_io->trailer; + + memset(trailer, 0, sizeof *trailer); + memcpy(trailer->dest_mac_addr, skb->data, ETH_ALEN); + + if (skb->sk) + trailer->connection_hash_and_valid = 0x40 | + ((be16_to_cpu(inet_sk(skb->sk)->sport) + + be16_to_cpu(inet_sk(skb->sk)->dport)) & 0x3f); + + trailer->connection_hash_and_valid |= CHV_VALID; + + if ((sz > 16) && (*(__be16 *) (skb->data + 12) == + __constant_cpu_to_be16(ETH_P_8021Q))) { + trailer->vlan = *(__be16 *) (skb->data + 14); + memmove(skb->data + 4, skb->data, 12); + skb_pull(skb, 4); + sz -= 4; + trailer->pkt_flags |= PF_VLAN_INSERT; + } + if (last) + trailer->pkt_flags |= PF_KICK; + if (sz < ETH_ZLEN) { + /* EIOU requires all packets to be + * of ethernet minimum packet size. + */ + trailer->data_length = __constant_cpu_to_be16(ETH_ZLEN); + rdma_io->len = ETH_ZLEN; + } else { + trailer->data_length = cpu_to_be16(sz); + rdma_io->len = sz; + } + + if (skb->ip_summed == CHECKSUM_PARTIAL) { + trailer->tx_chksum_flags = TX_CHKSUM_FLAGS_CHECKSUM_V4 + | TX_CHKSUM_FLAGS_IP_CHECKSUM + | TX_CHKSUM_FLAGS_TCP_CHECKSUM + | TX_CHKSUM_FLAGS_UDP_CHECKSUM; + } + + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + data_rdma_packet(data, bpe, rdma_io); + + if (pool->send_kicks) { + /* EIOC needs kicks to inform it of sent packets */ + pool->kick_count++; + pool->kick_byte_count += sz; + if ((pool->kick_count >= pool->kick_bundle) + || (pool->kick_byte_count >= pool->kick_byte_bundle)) { + data_send_kick_message(data); + } else if (pool->kick_count == 1) { + init_timer(&data->kick_timer); + /* timeout_before_kick is in usec */ + data->kick_timer.expires = + msecs_to_jiffies(be32_to_cpu(data-> + eioc_pool_parms.timeout_before_kick) * 1000) + + jiffies; + data->kick_timer.data = (unsigned long)data; + data->kick_timer.function = data_kick_timeout_handler; + add_timer(&data->kick_timer); + data->kick_timer_on = 1; + } + } + return 0; +} + +static void data_check_xmit_buffers(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + + DATA_FUNCTION("data_check_xmit_buffers()\n"); + spin_lock_irqsave(&data->xmit_buf_lock, flags); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + if (data->xmit_pool.need_buffers + && pool->buf_pool[pool->next_xmit_pool].valid + && pool->next_xmit_buf != pool->last_comp_buf) { + data->xmit_pool.need_buffers = 0; + vnic_restart_xmit(data->parent->vnic, + data->parent->parent); + DATA_INFO("there are free xmit buffers\n"); + } + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); +} + +static struct sk_buff *data_recv_to_skbuff(struct data *data, + struct rdma_dest *rdma_dest) +{ + struct viport_trailer *trailer; + struct sk_buff *skb = NULL; + int start; + unsigned int len; + u8 rx_chksum_flags; + + DATA_FUNCTION("data_recv_to_skbuff()\n"); + trailer = rdma_dest->trailer; + start = data_offset(data, trailer); + len = data_len(data, trailer); + + if (len <= min_rcv_skb) + skb = dev_alloc_skb(len + VLAN_HLEN + 2); + /* leave room for VLAN header and alignment */ + if (skb) { + skb_reserve(skb, VLAN_HLEN + 2); + memcpy(skb->data, rdma_dest->data + start, len); + skb_put(skb, len); + } else { + skb = rdma_dest->skb; + rdma_dest->skb = NULL; + rdma_dest->trailer = NULL; + rdma_dest->data = NULL; + skb_pull(skb, start); + skb_trim(skb, len); + } + + rx_chksum_flags = trailer->rx_chksum_flags; + DATA_INFO("rx_chksum_flags = %d, LOOP = %c, IP = %c," + " TCP = %c, UDP = %c\n", + rx_chksum_flags, + (rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) ? 'Y' : 'N', + (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED) ? 'N' : + '-'); + + if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) + || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) + && ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) + || (rx_chksum_flags & + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED)))) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + if ((trailer->pkt_flags & PF_VLAN_INSERT) && + !(data->parent->features_supported & VNIC_FEAT_IGNORE_VLAN)) { + u8 *rv; + + rv = skb_push(skb, 4); + memmove(rv, rv + 4, 12); + *(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q); + if (trailer->pkt_flags & PF_PVID_OVERRIDDEN) + *(__be16 *) (rv + 14) = trailer->vlan & + __constant_cpu_to_be16(0xF000); + else + *(__be16 *) (rv + 14) = trailer->vlan; + } + + return skb; +} + +static int data_incoming_recv(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct viport_trailer *trailer; + struct buff_pool_entry *bpe; + struct sk_buff *skb; + dma_addr_t vaddr_dma; + + DATA_FUNCTION("data_incoming_recv()\n"); + if (pool->next_full_buf == pool->next_free_buf) + return -1; + bpe = &pool->buf_pool[pool->next_full_buf]; + vaddr_dma = be64_to_cpu(bpe->remote_addr); + rdma_dest = &pool->recv_bufs[bpe->valid - 1]; + trailer = rdma_dest->trailer; + + if (!trailer + || !(trailer->connection_hash_and_valid & CHV_VALID)) + return -1; + + /* received a packet */ + if (trailer->pkt_flags & PF_KICK) + pool->kick_on_free = 1; + + skb = data_recv_to_skbuff(data, rdma_dest); + + if (skb) { + vnic_recv_packet(data->parent->vnic, + data->parent->parent, skb); + list_add(&rdma_dest->list_ptrs, &pool->avail_recv_bufs); + } + + ib_dma_unmap_single(data->parent->config->ibdev, + vaddr_dma, pool->buffer_sz, + DMA_FROM_DEVICE); + ib_dma_sync_single_for_cpu(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe->valid = 0; + ib_dma_sync_single_for_device(data->parent->config->ibdev, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + INC(pool->next_full_buf, 1, pool->eioc_pool_sz); + pool->num_posted_bufs--; + data_recvs_stats(data); + return 0; +} + +static void data_received_kick(struct io *io) +{ + struct data *data = &io->viport->data; + unsigned long flags; + + DATA_FUNCTION("data_received_kick()\n"); + data_note_kickrcv_time(); + spin_lock_irqsave(&data->recv_ios_lock, flags); + list_add(&io->list_ptrs, &data->recv_ios); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + data_post_recvs(data); + data_rcvkicks_stats(data); + data_check_xmit_buffers(data); + + while (!data_incoming_recv(data)); + + if (data->connected) { + data_alloc_buffers(data, 0); + data_send_free_recv_buffers(data); + } +} + +static void data_xmit_complete(struct io *io) +{ + struct rdma_io *rdma_io = (struct rdma_io *)io; + struct data *data = &io->viport->data; + struct xmit_pool *pool = &data->xmit_pool; + struct sk_buff *skb; + + DATA_FUNCTION("data_xmit_complete()\n"); + + if (rdma_io->skb) + ib_dma_unmap_single(data->parent->config->ibdev, + rdma_io->skb_data_dma, rdma_io->skb->len, + DMA_TO_DEVICE); + + while (pool->last_comp_buf != rdma_io->index) { + INC(pool->last_comp_buf, 1, pool->num_xmit_bufs); + skb = pool->xmit_bufs[pool->last_comp_buf].skb; + if (skb) + dev_kfree_skb_any(skb); + pool->xmit_bufs[pool->last_comp_buf].skb = NULL; + } + + data_check_xmit_buffers(data); +} + +static int mc_data_alloc_skb(struct ud_recv_io *recv_io, u32 len, + int initial_allocation) +{ + struct sk_buff *skb; + struct mc_data *mc_data = &recv_io->io.viport->mc_data; + + DATA_FUNCTION("mc_data_alloc_skb\n"); + if (initial_allocation) + skb = alloc_skb(len, GFP_KERNEL); + else + skb = alloc_skb(len, GFP_ATOMIC); + if (!skb) { + DATA_ERROR("failed to alloc MULTICAST skb\n"); + return -1; + } + skb_put(skb, len); + recv_io->skb = skb; + + recv_io->skb_data_dma = ib_dma_map_single( + recv_io->io.viport->config->ibdev, + skb->data, skb->len, + DMA_FROM_DEVICE); + + if (ib_dma_mapping_error(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma)) { + DATA_ERROR("skb data dma map error\n"); + dev_kfree_skb(skb); + return -1; + } + + recv_io->list[0].addr = recv_io->skb_data_dma; + recv_io->list[0].length = sizeof(struct ib_grh); + recv_io->list[0].lkey = mc_data->mr->lkey; + + recv_io->list[1].addr = recv_io->skb_data_dma + sizeof(struct ib_grh); + recv_io->list[1].length = len - sizeof(struct ib_grh); + recv_io->list[1].lkey = mc_data->mr->lkey; + + recv_io->io.rwr.wr_id = (u64)&recv_io->io; + recv_io->io.rwr.sg_list = recv_io->list; + recv_io->io.rwr.num_sge = 2; + recv_io->io.rwr.next = NULL; + + return 0; +} + +static int mc_data_alloc_buffers(struct mc_data *mc_data) +{ + unsigned int i, num; + struct ud_recv_io *bufs = NULL, *recv_io; + + DATA_FUNCTION("mc_data_alloc_buffers\n"); + if (!mc_data->skb_len) { + unsigned int len; + /* align multicast msg buffer on viport_trailer boundary */ + len = (MCAST_MSG_SIZE + VIPORT_TRAILER_ALIGNMENT - 1) & + (~((unsigned int)VIPORT_TRAILER_ALIGNMENT - 1)); + /* + * Add size of grh and trailer - + * note, we don't need a + 4 for vlan because we have room in + * netbuf for grh & trailer and we'll strip them both, so there + * will be room enough to handle the 4 byte insertion for vlan. + */ + len += sizeof(struct ib_grh) + + sizeof(struct viport_trailer); + mc_data->skb_len = len; + DATA_INFO("mc_data->skb_len %d (sizes:%d %d)\n", + len, (int)sizeof(struct ib_grh), + (int)sizeof(struct viport_trailer)); + } + mc_data->recv_len = sizeof(struct ud_recv_io) * mc_data->num_recvs; + bufs = kmalloc(mc_data->recv_len, GFP_KERNEL); + if (!bufs) { + DATA_ERROR("failed to allocate MULTICAST buffers size:%d\n", + mc_data->recv_len); + return -1; + } + DATA_INFO("allocated num_recvs:%d recv_len:%d \n", + mc_data->num_recvs, mc_data->recv_len); + for (num = 0; num < mc_data->num_recvs; num++) { + recv_io = &bufs[num]; + recv_io->len = mc_data->skb_len; + recv_io->io.type = RECV_UD; + recv_io->io.viport = mc_data->parent; + recv_io->io.routine = mc_data_recv_routine; + + if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 1)) { + for (i = 0; i < num; i++) { + recv_io = &bufs[i]; + ib_dma_unmap_single(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma, + recv_io->skb->len, + DMA_FROM_DEVICE); + dev_kfree_skb(recv_io->skb); + } + kfree(bufs); + return -1; + } + list_add_tail(&recv_io->io.list_ptrs, + &mc_data->avail_recv_ios_list); + } + mc_data->recv_ios = bufs; + return 0; +} + +void vnic_mc_data_cleanup(struct mc_data *mc_data) +{ + unsigned int num; + + DATA_FUNCTION("vnic_mc_data_cleanup()\n"); + completion_callback_cleanup(&mc_data->ib_conn); + if (!IS_ERR(mc_data->ib_conn.qp)) { + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL); + } + if (!IS_ERR(mc_data->ib_conn.cq)) { + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); + } + if (mc_data->recv_ios) { + for (num = 0; num < mc_data->num_recvs; num++) { + if (mc_data->recv_ios[num].skb) + dev_kfree_skb(mc_data->recv_ios[num].skb); + mc_data->recv_ios[num].skb = NULL; + } + kfree(mc_data->recv_ios); + mc_data->recv_ios = (struct ud_recv_io *)NULL; + } + if (mc_data->mr) { + ib_dereg_mr(mc_data->mr); + mc_data->mr = (struct ib_mr *)NULL; + } + DATA_FUNCTION("vnic_mc_data_cleanup done\n"); + +} + +int mc_data_init(struct mc_data *mc_data, struct viport *viport, + struct data_config *config, struct ib_pd *pd) +{ + DATA_FUNCTION("mc_data_init()\n"); + + mc_data->num_recvs = viport->data.config->num_recvs; + + INIT_LIST_HEAD(&mc_data->avail_recv_ios_list); + spin_lock_init(&mc_data->recv_lock); + + mc_data->parent = viport; + mc_data->config = config; + + mc_data->ib_conn.cm_id = NULL; + mc_data->ib_conn.viport = viport; + mc_data->ib_conn.ib_config = &config->ib_config; + mc_data->ib_conn.state = IB_CONN_UNINITTED; + mc_data->ib_conn.callback_thread = NULL; + mc_data->ib_conn.callback_thread_end = 0; + + if (vnic_ib_mc_init(mc_data, viport, pd, + &config->ib_config)) { + DATA_ERROR("vnic_ib_mc_init failed\n"); + goto failure; + } + mc_data->mr = ib_get_dma_mr(pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(mc_data->mr)) { + DATA_ERROR("failed to register memory for" + " mc_data connection\n"); + goto destroy_conn; + } + + if (mc_data_alloc_buffers(mc_data)) + goto dereg_mr; + + mc_data_post_recvs(mc_data); + if (vnic_ib_mc_mod_qp_to_rts(mc_data->ib_conn.qp)) + goto dereg_mr; + + return 0; + +dereg_mr: + ib_dereg_mr(mc_data->mr); + mc_data->mr = (struct ib_mr *)NULL; +destroy_conn: + completion_callback_cleanup(&mc_data->ib_conn); + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = (struct ib_qp *)ERR_PTR(-EINVAL); + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); +failure: + return -1; +} + +static void mc_data_post_recvs(struct mc_data *mc_data) +{ + unsigned long flags; + int i = 0; + DATA_FUNCTION("mc_data_post_recvs\n"); + spin_lock_irqsave(&mc_data->recv_lock, flags); + while (!list_empty(&mc_data->avail_recv_ios_list)) { + struct io *io = list_entry(mc_data->avail_recv_ios_list.next, + struct io, list_ptrs); + struct ud_recv_io *recv_io = + container_of(io, struct ud_recv_io, io); + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); + if (vnic_ib_mc_post_recv(mc_data, &recv_io->io)) { + viport_failure(mc_data->parent); + return; + } + spin_lock_irqsave(&mc_data->recv_lock, flags); + i++; + } + DATA_INFO("mcdata posted %d %p\n", i, &mc_data->avail_recv_ios_list); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); +} + +static void mc_data_recv_routine(struct io *io) +{ + struct sk_buff *skb; + struct ib_grh *grh; + struct viport_trailer *trailer; + struct mc_data *mc_data; + unsigned long flags; + struct ud_recv_io *recv_io = container_of(io, struct ud_recv_io, io); + union ib_gid_cpu sgid; + + DATA_FUNCTION("mc_data_recv_routine\n"); + skb = recv_io->skb; + grh = (struct ib_grh *)skb->data; + mc_data = &recv_io->io.viport->mc_data; + + ib_dma_unmap_single(recv_io->io.viport->config->ibdev, + recv_io->skb_data_dma, recv_io->skb->len, + DMA_FROM_DEVICE); + + /* first - check if we've got our own mc packet */ + /* convert sgid from host to cpu form before comparing */ + bswap_ib_gid(&grh->sgid, &sgid); + if (cpu_to_be64(sgid.global.interface_id) == + io->viport->config->path_info.path.sgid.global.interface_id) { + DATA_ERROR("dropping - our mc packet\n"); + dev_kfree_skb(skb); + } else { + /* GRH is at head and trailer at end. Remove GRH from head. */ + trailer = (struct viport_trailer *) + (skb->data + recv_io->len - + sizeof(struct viport_trailer)); + skb_pull(skb, sizeof(struct ib_grh)); + if (trailer->connection_hash_and_valid & CHV_VALID) { + mc_data_recv_to_skbuff(io->viport, skb, trailer); + vnic_recv_packet(io->viport->vnic, io->viport->parent, + skb); + vnic_multicast_recv_pkt_stats(io->viport->vnic); + } else { + DATA_ERROR("dropping - no CHV_VALID in HashAndValid\n"); + dev_kfree_skb(skb); + } + } + recv_io->skb = NULL; + if (mc_data_alloc_skb(recv_io, mc_data->skb_len, 0)) + return; + + spin_lock_irqsave(&mc_data->recv_lock, flags); + list_add_tail(&recv_io->io.list_ptrs, &mc_data->avail_recv_ios_list); + spin_unlock_irqrestore(&mc_data->recv_lock, flags); + mc_data_post_recvs(mc_data); + return; +} + +static void mc_data_recv_to_skbuff(struct viport *viport, struct sk_buff *skb, + struct viport_trailer *trailer) +{ + u8 rx_chksum_flags = trailer->rx_chksum_flags; + + /* drop alignment bytes at start */ + skb_pull(skb, trailer->data_alignment_offset); + /* drop excess from end */ + skb_trim(skb, __be16_to_cpu(trailer->data_length)); + + if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) + || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) + && ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) + || (rx_chksum_flags & + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED)))) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + if ((trailer->pkt_flags & PF_VLAN_INSERT) && + !(viport->features_supported & VNIC_FEAT_IGNORE_VLAN)) { + u8 *rv; + + /* insert VLAN id between source & length */ + DATA_INFO("VLAN adjustment\n"); + rv = skb_push(skb, 4); + memmove(rv, rv + 4, 12); + *(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q); + if (trailer->pkt_flags & PF_PVID_OVERRIDDEN) + /* + * Indicates VLAN is 0 but we keep the protocol id. + */ + *(__be16 *) (rv + 14) = trailer->vlan & + __constant_cpu_to_be16(0xF000); + else + *(__be16 *) (rv + 14) = trailer->vlan; + DATA_INFO("vlan:%x\n", *(int *)(rv+14)); + } + + return; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h new file mode 100644 index 0000000..866b9ee --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_data.h @@ -0,0 +1,206 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_DATA_H_INCLUDED +#define VNIC_DATA_H_INCLUDED + +#include + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS +#include +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" +#include "vnic_trailer.h" + +struct rdma_dest { + struct list_head list_ptrs; + struct sk_buff *skb; + u8 *data; + struct viport_trailer *trailer __attribute__((aligned(32))); +}; + +struct buff_pool_entry { + __be64 remote_addr; + __be32 rkey; + u32 valid; +}; + +struct recv_pool { + u32 buffer_sz; + u32 pool_sz; + u32 eioc_pool_sz; + u32 eioc_rdma_rkey; + u64 eioc_rdma_addr; + u32 next_full_buf; + u32 next_free_buf; + u32 num_free_bufs; + u32 num_posted_bufs; + u32 sz_free_bundle; + int kick_on_free; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_dest *recv_bufs; + struct list_head avail_recv_bufs; +}; + +struct xmit_pool { + u32 buffer_sz; + u32 pool_sz; + u32 notify_count; + u32 notify_bundle; + u32 next_xmit_buf; + u32 last_comp_buf; + u32 num_xmit_bufs; + u32 next_xmit_pool; + u32 kick_count; + u32 kick_byte_count; + u32 kick_bundle; + u32 kick_byte_bundle; + int need_buffers; + int send_kicks; + uint32_t rdma_rkey; + u64 rdma_addr; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_io *xmit_bufs; + u8 *xmit_data; + dma_addr_t xmitdata_dma; + int xmitdata_len; +}; + +struct data { + struct viport *parent; + struct data_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + u8 *local_storage; + struct vnic_recv_pool_config host_pool_parms; + struct vnic_recv_pool_config eioc_pool_parms; + struct recv_pool recv_pool; + struct xmit_pool xmit_pool; + u8 *region_data; + dma_addr_t region_data_dma; + struct rdma_io free_bufs_io; + struct send_io kick_io; + struct list_head recv_ios; + spinlock_t recv_ios_lock; + spinlock_t xmit_buf_lock; + int kick_timer_on; + int connected; + u16 max_mtu; + struct timer_list kick_timer; + struct completion done; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + u32 xmit_num; + u32 recv_num; + u32 free_buf_sends; + u32 free_buf_num; + u32 free_buf_min; + u32 kick_recvs; + u32 kick_reqs; + u32 no_xmit_bufs; + cycles_t no_xmit_buf_time; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct mc_data { + struct viport *parent; + struct data_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + + u32 num_recvs; + u32 skb_len; + spinlock_t recv_lock; + int recv_len; + struct ud_recv_io *recv_ios; + struct list_head avail_recv_ios_list; +}; + +int data_init(struct data *data, struct viport *viport, + struct data_config *config, struct ib_pd *pd); + +int data_connect(struct data *data); +void data_connected(struct data *data); +void data_disconnect(struct data *data); + +int data_xmit_packet(struct data *data, struct sk_buff *skb); + +void data_cleanup(struct data *data); + +#define data_is_connected(data) \ + (vnic_ib_conn_connected(&((data)->ib_conn))) +#define data_path_id(data) (data)->config->path_id +#define data_eioc_pool(data) &(data)->eioc_pool_parms +#define data_host_pool(data) &(data)->host_pool_parms +#define data_eioc_pool_min(data) &(data)->config->eioc_min +#define data_host_pool_min(data) &(data)->config->host_min +#define data_eioc_pool_max(data) &(data)->config->eioc_max +#define data_host_pool_max(data) &(data)->config->host_max +#define data_local_pool_addr(data) (data)->xmit_pool.rdma_addr +#define data_local_pool_rkey(data) (data)->xmit_pool.rdma_rkey +#define data_remote_pool_addr(data) &(data)->recv_pool.eioc_rdma_addr +#define data_remote_pool_rkey(data) &(data)->recv_pool.eioc_rdma_rkey + +#define data_max_mtu(data) (data)->max_mtu + + +#define data_len(data, trailer) be16_to_cpu(trailer->data_length) +#define data_offset(data, trailer) \ + ((data)->recv_pool.buffer_sz - sizeof(struct viport_trailer) \ + - ALIGN(data_len((data), (trailer)), VIPORT_TRAILER_ALIGNMENT) \ + + (trailer->data_alignment_offset)) + +/* the following macros manipulate ring buffer indexes. + * the ring buffer size must be a power of 2. + */ +#define ADD(index, increment, size) (((index) + (increment))&((size) - 1)) +#define NEXT(index, size) ADD(index, 1, size) +#define INC(index, increment, size) (index) = ADD(index, increment, size) + +/* this is max multicast msg embedded will send */ +#define MCAST_MSG_SIZE \ + (2048 - sizeof(struct ib_grh) - sizeof(struct viport_trailer)) + +int mc_data_init(struct mc_data *mc_data, struct viport *viport, + struct data_config *config, + struct ib_pd *pd); + +void vnic_mc_data_cleanup(struct mc_data *mc_data); + +#endif /* VNIC_DATA_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h new file mode 100644 index 0000000..dd8a073 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_trailer.h @@ -0,0 +1,103 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_TRAILER_H_INCLUDED +#define VNIC_TRAILER_H_INCLUDED + +/* pkt_flags values */ +enum { + PF_CHASH_VALID = 0x01, + PF_IPSEC_VALID = 0x02, + PF_TCP_SEGMENT = 0x04, + PF_KICK = 0x08, + PF_VLAN_INSERT = 0x10, + PF_PVID_OVERRIDDEN = 0x20, + PF_FCS_INCLUDED = 0x40, + PF_FORCE_ROUTE = 0x80 +}; + +/* tx_chksum_flags values */ +enum { + TX_CHKSUM_FLAGS_CHECKSUM_V4 = 0x01, + TX_CHKSUM_FLAGS_CHECKSUM_V6 = 0x02, + TX_CHKSUM_FLAGS_TCP_CHECKSUM = 0x04, + TX_CHKSUM_FLAGS_UDP_CHECKSUM = 0x08, + TX_CHKSUM_FLAGS_IP_CHECKSUM = 0x10 +}; + +/* rx_chksum_flags values */ +enum { + RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED = 0x01, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED = 0x02, + RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED = 0x04, + RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED = 0x08, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED = 0x10, + RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED = 0x20, + RX_CHKSUM_FLAGS_LOOPBACK = 0x40, + RX_CHKSUM_FLAGS_RESERVED = 0x80 +}; + +/* connection_hash_and_valid values */ +enum { + CHV_VALID = 0x80, + CHV_HASH_MASH = 0x7f +}; + +struct viport_trailer { + s8 data_alignment_offset; + u8 rndis_header_length; /* reserved for use by edp */ + __be16 data_length; + u8 pkt_flags; + u8 tx_chksum_flags; + u8 rx_chksum_flags; + u8 ip_sec_flags; + u32 tcp_seq_no; + u32 ip_sec_offload_handle; + u32 ip_sec_next_offload_handle; + u8 dest_mac_addr[6]; + __be16 vlan; + u16 time_stamp; + u8 origin; + u8 connection_hash_and_valid; +}; + +#define VIPORT_TRAILER_ALIGNMENT 32 + +#define BUFFER_SIZE(len) \ + (sizeof(struct viport_trailer) + \ + ALIGN((len), VIPORT_TRAILER_ALIGNMENT)) + +#define MAX_PAYLOAD(len) \ + ALIGN_DOWN((len) - sizeof(struct viport_trailer), \ + VIPORT_TRAILER_ALIGNMENT) + +#endif /* VNIC_TRAILER_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:56:54 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:26:54 +0530 Subject: [ofa-general] [PATCH v3 06/13] QLogic VNIC: IB core stack interaction In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095654.9943.72719.stgit@localhost.localdomain> From: Ramachandra K The patch implements the interaction of the QLogic VNIC driver with the underlying core infiniband stack. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c | 1043 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h | 206 ++++++ 2 files changed, 1249 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c new file mode 100644 index 0000000..c43e69e --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.c @@ -0,0 +1,1043 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_data.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_sys.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +static int vnic_ib_inited; +static void vnic_add_one(struct ib_device *device); +static void vnic_remove_one(struct ib_device *device); +static int vnic_defer_completion(void *ptr); + +static int vnic_ib_mc_init_qp(struct mc_data *mc_data, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config); + +static struct ib_client vnic_client = { + .name = "vnic", + .add = vnic_add_one, + .remove = vnic_remove_one +}; + +struct ib_sa_client vnic_sa_client; + +int vnic_ib_init(void) +{ + int ret = -1; + + IB_FUNCTION("vnic_ib_init()\n"); + + /* class has to be registered before + * calling ib_register_client() because, that call + * will trigger vnic_add_port() which will register + * class_device for the port with the parent class + * as vnic_class + */ + ret = class_register(&vnic_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class" + " infiniband_qlgc_vnic; error %d", ret); + goto out; + } + + ib_sa_register_client(&vnic_sa_client); + ret = ib_register_client(&vnic_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client;" + " error %d", ret); + goto err_ib_reg; + } + + interface_dev.dev.class = &vnic_class; + interface_dev.dev.release = vnic_release_dev; + snprintf(interface_dev.dev.bus_id, + BUS_ID_SIZE, "interfaces"); + init_completion(&interface_dev.released); + ret = device_register(&interface_dev.dev); + if (ret) { + printk(KERN_ERR PFX "couldn't register class interfaces;" + " error %d", ret); + goto err_class_dev; + } + ret = device_create_file(&interface_dev.dev, + &dev_attr_delete_vnic); + if (ret) { + printk(KERN_ERR PFX "couldn't create class file" + " 'delete_vnic'; error %d", ret); + goto err_class_file; + } + + vnic_ib_inited = 1; + + return ret; +err_class_file: + device_unregister(&interface_dev.dev); +err_class_dev: + ib_unregister_client(&vnic_client); +err_ib_reg: + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +out: + return ret; +} + +static struct vnic_ib_port *vnic_add_port(struct vnic_ib_device *device, + u8 port_num) +{ + struct vnic_ib_port *port; + + port = kzalloc(sizeof *port, GFP_KERNEL); + if (!port) + return NULL; + + init_completion(&port->pdev_info.released); + port->dev = device; + port->port_num = port_num; + + port->pdev_info.dev.class = &vnic_class; + port->pdev_info.dev.parent = NULL; + port->pdev_info.dev.release = vnic_release_dev; + snprintf(port->pdev_info.dev.bus_id, BUS_ID_SIZE, + "vnic-%s-%d", device->dev->name, port_num); + + if (device_register(&port->pdev_info.dev)) + goto free_port; + + if (device_create_file(&port->pdev_info.dev, + &dev_attr_create_primary)) + goto err_class; + if (device_create_file(&port->pdev_info.dev, + &dev_attr_create_secondary)) + goto err_class; + + return port; +err_class: + device_unregister(&port->pdev_info.dev); +free_port: + kfree(port); + + return NULL; +} + +static void vnic_add_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port; + int s, e, p; + + vnic_dev = kmalloc(sizeof *vnic_dev, GFP_KERNEL); + if (!vnic_dev) + return; + + vnic_dev->dev = device; + INIT_LIST_HEAD(&vnic_dev->port_list); + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = 0; + e = 0; + + } else { + s = 1; + e = device->phys_port_cnt; + + } + + for (p = s; p <= e; p++) { + port = vnic_add_port(vnic_dev, p); + if (port) + list_add_tail(&port->list, &vnic_dev->port_list); + } + + ib_set_client_data(device, &vnic_client, vnic_dev); + +} + +static void vnic_remove_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port, *tmp_port; + + vnic_dev = ib_get_client_data(device, &vnic_client); + list_for_each_entry_safe(port, tmp_port, + &vnic_dev->port_list, list) { + device_unregister(&port->pdev_info.dev); + /* + * wait for sysfs entries to go away, so that no new vnics + * are created + */ + wait_for_completion(&port->pdev_info.released); + kfree(port); + + } + kfree(vnic_dev); + + /* TODO Only those vnic interfaces associated with + * the HCA whose remove event is called should be freed + * Currently all the vnic interfaces are freed + */ + + while (!list_empty(&vnic_list)) { + struct vnic *vnic = + list_entry(vnic_list.next, struct vnic, list_ptrs); + vnic_free(vnic); + } + + vnic_npevent_cleanup(); + viport_cleanup(); + +} + +void vnic_ib_cleanup(void) +{ + IB_FUNCTION("vnic_ib_cleanup()\n"); + + if (!vnic_ib_inited) + return; + + device_unregister(&interface_dev.dev); + wait_for_completion(&interface_dev.released); + + ib_unregister_client(&vnic_client); + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +} + +static void vnic_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *context) +{ + struct vnic_ib_path_info *p = context; + p->status = status; + if (!status) + p->path = *pathrec; + + complete(&p->done); +} + +int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic) +{ + struct viport_config *config = netpath->viport->config; + int ret = 0; + + init_completion(&config->path_info.done); + IB_INFO("Using SA path rec get time out value of %d\n", + config->sa_path_rec_get_timeout); + config->path_info.path_query_id = + ib_sa_path_rec_get(&vnic_sa_client, + config->ibdev, + config->port, + &config->path_info.path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + config->sa_path_rec_get_timeout, + GFP_KERNEL, + vnic_path_rec_completion, + &config->path_info, + &config->path_info.path_query); + + if (config->path_info.path_query_id < 0) { + IB_ERROR("SA path record query failed; error %d\n", + config->path_info.path_query_id); + ret = config->path_info.path_query_id; + goto out; + } + + wait_for_completion(&config->path_info.done); + + if (config->path_info.status < 0) { + printk(KERN_WARNING PFX "connection not available to dgid " + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x", + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[0]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[2]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[4]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[6]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[8]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[10]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[12]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[14])); + + if (config->path_info.status == -ETIMEDOUT) + printk(KERN_INFO " path query timed out\n"); + else if (config->path_info.status == -EIO) + printk(KERN_INFO " path query sending error\n"); + else + printk(KERN_INFO " error %d\n", + config->path_info.status); + + ret = config->path_info.status; + } +out: + if (ret) + netpath_timer(netpath, vnic->config->no_path_timeout); + + return ret; +} + +static inline void vnic_ib_handle_completions(struct ib_wc *wc, + struct vnic_ib_conn *ib_conn, + u32 *comp_num, + cycles_t *comp_time) +{ + struct io *io; + + io = (struct io *)(wc->wr_id); + vnic_ib_comp_stats(ib_conn, comp_num); + if (wc->status) { + IB_INFO("completion error wc.status %d" + " wc.opcode %d vendor err 0x%x\n", + wc->status, wc->opcode, wc->vendor_err); + } else if (io) { + vnic_ib_io_stats(io, ib_conn, *comp_time); + if (io->type == RECV_UD) { + struct ud_recv_io *recv_io = + container_of(io, struct ud_recv_io, io); + recv_io->len = wc->byte_len; + } + if (io->routine) + (*io->routine) (io); + } +} + +static void ib_qp_event(struct ib_event *event, void *context) +{ + IB_ERROR("QP event %d\n", event->event); +} + +static void vnic_ib_completion(struct ib_cq *cq, void *ptr) +{ + struct vnic_ib_conn *ib_conn = ptr; + unsigned long flags; + int compl_received; + struct ib_wc wc; + cycles_t comp_time; + u32 comp_num = 0; + + /* for multicast, cm_id is NULL, so skip that test */ + if (ib_conn->cm_id && + (ib_conn->state != IB_CONN_CONNECTED)) + return; + + /* Check if completion processing is taking place in thread + * If not then process completions in this handler, + * else set compl_received if not set, to indicate that + * there are more completions to process in thread. + */ + + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + compl_received = ib_conn->compl_received; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags); + + if (ib_conn->in_thread || compl_received) { + if (!compl_received) { + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 1; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, + flags); + } + wake_up(&(ib_conn->callback_wait_queue)); + } else { + vnic_ib_note_comptime_stats(&comp_time); + vnic_ib_callback_stats(ib_conn); + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + vnic_ib_handle_completions(&wc, ib_conn, &comp_num, + &comp_time); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + break; + + /* If we get more completions than the completion limit + * defer completion to the thread + */ + if ((!ib_conn->in_thread) && + (comp_num >= ib_conn->ib_config->completion_limit)) { + ib_conn->in_thread = 1; + spin_lock_irqsave( + &ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 1; + spin_unlock_irqrestore( + &ib_conn->compl_received_lock, flags); + wake_up(&(ib_conn->callback_wait_queue)); + break; + } + + } + vnic_ib_maxio_stats(ib_conn, comp_num); + } +} + +static int vnic_ib_mod_qp_to_rts(struct ib_cm_id *cm_id, + struct vnic_ib_conn *ib_conn) +{ + int attr_mask = 0; + int ret; + struct ib_qp_attr *qp_attr = NULL; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + return -ENOMEM; + + qp_attr->qp_state = IB_QPS_RTR; + + ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (ret) + goto out; + + ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask); + if (ret) + goto out; + + IB_INFO("QP RTR\n"); + + qp_attr->qp_state = IB_QPS_RTS; + + ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask); + if (ret) + goto out; + + ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask); + if (ret) + goto out; + + IB_INFO("QP RTS\n"); + + ret = ib_send_cm_rtu(cm_id, NULL, 0); + if (ret) + goto out; +out: + kfree(qp_attr); + return ret; +} + +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct vnic_ib_conn *ib_conn = cm_id->context; + struct viport *viport = ib_conn->viport; + int err = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + IB_ERROR("sending CM REQ failed\n"); + err = 1; + viport->retry = 1; + break; + case IB_CM_REP_RECEIVED: + IB_INFO("CM REP recvd\n"); + if (vnic_ib_mod_qp_to_rts(cm_id, ib_conn)) + err = 1; + else { + ib_conn->state = IB_CONN_CONNECTED; + vnic_ib_connected_time_stats(ib_conn); + IB_INFO("RTU SENT\n"); + } + break; + case IB_CM_REJ_RECEIVED: + printk(KERN_ERR PFX " CM rejected control connection\n"); + if (event->param.rej_rcvd.reason == + IB_CM_REJ_INVALID_SERVICE_ID) + printk(KERN_ERR "reason: invalid service ID. " + "IOCGUID value specified may be incorrect\n"); + else + printk(KERN_ERR "reason code : 0x%x\n", + event->param.rej_rcvd.reason); + + err = 1; + viport->retry = 1; + break; + case IB_CM_MRA_RECEIVED: + IB_INFO("CM MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + IB_INFO("CM DREP recvd\n"); + ib_conn->state = IB_CONN_DISCONNECTED; + break; + + case IB_CM_TIMEWAIT_EXIT: + IB_ERROR("CM timewait exit\n"); + err = 1; + break; + + default: + IB_INFO("unhandled CM event %d\n", event->event); + break; + + } + + if (err) { + ib_conn->state = IB_CONN_DISCONNECTED; + viport_failure(viport); + } + + viport_kick(viport); + return 0; +} + + +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn) +{ + struct ib_cm_req_param *req = NULL; + struct viport *viport; + int ret = -1; + + if (!vnic_ib_conn_initted(ib_conn)) { + IB_ERROR("IB Connection out of state for CM connect (%d)\n", + ib_conn->state); + return -EINVAL; + } + + vnic_ib_conntime_stats(ib_conn); + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + viport = ib_conn->viport; + + req->primary_path = &viport->config->path_info.path; + req->alternate_path = NULL; + req->qp_num = ib_conn->qp->qp_num; + req->qp_type = ib_conn->qp->qp_type; + req->service_id = ib_conn->ib_config->service_id; + req->private_data = &ib_conn->ib_config->conn_data; + req->private_data_len = sizeof(struct vnic_connection_data); + req->flow_control = 1; + + get_random_bytes(&req->starting_psn, 4); + req->starting_psn &= 0xffffff; + + /* + * Both responder_resources and initiator_depth are set to zero + * as we do not need RDMA read. + * + * They also must be set to zero, otherwise data connections + * are rejected by VEx. + */ + req->responder_resources = 0; + req->initiator_depth = 0; + req->remote_cm_response_timeout = 20; + req->local_cm_response_timeout = 20; + req->retry_count = ib_conn->ib_config->retry_count; + req->rnr_retry_count = ib_conn->ib_config->rnr_retry_count; + req->max_cm_retries = 15; + + ib_conn->state = IB_CONN_CONNECTING; + + ret = ib_send_cm_req(ib_conn->cm_id, req); + + kfree(req); + + if (ret) { + IB_ERROR("CM REQ sending failed; error %d \n", ret); + ib_conn->state = IB_CONN_DISCONNECTED; + } + + return ret; +} + +static int vnic_ib_init_qp(struct vnic_ib_conn *ib_conn, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config) +{ + struct ib_qp_init_attr *init_attr; + struct ib_qp_attr *attr; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + init_attr->event_handler = ib_qp_event; + init_attr->cap.max_send_wr = config->num_sends; + init_attr->cap.max_recv_wr = config->num_recvs; + init_attr->cap.max_recv_sge = config->recv_scatter; + init_attr->cap.max_send_sge = config->send_gather; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = ib_conn->cq; + init_attr->recv_cq = ib_conn->cq; + + ib_conn->qp = ib_create_qp(pd, init_attr); + + if (IS_ERR(ib_conn->qp)) { + ret = -1; + IB_ERROR("could not create QP\n"); + goto free_init_attr; + } + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + ret = -ENOMEM; + goto destroy_qp; + } + + ret = ib_find_pkey(viport_config->ibdev, viport_config->port, + be16_to_cpu(viport_config->path_info.path.pkey), + &attr->pkey_index); + if (ret) { + printk(KERN_WARNING PFX "ib_find_pkey() failed; " + "error %d\n", ret); + goto freeattr; + } + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; + attr->port_num = viport_config->port; + + ret = ib_modify_qp(ib_conn->qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | IB_QP_PORT); + if (ret) { + printk(KERN_WARNING PFX "could not modify QP; error %d \n", + ret); + goto freeattr; + } + + kfree(attr); + kfree(init_attr); + return ret; + +freeattr: + kfree(attr); +destroy_qp: + ib_destroy_qp(ib_conn->qp); +free_init_attr: + kfree(init_attr); + return ret; +} + +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config) +{ + struct viport_config *viport_config = viport->config; + int ret = -1; + unsigned int cq_size = config->num_sends + config->num_recvs; + + + if (!vnic_ib_conn_uninitted(ib_conn)) { + IB_ERROR("IB Connection out of state for init (%d)\n", + ib_conn->state); + return -EINVAL; + } + + ib_conn->cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion, +#ifdef BUILD_FOR_OFED_1_2 + NULL, ib_conn, cq_size); +#else + NULL, ib_conn, cq_size, 0); +#endif + if (IS_ERR(ib_conn->cq)) { + IB_ERROR("could not create CQ\n"); + goto out; + } + + IB_INFO("cq created %p %d\n", ib_conn->cq, cq_size); + ib_req_notify_cq(ib_conn->cq, IB_CQ_NEXT_COMP); + init_waitqueue_head(&(ib_conn->callback_wait_queue)); + init_completion(&(ib_conn->callback_thread_exit)); + + spin_lock_init(&ib_conn->compl_received_lock); + + ib_conn->callback_thread = kthread_run(vnic_defer_completion, ib_conn, + "qlgc_vnic_def_compl"); + if (IS_ERR(ib_conn->callback_thread)) { + IB_ERROR("Could not create vnic_callback_thread;" + " error %d\n", (int) PTR_ERR(ib_conn->callback_thread)); + ib_conn->callback_thread = NULL; + goto destroy_cq; + } + + ret = vnic_ib_init_qp(ib_conn, config, pd, viport_config); + + if (ret) + goto destroy_thread; + + spin_lock_init(&ib_conn->conn_lock); + ib_conn->state = IB_CONN_INITTED; + + return ret; + +destroy_thread: + completion_callback_cleanup(ib_conn); +destroy_cq: + ib_destroy_cq(ib_conn->cq); +out: + return ret; +} + +int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io) +{ + cycles_t post_time; + struct ib_recv_wr *bad_wr; + int ret = -1; + unsigned long flags; + + IB_FUNCTION("vnic_ib_post_recv()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + + if (!vnic_ib_conn_initted(ib_conn) && + !vnic_ib_conn_connected(ib_conn)) { + ret = -EINVAL; + goto out; + } + + vnic_ib_pre_rcvpost_stats(ib_conn, io, &post_time); + io->type = RECV; + ret = ib_post_recv(ib_conn->qp, &io->rwr, &bad_wr); + if (ret) { + IB_ERROR("error in posting rcv wr; error %d\n", ret); + ib_conn->state = IB_CONN_ERRORED; + goto out; + } + + vnic_ib_post_rcvpost_stats(ib_conn, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; + +} + +int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io) +{ + cycles_t post_time; + unsigned long flags; + struct ib_send_wr *bad_wr; + int ret = -1; + + IB_FUNCTION("vnic_ib_post_send()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + if (!vnic_ib_conn_connected(ib_conn)) { + IB_ERROR("IB Connection out of state for" + " posting sends (%d)\n", ib_conn->state); + goto out; + } + + vnic_ib_pre_sendpost_stats(io, &post_time); + if (io->swr.opcode == IB_WR_RDMA_WRITE) + io->type = RDMA; + else + io->type = SEND; + + ret = ib_post_send(ib_conn->qp, &io->swr, &bad_wr); + if (ret) { + IB_ERROR("error in posting send wr; error %d\n", ret); + ib_conn->state = IB_CONN_ERRORED; + goto out; + } + + vnic_ib_post_sendpost_stats(ib_conn, io, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; +} + +static int vnic_defer_completion(void *ptr) +{ + struct vnic_ib_conn *ib_conn = ptr; + struct ib_wc wc; + struct ib_cq *cq = ib_conn->cq; + cycles_t comp_time; + u32 comp_num = 0; + unsigned long flags; + + while (!ib_conn->callback_thread_end) { + wait_event_interruptible(ib_conn->callback_wait_queue, + ib_conn->compl_received || + ib_conn->callback_thread_end); + ib_conn->in_thread = 1; + spin_lock_irqsave(&ib_conn->compl_received_lock, flags); + ib_conn->compl_received = 0; + spin_unlock_irqrestore(&ib_conn->compl_received_lock, flags); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + goto out_thread; + + vnic_ib_note_comptime_stats(&comp_time); + vnic_ib_callback_stats(ib_conn); + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + vnic_ib_handle_completions(&wc, ib_conn, &comp_num, + &comp_time); + if (ib_conn->cm_id && + ib_conn->state != IB_CONN_CONNECTED) + break; + } + vnic_ib_maxio_stats(ib_conn, comp_num); +out_thread: + ib_conn->in_thread = 0; + } + complete_and_exit(&(ib_conn->callback_thread_exit), 0); + return 0; +} + +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn) +{ + if (ib_conn->callback_thread) { + ib_conn->callback_thread_end = 1; + wake_up(&(ib_conn->callback_wait_queue)); + wait_for_completion(&(ib_conn->callback_thread_exit)); + ib_conn->callback_thread = NULL; + } +} + +int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config) +{ + struct viport_config *viport_config = viport->config; + int ret = -1; + unsigned int cq_size = config->num_recvs; /* recvs only */ + + IB_FUNCTION("vnic_ib_mc_init\n"); + + mc_data->ib_conn.cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion, +#ifdef BUILD_FOR_OFED_1_2 + NULL, &mc_data->ib_conn, cq_size); +#else + NULL, &mc_data->ib_conn, cq_size, 0); +#endif + if (IS_ERR(mc_data->ib_conn.cq)) { + IB_ERROR("ib_create_cq failed\n"); + goto out; + } + IB_INFO("mc cq created %p %d\n", mc_data->ib_conn.cq, cq_size); + + ret = ib_req_notify_cq(mc_data->ib_conn.cq, IB_CQ_NEXT_COMP); + if (ret) { + IB_ERROR("ib_req_notify_cq failed %x \n", ret); + goto destroy_cq; + } + + init_waitqueue_head(&(mc_data->ib_conn.callback_wait_queue)); + init_completion(&(mc_data->ib_conn.callback_thread_exit)); + + spin_lock_init(&mc_data->ib_conn.compl_received_lock); + mc_data->ib_conn.callback_thread = kthread_run(vnic_defer_completion, + &mc_data->ib_conn, + "qlgc_vnic_mc_def_compl"); + if (IS_ERR(mc_data->ib_conn.callback_thread)) { + IB_ERROR("Could not create vnic_callback_thread for MULTICAST;" + " error %d\n", + (int) PTR_ERR(mc_data->ib_conn.callback_thread)); + mc_data->ib_conn.callback_thread = NULL; + goto destroy_cq; + } + IB_INFO("callback_thread created\n"); + + ret = vnic_ib_mc_init_qp(mc_data, config, pd, viport_config); + if (ret) + goto destroy_thread; + + spin_lock_init(&mc_data->ib_conn.conn_lock); + mc_data->ib_conn.state = IB_CONN_INITTED; /* stays in this state */ + + return ret; + +destroy_thread: + completion_callback_cleanup(&mc_data->ib_conn); +destroy_cq: + ib_destroy_cq(mc_data->ib_conn.cq); + mc_data->ib_conn.cq = (struct ib_cq *)ERR_PTR(-EINVAL); +out: + return ret; +} + +static int vnic_ib_mc_init_qp(struct mc_data *mc_data, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config *viport_config) +{ + struct ib_qp_init_attr *init_attr; + struct ib_qp_attr *qp_attr; + int ret; + + IB_FUNCTION("vnic_ib_mc_init_qp\n"); + + if (!mc_data->ib_conn.cq) { + IB_ERROR("cq is null\n"); + return -ENOMEM; + } + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) { + IB_ERROR("failed to alloc init_attr\n"); + return -ENOMEM; + } + + init_attr->cap.max_recv_wr = config->num_recvs; + init_attr->cap.max_send_wr = 1; + init_attr->cap.max_recv_sge = 2; + init_attr->cap.max_send_sge = 1; + + /* Completion for all work requests. */ + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + + init_attr->qp_type = IB_QPT_UD; + + init_attr->send_cq = mc_data->ib_conn.cq; + init_attr->recv_cq = mc_data->ib_conn.cq; + + IB_INFO("creating qp %d \n", config->num_recvs); + + mc_data->ib_conn.qp = ib_create_qp(pd, init_attr); + + if (IS_ERR(mc_data->ib_conn.qp)) { + ret = -1; + IB_ERROR("could not create QP\n"); + goto free_init_attr; + } + + qp_attr = kzalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) { + ret = -ENOMEM; + goto destroy_qp; + } + + qp_attr->qp_state = IB_QPS_INIT; + qp_attr->port_num = viport_config->port; + qp_attr->qkey = IOC_NUMBER(be64_to_cpu(viport_config->ioc_guid)); + qp_attr->pkey_index = 0; + /* cannot set access flags for UD qp + qp_attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; */ + + IB_INFO("port_num:%d qkey:%d pkey:%d\n", qp_attr->port_num, + qp_attr->qkey, qp_attr->pkey_index); + ret = ib_modify_qp(mc_data->ib_conn.qp, qp_attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_QKEY | + + /* cannot set this for UD + IB_QP_ACCESS_FLAGS | */ + + IB_QP_PORT); + if (ret) { + IB_ERROR("ib_modify_qp to INIT failed %d \n", ret); + goto free_qp_attr; + } + + kfree(qp_attr); + kfree(init_attr); + return ret; + +free_qp_attr: + kfree(qp_attr); +destroy_qp: + ib_destroy_qp(mc_data->ib_conn.qp); + mc_data->ib_conn.qp = ERR_PTR(-EINVAL); +free_init_attr: + kfree(init_attr); + return ret; +} + +int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp) +{ + int ret; + struct ib_qp_attr *qp_attr = NULL; + + IB_FUNCTION("vnic_ib_mc_mod_qp_to_rts\n"); + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + return -ENOMEM; + + memset(qp_attr, 0, sizeof *qp_attr); + qp_attr->qp_state = IB_QPS_RTR; + + ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE); + if (ret) { + IB_ERROR("ib_modify_qp to RTR failed %d\n", ret); + goto out; + } + IB_INFO("MC QP RTR\n"); + + memset(qp_attr, 0, sizeof *qp_attr); + qp_attr->qp_state = IB_QPS_RTS; + qp_attr->sq_psn = 0; + + ret = ib_modify_qp(qp, qp_attr, IB_QP_STATE | IB_QP_SQ_PSN); + if (ret) { + IB_ERROR("ib_modify_qp to RTS failed %d\n", ret); + goto out; + } + IB_INFO("MC QP RTS\n"); + + return 0; + +out: + kfree(qp_attr); + return -1; +} + +int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io) +{ + cycles_t post_time; + struct ib_recv_wr *bad_wr; + int ret = -1; + + IB_FUNCTION("vnic_ib_mc_post_recv()\n"); + + vnic_ib_pre_rcvpost_stats(&mc_data->ib_conn, io, &post_time); + io->type = RECV_UD; + ret = ib_post_recv(mc_data->ib_conn.qp, &io->rwr, &bad_wr); + if (ret) { + IB_ERROR("error in posting rcv wr; error %d\n", ret); + goto out; + } + vnic_ib_post_rcvpost_stats(&mc_data->ib_conn, post_time); + +out: + return ret; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h new file mode 100644 index 0000000..ebf9ef5 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_ib.h @@ -0,0 +1,206 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_IB_H_INCLUDED +#define VNIC_IB_H_INCLUDED + +#include +#include +#include +#include +#include +#include + +#include "vnic_sys.h" +#include "vnic_netpath.h" +#define PFX "qlgc_vnic: " + +struct io; +typedef void (comp_routine_t) (struct io *io); + +enum vnic_ib_conn_state { + IB_CONN_UNINITTED = 0, + IB_CONN_INITTED = 1, + IB_CONN_CONNECTING = 2, + IB_CONN_CONNECTED = 3, + IB_CONN_DISCONNECTED = 4, + IB_CONN_ERRORED = 5 +}; + +struct vnic_ib_conn { + struct viport *viport; + struct vnic_ib_config *ib_config; + spinlock_t conn_lock; + enum vnic_ib_conn_state state; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_cm_id *cm_id; + int callback_thread_end; + struct task_struct *callback_thread; + wait_queue_head_t callback_wait_queue; + u32 in_thread; + u32 compl_received; + struct completion callback_thread_exit; + spinlock_t compl_received_lock; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + struct { + cycles_t connection_time; + cycles_t rdma_post_time; + u32 rdma_post_ios; + cycles_t rdma_comp_time; + u32 rdma_comp_ios; + cycles_t send_post_time; + u32 send_post_ios; + cycles_t send_comp_time; + u32 send_comp_ios; + cycles_t recv_post_time; + u32 recv_post_ios; + cycles_t recv_comp_time; + u32 recv_comp_ios; + u32 num_ios; + u32 num_callbacks; + u32 max_ios; + } statistics; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ +}; + +struct vnic_ib_path_info { + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + int status; + struct completion done; +}; + +struct vnic_ib_device { + struct ib_device *dev; + struct list_head port_list; +}; + +struct vnic_ib_port { + struct vnic_ib_device *dev; + u8 port_num; + struct dev_info pdev_info; + struct list_head list; +}; + +struct io { + struct list_head list_ptrs; + struct viport *viport; + comp_routine_t *routine; + struct ib_recv_wr rwr; + struct ib_send_wr swr; +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + cycles_t time; +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ + enum {RECV, RDMA, SEND, RECV_UD} type; +}; + +struct rdma_io { + struct io io; + struct ib_sge list[2]; + u16 index; + u16 len; + u8 *data; + dma_addr_t data_dma; + struct sk_buff *skb; + dma_addr_t skb_data_dma; + struct viport_trailer *trailer; + dma_addr_t trailer_dma; +}; + +struct send_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +struct recv_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +struct ud_recv_io { + struct io io; + u16 len; + dma_addr_t skb_data_dma; + struct ib_sge list[2]; /* one for grh and other for rest of pkt. */ + struct sk_buff *skb; +}; + +int vnic_ib_init(void); +void vnic_ib_cleanup(void); + +struct vnic; +int vnic_ib_get_path(struct netpath *netpath, struct vnic *vnic); +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config); + +int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn); +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +#define vnic_ib_conn_uninitted(ib_conn) \ + ((ib_conn)->state == IB_CONN_UNINITTED) +#define vnic_ib_conn_initted(ib_conn) \ + ((ib_conn)->state == IB_CONN_INITTED) +#define vnic_ib_conn_connecting(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTING) +#define vnic_ib_conn_connected(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTED) +#define vnic_ib_conn_disconnected(ib_conn) \ + ((ib_conn)->state == IB_CONN_DISCONNECTED) + +#define MCAST_GROUP_INVALID 0x00 /* viport failed to join or left mc group */ +#define MCAST_GROUP_JOINING 0x01 /* wait for completion */ +#define MCAST_GROUP_JOINED 0x02 /* join process completed successfully */ + +/* vnic_sa_client is used to register with sa once. It is needed to join and + * leave multicast groups. + */ +extern struct ib_sa_client vnic_sa_client; + +/* The following functions are using initialize and handle multicast + * components. + */ +struct mc_data; /* forward declaration */ +/* Initialize all necessary mc components */ +int vnic_ib_mc_init(struct mc_data *mc_data, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config); +/* Put multicast qp in RTS */ +int vnic_ib_mc_mod_qp_to_rts(struct ib_qp *qp); +/* Post multicast receive buffers */ +int vnic_ib_mc_post_recv(struct mc_data *mc_data, struct io *io); + +#endif /* VNIC_IB_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:57:24 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:27:24 +0530 Subject: [ofa-general] [PATCH v3 07/13] QLogic VNIC: Handling configurable parameters of the driver In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095724.9943.92517.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the files that handle various configurable parameters of the VNIC driver ---- configuration of virtual NIC, control, data connections to the EVIC and general IB connection parameters. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_config.c | 379 ++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_config.h | 242 +++++++++++++++ 2 files changed, 621 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_config.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c new file mode 100644 index 0000000..8bde3d8 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.c @@ -0,0 +1,379 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_trailer.h" +#include "vnic_main.h" + +u16 vnic_max_mtu = MAX_MTU; + +static u32 default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; +static u32 sa_path_rec_get_timeout = SA_PATH_REC_GET_TIMEOUT; +static u32 default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; +static u32 default_primary_switch_timeout = DEFAULT_PRIMARY_SWITCH_TIMEOUT; +static int default_prefer_primary = DEFAULT_PREFER_PRIMARY; + +static int use_rx_csum = VNIC_USE_RX_CSUM; +static int use_tx_csum = VNIC_USE_TX_CSUM; + +static u32 control_response_timeout = CONTROL_RSP_TIMEOUT; +static u32 completion_limit = DEFAULT_COMPLETION_LIMIT; + +module_param(vnic_max_mtu, ushort, 0444); +MODULE_PARM_DESC(vnic_max_mtu, "Maximum MTU size (1500-9500). Default is 9500"); + +module_param(default_prefer_primary, bool, 0444); +MODULE_PARM_DESC(default_prefer_primary, "Determines if primary path is" + " preferred (1) or not (0). Defaults to 0"); +module_param(use_rx_csum, bool, 0444); +MODULE_PARM_DESC(use_rx_csum, "Determines if RX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(use_tx_csum, bool, 0444); +MODULE_PARM_DESC(use_tx_csum, "Determines if TX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(default_no_path_timeout, uint, 0444); +MODULE_PARM_DESC(default_no_path_timeout, "Time to wait in milliseconds" + " before reconnecting to VEx after connection loss"); +module_param(default_primary_reconnect_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_reconnect_timeout, "Time to wait in" + " milliseconds before reconnecting the" + " primary path to VEx"); +module_param(default_primary_switch_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_switch_timeout, "Time to wait before" + " switching back to primary path if" + " primary path is preferred"); +module_param(sa_path_rec_get_timeout, uint, 0444); +MODULE_PARM_DESC(sa_path_rec_get_timeout, "Time out value in milliseconds" + " for SA path record get queries"); + +module_param(control_response_timeout, uint, 0444); +MODULE_PARM_DESC(control_response_timeout, "Time out value in milliseconds" + " to wait for response to control requests"); + +module_param(completion_limit, uint, 0444); +MODULE_PARM_DESC(completion_limit, "Maximum completions to process" + " in a single completion callback invocation. Default is 100" + " Minimum value is 10"); + +static void config_control_defaults(struct control_config *control_config, + struct path_param *params) +{ + int len; + char *dot; + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + control_config->ib_config.service_id = cpu_to_be64(sid); + control_config->ib_config.conn_data.path_id = 0; + control_config->ib_config.conn_data.vnic_instance = params->instance; + control_config->ib_config.conn_data.path_num = 0; + control_config->ib_config.conn_data.features_supported = + __constant_cpu_to_be32((u32) (VNIC_FEAT_IGNORE_VLAN | + VNIC_FEAT_RDMA_IMMED)); + dot = strchr(init_utsname()->nodename, '.'); + + if (dot) + len = dot - init_utsname()->nodename; + else + len = strlen(init_utsname()->nodename); + + if (len > VNIC_MAX_NODENAME_LEN) + len = VNIC_MAX_NODENAME_LEN; + + memcpy(control_config->ib_config.conn_data.nodename, + init_utsname()->nodename, len); + + if (params->ib_multicast == 1) + control_config->ib_multicast = 1; + else if (params->ib_multicast == 0) + control_config->ib_multicast = 0; + else { + /* parameter is not set - enable it by default */ + control_config->ib_multicast = 1; + CONFIG_ERROR("IOCGUID=%llx INSTANCE=%d IB_MULTICAST defaulted" + " to TRUE\n", + be64_to_cpu(params->ioc_guid), + (char)params->instance); + } + + if (control_config->ib_multicast) + control_config->ib_config.conn_data.features_supported |= + __constant_cpu_to_be32(VNIC_FEAT_INBOUND_IB_MC); + + control_config->ib_config.retry_count = RETRY_COUNT; + control_config->ib_config.rnr_retry_count = RETRY_COUNT; + control_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* These values are not configurable*/ + control_config->ib_config.num_recvs = 5; + control_config->ib_config.num_sends = 1; + control_config->ib_config.recv_scatter = 1; + control_config->ib_config.send_gather = 1; + control_config->ib_config.completion_limit = completion_limit; + + control_config->num_recvs = control_config->ib_config.num_recvs; + + control_config->vnic_instance = params->instance; + control_config->max_address_entries = MAX_ADDRESS_ENTRIES; + control_config->min_address_entries = MIN_ADDRESS_ENTRIES; + control_config->rsp_timeout = msecs_to_jiffies(control_response_timeout); +} + +static void config_data_defaults(struct data_config *data_config, + struct path_param *params) +{ + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + data_config->ib_config.service_id = cpu_to_be64(sid); + data_config->ib_config.conn_data.path_id = jiffies; /* random */ + data_config->ib_config.conn_data.vnic_instance = params->instance; + data_config->ib_config.conn_data.path_num = 0; + + data_config->ib_config.retry_count = RETRY_COUNT; + data_config->ib_config.rnr_retry_count = RETRY_COUNT; + data_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* + * NOTE: the num_recvs size assumes that the EIOC could + * RDMA enough packets to fill all of the host recv + * pool entries, plus send a kick message after each + * packet, plus RDMA new buffers for the size of + * the EIOC recv buffer pool, plus send kick messages + * after each min_host_update_sz of new buffers all + * before the host can even pull off the first completed + * receive off the completion queue, and repost the + * receive. NOT LIKELY! + */ + data_config->ib_config.num_recvs = HOST_RECV_POOL_ENTRIES + + (MAX_EIOC_POOL_SZ / MIN_HOST_UPDATE_SZ); + + data_config->ib_config.num_sends = (2 * NOTIFY_BUNDLE_SZ) + + (HOST_RECV_POOL_ENTRIES / MIN_EIOC_UPDATE_SZ) + 1; + + data_config->ib_config.recv_scatter = 1; /* not configurable */ + data_config->ib_config.send_gather = 2; /* not configurable */ + data_config->ib_config.completion_limit = completion_limit; + + data_config->num_recvs = data_config->ib_config.num_recvs; + data_config->path_id = data_config->ib_config.conn_data.path_id; + + + data_config->host_recv_pool_entries = HOST_RECV_POOL_ENTRIES; + + data_config->host_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->host_max.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + vnic_max_mtu)); + data_config->eioc_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->eioc_max.size_recv_pool_entry = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_HOST_POOL_SZ); + data_config->host_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + data_config->eioc_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_EIOC_POOL_SZ); + data_config->eioc_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_EIOC_POOL_SZ); + + data_config->host_min.timeout_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_TIMEOUT); + data_config->host_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_TIMEOUT); + data_config->eioc_min.timeout_before_kick = 0; + data_config->eioc_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_ENTRIES); + data_config->host_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_ENTRIES); + data_config->eioc_min.num_recv_pool_entries_before_kick = 0; + data_config->eioc_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_BYTES); + data_config->host_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_BYTES); + data_config->eioc_min.num_recv_pool_bytes_before_kick = 0; + data_config->eioc_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_HOST_UPDATE_SZ); + data_config->host_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_HOST_UPDATE_SZ); + data_config->eioc_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_EIOC_UPDATE_SZ); + data_config->eioc_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_EIOC_UPDATE_SZ); + + data_config->notify_bundle = NOTIFY_BUNDLE_SZ; +} + +static void config_path_info_defaults(struct viport_config *config, + struct path_param *params) +{ + int i; + ib_query_gid(config->ibdev, config->port, 0, + &config->path_info.path.sgid); + for (i = 0; i < 16; i++) + config->path_info.path.dgid.raw[i] = params->dgid[i]; + + config->path_info.path.pkey = params->pkey; + config->path_info.path.numb_path = 1; + config->sa_path_rec_get_timeout = sa_path_rec_get_timeout; + +} + +static void config_viport_defaults(struct viport_config *config, + struct path_param *params) +{ + config->ibdev = params->ibdev; + config->port = params->port; + config->ioc_guid = params->ioc_guid; + config->stats_interval = msecs_to_jiffies(VIPORT_STATS_INTERVAL); + config->hb_interval = msecs_to_jiffies(VIPORT_HEARTBEAT_INTERVAL); + config->hb_timeout = VIPORT_HEARTBEAT_TIMEOUT * 1000; + /*hb_timeout needs to be in usec*/ + strcpy(config->ioc_string, params->ioc_string); + config_path_info_defaults(config, params); + + config_control_defaults(&config->control_config, params); + config_data_defaults(&config->data_config, params); +} + +static void config_vnic_defaults(struct vnic_config *config) +{ + config->no_path_timeout = msecs_to_jiffies(default_no_path_timeout); + config->primary_connect_timeout = + msecs_to_jiffies(DEFAULT_PRIMARY_CONNECT_TIMEOUT); + config->primary_reconnect_timeout = + msecs_to_jiffies(default_primary_reconnect_timeout); + config->primary_switch_timeout = + msecs_to_jiffies(default_primary_switch_timeout); + config->prefer_primary = default_prefer_primary; + config->use_rx_csum = use_rx_csum; + config->use_tx_csum = use_tx_csum; +} + +struct viport_config *config_alloc_viport(struct path_param *params) +{ + struct viport_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("could not allocate memory for" + " struct viport_config\n"); + return NULL; + } + + config_viport_defaults(config, params); + + return config; +} + +struct vnic_config *config_alloc_vnic(void) +{ + struct vnic_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("couldn't allocate memory for" + " struct vnic_config\n"); + + return NULL; + } + + config_vnic_defaults(config); + return config; +} + +char *config_viport_name(struct viport_config *config) +{ + /* function only called by one thread, can return a static string */ + static char str[64]; + + sprintf(str, "GUID %llx instance %d", + be64_to_cpu(config->ioc_guid), + config->control_config.vnic_instance); + return str; +} + +int config_start(void) +{ + vnic_max_mtu = min_t(u16, vnic_max_mtu, MAX_MTU); + vnic_max_mtu = max_t(u16, vnic_max_mtu, MIN_MTU); + + sa_path_rec_get_timeout = min_t(u32, sa_path_rec_get_timeout, + MAX_SA_TIMEOUT); + sa_path_rec_get_timeout = max_t(u32, sa_path_rec_get_timeout, + MIN_SA_TIMEOUT); + + control_response_timeout = min_t(u32, control_response_timeout, + MAX_CONTROL_RSP_TIMEOUT); + + control_response_timeout = max_t(u32, control_response_timeout, + MIN_CONTROL_RSP_TIMEOUT); + + completion_limit = max_t(u32, completion_limit, + MIN_COMPLETION_LIMIT); + + if (!default_no_path_timeout) + default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; + + if (!default_primary_reconnect_timeout) + default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; + + if (!default_primary_switch_timeout) + default_primary_switch_timeout = + DEFAULT_PRIMARY_SWITCH_TIMEOUT; + + return 0; + +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h new file mode 100644 index 0000000..dca5f98 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_config.h @@ -0,0 +1,242 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONFIG_H_INCLUDED +#define VNIC_CONFIG_H_INCLUDED + +#include +#include +#include + +#include "vnic_control.h" +#include "vnic_ib.h" + +#define SST_AGN 0x10ULL +#define SST_OUI 0x00066AULL + +enum { + CONTROL_PATH_ID = 0x0, + DATA_PATH_ID = 0x1 +}; + +#define IOC_NUMBER(GUID) (((GUID) >> 32) & 0xFF) + +enum { + VNIC_CLASS_SUBCLASS = 0x2000066A, + VNIC_PROTOCOL = 0, + VNIC_PROT_VERSION = 1 +}; + +enum { + MIN_MTU = 1500, /* minimum negotiated MTU size */ + MAX_MTU = 9500 /* jumbo frame */ +}; + +/* + * TODO: tune the pool parameter values + */ +enum { + MIN_ADDRESS_ENTRIES = 16, + MAX_ADDRESS_ENTRIES = 64 +}; + +enum { + HOST_RECV_POOL_ENTRIES = 512, + MIN_HOST_POOL_SZ = 64, + MIN_EIOC_POOL_SZ = 64, + MAX_EIOC_POOL_SZ = 256, + MIN_HOST_UPDATE_SZ = 8, + MAX_HOST_UPDATE_SZ = 32, + MIN_EIOC_UPDATE_SZ = 8, + MAX_EIOC_UPDATE_SZ = 32, + NOTIFY_BUNDLE_SZ = 32 +}; + +enum { + MIN_HOST_KICK_TIMEOUT = 10, /* in usec */ + MAX_HOST_KICK_TIMEOUT = 100 /* in usec */ +}; + +enum { + MIN_HOST_KICK_ENTRIES = 1, + MAX_HOST_KICK_ENTRIES = 128 +}; + +enum { + MIN_HOST_KICK_BYTES = 0, + MAX_HOST_KICK_BYTES = 5000 +}; + +enum { + DEFAULT_NO_PATH_TIMEOUT = 10000, + DEFAULT_PRIMARY_CONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_RECONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_SWITCH_TIMEOUT = 10000 +}; + +enum { + VIPORT_STATS_INTERVAL = 500, /* .5 sec */ + VIPORT_HEARTBEAT_INTERVAL = 1000, /* 1 second */ + VIPORT_HEARTBEAT_TIMEOUT = 64000 /* 64 sec */ +}; + +enum { + /* 5 sec increased for EVIC support for large number of + * host connections + */ + CONTROL_RSP_TIMEOUT = 5000, + MIN_CONTROL_RSP_TIMEOUT = 1000, /* 1 sec */ + MAX_CONTROL_RSP_TIMEOUT = 60000 /* 60 sec */ +}; + +/* Maximum number of completions to be processed + * during a single completion callback invocation + */ +enum { + DEFAULT_COMPLETION_LIMIT = 100, + MIN_COMPLETION_LIMIT = 10 +}; + +/* infiniband connection parameters */ +enum { + RETRY_COUNT = 3, + MIN_RNR_TIMER = 22, /* 20 ms */ + DEFAULT_PKEY = 0 /* pkey table index */ +}; + +enum { + SA_PATH_REC_GET_TIMEOUT = 1000, /* 1000 ms */ + MIN_SA_TIMEOUT = 100, /* 100 ms */ + MAX_SA_TIMEOUT = 20000 /* 20s */ +}; + +#define MAX_PARAM_VALUE 0x40000000 +#define VNIC_USE_RX_CSUM 1 +#define VNIC_USE_TX_CSUM 1 +#define DEFAULT_PREFER_PRIMARY 0 + +/* As per IBTA specification, IOCString Maximum length can be 512 bits. */ +#define MAX_IOC_STRING_LEN (512/8) + +struct path_param { + __be64 ioc_guid; + u8 ioc_string[MAX_IOC_STRING_LEN+1]; + u8 port; + u8 instance; + struct ib_device *ibdev; + struct vnic_ib_port *ibport; + char name[IFNAMSIZ]; + u8 dgid[16]; + __be16 pkey; + int rx_csum; + int tx_csum; + int heartbeat; + int ib_multicast; +}; + +struct vnic_ib_config { + __be64 service_id; + struct vnic_connection_data conn_data; + u32 retry_count; + u32 rnr_retry_count; + u8 min_rnr_timer; + u32 num_sends; + u32 num_recvs; + u32 recv_scatter; /* 1 */ + u32 send_gather; /* 1 or 2 */ + u32 completion_limit; +}; + +struct control_config { + struct vnic_ib_config ib_config; + u32 num_recvs; + u8 vnic_instance; + u16 max_address_entries; + u16 min_address_entries; + u32 rsp_timeout; + u32 ib_multicast; +}; + +struct data_config { + struct vnic_ib_config ib_config; + u64 path_id; + u32 num_recvs; + u32 host_recv_pool_entries; + struct vnic_recv_pool_config host_min; + struct vnic_recv_pool_config host_max; + struct vnic_recv_pool_config eioc_min; + struct vnic_recv_pool_config eioc_max; + u32 notify_bundle; +}; + +struct viport_config { + struct viport *viport; + struct control_config control_config; + struct data_config data_config; + struct vnic_ib_path_info path_info; + u32 sa_path_rec_get_timeout; + struct ib_device *ibdev; + u32 port; + unsigned long stats_interval; + u32 hb_interval; + u32 hb_timeout; + __be64 ioc_guid; + u8 ioc_string[MAX_IOC_STRING_LEN+1]; + size_t path_idx; +}; + +/* + * primary_connect_timeout - if the secondary connects first, + * how long do we give the primary? + * primary_reconnect_timeout - same as above, but used when recovering + * from the case where both paths fail + * primary_switch_timeout - how long do we wait before switching to the + * primary when it comes back? + */ +struct vnic_config { + struct vnic *vnic; + char name[IFNAMSIZ]; + unsigned long no_path_timeout; + u32 primary_connect_timeout; + u32 primary_reconnect_timeout; + u32 primary_switch_timeout; + int prefer_primary; + int use_rx_csum; + int use_tx_csum; +}; + +int config_start(void); +struct viport_config *config_alloc_viport(struct path_param *params); +struct vnic_config *config_alloc_vnic(void); +char *config_viport_name(struct viport_config *config); + +#endif /* VNIC_CONFIG_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:57:54 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:27:54 +0530 Subject: [ofa-general] [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095754.9943.27936.stgit@localhost.localdomain> From: Amar Mudrankit The sysfs interface for the QLogic VNIC driver is implemented through this patch. Signed-off-by: Amar Mudrankit Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath --- drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 51 + 2 files changed, 1184 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c new file mode 100644 index 0000000..40b3c77 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c @@ -0,0 +1,1133 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +/* + * target eiocs are added by writing + * + * ioc_guid=,dgid=,pkey=,name= + * to the create_primary sysfs attribute. + */ +enum { + VNIC_OPT_ERR = 0, + VNIC_OPT_IOC_GUID = 1 << 0, + VNIC_OPT_DGID = 1 << 1, + VNIC_OPT_PKEY = 1 << 2, + VNIC_OPT_NAME = 1 << 3, + VNIC_OPT_INSTANCE = 1 << 4, + VNIC_OPT_RXCSUM = 1 << 5, + VNIC_OPT_TXCSUM = 1 << 6, + VNIC_OPT_HEARTBEAT = 1 << 7, + VNIC_OPT_IOC_STRING = 1 << 8, + VNIC_OPT_IB_MULTICAST = 1 << 9, + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), +}; + +static match_table_t vnic_opt_tokens = { + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, + {VNIC_OPT_DGID, "dgid=%s"}, + {VNIC_OPT_PKEY, "pkey=%x"}, + {VNIC_OPT_NAME, "name=%s"}, + {VNIC_OPT_INSTANCE, "instance=%d"}, + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, + {VNIC_OPT_ERR, NULL} +}; + +void vnic_release_dev(struct device *dev) +{ + struct dev_info *dev_info = + container_of(dev, struct dev_info, dev); + + complete(&dev_info->released); + +} + +struct class vnic_class = { + .name = "infiniband_qlgc_vnic", + .dev_release = vnic_release_dev +}; + +struct dev_info interface_dev; + +static int vnic_parse_options(const char *buf, struct path_param *param) +{ + char *options, *sep_opt; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i, len; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + sep_opt = options; + while ((p = strsep(&sep_opt, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, vnic_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case VNIC_OPT_IOC_GUID: + p = match_strdup(args); + param->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, + 16)); + kfree(p); + break; + + case VNIC_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) { + printk(KERN_WARNING PFX + "bad dest GID parameter '%s'\n", p); + kfree(p); + goto out; + } + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + param->dgid[i] = simple_strtoul(dgid, NULL, + 16); + + } + kfree(p); + break; + + case VNIC_OPT_PKEY: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX + "bad P_key parameter '%s'\n", p); + goto out; + } + param->pkey = cpu_to_be16(token); + break; + + case VNIC_OPT_NAME: + p = match_strdup(args); + if (strlen(p) >= IFNAMSIZ) { + printk(KERN_WARNING PFX + "interface name parameter too long\n"); + kfree(p); + goto out; + } + strcpy(param->name, p); + kfree(p); + break; + case VNIC_OPT_INSTANCE: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 255 || token < 0) { + printk(KERN_WARNING PFX + "instance parameter must be" + " >= 0 and <= 255\n"); + goto out; + } + + param->instance = token; + break; + case VNIC_OPT_RXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->rx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->rx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad rx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_TXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->tx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->tx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad tx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_HEARTBEAT: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 6000 || token <= 0) { + printk(KERN_WARNING PFX + "heartbeat parameter must be" + " > 0 and <= 6000\n"); + goto out; + } + param->heartbeat = token; + break; + case VNIC_OPT_IOC_STRING: + p = match_strdup(args); + len = strlen(p); + if (len > MAX_IOC_STRING_LEN) { + printk(KERN_WARNING PFX + "ioc string parameter too long\n"); + kfree(p); + goto out; + } + strcpy(param->ioc_string, p); + if (*(p + len - 1) != '\"') { + strcat(param->ioc_string, ","); + kfree(p); + p = strsep(&sep_opt, "\""); + strcat(param->ioc_string, p); + sep_opt++; + } else { + *(param->ioc_string + len - 1) = '\0'; + kfree(p); + } + break; + case VNIC_OPT_IB_MULTICAST: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->ib_multicast = 1; + else if (!strncmp(p, "false", 5)) + param->ib_multicast = 0; + else { + printk(KERN_WARNING PFX + "bad ib_multicast parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + default: + printk(KERN_WARNING PFX + "unknown parameter or missing value " + "'%s' in target creation request\n", p); + goto out; + } + + } + + if ((opt_mask & VNIC_OPT_ALL) == VNIC_OPT_ALL) + ret = 0; + else + for (i = 0; i < ARRAY_SIZE(vnic_opt_tokens); ++i) + if ((vnic_opt_tokens[i].token & VNIC_OPT_ALL) && + !(vnic_opt_tokens[i].token & opt_mask)) + printk(KERN_WARNING PFX + "target creation request is " + "missing parameter '%s'\n", + vnic_opt_tokens[i].pattern); + +out: + kfree(options); + return ret; + +} + +static ssize_t show_vnic_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + switch (vnic->state) { + case VNIC_UNINITIALIZED: + return sprintf(buf, "VNIC_UNINITIALIZED\n"); + case VNIC_REGISTERED: + return sprintf(buf, "VNIC_REGISTERED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static DEVICE_ATTR(vnic_state, S_IRUGO, show_vnic_state, NULL); + +static ssize_t show_rx_csum(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + + if (vnic->config->use_rx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static DEVICE_ATTR(rx_csum, S_IRUGO, show_rx_csum, NULL); + +static ssize_t show_tx_csum(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + + if (vnic->config->use_tx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static DEVICE_ATTR(tx_csum, S_IRUGO, show_tx_csum, NULL); + +static ssize_t show_current_path(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, dev_info); + unsigned long flags; + size_t length; + + spin_lock_irqsave(&vnic->current_path_lock, flags); + if (vnic->current_path == &vnic->primary_path) + length = sprintf(buf, "primary_path\n"); + else if (vnic->current_path == &vnic->secondary_path) + length = sprintf(buf, "secondary path\n"); + else + length = sprintf(buf, "none\n"); + spin_unlock_irqrestore(&vnic->current_path_lock, flags); + return length; +} + +static DEVICE_ATTR(current_path, S_IRUGO, show_current_path, NULL); + +static struct attribute *vnic_dev_attrs[] = { + &dev_attr_vnic_state.attr, + &dev_attr_rx_csum.attr, + &dev_attr_tx_csum.attr, + &dev_attr_current_path.attr, + NULL +}; + +struct attribute_group vnic_dev_attr_group = { + .attrs = vnic_dev_attrs, +}; + +static inline void print_dgid(u8 *dgid) +{ + int i; + + for (i = 0; i < 16; i += 2) + printk("%04x", be16_to_cpu(*(__be16 *)&dgid[i])); +} + +static inline int is_dgid_zero(u8 *dgid) +{ + int i; + + for (i = 0; i < 16; i++) { + if (dgid[i] != 0) + return 1; + } + return 0; +} + +static int create_netpath(struct netpath *npdest, + struct path_param *p_params) +{ + struct viport_config *viport_config; + struct viport *viport; + struct vnic *vnic; + struct list_head *ptr; + int ret = 0; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (vnic->primary_path.viport) { + viport_config = vnic->primary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance) + && (be64_to_cpu(p_params->ioc_guid))) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + + if (vnic->secondary_path.viport) { + viport_config = vnic->secondary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance) + && (be64_to_cpu(p_params->ioc_guid))) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + } + + if (npdest->viport) { + SYS_ERROR("create_netpath: path already exists\n"); + ret = -EINVAL; + goto out; + } + + viport_config = config_alloc_viport(p_params); + if (!viport_config) { + SYS_ERROR("create_netpath: failed creating viport config\n"); + ret = -1; + goto out; + } + + /*User specified heartbeat value is in 1/100s of a sec*/ + if (p_params->heartbeat != -1) { + viport_config->hb_interval = + msecs_to_jiffies(p_params->heartbeat * 10); + viport_config->hb_timeout = + (p_params->heartbeat << 6) * 10000; /* usec */ + } + + viport_config->path_idx = 0; + + viport = viport_allocate(viport_config); + if (!viport) { + SYS_ERROR("create_netpath: failed creating viport\n"); + kfree(viport_config); + ret = -1; + goto out; + } + + npdest->viport = viport; + viport->parent = npdest; + viport->vnic = npdest->parent; + + if (is_dgid_zero(p_params->dgid) && p_params->ioc_guid != 0 + && p_params->pkey != 0) { + viport_kick(viport); + vnic_disconnected(npdest->parent, npdest); + } else { + printk(KERN_WARNING "Specified parameters IOCGUID=%llx, " + "P_Key=%x, DGID=", be64_to_cpu(p_params->ioc_guid), + p_params->pkey); + print_dgid(p_params->dgid); + printk(" insufficient for establishing %s path for interface " + "%s. Hence, path will not be established.\n", + (npdest->second_bias ? "secondary" : "primary"), + p_params->name); + } +out: + return ret; +} + +static struct vnic *create_vnic(struct path_param *param) +{ + struct vnic_config *vnic_config; + struct vnic *vnic; + struct list_head *ptr; + + SYS_INFO("create_vnic: name = %s\n", param->name); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, param->name)) { + SYS_ERROR("vnic %s already exists\n", + param->name); + return NULL; + } + } + + vnic_config = config_alloc_vnic(); + if (!vnic_config) { + SYS_ERROR("create_vnic: failed creating vnic config\n"); + return NULL; + } + + if (param->rx_csum != -1) + vnic_config->use_rx_csum = param->rx_csum; + + if (param->tx_csum != -1) + vnic_config->use_tx_csum = param->tx_csum; + + strcpy(vnic_config->name, param->name); + vnic = vnic_allocate(vnic_config); + if (!vnic) { + SYS_ERROR("create_vnic: failed allocating vnic\n"); + goto free_vnic_config; + } + + init_completion(&vnic->dev_info.released); + + vnic->dev_info.dev.class = NULL; + vnic->dev_info.dev.parent = &interface_dev.dev; + vnic->dev_info.dev.release = vnic_release_dev; + snprintf(vnic->dev_info.dev.bus_id, BUS_ID_SIZE, + vnic_config->name); + + if (device_register(&vnic->dev_info.dev)) { + SYS_ERROR("create_vnic: error in registering" + " vnic class dev\n"); + goto free_vnic; + } + + if (sysfs_create_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group)) { + SYS_ERROR("create_vnic: error in creating" + "vnic attr group\n"); + goto err_attr; + + } + + if (vnic_setup_stats_files(vnic)) + goto err_stats; + + return vnic; +err_stats: + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_dev_attr_group); +err_attr: + device_unregister(&vnic->dev_info.dev); + wait_for_completion(&vnic->dev_info.released); +free_vnic: + list_del(&vnic->list_ptrs); + kfree(vnic); +free_vnic_config: + kfree(vnic_config); + return NULL; +} + +static ssize_t vnic_delete(struct device *dev, struct device_attribute *dev_attr, + const char *buf, size_t count) +{ + struct vnic *vnic; + struct list_head *ptr; + int ret = -EINVAL; + + if (count > IFNAMSIZ) { + printk(KERN_WARNING PFX "invalid vnic interface name\n"); + return ret; + } + + SYS_INFO("vnic_delete: name = %s\n", buf); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, buf)) { + vnic_free(vnic); + return count; + } + } + + printk(KERN_WARNING PFX "vnic interface '%s' does not exist\n", buf); + return ret; +} + +DEVICE_ATTR(delete_vnic, S_IWUSR, NULL, vnic_delete); + +static ssize_t show_viport_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct netpath *path = container_of(info, struct netpath, dev_info); + switch (path->viport->state) { + case VIPORT_DISCONNECTED: + return sprintf(buf, "VIPORT_DISCONNECTED\n"); + case VIPORT_CONNECTED: + return sprintf(buf, "VIPORT_CONNECTED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static DEVICE_ATTR(viport_state, S_IRUGO, show_viport_state, NULL); + +static ssize_t show_link_state(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct netpath *path = container_of(info, struct netpath, dev_info); + + switch (path->viport->link_state) { + case LINK_UNINITIALIZED: + return sprintf(buf, "LINK_UNINITIALIZED\n"); + case LINK_INITIALIZE: + return sprintf(buf, "LINK_INITIALIZE\n"); + case LINK_INITIALIZECONTROL: + return sprintf(buf, "LINK_INITIALIZECONTROL\n"); + case LINK_INITIALIZEDATA: + return sprintf(buf, "LINK_INITIALIZEDATA\n"); + case LINK_CONTROLCONNECT: + return sprintf(buf, "LINK_CONTROLCONNECT\n"); + case LINK_CONTROLCONNECTWAIT: + return sprintf(buf, "LINK_CONTROLCONNECTWAIT\n"); + case LINK_INITVNICREQ: + return sprintf(buf, "LINK_INITVNICREQ\n"); + case LINK_INITVNICRSP: + return sprintf(buf, "LINK_INITVNICRSP\n"); + case LINK_BEGINDATAPATH: + return sprintf(buf, "LINK_BEGINDATAPATH\n"); + case LINK_CONFIGDATAPATHREQ: + return sprintf(buf, "LINK_CONFIGDATAPATHREQ\n"); + case LINK_CONFIGDATAPATHRSP: + return sprintf(buf, "LINK_CONFIGDATAPATHRSP\n"); + case LINK_DATACONNECT: + return sprintf(buf, "LINK_DATACONNECT\n"); + case LINK_DATACONNECTWAIT: + return sprintf(buf, "LINK_DATACONNECTWAIT\n"); + case LINK_XCHGPOOLREQ: + return sprintf(buf, "LINK_XCHGPOOLREQ\n"); + case LINK_XCHGPOOLRSP: + return sprintf(buf, "LINK_XCHGPOOLRSP\n"); + case LINK_INITIALIZED: + return sprintf(buf, "LINK_INITIALIZED\n"); + case LINK_IDLE: + return sprintf(buf, "LINK_IDLE\n"); + case LINK_IDLING: + return sprintf(buf, "LINK_IDLING\n"); + case LINK_CONFIGLINKREQ: + return sprintf(buf, "LINK_CONFIGLINKREQ\n"); + case LINK_CONFIGLINKRSP: + return sprintf(buf, "LINK_CONFIGLINKRSP\n"); + case LINK_CONFIGADDRSREQ: + return sprintf(buf, "LINK_CONFIGADDRSREQ\n"); + case LINK_CONFIGADDRSRSP: + return sprintf(buf, "LINK_CONFIGADDRSRSP\n"); + case LINK_REPORTSTATREQ: + return sprintf(buf, "LINK_REPORTSTATREQ\n"); + case LINK_REPORTSTATRSP: + return sprintf(buf, "LINK_REPORTSTATRSP\n"); + case LINK_HEARTBEATREQ: + return sprintf(buf, "LINK_HEARTBEATREQ\n"); + case LINK_HEARTBEATRSP: + return sprintf(buf, "LINK_HEARTBEATRSP\n"); + case LINK_RESET: + return sprintf(buf, "LINK_RESET\n"); + case LINK_RESETRSP: + return sprintf(buf, "LINK_RESETRSP\n"); + case LINK_RESETCONTROL: + return sprintf(buf, "LINK_RESETCONTROL\n"); + case LINK_RESETCONTROLRSP: + return sprintf(buf, "LINK_RESETCONTROLRSP\n"); + case LINK_DATADISCONNECT: + return sprintf(buf, "LINK_DATADISCONNECT\n"); + case LINK_CONTROLDISCONNECT: + return sprintf(buf, "LINK_CONTROLDISCONNECT\n"); + case LINK_CLEANUPDATA: + return sprintf(buf, "LINK_CLEANUPDATA\n"); + case LINK_CLEANUPCONTROL: + return sprintf(buf, "LINK_CLEANUPCONTROL\n"); + case LINK_DISCONNECTED: + return sprintf(buf, "LINK_DISCONNECTED\n"); + case LINK_RETRYWAIT: + return sprintf(buf, "LINK_RETRYWAIT\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + + } + +} +static DEVICE_ATTR(link_state, S_IRUGO, show_link_state, NULL); + +static ssize_t show_heartbeat(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + /* hb_inteval is in jiffies, convert it back to + * 1/100ths of a second + */ + return sprintf(buf, "%d\n", + (jiffies_to_msecs(path->viport->config->hb_interval)/10)); +} + +static DEVICE_ATTR(heartbeat, S_IRUGO, show_heartbeat, NULL); + +static ssize_t show_ioc_guid(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%llx\n", + __be64_to_cpu(path->viport->config->ioc_guid)); +} + +static DEVICE_ATTR(ioc_guid, S_IRUGO, show_ioc_guid, NULL); + +static inline void get_dgid_string(u8 *dgid, char *buf) +{ + int i; + char holder[5]; + + for (i = 0; i < 16; i += 2) { + sprintf(holder, "%04x", be16_to_cpu(*(__be16 *)&dgid[i])); + strcat(buf, holder); + } + + strcat(buf, "\n"); +} + +static ssize_t show_dgid(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + get_dgid_string(path->viport->config->path_info.path.dgid.raw, buf); + + return strlen(buf); +} + +static DEVICE_ATTR(dgid, S_IRUGO, show_dgid, NULL); + +static ssize_t show_pkey(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%x\n", path->viport->config->path_info.path.pkey); +} + +static DEVICE_ATTR(pkey, S_IRUGO, show_pkey, NULL); + +static ssize_t show_hca_info(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "vnic-%s-%d\n", path->viport->config->ibdev->name, + path->viport->config->port); +} + +static DEVICE_ATTR(hca_info, S_IRUGO, show_hca_info, NULL); + +static ssize_t show_ioc_string(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + return sprintf(buf, "%s\n", path->viport->config->ioc_string); +} + +static DEVICE_ATTR(ioc_string, S_IRUGO, show_ioc_string, NULL); + +static ssize_t show_multicast_state(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + + struct netpath *path = container_of(info, struct netpath, dev_info); + + if (!(path->viport->features_supported & VNIC_FEAT_INBOUND_IB_MC)) + return sprintf(buf, "feature not enabled\n"); + + switch (path->viport->mc_info.state) { + case MCAST_STATE_INVALID: + return sprintf(buf, "state=Invalid\n"); + case MCAST_STATE_JOINING: + return sprintf(buf, "state=Joining MGID:" VNIC_GID_FMT "\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw)); + case MCAST_STATE_ATTACHING: + return sprintf(buf, "state=Attaching MGID:" VNIC_GID_FMT + " MLID:%X\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw), + path->viport->mc_info.mlid); + case MCAST_STATE_JOINED_ATTACHED: + return sprintf(buf, + "state=Joined & Attached MGID:" VNIC_GID_FMT + " MLID:%X\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw), + path->viport->mc_info.mlid); + case MCAST_STATE_DETACHING: + return sprintf(buf, "state=Detaching MGID: " VNIC_GID_FMT "\n", + VNIC_GID_RAW_ARG(path->viport->mc_info.mgid.raw)); + case MCAST_STATE_RETRIED: + return sprintf(buf, "state=Retries Exceeded\n"); + } + return sprintf(buf, "invalid state\n"); +} + +static DEVICE_ATTR(multicast_state, S_IRUGO, show_multicast_state, NULL); + +static struct attribute *vnic_path_attrs[] = { + &dev_attr_viport_state.attr, + &dev_attr_link_state.attr, + &dev_attr_heartbeat.attr, + &dev_attr_ioc_guid.attr, + &dev_attr_dgid.attr, + &dev_attr_pkey.attr, + &dev_attr_hca_info.attr, + &dev_attr_ioc_string.attr, + &dev_attr_multicast_state.attr, + NULL +}; + +struct attribute_group vnic_path_attr_group = { + .attrs = vnic_path_attrs, +}; + + +static int setup_path_class_files(struct netpath *path, char *name) +{ + init_completion(&path->dev_info.released); + + path->dev_info.dev.class = NULL; + path->dev_info.dev.parent = &path->parent->dev_info.dev; + path->dev_info.dev.release = vnic_release_dev; + snprintf(path->dev_info.dev.bus_id, BUS_ID_SIZE, name); + + if (device_register(&path->dev_info.dev)) { + SYS_ERROR("error in registering path class dev\n"); + goto out; + } + + if (sysfs_create_group(&path->dev_info.dev.kobj, + &vnic_path_attr_group)) { + SYS_ERROR("error in creating vnic path group attrs"); + goto err_path; + } + + return 0; + +err_path: + device_unregister(&path->dev_info.dev); + wait_for_completion(&path->dev_info.released); +out: + return -1; + +} + +static inline void update_dgids(u8 *old, u8 *new, char *vnic_name, + char *path_name) +{ + int i; + + if (!memcmp(old, new, 16)) + return; + + printk(KERN_INFO PFX "Changing dgid from 0x"); + print_dgid(old); + printk(" to 0x"); + print_dgid(new); + printk(" for %s path of %s\n", path_name, vnic_name); + for (i = 0; i < 16; i++) + old[i] = new[i]; +} + +static inline void update_ioc_guids(struct path_param *params, + struct netpath *path, + char *vnic_name, char *path_name) +{ + u64 sid; + + if (path->viport->config->ioc_guid == params->ioc_guid) + return; + + printk(KERN_INFO PFX "Changing IOC GUID from 0x%llx to 0x%llx " + "for %s path of %s\n", + __be64_to_cpu(path->viport->config->ioc_guid), + __be64_to_cpu(params->ioc_guid), path_name, vnic_name); + + path->viport->config->ioc_guid = params->ioc_guid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + path->viport->config->control_config.ib_config.service_id = + cpu_to_be64(sid); + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + path->viport->config->data_config.ib_config.service_id = + cpu_to_be64(sid); +} + +static inline void update_pkeys(__be16 *old, __be16 *new, char *vnic_name, + char *path_name) +{ + if (*old == *new) + return; + + printk(KERN_INFO PFX "Changing P_Key from 0x%x to 0x%x " + "for %s path of %s\n", *old, *new, + path_name, vnic_name); + *old = *new; +} + +static void update_ioc_strings(struct path_param *params, struct netpath *path, + char *path_name) +{ + if (!strcmp(params->ioc_string, path->viport->config->ioc_string)) + return; + + printk(KERN_INFO PFX "Changing ioc_string to %s for %s path of %s\n", + params->ioc_string, path_name, params->name); + + strcpy(path->viport->config->ioc_string, params->ioc_string); +} + +static void update_path_parameters(struct path_param *params, + struct netpath *path) +{ + update_dgids(path->viport->config->path_info.path.dgid.raw, + params->dgid, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_ioc_guids(params, path, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_pkeys(&path->viport->config->path_info.path.pkey, + ¶ms->pkey, params->name, + (path->second_bias ? "secondary" : "primary")); + + update_ioc_strings(params, path, + (path->second_bias ? "secondary" : "primary")); +} + +static ssize_t update_params_and_connect(struct path_param *params, + struct netpath *path, size_t count) +{ + if (is_dgid_zero(params->dgid) && params->ioc_guid != 0 && + params->pkey != 0) { + + if (!memcmp(path->viport->config->path_info.path.dgid.raw, + params->dgid, 16) && + params->ioc_guid == path->viport->config->ioc_guid && + params->pkey == path->viport->config->path_info.path.pkey) { + + printk(KERN_WARNING PFX "All of the dgid, ioc_guid and " + "pkeys are same as the existing" + " one. Not updating values.\n"); + return -EINVAL; + } else { + if (path->viport->state == VIPORT_CONNECTED) { + printk(KERN_WARNING PFX "%s path of %s " + "interface is already in connected " + "state. Not updating values.\n", + (path->second_bias ? "Secondary" : "Primary"), + path->parent->config->name); + return -EINVAL; + } else { + update_path_parameters(params, path); + viport_kick(path->viport); + vnic_disconnected(path->parent, path); + return count; + } + } + } else { + printk(KERN_WARNING PFX "Either dgid, iocguid, pkey is zero. " + "No update.\n"); + return -EINVAL; + } +} + +static ssize_t vnic_create_primary(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic_ib_port *target = + container_of(info, struct vnic_ib_port, pdev_info); + + struct path_param param; + int ret = -EINVAL; + struct vnic *vnic; + struct list_head *ptr; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + param.ib_multicast = -1; + *param.ioc_string = '\0'; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, param.name)) { + ret = update_params_and_connect(¶m, + &vnic->primary_path, + count); + goto out; + } + } + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + vnic = create_vnic(¶m); + if (!vnic) { + printk(KERN_ERR PFX "creating vnic failed\n"); + ret = -EINVAL; + goto out; + } + + if (create_netpath(&vnic->primary_path, ¶m)) { + printk(KERN_ERR PFX "creating primary netpath failed\n"); + goto free_vnic; + } + + if (setup_path_class_files(&vnic->primary_path, "primary_path")) + goto free_vnic; + + if (vnic && !vnic->primary_path.viport) { + printk(KERN_ERR PFX "no valid netpaths\n"); + goto free_vnic; + } + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} + +DEVICE_ATTR(create_primary, S_IWUSR, NULL, vnic_create_primary); + +static ssize_t vnic_create_secondary(struct device *dev, + struct device_attribute *dev_attr, + const char *buf, size_t count) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic_ib_port *target = + container_of(info, struct vnic_ib_port, pdev_info); + + struct path_param param; + struct vnic *vnic = NULL; + int ret = -EINVAL; + struct list_head *ptr; + int found = 0; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + param.ib_multicast = -1; + *param.ioc_string = '\0'; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strncmp(vnic->config->name, param.name, IFNAMSIZ)) { + if (vnic->secondary_path.viport) { + ret = update_params_and_connect(¶m, + &vnic->secondary_path, + count); + goto out; + } + found = 1; + break; + } + } + + if (!found) { + printk(KERN_ERR PFX + "primary connection with name '%s' does not exist\n", + param.name); + ret = -EINVAL; + goto out; + } + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + if (create_netpath(&vnic->secondary_path, ¶m)) { + printk(KERN_ERR PFX "creating secondary netpath failed\n"); + ret = -EINVAL; + goto out; + } + + if (setup_path_class_files(&vnic->secondary_path, "secondary_path")) + goto free_vnic; + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} + +DEVICE_ATTR(create_secondary, S_IWUSR, NULL, vnic_create_secondary); diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h new file mode 100644 index 0000000..7e6aa8d --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h @@ -0,0 +1,51 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_SYS_H_INCLUDED +#define VNIC_SYS_H_INCLUDED + +struct dev_info { + struct device dev; + struct completion released; +}; + +extern struct class vnic_class; +extern struct dev_info interface_dev; +extern struct attribute_group vnic_dev_attr_group; +extern struct attribute_group vnic_path_attr_group; +extern struct device_attribute dev_attr_create_primary; +extern struct device_attribute dev_attr_create_secondary; +extern struct device_attribute dev_attr_delete_vnic; + +extern void vnic_release_dev(struct device *dev); + +#endif /*VNIC_SYS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:58:24 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:28:24 +0530 Subject: [ofa-general] [PATCH v3 09/13] QLogic VNIC: IB Multicast for Ethernet broadcast/multicast In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095824.9943.36889.stgit@localhost.localdomain> From: Usha Srinivasan Implementation of ethernet broadcasting and multicasting for QLogic VNIC interface by making use of underlying IB multicasting. Signed-off-by: Usha Srinivasan Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c | 319 +++++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h | 77 +++++ 2 files changed, 396 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c new file mode 100644 index 0000000..f40ea20 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.c @@ -0,0 +1,319 @@ +/* + * Copyright (c) 2008 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_util.h" + +static inline void vnic_set_multicast_state_invalid(struct viport *viport) +{ + viport->mc_info.state = MCAST_STATE_INVALID; + viport->mc_info.mc = NULL; + memset(&viport->mc_info.mgid, 0, sizeof(union ib_gid)); +} + +int vnic_mc_init(struct viport *viport) +{ + MCAST_FUNCTION("vnic_mc_init %p\n", viport); + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_lock_init(&viport->mc_info.lock); + + return 0; +} + +void vnic_mc_uninit(struct viport *viport) +{ + unsigned long flags; + MCAST_FUNCTION("vnic_mc_uninit %p\n", viport); + + spin_lock_irqsave(&viport->mc_info.lock, flags); + if ((viport->mc_info.state != MCAST_STATE_INVALID) && + (viport->mc_info.state != MCAST_STATE_RETRIED)) { + MCAST_ERROR("%s mcast state is not INVALID or RETRIED %d\n", + control_ifcfg_name(&viport->control), + viport->mc_info.state); + } + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_FUNCTION("vnic_mc_uninit done\n"); +} + + +/* This function is called when NEED_MCAST_COMPLETION is set. + * It finishes off the join multicast work. + */ +int vnic_mc_join_handle_completion(struct viport *viport) +{ + unsigned int ret = 0; + + MCAST_FUNCTION("vnic_mc_join_handle_completion()\n"); + if (viport->mc_info.state != MCAST_STATE_JOINING) { + MCAST_ERROR("%s unexpected mcast state in handle_completion: " + " %d\n", control_ifcfg_name(&viport->control), + viport->mc_info.state); + ret = -1; + goto out; + } + viport->mc_info.state = MCAST_STATE_ATTACHING; + MCAST_INFO("%s Attaching QP %lx mgid:" + VNIC_GID_FMT " mlid:%x\n", + control_ifcfg_name(&viport->control), jiffies, + VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw), + viport->mc_info.mlid); + ret = ib_attach_mcast(viport->mc_data.ib_conn.qp, &viport->mc_info.mgid, + viport->mc_info.mlid); + if (ret) { + MCAST_ERROR("%s Attach mcast qp failed %d\n", + control_ifcfg_name(&viport->control), ret); + ret = -1; + goto out; + } + viport->mc_info.state = MCAST_STATE_JOINED_ATTACHED; + MCAST_INFO("%s UD QP successfully attached to mcast group\n", + control_ifcfg_name(&viport->control)); + +out: + return ret; +} + +/* NOTE: ib_sa.h says "returning a non-zero value from this callback will + * result in destroying the multicast tracking structure. + */ +static int vnic_mc_join_complete(int status, + struct ib_sa_multicast *multicast) +{ + struct viport *viport = (struct viport *)multicast->context; + unsigned long flags; + + MCAST_FUNCTION("vnic_mc_join_complete() status:%x\n", status); + if (status) { + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (status == -ENETRESET) { + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_ERROR("%s got ENETRESET\n", + control_ifcfg_name(&viport->control)); + goto out; + } + /* perhaps the mcgroup hasn't yet been created - retry */ + viport->mc_info.retries++; + viport->mc_info.mc = NULL; + if (viport->mc_info.retries > MAX_MCAST_JOIN_RETRIES) { + viport->mc_info.state = MCAST_STATE_RETRIED; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_ERROR("%s join failed 0x%x - max retries:%d " + "exceeded\n", + control_ifcfg_name(&viport->control), + status, viport->mc_info.retries); + } else { + viport->mc_info.state = MCAST_STATE_INVALID; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_MCAST_JOIN; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_ERROR("%s join failed 0x%x - retrying; " + "retries:%d\n", + control_ifcfg_name(&viport->control), + status, viport->mc_info.retries); + } + goto out; + } + + /* finish join work from main state loop for viport - in case + * the work itself cannot be done in a callback environment */ + spin_lock_irqsave(&viport->lock, flags); + viport->mc_info.mlid = be16_to_cpu(multicast->rec.mlid); + viport->updates |= NEED_MCAST_COMPLETION; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_INFO("%s setting NEED_MCAST_COMPLETION %x %x\n", + control_ifcfg_name(&viport->control), + multicast->rec.mlid, viport->mc_info.mlid); +out: + return status; +} + +void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid) +{ + unsigned long flags; + + MCAST_FUNCTION("in vnic_mc_join_setup\n"); + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (viport->mc_info.state != MCAST_STATE_INVALID) { + if (viport->mc_info.state == MCAST_STATE_DETACHING) + MCAST_ERROR("%s detach in progress\n", + control_ifcfg_name(&viport->control)); + else if (viport->mc_info.state == MCAST_STATE_RETRIED) + MCAST_ERROR("%s max join retries exceeded\n", + control_ifcfg_name(&viport->control)); + else { + /* join/attach in progress or done */ + /* verify that the current mgid is same as prev mgid */ + if (memcmp(mgid, &viport->mc_info.mgid, sizeof(union ib_gid)) != 0) { + /* Separate MGID for each IOC */ + MCAST_ERROR("%s Multicast Group MGIDs not " + "unique; mgids: " VNIC_GID_FMT + " " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(mgid->raw), + VNIC_GID_RAW_ARG(viport->mc_info.mgid.raw)); + } else + MCAST_INFO("%s join already issued: %d\n", + control_ifcfg_name(&viport->control), + viport->mc_info.state); + + } + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + return; + } + viport->mc_info.mgid = *mgid; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_MCAST_JOIN; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + MCAST_INFO("%s setting NEED_MCAST_JOIN \n", + control_ifcfg_name(&viport->control)); +} + +int vnic_mc_join(struct viport *viport) +{ + struct ib_sa_mcmember_rec rec; + ib_sa_comp_mask comp_mask; + unsigned long flags; + int ret = 0; + + MCAST_FUNCTION("vnic_mc_join()\n"); + if (!viport->mc_data.ib_conn.qp) { + MCAST_ERROR("%s qp is NULL\n", + control_ifcfg_name(&viport->control)); + ret = -1; + goto out; + } + spin_lock_irqsave(&viport->mc_info.lock, flags); + if (viport->mc_info.state != MCAST_STATE_INVALID) { + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + MCAST_INFO("%s Multicast join already issued\n", + control_ifcfg_name(&viport->control)); + goto out; + } + viport->mc_info.state = MCAST_STATE_JOINING; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + + memset(&rec, 0, sizeof(rec)); + rec.join_state = 2; /* bit 1 is Nonmember */ + rec.mgid = viport->mc_info.mgid; + rec.port_gid = viport->config->path_info.path.sgid; + + comp_mask = IB_SA_MCMEMBER_REC_MGID | + IB_SA_MCMEMBER_REC_PORT_GID | + IB_SA_MCMEMBER_REC_JOIN_STATE; + + MCAST_INFO("%s Joining Multicast group%lx mgid:" + VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), jiffies, + VNIC_GID_RAW_ARG(rec.mgid.raw), + VNIC_GID_RAW_ARG(rec.port_gid.raw)); + + viport->mc_info.mc = ib_sa_join_multicast(&vnic_sa_client, + viport->config->ibdev, viport->config->port, + &rec, comp_mask, GFP_KERNEL, + vnic_mc_join_complete, viport); + + if (IS_ERR(viport->mc_info.mc)) { + MCAST_ERROR("%s Multicast joining failed " VNIC_GID_FMT + ".\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(rec.mgid.raw)); + viport->mc_info.state = MCAST_STATE_INVALID; + ret = -1; + goto out; + } + MCAST_INFO("%s Multicast group join issued mgid:" + VNIC_GID_FMT " port_gid: " VNIC_GID_FMT "\n", + control_ifcfg_name(&viport->control), + VNIC_GID_RAW_ARG(rec.mgid.raw), + VNIC_GID_RAW_ARG(rec.port_gid.raw)); +out: + return ret; +} + +void vnic_mc_leave(struct viport *viport) +{ + unsigned long flags; + unsigned int ret; + struct ib_sa_multicast *mc; + + MCAST_FUNCTION("vnic_mc_leave()\n"); + + spin_lock_irqsave(&viport->mc_info.lock, flags); + if ((viport->mc_info.state == MCAST_STATE_INVALID) || + (viport->mc_info.state == MCAST_STATE_RETRIED)) { + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + return; + } + + if (viport->mc_info.state == MCAST_STATE_JOINED_ATTACHED) { + + viport->mc_info.state = MCAST_STATE_DETACHING; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + ret = ib_detach_mcast(viport->mc_data.ib_conn.qp, + &viport->mc_info.mgid, + viport->mc_info.mlid); + if (ret) { + MCAST_ERROR("%s UD QP Detach failed %d\n", + control_ifcfg_name(&viport->control), ret); + return; + } + MCAST_INFO("%s UD QP detached succesfully\n", + control_ifcfg_name(&viport->control)); + spin_lock_irqsave(&viport->mc_info.lock, flags); + } + mc = viport->mc_info.mc; + vnic_set_multicast_state_invalid(viport); + viport->mc_info.retries = 0; + spin_unlock_irqrestore(&viport->mc_info.lock, flags); + + if (mc) { + MCAST_INFO("%s Freeing up multicast structure.\n", + control_ifcfg_name(&viport->control)); + ib_sa_free_multicast(mc); + } + MCAST_FUNCTION("vnic_mc_leave done\n"); + return; +} diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h new file mode 100644 index 0000000..e049180 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_multicast.h @@ -0,0 +1,77 @@ +/* + * Copyright (c) 2008 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef __VNIC_MULTICAST_H__ +#define __VNIC_MULTTCAST_H__ + +enum { + MCAST_STATE_INVALID = 0x00, /* join not attempted or failed */ + MCAST_STATE_JOINING = 0x01, /* join mcgroup in progress */ + MCAST_STATE_ATTACHING = 0x02, /* join completed with success, + * attach qp to mcgroup in progress + */ + MCAST_STATE_JOINED_ATTACHED = 0x03, /* join completed with success */ + MCAST_STATE_DETACHING = 0x04, /* detach qp in progress */ + MCAST_STATE_RETRIED = 0x05, /* retried join and failed */ +}; + +#define MAX_MCAST_JOIN_RETRIES 5 /* used to retry join */ + +struct mc_info { + u8 state; + spinlock_t lock; + union ib_gid mgid; + u16 mlid; + struct ib_sa_multicast *mc; + u8 retries; +}; + + +int vnic_mc_init(struct viport *viport); +void vnic_mc_uninit(struct viport *viport); +extern char *control_ifcfg_name(struct control *control); + +/* This function is called when a viport gets a multicast mgid from EVIC + and must join the multicast group. It sets up NEED_MCAST_JOIN flag, which + results in vnic_mc_join being called later. */ +void vnic_mc_join_setup(struct viport *viport, union ib_gid *mgid); + +/* This function is called when NEED_MCAST_JOIN flag is set. */ +int vnic_mc_join(struct viport *viport); + +/* This function is called when NEED_MCAST_COMPLETION is set. + It finishes off the join multicast work. */ +int vnic_mc_join_handle_completion(struct viport *viport); + +void vnic_mc_leave(struct viport *viport); + +#endif /* __VNIC_MULTICAST_H__ */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:58:54 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:28:54 +0530 Subject: [ofa-general] [PATCH v3 10/13] QLogic VNIC: Driver Statistics collection In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095854.9943.19624.stgit@localhost.localdomain> From: Amar Mudrankit Collection of statistics about QLogic VNIC interfaces is implemented in this patch. Signed-off-by: Amar Mudrankit Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath --- drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c | 234 ++++++++++++ drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h | 497 +++++++++++++++++++++++++ 2 files changed, 731 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c new file mode 100644 index 0000000..d11a8df --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.c @@ -0,0 +1,234 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_main.h" + +cycles_t vnic_recv_ref; + +/* + * TODO: Statistics reporting for control path, data path, + * RDMA times, IOs etc + * + */ +static ssize_t show_lifetime(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time = get_cycles() - vnic->statistics.start_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(lifetime, S_IRUGO, show_lifetime, NULL); + +static ssize_t show_conntime(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + if (vnic->statistics.conn_time) + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.conn_time); + return 0; +} + +static DEVICE_ATTR(connection_time, S_IRUGO, show_conntime, NULL); + +static ssize_t show_disconnects(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.disconn_ref) + num = vnic->statistics.disconn_num + 1; + else + num = vnic->statistics.disconn_num; + + return sprintf(buf, "%d\n", num); +} + +static DEVICE_ATTR(disconnects, S_IRUGO, show_disconnects, NULL); + +static ssize_t show_total_disconn_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.disconn_ref) + time = vnic->statistics.disconn_time + + get_cycles() - vnic->statistics.disconn_ref; + else + time = vnic->statistics.disconn_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(total_disconn_time, S_IRUGO, show_total_disconn_time, NULL); + +static ssize_t show_carrier_losses(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.carrier_ref) + num = vnic->statistics.carrier_off_num + 1; + else + num = vnic->statistics.carrier_off_num; + + return sprintf(buf, "%d\n", num); +} + +static DEVICE_ATTR(carrier_losses, S_IRUGO, show_carrier_losses, NULL); + +static ssize_t show_total_carr_loss_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.carrier_ref) + time = vnic->statistics.carrier_off_time + + get_cycles() - vnic->statistics.carrier_ref; + else + time = vnic->statistics.carrier_off_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static DEVICE_ATTR(total_carrier_loss_time, S_IRUGO, + show_total_carr_loss_time, NULL); + +static ssize_t show_total_recv_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.recv_time); +} + +static DEVICE_ATTR(total_recv_time, S_IRUGO, show_total_recv_time, NULL); + +static ssize_t show_recvs(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.recv_num); +} + +static DEVICE_ATTR(recvs, S_IRUGO, show_recvs, NULL); + +static ssize_t show_multicast_recvs(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.multicast_recv_num); +} + +static DEVICE_ATTR(multicast_recvs, S_IRUGO, show_multicast_recvs, NULL); + +static ssize_t show_total_xmit_time(struct device *dev, + struct device_attribute *dev_attr, + char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.xmit_time); +} + +static DEVICE_ATTR(total_xmit_time, S_IRUGO, show_total_xmit_time, NULL); + +static ssize_t show_xmits(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_num); +} + +static DEVICE_ATTR(xmits, S_IRUGO, show_xmits, NULL); + +static ssize_t show_failed_xmits(struct device *dev, + struct device_attribute *dev_attr, char *buf) +{ + struct dev_info *info = container_of(dev, struct dev_info, dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_fail); +} + +static DEVICE_ATTR(failed_xmits, S_IRUGO, show_failed_xmits, NULL); + +static struct attribute *vnic_stats_attrs[] = { + &dev_attr_lifetime.attr, + &dev_attr_xmits.attr, + &dev_attr_total_xmit_time.attr, + &dev_attr_failed_xmits.attr, + &dev_attr_recvs.attr, + &dev_attr_multicast_recvs.attr, + &dev_attr_total_recv_time.attr, + &dev_attr_connection_time.attr, + &dev_attr_disconnects.attr, + &dev_attr_total_disconn_time.attr, + &dev_attr_carrier_losses.attr, + &dev_attr_total_carrier_loss_time.attr, + NULL +}; + +struct attribute_group vnic_stats_attr_group = { + .attrs = vnic_stats_attrs, +}; diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h new file mode 100644 index 0000000..a241b71 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_stats.h @@ -0,0 +1,497 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_STATS_H_INCLUDED +#define VNIC_STATS_H_INCLUDED + +#include "vnic_main.h" +#include "vnic_ib.h" +#include "vnic_sys.h" + +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + if (vnic->statistics.conn_time == 0) { + vnic->statistics.conn_time = + get_cycles() - vnic->statistics.start_time; + } + + if (vnic->statistics.disconn_ref != 0) { + vnic->statistics.disconn_time += + get_cycles() - vnic->statistics.disconn_ref; + vnic->statistics.disconn_num++; + vnic->statistics.disconn_ref = 0; + } + +} + +static inline void vnic_stop_xmit_stats(struct vnic *vnic) +{ + if (vnic->statistics.xmit_ref == 0) + vnic->statistics.xmit_ref = get_cycles(); +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + if (vnic->statistics.xmit_ref != 0) { + vnic->statistics.xmit_off_time += + get_cycles() - vnic->statistics.xmit_ref; + vnic->statistics.xmit_off_num++; + vnic->statistics.xmit_ref = 0; + } +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + vnic->statistics.recv_time += get_cycles() - vnic_recv_ref; + vnic->statistics.recv_num++; +} + +static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic) +{ + vnic->statistics.multicast_recv_num++; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + vnic->statistics.xmit_time += get_cycles() - time; + vnic->statistics.xmit_num++; + +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + vnic->statistics.xmit_fail++; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + if (vnic->statistics.carrier_ref != 0) { + vnic->statistics.carrier_off_time += + get_cycles() - vnic->statistics.carrier_ref; + vnic->statistics.carrier_off_num++; + vnic->statistics.carrier_ref = 0; + } +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + init_completion(&vnic->stat_info.released); + vnic->stat_info.dev.class = NULL; + vnic->stat_info.dev.parent = &vnic->dev_info.dev; + vnic->stat_info.dev.release = vnic_release_dev; + snprintf(vnic->stat_info.dev.bus_id, BUS_ID_SIZE, + "stats"); + + if (device_register(&vnic->stat_info.dev)) { + SYS_ERROR("create_vnic: error in registering" + " stat class dev\n"); + goto stats_out; + } + + if (sysfs_create_group(&vnic->stat_info.dev.kobj, + &vnic_stats_attr_group)) + goto err_stats_file; + + return 0; +err_stats_file: + device_unregister(&vnic->stat_info.dev); + wait_for_completion(&vnic->stat_info.released); +stats_out: + return -1; +} + +static inline void vnic_cleanup_stats_files(struct vnic *vnic) +{ + sysfs_remove_group(&vnic->dev_info.dev.kobj, + &vnic_stats_attr_group); + device_unregister(&vnic->stat_info.dev); + wait_for_completion(&vnic->stat_info.released); +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + if (!vnic->statistics.disconn_ref) + vnic->statistics.disconn_ref = get_cycles(); + + if (vnic->statistics.carrier_ref == 0) + vnic->statistics.carrier_ref = get_cycles(); +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + vnic->statistics.start_time = get_cycles(); +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + response_time -= control->statistics.request_time; + control->statistics.response_time += response_time; + control->statistics.response_num++; + if (control->statistics.response_max < response_time) + control->statistics.response_max = response_time; + if ((control->statistics.response_min == 0) || + (control->statistics.response_min > response_time)) + control->statistics.response_min = response_time; + +} + +static inline void control_note_reqtime_stats(struct control *control) +{ + control->statistics.request_time = get_cycles(); +} + +static inline void control_timeout_stats(struct control *control) +{ + control->statistics.timeout_num++; +} + +static inline void data_kickreq_stats(struct data *data) +{ + data->statistics.kick_reqs++; +} + +static inline void data_no_xmitbuf_stats(struct data *data) +{ + data->statistics.no_xmit_bufs++; +} + +static inline void data_xmits_stats(struct data *data) +{ + data->statistics.xmit_num++; +} + +static inline void data_recvs_stats(struct data *data) +{ + data->statistics.recv_num++; +} + +static inline void data_note_kickrcv_time(void) +{ + vnic_recv_ref = get_cycles(); +} + +static inline void data_rcvkicks_stats(struct data *data) +{ + data->statistics.kick_recvs++; +} + + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = get_cycles(); +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.num_callbacks++; +} + +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ib_conn->statistics.num_ios++; + *comp_num = *comp_num + 1; + +} + +static inline void vnic_ib_io_stats(struct io *io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + if ((io->type == RECV) || (io->type == RECV_UD)) + io->time = comp_time; + else if (io->type == RDMA) { + ib_conn->statistics.rdma_comp_time += comp_time - io->time; + ib_conn->statistics.rdma_comp_ios++; + } else if (io->type == SEND) { + ib_conn->statistics.send_comp_time += comp_time - io->time; + ib_conn->statistics.send_comp_ios++; + } +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + if (comp_num > ib_conn->statistics.max_ios) + ib_conn->statistics.max_ios = comp_num; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = + get_cycles() - ib_conn->statistics.connection_time; + +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + *time = get_cycles(); + if (io->time != 0) { + ib_conn->statistics.recv_comp_time += *time - io->time; + ib_conn->statistics.recv_comp_ios++; + } + +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ib_conn->statistics.recv_post_time += get_cycles() - time; + ib_conn->statistics.recv_post_ios++; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + io->time = *time = get_cycles(); +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + time = get_cycles() - time; + if (io->swr.opcode == IB_WR_RDMA_WRITE) { + ib_conn->statistics.rdma_post_time += time; + ib_conn->statistics.rdma_post_ios++; + } else { + ib_conn->statistics.send_post_time += time; + ib_conn->statistics.send_post_ios++; + } +} +#else /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_stop_xmit_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_multicast_recv_pkt_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + ; +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + ; +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + return 0; +} + +static inline void vnic_cleanup_stats_files(struct vnic *vnic) +{ + ; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + ; +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + ; +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + ; +} + +static inline void control_note_reqtime_stats(struct control *control) +{ + ; +} + +static inline void control_timeout_stats(struct control *control) +{ + ; +} + +static inline void data_kickreq_stats(struct data *data) +{ + ; +} + +static inline void data_no_xmitbuf_stats(struct data *data) +{ + ; +} + +static inline void data_xmits_stats(struct data *data) +{ + ; +} + +static inline void data_recvs_stats(struct data *data) +{ + ; +} + +static inline void data_note_kickrcv_time(void) +{ + ; +} + +static inline void data_rcvkicks_stats(struct data *data) +{ + ; +} + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) + +{ + ; +} +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ; +} + +static inline void vnic_ib_io_stats(struct io *io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + ; +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + ; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + ; +} +#endif /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +#endif /*VNIC_STATS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:59:25 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:29:25 +0530 Subject: [ofa-general] [PATCH v3 11/13] QLogic VNIC: Driver utility file - implements various utility macros In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095925.9943.21164.stgit@localhost.localdomain> From: Poornima Kamath This patch adds the driver utility file which mainly contains utility macros for debugging of QLogic VNIC driver. Signed-off-by: Poornima Kamath Signed-off-by: Ramachandra K Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/vnic_util.h | 236 ++++++++++++++++++++++++++ 1 files changed, 236 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_util.h diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h new file mode 100644 index 0000000..095fa3a --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_util.h @@ -0,0 +1,236 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_UTIL_H_INCLUDED +#define VNIC_UTIL_H_INCLUDED + +#define MODULE_NAME "QLGC_VNIC" + +#define VNIC_MAJORVERSION 1 +#define VNIC_MINORVERSION 1 + +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) + +extern u32 vnic_debug; + +enum { + DEBUG_IB_INFO = 0x00000001, + DEBUG_IB_FUNCTION = 0x00000002, + DEBUG_IB_FSTATUS = 0x00000004, + DEBUG_IB_ASSERTS = 0x00000008, + DEBUG_CONTROL_INFO = 0x00000010, + DEBUG_CONTROL_FUNCTION = 0x00000020, + DEBUG_CONTROL_PACKET = 0x00000040, + DEBUG_CONFIG_INFO = 0x00000100, + DEBUG_DATA_INFO = 0x00001000, + DEBUG_DATA_FUNCTION = 0x00002000, + DEBUG_NETPATH_INFO = 0x00010000, + DEBUG_VIPORT_INFO = 0x00100000, + DEBUG_VIPORT_FUNCTION = 0x00200000, + DEBUG_LINK_STATE = 0x00400000, + DEBUG_VNIC_INFO = 0x01000000, + DEBUG_VNIC_FUNCTION = 0x02000000, + DEBUG_MCAST_INFO = 0x04000000, + DEBUG_MCAST_FUNCTION = 0x08000000, + DEBUG_SYS_INFO = 0x10000000, + DEBUG_SYS_VERBOSE = 0x40000000 +}; + +#define PRINT(level, x, fmt, arg...) \ + printk(level "%s: " fmt, MODULE_NAME, ##arg) + +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ + do { \ + if (condition) \ + printk(level "%s: %s: " fmt, \ + MODULE_NAME, x, ##arg); \ + } while (0) + +#define IB_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "IB", fmt, ##arg) +#define IB_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "IB", fmt, ##arg) + +#define IB_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_FUNCTION), \ + fmt, ##arg) + +#define IB_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_INFO), \ + fmt, ##arg) + +#define IB_ASSERT(x) \ + do { \ + if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x)) \ + panic("%s assertion failed, file: %s," \ + " line %d: ", \ + MODULE_NAME, __FILE__, __LINE__) \ + } while (0) + +#define CONTROL_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONTROL", fmt, ##arg) +#define CONTROL_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONTROL", fmt, ##arg) + +#define CONTROL_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_INFO), \ + fmt, ##arg) + +#define CONTROL_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_FUNCTION), \ + fmt, ##arg) + +#define CONTROL_PACKET(pkt) \ + do { \ + if (vnic_debug & DEBUG_CONTROL_PACKET) \ + control_log_control_packet(pkt); \ + } while (0) + +#define CONFIG_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONFIG", fmt, ##arg) +#define CONFIG_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONFIG", fmt, ##arg) + +#define CONFIG_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONFIG", \ + (vnic_debug & DEBUG_CONFIG_INFO), \ + fmt, ##arg) + +#define DATA_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "DATA", fmt, ##arg) +#define DATA_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "DATA", fmt, ##arg) + +#define DATA_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_INFO), \ + fmt, ##arg) + +#define DATA_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_FUNCTION), \ + fmt, ##arg) + + +#define MCAST_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "MCAST", fmt, ##arg) +#define MCAST_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "MCAST", fmt, ##arg) + +#define MCAST_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "MCAST", \ + (vnic_debug & DEBUG_MCAST_INFO), \ + fmt, ##arg) + +#define MCAST_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "MCAST", \ + (vnic_debug & DEBUG_MCAST_FUNCTION), \ + fmt, ##arg) + +#define NETPATH_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NETPATH", fmt, ##arg) +#define NETPATH_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NETPATH", fmt, ##arg) + +#define NETPATH_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NETPATH", \ + (vnic_debug & DEBUG_NETPATH_INFO), \ + fmt, ##arg) + +#define VIPORT_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "VIPORT", fmt, ##arg) +#define VIPORT_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "VIPORT", fmt, ##arg) + +#define VIPORT_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_INFO), \ + fmt, ##arg) + +#define VIPORT_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_FUNCTION), \ + fmt, ##arg) + +#define LINK_STATE(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "LINK", \ + (vnic_debug & DEBUG_LINK_STATE), \ + fmt, ##arg) + +#define VNIC_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) +#define VNIC_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NIC", fmt, ##arg) +#define VNIC_INIT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) + +#define VNIC_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_INFO), \ + fmt, ##arg) + +#define VNIC_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_FUNCTION), \ + fmt, ##arg) + +#define SYS_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "SYS", fmt, ##arg) +#define SYS_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "SYS", fmt, ##arg) + +#define SYS_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "SYS", \ + (vnic_debug & DEBUG_SYS_INFO), \ + fmt, ##arg) + +#endif /* VNIC_UTIL_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Thu May 29 02:59:55 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:29:55 +0530 Subject: [ofa-general] [PATCH v3 12/13] QLogic VNIC: Driver Kconfig and Makefile. In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529095955.9943.48616.stgit@localhost.localdomain> From: Ramachandra K Kconfig and Makefile for the QLogic VNIC driver. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/ulp/qlgc_vnic/Kconfig | 19 +++++++++++++++++++ drivers/infiniband/ulp/qlgc_vnic/Makefile | 13 +++++++++++++ 2 files changed, 32 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Kconfig create mode 100644 drivers/infiniband/ulp/qlgc_vnic/Makefile diff --git a/drivers/infiniband/ulp/qlgc_vnic/Kconfig b/drivers/infiniband/ulp/qlgc_vnic/Kconfig new file mode 100644 index 0000000..7b4030e --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/Kconfig @@ -0,0 +1,19 @@ +config INFINIBAND_QLGC_VNIC + tristate "QLogic VNIC - Support for QLogic Ethernet Virtual I/O Controller" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the QLogic Ethernet Virtual I/O Controller + (EVIC). In conjunction with the EVIC, this provides virtual + ethernet interfaces and transports ethernet packets over + InfiniBand so that you can communicate with Ethernet networks + using your IB device. + +config INFINIBAND_QLGC_VNIC_STATS + bool "QLogic VNIC Statistics" + depends on INFINIBAND_QLGC_VNIC + default n + ---help--- + This option compiles statistics collecting code into the + data path of the QLogic VNIC driver to help in profiling and fine + tuning. This adds some overhead in the interest of gathering + data. diff --git a/drivers/infiniband/ulp/qlgc_vnic/Makefile b/drivers/infiniband/ulp/qlgc_vnic/Makefile new file mode 100644 index 0000000..509dd67 --- /dev/null +++ b/drivers/infiniband/ulp/qlgc_vnic/Makefile @@ -0,0 +1,13 @@ +obj-$(CONFIG_INFINIBAND_QLGC_VNIC) += qlgc_vnic.o + +qlgc_vnic-y := vnic_main.o \ + vnic_ib.o \ + vnic_viport.o \ + vnic_control.o \ + vnic_data.o \ + vnic_netpath.o \ + vnic_config.o \ + vnic_sys.o \ + vnic_multicast.o + +qlgc_vnic-$(CONFIG_INFINIBAND_QLGC_VNIC_STATS) += vnic_stats.o From ramachandra.kuchimanchi at qlogic.com Thu May 29 03:00:25 2008 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 29 May 2008 15:30:25 +0530 Subject: [ofa-general] [PATCH v3 13/13] QLogic VNIC: Modifications to IB Kconfig and Makefile In-Reply-To: <20080529095126.9943.84692.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> Message-ID: <20080529100025.9943.35838.stgit@localhost.localdomain> From: Ramachandra K This patch modifies the toplevel Infiniband Kconfig and Makefile to include QLogic VNIC as new ULP. Signed-off-by: Ramachandra K Signed-off-by: Poornima Kamath Signed-off-by: Amar Mudrankit --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index a5dc78a..0775df5 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -53,4 +53,6 @@ source "drivers/infiniband/ulp/srp/Kconfig" source "drivers/infiniband/ulp/iser/Kconfig" +source "drivers/infiniband/ulp/qlgc_vnic/Kconfig" + endif # INFINIBAND diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index ed35e44..845271e 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES) += hw/nes/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_INFINIBAND_QLGC_VNIC) += ulp/qlgc_vnic/ From weiny2 at llnl.gov Thu May 29 10:06:17 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 29 May 2008 10:06:17 -0700 Subject: [ofa-general] Infiniband back-to-back without OpenSM? In-Reply-To: References: <1211978369.13185.351.camel@hrosenstock-ws.xsigo.com> <1211979817.13185.357.camel@hrosenstock-ws.xsigo.com> <1211981650.13185.362.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080529100617.11d9b492.weiny2@llnl.gov> On Thu, 29 May 2008 11:37:15 -0400 (EDT) James Lentini wrote: > > > On Wed, 28 May 2008, Hal Rosenstock wrote: > > > On Wed, 2008-05-28 at 09:24 -0400, Talpey, Thomas wrote: > > > > > Maybe I'm getting ahead of myself though, still wondering if there's a way > > > to do it with what we have. > > > > The closest thing is OpenSM run once mode but I think you've been > > describing a b2b mini SM command which wouldn't be hard to implement. > > Unreleated to NFS/RDMA, I wrote a small kernel module that used MADs > to assign a lid, and then transitioned the port to ARMED and ACTIVE. > This worked for enabling IB communication, but not IPoIB. In > retrospect, I probably could have implemented the same functionality > in userspace. > Have/could you release this? I would be interested in looking at it. Thanks, Ira From shemminger at vyatta.com Thu May 29 10:27:52 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 29 May 2008 10:27:52 -0700 Subject: [ofa-general] Re: [PATCH v3 01/13] QLogic VNIC: Driver - netdev implementation In-Reply-To: <20080529095423.9943.77528.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> <20080529095423.9943.77528.stgit@localhost.localdomain> Message-ID: <20080529102752.584147ee@extreme> On Thu, 29 May 2008 15:24:23 +0530 Ramachandra K wrote: > From: Ramachandra K > > QLogic Virtual NIC Driver. This patch implements netdev registration, > netdev functions and state maintenance of the QLogic Virtual NIC > corresponding to the various events associated with the QLogic Ethernet > Virtual I/O Controller (EVIC/VEx) connection. > > Signed-off-by: Ramachandra K > Signed-off-by: Poornima Kamath > Signed-off-by: Amar Mudrankit > --- > > drivers/infiniband/ulp/qlgc_vnic/vnic_main.c | 1098 ++++++++++++++++++++++++++ > drivers/infiniband/ulp/qlgc_vnic/vnic_main.h | 154 ++++ > 2 files changed, 1252 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.c > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_main.h > > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c > new file mode 100644 > index 0000000..570c069 > --- /dev/null > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.c > @@ -0,0 +1,1098 @@ > +/* > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "vnic_util.h" > +#include "vnic_main.h" > +#include "vnic_netpath.h" > +#include "vnic_viport.h" > +#include "vnic_ib.h" > +#include "vnic_stats.h" > + > +#define MODULEVERSION "1.3.0.0.4" > +#define MODULEDETAILS \ > + "QLogic Corp. Virtual NIC (VNIC) driver version " MODULEVERSION > + > +MODULE_AUTHOR("QLogic Corp."); > +MODULE_DESCRIPTION(MODULEDETAILS); > +MODULE_LICENSE("Dual BSD/GPL"); > +MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller"); > + > +u32 vnic_debug; > + > +module_param(vnic_debug, uint, 0444); > +MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0"); maybe migrate this to ethtool msg_level? > + > +LIST_HEAD(vnic_list); > + > +static DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue); > +static LIST_HEAD(vnic_npevent_list); > +static DECLARE_COMPLETION(vnic_npevent_thread_exit); > +static spinlock_t vnic_npevent_list_lock; > +static struct task_struct *vnic_npevent_thread; > +static int vnic_npevent_thread_end; > + > +static const char *const vnic_npevent_str[] = { > + "PRIMARY CONNECTED", > + "PRIMARY DISCONNECTED", > + "PRIMARY CARRIER", > + "PRIMARY NO CARRIER", > + "PRIMARY TIMER EXPIRED", > + "PRIMARY SETLINK", > + "SECONDARY CONNECTED", > + "SECONDARY DISCONNECTED", > + "SECONDARY CARRIER", > + "SECONDARY NO CARRIER", > + "SECONDARY TIMER EXPIRED", > + "SECONDARY SETLINK", > + "FREE VNIC", > +}; > + > +void vnic_connected(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_connected()\n"); > + if (netpath->second_bias) > + vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED); > + else > + vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED); > + > + vnic_connected_stats(vnic); > +} > + > +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_disconnected()\n"); > + if (netpath->second_bias) > + vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED); > + else > + vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED); > +} > + > +void vnic_link_up(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_link_up()\n"); > + if (netpath->second_bias) > + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP); > + else > + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP); > +} > + > +void vnic_link_down(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_link_down()\n"); > + if (netpath->second_bias) > + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN); > + else > + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN); > +} > + > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) > +{ > + unsigned long flags; > + > + VNIC_FUNCTION("vnic_stop_xmit()\n"); > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + if (netpath == vnic->current_path) { > + if (!netif_queue_stopped(vnic->netdevice)) { > + netif_stop_queue(vnic->netdevice); > + vnic->failed_over = 0; > + } > + > + vnic_stop_xmit_stats(vnic); > + } > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > +} > + > +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath) > +{ > + unsigned long flags; > + > + VNIC_FUNCTION("vnic_restart_xmit()\n"); > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + if (netpath == vnic->current_path) { > + if (netif_queue_stopped(vnic->netdevice)) > + netif_wake_queue(vnic->netdevice); > + > + vnic_restart_xmit_stats(vnic); > + } > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > +} > + > +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, > + struct sk_buff *skb) > +{ > + VNIC_FUNCTION("vnic_recv_packet()\n"); > + if ((netpath != vnic->current_path) || !vnic->open) { > + VNIC_INFO("tossing packet\n"); > + dev_kfree_skb(skb); > + return; > + } > + > + vnic->netdevice->last_rx = jiffies; > + skb->dev = vnic->netdevice; > + skb->protocol = eth_type_trans(skb, skb->dev); > + if (!vnic->config->use_rx_csum) > + skb->ip_summed = CHECKSUM_NONE; > + netif_rx(skb); Not sure if you are calling this always in softirq (ie NAPI or tasklet) then no need for additional queuing and softirq from netif_rx > + vnic_recv_pkt_stats(vnic); > +} > + > +static struct net_device_stats *vnic_get_stats(struct net_device *device) > +{ > + struct vnic *vnic; > + struct netpath *np; > + unsigned long flags; > + > + VNIC_FUNCTION("vnic_get_stats()\n"); > + vnic = netdev_priv(device); > + > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + np = vnic->current_path; > + if (np && np->viport) { > + atomic_inc(&np->viport->reference_count); > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > + viport_get_stats(np->viport, &vnic->stats); > + atomic_dec(&np->viport->reference_count); > + wake_up(&np->viport->reference_queue); > + } else > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > + > + return &vnic->stats; > +} You can use device->stats and delete vnic->stats to save space. > + > +static int vnic_open(struct net_device *device) > +{ > + struct vnic *vnic; > + > + VNIC_FUNCTION("vnic_open()\n"); > + vnic = netdev_priv(device); > + > + vnic->open++; Don't need this (vnic->open), instead use netif_running(device). > + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); > + netif_start_queue(vnic->netdevice); > + > + return 0; > +} > + > +static int vnic_stop(struct net_device *device) > +{ > + struct vnic *vnic; > + int ret = 0; > + > + VNIC_FUNCTION("vnic_stop()\n"); > + vnic = netdev_priv(device); > + netif_stop_queue(device); > + vnic->open--; > + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); > + > + return ret; > +} > + > +static int vnic_hard_start_xmit(struct sk_buff *skb, > + struct net_device *device) > +{ > + struct vnic *vnic; > + struct netpath *np; > + cycles_t xmit_time; > + int ret = -1; > + > + VNIC_FUNCTION("vnic_hard_start_xmit()\n"); > + vnic = netdev_priv(device); > + np = vnic->current_path; > + > + vnic_pre_pkt_xmit_stats(&xmit_time); > + > + if (np && np->viport) > + ret = viport_xmit_packet(np->viport, skb); > + > + if (ret) { > + vnic_xmit_fail_stats(vnic); > + dev_kfree_skb_any(skb); > + vnic->stats.tx_dropped++; > + goto out; > + } > + > + device->trans_start = jiffies; > + vnic_post_pkt_xmit_stats(vnic, xmit_time); > +out: > + return 0; > +} No flow control? you will just drop packets if overloaded? > + > +static void vnic_tx_timeout(struct net_device *device) > +{ > + struct vnic *vnic; > + struct viport *viport = NULL; > + unsigned long flags; > + > + VNIC_FUNCTION("vnic_tx_timeout()\n"); > + vnic = netdev_priv(device); > + device->trans_start = jiffies; > + > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + if (vnic->current_path && vnic->current_path->viport) { > + if (vnic->failed_over) { > + if (vnic->current_path == &vnic->primary_path) > + viport = vnic->secondary_path.viport; > + else if (vnic->current_path == &vnic->secondary_path) > + viport = vnic->primary_path.viport; > + } else > + viport = vnic->current_path->viport; > + > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > + if (viport) > + viport_failure(viport); > + } else > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > + > + VNIC_ERROR("vnic_tx_timeout\n"); > +} > + > +static void vnic_set_multicast_list(struct net_device *device) > +{ > + struct vnic *vnic; > + unsigned long flags; > + > + VNIC_FUNCTION("vnic_set_multicast_list()\n"); > + vnic = netdev_priv(device); > + > + spin_lock_irqsave(&vnic->lock, flags); > + if (device->mc_count == 0) { > + if (vnic->mc_list_len) { > + vnic->mc_list_len = vnic->mc_count = 0; > + kfree(vnic->mc_list); > + } > + } else { > + struct dev_mc_list *mc_list = device->mc_list; > + int i; > + > + if (device->mc_count > vnic->mc_list_len) { > + if (vnic->mc_list_len) > + kfree(vnic->mc_list); > + vnic->mc_list_len = device->mc_count + 10; > + vnic->mc_list = kmalloc(vnic->mc_list_len * > + sizeof *mc_list, GFP_ATOMIC); > + if (!vnic->mc_list) { > + vnic->mc_list_len = vnic->mc_count = 0; > + VNIC_ERROR("failed allocating mc_list\n"); > + goto failure; > + } > + } > + vnic->mc_count = device->mc_count; > + for (i = 0; i < device->mc_count; i++) { > + vnic->mc_list[i] = *mc_list; > + vnic->mc_list[i].next = &vnic->mc_list[i + 1]; > + mc_list = mc_list->next; > + } > + } > + spin_unlock_irqrestore(&vnic->lock, flags); > + > + if (vnic->primary_path.viport) > + viport_set_multicast(vnic->primary_path.viport, > + vnic->mc_list, vnic->mc_count); > + > + if (vnic->secondary_path.viport) > + viport_set_multicast(vnic->secondary_path.viport, > + vnic->mc_list, vnic->mc_count); > + > + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_PRINP_SETLINK); > + return; > +failure: > + spin_unlock_irqrestore(&vnic->lock, flags); > +} > + > +/** > + * Following set of functions queues up the events for EVIC and the > + * kernel thread queuing up the event might return. > + */ > +static int vnic_set_mac_address(struct net_device *device, void *addr) > +{ > + struct vnic *vnic; > + struct sockaddr *sockaddr = addr; > + u8 *address; > + int ret = -1; > + > + VNIC_FUNCTION("vnic_set_mac_address()\n"); > + vnic = netdev_priv(device); > + > + if (!is_valid_ether_addr(sockaddr->sa_data)) > + return -EADDRNOTAVAIL; > + > + if (netif_running(device)) > + return -EBUSY; > + > + memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN); > + address = sockaddr->sa_data; > + > + if (vnic->primary_path.viport) > + ret = viport_set_unicast(vnic->primary_path.viport, > + address); > + > + if (ret) > + return ret; > + > + if (vnic->secondary_path.viport) > + viport_set_unicast(vnic->secondary_path.viport, address); > + > + vnic->mac_set = 1; > + return 0; > +} > + > +static int vnic_change_mtu(struct net_device *device, int mtu) > +{ > + struct vnic *vnic; > + int ret = 0; > + int pri_max_mtu; > + int sec_max_mtu; > + > + VNIC_FUNCTION("vnic_change_mtu()\n"); > + vnic = netdev_priv(device); > + > + if (vnic->primary_path.viport) > + pri_max_mtu = viport_max_mtu(vnic->primary_path.viport); > + else > + pri_max_mtu = MAX_PARAM_VALUE; > + > + if (vnic->secondary_path.viport) > + sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport); > + else > + sec_max_mtu = MAX_PARAM_VALUE; > + > + if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) { > + device->mtu = mtu; > + vnic_npevent_queue_evt(&vnic->primary_path, > + VNIC_PRINP_SETLINK); > + vnic_npevent_queue_evt(&vnic->secondary_path, > + VNIC_SECNP_SETLINK); > + } else if (pri_max_mtu < sec_max_mtu) > + printk(KERN_WARNING PFX "%s: Maximum " > + "supported MTU size is %d. " > + "Cannot set MTU to %d\n", > + vnic->config->name, pri_max_mtu, mtu); > + else > + printk(KERN_WARNING PFX "%s: Maximum " > + "supported MTU size is %d. " > + "Cannot set MTU to %d\n", > + vnic->config->name, sec_max_mtu, mtu); > + > + return ret; > +} > + > +static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath) > +{ > + u8 *address; > + int ret; > + > + if (!vnic->mac_set) { > + /* if netpath == secondary_path, then the primary path isn't > + * connected. MAC address will be set when the primary > + * connects. > + */ > + netpath_get_hw_addr(netpath, vnic->netdevice->dev_addr); > + address = vnic->netdevice->dev_addr; > + > + if (vnic->secondary_path.viport) > + viport_set_unicast(vnic->secondary_path.viport, > + address); > + > + vnic->mac_set = 1; > + } > + ret = register_netdev(vnic->netdevice); > + if (ret) { > + printk(KERN_ERR PFX "%s failed registering netdev " > + "error %d - calling viport_failure\n", > + config_viport_name(vnic->primary_path.viport->config), > + ret); > + vnic_free(vnic); > + printk(KERN_ERR PFX "%s DELETED : register_netdev failure\n", > + config_viport_name(vnic->primary_path.viport->config)); > + return ret; > + } > + > + vnic->state = VNIC_REGISTERED; > + vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/ > + return 0; > +} > + > +static void vnic_npevent_dequeue_all(struct vnic *vnic) > +{ > + unsigned long flags; > + struct vnic_npevent *npevt, *tmp; > + > + spin_lock_irqsave(&vnic_npevent_list_lock, flags); > + if (list_empty(&vnic_npevent_list)) > + goto out; > + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, > + list_ptrs) { > + if ((npevt->vnic == vnic)) { > + list_del(&npevt->list_ptrs); > + kfree(npevt); > + } > + } > +out: > + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); > +} > + > +static void update_path_and_reconnect(struct netpath *netpath, > + struct vnic *vnic) > +{ > + struct viport_config *config = netpath->viport->config; > + int delay = 1; > + > + if (vnic_ib_get_path(netpath, vnic)) > + return; > + /* > + * tell viport_connect to wait for default_no_path_timeout > + * before connecting if we are retrying the same path index > + * within default_no_path_timeout. > + * This prevents flooding connect requests to a path (or set > + * of paths) that aren't successfully connecting for some reason. > + */ > + if (time_after(jiffies, > + (netpath->connect_time + vnic->config->no_path_timeout))) { > + netpath->path_idx = config->path_idx; > + netpath->connect_time = jiffies; > + netpath->delay_reconnect = 0; > + delay = 0; > + } else if (config->path_idx != netpath->path_idx) { > + delay = netpath->delay_reconnect; > + netpath->path_idx = config->path_idx; > + netpath->delay_reconnect = 1; > + } else > + delay = 1; > + viport_connect(netpath->viport, delay); > +} > + > +static inline void vnic_set_checksum_flag(struct vnic *vnic, > + struct netpath *target_path) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + vnic->current_path = target_path; > + vnic->failed_over = 1; > + if (vnic->config->use_tx_csum && > + netpath_can_tx_csum(vnic->current_path)) > + vnic->netdevice->features |= NETIF_F_IP_CSUM; > + > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > +} > + > +static void vnic_set_uni_multicast(struct vnic *vnic, > + struct netpath *netpath) > +{ > + unsigned long flags; > + u8 *address; > + > + if (vnic->mac_set) { > + address = vnic->netdevice->dev_addr; > + > + if (netpath->viport) > + viport_set_unicast(netpath->viport, address); > + } > + spin_lock_irqsave(&vnic->lock, flags); > + > + if (vnic->mc_list && netpath->viport) > + viport_set_multicast(netpath->viport, vnic->mc_list, > + vnic->mc_count); > + > + spin_unlock_irqrestore(&vnic->lock, flags); > + if (vnic->state == VNIC_REGISTERED) { > + if (!netpath->viport) > + return; > + viport_set_link(netpath->viport, > + vnic->netdevice->flags & ~IFF_UP, > + vnic->netdevice->mtu); > + } > +} > + > +static void vnic_set_netpath_timers(struct vnic *vnic, > + struct netpath *netpath) > +{ > + switch (netpath->timer_state) { > + case NETPATH_TS_IDLE: > + netpath->timer_state = NETPATH_TS_ACTIVE; > + if (vnic->state == VNIC_UNINITIALIZED) > + netpath_timer(netpath, > + vnic->config-> > + primary_connect_timeout); > + else > + netpath_timer(netpath, > + vnic->config-> > + primary_reconnect_timeout); > + break; > + case NETPATH_TS_ACTIVE: > + /*nothing to do*/ > + break; > + case NETPATH_TS_EXPIRED: > + if (vnic->state == VNIC_UNINITIALIZED) > + vnic_npevent_register(vnic, netpath); > + > + break; > + } > +} > + > +static void vnic_check_primary_path_timer(struct vnic *vnic) > +{ > + switch (vnic->primary_path.timer_state) { > + case NETPATH_TS_ACTIVE: > + /* nothing to do. just wait */ > + break; > + case NETPATH_TS_IDLE: > + netpath_timer(&vnic->primary_path, > + vnic->config-> > + primary_switch_timeout); > + break; > + case NETPATH_TS_EXPIRED: > + printk(KERN_INFO PFX > + "%s: switching to primary path\n", > + vnic->config->name); > + > + vnic_set_checksum_flag(vnic, &vnic->primary_path); > + break; > + } > +} > + > +static void vnic_carrier_loss(struct vnic *vnic, > + struct netpath *last_path) > +{ > + if (vnic->primary_path.carrier) { > + vnic->carrier = 1; > + vnic_set_checksum_flag(vnic, &vnic->primary_path); > + > + if (last_path && last_path != vnic->current_path) > + printk(KERN_INFO PFX > + "%s: failing over to primary path\n", > + vnic->config->name); > + else if (!last_path) > + printk(KERN_INFO PFX "%s: using primary path\n", > + vnic->config->name); > + > + } else if ((vnic->secondary_path.carrier) && > + (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) { > + vnic->carrier = 1; > + vnic_set_checksum_flag(vnic, &vnic->secondary_path); > + > + if (last_path && last_path != vnic->current_path) > + printk(KERN_INFO PFX > + "%s: failing over to secondary path\n", > + vnic->config->name); > + else if (!last_path) > + printk(KERN_INFO PFX "%s: using secondary path\n", > + vnic->config->name); > + > + } > + > +} > + > +static void vnic_handle_path_change(struct vnic *vnic, > + struct netpath **path) > +{ > + struct netpath *last_path = *path; > + > + if (!last_path) { > + if (vnic->current_path == &vnic->primary_path) > + last_path = &vnic->secondary_path; > + else > + last_path = &vnic->primary_path; > + > + } > + > + if (vnic->current_path && vnic->current_path->viport) > + viport_set_link(vnic->current_path->viport, > + vnic->netdevice->flags, > + vnic->netdevice->mtu); > + > + if (last_path->viport) > + viport_set_link(last_path->viport, > + vnic->netdevice->flags & > + ~IFF_UP, vnic->netdevice->mtu); > + > + vnic_restart_xmit(vnic, vnic->current_path); > +} > + > +static void vnic_report_path_change(struct vnic *vnic, > + struct netpath *last_path, > + int other_path_ok) > +{ > + if (!vnic->current_path) { > + if (last_path == &vnic->primary_path) > + printk(KERN_INFO PFX "%s: primary path lost, " > + "no failover path available\n", > + vnic->config->name); > + else > + printk(KERN_INFO PFX "%s: secondary path lost, " > + "no failover path available\n", > + vnic->config->name); > + return; > + } > + > + if (last_path != vnic->current_path) > + return; > + > + if (vnic->current_path == &vnic->secondary_path) { > + if (other_path_ok != vnic->primary_path.carrier) { > + if (other_path_ok) > + printk(KERN_INFO PFX "%s: primary path no" > + " longer available for failover\n", > + vnic->config->name); > + else > + printk(KERN_INFO PFX "%s: primary path now" > + " available for failover\n", > + vnic->config->name); > + } > + } else { > + if (other_path_ok != vnic->secondary_path.carrier) { > + if (other_path_ok) > + printk(KERN_INFO PFX "%s: secondary path no" > + " longer available for failover\n", > + vnic->config->name); > + else > + printk(KERN_INFO PFX "%s: secondary path now" > + " available for failover\n", > + vnic->config->name); > + } > + } > +} > + > +static void vnic_handle_free_vnic_evt(struct vnic *vnic) > +{ > + unsigned long flags; > + > + if (!netif_queue_stopped(vnic->netdevice)) > + netif_stop_queue(vnic->netdevice); > + > + netpath_timer_stop(&vnic->primary_path); > + netpath_timer_stop(&vnic->secondary_path); > + spin_lock_irqsave(&vnic->current_path_lock, flags); > + vnic->current_path = NULL; > + spin_unlock_irqrestore(&vnic->current_path_lock, flags); > + netpath_free(&vnic->primary_path); > + netpath_free(&vnic->secondary_path); > + if (vnic->state == VNIC_REGISTERED) > + unregister_netdev(vnic->netdevice); > + > + vnic_npevent_dequeue_all(vnic); > + kfree(vnic->config); > + if (vnic->mc_list_len) { > + vnic->mc_list_len = vnic->mc_count = 0; > + kfree(vnic->mc_list); > + } > + > + sysfs_remove_group(&vnic->dev_info.dev.kobj, > + &vnic_dev_attr_group); > + vnic_cleanup_stats_files(vnic); > + device_unregister(&vnic->dev_info.dev); > + wait_for_completion(&vnic->dev_info.released); > + free_netdev(vnic->netdevice); > +} > + > +static struct vnic *vnic_handle_npevent(struct vnic *vnic, > + enum vnic_npevent_type npevt_type) > +{ > + struct netpath *netpath; > + const char *netpath_str; > + > + if (npevt_type <= VNIC_PRINP_LASTTYPE) > + netpath_str = netpath_to_string(vnic, &vnic->primary_path); > + else if (npevt_type <= VNIC_SECNP_LASTTYPE) > + netpath_str = netpath_to_string(vnic, &vnic->secondary_path); > + else > + netpath_str = netpath_to_string(vnic, vnic->current_path); > + > + VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n", > + vnic->config->name, vnic_npevent_str[npevt_type], > + netpath_str, vnic->carrier); > + > + switch (npevt_type) { > + case VNIC_PRINP_CONNECTED: > + netpath = &vnic->primary_path; > + if (vnic->state == VNIC_UNINITIALIZED) { > + if (vnic_npevent_register(vnic, netpath)) > + break; > + } > + vnic_set_uni_multicast(vnic, netpath); > + break; > + case VNIC_SECNP_CONNECTED: > + vnic_set_uni_multicast(vnic, &vnic->secondary_path); > + break; > + case VNIC_PRINP_TIMEREXPIRED: > + netpath = &vnic->primary_path; > + netpath->timer_state = NETPATH_TS_EXPIRED; > + if (!netpath->carrier) > + update_path_and_reconnect(netpath, vnic); > + break; > + case VNIC_SECNP_TIMEREXPIRED: > + netpath = &vnic->secondary_path; > + netpath->timer_state = NETPATH_TS_EXPIRED; > + if (!netpath->carrier) > + update_path_and_reconnect(netpath, vnic); > + else { > + if (vnic->state == VNIC_UNINITIALIZED) > + vnic_npevent_register(vnic, netpath); > + } > + break; > + case VNIC_PRINP_LINKUP: > + vnic->primary_path.carrier = 1; > + break; > + case VNIC_SECNP_LINKUP: > + netpath = &vnic->secondary_path; > + netpath->carrier = 1; > + if (!vnic->carrier) > + vnic_set_netpath_timers(vnic, netpath); > + break; > + case VNIC_PRINP_LINKDOWN: > + vnic->primary_path.carrier = 0; > + break; > + case VNIC_SECNP_LINKDOWN: > + if (vnic->state == VNIC_UNINITIALIZED) > + netpath_timer_stop(&vnic->secondary_path); > + vnic->secondary_path.carrier = 0; > + break; > + case VNIC_PRINP_DISCONNECTED: > + netpath = &vnic->primary_path; > + netpath_timer_stop(netpath); > + netpath->carrier = 0; > + update_path_and_reconnect(netpath, vnic); > + break; > + case VNIC_SECNP_DISCONNECTED: > + netpath = &vnic->secondary_path; > + netpath_timer_stop(netpath); > + netpath->carrier = 0; > + update_path_and_reconnect(netpath, vnic); > + break; > + case VNIC_PRINP_SETLINK: > + netpath = vnic->current_path; > + if (!netpath || !netpath->viport) > + break; > + viport_set_link(netpath->viport, > + vnic->netdevice->flags, > + vnic->netdevice->mtu); > + break; > + case VNIC_SECNP_SETLINK: > + netpath = &vnic->secondary_path; > + if (!netpath || !netpath->viport) > + break; > + viport_set_link(netpath->viport, > + vnic->netdevice->flags, > + vnic->netdevice->mtu); > + break; > + case VNIC_NP_FREEVNIC: > + vnic_handle_free_vnic_evt(vnic); > + vnic = NULL; > + break; > + } > + return vnic; > +} > + > +static int vnic_npevent_statemachine(void *context) > +{ > + struct vnic_npevent *vnic_link_evt; > + enum vnic_npevent_type npevt_type; > + struct vnic *vnic; > + int last_carrier; > + int other_path_ok = 0; > + struct netpath *last_path; > + > + while (!vnic_npevent_thread_end || > + !list_empty(&vnic_npevent_list)) { > + unsigned long flags; > + > + wait_event_interruptible(vnic_npevent_queue, > + !list_empty(&vnic_npevent_list) > + || vnic_npevent_thread_end); > + spin_lock_irqsave(&vnic_npevent_list_lock, flags); > + if (list_empty(&vnic_npevent_list)) { > + spin_unlock_irqrestore(&vnic_npevent_list_lock, > + flags); > + VNIC_INFO("netpath statemachine wake" > + " on empty list\n"); > + continue; > + } > + > + vnic_link_evt = list_entry(vnic_npevent_list.next, > + struct vnic_npevent, > + list_ptrs); You could use new list_first_entry macro here. > + list_del(&vnic_link_evt->list_ptrs); > + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); > + vnic = vnic_link_evt->vnic; > + npevt_type = vnic_link_evt->event_type; > + kfree(vnic_link_evt); > + > + if (vnic->current_path == &vnic->secondary_path) > + other_path_ok = vnic->primary_path.carrier; > + else if (vnic->current_path == &vnic->primary_path) > + other_path_ok = vnic->secondary_path.carrier; > + > + vnic = vnic_handle_npevent(vnic, npevt_type); > + > + if (!vnic) > + continue; > + > + last_carrier = vnic->carrier; > + last_path = vnic->current_path; > + > + if (!vnic->current_path || > + !vnic->current_path->carrier) { > + vnic->carrier = 0; > + vnic->current_path = NULL; > + vnic->netdevice->features &= ~NETIF_F_IP_CSUM; > + } > + > + if (!vnic->carrier) > + vnic_carrier_loss(vnic, last_path); > + else if ((vnic->current_path != &vnic->primary_path) && > + (vnic->config->prefer_primary) && > + (vnic->primary_path.carrier)) > + vnic_check_primary_path_timer(vnic); > + > + if (last_path) > + vnic_report_path_change(vnic, last_path, > + other_path_ok); > + > + VNIC_INFO("new netpath=%s, carrier=%d\n", > + netpath_to_string(vnic, vnic->current_path), > + vnic->carrier); > + > + if (vnic->current_path != last_path) > + vnic_handle_path_change(vnic, &last_path); > + > + if (vnic->carrier != last_carrier) { > + if (vnic->carrier) { > + VNIC_INFO("netif_carrier_on\n"); > + netif_carrier_on(vnic->netdevice); > + vnic_carrier_loss_stats(vnic); > + } else { > + VNIC_INFO("netif_carrier_off\n"); > + netif_carrier_off(vnic->netdevice); > + vnic_disconn_stats(vnic); > + } > + > + } > + } > + complete_and_exit(&vnic_npevent_thread_exit, 0); > + return 0; > +} > + > +void vnic_npevent_queue_evt(struct netpath *netpath, > + enum vnic_npevent_type evt) > +{ > + struct vnic_npevent *npevent; > + unsigned long flags; > + > + npevent = kmalloc(sizeof *npevent, GFP_ATOMIC); > + if (!npevent) { > + VNIC_ERROR("Could not allocate memory for vnic event\n"); > + return; > + } > + npevent->vnic = netpath->parent; > + npevent->event_type = evt; > + INIT_LIST_HEAD(&npevent->list_ptrs); > + spin_lock_irqsave(&vnic_npevent_list_lock, flags); > + list_add_tail(&npevent->list_ptrs, &vnic_npevent_list); > + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); > + wake_up(&vnic_npevent_queue); > +} > + > +void vnic_npevent_dequeue_evt(struct netpath *netpath, > + enum vnic_npevent_type evt) > +{ > + unsigned long flags; > + struct vnic_npevent *npevt, *tmp; > + struct vnic *vnic = netpath->parent; > + > + spin_lock_irqsave(&vnic_npevent_list_lock, flags); > + if (list_empty(&vnic_npevent_list)) > + goto out; > + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, > + list_ptrs) { > + if ((npevt->vnic == vnic) && > + (npevt->event_type == evt)) { > + list_del(&npevt->list_ptrs); > + kfree(npevt); > + break; > + } > + } > +out: > + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); > +} > + > +static int vnic_npevent_start(void) > +{ > + VNIC_FUNCTION("vnic_npevent_start()\n"); > + > + spin_lock_init(&vnic_npevent_list_lock); > + vnic_npevent_thread = kthread_run(vnic_npevent_statemachine, NULL, > + "qlgc_vnic_npevent_s_m"); > + if (IS_ERR(vnic_npevent_thread)) { > + printk(KERN_WARNING PFX "failed to create vnic npevent" > + " thread; error %d\n", > + (int) PTR_ERR(vnic_npevent_thread)); > + vnic_npevent_thread = NULL; > + return 1; > + } > + > + return 0; > +} > + > +void vnic_npevent_cleanup(void) > +{ > + if (vnic_npevent_thread) { > + vnic_npevent_thread_end = 1; > + wake_up(&vnic_npevent_queue); > + wait_for_completion(&vnic_npevent_thread_exit); > + vnic_npevent_thread = NULL; > + } > +} > + > +static void vnic_setup(struct net_device *device) > +{ > + ether_setup(device); > + > + /* ether_setup is used to fill > + * device parameters for ethernet devices. > + * We override some of the parameters > + * which are specific to VNIC. > + */ > + device->get_stats = vnic_get_stats; > + device->open = vnic_open; > + device->stop = vnic_stop; > + device->hard_start_xmit = vnic_hard_start_xmit; > + device->tx_timeout = vnic_tx_timeout; > + device->set_multicast_list = vnic_set_multicast_list; > + device->set_mac_address = vnic_set_mac_address; > + device->change_mtu = vnic_change_mtu; > + device->watchdog_timeo = 10 * HZ; > + device->features = 0; > +} > + > +struct vnic *vnic_allocate(struct vnic_config *config) > +{ > + struct vnic *vnic = NULL; > + struct net_device *netdev; > + > + VNIC_FUNCTION("vnic_allocate()\n"); > + netdev = alloc_netdev((int) sizeof(*vnic), config->name, vnic_setup); > + if (!netdev) { > + VNIC_ERROR("failed allocating vnic structure\n"); > + return NULL; > + } > + > + vnic = netdev_priv(netdev); > + vnic->netdevice = netdev; > + spin_lock_init(&vnic->lock); > + spin_lock_init(&vnic->current_path_lock); > + vnic_alloc_stats(vnic); > + vnic->state = VNIC_UNINITIALIZED; > + vnic->config = config; > + > + netpath_init(&vnic->primary_path, vnic, 0); > + netpath_init(&vnic->secondary_path, vnic, 1); > + > + vnic->current_path = NULL; > + vnic->failed_over = 0; > + > + list_add_tail(&vnic->list_ptrs, &vnic_list); > + > + return vnic; > +} > + > +void vnic_free(struct vnic *vnic) > +{ > + VNIC_FUNCTION("vnic_free()\n"); > + list_del(&vnic->list_ptrs); > + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC); > +} > + > +static void __exit vnic_cleanup(void) > +{ > + VNIC_FUNCTION("vnic_cleanup()\n"); > + > + VNIC_INIT("unloading %s\n", MODULEDETAILS); > + > + while (!list_empty(&vnic_list)) { > + struct vnic *vnic = > + list_entry(vnic_list.next, struct vnic, list_ptrs); Another place to use list_first_entry > + vnic_free(vnic); > + } > + > + vnic_npevent_cleanup(); > + viport_cleanup(); > + vnic_ib_cleanup(); > +} > + > +static int __init vnic_init(void) > +{ > + int ret; > + VNIC_FUNCTION("vnic_init()\n"); > + VNIC_INIT("Initializing %s\n", MODULEDETAILS); > + > + ret = config_start(); > + if (ret) { > + VNIC_ERROR("config_start failed\n"); > + goto failure; > + } > + > + ret = vnic_ib_init(); > + if (ret) { > + VNIC_ERROR("ib_start failed\n"); > + goto failure; > + } > + > + ret = viport_start(); > + if (ret) { > + VNIC_ERROR("viport_start failed\n"); > + goto failure; > + } > + > + ret = vnic_npevent_start(); > + if (ret) { > + VNIC_ERROR("vnic_npevent_start failed\n"); > + goto failure; > + } > + > + return 0; > +failure: > + vnic_cleanup(); > + return ret; > +} > + > +module_init(vnic_init); > +module_exit(vnic_cleanup); > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h > new file mode 100644 > index 0000000..7535124 > --- /dev/null > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_main.h > @@ -0,0 +1,154 @@ > +/* > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#ifndef VNIC_MAIN_H_INCLUDED > +#define VNIC_MAIN_H_INCLUDED > + > +#include > +#include > +#include > +#include > + > +#include "vnic_config.h" > +#include "vnic_netpath.h" > + > +extern u16 vnic_max_mtu; > +extern struct list_head vnic_list; > +extern struct attribute_group vnic_stats_attr_group; > +extern cycles_t vnic_recv_ref; > + > +enum vnic_npevent_type { > + VNIC_PRINP_CONNECTED = 0, > + VNIC_PRINP_DISCONNECTED = 1, > + VNIC_PRINP_LINKUP = 2, > + VNIC_PRINP_LINKDOWN = 3, > + VNIC_PRINP_TIMEREXPIRED = 4, > + VNIC_PRINP_SETLINK = 5, > + > + /* used to figure out PRI vs SEC types for dbg msg*/ > + VNIC_PRINP_LASTTYPE = VNIC_PRINP_SETLINK, > + > + VNIC_SECNP_CONNECTED = 6, > + VNIC_SECNP_DISCONNECTED = 7, > + VNIC_SECNP_LINKUP = 8, > + VNIC_SECNP_LINKDOWN = 9, > + VNIC_SECNP_TIMEREXPIRED = 10, > + VNIC_SECNP_SETLINK = 11, > + > + /* used to figure out PRI vs SEC types for dbg msg*/ > + VNIC_SECNP_LASTTYPE = VNIC_SECNP_SETLINK, > + > + VNIC_NP_FREEVNIC = 12, > + > + /* > + * NOTE : If any new netpath event is being added, don't forget to > + * add corresponding netpath event string into vnic_main.c. > + */ > +}; > + > +struct vnic_npevent { > + struct list_head list_ptrs; > + struct vnic *vnic; > + enum vnic_npevent_type event_type; > +}; > + > +void vnic_npevent_queue_evt(struct netpath *netpath, > + enum vnic_npevent_type evt); > +void vnic_npevent_dequeue_evt(struct netpath *netpath, > + enum vnic_npevent_type evt); > + > +enum vnic_state { > + VNIC_UNINITIALIZED = 0, > + VNIC_REGISTERED = 1 > +}; > + > +struct vnic { > + struct list_head list_ptrs; > + enum vnic_state state; > + struct vnic_config *config; > + struct netpath *current_path; > + struct netpath primary_path; > + struct netpath secondary_path; > + int open; > + int carrier; > + int failed_over; > + int mac_set; > + struct net_device_stats stats; > + struct net_device *netdevice; > + struct dev_info dev_info; > + struct dev_mc_list *mc_list; > + int mc_list_len; > + int mc_count; > + spinlock_t lock; > + spinlock_t current_path_lock; > +#ifdef CONFIG_INFINIBAND_QLGC_VNIC_STATS > + struct { > + cycles_t start_time; > + cycles_t conn_time; > + cycles_t disconn_ref; /* intermediate time */ > + cycles_t disconn_time; > + u32 disconn_num; > + cycles_t xmit_time; > + u32 xmit_num; > + u32 xmit_fail; > + cycles_t recv_time; > + u32 recv_num; > + u32 multicast_recv_num; > + cycles_t xmit_ref; /* intermediate time */ > + cycles_t xmit_off_time; > + u32 xmit_off_num; > + cycles_t carrier_ref; /* intermediate time */ > + cycles_t carrier_off_time; > + u32 carrier_off_num; > + } statistics; > + struct dev_info stat_info; > +#endif /* CONFIG_INFINIBAND_QLGC_VNIC_STATS */ > +}; > + > +struct vnic *vnic_allocate(struct vnic_config *config); > + > +void vnic_free(struct vnic *vnic); > + > +void vnic_connected(struct vnic *vnic, struct netpath *netpath); > +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath); > + > +void vnic_link_up(struct vnic *vnic, struct netpath *netpath); > +void vnic_link_down(struct vnic *vnic, struct netpath *netpath); > + > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath); > +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath); > + > +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, > + struct sk_buff *skb); > +void vnic_npevent_cleanup(void); > +void completion_callback_cleanup(struct vnic_ib_conn *ib_conn); Change name to vnic_complete_cleanup or something like that for consistency. > +#endif /* VNIC_MAIN_H_INCLUDED */ > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From shemminger at vyatta.com Thu May 29 10:30:03 2008 From: shemminger at vyatta.com (Stephen Hemminger) Date: Thu, 29 May 2008 10:30:03 -0700 Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080529095754.9943.27936.stgit@localhost.localdomain> References: <20080529095126.9943.84692.stgit@localhost.localdomain> <20080529095754.9943.27936.stgit@localhost.localdomain> Message-ID: <20080529103003.010c4a08@extreme> On Thu, 29 May 2008 15:27:54 +0530 Ramachandra K wrote: > From: Amar Mudrankit > > The sysfs interface for the QLogic VNIC driver is implemented through > this patch. > > Signed-off-by: Amar Mudrankit > Signed-off-by: Ramachandra K > Signed-off-by: Poornima Kamath > --- > > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++ > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 51 + > 2 files changed, 1184 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h > > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > new file mode 100644 > index 0000000..40b3c77 > --- /dev/null > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > @@ -0,0 +1,1133 @@ > +/* > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + */ > + > +#include > +#include > +#include > + > +#include "vnic_util.h" > +#include "vnic_config.h" > +#include "vnic_ib.h" > +#include "vnic_viport.h" > +#include "vnic_main.h" > +#include "vnic_stats.h" > + > +/* > + * target eiocs are added by writing > + * > + * ioc_guid=,dgid=,pkey=,name= > + * to the create_primary sysfs attribute. > + */ > +enum { > + VNIC_OPT_ERR = 0, > + VNIC_OPT_IOC_GUID = 1 << 0, > + VNIC_OPT_DGID = 1 << 1, > + VNIC_OPT_PKEY = 1 << 2, > + VNIC_OPT_NAME = 1 << 3, > + VNIC_OPT_INSTANCE = 1 << 4, > + VNIC_OPT_RXCSUM = 1 << 5, > + VNIC_OPT_TXCSUM = 1 << 6, > + VNIC_OPT_HEARTBEAT = 1 << 7, > + VNIC_OPT_IOC_STRING = 1 << 8, > + VNIC_OPT_IB_MULTICAST = 1 << 9, > + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | > + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), > +}; > + > +static match_table_t vnic_opt_tokens = { > + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, > + {VNIC_OPT_DGID, "dgid=%s"}, > + {VNIC_OPT_PKEY, "pkey=%x"}, > + {VNIC_OPT_NAME, "name=%s"}, > + {VNIC_OPT_INSTANCE, "instance=%d"}, > + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, > + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, > + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, > + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, > + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, > + {VNIC_OPT_ERR, NULL} > +}; > No sysfs is supposed to be one value per file use separate attributes for each one. This also eliminates the parsing code. From greg at kroah.com Thu May 29 10:48:05 2008 From: greg at kroah.com (Greg KH) Date: Thu, 29 May 2008 10:48:05 -0700 Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver In-Reply-To: <20080529103003.010c4a08@extreme> References: <20080529095126.9943.84692.stgit@localhost.localdomain> <20080529095754.9943.27936.stgit@localhost.localdomain> <20080529103003.010c4a08@extreme> Message-ID: <20080529174805.GA10903@kroah.com> On Thu, May 29, 2008 at 10:30:03AM -0700, Stephen Hemminger wrote: > On Thu, 29 May 2008 15:27:54 +0530 > Ramachandra K wrote: > > > From: Amar Mudrankit > > > > The sysfs interface for the QLogic VNIC driver is implemented through > > this patch. > > > > Signed-off-by: Amar Mudrankit > > Signed-off-by: Ramachandra K > > Signed-off-by: Poornima Kamath > > --- > > > > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c | 1133 +++++++++++++++++++++++++++ > > drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h | 51 + > > 2 files changed, 1184 insertions(+), 0 deletions(-) > > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > > create mode 100644 drivers/infiniband/ulp/qlgc_vnic/vnic_sys.h > > > > diff --git a/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > > new file mode 100644 > > index 0000000..40b3c77 > > --- /dev/null > > +++ b/drivers/infiniband/ulp/qlgc_vnic/vnic_sys.c > > @@ -0,0 +1,1133 @@ > > +/* > > + * Copyright (c) 2006 QLogic, Inc. All rights reserved. > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + */ > > + > > +#include > > +#include > > +#include > > + > > +#include "vnic_util.h" > > +#include "vnic_config.h" > > +#include "vnic_ib.h" > > +#include "vnic_viport.h" > > +#include "vnic_main.h" > > +#include "vnic_stats.h" > > + > > +/* > > + * target eiocs are added by writing > > + * > > + * ioc_guid=,dgid=,pkey=,name= > > + * to the create_primary sysfs attribute. > > + */ > > +enum { > > + VNIC_OPT_ERR = 0, > > + VNIC_OPT_IOC_GUID = 1 << 0, > > + VNIC_OPT_DGID = 1 << 1, > > + VNIC_OPT_PKEY = 1 << 2, > > + VNIC_OPT_NAME = 1 << 3, > > + VNIC_OPT_INSTANCE = 1 << 4, > > + VNIC_OPT_RXCSUM = 1 << 5, > > + VNIC_OPT_TXCSUM = 1 << 6, > > + VNIC_OPT_HEARTBEAT = 1 << 7, > > + VNIC_OPT_IOC_STRING = 1 << 8, > > + VNIC_OPT_IB_MULTICAST = 1 << 9, > > + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | > > + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), > > +}; > > + > > +static match_table_t vnic_opt_tokens = { > > + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, > > + {VNIC_OPT_DGID, "dgid=%s"}, > > + {VNIC_OPT_PKEY, "pkey=%x"}, > > + {VNIC_OPT_NAME, "name=%s"}, > > + {VNIC_OPT_INSTANCE, "instance=%d"}, > > + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, > > + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, > > + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, > > + {VNIC_OPT_IOC_STRING, "ioc_string=\"%s"}, > > + {VNIC_OPT_IB_MULTICAST, "ib_multicast=%s"}, > > + {VNIC_OPT_ERR, NULL} > > +}; > > > > No sysfs is supposed to be one value per file use separate attributes > for each one. This also eliminates the parsing code. Also, every new sysfs file needs to have an entry in Documentation/ABI/ which shows how to use it and what the contents are. And yes, multiple values per sysfs file are not allowed, sorry, please change this. If you need to configure your device through an interface like this, consider using configfs instead, that is what it is there for. thanks, greg k-h From matthias at sgi.com Thu May 29 11:32:47 2008 From: matthias at sgi.com (Matthias Blankenhaus) Date: Thu, 29 May 2008 11:32:47 -0700 (PDT) Subject: [ofa-general] saquery port problems In-Reply-To: <20080522073703.GA31474@sashak.voltaire.com> References: <20080522073703.GA31474@sashak.voltaire.com> Message-ID: On Thu, 22 May 2008, Sasha Khapyorsky wrote: > Hi Matthias, > > On 16:48 Wed 21 May , Matthias Blankenhaus wrote: > > I have a patch that fixes the problem: > > > > diff -Narpu infiniband-diags-1.3.6.vanilla/src/saquery.c my/src/saquery.c > > --- infiniband-diags-1.3.6.vanilla/src/saquery.c 2008-02-28 > > 00:58:36.000000000 -0800 > > +++ my/src/saquery.c 2008-05-21 16:08:19.583221794 -0700 > > @@ -1304,13 +1304,13 @@ get_bind_handle(void) > > ca_name_index++; > > if (sa_port_num && sa_port_num != attr_array[i].port_num) > > continue; > > - if (sa_hca_name && i == 0) > > - continue; > > if (sa_hca_name > > && strcmp(sa_hca_name, vendor->ca_names[ca_name_index]) != 0) > > continue; > > - if (attr_array[i].link_state == IB_LINK_ACTIVE) > > + if (attr_array[i].link_state == IB_LINK_ACTIVE) { > > port_guid = attr_array[i].port_guid; > > + break; > > + } > > } > > > > > > I have tested it and it solves the problem. > > > > Does this look ok ? > > Yes, this looks correct. Thanks for fixing this. I just will need your > 'Signed-off-by:' line in order to apply the patch. Sorry, I don't know what that is :-) This is my first patch for OFED, excuse my ignorance. Please, let me know if this helps: Signed-off-by: matthias at sgi.com Matthias > > Sasha > From tziporet at mellanox.co.il Thu May 29 11:34:48 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 29 May 2008 21:34:48 +0300 Subject: [ofa-general] OFED 1.3.1 RC3 release is available Message-ID: <6C2C79E72C305246B504CBA17B5500C90282E6E3@mtlexch01.mtl.com> Hi, OFED 1.3.1 RC3 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.3.1/OFED-1.3.1-rc3.tgz To get BUILD_ID run ofed_info Please report any issues in Bugzilla https://bugs.openfabrics.org/ The GA version is expected next week. Release information: -------------------- Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2 beta: 2.6.18-84.el5 * - Fedora C6: 2.6.18-8.fc6 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp * - OpenSuSE 10.3: 2.6.22-*-* * - kernel.org: 2.6.23 and 2.6.24 * OSes that are partially tested Systems: * x86_64 * x86 * ia64 * ppc64 Main changes from OFED 1.3.1-rc2 ================================ * Updates utilities: * mstflint * ibutils: Added rcv/snd data/pkt counters to port counters fetch (-pm) * opensm version 3.1.11 * ULPs changes: * RDS: - Fix a bug in RDMA signaling - Add 3 more stats counters - Fix kernel oops: swiotlb_unmap_sg+0x35/0x126 * IPoIB: - Fix alignment of small SKBs in CM mode receive - Set max CM MTU when moving to CM mode - Fix neigh destructor oops on kernel versions between 2.6.17 and 2.6.20 * General: * 90-ib.rules: fix uat that has been deprecated Main changes from OFED 1.3.1-rc1 ================================ * Added backports for the OSes (with very limited testing): * SLES10 SP2 with kernel 2.6.16.60-0.21-smp * RedHat EL5 up2 beta with kernel 2.6.18-84.el5 * MPI packages update: * mvapich-1.0.1-2481 * Updated libraries: * dapl-v1 1.2.7-1 * dapl-v2 2.0.9-1 * libcxgb3 1.2.1 * ULPs changes: * OpenSM: Fix segmentation fault * iSER: Bug fixes since 2.6.24 * RDS: fixes for RDMA API * IPoIB: Fix several kernel crashes (see attached list) * Updated low level drivers: * nes * mlx4 * cxgb3 * ehca * ipath Main Changes from OFED-1.3: =========================== * MPI packages update: * mvapich-1.0.1-2434 * mvapich2-1.0.3-1 * openmpi-1.2.6-1 * Updated libraries: * dapl-v1 1.2.6 * dapl-v2 2.0.8 * libcxgb3 1.2.0 * librdmacm 1.0.7 * ULPs changes: * IB Bonding: ib-bonding-0.9.0-24 * IPoIB bug fixes * RDS fixes for RDMA API * SRP failover * Updated low level drivers: * nes * mlx4 * cxgb3 * ehca Vlad & Tziporet From okir at lst.de Thu May 29 11:38:34 2008 From: okir at lst.de (Olaf Kirch) Date: Thu, 29 May 2008 20:38:34 +0200 Subject: [ofa-general] Fwd: [rds-devel] RDS: Fix a bug in RDMA signalling In-Reply-To: <483C0D40.5060902@dev.mellanox.co.il> References: <200805271014.30175.okir@lst.de> <483C0D40.5060902@dev.mellanox.co.il> Message-ID: <200805292038.35476.okir@lst.de> On Tuesday 27 May 2008 15:31:44 Vladimir Sokolovsky wrote: > Applied to OFED-1.3.1 kernel git tree. Thanks a lot, Vlad! Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From olaf.kirch at oracle.com Thu May 29 11:40:55 2008 From: olaf.kirch at oracle.com (Olaf Kirch) Date: Thu, 29 May 2008 20:40:55 +0200 Subject: [rds-devel] [ofa-general] Port space sharing in RDS In-Reply-To: <20080529000354.GD6288@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> Message-ID: <200805292040.56097.olaf.kirch@oracle.com> On Thursday 29 May 2008 02:03:54 Jon Mason wrote: > On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote: > > >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to > > >the > > >RDS port for all IP addresses. Unfortunately, that will not work for iWARP for > > >2 major reasons. > > > > Can RDS use different port numbers for its RDMA and TCP protocols? The wire > > I do not know if this is desirable, but a quick test shows that having TCP and IB > on different ports works around the problem. Okay, fine with me. Since TCP is disabled in 1.3 anyway, this shouldn't be an issue there, but it'll certainly crop up - I'm re-enabling TCP for 1.4. Care to send me a patch? Any preference as to the port number? Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From eaburns at iol.unh.edu Thu May 29 11:44:25 2008 From: eaburns at iol.unh.edu (Ethan Burns) Date: Thu, 29 May 2008 14:44:25 -0400 Subject: [ofa-general] UNH-iSCSI v2.0 with iSER Message-ID: <20080529184425.GA32717@postal.iol.unh.edu> Hello, I would like to inform everyone who is interested that the UNH-iSCSI sourceforge page [1] has been updated to include the latest version of the UNH-iSCSI initiator and target. This latest version: - includes major and minor bug fixes - includes iSER support - allows for compilation in user-space (for development purposes) - *finally* has support for selecting which DISKIO mode SCSI devices are offered by the target to the initiator. The code has been implemented and tested with RHEL5 (and OFED-1.3 when iSER mode is enabled). There are still some rough edges that need to be smoothed out, but hopefully the community will be able to help out here. Further, the iSER support has only been tested over iWARP. Thanks, Ethan Burns [1] https://sourceforge.net/projects/unh-iscsi From jon at opengridcomputing.com Thu May 29 11:55:25 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Thu, 29 May 2008 13:55:25 -0500 Subject: [rds-devel] [ofa-general] Port space sharing in RDS In-Reply-To: <200805292040.56097.olaf.kirch@oracle.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> Message-ID: <20080529185525.GD7299@opengridcomputing.com> On Thu, May 29, 2008 at 08:40:55PM +0200, Olaf Kirch wrote: > On Thursday 29 May 2008 02:03:54 Jon Mason wrote: > > On Wed, May 28, 2008 at 04:33:06PM -0700, Sean Hefty wrote: > > > >During RDS init, rds_ib_init and rds_tcp_init will both individually bind to > > > >the > > > >RDS port for all IP addresses. Unfortunately, that will not work for iWARP for > > > >2 major reasons. > > > > > > Can RDS use different port numbers for its RDMA and TCP protocols? The wire > > > > I do not know if this is desirable, but a quick test shows that having TCP and IB > > on different ports works around the problem. > > Okay, fine with me. Since TCP is disabled in 1.3 anyway, this shouldn't be an > issue there, but it'll certainly crop up - I'm re-enabling TCP for 1.4. > > Care to send me a patch? Any preference as to the port number? Sure, I'll be happy to send a patch. The port numbers I picked were simply the current one for use in TCP and the next one for IB. Obviously, I will need to verify that there are no conflicts with its usage. I'll check this out and send it out shortly. Thanks, Jon > > Olaf > -- > Olaf Kirch | --- o --- Nous sommes du soleil we love when we play > okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From hrosenstock at xsigo.com Thu May 29 12:25:07 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 12:25:07 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_sa_mcmember_record.c: Improve log message and some comments relating to SNM Message-ID: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com> opensm/osm_sa_mcmember_record.c: Improve log message and some comments relating to SNM (solicited node multicast) Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index fd6714c..040068f 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1082,10 +1082,10 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context) if (memcmp(&p_mgrp->mcmember_rec.mgid, p_recvd_mgid, sizeof(ib_gid_t))) { if (sa->p_subn->opt.consolidate_ipv6_snm_req) { - /* Special Case IPV6 Multicast Loopback addresses */ + /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ - /* Where XXXX is the partition and YYYYYY is the last 24 bits - * of the port guid */ + /* Where XXXX is the P_Key and + * YYYYYY is the last 24 bits of the port guid */ #define PREFIX_MASK (0xff12601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); @@ -1099,8 +1099,8 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context) (g_interface_id & INT_ID_MASK) == (rcv_interface_id & INT_ID_MASK)) { OSM_LOG(sa->p_log, OSM_LOG_INFO, - "Special Case Mcast Join for MGID " - " MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n", + "Special Case Solicited Node Mcast Join " + " for MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n", rcv_prefix, rcv_interface_id); } else return; From hrosenstock at xsigo.com Thu May 29 12:25:10 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 12:25:10 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req Message-ID: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com> opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index fe19a12..05b3dd5 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -658,7 +658,7 @@ int main(int argc, char *argv[]) {"perfmgr_sweep_time_s", 1, NULL, 2}, #endif {"prefix_routes_file", 1, NULL, 3}, - {"consolidate_ipv6_snm_reqests", 0, NULL, 4}, + {"consolidate_ipv6_snm_req", 0, NULL, 4}, {NULL, 0, NULL, 0} /* Required at the end of the array */ }; From hrosenstock at xsigo.com Thu May 29 12:27:12 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 12:27:12 -0700 Subject: [ofa-general] OpenSM IPv6 consolidation Message-ID: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> Ira, In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is: #define PREFIX_MASK (0xff12601b00000000ULL) Shouldn't all scopes be consolidated so this should be: #define PREFIX_MASK (0xff10601b00000000ULL) or was this intentional for some reason ? -- Hal From weiny2 at llnl.gov Thu May 29 14:35:35 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 29 May 2008 14:35:35 -0700 Subject: [ofa-general] Re: OpenSM IPv6 consolidation In-Reply-To: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080529143535.60d02d75.weiny2@llnl.gov> On Thu, 29 May 2008 12:27:12 -0700 Hal Rosenstock wrote: > Ira, > > In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is: > > #define PREFIX_MASK (0xff12601b00000000ULL) > > Shouldn't all scopes be consolidated so this should be: > > #define PREFIX_MASK (0xff10601b00000000ULL) > > or was this intentional for some reason ? > It seemed reasonable for this to consolidate link-local only because according to my IPv6 book, solicited node multicast is the particular range, ff02::1:ff00:0/104 However, I am a bit confused about how the scope bits map from the IP address to the MGID. The MGID refers only to the IB-subnet scope _not_ IP, therefore what I said above might not matter because we are now talking about the IB scope. But that begs the question: Can a node issue an SNM request to a node in another IB subnet? (I think the answer is yes if the IP subnet spans more than one IB subnet) In that case, the SNM address would be in the range ff02::1:ff00:0/104 but what MGID would that map onto in IB? I think the current mapping results in an IB link-local scope. So would a router have to forward it even though the IB scope is link-local? Now my head hurts... :-( Ira From meier3 at llnl.gov Thu May 29 14:43:50 2008 From: meier3 at llnl.gov (Timothy A. Meier) Date: Thu, 29 May 2008 14:43:50 -0700 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <20080525191047.GS4616@sashak.voltaire.com> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> <20080523103532.GA4640@sashak.voltaire.com> <4836E9B8.2080406@llnl.gov> <20080525191047.GS4616@sashak.voltaire.com> Message-ID: <483F2396.3050008@llnl.gov> Sasha Khapyorsky wrote: > Hi Tim, Ira, > > On 08:58 Fri 23 May , Timothy A. Meier wrote: >> Following Hals advice, authorization is based on the umad permissions. > > I will send some more comments about this method later today. But > basically still think that some things could be broken and that it is > not really trivial to separate in this way wrong usage from desired > behavior reliably (with some approximation it is possible of course). > >> The intent is simply to provide a consistent and >> non-silent fail mechanism. > > OTOH I fully agree with yours and Ira's arguments about this - 'Silent' > fails are bad. I thought about how to solve this and and started to run > diag perl scripts from unprivileged account in various conditions (cache > file exists or not, cache dir is readable or not, etc.). > > First thing I saw was that even on bad usage most scripts return 0. Then > I found that on many failures return status is not checked or ignored > and program return 0. I did those two patches (below) and up to now it > works fine for me (but likely I didn't cover everything). What do you > say? > I think this patch is fine, and helps solve the improper "usage" issue. (btw - should we prefer the "adapter" spelling over "adaptor"?) My patch was addressing non-authorized use. Our philosophy was to not allow "any" sort of functionality (even help) if not authorized. Fail, and provide a reason/code. So rather than go through each perl script to see if the proper thing is done (return code is checked, error msg provided, terminate, etc.) each time a privileged function is invoked, we just do it at the beginning of the script, using a common (consistent) function call ( auth_check() ). I don't know if this is the desired behavior, but it would have caught a few problems we have encountered with "silent" failures that produce misleading results. It would also catch any future (unauthorized) scripting issues. On 5-23, I submitted a patch which adds an auth_check() function to the common perl module. I agree, the implementation is non-ideal, but it is probably sufficient for the vast majority of installations. If you think the concept of an auth_check() function is desirable/acceptable, then I will pursue fixing the implementation in a more universal way. > Sasha > > >>From cbbc155996c9f6efe91b78f055a643809b997468 Mon Sep 17 00:00:00 2001 > From: root > Date: Sat, 24 May 2008 11:04:08 +0300 > Subject: [PATCH] infiniband-diags/scripts/*.pl: exit 2 on usage errors > > Add non-zero exit status (2) on usage errors for perl scripts. > > Signed-off-by: root > --- > infiniband-diags/scripts/check_lft_balance.pl | 2 +- > infiniband-diags/scripts/ibfindnodesusing.pl | 2 +- > infiniband-diags/scripts/ibidsverify.pl | 2 +- > infiniband-diags/scripts/iblinkinfo.pl | 2 +- > infiniband-diags/scripts/ibprintca.pl | 2 +- > infiniband-diags/scripts/ibprintrt.pl | 2 +- > infiniband-diags/scripts/ibprintswitch.pl | 2 +- > infiniband-diags/scripts/ibqueryerrors.pl | 2 +- > infiniband-diags/scripts/ibswportwatch.pl | 2 +- > 9 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/infiniband-diags/scripts/check_lft_balance.pl b/infiniband-diags/scripts/check_lft_balance.pl > index 66f5f0f..b0f0fef 100755 > --- a/infiniband-diags/scripts/check_lft_balance.pl > +++ b/infiniband-diags/scripts/check_lft_balance.pl > @@ -70,7 +70,7 @@ sub usage > print "Usage: $prog [-R -v]\n"; > print " -R recalculate all cached information\n"; > print " -v verbose output\n"; > - exit 0; > + exit 2; > } > > sub is_port_up > diff --git a/infiniband-diags/scripts/ibfindnodesusing.pl b/infiniband-diags/scripts/ibfindnodesusing.pl > index 1bf0987..71656b3 100755 > --- a/infiniband-diags/scripts/ibfindnodesusing.pl > +++ b/infiniband-diags/scripts/ibfindnodesusing.pl > @@ -80,7 +80,7 @@ sub usage_and_exit > print " -R Recalculate ibnetdiscover information\n"; > print " -C use selected Channel Adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl > index de78e6b..1a236c8 100755 > --- a/infiniband-diags/scripts/ibidsverify.pl > +++ b/infiniband-diags/scripts/ibidsverify.pl > @@ -46,7 +46,7 @@ sub usage_and_exit > print " -h This help message\n"; > print > " -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl > index a195474..a7a3df5 100755 > --- a/infiniband-diags/scripts/iblinkinfo.pl > +++ b/infiniband-diags/scripts/iblinkinfo.pl > @@ -62,7 +62,7 @@ sub usage_and_exit > print " -C use selected Channel Adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > print " -g print port guids instead of node guids\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibprintca.pl b/infiniband-diags/scripts/ibprintca.pl > index 38b4330..0baea0b 100755 > --- a/infiniband-diags/scripts/ibprintca.pl > +++ b/infiniband-diags/scripts/ibprintca.pl > @@ -51,7 +51,7 @@ sub usage_and_exit > print " -l list cas\n"; > print " -C use selected channel adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibprintrt.pl b/infiniband-diags/scripts/ibprintrt.pl > index 86dcb64..0b3db19 100755 > --- a/infiniband-diags/scripts/ibprintrt.pl > +++ b/infiniband-diags/scripts/ibprintrt.pl > @@ -51,7 +51,7 @@ sub usage_and_exit > print " -l list rts\n"; > print " -C use selected channel adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibprintswitch.pl b/infiniband-diags/scripts/ibprintswitch.pl > index 6712201..c7377a9 100755 > --- a/infiniband-diags/scripts/ibprintswitch.pl > +++ b/infiniband-diags/scripts/ibprintswitch.pl > @@ -50,7 +50,7 @@ sub usage_and_exit > print " -l list switches\n"; > print " -C use selected channel adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl > index c807c02..5f2e167 100755 > --- a/infiniband-diags/scripts/ibqueryerrors.pl > +++ b/infiniband-diags/scripts/ibqueryerrors.pl > @@ -149,7 +149,7 @@ sub usage_and_exit > print " -d include the data counters in the output\n"; > print " -C use selected Channel Adaptor name for queries\n"; > print " -P use selected channel adaptor port for queries\n"; > - exit 0; > + exit 2; > } > > my $argv0 = `basename $0`; > diff --git a/infiniband-diags/scripts/ibswportwatch.pl b/infiniband-diags/scripts/ibswportwatch.pl > index 6d6ba1c..d888f51 100755 > --- a/infiniband-diags/scripts/ibswportwatch.pl > +++ b/infiniband-diags/scripts/ibswportwatch.pl > @@ -81,7 +81,7 @@ sub usage_and_exit > print " -n run n cycles then exit (default -1 == forever)\n"; > print " -G Address provided is a GUID\n"; > print " -b report bytes/second packets/second\n"; > - exit 0; > + exit 2; > } > > # ========================================================================= -- Timothy A. Meier Computer Scientist ICCD/High Performance Computing 925.422.3341 meier3 at llnl.gov From jon at opengridcomputing.com Thu May 29 15:58:24 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Thu, 29 May 2008 17:58:24 -0500 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: <200805292040.56097.olaf.kirch@oracle.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> Message-ID: <20080529225824.GB7960@opengridcomputing.com> [PATCH] rds: use separate ports for TCP and IB Currently, RDS will bind to a single port during bring up of both the IB and TCP sub-modules. This binding of 2 different processes to a single port causes a port space collision to devices which are aware of both (e.g., iWARP). This prevents iWARP devices from working with RDS if both TCP and IB are compiled in. This patch works around this issue by having IB and TCP bind to separate ports, thus avoiding the port space collision. This enables iWARP to work over RDS TCP. Signed-off-by: Jon Mason diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index a49e394..9935c9b 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -628,7 +628,7 @@ int rds_ib_conn_connect(struct rds_connection *conn) dest.sin_family = AF_INET; dest.sin_addr.s_addr = (__force u32)conn->c_faddr; - dest.sin_port = (__force u16)htons(RDS_PORT); + dest.sin_port = (__force u16)htons(RDS_IB_PORT); ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src, (struct sockaddr *)&dest, @@ -813,7 +813,7 @@ int __init rds_ib_listen_init(void) sin.sin_family = PF_INET, sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY); - sin.sin_port = (__force u16)htons(RDS_PORT); + sin.sin_port = (__force u16)htons(RDS_IB_PORT); /* * XXX I bet this binds the cm_id to a device. If we want to support @@ -833,7 +833,7 @@ int __init rds_ib_listen_init(void) goto out; } - rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT); + rdsdebug("cm %p listening on port %u\n", cm_id, RDS_IB_PORT); rds_ib_listen_id = cm_id; cm_id = NULL; diff --git a/net/rds/rds.h b/net/rds/rds.h index 03031e2..aa14fa6 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -25,9 +25,11 @@ * userspace from listening. * * port 18633 was the version that had ack frames on the wire. + * port 18634 was the version that had both TCP and IB transports on the + * same port. */ -#define RDS_PORT 18634 - +#define RDS_IB_PORT 18635 +#define RDS_TCP_PORT 18636 #ifndef AF_RDS #define AF_RDS 28 /* Reliable Datagram Socket */ diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index 0389a99..298e372 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -96,7 +96,7 @@ int rds_tcp_conn_connect(struct rds_connection *conn) dest.sin_family = AF_INET; dest.sin_addr.s_addr = (__force u32)conn->c_faddr; - dest.sin_port = (__force u16)htons(RDS_PORT); + dest.sin_port = (__force u16)htons(RDS_TCP_PORT); /* * once we call connect() we can start getting callbacks and they diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index caeacbe..50709b7 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -159,7 +159,7 @@ int __init rds_tcp_listen_init(void) sin.sin_family = PF_INET, sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY); - sin.sin_port = (__force u16)htons(RDS_PORT); + sin.sin_port = (__force u16)htons(RDS_TCP_PORT); ret = sock->ops->bind(sock, (struct sockaddr *)&sin, sizeof(sin)); if (ret < 0) From jgunthorpe at obsidianresearch.com Thu May 29 16:00:27 2008 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 29 May 2008 17:00:27 -0600 Subject: [ofa-general] Re: OpenSM IPv6 consolidation In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov> References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> <20080529143535.60d02d75.weiny2@llnl.gov> Message-ID: <20080529230027.GG8259@obsidianresearch.com> On Thu, May 29, 2008 at 02:35:35PM -0700, Ira Weiny wrote: > But that begs the question: Can a node issue an SNM request to a node in > another IB subnet? (I think the answer is yes if the IP subnet spans more than > one IB subnet) In that case, the SNM address would be in the range > ff02::1:ff00:0/104 but what MGID would that map onto in IB? I think the > current mapping results in an IB link-local scope. So would a router have to > forward it even though the IB scope is link-local? IP (v4 and v6) 'link local' traffic (ie IPv4 broadcasts and IPv6 link local multicast) use MGID scope bits that are dependent on the configuration of the IPoIB stack. Today linux and everyone else uses link local MGID scope. There are patches floating about to make this configurable like pkey so that you can have a global IB scope IPoIB subnet. We used that patch set at SC07 to demonstrate IPoIB running single subnet across IB routers. Jason From weiny2 at llnl.gov Thu May 29 16:08:51 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 29 May 2008 16:08:51 -0700 Subject: [ofa-general] Re: OpenSM IPv6 consolidation In-Reply-To: <20080529230027.GG8259@obsidianresearch.com> References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> <20080529143535.60d02d75.weiny2@llnl.gov> <20080529230027.GG8259@obsidianresearch.com> Message-ID: <20080529160851.6ec1ed06.weiny2@llnl.gov> On Thu, 29 May 2008 17:00:27 -0600 Jason Gunthorpe wrote: > On Thu, May 29, 2008 at 02:35:35PM -0700, Ira Weiny wrote: > > > But that begs the question: Can a node issue an SNM request to a node in > > another IB subnet? (I think the answer is yes if the IP subnet spans more than > > one IB subnet) In that case, the SNM address would be in the range > > ff02::1:ff00:0/104 but what MGID would that map onto in IB? I think the > > current mapping results in an IB link-local scope. So would a router have to > > forward it even though the IB scope is link-local? > > IP (v4 and v6) 'link local' traffic (ie IPv4 broadcasts and IPv6 link > local multicast) use MGID scope bits that are dependent on the > configuration of the IPoIB stack. Today linux and everyone else uses > link local MGID scope. There are patches floating about to make this > configurable like pkey so that you can have a global IB scope IPoIB > subnet. We used that patch set at SC07 to demonstrate IPoIB running > single subnet across IB routers. > So, in that case if one is having issues with MLID space and wants to use my hack it should consolidate all the scopes. BTW, I still have on the back burner plans to implement a "real" fix to this problem... If only there were say -- 100 hours in a day? ;-) Ira From rdreier at cisco.com Thu May 29 16:11:09 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 May 2008 16:11:09 -0700 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: <20080529225824.GB7960@opengridcomputing.com> (Jon Mason's message of "Thu, 29 May 2008 17:58:24 -0500") References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> <20080529225824.GB7960@opengridcomputing.com> Message-ID: > Currently, RDS will bind to a single port during bring up of both the IB and TCP > sub-modules. This binding of 2 different processes to a single port causes a > port space collision to devices which are aware of both (e.g., iWARP). This > prevents iWARP devices from working with RDS if both TCP and IB are compiled in. Of course nothing prevents another hapless application from trying to use port 18635 with TCP... Not really much we can do about the general port space collision problem unless and until the network stack guys are willing to cooperate though. - R. From jon at opengridcomputing.com Thu May 29 16:44:54 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Thu, 29 May 2008 18:44:54 -0500 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> <20080529225824.GB7960@opengridcomputing.com> Message-ID: <20080529234454.GD7960@opengridcomputing.com> On Thu, May 29, 2008 at 04:11:09PM -0700, Roland Dreier wrote: > > Currently, RDS will bind to a single port during bring up of both the IB and TCP > > sub-modules. This binding of 2 different processes to a single port causes a > > port space collision to devices which are aware of both (e.g., iWARP). This > > prevents iWARP devices from working with RDS if both TCP and IB are compiled in. > > Of course nothing prevents another hapless application from trying to > use port 18635 with TCP... Yes, but that potential problem was already there. I suppose in the long run RDS should try to get IANA to give a reserved port (assuming that the RDS of ports 1540 and 1541 is a different RDS). > Not really much we can do about the general port space collision problem > unless and until the network stack guys are willing to cooperate though. While I agree it is a necessity, I don't think I want to be the one to start that fight again. Perhaps if/when RDS is merged with mainline. > - R. From vidvuds at ucla.edu Thu May 29 18:00:47 2008 From: vidvuds at ucla.edu (Vidvuds Ozolins) Date: Thu, 29 May 2008 18:00:47 -0700 Subject: [ofa-general] OFED-1.3.1 fails on CentOS 5.0 in libmlx4 Message-ID: Hi All, When I try installing OFED-1.3.1 on CentOS 5.0 I get the following error message: Build libmlx4 RPM Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' -- define 'dist ' --target x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' / home/vidvuds/OFED-1.3.1-rc3/SRPMS/libmlx4-1.0-0.1.ofed20080421.src.rpm Failed to build libmlx4 RPM See /tmp/OFED.18735.logs/libmlx4.rpmbuild.log Anyone knows what is going on? The contents of the logfile are: [root at smithers OFED-1.3.1-rc3]# more /tmp/OFED.18735.logs/ libmlx4.rpmbuild.log Running rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' -- define 'dist ' --target x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --de fine '_usr /usr' /home/vidvuds/OFED-1.3.1-rc3/SRPMS/ libmlx4-1.0-0.1.ofed20080421.src.rpm error: Macro %dist has empty body error: Macro %dist has empty body warning: user vlad does not exist - using root warning: group vlad does not exist - using root warning: user vlad does not exist - using root warning: group vlad does not exist - using root Installing /home/vidvuds/OFED-1.3.1-rc3/SRPMS/ libmlx4-1.0-0.1.ofed20080421.src.rpm Building target platforms: x86_64 Building for target x86_64 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.87202 + umask 022 + cd /var/tmp/OFED_topdir/BUILD + LANG=C + export LANG + unset DISPLAY + cd /var/tmp/OFED_topdir/BUILD + rm -rf libmlx4-1.0 + /bin/gzip -dc /var/tmp/OFED_topdir/SOURCES/ libmlx4-1.0-0.1.ofed20080421.tar.gz + tar -xf - + STATUS=0 + '[' 0 -ne 0 ']' + cd libmlx4-1.0 ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chown -Rhf root . ++ /usr/bin/id -u + '[' 0 = 0 ']' + /bin/chgrp -Rhf root . + /bin/chmod -Rf a+rX,u+w,g-w,o-w . + exit 0 Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.87202 + umask 022 + cd /var/tmp/OFED_topdir/BUILD + cd libmlx4-1.0 + LANG=C + export LANG + unset DISPLAY + CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic' + export CFLAGS + CXXFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param =ssp-buffer-size=4 -m64 -mtune=generic' + export CXXFLAGS + FFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic' + export FFLAGS ++ find . -name config.guess -o -name config.sub + for i in '$(find . -name config.guess -o -name config.sub)' ++ basename ./config/config.sub + '[' -f /usr/lib/rpm/redhat/config.sub ']' + /bin/rm -f ./config/config.sub ++ basename ./config/config.sub + /bin/cp -fv /usr/lib/rpm/redhat/config.sub ./config/config.sub `/usr/lib/rpm/redhat/config.sub' -> `./config/config.sub' + for i in '$(find . -name config.guess -o -name config.sub)' ++ basename ./config/config.guess + '[' -f /usr/lib/rpm/redhat/config.guess ']' + /bin/rm -f ./config/config.guess ++ basename ./config/config.guess + /bin/cp -fv /usr/lib/rpm/redhat/config.guess ./config/config.guess `/usr/lib/rpm/redhat/config.guess' -> `./config/config.guess' + ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat- linux-gnu --target=x86_64- redhat-linux-gnu --program-prefix= --prefix=/usr --exec-prefix=/usr -- bindir=/usr/bin --sbind ir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/ include --libdir=/usr/l ib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/ usr/com --mandir=/usr/s hare/man --infodir=/usr/share/info checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-redhat-linux-gnu checking host system type... x86_64-redhat-linux-gnu checking for style of include used by make... GNU checking for x86_64-redhat-linux-gnu-gcc... no checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking dependency style of gcc... gcc3 checking for a sed that does not truncate output... /bin/sed checking for egrep... grep -E checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for /usr/bin/ld option to reload object files... -r checking for BSD-compatible nm... /usr/bin/nm -B checking whether ln -s works... yes checking how to recognise dependent libraries... pass_all checking how to run the C preprocessor... gcc -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking dlfcn.h usability... yes checking dlfcn.h presence... yes checking for dlfcn.h... yes checking for x86_64-redhat-linux-gnu-g++... no checking for x86_64-redhat-linux-gnu-c++... no checking for x86_64-redhat-linux-gnu-gpp... no checking for x86_64-redhat-linux-gnu-aCC... no checking for x86_64-redhat-linux-gnu-CC... no checking for x86_64-redhat-linux-gnu-cxx... no checking for x86_64-redhat-linux-gnu-cc++... no checking for x86_64-redhat-linux-gnu-cl... no checking for x86_64-redhat-linux-gnu-FCC... no checking for x86_64-redhat-linux-gnu-KCC... no checking for x86_64-redhat-linux-gnu-RCC... no checking for x86_64-redhat-linux-gnu-xlC_r... no checking for x86_64-redhat-linux-gnu-xlC... no checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... gcc3 checking how to run the C++ preprocessor... g++ -E checking for x86_64-redhat-linux-gnu-g77... no checking for x86_64-redhat-linux-gnu-f77... no checking for x86_64-redhat-linux-gnu-xlf... no checking for x86_64-redhat-linux-gnu-frt... no checking for x86_64-redhat-linux-gnu-pgf77... no checking for x86_64-redhat-linux-gnu-fort77... no checking for x86_64-redhat-linux-gnu-fl32... no checking for x86_64-redhat-linux-gnu-af77... no checking for x86_64-redhat-linux-gnu-f90... no checking for x86_64-redhat-linux-gnu-xlf90... no checking for x86_64-redhat-linux-gnu-pgf90... no checking for x86_64-redhat-linux-gnu-epcf90... no checking for x86_64-redhat-linux-gnu-f95... no checking for x86_64-redhat-linux-gnu-fort... no checking for x86_64-redhat-linux-gnu-xlf95... no checking for x86_64-redhat-linux-gnu-ifc... no checking for x86_64-redhat-linux-gnu-efc... no checking for x86_64-redhat-linux-gnu-pgf95... no checking for x86_64-redhat-linux-gnu-lf95... no checking for x86_64-redhat-linux-gnu-gfortran... no checking for g77... g77 checking whether we are using the GNU Fortran 77 compiler... no checking whether g77 accepts -g... yes checking the maximum length of command line arguments... 32768 checking command to parse /usr/bin/nm -B output from gcc object... ok checking for objdir... .libs checking for x86_64-redhat-linux-gnu-ar... no checking for ar... ar checking for x86_64-redhat-linux-gnu-ranlib... no checking for ranlib... ranlib checking for x86_64-redhat-linux-gnu-strip... no checking for strip... strip checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC checking if gcc PIC flag -fPIC works... yes checking if gcc static flag -static works... yes checking if gcc supports -c -o file.o... yes checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes configure: creating libtool appending configuration tag "CXX" to libtool checking for ld used by g++... /usr/bin/ld -m elf_x86_64 checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC checking if g++ PIC flag -fPIC works... yes checking if g++ static flag -static works... yes checking if g++ supports -c -o file.o... yes checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate appending configuration tag "F77" to libtool checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking for g77 option to produce PIC... -fPIC checking if g77 PIC flag -fPIC works... no checking if g77 static flag -static works... no checking if g77 supports -c -o file.o... no checking whether the g77 linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for x86_64-redhat-linux-gnu-gcc... gcc checking whether we are using the GNU C compiler... (cached) yes checking whether gcc accepts -g... (cached) yes checking for gcc option to accept ANSI C... (cached) none needed checking dependency style of gcc... (cached) gcc3 checking for ibv_get_device_list in -libverbs... yes checking infiniband/driver.h usability... yes checking infiniband/driver.h presence... yes checking for infiniband/driver.h... yes checking for ANSI C header files... (cached) yes checking valgrind/memcheck.h usability... yes checking valgrind/memcheck.h presence... yes checking for valgrind/memcheck.h... yes checking for an ANSI C-conforming const... yes checking for long... yes checking size of long... 8 checking for struct ibv_context.xrc_ops... no checking for ibv_read_sysfs_file... yes checking for ibv_dontfork_range... yes checking for ibv_dofork_range... yes checking for ibv_register_driver... yes checking whether ld accepts --version-script... yes configure: creating ./config.status config.status: creating Makefile config.status: creating libmlx4.spec config.status: creating config.h config.status: executing depfiles commands + make -j8 make all-am make[1]: Entering directory `/var/tmp/OFED_topdir/BUILD/libmlx4-1.0' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT buf.lo -MD -MP -MF ".deps/ buf.Tpo" -c -o buf.lo `tes t -f 'src/buf.c' || echo './'`src/buf.c; \ then mv -f ".deps/buf.Tpo" ".deps/buf.Plo"; else rm -f ".deps/ buf.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT cq.lo -MD -MP -MF ".deps/ cq.Tpo" -c -o cq.lo `test - f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/cq.Tpo" ".deps/cq.Plo"; else rm -f ".deps/cq.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT dbrec.lo -MD -MP -MF ".deps/ dbrec.Tpo" -c -o dbrec.l o `test -f 'src/dbrec.c' || echo './'`src/dbrec.c; \ then mv -f ".deps/dbrec.Tpo" ".deps/dbrec.Plo"; else rm -f ".deps/ dbrec.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT mlx4.lo -MD -MP -MF ".deps/ mlx4.Tpo" -c -o mlx4.lo ` test -f 'src/mlx4.c' || echo './'`src/mlx4.c; \ then mv -f ".deps/mlx4.Tpo" ".deps/mlx4.Plo"; else rm -f ".deps/ mlx4.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT qp.lo -MD -MP -MF ".deps/ qp.Tpo" -c -o qp.lo `test - f 'src/qp.c' || echo './'`src/qp.c; \ then mv -f ".deps/qp.Tpo" ".deps/qp.Plo"; else rm -f ".deps/qp.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT srq.lo -MD -MP -MF ".deps/ srq.Tpo" -c -o srq.lo `tes t -f 'src/srq.c' || echo './'`src/srq.c; \ then mv -f ".deps/srq.Tpo" ".deps/srq.Plo"; else rm -f ".deps/ srq.Tpo"; exit 1; fi if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. - I. -I. -g -Wall -D_G NU_SOURCE -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions - fstack-protector --param=s sp-buffer-size=4 -m64 -mtune=generic -MT verbs.lo -MD -MP -MF ".deps/ verbs.Tpo" -c -o verbs.l o `test -f 'src/verbs.c' || echo './'`src/verbs.c; \ then mv -f ".deps/verbs.Tpo" ".deps/verbs.Plo"; else rm -f ".deps/ verbs.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT dbrec. lo -MD -MP -MF .deps/dbrec.Tpo -c src/dbrec.c -fPIC -DPIC -o .libs/ dbrec.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT buf.lo -MD -MP -MF .deps/buf.Tpo -c src/buf.c -fPIC -DPIC -o .libs/buf.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT mlx4.l o -MD -MP -MF .deps/mlx4.Tpo -c src/mlx4.c -fPIC -DPIC -o .libs/mlx4.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT srq.lo -MD -MP -MF .deps/srq.Tpo -c src/srq.c -fPIC -DPIC -o .libs/srq.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT cq.lo -MD -MP -MF .deps/cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/cq.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT qp.lo -MD -MP -MF .deps/qp.Tpo -c src/qp.c -fPIC -DPIC -o .libs/qp.o gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -O2 -g -pipe - Wall -Wp,-D_FORTIFY_SOU RCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 - mtune=generic -MT verbs. lo -MD -MP -MF .deps/verbs.Tpo -c src/verbs.c -fPIC -DPIC -o .libs/ verbs.o In file included from src/buf.c:39: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type make[1]: *** [buf.lo] Error 1 make[1]: *** Waiting for unfinished jobs.... In file included from src/dbrec.c:42: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type make[1]: *** [dbrec.lo] Error 1 In file included from src/srq.c:42: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type make[1]: *** [srq.lo] Error 1 In file included from src/cq.c:47: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type In file included from src/mlx4.c:49: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type In file included from src/qp.c:44: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type src/cq.c: In function 'mlx4_cq_clean': src/cq.c:408: error: 'struct ibv_srq' has no member named 'xrc_cq' src/qp.c: In function 'mlx4_post_send': src/qp.c:244: error: 'IBV_QPT_XRC' undeclared (first use in this function) src/qp.c:244: error: (Each undeclared identifier is reported only once src/qp.c:244: error: for each function it appears in.) src/qp.c:245: error: 'struct ibv_send_wr' has no member named 'xrc_remote_srq_num' make[1]: *** [cq.lo] Error 1 make[1]: *** [mlx4.lo] Error 1 src/qp.c: In function 'mlx4_calc_sq_wqe_size': src/qp.c:547: error: 'IBV_QPT_XRC' undeclared (first use in this function) src/qp.c: In function 'mlx4_set_sq_sizes': src/qp.c:636: error: 'IBV_QPT_XRC' undeclared (first use in this function) make[1]: *** [qp.lo] Error 1 In file included from src/verbs.c:44: src/mlx4.h:282: error: field 'ibv_xrcd' has incomplete type src/verbs.c: In function 'mlx4_destroy_srq': src/verbs.c:329: error: 'struct ibv_srq' has no member named 'xrc_cq' src/verbs.c:331: error: 'struct ibv_srq' has no member named 'xrc_cq' src/verbs.c:340: error: 'struct ibv_srq' has no member named 'xrc_cq' src/verbs.c: In function 'mlx4_create_qp': src/verbs.c:388: error: 'IBV_QPT_XRC' undeclared (first use in this function) src/verbs.c:388: error: (Each undeclared identifier is reported only once src/verbs.c:388: error: for each function it appears in.) src/verbs.c: In function 'mlx4_modify_qp': src/verbs.c:517: error: 'IBV_QPT_XRC' undeclared (first use in this function) src/verbs.c: In function 'mlx4_destroy_qp': src/verbs.c:579: error: 'IBV_QPT_XRC' undeclared (first use in this function) make[1]: *** [verbs.lo] Error 1 make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/libmlx4-1.0' make: *** [all] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.87202 (%build) RPM build errors: Macro %dist has empty body Macro %dist has empty body user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.87202 (%build) Thanks, Vidvuds Vidvuds Ozolins Dept. of Materials Science and Engineering University of California, Los Angeles 3121E Engineering V P.O. Box 951595 Los Angeles, CA 90095-1595 Office: (310) 267-5538 E-mail: vidvuds at ucla.edu From hrosenstock at xsigo.com Thu May 29 19:31:26 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 19:31:26 -0700 Subject: [ofa-general] Re: OpenSM IPv6 consolidation In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov> References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> <20080529143535.60d02d75.weiny2@llnl.gov> Message-ID: <1212114686.17997.139.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 14:35 -0700, Ira Weiny wrote: > On Thu, 29 May 2008 12:27:12 -0700 > Hal Rosenstock wrote: > > > Ira, > > > > In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is: > > > > #define PREFIX_MASK (0xff12601b00000000ULL) > > > > Shouldn't all scopes be consolidated so this should be: > > > > #define PREFIX_MASK (0xff10601b00000000ULL) > > > > or was this intentional for some reason ? > > > > It seemed reasonable for this to consolidate link-local only because according > to my IPv6 book, solicited node multicast is the particular range, > ff02::1:ff00:0/104 > > However, I am a bit confused about how the scope bits map from the IP address > to the MGID. The MGID refers only to the IB-subnet scope _not_ IP, therefore > what I said above might not matter because we are now talking about the IB > scope. Right; the IP scope is different from the IB scope. > But that begs the question: Can a node issue an SNM request to a node in > another IB subnet? Yes, as IPoIB subnets can span IB subnets. -- Hal > (I think the answer is yes if the IP subnet spans more than > one IB subnet) In that case, the SNM address would be in the range > ff02::1:ff00:0/104 but what MGID would that map onto in IB? I think the > current mapping results in an IB link-local scope. So would a router have to > forward it even though the IB scope is link-local? > Now my head hurts... :-( > > Ira > From hrosenstock at xsigo.com Thu May 29 19:32:10 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Thu, 29 May 2008 19:32:10 -0700 Subject: [ofa-general] Re: OpenSM IPv6 consolidation In-Reply-To: <20080529143535.60d02d75.weiny2@llnl.gov> References: <1212089232.17997.125.camel@hrosenstock-ws.xsigo.com> <20080529143535.60d02d75.weiny2@llnl.gov> Message-ID: <1212114730.17997.141.camel@hrosenstock-ws.xsigo.com> On Thu, 2008-05-29 at 14:35 -0700, Ira Weiny wrote: > On Thu, 29 May 2008 12:27:12 -0700 > Hal Rosenstock wrote: > > > Ira, > > > > In osm_sa_mcmember_record.c:__search_mgrp_by_mgid, there is: > > > > #define PREFIX_MASK (0xff12601b00000000ULL) > > > > Shouldn't all scopes be consolidated so this should be: > > > > #define PREFIX_MASK (0xff10601b00000000ULL) > > > > or was this intentional for some reason ? > > > > It seemed reasonable for this to consolidate link-local only Actually, the code doesn't quite even do that. Patch to follow in a bit. -- Hal > because according > to my IPv6 book, solicited node multicast is the particular range, > ff02::1:ff00:0/104 > > However, I am a bit confused about how the scope bits map from the IP address > to the MGID. The MGID refers only to the IB-subnet scope _not_ IP, therefore > what I said above might not matter because we are now talking about the IB > scope. > > But that begs the question: Can a node issue an SNM request to a node in > another IB subnet? (I think the answer is yes if the IP subnet spans more than > one IB subnet) In that case, the SNM address would be in the range > ff02::1:ff00:0/104 but what MGID would that map onto in IB? I think the > current mapping results in an IB link-local scope. So would a router have to > forward it even though the IB scope is link-local? > > Now my head hurts... :-( > > Ira > From statubp at bossimissi.com Thu May 29 21:18:18 2008 From: statubp at bossimissi.com (Kenny Coulter) Date: Fri, 30 May 2008 11:18:18 +0700 Subject: [ofa-general] get smart, beat your foes Message-ID: <01c8c246$dc533900$02e46bcb@statubp> Make your corn bigger. And no metter how your look, you'll have success! Join and enjoy chicks attention! This way http://www.demodev.cn/a/ From vlad at lists.openfabrics.org Fri May 30 03:09:07 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 30 May 2008 03:09:07 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080530-0200 daily build status Message-ID: <20080530100907.EF77AE60BE5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From cousin_vinnie at hotmail.fr Fri May 30 03:26:30 2008 From: cousin_vinnie at hotmail.fr (Renaud Durand) Date: Fri, 30 May 2008 12:26:30 +0200 Subject: [ofa-general] iSCSI problem Message-ID: Hello guys, I tried to create a ISER target to load a hard drive from a remote computer following this tuto : https://wiki.openfabrics.org/tiki-index.php?page=ISER I can connect to the target, i have a device /dev/sdc now. my problem now is that i cannot mount this device, I try : linux-cx5e:~ # mount /dev/sdc /mnt and the computer freezes I have no idea why the computer freezes, so if you have any suggestions... thanks _________________________________________________________________ Faites vous de nouveaux amis grâce à l'annuaire des profils Messenger ! http://home.services.spaces.live.com/search/?page=searchresults&ss=true&FormId=AdvPeopleSearch&form=SPXFRM&tp=3&sc=2&pg=0&Search.DisplayName=Nom+public&search.gender=&search.age=&Search.FirstName=Pr%C3%A9nom&Search.LastName=Nom&search.location=Lieu&search.occupation=Profession&search.interests=amis&submit=Rechercher -------------- next part -------------- An HTML attachment was scrubbed... URL: From hrosenstock at xsigo.com Fri May 30 04:07:39 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 04:07:39 -0700 Subject: [ofa-general] [PATCH] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM Message-ID: <1212145660.17997.150.camel@hrosenstock-ws.xsigo.com> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all scopes when consolidating IPv6 SNM Signed-off-by: Hal Rosenstock --- opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 +++ opensm/osm_sa_mcmember_record.c 2008-05-30 04:02:20.800519000 -0700 @@ -1086,14 +1086,15 @@ /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ /* Where XXXX is the P_Key and * YYYYYY is the last 24 bits of the port guid */ -#define PREFIX_MASK (0xff10601b00000000ULL) +#define PREFIX_MASK (0xff10ffff00000000ULL) +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && g_prefix == rcv_prefix && (g_interface_id & INT_ID_MASK) == From hrosenstock at xsigo.com Fri May 30 04:07:57 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 04:07:57 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: Add another HP OUI to recognized vendor IDs Message-ID: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com> OpenSM: Add another HP OUI to recognized vendor IDs Signed-off-by: Hal Rosenstock diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 289e49e..43ec033 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -873,6 +884,7 @@ typedef enum _osm_mcast_req_type { #define OSM_VENDOR_ID_SUN 0x0003BA #define OSM_VENDOR_ID_3LEAFNTWKS 0x0016A1 #define OSM_VENDOR_ID_XSIGO 0x001397 +#define OSM_VENDOR_ID_HP2 0x0018FE /**********/ diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index aa0a9ea..15cc2e9 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2228,6 +2228,7 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho) case OSM_VENDOR_ID_PANTA: return (panta_str); case OSM_VENDOR_ID_HP: + case OSM_VENDOR_ID_HP2: return (hp_str); case OSM_VENDOR_ID_RIOWORKS: return (rioworks_str); From hrosenstock at xsigo.com Fri May 30 04:17:52 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 04:17:52 -0700 Subject: [ofa-general] [PATCHv2] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM Message-ID: <1212146272.17997.157.camel@hrosenstock-ws.xsigo.com> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all scopes when consolidating IPv6 SNM Minor comment change from v1 of this patch Patch is cumulative on minor improvement patch to this file Signed-off-by: Hal Rosenstock --- opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 +++ opensm/osm_sa_mcmember_record.c 2008-05-30 04:11:10.637837000 -0700 @@ -1083,17 +1083,18 @@ if (sa->p_subn->opt.consolidate_ipv6_snm_req) { /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ - /* Where XXXX is the P_Key and + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ + /* Where Z is the scope, XXXX is the P_Key, and * YYYYYY is the last 24 bits of the port guid */ -#define PREFIX_MASK (0xff10601b00000000ULL) +#define PREFIX_MASK (0xff10ffff00000000ULL) +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && g_prefix == rcv_prefix && (g_interface_id & INT_ID_MASK) == From shuttingr8 at epicenter.org Fri May 30 04:27:43 2008 From: shuttingr8 at epicenter.org (Antoinette Battle) Date: Fri, 30 May 2008 20:27:43 +0900 Subject: [ofa-general] We have CEOs as students. Message-ID: <01c8c293$9cff0980$0811aa79@shuttingr8> Bacheelor, MasteerMBA, and Doctoraate diplomas available in the field of your choice that's right, you can even become a Doctor and receive all the benefits that comes with it! Our Diplomas/Certificates are recognised in most countries No required examination, tests, classes, books, or interviews. ** No one is turned down ** Confidentiality assured CALL US 24 HOURS A DAY, 7 DAYS A WEEK For US: 1-801-504-2132 Outside US: +1-801-504-2132 "Just leave your NAME & PHONE NO. (with CountryCode)" in the voicemail our staff will get back to you in next few days -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Fri May 30 04:55:53 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 30 May 2008 07:55:53 -0400 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: <20080529234454.GD7960@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> <20080529225824.GB7960@opengridcomputing.com> <20080529234454.GD7960@opengridcomputing.com> Message-ID: At 07:44 PM 5/29/2008, Jon Mason wrote: >On Thu, May 29, 2008 at 04:11:09PM -0700, Roland Dreier wrote: >> Of course nothing prevents another hapless application from trying to >> use port 18635 with TCP... > >Yes, but that potential problem was already there. I suppose in the >long run RDS >should try to get IANA to give a reserved port (assuming that the RDS >of ports 1540 >and 1541 is a different RDS). RDS should do that right now, before the number is compiled into the code! The registry is at http://www.iana.org/assignments/port-numbers and the application form is http://www.iana.org/cgi-bin/usr-port-number.pl It's supposed to be a two-week process, but it can take longer if you have special requests like a specific port number, etc. > >> Not really much we can do about the general port space collision problem >> unless and until the network stack guys are willing to cooperate though. I don't think it has anything to do with the network stack code. It's basically an Internet license plate, issued by a separate authority. Tom. From hrosenstock at xsigo.com Fri May 30 05:50:24 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 05:50:24 -0700 Subject: [ofa-general] More questions/comments on IPv6 SNM consolidation option in OpenSM Message-ID: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com> Ira, The IPv6 SNM consolidation option in OpenSM currently collapses the SNM groups down to 1 group aliased group. However, in a heterogeneous network, not all ports will be able to meet certain group parameters (MTU, rate). This has been discussed on the list before. My current read of the code indicates that these joins would be rejected. Is that right ? If so, my question is why not allow them to create and join with their original real multicast group for this case ? The downside would be that if there were a lot of ports like this, then the consolidation would reduce the number of groups but maybe not enough. So would we then want an additional option for doing this (and what the default should be) ? One cut on the default would be to keep it the same as now but does that really matter ? Ideally, those additional SNM groups would be collapsed too. I think that aspect was dealt with in Jason's approach to this in a thread entitled "IPv6 and IPoIB scalability issue": http://lists.openfabrics.org/pipermail/general/2006-November/029621.html in which he proposed an MGID range for collapsing IPv6 SNM groups. Also, have you tried IPv6 SNM consolidation with multiple partitions ? I may have more on this aspect later. -- Hal From hrosenstock at xsigo.com Fri May 30 05:56:13 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 05:56:13 -0700 Subject: [ofa-general] Re: [PATCH] infiniband-diags: terminate perl scripts with error if not authorized In-Reply-To: <20080525191430.GT4616@sashak.voltaire.com> References: <4836EB27.7060707@llnl.gov> <20080525191430.GT4616@sashak.voltaire.com> Message-ID: <1212152173.17997.176.camel@hrosenstock-ws.xsigo.com> On Sun, 2008-05-25 at 22:14 +0300, Sasha Khapyorsky wrote: > Hi Tim, > > On 09:04 Fri 23 May , Timothy A. Meier wrote: > > > > +# ========================================================================= > > +# only authorized if uid is root, or matches umad ownership > > +# > > +sub auth_check > > +{ > > + my $file = "/dev/infiniband/umad0"; > > How would we know that it is "/dev/infiniband/umad0" and not another > device (when first port in not connected, or if -C and/or -P options are > used, or if udev is configured to put the entries in another place)? > > Really I don't see an easy (without reimplementing most of libibumad > device resolution functionality via sysfs in perl scripts) way to detect > device reliably. How about having a library function return the umad mapping so this doesn't need to be reimplemented ? -- Hal > > + my $uid = (stat $file)[4]; > > + my $gid = (stat $file)[5]; > > + if (($> != $uid) && ($> != $gid) && ($> != 0)){ > > The requirement here is not really ownership, but rather that the file > is readable and writable by user which runs script. Right? > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From forum.san at gmail.com Fri May 30 06:05:21 2008 From: forum.san at gmail.com (Sangamesh B) Date: Fri, 30 May 2008 18:35:21 +0530 Subject: [ofa-general] ***SPAM*** How to test OFED install Message-ID: Hi all, Can some one send the link/document which can explain the OFA tests:To check drivers installed properly? Thanks, Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: From okir at lst.de Fri May 30 06:21:19 2008 From: okir at lst.de (Olaf Kirch) Date: Fri, 30 May 2008 15:21:19 +0200 Subject: [ofa-general] Port space sharing in RDS In-Reply-To: <20080528225549.GC6288@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> Message-ID: <200805301521.20476.okir@lst.de> On Thursday 29 May 2008 00:55:49 Jon Mason wrote: > During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the > RDS port for all IP addresses. Unfortunately, that will not work for iWARP for > 2 major reasons. I looked at the code, and I think it is possible to bind to each interface address separately instead of binding to INADDR_ANY once. It's a bit of extra work, but it would take care of the conflict. For instance, in the IB case, the binding code could move to rds_ib_add_one, and we'd bind to each address associated with the interface - the problem there being of course that we're currently not watching for address changes. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From jon at opengridcomputing.com Fri May 30 08:08:14 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Fri, 30 May 2008 10:08:14 -0500 Subject: [ofa-general] Port space sharing in RDS In-Reply-To: <200805301521.20476.okir@lst.de> References: <20080528225549.GC6288@opengridcomputing.com> <200805301521.20476.okir@lst.de> Message-ID: <20080530150814.GB8638@opengridcomputing.com> On Fri, May 30, 2008 at 03:21:19PM +0200, Olaf Kirch wrote: > On Thursday 29 May 2008 00:55:49 Jon Mason wrote: > > During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the > > RDS port for all IP addresses. Unfortunately, that will not work for iWARP for > > 2 major reasons. > > I looked at the code, and I think it is possible to bind to each > interface address separately instead of binding to INADDR_ANY once. > It's a bit of extra work, but it would take care of the conflict. > > For instance, in the IB case, the binding code could move to > rds_ib_add_one, and we'd bind to each address associated with the interface - > the problem there being of course that we're currently not watching for > address changes. Yes, that was my original suggestion (though worded poorly). If that way is prefered, I can do it. Thanks, Jon > > Olaf > -- > Olaf Kirch | --- o --- Nous sommes du soleil we love when we play > okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax From freyp at student.ethz.ch Fri May 30 09:00:26 2008 From: freyp at student.ethz.ch (Philip Frey) Date: Fri, 30 May 2008 18:00:26 +0200 Subject: [ofa-general] Length of inbound RDMA send Message-ID: <4840249A.2080009@student.ethz.ch> Hello, I was wondering if a receive work completion tells me how many bytes had been placed. Is 'byte_len' the field indicating that value? The various fields of 'struct ibv_wc' are not quite clear to me. Can you point me to a document where this is described? Many thanks and kind regards, Philip From rdreier at cisco.com Fri May 30 09:09:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 May 2008 09:09:15 -0700 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: (Thomas Talpey's message of "Fri, 30 May 2008 07:55:53 -0400") References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> <20080529225824.GB7960@opengridcomputing.com> <20080529234454.GD7960@opengridcomputing.com> Message-ID: > >> Not really much we can do about the general port space collision problem > >> unless and until the network stack guys are willing to cooperate though. > > I don't think it has anything to do with the network stack code. It's basically > an Internet license plate, issued by a separate authority. I just meant that currently, I can bind an iWARP listen and a normal TCP listen to the same port, and a connect attempt to one or the other will fail. From rdreier at cisco.com Fri May 30 09:10:33 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 May 2008 09:10:33 -0700 Subject: [ofa-general] Length of inbound RDMA send In-Reply-To: <4840249A.2080009@student.ethz.ch> (Philip Frey's message of "Fri, 30 May 2008 18:00:26 +0200") References: <4840249A.2080009@student.ethz.ch> Message-ID: > I was wondering if a receive work completion tells me how > many bytes had been placed. Is 'byte_len' the field indicating > that value? Not sure what an "RDMA send" is -- if you mean a normal send (as opposed to an RDMA operation), then yes the byte_len field has the length of the message that was received. > The various fields of 'struct ibv_wc' are not quite clear to me. > Can you point me to a document where this is described? The "poll CQ" section of chapter 11 of the IB spec should cover it. - R. From Thomas.Talpey at netapp.com Fri May 30 09:29:01 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 30 May 2008 12:29:01 -0400 Subject: [ofa-general] [PATCH] rds: use separate ports for TCP and IB In-Reply-To: References: <20080528225549.GC6288@opengridcomputing.com> <000001c8c11b$2ebaf020$8398070a@amr.corp.intel.com> <20080529000354.GD6288@opengridcomputing.com> <200805292040.56097.olaf.kirch@oracle.com> <20080529225824.GB7960@opengridcomputing.com> <20080529234454.GD7960@opengridcomputing.com> Message-ID: At 12:09 PM 5/30/2008, Roland Dreier wrote: > > > >> Not really much we can do about the general port space collision problem > > >> unless and until the network stack guys are willing to cooperate though. > > > > I don't think it has anything to do with the network stack code. >It's basically > > an Internet license plate, issued by a separate authority. > >I just meant that currently, I can bind an iWARP listen and a normal TCP >listen to the same port, and a connect attempt to one or the other will fail. Oh THAT problem. :-) Yes, at the moment TCP to the iWARP NIC is like talking to a different host. But, RDMA-aware versions of a given protocol still need a second port, unless there is explicit upper layer support for initiating the MPA exchange. We have the same issue with NFSv3/RDMA, and we have applied for a second port (the application is still pending within IANA). The second port is not needed for future NFSv4.1, which has RDMA negotiation in its session establishment. And it's also not needed for IB, which doesn't have an RDMA upgrade at all. But for simplicity, we'll continue to use both. Tom. From hrosenstock at xsigo.com Fri May 30 11:07:15 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 11:07:15 -0700 Subject: [ofa-general] [PATCH][TRIVIAL] opensm/osm_subnet.c: Change comment for IPv6 SNM in options file Message-ID: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com> opensm/osm_subnet.c: Change comment for IPv6 SNM in options file Signed-off-by: Hal Rosenstock --- opensm/osm_subnet.c.2 2008-05-29 04:24:19.802169000 -0700 +++ opensm/osm_subnet.c 2008-05-30 11:04:00.098938000 -0700 @@ -1713,7 +1713,7 @@ p_opts->prefix_routes_file); fprintf(opts_file, - "#\n# IPv6 MCast Options\n#\n" + "#\n# IPv6 Solicited Node Multicast (SNM) Options\n#\n" "consolidate_ipv6_snm_req %s\n\n", p_opts->consolidate_ipv6_snm_req ? "TRUE" : "FALSE"); From dotanba at gmail.com Fri May 30 12:39:03 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 30 May 2008 21:39:03 +0200 Subject: [ofa-general] Length of inbound RDMA send In-Reply-To: <4840249A.2080009@student.ethz.ch> References: <4840249A.2080009@student.ethz.ch> Message-ID: <484057D7.2060904@gmail.com> Hi. Philip Frey wrote: > Hello, > > I was wondering if a receive work completion tells me how > many bytes had been placed. Is 'byte_len' the field indicating > that value? > > The various fields of 'struct ibv_wc' are not quite clear to me. > Can you point me to a document where this is described? But you know how many bytes you sent in this message .... byte_len is most useful for incoming message (to understand how many bytes were received). Dotan From swise at opengridcomputing.com Fri May 30 12:24:30 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 30 May 2008 14:24:30 -0500 Subject: [ofa-general] Port space sharing in RDS In-Reply-To: <20080530150814.GB8638@opengridcomputing.com> References: <20080528225549.GC6288@opengridcomputing.com> <200805301521.20476.okir@lst.de> <20080530150814.GB8638@opengridcomputing.com> Message-ID: <4840546E.40009@opengridcomputing.com> Jon Mason wrote: > On Fri, May 30, 2008 at 03:21:19PM +0200, Olaf Kirch wrote: > >> On Thursday 29 May 2008 00:55:49 Jon Mason wrote: >> >>> During RDS init, rds_ib_init and rds_tcp_init will both individually bind to the >>> RDS port for all IP addresses. Unfortunately, that will not work for iWARP for >>> 2 major reasons. >>> >> I looked at the code, and I think it is possible to bind to each >> interface address separately instead of binding to INADDR_ANY once. >> It's a bit of extra work, but it would take care of the conflict. >> >> For instance, in the IB case, the binding code could move to >> rds_ib_add_one, and we'd bind to each address associated with the interface - >> the problem there being of course that we're currently not watching for >> address changes. >> > > Yes, that was my original suggestion (though worded poorly). If that way is prefered, > I can do it. > > Note that if you do bind to specific addresses, then you need to deal with multiple addresses bound to the same interface... > Thanks, > Jon > > >> Olaf >> -- >> Olaf Kirch | --- o --- Nous sommes du soleil we love when we play >> okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Fri May 30 13:21:57 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 30 May 2008 15:21:57 -0500 Subject: [ofa-general] [PATCH RFC v4 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support In-Reply-To: <483E78C9.7080209@mellanox.co.il> References: <20080527183429.32168.14351.stgit@dell3.ogc.int> <20080527183549.32168.22959.stgit@dell3.ogc.int> <483CBDF0.7030209@opengridcomputing.com> <483DCEA8.20505@opengridcomputing.com> <483E78C9.7080209@mellanox.co.il> Message-ID: <484061E5.5060600@opengridcomputing.com> Tziporet Koren wrote: > Steve Wise wrote: >> Yes, I have already said I'll post a test case. :) >> >> The krping tool will be the culprit. Its the kernel equivalent of >> rping and has been around for a long time in one form or another. >> >> It is available at git://git.openfabrics.org/~swise/krping >> > Do younthink we should include it in OFED as we include user space > examples? > > Tziporet I would rather not ship it since then I'd have to support it. :) From hrosenstock at xsigo.com Fri May 30 13:22:14 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Fri, 30 May 2008 13:22:14 -0700 Subject: [ofa-general] [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM Message-ID: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all scopes when consolidating IPv6 SNM v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP v1 had a minor comment change Patch is cumulative on minor improvement patch to this file Signed-off-by: Hal Rosenstock --- opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 +++ opensm/osm_sa_mcmember_record.c 2008-05-30 13:13:59.344954000 -0700 @@ -1083,19 +1083,21 @@ if (sa->p_subn->opt.consolidate_ipv6_snm_req) { /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ - /* Where XXXX is the P_Key and + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ + /* Where Z is the scope, XXXX is the P_Key, and * YYYYYY is the last 24 bits of the port guid */ -#define PREFIX_MASK (0xff10601b00000000ULL) +#define PREFIX_MASK (0xff10ffff00000000ULL) +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && - g_prefix == rcv_prefix && + (g_prefix & PREFIX_MASK) == + (rcv_prefix && PREFIX_MASK) && (g_interface_id & INT_ID_MASK) == (rcv_interface_id & INT_ID_MASK)) { OSM_LOG(sa->p_log, OSM_LOG_INFO, From weiny2 at llnl.gov Fri May 30 14:41:15 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 30 May 2008 14:41:15 -0700 Subject: [ofa-general] Re: More questions/comments on IPv6 SNM consolidation option in OpenSM In-Reply-To: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com> References: <1212151825.17997.173.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080530144115.5fe80f0c.weiny2@llnl.gov> On Fri, 30 May 2008 05:50:24 -0700 Hal Rosenstock wrote: > Ira, > > The IPv6 SNM consolidation option in OpenSM currently collapses the SNM > groups down to 1 group aliased group. However, in a heterogeneous > network, not all ports will be able to meet certain group parameters > (MTU, rate). This has been discussed on the list before. My current read > of the code indicates that these joins would be rejected. Is that > right ? Yes I beleive so. > > If so, my question is why not allow them to create and join with > their original real multicast group for this case ? That would be fine as long as there were not too many "odd" nodes. > > The downside would > be that if there were a lot of ports like this, then the consolidation > would reduce the number of groups but maybe not enough. So would we then > want an additional option for doing this (and what the default should > be) ? I think it would be best to consolidate all the "like" ports. For example if you had 3 different MTU's on the fabric then you would have 3 different MGID's and groups. The main reason I did not do this was because it would have been a much larger change to the code and I did not want to risk breaking things. > > One cut on the default would be to keep it the same as now but > does that really matter ? Ideally, those additional SNM groups would be > collapsed too. I think that aspect was dealt with in Jason's approach to > this in a thread entitled "IPv6 and IPoIB scalability issue": > http://lists.openfabrics.org/pipermail/general/2006-November/029621.html > in which he proposed an MGID range for collapsing IPv6 SNM groups. Ah yes... I guess I should have read this part before responding above! ;-) > > Also, have you tried IPv6 SNM consolidation with multiple partitions ? I > may have more on this aspect later. > No, as we don't really use partitions. Ira From vlad at lists.openfabrics.org Sat May 31 03:09:05 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 31 May 2008 03:09:05 -0700 (PDT) Subject: [ofa-general] ofa_1_3_kernel 20080531-0200 daily build status Message-ID: <20080531100905.7B0CBE60B3E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.13 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.12 Passed on x86_64 with linux-2.6.15 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.12 Passed on ia64 with linux-2.6.13 Passed on ia64 with linux-2.6.15 Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on powerpc with linux-2.6.12 Passed on powerpc with linux-2.6.13 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.15 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.13 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.14 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.24 Passed on ppc64 with linux-2.6.19 Failed: From tempacer2 at upa.qc.ca Sat May 31 04:02:04 2008 From: tempacer2 at upa.qc.ca (Colin Mayberry) Date: Sat, 31 May 2008 20:02:04 +0900 Subject: [ofa-general] get smart, beat your foes Message-ID: <01c8c359$32180600$ba506d7b@tempacer2> Make your corn bigger. And no metter how your look, you'll have success! Join and enjoy chicks attention! This way http://www.esigont.com/a/ From sashak at voltaire.com Sat May 31 05:18:58 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 15:18:58 +0300 Subject: [ofa-general] saquery port problems In-Reply-To: References: <20080522073703.GA31474@sashak.voltaire.com> Message-ID: <20080531121858.GB22418@sashak.voltaire.com> On 11:32 Thu 29 May , Matthias Blankenhaus wrote: > > Sorry, I don't know what that is :-) This is my first patch for OFED, > excuse my ignorance. One of the simplest ways to send the patch is to commit the change to some branch in your local git tree and the to email the output of 'git-format-patch --stdout HEAD^' command. > Please, let me know if this helps: > > Signed-off-by: matthias at sgi.com Yes, this helps. Applied. Thanks. Sasha From sashak at voltaire.com Sat May 31 06:41:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 16:41:19 +0300 Subject: [ofa-general] [PATCH] infiniband-diags: terminate perl scripts with error if not root In-Reply-To: <483F2396.3050008@llnl.gov> References: <48358DF8.2060603@llnl.gov> <1211469467.18236.196.camel@hrosenstock-ws.xsigo.com> <20080523103532.GA4640@sashak.voltaire.com> <4836E9B8.2080406@llnl.gov> <20080525191047.GS4616@sashak.voltaire.com> <483F2396.3050008@llnl.gov> Message-ID: <20080531134119.GE22418@sashak.voltaire.com> On 14:43 Thu 29 May , Timothy A. Meier wrote: > I think this patch is fine, and helps solve the improper "usage" issue. I will apply then. > (btw - should we prefer the "adapter" spelling over "adaptor"?) Originally it was added as "adaptor" with "adding -C, -P options" patch. I have nothing against changing this to "adapter". > My patch was addressing non-authorized use. Our philosophy was to not > allow > "any" sort of functionality (even help) if not authorized. Fail, and > provide > a reason/code. Doesn't 'chmod 0700 /usr/local/sbin/ib*.pl' (as root) solve this? > So rather than go through each perl script to see if the proper thing is > done > (return code is checked, error msg provided, terminate, etc.) It is bug fixing... :) > On 5-23, I submitted a patch which adds an auth_check() function to the > common > perl module. I agree, the implementation is non-ideal, but it is probably > sufficient for the vast majority of installations. > > If you think the concept of an auth_check() function is > desirable/acceptable, > then I will pursue fixing the implementation in a more universal way. Basically I think that idea of limited access is useful, but don't see why simple 'chmod' is insufficient. And if it is not I think that auth_check() should be optional (and of course not broken). Sasha From sashak at voltaire.com Sat May 31 07:11:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 17:11:19 +0300 Subject: [ofa-general] Re: [PATCHv2] management: Support separate SA and SM keys as clarified in IBA 1.2.1 In-Reply-To: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com> References: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531141119.GG22418@sashak.voltaire.com> On 06:22 Thu 29 May , Hal Rosenstock wrote: > management: Support separate SA and SM keys as clarified in IBA 1.2.1 > > v2 is just a rebase to latest tree > > Signed-off-by: Hal Rosenstock Applied. Thanks. Host order default value is obviously wrong thing (like just OSM_DEFAULT_SM_KEY which is not resolved yet). I think I will change it to something like: #define OSM_DEFAULT_SA_KEY OSM_DEFAULT_SM_KEY Sasha From sashak at voltaire.com Sat May 31 07:13:40 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 17:13:40 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_sa_mcmember_record.c: Improve log message and some comments relating to SNM In-Reply-To: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com> References: <1212089107.17997.122.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531141340.GH22418@sashak.voltaire.com> On 12:25 Thu 29 May , Hal Rosenstock wrote: > opensm/osm_sa_mcmember_record.c: Improve log message and some comments > relating to SNM (solicited node multicast) > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hrosenstock at xsigo.com Sat May 31 07:19:32 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 31 May 2008 07:19:32 -0700 Subject: [ofa-general] Re: [PATCHv2] management: Support separate SA and SM keys as clarified in IBA 1.2.1 In-Reply-To: <20080531141119.GG22418@sashak.voltaire.com> References: <1212067372.17997.43.camel@hrosenstock-ws.xsigo.com> <20080531141119.GG22418@sashak.voltaire.com> Message-ID: <1212243572.17997.276.camel@hrosenstock-ws.xsigo.com> On Sat, 2008-05-31 at 17:11 +0300, Sasha Khapyorsky wrote: > On 06:22 Thu 29 May , Hal Rosenstock wrote: > > management: Support separate SA and SM keys as clarified in IBA 1.2.1 > > > > v2 is just a rebase to latest tree > > > > Signed-off-by: Hal Rosenstock > > Applied. Thanks. > > Host order default value is obviously wrong thing But it's compatible with what is there now so doesn't require a change to saquery except in the PPC case as you noted. > (like just OSM_DEFAULT_SM_KEY which is not resolved yet). > I think I will change it to something like: > > #define OSM_DEFAULT_SA_KEY OSM_DEFAULT_SM_KEY Depends on how default SM key finally settles out as to whether it's better to do it this way. I agree if we started with a clean slate this would be the way to do it. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Sat May 31 07:28:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 17:28:16 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req In-Reply-To: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com> References: <1212089110.17997.123.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531142816.GI22418@sashak.voltaire.com> On 12:25 Thu 29 May , Hal Rosenstock wrote: > opensm/main.c: Minor change to long option for consolidate_ipv6_snm_req > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat May 31 07:35:35 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 17:35:35 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: Add another HP OUI to recognized vendor IDs In-Reply-To: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com> References: <1212145677.17997.151.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531143535.GJ22418@sashak.voltaire.com> On 04:07 Fri 30 May , Hal Rosenstock wrote: > OpenSM: Add another HP OUI to recognized vendor IDs > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat May 31 07:40:53 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 17:40:53 +0300 Subject: [ofa-general] Re: [PATCH][TRIVIAL] opensm/osm_subnet.c: Change comment for IPv6 SNM in options file In-Reply-To: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com> References: <1212170835.17997.231.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531144053.GK22418@sashak.voltaire.com> On 11:07 Fri 30 May , Hal Rosenstock wrote: > opensm/osm_subnet.c: Change comment for IPv6 SNM in options file > > Signed-off-by: Hal Rosenstock Applied by hand (patch is whitespace mangled). Thanks. Sasha > > --- opensm/osm_subnet.c.2 2008-05-29 04:24:19.802169000 -0700 > +++ opensm/osm_subnet.c 2008-05-30 11:04:00.098938000 -0700 > @@ -1713,7 +1713,7 @@ > p_opts->prefix_routes_file); > > fprintf(opts_file, > - "#\n# IPv6 MCast Options\n#\n" > + "#\n# IPv6 Solicited Node Multicast (SNM) Options\n#\n" > "consolidate_ipv6_snm_req %s\n\n", > p_opts->consolidate_ipv6_snm_req ? "TRUE" : "FALSE"); > > > From atheatre at bellnet.ca Sat May 31 07:48:32 2008 From: atheatre at bellnet.ca (Uk Lottery Board) Date: Sat, 31 May 2008 10:48:32 -0400 Subject: [ofa-general] Congratulation Your Email Won Message-ID: <6tghee$1cknof@toip39-bus.srvr.bell.ca> contact.mr.scott campbell ,Email:agent_scottcampbell at yahoo.co.uk for a lump sum pay out of 1,500,000.00 pounds. Provide him with the information below: 1.Full Name: 2.Full Address, Occupation, sex and age From sashak at voltaire.com Sat May 31 08:09:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 18:09:22 +0300 Subject: [ofa-general] Multicast Performance In-Reply-To: <483EBE95.60901@informatik.tu-chemnitz.de> References: <4836E231.4000601@informatik.tu-chemnitz.de> <48371B2D.3040908@gmail.com> <483A7E40.5040407@informatik.tu-chemnitz.de> <483BBBDB.6000605@informatik.tu-chemnitz.de> <483DA512.2070403@gmail.com> <483E7520.1000302@informatik.tu-chemnitz.de> <1212065181.27600.96.camel@hrosenstock-ws.xsigo.com> <483EB11A.5000000@informatik.tu-chemnitz.de> <1212068243.17997.48.camel@hrosenstock-ws.xsigo.com> <483EBE95.60901@informatik.tu-chemnitz.de> Message-ID: <20080531150922.GL22418@sashak.voltaire.com> On 16:32 Thu 29 May , Marcel Heinz wrote: > > Now, all 3 instances measure 950MB/s throughput. > > The returned MCMember Records are absolutely identical except > for the PortGid and the membership state. So the difference is only membership. If you have just 2 full member instances could you see performance degradation? Sasha From sashak at voltaire.com Sat May 31 08:37:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 18:37:30 +0300 Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM In-Reply-To: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531153730.GM22418@sashak.voltaire.com> Hi Hal, I agree about idea. But patch itself seems to not be against main stream (or any known published branch). Comments are below. On 13:22 Fri 30 May , Hal Rosenstock wrote: > OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all > scopes when consolidating IPv6 SNM > > v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP > v1 had a minor comment change > > Patch is cumulative on minor improvement patch to this file > > Signed-off-by: Hal Rosenstock > > --- opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 > +++ opensm/osm_sa_mcmember_record.c 2008-05-30 13:13:59.344954000 -0700 Please next time generate a diff at least at one level above (but better is as usual - at git tree level + 1). > @@ -1083,19 +1083,21 @@ > > if (sa->p_subn->opt.consolidate_ipv6_snm_req) { > /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ > - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ > - /* Where XXXX is the P_Key and > + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ > + /* Where Z is the scope, XXXX is the P_Key, and > * YYYYYY is the last 24 bits of the port guid */ > -#define PREFIX_MASK (0xff10601b00000000ULL) There I have 0xff12601b00000000ULL value (and likely it is what you wanted to fix :)). > +#define PREFIX_MASK (0xff10ffff00000000ULL) > +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) > #define INT_ID_MASK (0x00000001ff000000ULL) > uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); > uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); > uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); > uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); > > - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && > + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && If you are changing PREFIX_MASK to 0xff10601b00000000ULL, why PREFIX_SIGNATURE is needed? Am I missing something? Sasha > (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && > - g_prefix == rcv_prefix && > + (g_prefix & PREFIX_MASK) == > + (rcv_prefix && PREFIX_MASK) && > (g_interface_id & INT_ID_MASK) == > (rcv_interface_id & INT_ID_MASK)) { > OSM_LOG(sa->p_log, OSM_LOG_INFO, > > From hrosenstock at xsigo.com Sat May 31 08:48:45 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 31 May 2008 08:48:45 -0700 Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM In-Reply-To: <20080531153730.GM22418@sashak.voltaire.com> References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> <20080531153730.GM22418@sashak.voltaire.com> Message-ID: <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com> Hi Sasha, On Sat, 2008-05-31 at 18:37 +0300, Sasha Khapyorsky wrote: > Hi Hal, > > I agree about idea. But patch itself seems to not be against main stream > (or any known published branch). Comments are below. It is against master but I made a mistake in generating it. > On 13:22 Fri 30 May , Hal Rosenstock wrote: > > OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all > > scopes when consolidating IPv6 SNM > > > > v2 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP > > v1 had a minor comment change > > > > Patch is cumulative on minor improvement patch to this file > > > > Signed-off-by: Hal Rosenstock > > > > --- opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 > > +++ opensm/osm_sa_mcmember_record.c 2008-05-30 13:13:59.344954000 -0700 > > Please next time generate a diff at least at one level above (but better > is as usual - at git tree level + 1). > > > @@ -1083,19 +1083,21 @@ > > > > if (sa->p_subn->opt.consolidate_ipv6_snm_req) { > > /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ > > - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ > > - /* Where XXXX is the P_Key and > > + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ > > + /* Where Z is the scope, XXXX is the P_Key, and > > * YYYYYY is the last 24 bits of the port guid */ > > -#define PREFIX_MASK (0xff10601b00000000ULL) > > There I have 0xff12601b00000000ULL value (and likely it is what you > wanted to fix :)). Correct. That was where the generated patch was broken. > > +#define PREFIX_MASK (0xff10ffff00000000ULL) > > +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) > > #define INT_ID_MASK (0x00000001ff000000ULL) > > uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); > > uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); > > uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); > > uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); > > > > - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && > > + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && > > If you are changing PREFIX_MASK to 0xff10601b00000000ULL, No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to: 1. eliminate the scope part, and 2. get the entire signature for subsequent comparison. > why PREFIX_SIGNATURE is needed? Am I missing something? Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it should be after masking. I'll regenerate v4 and hopefully I'll get it right. -- Hal > Sasha > > > (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && > > - g_prefix == rcv_prefix && > > + (g_prefix & PREFIX_MASK) == > > + (rcv_prefix && PREFIX_MASK) && > > (g_interface_id & INT_ID_MASK) == > > (rcv_interface_id & INT_ID_MASK)) { > > OSM_LOG(sa->p_log, OSM_LOG_INFO, > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Sat May 31 09:03:47 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 31 May 2008 09:03:47 -0700 Subject: [ofa-general] [PATCHv4] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM Message-ID: <1212249827.17997.294.camel@hrosenstock-ws.xsigo.com> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all scopes when consolidating IPv6 SNM v4 fixes the original PREFIX_MASK as Sasha commented v3 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP v2 had a minor comment change Patch is cumulative on minor improvement patch to this file Signed-off-by: Hal Rosenstock --- opensm/opensm/osm_sa_mcmember_record.c.1 2008-05-30 03:58:01.129544000 -0700 +++ opensm/opensm/osm_sa_mcmember_record.c 2008-05-30 13:13:59.344954000 -0700 @@ -1083,19 +1083,21 @@ if (sa->p_subn->opt.consolidate_ipv6_snm_req) { /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ - /* Where XXXX is the P_Key and + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ + /* Where Z is the scope, XXXX is the P_Key, and * YYYYYY is the last 24 bits of the port guid */ -#define PREFIX_MASK (0xff12601b00000000ULL) +#define PREFIX_MASK (0xff10ffff00000000ULL) +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && - g_prefix == rcv_prefix && + (g_prefix & PREFIX_MASK) == + (rcv_prefix && PREFIX_MASK) && (g_interface_id & INT_ID_MASK) == (rcv_interface_id & INT_ID_MASK)) { OSM_LOG(sa->p_log, OSM_LOG_INFO, From Income at lists.openfabrics.org Sat May 31 09:30:03 2008 From: Income at lists.openfabrics.org (Income at lists.openfabrics.org) Date: 31 May 2008 09:30:03 -0700 Subject: [ofa-general] Quit Your Day Job Within 30 Days Message-ID: <20080531093002.7C1D027846B01604@from.header.has.no.domain> I've discovered an exciting way to make moolah from the comfort of YOUR living room... while creating multiple streams of income... using Google and other search engines! This isn't something you see every day. This is a business that can help you earn $3k to $9k a month! For Full details please read the attached .html file Unsubscribe: please read the attached .html file, click on contact form -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sat May 31 10:13:15 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 20:13:15 +0300 Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM In-Reply-To: <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com> References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> <20080531153730.GM22418@sashak.voltaire.com> <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531171315.GN22418@sashak.voltaire.com> On 08:48 Sat 31 May , Hal Rosenstock wrote: > > No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to: > 1. eliminate the scope part, and > 2. get the entire signature > for subsequent comparison. > > > why PREFIX_SIGNATURE is needed? Am I missing something? > > Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it > should be after masking. I see. Then shouldn't it (mask) be 0xff10ffff0000ffffULL? Sasha From sashak at voltaire.com Sat May 31 10:28:15 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 20:28:15 +0300 Subject: [ofa-general] [PATCH v2] saquery: --smkey command line option In-Reply-To: <1211972791.13185.334.camel@hrosenstock-ws.xsigo.com> References: <20080522145607.GE32128@sashak.voltaire.com> <1211469029.18236.188.camel@hrosenstock-ws.xsigo.com> <20080523100634.GD4164@sashak.voltaire.com> <1211541313.13185.80.camel@hrosenstock-ws.xsigo.com> <20080523123414.GB4640@sashak.voltaire.com> <1211547161.13185.103.camel@hrosenstock-ws.xsigo.com> <20080527103341.GF12014@sashak.voltaire.com> <1211888036.13185.219.camel@hrosenstock-ws.xsigo.com> <20080527175343.GA14205@sashak.voltaire.com> <1211972791.13185.334.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531172815.GO22418@sashak.voltaire.com> This adds possibility to specify SM_Key value with saquery. It should work with queries where OSM_DEFAULT_SM_KEY was used. If non-numeric string (like 'x') is provided with --smkey option then saquery will prompt to get SM_Key value. Signed-off-by: Sasha Khapyorsky --- SM_key value prompting was added as addition to v1 of the patch. infiniband-diags/src/saquery.c | 20 +++++++++++++++++--- 1 files changed, 17 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 3d4ab24..d3875fc 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -37,6 +37,7 @@ * */ +#include #include #include #include @@ -69,6 +70,7 @@ char *argv0 = "saquery"; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; +static ib_net64_t smkey = OSM_DEFAULT_SA_KEY; /** * Declare some globals because I don't want this to be too complex. @@ -730,7 +732,7 @@ get_all_records(osm_bind_handle_t bind_handle, int trusted) { return get_any_records(bind_handle, query_id, 0, 0, NULL, attr_offset, - trusted ? OSM_DEFAULT_SA_KEY : 0); + trusted ? smkey : 0); } /** @@ -1254,8 +1256,7 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, status = get_any_records(bind_handle, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, &pktr, - ib_get_attr_offset(sizeof(pktr)), - OSM_DEFAULT_SA_KEY); + ib_get_attr_offset(sizeof(pktr)), smkey); if (status != IB_SUCCESS) return status; @@ -1411,6 +1412,10 @@ usage(void) "IPv6 format\n"); fprintf(stderr, " -C specify the SA query HCA\n"); fprintf(stderr, " -P specify the SA query port\n"); + fprintf(stderr, " --smkey specify SM_Key value for the query." + " If non-numeric value \n" + " (like 'x') is specified then " + "saquery will prompt for a value\n"); fprintf(stderr, " -t | --timeout specify the SA query " "response timeout (default %u msec)\n", DEFAULT_SA_TIMEOUT_MS); @@ -1466,6 +1471,7 @@ main(int argc, char **argv) {"sgid-to-dgid", 1, 0, 2}, {"timeout", 1, 0, 't'}, {"node-name-map", 1, 0, 3}, + {"smkey", 1, 0, 4}, { } }; @@ -1512,6 +1518,14 @@ main(int argc, char **argv) case 3: node_name_map_file = strdup(optarg); break; + case 4: + if (!isxdigit(*optarg) && + !(optarg = getpass("SM_Key: "))) { + fprintf(stderr, "cannot get SM_Key\n"); + usage(); + } + smkey = cl_hton64(strtoull(optarg, NULL, 0)); + break; case 'p': query_type = IB_MAD_ATTR_PATH_RECORD; break; -- 1.5.5.1.178.g1f811 From hrosenstock at xsigo.com Sat May 31 10:31:54 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 31 May 2008 10:31:54 -0700 Subject: [ofa-general] Re: [PATCHv3] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM In-Reply-To: <20080531171315.GN22418@sashak.voltaire.com> References: <1212178934.17997.244.camel@hrosenstock-ws.xsigo.com> <20080531153730.GM22418@sashak.voltaire.com> <1212248925.17997.282.camel@hrosenstock-ws.xsigo.com> <20080531171315.GN22418@sashak.voltaire.com> Message-ID: <1212255114.17997.301.camel@hrosenstock-ws.xsigo.com> On Sat, 2008-05-31 at 20:13 +0300, Sasha Khapyorsky wrote: > On 08:48 Sat 31 May , Hal Rosenstock wrote: > > > > No, PREFIX_MASK is being changed to 0xff10ffff00000000ULL to: > > 1. eliminate the scope part, and > > 2. get the entire signature > > for subsequent comparison. > > > > > why PREFIX_SIGNATURE is needed? Am I missing something? > > > > Because PREFIX_MASK masks the bits to compare and SIGNATURE is what it > > should be after masking. > > I see. Then shouldn't it (mask) be 0xff10ffff0000ffffULL? Yes; updated patch to follow shortly. Similarly for INTF_MASK and I'll generate a separate path for that. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hrosenstock at xsigo.com Sat May 31 10:31:57 2008 From: hrosenstock at xsigo.com (Hal Rosenstock) Date: Sat, 31 May 2008 10:31:57 -0700 Subject: [ofa-general] [PATCHv5] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM Message-ID: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com> OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all scopes when consolidating IPv6 SNM v5 changes PREFIX_MASK so low 16 bits are validated as 0 v4 fixes the original PREFIX_MASK as Sasha commented v3 compares masked prefixes rather than actual prefix in MCMemberRecord MGID and MGRP v2 had a minor comment change Signed-off-by: Hal Rosenstock diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index 040068f..c14632d 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1083,19 +1083,21 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context) if (sa->p_subn->opt.consolidate_ipv6_snm_req) { /* Special Case IPv6 Solicited Node Multicast (SNM) addresses */ - /* 0xff12601bXXXX0000 : 0x00000001ffYYYYYY */ - /* Where XXXX is the P_Key and + /* 0xff1Z601bXXXX0000 : 0x00000001ffYYYYYY */ + /* Where Z is the scope, XXXX is the P_Key, and * YYYYYY is the last 24 bits of the port guid */ -#define PREFIX_MASK (0xff12601b00000000ULL) +#define PREFIX_MASK (0xff10ffff0000ffffULL) +#define PREFIX_SIGNATURE (0xff10601b00000000ULL) #define INT_ID_MASK (0x00000001ff000000ULL) uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix); uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id); uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix); uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id); - if ((rcv_prefix & PREFIX_MASK) == PREFIX_MASK && + if ((rcv_prefix & PREFIX_MASK) == PREFIX_SIGNATURE && (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK && - g_prefix == rcv_prefix && + (g_prefix & PREFIX_MASK) == + (rcv_prefix && PREFIX_MASK) && (g_interface_id & INT_ID_MASK) == (rcv_interface_id & INT_ID_MASK)) { OSM_LOG(sa->p_log, OSM_LOG_INFO, From sashak at voltaire.com Sat May 31 12:25:51 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 31 May 2008 22:25:51 +0300 Subject: [ofa-general] Re: [PATCHv5] OpenSM/osm_sa_mcmember_record.c: Collapse all scopes when consolidating IPv6 SNM In-Reply-To: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com> References: <1212255117.17997.303.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531192551.GP22418@sashak.voltaire.com> On 10:31 Sat 31 May , Hal Rosenstock wrote: > OpenSM/osm_sa_mcmember_record.c: In __search_mgrp_by_mgid, collapse all > scopes when consolidating IPv6 SNM > > v5 changes PREFIX_MASK so low 16 bits are validated as 0 > v4 fixes the original PREFIX_MASK as Sasha commented > v3 compares masked prefixes rather than actual prefix in MCMemberRecord > MGID and MGRP > v2 had a minor comment change > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat May 31 14:49:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 1 Jun 2008 00:49:19 +0300 Subject: [ofa-general] Re: OSM_DEFAULT_SM_KEY byte order In-Reply-To: <1211467961.18236.178.camel@hrosenstock-ws.xsigo.com> References: <20080522140916.GC32128@sashak.voltaire.com> <1211467961.18236.178.camel@hrosenstock-ws.xsigo.com> Message-ID: <20080531214919.GS22418@sashak.voltaire.com> On 07:52 Thu 22 May , Hal Rosenstock wrote: > > +#define OSM_DEFAULT_SM_KEY CL_HTON64(1) > > /********/ > > /****s* OpenSM: Base/OSM_DEFAULT_LMC > > * NAME > > > > > > , but sort of backward compatibility (currently I know that > > OSM_DEFAULT_SM_KEY is used with 'osmtest' and 'saquery') could be lost. > > Is this so important? Ideas? > > IMO yes, I think this breaks both backward compatibility and what was > actually observed from some other SMs during interop testing. > > I agree it needs fixing but I think the proper thing is probably more > like: > > #define OSM_DEFAULT_SM_KEY CL_HTON64(0x0100000000000000); Using value like this we will break on big endian machines where originally the value is correct. I think that '1' in network byte order is better (especially in long term) - it is more "native" non-zero value. Also I found at least one vendor SM which uses 1 as default SM key in network byte order (and this is expected, I doubt somebody uses 0x0100000000000000). Our own backward compatibility could be solved by configuring sm key (this will work with OpenSM and saquery). Another opinions? Sasha From sashak at voltaire.com Sat May 31 15:13:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 1 Jun 2008 01:13:22 +0300 Subject: [ofa-general] [PATCH] opensm: remove osm_log reference from osm_mad_pool object Message-ID: <20080531221322.GT22418@sashak.voltaire.com> This removes osm_log reference from osm_mad_pool object as well as some noisy debug prints there. Recently osm_mad_pool was reworked to use plain malloc allocator instead of complib's cl_qlock_pool so importance of those messages was reduced. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_mad_pool.h | 12 +------- opensm/opensm/libopensm.ver | 2 +- opensm/opensm/osm_mad_pool.c | 50 +++------------------------------ opensm/opensm/osm_opensm.c | 2 +- 4 files changed, 8 insertions(+), 58 deletions(-) diff --git a/opensm/include/opensm/osm_mad_pool.h b/opensm/include/opensm/osm_mad_pool.h index b8421b9..e3234f4 100644 --- a/opensm/include/opensm/osm_mad_pool.h +++ b/opensm/include/opensm/osm_mad_pool.h @@ -53,7 +53,6 @@ #include #include #include -#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -95,14 +94,10 @@ BEGIN_C_DECLS * SYNOPSIS */ typedef struct _osm_mad_pool { - osm_log_t *p_log; atomic32_t mads_out; } osm_mad_pool_t; /* * FIELDS -* p_log -* Pointer to the log object. -* * mads_out * Running total of the number of MADs outstanding. * @@ -176,17 +171,12 @@ void osm_mad_pool_destroy(IN osm_mad_pool_t * const p_pool); * * SYNOPSIS */ -ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool, - IN osm_log_t * const p_log); +ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool); /* * PARAMETERS * p_pool * [in] Pointer to an osm_mad_pool_t object to initialize. * -* p_log -* [in] Pointer to the log object. -* -* * RETURN VALUES * CL_SUCCESS if the MAD Pool was initialized successfully. * diff --git a/opensm/opensm/libopensm.ver b/opensm/opensm/libopensm.ver index 3324b1a..0d5e9d4 100644 --- a/opensm/opensm/libopensm.ver +++ b/opensm/opensm/libopensm.ver @@ -6,4 +6,4 @@ # API_REV - advance on any added API # RUNNING_REV - advance any change to the vendor files # AGE - number of backward versions the API still supports -LIBVERSION=2:1:0 +LIBVERSION=2:2:0 diff --git a/opensm/opensm/osm_mad_pool.c b/opensm/opensm/osm_mad_pool.c index 9b3812f..a7769d4 100644 --- a/opensm/opensm/osm_mad_pool.c +++ b/opensm/opensm/osm_mad_pool.c @@ -53,7 +53,6 @@ #include #include #include -#include #include /********************************************************************** @@ -74,14 +73,10 @@ void osm_mad_pool_destroy(IN osm_mad_pool_t * const p_pool) /********************************************************************** **********************************************************************/ -ib_api_status_t -osm_mad_pool_init(IN osm_mad_pool_t * const p_pool, IN osm_log_t * const p_log) +ib_api_status_t osm_mad_pool_init(IN osm_mad_pool_t * const p_pool) { - OSM_LOG_ENTER(p_log); + p_pool->mads_out = 0; - p_pool->p_log = p_log; - - OSM_LOG_EXIT(p_log); return IB_SUCCESS; } @@ -95,8 +90,6 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool, osm_madw_t *p_madw; ib_mad_t *p_mad; - OSM_LOG_ENTER(p_pool->p_log); - CL_ASSERT(h_bind != OSM_BIND_INVALID_HANDLE); CL_ASSERT(total_size); @@ -104,11 +97,8 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool, First, acquire a mad wrapper from the mad wrapper pool. */ p_madw = malloc(sizeof(*p_madw)); - if (p_madw == NULL) { - OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0703: " - "Unable to acquire MAD wrapper object\n"); + if (p_madw == NULL) goto Exit; - } osm_madw_init(p_madw, h_bind, total_size, p_mad_addr); @@ -117,9 +107,6 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool, */ p_mad = osm_vendor_get(h_bind, total_size, &p_madw->vend_wrap); if (p_mad == NULL) { - OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0704: " - "Unable to acquire wire MAD\n"); - /* Don't leak wrappers! */ free(p_madw); p_madw = NULL; @@ -132,13 +119,8 @@ osm_madw_t *osm_mad_pool_get(IN osm_mad_pool_t * const p_pool, */ osm_madw_set_mad(p_madw, p_mad); - OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG, - "Acquired p_madw = %p, p_mad = %p, size = %u\n", - p_madw, p_madw->p_mad, total_size); - Exit: - OSM_LOG_EXIT(p_pool->p_log); - return (p_madw); + return p_madw; } /********************************************************************** @@ -151,8 +133,6 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool, { osm_madw_t *p_madw; - OSM_LOG_ENTER(p_pool->p_log); - CL_ASSERT(h_bind != OSM_BIND_INVALID_HANDLE); CL_ASSERT(total_size); CL_ASSERT(p_mad); @@ -161,11 +141,8 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool, First, acquire a mad wrapper from the mad wrapper pool. */ p_madw = malloc(sizeof(*p_madw)); - if (p_madw == NULL) { - OSM_LOG(p_pool->p_log, OSM_LOG_ERROR, "ERR 0705: " - "Unable to acquire MAD wrapper object\n"); + if (p_madw == NULL) goto Exit; - } /* Finally, initialize the wrapper object. @@ -174,12 +151,7 @@ osm_madw_t *osm_mad_pool_get_wrapper(IN osm_mad_pool_t * const p_pool, osm_madw_init(p_madw, h_bind, total_size, p_mad_addr); osm_madw_set_mad(p_madw, p_mad); - OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG, - "Acquired p_madw = %p, p_mad = %p size = %u\n", - p_madw, p_madw->p_mad, total_size); - Exit: - OSM_LOG_EXIT(p_pool->p_log); return (p_madw); } @@ -189,19 +161,14 @@ osm_madw_t *osm_mad_pool_get_wrapper_raw(IN osm_mad_pool_t * const p_pool) { osm_madw_t *p_madw; - OSM_LOG_ENTER(p_pool->p_log); - p_madw = malloc(sizeof(*p_madw)); if (!p_madw) return NULL; - OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG, "Getting p_madw = %p\n", p_madw); - osm_madw_init(p_madw, 0, 0, 0); osm_madw_set_mad(p_madw, 0); cl_atomic_inc(&p_pool->mads_out); - OSM_LOG_EXIT(p_pool->p_log); return (p_madw); } @@ -210,13 +177,8 @@ osm_madw_t *osm_mad_pool_get_wrapper_raw(IN osm_mad_pool_t * const p_pool) void osm_mad_pool_put(IN osm_mad_pool_t * const p_pool, IN osm_madw_t * const p_madw) { - OSM_LOG_ENTER(p_pool->p_log); - CL_ASSERT(p_madw); - OSM_LOG(p_pool->p_log, OSM_LOG_DEBUG, - "Releasing p_madw = %p, p_mad = %p\n", p_madw, p_madw->p_mad); - /* First, return the wire mad to the pool */ @@ -228,6 +190,4 @@ osm_mad_pool_put(IN osm_mad_pool_t * const p_pool, IN osm_madw_t * const p_madw) */ free(p_madw); cl_atomic_dec(&p_pool->mads_out); - - OSM_LOG_EXIT(p_pool->p_log); } diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index aa7ded3..abe55b5 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -314,7 +314,7 @@ osm_opensm_init(IN osm_opensm_t * const p_osm, goto Exit; } - status = osm_mad_pool_init(&p_osm->mad_pool, &p_osm->log); + status = osm_mad_pool_init(&p_osm->mad_pool); if (status != IB_SUCCESS) goto Exit; -- 1.5.5.1.178.g1f811 From rdreier at cisco.com Sat May 31 22:46:05 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 31 May 2008 22:46:05 -0700 Subject: [ofa-general] Re: [PATCH v3 08/13] QLogic VNIC: sysfs interface implementation for the driver References: <20080529095126.9943.84692.stgit@localhost.localdomain> <20080529095754.9943.27936.stgit@localhost.localdomain> <20080529103003.010c4a08@extreme> <20080529174805.GA10903@kroah.com> Message-ID: > And yes, multiple values per sysfs file are not allowed, sorry, please > change this. If you need to configure your device through an interface > like this, consider using configfs instead, that is what it is there > for. Makes sense... I know that the SRP initiator uses the method of multiple 'token=' entries passed into sysfs, but the excuse is that SRP was merged before configfs. I'll also have a look at adding configfs support to SRP and deprecating the current sysfs method... - R.